Data Management

A reliable, well designed, and quality assured relational database management system (RDBMS) is fundamental to a program’s ability to effectively use monitoring information to solve environmental problems. For biomonitoring programs in the initial development stage, information management often begins with simple spreadsheet storage of sampling event data (e.g., Excel, Lotus). Eventually, a relational database offers major advantages in terms of efficiency of multi-user data access and editing, quality control, integration with spatial data, and web-based access to data. Hernandez (2003) compiled data standards for environmental sampling, analysis, and results (ESAR), http://www.exchangenetwork.net/standards, which offer guidance for ensuring the proper information is recorded. A proper system for aggregating data and performing the necessary quality control checks is essential. Furthermore, storing these data in STORET and RETrieval system) (via the Central State Exchange [CDX] is advised for maintaining long-term data sets. For programs that have established numeric biocriteria thresholds, data management includes not only proper stewardship of raw data elements but also proper computation of biological metrics and biocriteria threshold information (i.e., the site biocriteria outcome).
Therefore, integration of assessment information from multiple assemblages (fish, macroinvertebrate, algae, etc) can contribute important diagnostic information. State and Tribal biomonitoring programs often collect data over years and decades in contrast to research projects that are typically designed to address research questions that can be answered in one to a few years. The value of long-term datasets to all users depends upon well-documented and properly implemented quality assurance protocols that ensure data integrity, and a RDBMS that allows efficient and transparent statistical and graphical analysis of the data. A strong geographic information system (GIS) linked to a well-designed relational database moves programs toward a more comprehensive watershed perspective in interpreting monitoring data and improves the ability of biological data to meet the increasing information demands of State, Tribal, and federal programs, as well as responsible parties, and the public. A hierarchy of data management is as follows:
- Level 1 - may be solely paper files or simple electronic spreadsheet; QC is cursory at best.
- Level 2 - electronic spreadsheets are the norm; data types are likely separated into separate files and stored independently; QC cursory and mostly for transcription errors.
- Level 3 - electronic relational database customized for individual agency, and not easily transportable to another platform; good QC for data entry, value ranges, and site locations; data can be exported for analysis and biocriteria development; dedicated database management.
- Level 4 - fully developed electronic relational database that includes metadata and strong QC and data validation function (e.g., WQX STORET); clean data are ensured through rigorous QC; various ecological and ancillary data are readily accessible for aggregation and integrated analyses.
Some Frequently Asked Questions
Question: How do I know what type of data management system I need?
Answer: It is recommended that data organization and management transcend the compilation of field and laboratory sheets to an electronic database. Use of spreadsheets is the minimum level of an electronic database management system, but these can be deficient in error checking and data integration. A relational database will address both of these shortcomings. In any case, a thorough quality control check on the database ensures a “clean” dataset for use throughout an agency’s program. A small RDBMS such as Microsoft Access could serve as a logical step from spreadsheets to a relational database. These smaller systems can be used to develop a bioassessment database that includes most of the relational data integrity and validation features of a larger RDBMS. Most large RDBMS are installed on a server which provides options for making the database available through a network or internet connection. Larger RDBMS are usually installed and administered by an agency’s information technology (IT) department. IT departments can help program managers identify qualified professionals to assist with creating a custom database to meet the data management and analysis needs of bioassessment programs.
Question: If I’m able to use electronic spreadsheets or even a small RDBMS such as Microsoft Access, why do I need a data dictionary (metadata)?
Answer: A well-documented data dictionary defines not only how the data in a particular field relates to field operations and data collection, but it specifies how those values are stored and validated. Creating a well-documented data dictionary requires the data manager to address questions ranging from fairly simple to more complex. For example, are the data numeric or text? Are they allowed to be null? Are there restricted values that must be used? The answers to these questions may show that multiple types of data are being stored in one field and should be separated. Additionally, answering these questions helps to bridge the gap between using spreadsheets and moving toward a more robust data management system (RDBMS).
References
Hernandez, M.J. 2003. Database Design for Mere Mortals: A Hands-On Guide to Relational Database Design. Addison-Wesley\Pearson Education, Boston, Massachusetts. 668 pgs.
![[logo] US EPA](http://www.epa.gov/epafiles/images/logo_epaseal.gif)