The Perils of Data Management
Every scientific analysis is founded on the format and reliability of its underlying data. Initial decisions about how to store and format data will influence how or if data are ever used for scientific analysis. What agency doesn't have data languishing on diskettes in file drawers? The lesson learned from the MAIA pilot was that time spent on data management was never wasted and that the time required was usually more than expected.
During the course of the MAIA study, several hundred sites were sampled for hundreds of variables over six years. Some variables were sampled once per site (e.g., land cover and use); other variables were sampled on every possible occasion (e.g., water chemistry); and still others were sampled on only some site visits (e.g., fish and habitat). For invertebrates and algae, two samples were collected during many site visits: one from pools and the other from riffles. Most sites were visited only once, but some sites were sampled repeatedly, either during one year or in successive years, in order to estimate the variability associated with site scale data. As a result, the data set was huge and complex, and data management was an issue that was carefully considered from the beginning (Hale et al., 1998; Hale et al., 1999). However, in spite of careful planning, data analysis was slowed by months due to mishandling of data files that resulted in incomplete and corrupted data.
Many agencies struggle with data management choices made early in their programs that restrict their current ability to analyze and interpret those data. Although many state water monitoring programs are smaller in scale than MAIA, the amount of data collected increases each year and databases are much larger than they were ten or even five years ago. Data management takes on a life of its own when more than about 200 sites are sampled for more than 50 variables. Most state sampling programs now exceed this amount of data for monitoring under the Clean Water Act. Data management lessons learned in the Mid-Atlantic will become increasingly relevant to state programs as data accumulate.
Different "names" for the same site caused confusion
The MAIA data set was large and complex; therefore, it was not possible to put all the data in a single file. Multiple files of related data meant that site names were extremely important for matching data across files. Unfortunately, different types of data required different site information to uniquely identify data from a single site visit. Thus, it was not possible to create a single unique code for matching site visits across files that would correctly match all variables.
For land use data, only the site name was needed because satellite data were only recorded once for each site during the course of the project. For fish and chemistry data, the site name, year, and visit number within year were needed to identify data from a unique site visit. For diatom and invertebrate data, the site name, year, visit number and habitat type (pool vs. riffle) were needed. As a consequence, files were easily corrupted when smaller files were merged without using all the variables necessary to match site visits correctly. An incorrectly merged file could have twice as many entries for some variables or data from one site visit assigned to another. This issue will be complicated for any monitoring program when the same site has different types of data associated with it and the data are collected at different times. The lesson learned was that more information should have been included with the data to identify unique sampling occasions.
Original data must be archived
Biological data evolve from field sheets for fish or laboratory bench sheets for invertebrates and algae with coded taxonomic identifiers; to files with taxa names, counts and natural history information; to calculated metrics. With each new generation of the data file, the original files may appear crude in retrospect and the tendency was to lose track of original files with confusing formats when newer versions were created. For the MAIA pilot, original files were the only way to catch major errors in later versions of the data. The lesson learned was to archive the original field or bench sheets along with the first generation of electronic data in a way that the data will not be changed or lost.
Different research groups tended naturally to follow different paths in evolving their data files. When the different researchers came together, their results sometimes disagreed either due to errors or different decision rules. Checks against the first generation of the data were often the only way to resolve discrepancies between researchers. Errors crept into data sets, for example, when files were merged incorrectly. Other problems arose when large files were truncated without warning by a spreadsheet or statistical program because the number of variables exceeded the number of columns permitted by the software.
During the course of the project, rules for combining invertebrate taxa evolved independently among different researchers. Similarly for periphyton, rules for calculating the relative abundance of soft algae and diatoms also evolved. In both cases, the first generation data files were required to create a correct master file and to calculate metric values consistently. For invertebrates, discrepancies can arise if one mayfly in a sample is identified to genus while another mayfly can only be identified to the family that includes that same genus (either due to damage or being an early instar). Should the two mayflies be counted as two distinct taxa or only one? If the family typically has only one genus, then it's an easy decision, they should be counted as one taxon. But if the family has many genera or the genus identified is known to be rare, then perhaps they should be counted as two taxa. For the MAIA data, much time was spent defining consistent rules and creating a master file that used the taxonomic rules to combine taxa. For periphyton, soft algae and diatoms are identified with different laboratory methods; consequently, the taxa counts were originally saved in separate files. In order to combine the taxonomic data for both types of algae, the relative abundance of each type in the original sample had to be considered. Bench data related to sample volumes was retrieved from the earliest data files to combine the information.
Without access to the original archived data, much of the data would have been irretrievably lost. Although the archived data prevented this, a larger lesson learned from MAIA was that an information manager should be an integral part of any scientific team for projects of this scale. Furthermore, the hours dedicated in the beginning of the project to ensure data were correctly stored were dwarfed by the hours spent after the fact trying to repair or retrieve corrupted data.
File structure mattered
Three key points contributed to the usefulness of the MAIA data and the successful analysis of the data in so many publications. Simple files were shared via an Internet server, file formatting was consistent across files, and metadata files were provided to describe the variable names. In addition, the data were organized into files that lent themselves to statistical analysis. File structure logically anticipated the type of analysis that was likely to be applied to a particular type of data. When selecting the file structure for different data types, EMAP database designers distinguished between "vertical" and "horizontal" files (Hale and Buffum, 2000). A horizontal file, has a single row, or case, for each unique site visit and a single column for every variable recorded for that site visit. In contrast, a vertical file lists different types of data in a single column and may use multiple rows for a single site visit. Vertical files save space when many of the variables contain no information for a particular site visit. Both are appropriate in different situations.
For the MAIA study, vertical files were used to list fish, invertebrate, diatom, and soft (non-diatom) algal taxa collected at each site (Figure 2). The taxon code and name were listed along with the number of individuals in that taxon. A horizontal file structure would have headed each column with the name of one taxon and yielded a file with most of the cells blank, because not all taxa are found at a particular site. For diatoms, the number of taxa averaged 29 per site out of approximately 950 possible. Most spreadsheet programs do not allow so many columns, so rather than split the data across multiple files, a vertical file structure was used. Taxonomic information was also too cumbersome to include in the data file and was instead recorded in a companion file of metadata that listed phylum, class, order, family and other taxonomic details once for each species code.
|
SITE CODE |
VISIT |
YEAR |
SAMPLE |
TAXA |
TAXON |
COUNT |
|
DE750S |
1 |
1994 |
Pool |
BAACBIO |
Achnanthes bioreti |
7 |
|
DE750S |
1 |
1994 |
Pool |
BAACMNU |
Achnanthes minutissima |
130 |
|
DE750S |
1 |
1994 |
Pool |
BAANVIT |
Anomoeneis vitrea |
2 |
|
DE750S |
1 |
1994 |
Pool |
BACBAMP |
Cymbella amphicephaia |
2 |
|
DE750S |
1 |
1994 |
Pool |
BAEUBIL |
Eunotia bilunaris |
2 |
|
DE750S |
1 |
1994 |
Pool |
BASYRURU |
Synedra rumpens v rumpens |
16 |
|
DE750S |
1 |
1994 |
Pool |
BATAFLO |
Tabellaria flocculosa |
94 |
|
MD003S |
1 |
1995 |
Riffle |
BAACMNGR |
Achnanthes minutissima v gracillima |
5 |
|
MD003S |
1 |
1995 |
Riffle |
BACBAMP |
Cymbella amphicephala |
1 |
|
MD003S |
1 |
1995 |
Riffle |
BASYNAN |
Synedra nana |
1 |
|
MD003S |
1 |
1995 |
Riffle |
BASYULUL |
Synedra ulna v ulna |
5 |
|
MD003S |
1 |
1995 |
Riffle |
BATAFLO |
Tabellaria flocculosa |
4 |
|
ETC. |
|
|
|
|
|
|
|
TAXA CODE |
PHYLUM |
GENUS |
SPECIES |
V or f |
SUBSP |
AUTHORITY |
|
BAAC |
Bacillariophyta |
Achnanthes |
spp. |
|
|
|
|
BAACBIA |
Bacillariophyta |
Achnanthes |
Biasolettiana |
|
|
(Kützing) Grunow |
|
BAACBIO |
Bacillariophyta |
Achnanthes |
bioreti |
|
|
Germain |
|
BAACBITH |
Bacillariophyta |
Achnanthes |
biasolettiana |
v |
thienemannii |
(Hustedt) |
|
BAACDAU |
Bacillariophyta |
Achnanthes |
daui |
|
|
|
|
BAACDEAL |
Bacillariophyta |
Achnanthes |
deflexa |
v |
alpestris |
Lowe & Kociolek |
|
BAACDEF |
Bacillariophyta |
Achnanthes |
deflexa |
|
|
Reimer |
|
BAACDEL |
Bacillariophyta |
Achnanthes |
delicatula |
|
|
|
|
BAACDESE |
Bacillariophyta |
Achnanthes |
delicatula |
spp |
septentrional |
(Øestrup) |
|
BAACDIS |
Bacillariophyta |
Achnanthes |
distincta |
|
|
Messikommer |
|
BAACEXA |
Bacillariophyta |
Achnanthes |
exigua |
v |
elliptica |
Grunow |
|
BAACEXEL |
Bacillariophyta |
Achnanthes |
exigua |
|
|
Hustedt |
|
BAACEXI |
Bacillariophyta |
Achnanthes |
exilis |
|
|
Kützing |
|
BAALPEL |
Bacillariophyta |
Amphipleura |
pellucida |
|
|
(Kützing) Kützing |
|
BAAMOVA |
Bacillariophyta |
Amphora |
ovalis |
|
|
(Kützing) Kützing |
|
BAAMPED |
Bacillariophyta |
Amphora |
peducilus |
|
|
(Kützing) Grunow |
|
BAAMSUB |
Bacillariophyta |
Amphora |
submontana |
|
|
Hustedt |
ETC. |
|
|
|
|
|
|
Figure 2. Example data file for diatom samples from MAIA sites and companion metadata file with taxonomic details keyed by taxa codes. The upper table is an example of a vertical file, with more than one row per site visit and taxa names collapsed into a single column rather than one column per taxon.
<< previous page
next page >>
![[logo] US EPA](http://www.epa.gov/epafiles/images/logo_epaseal.gif)