Jump to main content.


The Perils of Data Management (continued)

Once calculated, metrics were stored as horizontal files where a single row of data contained all the information for each site visit (Figure 3). Each data file also had a companion file of metadata that explained the meaning, units, and derivation of the codes naming each variable, represented by one column of data, in the file. Other horizontal files were created that grouped sets of variables by topics such as fish metrics, water chemistry, habitat measures, and stressors (Hale et al., 1998).

Although the metadata files were somewhat minimal, they were a key component to the usefulness of the data across projects and research groups (Hale et al., 2000). Unfortunately, additional important information regarding how data were collected was not archived with the data. This caused some confusion at the time of analysis because sampling protocols changed through time. In short, all the attention paid to file structure and data organization was well worth the effort; in fact, more metadata and data description would have been useful.

Simple files were best

Data analysis for MAIA involved multiple institutions and investigators using different statistical software. Posting files on an Internet server was the most practical approach to sharing files between so many remote users. Hosting a searchable relational database that included all the data was an option, but these are typically slow and difficult for the host to maintain (Hale et al., 2000). Because researchers were typically interested in a subset of data, smaller, simpler files with variables grouped according to topic worked best.

Relational database programs (e.g., Access and Oracle) provide the opportunity to develop complicated data structures composed of multiple linked files. However, this approach assumes that users are working in a similar software environment, that is, using the same programs to analyze data or working from the same central database. Relational databases typically link multiple vertical files, thereby saving space by not repeating information that is identical for each entry. Although relational databases avoid some repetition within files, the unseen program code required to support the relationships between files typically takes up more space than a simple flat file with redundant information. Another advantage of a relational database is that data can be stored in one place and relationships between variables and files are encoded in the program so that individual users are not required to remember or derive these details; however, these relationships can be very complicated to code correctly and to maintain.

The MAIA data had to be accessible to many remote users with the intention of manipulating the data within a variety of software environments. Most statistical software packages expect flat files and cannot import the program information that a relational database uses to link variables across files. Flat files have all the information for a site visit in the same place, that is, in one file on the same row (Hale and Buffum, 2000). Thus, rather than create a complicated database structure from which data would have to be exported for analysis, data files were kept simple from the beginning so that they could be easily downloaded from the EMAP Internet site and quickly entered into the user's own statistical software.

Data: Example of a horizontal data file structure

Figure 3 - example of a horizontal data file structure

Metadata: Description of column headings

Figure 3 - Metadata: Description of column headings

Figure 3. Example of a horizontal file structure for water chemistry data. Stream ID, visit number and date identify a unique sampling event represented by a single row in the data file. A companion metadata file lists each variable name, its description, and units of measure.

<< previous page
next page >>

 

 

Biological Indicators | Aquatic Biodiversity | Statistical Primer


Local Navigation


Jump to main content.