Jump to main content.


Metric Testing

Candidate metrics are selected for inclusion in a multimetric index if they are biologically meaningful, consistently associated with human disturbance, not redundant with other metrics, and reliably and easily quantified from field samples (Karr and Chu, 1999; Jackson et al., 2000). Typically the approach selected for metric testing is limited by the type and variety of data available for quantifying human disturbance. For the MAIA study there were virtually no limitations because of the variety of variables measured. For any study relating biological change to the degradation associated with human activities, the most difficult step is identifying the independent measure of human disturbance. Independent measures of disturbance are needed to test for a consistent biological response across the range of possible conditions. Measures of human disturbance must be independently derived from biological data to avoid simply choosing aspects of disturbance or measures of biology that match our expectations. Because human influence is complex and human activities are multidimensional, the challenge revolves around how to integrate disparate measures of human influence into a single axis of human disturbance for metric testing.

Statistically, it is simplest to compare measures taken at impaired sites with the same measures taken at minimally disturbed reference sites (Barbour et al., 1996). When only a few measurements of disturbance are made, statistical testing may involve simple tests of differences in means (ANOVA) or association between one-dimensional measures of disturbance (regression or correlation; Miltner and Rankin, 1998). When multiple, related measures of disturbance are made, multiple regression may be used test the disturbance measures together. If, however, these measures are themselves correlated with each other (which they often are), multiple regression's assumption of independence is violated, and the results may not be robust (Loftis et al., 1991; Wang et al., 1998; Olden and Jackson, 2000). When many measures of disturbance are collected, measures may be combined using principal components analysis (Hughes et al., 1998; Norton et al., 2000). Other projects have used a ranking system to summarize information about human disturbance at different spatial scales (Bryce et al., 1999; Fore and Grafe, 2002).

For the MAIA pilot, the wealth of measures associated with human influence necessitated a larger discussion among the researchers involved. The disturbance measures used for metric testing were selected during a series of workshops sponsored by the USEPA. Through discussion and consensus, researchers derived a list that captured multiple aspects of human influence at different spatial scales (McCormick et al., 2001; Klemm et al., 2003). Reference and test conditions were defined to include a subset of least disturbed and most degraded sites (Waite et al., 2000). Measures of nutrient concentration were included to reflect the influence of agriculture and urban land uses; acidity was included to capture the effects of acid rain deposition and acid mine drainage; variables related to sediment and turbidity measured erosion; and measures of riparian condition summarized the physical disturbance near the site. In addition, Bryce et al. (1999) developed an integrative measure of human disturbance at the watershed and reach scale. The percentage of disturbed land in the watershed was also used as a summary measure.

Redundant testing of metrics against multiple measures of disturbance ensured that metrics were selected for their biological meaning rather than statistical chance. With a large list of candidate metrics and a single test for each, candidates may meet the criteria for metric selection because of chance alone. The probability of such chance selection equals the p-value selected for the test. Multiple tests against measures of different aspects of human disturbance avoided this pitfall and insured that the metrics selected represented meaningful and reliable indicators of biological change associated with human influence.

Simple criteria were used first to eliminate potential metrics

For each assemblage, a large number of candidate metrics were identified for testing (Stevenson and Bahls, 1999; McCormick et al., 2001; Klemm et al., 2003). Simple statistical rules were developed to shorten the long list of candidate metrics to a smaller number that were then considered more carefully.

During the first round of elimination, metrics were evaluated for their range of values. Metrics calculated on the basis of a small number of species might not yield an adequate range of values for calculating taxa richness or percentage relative abundance. For example, taxa richness metrics that could take on values of only 0 or 1 or percentage metrics that only had a range of 10% had too few potential values to distinguish between different levels of human disturbance. These candidate metrics were eliminated in favor of metrics with a broader range of values. For Mid-Atlantic streams, candidate metrics such as number of native cottid species and percentages of Corbicula were eliminated because their ranges of values were simply too small.

Statistical precision was no substitute for correlation with disturbance

If the variability of a candidate metric within individual sites is higher than its variability between all sites, then the measure is unlikely to detect differences in biological condition among sites (or differences at sites that change through time). Signal-to-noise ratios estimate a measure's ability to distinguish differences among sites from differences observed within individual sites. In this context, "signal" is defined as variability of a metric value across all sites and "noise" as variability over repeated visits to the same site during a single year (Kaufmann et al., 1999).

Although most metrics incorporated into multimetric indexes have high signal-to-noise ratios, i.e., small within-site variability compared to large between-site variability, a high signal-to-noise ratio alone does not guarantee that a candidate metric will be a meaningful indicator of biological condition. Metric values can be highly repeatable at individual sites but still be unrelated to human disturbance. Consider, for example, pool depth and embeddedness, two candidate metrics described by Kaufmann et al. (1999) in their assessment of habitat measures for MAIA. Pool depth is often considered an indicator of good fish habitat. It is expected to decline as erosion, dredging, and sedimentation fill pools, creating a homogeneous channel profile. Embeddedness was defined as an average of several substrate measures at the stream reach scale; it represented the proportion of the reach filled with sand and fine sediments.

Certainly statistical precision is a desirable property of a good metric, but statistical precision alone does not guarantee a predictable association with human disturbance. For the MAIA study, pool depth was very precise, with signal-to-noise ratio equal to 16. In contrast, embeddedness was more variable for repeat site visits with a signal-to-noise ratio of 1.9, which failed to meet the authors' suggested minimum value of 2. Embeddedness, however, showed a strong correlation with human disturbance. In contrast, pool depth was precise but not related to human disturbance. Embeddedness, though less statistically precise, was the better indicator of biological condition (Figure 5). Thus, statistical precision alone was not a good criterion for metric selection.

The most important signal for biomonitoring is a metric's response to human disturbance. The term "signal" in this case may, in fact, be misleading. The "signal" part of the signal-to-noise ratio is actually just a measure of the observed range of values across all sites. In terms of biological assessment, the actual signal we are interested in detecting is the change in biological condition associated with human disturbance. In short, evaluation of the statistical properties of metrics should never substitute for actual metric testing against disturbance. These are two separate, though complementary, tests.

Figure 5

Figure 5. Ranges of values for mean residual pool depth (RP100, upper panel) and embeddedness (XEMBED, lower panel) for 15 sample sites sorted along the x-axis by disturbance class from least (1) to most (5) disturbed (Bryce et al., 1999). Vertical lines span the range of values recorded for two to six repeat visits to each site. Repeat visits to the same site yielded more similar values for RP100 than for embeddedness indicating greater precision (shorter vertical lines); however, embeddedness consistently increased with greater human disturbance while RP100 did not.

Watershed features were confounded with metric response to disturbance

A good monitoring tool must correlate with human disturbance and show little or no association with natural features. When human disturbance itself is associated with natural features, isolating the biological change associated with human disturbance can be tricky.

Watershed area was an example of this situation. Some fish taxa richness metrics tended to be correlated with watershed area because larger streams have more different types of habitat that support more species. To address this problem, metrics that correlated with watershed area were first regressed against watershed area using only reference sites. The residual values from this regression, that is, the metric values with the influence of watershed area removed statistically, were used to define a new version of the metric that was independent of watershed size. When the "corrected" metrics were significantly correlated with human disturbance, concerns about an underlying spurious correlation with watershed size were eliminated and the metrics were retained as good indicators of human influence. Watershed size was most important for fish, somewhat important for invertebrates, and did not influence diatom assemblages. The different sensitivity across assemblages to watershed area may reflect the relative range size of the organisms. Fish may travel throughout the watershed, invertebrates tend to stay within a reach or local stream, and diatoms may pass their lives on the same rock.

Metrics from different assemblage types were eliminated for different reasons

The list of plausible metrics proposed for testing in Mid-Atlantic streams was much shorter for fish (58), a bit longer for invertebrates (120) and much longer for periphyton (240). The initial list was shortest for fish because the greatest amount of metric testing has occurred for fish; invertebrates place a close second and periphyton a distant third. For periphyton, most of the candidate metrics represented untested hypotheses. Metrics were selected and eliminated for different reasons across assemblages.

The majority of candidate fish metrics were eliminated because they failed to correlate with disturbance (30 metrics; Table 3). In contrast, a larger percentage of candidate invertebrate metrics were consistently associated with disturbance. For the invertebrate index, many metrics that were significantly correlated with multiple measures of disturbance and that demonstrated good statistical properties were excluded because their correlation with one another exceeded 0.7. Approximately 25 metrics were eliminated for this reason, leaving relatively few (7) in the final index (Klemm et al., 2003). Because index precision increases as a function of the number of metrics, some caution may be warranted in eliminating good biological signal on the basis of statistical correlation (Fore, unpublished data). Results are not complete for periphyton, but based on preliminary results for a subset of diatom metrics, there will likely be many metrics to choose from that are significantly correlated with disturbance (Fore, 2002b). Additional criteria related to the type of environmental processes measured by the metrics and metric redundancy will probably be used to select metrics for inclusion in the final multimetric index.

Table 3. Numbers of candidate metrics tested for MAIA's fish and invertebrate multimetric indexes and a summary of the reasons for which they were eliminated. Starting with the total number of candidate metrics, the number of candidates listed in each column were eliminated because their values spanned an insufficient range, their signal-to-noise ratios were low (indicating low precision), they were redundant with other metrics, they failed to correlate with human disturbance, or their correlation with watershed size could not be corrected. This winnowing process resulted in fewer than 10 metrics included in the final indexes.

  Fish Invertebrates
Total number of candidate metrics 58 120
Insufficient range 13 20
Poor signal/noise 2 66
Redundant 3 25
Fail to correlate with disturbance 30 2
Persistent correlation with watershed area 1 0
Number of metrics in final index 9 7

 

<< previous page
next page >>

 

 

Biological Indicators | Aquatic Biodiversity | Statistical Primer


Local Navigation


Jump to main content.