Discriminant Function Analysis (DFA)
description | simple example | MAIA example | diatom example | how it works | caveats
Description: DFA uses a set of independent variables (IV's) to separate cases based on groups you define; the grouping variable is the dependent variable (DV) and it is categorical. DFA creates new variables based on linear combinations of the independent set that you provided. These new variables are defined so that they separate the groups as far apart as possible. How well the model performed is usually reported in terms of the classification efficiency, that is, how many cases would be correctly assigned to their groups using the new variables from DFA. The new variables can also be used to classify a new set of cases.
1 DV = f (2 or more IV's).
Simple example: You notice that streams in different ecoregions seem to be shaped differently. A set of variables related to stream shape might include width, depth, gradient, and flow. With sites grouped according to ecoregion, DFA would test whether stream shape can accurately classify stream sites according to ecoregion.
MAIA example: For the estuary invertebrate index, Llanso, et al. (in review) used cluster analysis to group reference sites based on species abundances at each site. After deriving the clusters, they used discriminant function analysis to relate the clusters to 5 environmental variables. For this DFA, the independent variable was cluster membership and the dependent variables were salinity, silt-clay substrate, depth, latitude, and total organic carbon.
| Variable | DF 1 | DF 2 | DF 3 |
|---|---|---|---|
| Salinity (ppt) | 0.99 | 0.07 | -0.01 |
| Silt-Clay | -0.16 | 0.94 | 0.26 |
| Depth (m) | 0.19 | 0.33 | 0.07 |
| Latitude | -0.07 | -0.28 | 0.88 |
| TOC (%) | -0.25 | 0.78 | -0.04 |
Table: Correlations between discriminant functions and environmental variables. Salinity, % of silt-clay substrate and latitude best predicted cluster membership for sites.TOC = total organic carbon
Metric signatures: Different sets of metrics are sometimes associated with different types of human disturbance. These patterns can represent a 'signature' for a particular pattern of land use.
Three types of sites were selected from the larger data set. Sites in primarily agricultural watersheds, sites with a high percentage of urban development in their watershed, and sites with mining or toxic point sources upstream.
Using discriminant function analysis (DFA), particular metrics were tested whether they would distinguish between these different types of disturbance. Urban sites were characterized by larger percentages of valves belonging to species that are tolerant of salt and species that tolerate inorganic nutrients. In contrast, agricultural sites were characterized by larger percentages of valves belonging to genera that are tolerant of sediment (motile) and species that tolerate organic nutrients.
Urban areas have more roads that must be de-iced in the winter with salt as well as wastewater treated with chloride; either source of chloride could select for salt-tolerant diatoms. Agricultural areas often have erosion which may select for motile diatoms that can tolerate sediment. Sewage is urban areas is typically treated to yield inorganic nutrients, while untreated livestock waste yields organic nutrients.
How the method works: DFA creates a new variable from the independent variables. This new variable defines a line onto which the group centers would plot as far apart as possible from each other. In other words, this new variable is defined such that is provides the maximum separation between groups of cases. This process repeats with successive new variables that further separate the group centers.
Assumptions/limitations: Like most multivariate techniques, DFA is sensitive to outliers and assumes multivariate normality. It also assumes that the variability within groups is similar for the independent variables.
Classification efficiency of cases is often tested by plugging in the same data used to define the model. By using the same model to define and test the model, you are guaranteed to overestimate the classification skill of the DFA model. One way to avoid this problem is to jackknife the sample by leaving out 1 case each time, running the model, and using the results to classify the left-out site.
![[logo] US EPA](http://www.epa.gov/epafiles/images/logo_epaseal.gif)