RENR 580
Clustering Sites by Environmental Factors
In this section, sites are clustered based on the continuous predictor variables: water depth, underwater visibility, mean summer water temperature, mean summer water salinity, and mean summer dissolved oxygen. This groups the sites by their environmental similarities and dissimilarities, allowing us to colour sites by cluster in future figures.
​
First, I used three methods to determine the optimal number of cluster centres to use (Fig 1). The gap statistic method suggests that 6 clusters is optimal, the WSS method suggests that between 4 and 6 clusters is optimal according to the broken-stick rule, and the silhouette method suggests that 2 clusters is optimal. I chose to procede with 5 cluster centres.
Fig 1: Determining the optimal number of clusters using the gap statistic (left), total within sum of squares (centre), and silhouette cost (right).
K-means clustering was performed on the sites using 5 centers with 50 iterations and 50 repeat runs to find the best solution. The results of the clustering are visualized via an ordination, in this case Principal Coordinates Analysis. To help visualize the clusters, I plotted the results on a three-dimensional plot, which can be rotated to view the clusters from different perspectives (Video 1). Each point on the graph is an observation site, placed relative to the other sites based on the five continuous environmental predictors and coloured by cluster. I then plotted the results on a two-dimensional plot, with vectors that represent the associations of each predictor variable and ellipses to indicate the 95% confidence interval for each cluster (Fig 2).
Video 1: Animation showing clustering results on a 3-dimensional plot.
Fig 2: Sample sites clustered by continuous environmental variables. Ellipses illustrate the 95% confidence interval for each of the five clusters. Black arrows are vectors that represent each of the environmental variables' relative direction and magnitude of association with the sites.
Predictors that Explain Species Distributions
An ordination was performed on the data using the species occurrence records. As we ordinate the response variable, this is an indirect gradient analysis. The Bray-Curtis distance was used to calculate the distance matrix, and the pcoa function in R was used for the ordination. The ordination of sites was then plotted, with sites coloured by habitat type (Figs 3&4), then with sites coloured by the clusters formed in the section above (Figs 5&6). Species and environmental association vectors were plotted as black arrows, and ellipses indicate the 75% confidence interval for each habitat type (Figs 3&4) or environmental cluster (Figs 5&6).
Fig 3: Ordination of sites by species occurrence records, coloured by habitat type. Ellipses indicate the 75% confidence interval for each habitat type. Black vectors indicate the relative association of each of the species with the sites. The ten species with the strongest associations are labeled. Habitat type codes are explained on the Methods page.
These plots show which species are most strongly associated with which habitat types, and with which environmental predictors as well. For example, Haemulon plumierii and other Haemulon species are most strongly associated with light green, dark green, and dark blue sites (Fig 3), which are the three varieties of isolated patch reef. Halichoeres bivittatus is most strongly associated with orange, light blue, and dark green sites (Fig 3), which are all reefs with low vertical relief. We can also see more general trends, for example, more species have associations with contiguous or spur-groove reefs than isolated or rubble reefs. Additionally, there are more and stronger associations with habitats with high vertical relief compared to low vertical relief.
As for the continuous environmental predictors (Fig 4.), Holacanthus tricolor and Cephalopholus cruentata are most strongly associated with deep sites. Sites with high visibility and high dissolved oxygen are most strongly associated with Thalassoma bifasciatum, Stegastes partitus, and Halichoeres garnoti. We can see that overall, species tend to be associated more strongly with low temperatures and high dissolved oxygen.
Fig 4: Ordination of sites by species occurrence records coloured by habitat type. Ellipses indicate the 75% confidence interval for each habitat type. Black vectors indicate the relative association of each of the environmental predictors with the sites. Habitat type codes are explained on the Methods page.
The same ordination was plotted again, with sites being coloured this time by environmental cluster (see K-means clustering section). This was done to see which method of grouping sites (by habitat type as in Figs 3&4, or by cluster as in Figs 5&6) resulted in the most separation between groups. Comparing the ellipses of Figs 3&4 with the ellipses of Figs 5&6, we can see that grouping by habitat type yields a higher degree of separation between groups than grouping by clusters.
Fig 5: Ordination of sites by species occurrence records, coloured by cluster. Ellipses indicate the 75% confidence interval for each habitat type. Black vectors indicate the relative association of each of the species with the sites. The ten species with the strongest associations are labeled.
Fig 6: Ordination of sites by species occurrence records coloured by cluster. Ellipses indicate the 75% confidence interval for each habitat type. Black vectors indicate the relative association of each of the environmental predictors with the sites.
Identification of Reefs Potentially at Risk
Predicting Habitat Type
​
The randomForest package in R was used to predict the habitat type of each site from its species composition. The predicted habitat type was then compared to the actual habitat type, and sites that had been misclassified were identified. Out of the 545 sites 46 were misclassified, meaning they have atypical species composition for their habitat type. The habitat types with the highest proportion of misclassified sites were those with the fewest records, probably because they provided the least training data to the randomForest algorithm (Fig 7). Isolated patch reefs with low vertical relief had the most misclassifications given the amount of training data (Fig 7). These types of reefs therefore have the least consistent reef fish communities.
Fig 7: Number of correctly classified and misclassified sites by habitat type.
Comparing Correctly Classified versus Misclassified Sites
​
To check if the misclassifications are associated with any environmental factors, a boxplot was produced for each environmental variable comparing correctly classified versus misclassified sites. No significant association was found between misclassifications and any environmental predictor.
Fig 8: Comparison of sites with correctly classified habitat type versus misclassified habitat type.
Identifying Reefs Potentially at Risk
​
Misclassified reefs (sites with atypical fish communities), are identified on the map with rings to locate sites that should be investigated further (Fig 9). Atypical reef fish community structure together with low Simpson's diversity index is a strong indicator of a reef whose ecological functioning is at risk. For this reason, sites that were misclassified were ranked by their Simpson's diversity index (Fig 9). The majority of at risk sites are found in the north-eastern keys, however this appears to be due to sampling bias. The reef sites with the greatest potential of being at risk should be monitored closely and strongly considered for restoration initiatives.
Fig 9: Map of survey sites. Each misclassified site is identified by a ring. Misclassified sites are ranked by diversity.