Functional group classification using consensus clustering

doi:10.1371/journal.pcbi.1014278

Fig 1.

Conceptual diagram for consensus clustering with data uncertainty.

a) The input of the model is the data and prediction error to account for data uncertainty. These data and prediction error are used to perform resamples that are clustered using GMM by optimizing the number of groups with BIC. b) These resamples yield a clustering ensemble, where for each pair of species it is possible to compute the proportion of times they were linked together. This builds a consensus matrix. c) This consensus matrix is used as a distance matrix to perform hierarchical clustering to optimize the silhouette score (due to the absence of a likelihood for BIC-based selection) and obtain the final group assignments.

More »

Expand

Table 1.

Trait statistics. Each trait specifies the unit of measurement, the mean and standard deviation for natural-log transformed values, the mean, standard deviation, minimum, and maximum values for the original scale, the number of error samples, and the Mean Absolute Error (MAE) on the standardised log scale.

More »

Expand

Fig 2.

Optimization for number of groups.

All BIC curves and the resulting average curve, shown for a) unscaled and b) scaled data. c): Silhouette score for different aggregations of the consensus matrix. A higher value of K means going lower in the dendrogram, hence splitting one group. Note that BIC is optimized independently for each resample in (a) and (b), and therefore partitions with the same number of clusters may differ substantially. In contrast, the Silhouette score is optimized at the aggregated level across all resamples. As a result, the optimal number of clusters identified by these two criteria will not necessarily coincide.

More »

Expand

Fig 3.

Summary of functional groups results.

(a) The consensus matrix displays the average consensus for all species pairs within clusters along the diagonal, while off-diagonal values represent the average consensus between species from different clusters (row vs. column). Square sizes are proportional to their respective cluster size. (b) Distribution of average consensus for the 42 observed groups, ranging from 0 (no consensus) to 1 (maximum consensus), divided into bins of width 0.1. (c) Size distribution of the 42 groups, where cluster size refers to the number of species per group, with bins of width 100.

More »

Expand

Fig 4.

t-SNE visualization of clustering results.

Species are visualized in a two-dimensional functional space using t-Distributed Stochastic Neighbor Embedding (t-SNE), a probabilistic technique that reduces high-dimensional data (originally 18 dimensions) to two dimensions while preserving the local structure and similarity relationships of the data points. Different colors and markers represent distinct groups, chosen to maximize visual distinction among clusters based on their distribution. The distances in this 2D space are probabilistic, reflecting the likelihood of similarity between species, and approximate the original high-dimensional probability distribution. Note that t-SNE is intended for visualisation of the shape and location of clusters, but it does not have the same interpretation as other dimensionality reduction techniques such as PCA (see S5 and S6 Figs for alternate visualisations). Please also see the interactive version of this figure in S1 Appendix, or available from the GitHub repository.

More »

Expand

Fig 5.

Centroids of 42 functional groups.

Mean values were calculated for each group across their species after normalizing the original dataset for each trait. The colour gradient represents these normalized values and is consistent across all traits, with values clipped at 3 standard deviations from the mean to enhance visualization contrast. A dendrogram, constructed using the Ward-linkage method, is included to illustrate the relative consensus distances among the groups.

More »

Expand

Fig 6.

Functional group diversity metrics compared to traditional measures of species richness and functional diversity, calculated for each of n = 16,048 forested pixels across the globe (with more than 3 clusters), as reported in Paz et al. [12].

The points show the raw observations, with the lines denoting the fitted trends from generalised additive models, adjusted for spatial autocorrelation (see Methods), along with 95% confidence intervals. (a-c) Functional group richness (the number of unique groups) is highly correlated with species richness at low levels, but saturates after approximately 500 species in the system. Functional redundancy is weakly correlated with Rao’s Q, which reflects spatial clustering in trait space, but, compared to functional redundancy exhibits a stronger (albeit still relatively weak) correlation with functional richness. (d-f) Functional redundancy (Simpson’s Index applied to groups) is inversely correlated with species richness, revealing significant variation in redundancy at high levels of richness, but is only weakly correlated with Rao’s Q and Functional richness, capturing complementary aspects of both. These results illustrate how traditional diversity metrics can be applied to functional groups, yielding intuitive and simple metrics for quantifying functional composition.

More »

Expand