SPARTA: Interpretable functional classification of microbiomes and detection of hidden cumulative effects

doi:10.1371/journal.pcbi.1012577

Fig 1.

A schematic representation of SPARTA’s pipeline.

From taxonomic tables and their associated labels as inputs, the pipeline produces functional descriptions of the microbiota samples via the EsMeCaTa pipeline. Both of these profiles are then used as basis for the training of RF models to discern Control from Patient profiles. The average importance scores of these variables over all trained forests are then used as basis for a selection of significantly discriminant variables, which can then be processed again iteratively, or passed as an output. For robustness, the process is repeated 10 times, leading to 10 different lists of significantly discriminant taxa and FAs. These lists can be compiled into different categories, which group variables by level of robustness based on the frequency of their appearance in the significant lists. Thus, unanimous variables are considered to be “robust” discriminators, those agreed on by 75% or more of the classifiers are considered “confident”, and those that are selected at least once are considered “candidates”. Internally to the pipeline implementation, robust features are labeled “Core-significant”, and the others are labeled as “Meta-X significant”, X being the percentage of significant variable lists that include them.

More »

Expand

Fig 2.

Classification algorithm implemented in SPARTA.

For a given run k, a test subset is randomly selected within the initial dataset and set aside. A given iteration j consists in training X random forests (20 by default), each having a dedicated validation subset. These 20 forests are used to compute a median classification performance P(j, k) and a shortlist of important features. This lists is used to train the X random forests of iteration j + 1. By default, SPARTA launches 10 runs and 5 iterations.

More »

Expand

Fig 3.

Classification performances of RF models trained on taxonomic and functional profiles, and impact of the variable selection on performance.

Median classification performances (AUC) for all types of profiles and each dataset, on the original datasets as well as at the optimal level of selection over 10 full runs of the pipeline. Each of these runs involved a different randomly selected test set of individuals, which was used for both profiles. Performances and importance scores for each run were computed and averaged over 20 distinctly trained RF models. The amount of selection iterations required to obtain the best average among these median AUCs are represented beside each plot. Instances when the difference in performance between functional and taxonomic profiles using SPARTA is significant for a same dataset (based on a Mann-Whitney U-test) are signaled by a * symbol.

More »

Expand

Table 1.

Application of the SPARTA selection process to identify signature taxa and functions on 6 reference datasets.

More »

Expand

Fig 4.

Comparison between SPARTA and limma functional selections.

A: Number of important selected FAs for each run at best iteration for the six datasets Amount of FAs selected by SPARTA and limma, for all datasets. Limma selections were effectuated with an adjusted p-value threshold of 0.05. Both selection methods were repeated 10 times, with a common test subset set aside each time. B: Comparison between robust and candidate FAs for T2D dataset The limma subsets were obtained using the classic threshold of 0.05. Values indicate the number of annotations in each intersection and do not represent the size of a category as a whole. The white circle includes all annotations from the full dataset.

More »

Expand

Table 2.

Sizes of the SPARTA and limma selections.

Limma was applied with an adjusted p-value threshold of 0.05. From left to right, the columns present, for SPARTA and limma, the size of the robust, confident, and candidate subsets issued by the concerned selection method iterated 10 times with identical test subsets.

More »

Expand

Table 3.

Robust subset of annotations from the IBD dataset.

More »

Expand

Table 4.

Robust subset of taxa from the IBD dataset.

More »

Expand

Fig 5.

Number of taxa associated to each robust annotation, as a function of the number of associated robust taxa for the IBD dataset.

Four groups of annotations are represented, three of which were determined based on the total amount of taxa attached to the annotation: those within the top 10% of these values’ scale were labeled ‘Ubiquitous’, those in the bottom 10% were labeled ‘Specific’, and the others were labeled ‘In-between’. The final category corresponds to the robust significant annotations with no relationship to the robust significant taxa (‘Cumulative’). The highlighted annotations are those used as illustrative examples in Fig 6.

More »

Expand

Fig 6.

Associations between robust functions and the associated robust taxa predicted by SPARTA, for the best iteration on the IBD dataset.

Depicted annotations were selected to be representative examples of the different categories highlighted in Fig 5, and are presented with the same color scheme. Taxa are colored based on their normalized average differential expression between Control (red) and Unhealthy (blue) profiles. The width of the connections is proportional to the importance of the association. The arrow between a given function and the generic ‘Non-robust’ node represents the contribution of non-robust taxa to the considered function.

More »

Expand

Table 5.

Distribution of samples within the datasets of reference.

More »

Expand

Fig 7.

Application of EsMeCaTa and calculation of Scores of FAs in the context of the sparta esmecata step of the pipeline.

The inputs represented here are taxonomic units, potentially containing several species. EsMeCaTa is compatible with this paradigm, but can also process data directly on the species level. EsMeCaTa interrogates the UniProt database to gather the proteomes of all species included in the input taxon. A meta-proteome for the entire taxon is then calculated, based on clustering using Mmseqs2 [71] followed by retention of clusters with a 95% incidence in all proteomes. UniProt is then interrogated a second time to retrieve the FAs of all of the kept protein clusters. A weighted association between taxon and annotation can be established in this manner. By combining this information with the taxon’s initial abundance, a quantification of the FAs’ expression can be measured.

More »

Expand