Explaining Diversity in Metagenomic Datasets by Phylogenetic-Based Feature Weighting

Metagenomics is revolutionizing our understanding of microbial communities, showing that their structure and composition have profound effects on the ecosystem and in a variety of health and disease conditions. Despite the flourishing of new analysis methods, current approaches based on statistical comparisons between high-level taxonomic classes often fail to identify the microbial taxa that are differentially distributed between sets of samples, since in many cases the taxonomic schema do not allow an adequate description of the structure of the microbiota. This constitutes a severe limitation to the use of metagenomic data in therapeutic and diagnostic applications. To provide a more robust statistical framework, we introduce a class of feature-weighting algorithms that discriminate the taxa responsible for the classification of metagenomic samples. The method unambiguously groups the relevant taxa into clades without relying on pre-defined taxonomic categories, thus including in the analysis also those sequences for which a taxonomic classification is difficult. The phylogenetic clades are weighted and ranked according to their abundance measuring their contribution to the differentiation of the classes of samples, and a criterion is provided to define a reduced set of most relevant clades. Applying the method to public datasets, we show that the data-driven definition of relevant phylogenetic clades accomplished by our ranking strategy identifies features in the samples that are lost if phylogenetic relationships are not considered, improving our ability to mine metagenomic datasets. Comparison with supervised classification methods currently used in metagenomic data analysis highlights the advantages of using phylogenetic information.


Introduction
Thanks to the possibility to characterize microbial communities through next generation sequencing, microbial ecology has become a central topic in many environmental and therapeutic applications. Extensive explorative studies of the microbiota colonizing several districts of the human body have been conducted, highlighting the large variability from site to site, as well as the interpersonal differences in the same body site [1]. The more extensively studied district is the human gastrointestinal tract (GI), whose metagenomics composition appears to be influenced by several factors [2], including age [3,4], geography [5], diet [6], and lifestyle [7]. In addition, a correlation between imbalances or abnormal composition of the gut microbiota and a number of pathologic conditions has been proposed. These alterations might be due to therapeutic interventions, like antibiotic treatment [8], or different lifestyle [9].
The growing body of evidence of the importance of the gut microbiota for the selfsustainability of health of the "holobiont" is opening the debate on the design of therapeutic intervention strategies. Fecal transplantation has shown its effectiveness and safety in the treatment of recurrent Clostridium difficile infections [10], which are known to correlate with altered microbiomes following antibiotic treatment [11]. Alternatives for bioremediation of microbiota alterations is the supplementation of pro-or prebiotics, while it has been suggested that antibiotic treatment and vaccination can be used to guide the structure of the gut microbiota towards a status that is compatible with health [12,13]. Most of these intervention strategies would greatly increase their efficacy using a precise definition of the microbial species that are differentially distributed in health and disease conditions. This task faces several difficulties. On one hand, most of the microorganisms composing the human and environmental microbiota are poorly characterized, difficult to cultivate, and lack a precise taxonomic classification. On the other hand, methods to unambiguously define the microbial taxa that are responsible for these differences are still lacking, and their identification usually relies on a small number of arbitrarily chosen association tests with high-level taxonomic classes, or on statistical learning methods, both evaluating only taxa for which a taxonomic classification is possible [14]. In addition, the low abundance of most microbial taxa in metagenomic samples poses additional challenges only recently tackled with statistical methods [15].
In amplicon metagenomics, the composition in term of microbial genera of a sample is inferred from the high throughput sequencing of a small number of diagnostic genomic loci, the most popular being the V1-V6 variable regions of the 16S rDNA gene for bacteria [16] and the ITS spacer for fungi [17], selectively amplified using broadly conserved PCR primers. As a proxy for species, Operational Taxonomic Units (OTUs) are determined by the clustering of the sequences up to a given level of similarity, usually 97%. Using the OTUs abundances, the differentiation between samples or classes of samples is accomplished by measuring their βdiversity, i.e. the variations in community membership across the different groups [18]. Given that the sequences of marker genes are available, phylogenetic measures of diversity such as UniFrac [19,20] have proven to be able to identify subtle differences in the structures of microbial communities by weighting species abundances with the phylogenetic relationships amongst taxa.
Here we present PhyloRelief, a ranking strategy to identify the taxa significantly contributing to the differentiation of groups of amplicon metagenomic samples. By integrating the phylogenetic relationships amongst taxa into the framework of statistical learning, the method is able to unambiguously group the taxa into clades without relying on a precompiled taxonomy, and accomplishes a ranking of the clades according to their contribution to the sample differentiation. We applied the method to a meta-analysis of two recent datasets of comparative studies of the gut microbiota of European, USA, African and South American healthy individuals, identifying bacterial taxa that are differentially distributed with geography and age. Comparison of the performances of the method to popular feature selection and classification algorithms shows that or strategy is effective in identifying microbial clades associated to the different sample groups, providing a novel analysis method for targeted metagenomic datasets.

Results
PhyloRelief is an algorithm that introduces the Relief [21,22] strategy of feature weighting in a phylogenetic context to identify those OTUs or groups of OTUs that are responsible for the differentiation between classes of samples (i.e. healthy vs. disease, lean vs. obese, population A vs. population B, etc.) in a metagenomic dataset. The method is designed to analyze any set of samples that has been characterized via high throughput sequencing of one or more marker genomic loci, whose sequences have been clustered into OTUs. The process requires that the samples are unambiguously classified into cases and controls according to the description provided by the study design, and that a phylogenetic tree of the OTUs has been obtained by molecular phylogenetic analysis.
The algorithm is composed by two main conceptual steps: i) a scoring scheme that ranks the branches of the OTU tree according to their contribution to the differentiation of the classes, and ii) a merging step that merges nested subtrees into independent clades. At the end of this procedure, PhyloRelief ranks the clades according to their discriminant power between cases and controls.

Definition of the scores
Given a partitioning of the samples into two or more classes ({C1}, {C2},. . .), PhyloRelief ranks the internal branches in the OTU tree by assigning them a score w that reflects their importance in the differentiation of the classes. In its simplest form the procedure is as follows. First, one sample S is randomly chosen and its nearest hit H (i.e. the nearest sample of the same class) and miss M (i.e. the nearest sample of a different class) are individuated (Fig. 1). Next, the score w of each clade is increased by an amount proportional to the contribution of the clade to the distance between S and M, and decreased by an amount proportional to its contribution to the distance between S and H. In this way, the score of those clades that support the fact that S is more distant from M than from H is increased, while the score of those that support the contrary is decreased. A detailed description of the update rules is given in the Methods section. After that the procedure has been repeated over all possible choices of S, each clade has a score w that is high if the clade supports the partitioning of the samples into classes, {C1}, {C2}, and low if it does not (Fig. 1). The critical step of the procedure is the choice of the update function, for which different definitions are possible. Here we define (see Methods): a) an unweighted update function, that, for each clade, is proportional to the fraction of the clade that is unique to one of the classes, i.e. the fraction of the phylogenetic tree from which descend only OTUs belonging to one of the classes; b) a weighted update function, in which each branch of the tree is weighted by a quantity proportional to its unbalance between the classes, i.e. the difference between the number of sequences in samples from one class and from the other. Analogously to the Relief-F extension of the Relief algorithm, PhyloRelief can be applied to multiclass problems and can use k-nearest neighbors in the score computation, becoming robust in the case of noisy or unbalanced data sets [21].

Correlation between the lineages and identification of the clades
The peculiar nature of the features that we are ranking (i.e. subtrees in a tree) introduces a correlation that needs to be taken into account when analyzing the data, and that can be exploited to define a set of independent clades ranking them according to their relevance. If a given branch is heavier, due to unbalanced OTUs distribution between the different classes, its weight will propagate to the parent branches, where it is either reinforced by coalescing with branches sharing a similar unbalance, or diluted if the coalescing branches have contrasting or no unbalances. Exploiting this property, individual lineages can be clustered into taxonomic clades by inspecting the profile of the weights along the tree and identifying the branch where this has a local maximum. This rule, exemplified in S1 Fig, (see Methods) naturally defines a set of independent taxonomic clades and ranks them according to their contribution to the diversification between the classes. Using this ranking, the minimal set of clades necessary to describe the classes to a certain level of accuracy is determined by running non-parametric tests of class diversification, such as PERMANOVA [23] and ANOSIM [24], as a function of the number of clades.

Applications
In order to illustrate the potentialities of the method, we analyzed two recent datasets, one including 528 samples from healthy individuals of different ages from the United States, from Guhaibo Amerindians living in two villages in Venezuela, and from four rural communities in Malawi [5], and the other including samples from 14 healthy children from the Mossi ethnic group living in a rural setting in Burkina Faso and 15 healthy children living in Florence (Italy) [6]. To allow joint analysis of these two datasets, OTUs were picked using a reference database (see Materials and Methods) and the OTU tables were merged and rarefied to the same number of reads. A PCoA analysis of the weighted UniFrac distances ( Fig. 2A) shows that the samples segregate by geographical origin, with the USA and Italian samples clearly distinct from the African (Malawi and Burkina Faso) samples, and the Venezuelan occupying an intermediate position between the two groups. Previous meta-analyses of these data have shown differences in microbiota composition correlating to the "Western" (USA and Italy) or "non-Western" (Malawi, Burkina Faso and Venezuela) origin of the samples, and it has been suggested that these differences are related to the different balance between protein-rich and fiber-rich diet in these communities [2,5,6]. Stratifying the data by age of the subjects shows (S2 Fig) that the age is also an important factor in the variability of the human gut microbiota, and that this variability seems to be highest at younger age.
To identify the taxonomic groups that associate with the geographical origin and that might be correlated to the different diets of the five different populations, we partitioned the samples into two classes, one including the Western subjects (from Italy and the USA), and the other including the non-Western (Malawi, Burkina Faso and Venezuela) subjects, and applied the PhyloRelief algorithm to these two classes. To identify the number of clades that were more relevant to differentiate the two classes, we performed ANOSIM and PERMANOVA analysis with increasing number of clades ranked according to the PhyloRelief weights (Table 1). This procedure showed that both ANOSIM and PERMANOVA had a maximum comprised between 20 and 30 clades, indicating that using this number of clades the separation between the groups is largest. In Fig. 2B we show a phylogenetic tree of the OTUs present in the samples, with those included in the 30 most relevant clades identified by PhyloRelief highlighted (in red OTUs more prevalent in Malawi, Burkina Faso and Venezuela, in green OTUs more prevalent in the USA and Italy). It is worth noting that most of these clades were specifically more represented in the non-Western samples, while only few were specific of the Western individuals, and that much of the differences were confined within the order Bacteroidales. In particular, the Malawi, Burkina Faso and Venezuelan samples were rich in Prevotellaceae, while the Western samples were rich in Ruminococcaceae. In Fig. 2C the PCoA of the weighted UniFrac distances computed on the 30 most relevant clades is shown. Although the Western samples were distinct from the rest, they showed a large degree of variability, with a small fraction of samples from the USA closely related to the Malawi, Burkina Faso and Venezuelan samples. In addition, age was still a major factor, being closely associated to the second component of the PCoA (S3 Fig). To further investigate the individual distribution of the 30 most relevant clades, we show in Fig. 2D a heatmap of the logarithm of the prevalences of the OTUs within these clades. These data confirmed that there was a group of individuals from the USA that were closely related to the non-Western individuals, sharing three clades of Prevotellaceae with most of the Malawi, Burkina Faso and Venezuelan subjects. Stratifying the subjects by age, we found that in both classes young subjects (below 2 years of age) were clearly distinct from older subjects (Fig. 2D, upper panel). In addition, while we found clear separation between Western and non-Western adult subjects, some of the Western young subjects were classified by hierarchical clustering together with the non-Western young subjects and vice-versa, suggesting that at young age cultural or geographical differences are less important in determining the structure of the gut microbiota probably related to the instability of the gut microbiota, a phenomenon typical of childhood [5].
To highlight the role of age, and to identify the age for which the differences between young and older individual was highest, we partitioned the samples into two groups using as variable the age threshold, performing a PERMANOVA analysis of the weighted UniFrac distances between the groups as a function of this threshold. We found (S4 Fig and S1 table) that the differentiation between young and older subjects was largest when the age threshold was set to two years, and that above 14 years of age, there was no difference between the microbiome of young and adult subjects. However, running the PhyloRelief analysis on the complete dataset, we could not unambiguously identify a minimal set of bacterial clades associated to this differentiation (S2 Table). This result was likely due to the different gut microbiota of Western and non-Western adult subjects. For this reason, we repeated the analysis separately for Western and non-Western samples. ANOSIM and PERMANOVA showed that the maximum differentiation between individuals below age of 2 and above age of 2 for the Western and for the non-Western samples was achieved using 90 (Table 2) and 30 clades (Table 3), respectively, where both PERMANOVA R 2 and ANOSIM R have a maximum. The differentiation between young and adults was much sharper in Western subjects, with a prominent role played by Lachnospiraceae and Ruminococcaceae in the adults (Figs. 3A and S5). In non-Western subjects (Figs. 3B and S6), there was also a contribution of the presence of five clades of Prevotellaceae to the differentiation of the adult gut microbiota. In both Western and non-Western samples the younger subjects have higher abundance of Bifidobacteriaceae (Fig. 3), probably due to breast-feeding in infants [5]. Bifidobacteriaceae were present at lower prevalence in most adult subjects, except for the adults from Burkina Faso probably due to the absence of dairy food in adult age in this African population [25]. Predictivity of the ranked features in supervised classification problems The main goal of supervised classification is to build a model from a set of labeled samples to classify new, uncategorized data in high dimensional datasets in the presence of complex relationships between the variables. Identifying a ranking strategy to reduce the dimensionality of the dataset can improve the effectiveness of classification algorithms in metagenomic datasets, where correlations between the variables are introduced both by the phylogenetic relationships between the clades and by the fact that relative abundances are measured. The Random Forest (RF) classifier was recently proven to be the most effective in this class of problems [26,27], both for feature selection and classification. Although the main goal of this work is to define a phylogeny-based OTUs ranking method, it is interesting to assess the predictive power of the ranked taxa for the classification of samples into predefined categories in comparison to other state of the art algorithms. For this purpose, we selected four publicly available datasets, including data from four body sites (forehead vs. external nose and volar forearm vs. popliteal fossa) [1], from fecal samples (IBD vs. healthy) [28] and skin [29] (using both subject identification-3 classes-and subject/hand identification-6 classes-as target) that have recently been used as benchmark in comparative evaluations of classification methods for metagenomic data [26,27,28]. We compared the performance of PhyloRelief coupled with the RF classifier to LEfSe [30], an algorithm that uses statistical tests for biomarker discovery, to MetaPhyl, a recent phylogenybased method for the classification of microbial communities [31] and to Random Forest, used both as classifier and feature selection method. The performances were assessed in terms of average predictive accuracy using the K-category correlation coefficient (KCCC), a multiclass extension of the Matthews Correlation Coefficient (MCC) [32] (see Materials and Methods for details on the procedure). The results are reported in Table 4. We found that while in one case

Discussion
High throughput sequencing applied to the study of microbial communities is revolutionizing the way we understand the role of microorganisms in the environment and in health and disease conditions. The relatively low cost of sequencing has triggered an exponential increase in the amount of data generated, that have highlighted correlations between the structure of the microbiota and important human pathologies for which conventional intervention strategies were not effective. This suggests that a precise definition of the structure of the healthy vs. disease microbiota could allow early diagnosis and the definition of effective intervention strategies in a number of pathologies. To become a viable diagnostic and therapeutic tool, the evolution of sequencing technologies needs to be paralleled by progress in computational tools enabling to significantly correlate phenotypes to the smallest possible number of microbial taxa. This would allow, on one hand, to develop relatively cheap and easy to use diagnostic tools, and on the other hand to design focused and personalized intervention strategies.
PhyloRelief is an algorithm that resolves the problem of relevant taxa identification by applying the Relief strategy of feature ranking in a phylogenetic context. The improvement of this method over existing ones consists in its ability to accomplish a ranking of the microbial clades, defined on the basis of the taxa distribution amongst the samples weighted by phylogenetic information, discovering those that contribute to the differentiation between two or more classes of samples. Importantly, this result is obtained without relying on a predefined set of taxonomic categories that are often hard pressed to describe the complexity of the evolutionary relationships between microorganisms.  We applied the algorithm to case control studies derived from the literature, in all cases identifying taxa that are significantly differentially distributed. Of particular interest were the results obtained when comparing infants vs. adults in the different geographies, showing that age has a much greater influence in the USA and Italy than in the African and South American samples, with a much larger fraction of the OTU differentially distributed between young children and adults in the former than in the latter. Comparing the performances of the algorithm to LEfSe, MetaPhyl and to Random Forest in a classical supervised classification schema using cross validation, we found that the taxa ranked by PhyloRelief had also a high predictive value, performing as well as-and in some cases outperforming-current gold standard methods.
The algorithm is general and does not rely on any specific sequencing technology, as long as a phylogenetic tree of the OTUs and the distribution of the OTUs in the different samples are available. The method presented here is technology agnostic since it can be used to interpret data generated by the targeted amplification of marker genomic loci, such as the variable regions of the 16S rDNA gene for bacteria, or the ITS sequences for fungi as well as complete metagenome sequencing data, such as those obtained with Illumina technologies. In addition, the algorithm can readily be extended to regression problems to include cases where a continuous variable differentiate the individual samples using the RReliefF extension of Relief [21,22]. The PhyloRelief class of algorithms fills a significant gap in the growing array of computational methods that are currently used for the analysis of metagenomic data, and will impact importantly on the application of metagenomics to the development of novel diagnostic markers, leading the application of these approaches from the bench to the bedside.

The PhyloRelief algorithm
We will assume that a phylogenetic tree T of the OTUs is given, and that a distance matrix D S between the samples S has been computed according to some measure of β -diversity. Given the availability of phylogenetic data, β -diversity measures incorporating phylogenetic information, such as weighted and unweighted UniFrac [19,20] have become popular in the context of metagenomic research, but other measures, such as Bray-Curtis dissimilarity index could also be used. Let us define a partitioning of S into sample class {C1} and {C2}. Usually, this partitioning is obtained either by exploratory analysis of the distance matrix D S , or by the study design (e.g. according to the origin of the samples, health status or age of the donor in the case of human samples, etc.).
The purpose of the PhyloRelief algorithm is to rank the OTUs according to their relevance in the partitioning of S into {C1} and {C2}. To accomplish this result, we developed a modified version of the Relief-F procedure that takes into account the phylogenetic information contained in the tree T. To this purpose, the algorithm does not work directly with the OTUs, but with the clades (or sub-trees) T i of the tree T. Below we report the two main steps of the algorithm, i.e. i) the scoring scheme ranking the sub-trees of the tree T, and ii) the merging step that identifies the independent clades.
Definition of the scores. Two different scoring function have been define: a unweighted update function and a weighted update function. Being closely related to the unweighted Uni-Frac distance, the unweighted update function is the natural choice when exploratory analysis of the samples has been performed using unweighted UniFrac, while the weighted update function is the natural choice when weighted UniFrac has been used.
Unweighted PhyloRelief. For each subtree T i (1 i N)of T, we compute a weight w[T i ] with the following iterative procedure: The function d(T i ,A,B) is equal to: where Y S q is equal to 1 if the branch B q contains OTUs from sample S, 0 otherwise. In this way, the contribution of each clade does not take into account the prevalence of the OTUs in the different classes, but just their presence. Consequently, the algorithm identifies those lineages that are specific to one of the classes of samples.
Weighted PhyloRelief. In weighted PhyloRelief we use the same iterative procedure defined above, but the update function d(T i ,A,B) is defined as: The sum runs on all branches B q of T i (including B i ). p A q and p B q are the fraction of the taxa descending from the branch B q that are from samples A and B, respectively. b q is the length of the branch B q . In other words, d(T i ,A,B) is the weighted UniFrac distance between samples A and B due to the OTUs in the subtree T i of the tree T defined by the branch B i . By iterating over S j , the procedure positively weights those sub-trees T i that support the partitioning of S into {C1} and {C2}, and negatively those that do not support the partitioning. m is a user defined parameter that, in practice, can be set to the number of samples in S.
Generalizations and extensions. Analogously to the Relief-F algorithm, PhyloRelief can work with multi-class classification problems. Moreover, in its generalized form, the algorithm again randomly selects a sample S, but then identifies k of its nearest neighbors from the same class, called nearest hits H l , and k nearest neighbors from each of the different classes, called nearest misses M l (C).
Start define the subtrees T i (1 i N) that have the branches B i as root; set all weights w[T i ] = 0; for j: = 1 to m do begin randomly select a sample S j ; find k nearest hits H l ; for each class C6 ¼class(S j ) do find k nearest misses M l (C); where P(C) is the fraction of samples in class C. The factor PðCÞ 1ÀPðclassðS j ÞÞ is required for ensuring appropriate normalization and to guarantee that the contribution of hits and of each class of misses is between 0 and 1.
Definition of the clades. Let T j be a sub-tree of T i . The correlation between the values w[T i ], w[T j ] is illustrated by the example in S1 Fig, where the weights for four samples ("circles", "squares", "triangles" and "stars") partitioned into two classes ("red" and "blue") are shown. The high weights in the bottom clade (Clade b, containing OTUs only from the "red" class of samples) propagates up from the terminal branches until it merges with Clade a, that contains OTUs from the "blue" class of samples. In the parent branch of Clades a and b, the unbalance between the two samples is diluted, and consequently the weight decreases. In this example, Clade a and clade b separately, but not their parent clade, would be responsible for the differentiation between "red" and "blue" samples.
In order to exploit the information contained in the weights w[T i ], it is crucial to define a set of independent clades and rank those in order of importance. To accomplish this, we identify the sub-tree T with the highest weight. In the case of ties, we randomly start from one of the subtrees if these are independent, or take the one closest to the root if these are nested. This defines the first clade. Next, we prune the tree from all the branches descending from T, and from those ascending from T. This last step is needed to avoid the possibility to iteratively enlarge the same clades, given the correlation between the weights of nested sub-trees discussed above. Iteratively applying this rule, we define a set of independent clades ranked according to their weight. Using this ranking, the minimal set of clades necessary to describe the classes to a certain level of accuracy is determined by running non-parametric tests of class diversification, such as PERMANOVA [23] and ANOSIM [24], as a function of the number of clades. Alternatively, univariate non-parametric tests such as the Kruskal-Wallis could be applied for testing whether samples originate from the same distribution in each clade.
IBD, Costello et al. Body Habitats (CBH) and Fierer et al. Subject (FS) Datasets. Preprocessed reads and metadata were downloaded from the Qiime repository http://www.microbio. me/qiime/ under the study ids 1290, 449 and 232 respectively. OTU tables and phylogenetic trees were inferred using the standard QIIME pipeline with default settings, where OTUs were picked with UCLUST [37] at a sequence similarity threshold of 0.97%. Taxonomy was assigned using the Greengenes database (version 2013/05). Rarefactions were performed at the depth of the shallowest sample. Phylogenetic tree were computed using PyNAST [38] (using the Greengenes 2013/05 database as template) and FastTree [39].

Predictive classification pipeline
We compared the predictive performance of PhyloRelief with the Random Forest classifier (PhyloRelief +RF) to LEfSe + RF, MetaPhyl (without feature selection) and Random Forest used as both classifier and feature selection method (RF + RF).
To assess the prediction performance of the weighted features we implemented a predictive pipeline based on a stratified 10x random subsampling cross validation (CV). Data are partitioned into a training set and a testing set (75% and 25% of the samples respectively). In order to avoid overfitting and selection bias effects, the feature selection procedure was included in the cross validation loop [40,41]. For each training set, the number of ranked features n 0 that provides the smallest average KCCC is found by a nested 10x random subsampling CV. Later, the features are ranked using the entire training set and the model is trained using the top ranked n 0 features. The model is finally tested on the independent testing set and a KCCC is computed. In the case of LEfSe + RF, LEfSe was treated as feature selection method using the common p-value threshold of 0.05. For MetaPhyl, no feature selection was performed and the nested CV was used to find the optimal model parameters (parameters grid: λ = {100000, 1000, 100, 10, 1, 0.1, 0.01, 0.001, 0.0001} and w = {0, 0.2, 0.4, 0.6, 0.8, 1}). For the Random Forest classifier, the number of trees was set to 500 and the weights were computed as in [42]. In PhyloRelief OTUs were ranked using the weights computed on the related clades.
The pipeline was developed in Python using the scikit-learn module (http://scikit-learn. org).
Supporting Information S1 Fig. The PhiloRelief scoring scheme In this example, four samples ("circles", "squares", "triangles" and "stars") are partitioned into two classes ("red" and "blue"). The Bottom clades (Clade a and Clade b) have high weight since they contain OTUs only from the "blue" and "red" class of samples, respectively. The higher weights of the branches in clade b take into account the more even distribution of the "blue" class of samples. The weights propagate up from the terminal branches until the two clades merge. In the parent branch of Clade a and Clade b the unbalance between the two samples is diluted, and consequently the weight decreases.  Table. Permutational ANOVA and ANOSIM tests on the effect of age. The sample has been partitioned into two as function of an age threshold and the Permutational ANOVA and ANOSIM tests have been computed using the weighted UniFrac distance. (DOCX) S2 Table. Permutational ANOVA and ANOSIM tests on the effect of the number of clades used in the calculation of the UniFrac distance between young (below two years of age) and older (above two years of age) individuals. (DOCX)