^{1}

^{1}

^{2}

^{3}

^{*}

The authors have declared that no competing interests exist.

Conceived and designed the experiments: JF EJA. Performed the experiments: JF. Analyzed the data: JF. Wrote the paper: JF EJA.

High-throughput sequencing based techniques, such as 16S rRNA gene profiling, have the potential to elucidate the complex inner workings of natural microbial communities - be they from the world's oceans or the human gut. A key step in exploring such data is the identification of dependencies between members of these communities, which is commonly achieved by correlation analysis. However, it has been known since the days of Karl Pearson that the analysis of the type of data generated by such techniques (referred to as compositional data) can produce unreliable results since the observed data take the form of relative fractions of genes or species, rather than their absolute abundances. Using simulated and real data from the Human Microbiome Project, we show that such compositional effects can be widespread and severe: in some real data sets many of the correlations among taxa can be artifactual, and true correlations may even appear with opposite sign. Additionally, we show that community diversity is the key factor that modulates the acuteness of such compositional effects, and develop a new approach, called SparCC (available at

Genomic survey data, such as those obtained from 16S rRNA gene sequencing, are subject to underappreciated mathematical difficulties that can undermine standard data analysis techniques. We show that these effects can lead to erroneous correlations among taxa within the human microbiome despite the statistical significance of the associations. To overcome these difficulties, we developed SparCC; a novel procedure, tailored to the properties of genomic survey data, that allow inference of correlations between genes or species. We use SparCC to elucidate networks of interaction among microbial species living in or on the human body.

The study of natural communities using high throughput genomic surveys, such as 16S rRNA gene profiling, has become routine

A common goal of genomic surveys is to identify correlations between taxa within ecological communities. Correlation analysis provides a well trodden path to achieving this goal, but we show that it is not valid when applied to genomic survey data (GSD), and may produce misleading results. The challanges associated with GSD stem from the fact that they are a relative, rather than absolute, measure of abundances of community components. The counts comprising these data (e.g., 16S rRNA gene reads) are set by the amount of genetic material extracted from the community or the sequencing depth, and analysis typically begins by normalizing the observed counts by the total number of counts. The resulting fractions fall into a class of data termed closed or compositional, and poses its particular geometrical and statistical properties

Although approaches to compositional data analysis have been developed (e.g.

In this paper, we first use simulations and real-world data from the Human Microbiome Project (HMP) to demonstrate that GSD can be severely biased by “compositional” effects, and then identify the factors the modulate their severity. Finally, we present a novel method, called SparCC, and show that it can infer correlations with high accuracy even in the most challenging data sets.

To what extent do compositional artifacts affect real-world GSD? We applied standard statistical methods to 16S rRNA gene survey data from the Human Microbiome Project (HMP)

Networks inferred from Standard Pearson correlation display distinct patterns within different body sites, suggestive of biological structure (

Correlation networks based on 16S rRNA gene survey data collected as part of the Human Microbiome Project (HMP), inferred using Pearson correlations (left column), and SparCC (right column). Additionally, Pearson correlation networks were inferred from shuffled HMP data (middle column), where all OTUs are independent. The Pearson networks inferred from shuffled data show patterns similar to the ones seen in the Pearson networks of the real data, especially for low diversity body sites. This indicates that the observed Pearson network structure may be due to biases inherent in compositional data rather than a real biological signal. In contrast, no significant correlation were inferred from the shuffled data using SparCC (data not shown). Nodes represent OTUs, with size reflecting the OTU's average fraction in the community. Edges between nodes represent correlations between the nodes they connect, with edge width and shade indicating the correlation magnitude, and green and red colors indicating positive and negative correlations, respectively. For clarity, only edges corresponding to correlations whose magnitude is greater than 0.3 are drawn. See

The mechanism behind these spurious correlations is straightforward. The pattern observed in the mid-vagina network results from the dominance of OTU 3, a

Compositional effects are severe in some datasets, but mild in others. We found that diversity of the samples in the dataset (often referred to as alpha diversity), is a good predictor of the strength of compositional effects, which diminish with increased diversity. Intuitively, the fewer OTUs comprise the community, the worse the compositional effects are, with the extreme case of a community composed of only two OTUs, which will always appear to be perfectly negatively correlated. Moreover, compositional effects can be significant even in communities comprised of multiple OTUs, if only a few OTUs dominate the community. This notion of diversity can be quantified using the Shannon effective number of OTUs,(

Simulated networks of varying

Basis data was simulated with a known correlation structure. OTU counts were generated by randomly drawing from the basis, and were subsequently subject to both correlation inference procedures. (A–C) True basis correlation network. (D–F) Networks inferred using standard procedure. (G–I) Networks inferred using SparCC. The average community diversities, as given by the Shannon entropy effective number of components

If the underlying network has true positive correlations, then compositional effects can be even more pronounced than expected based on the community diversity. This happens because strong correlations between components lowers the effective diversity of the sample (i.e., two OTUs that are perfectly correlated behave as a single OTU). This effect can confound naive efforts to correct for compositional effects by comparing observed correlations against shuffled networks. When the data are shuffled, as in the middle column of

Here, we describe a new technique for inferring correlations from compositional data called SparCC (Sparse Correlations for Compositional data). SparCC estimates the linear Pearson correlations between the log-transformed components. Since these correlations cannot be computed exactly (as described below), SparCC utilizes an approximation which is based on the assumptions that: (i) the number of different components (e.g., OTUs or genes) is large, and (ii) the true correlation network is ‘sparse’ (i.e., most components are not strongly correlated with each other). Later, we show that SparCC is surprisingly robust to violations of the sparsity assumption. SparCC does not rely on any particular distribution of the basis variables, i.e. the true abundances in the community can follow any distribution, and the choice of the log-normal distribution in subsequent examples is motivated solely by ease of implementation and empirical fit. For clarity, we present the method in the context of 16S rRNA gene data, where the components are OTUs and the basis variables are their true abundances in a community, but SparCC can be applied to any compositional data for which its approximation is valid.

Like most compositional data analysis techniques, SparCC is based on the log-ratio transformation:

To describe the dependencies in a compositional dataset, Aitchison suggested using the quantity

More accurate estimation can be achieved by iterating the above procedure. At each iteration the strongest correlated OTU pair identified in the previous iteration is excluded from the basis variance estimation. This reinforces sparsity among the remaining pairs and yields better variance and correlation estimates.

OTU fractions need to be estimated from the observed counts to apply SparCC. Normalizing each OTU by the total counts in the sample (the maximum-likelihood estimate) is unreliable for rare OTU because it overestimates the number of zero fractions

We used the previously described simulated datasets to demonstrate the accuracy of SparCC at inferring correlations, even in highly problematic compositional data dominated by a single OTU (

Root-mean-square error (RMSE) of both Pearson (A) and SparCC (B) inferred correlations, as a function of the density of the underlying correlation network, as given by the probability that any pair of components be strongly correlated

We used SparCC to infer the taxon-taxon interaction networks from the HMP data sets (

Networks inferred using SparCC from the same data as in

In this study we have focused on an outstanding challenge of compositional data analysis – inference of correlations. We have demonstrated that compositional effects are pronounced in 16S rRNA gene surveys of the human microbiome, and, motivated by the properties of this data, have developed a novel procedure for estimating correlations.

We found that diversity of species and density of interactions are the two key factors that influence the severity of compositional effects on correlation estimates, with low diversity, high density data being the most challenging to infer correlation from using standard methods. SparCC does not rely on high diversity, rather it only requires sparsity of correlations, but in practice is robust even when the sparsity assumption is strongly violated (%30 of all component pairs are strongly correlated). Therefore, we recommend that SparCC be used on any GSD that has low diversity: as a rule of thumb we recommend an effective number of components of at least 50 for standard techniques (with the potential caveat that if strong positive correlations are present among many OTUs, the effective diversity may be much lower than estimated). We emphasize that simply having many components is not sufficient to avoid compositional effects. For example, 16S rRNA gene surveys from the HMP include hundreds to thousands of distinct OTUs, yet have have a relatively low effective number of species, with a small number of species dominating most samples.

An important subclass of GSD are genome-wide surveys conducted using techniques such as DNA microarrays, RNA-seq and ChIP-seq. These genome-wide data are also subject to compositional effects, however, as these data tend to have high diversity, they are likely to be much less severe or negligible. For example, the average effective number of genes in microarray experiments available through the

The preponderance of zero values are another area of concern with GSD. These zeros can represent either components that are truly absent from the community, or rare components that, by chance, were not present in the sample drawn from the community. Without additional knowledge, these options are indistinguishable, and, depending on goal of the analysis, the researcher must decide how to interpret them, and choose analysis methods accordingly. We emphasize that the treatment of zero values is a challenge that is in no way unique to compositional data, but is merely highlighted by the log-ratio transformations employed to analyze these data

Though the method presented in this paper allows detection of correlation within communities, many challenges still remain. First, SparCC relies on having reliable component counts, which as noted in the introduction, is not trivial. Second, the correlations estimated by SparCC measure the linear relationship between log transformed abundances. Compositional methods for inferring more general dependencies between components, equivalent to rank correlations and mutual information for non-compositional data, have not yet been developed. Third, relating the patterns detected within a community to external factors (e.g. relating the composition of a human gut microbial community to human health status), and detecting temporal patterns within and between communities requires non-standard, compositional approaches. While some such methods exist

HMP OTU counts and their taxonomic classification were obtained from the HMPOC dataset, build 1.0, available at

Shuffled datasets are created by assigning each OTU in each sample a number of counts that is randomly sampled from the OTU's observed counts across all samples, with replacement. This procedure ensures that the resulting marginal distributions of counts of each OTU alone are the same as in the real data, and that there are no correlations between the OTUs in the simulated data.

Simulated communities were generated by sampling the joint abundances of 50 OTUs from a log-normal distribution with a given mean and covariance matrix. The mean abundances were equal for all OTUs except OTU 1, whose abundance was set such that the community will have a given effective number of OTUs (

For each combination of the parameters

The entropy effective number of species of a community, is defined as

We adopt a bayesian framework for estimating the true fractions from the observed counts. Assuming unbiased sampling in the sequencing procedure, and a uniform prior, the posterior joint fractions distribution is the Dirichlet distribution

Point estimates of fraction values, if desired, can be given by the the mean of the posterior distribution:

It is important to note that in SparCC, like in any method employing log transformations, some pre-processing is required to eliminate zero values. As described above, SparCC employes a variation of the well-known pseudocounts method which assigns a small fraction to OTUs that were not detected in a sample. This approach implicitly assumes that all components are in fact present in the sample, and that all zero value result from finite detection resolution

As noted in the main text, the quantity

Since an exact solution cannot be found, we SparCC utilizes an approximation, which is valid when there are many components which are only sparsely correlated.

To elucidate the nature of this approximation, consider the case where all the basis variables have the same variance

Using the above approximation, the basic inference procedure is the following:

Estimate the component fractions in all the samples as outlined above, to obtain the fractions matrix

Compute the variation matrix

Compute the component variations

Solve

Plug the estimated log-basis variances into

The basic inference procedure can be improved upon by employing the following iterative refinement scheme (

Estimate correlations using the basic procedure described above.

Identify the most strongly correlated pair of components that was not previously excluded. If the magnitude of this strongest correlation exceeds a given threshold, add this pair to the set of excluded pairs. Otherwise, terminate the estimation procedure.

Identify components that form only excluded pairs and completely exclude them from the analysis. Since the assumptions of our method are not met by such components, it is unable to infer their correlations. If all components but three are excluded, terminate the estimation procedure, as the sparsity assumption is violated for the whole system.

If any components were excluded, re-estimate the fractions of the remaining components. Note that the new fractions are relative to the new subset of components.

Calculate the component variations

Use the newly computed component variations to compute the basis correlations, as in steps 4 and 5 of the basic inference procedure.

Repeat steps 2 through 6 for a given number of iterations, or until no new strongly correlated pairs are identified.

Note that the iterative procedure can result in correlations whose magnitude is greater than 1, indicating that too many pairs were excluded. Setting a higher exclusion threshold, or a lower iteration number will remedy this fallacy, though the resulting approximation is likely to be of poor accuracy.

Basis correlation can also be inferred using transformed variables (see

To account for the sampling noise, the inference procedure is repeated multiple times, each time with fraction values drawn randomly from their posterior distribution, generating a distribution of each pairwise correlation. The median value of each pairwise correlation distribution is taken as its estimated value. In this work, a threshold of

For each body site, pairwise correlations between all OTUs were inferred using both Pearson and SparCC as described above. Interaction networks were subsequently build by connecting all OTU pairs that had a correlation magnitude greater than a given threshold.

The statistical significance of the inferred correlations can be assessed using a bootstrap procedure. First, a large number of simulated datasets, where all components are uncorrelated, are generated as described in Material and Methods. Next, correlations are inferred from each simulated dataset using SparCC with the same parameter setting as is used for the original data. Finally, for each component pair, pseudo p-values are assigned to be proportion of simulated data sets for which a correlation value at least as extreme as the one computed for the original data was obtained.

All analysis and procedures were implemented in Python, utilizing the Numpy

Correlation values for all HMP body sites inferred using both Pearson and SparCC from real and shuffled data.

(ZIP)

(PDF)

(PDF)

(PDF)

(PDF)

(PDF)

(PDF)

Accuracy of HMP Pearson networks compared to SparCC networks.

(DOC)

Correlation between OTUs decreases with phylogenetic distance.

(DOC)

Correlation inference using transformed variables.

(PDF)

The authors thank Lawrence David, Chris Smillie, Otto Cordero, Olivier Devauchelle, and Dr. Alex Petroff for many helpful discussions.