ARK: Aggregation of Reads by K-Means for Estimation of Bacterial Community Composition

Motivation Estimation of bacterial community composition from high-throughput sequenced 16S rRNA gene amplicons is a key task in microbial ecology. Since the sequence data from each sample typically consist of a large number of reads and are adversely impacted by different levels of biological and technical noise, accurate analysis of such large datasets is challenging. Results There has been a recent surge of interest in using compressed sensing inspired and convex-optimization based methods to solve the estimation problem for bacterial community composition. These methods typically rely on summarizing the sequence data by frequencies of low-order k-mers and matching this information statistically with a taxonomically structured database. Here we show that the accuracy of the resulting community composition estimates can be substantially improved by aggregating the reads from a sample with an unsupervised machine learning approach prior to the estimation phase. The aggregation of reads is a pre-processing approach where we use a standard K-means clustering algorithm that partitions a large set of reads into subsets with reasonable computational cost to provide several vectors of first order statistics instead of only single statistical summarization in terms of k-mer frequencies. The output of the clustering is then processed further to obtain the final estimate for each sample. The resulting method is called Aggregation of Reads by K-means (ARK), and it is based on a statistical argument via mixture density formulation. ARK is found to improve the fidelity and robustness of several recently introduced methods, with only a modest increase in computational complexity. Availability An open source, platform-independent implementation of the method in the Julia programming language is freely available at https://github.com/dkoslicki/ARK. A Matlab implementation is available at http://www.ee.kth.se/ctsoftware.

For our experiments on real biological data, we used the mock microbial communities database developed in [2]. The database is called even composition Mock Communities (eMC) for chimeric sequence detection where the bacterial species included are known in advance. Three regions (V1-V3, V3-V5, V6-V9) of the 16S rRNA gene of the eMC were sequenced using 454 sequencing technology in four different sequencing centers. In our experiments we focused on the V3-V5 region datasets, since these have been previously used for the evaluation of the BEBaC method (see Experiment 2 of [1]) and SEK, Quikr and Taxy methods (see Experiment in Section 3.2.2 of [3]).

Test dataset (Reads):
Our basic test dataset used under a variety of different in silico experimental conditions is the one used in Experiment 2 of BEBaC [1] and the Experiment in Section 3.2.2 of SEK [3]. The test dataset consists of 91,240 short reads from 21 different species. The length of reads has a range between 450-550 bp and the bacterial community composition is known at the species level by using the following computation performed in [1]. Each individual sequence of the 91,240 read sequences was aligned (local alignment) to all the reference sequences of the reference database and then each read sequence is labeled by the species of the highest scoring reference sequence, followed by computation of the community composition referred to as ground truth.

Training datasets (Reference):
We used a database generated from the eMC database [2]. The database consists of the same M = 21 species present among the reads described previously. The details of the reference database can be found in Experiment 2 of BEBaC [1]. The database was also used for training SEK, Quikr and Taxy in the experiment described in Section 3.2.2 of [3]. The database consists of 113 reference sequences for a total of 21 bacterial species, such that each reference sequence represents a distinct 16S rRNA gene. Thus there is a varying number of reference sequences for each of the considered species. Each reference sequence has an approximate length of 1500 bp, and for each species, the corresponding reference sequences are concatenated to a single sequence. The final reference database consists of 21 sequences where each sequence has an approximate length of 5000 bp.

Computational resources and reproducible codes
We used standard Matlab software. For hardware, we used a Dell Latitude E6400 laptop computer with a 3 GHz processor and 8 GB memory. We also used the cvx [4] convex optimization toolbox. The reproducible Matlab codes are available at http://www.ee.kth.se/ctsoftware.

Relevant methods
For ARK, we use SEK, Taxy [5] and Quikr as the estimation methods employed to each cluster. The choice of SEK is due to its recent development and good performance. The use of Taxy and Quikr is due to fact that they also use the first order moment of the k-mer frequencies of reads. We use the greedy algorithm, called OMP +, 1 sek to realize SEK. We also use the parameters for OMP +,1 sek and parameters for k-mers training and testing setups identical to that used in [3] so that fair comparison can be pursued. Finally we compared the performance of ARK-SEK with the ground truth and BEBaC.

Results for Mock Communities data
Using mock communities data, we carried out experiments where the community composition problem was addressed at the species level.

2/12
Value of k for k-mers: Throughout all experiments, we used k = 4 for reads and reference. Therefore the k-mer feature vectors have a dimension of 256.

k-mers from test dataset:
In the test dataset, described above, the shortest read is 450 bp long. Therefore, for each read sequence, we used first 450 bp sequence to compute k-mers. Using k = 4, the generation of k-mers feature vectors from reads (testset) took 21 minutes of execution time. Note that the k-mers test dataset is same as that used in Section 3.2.2 of [3].

Results:
We first investigated the convergence of ARK and how the performance of ARK behaves with increasing cluster number. We used η = 0.0005. The performance of ARK using three estimation methods is shown in Figure A. We note that the gross performance trend typically improves with the increase in clusters and then saturates. ARK-SEK saturates at 24 clusters with the VD of 0.0103. The ARK-SEK performance is much better than ARK-Taxy and ARK-Quikr. The scope of improvement of SEK via ARK-SEK is found to be limited. In contrast, Figure B shows that Quikr is found to improve significantly via ARK-Quikr (after convergence).
The convergence time of ARK-SEK was 763 seconds (∼12 minutes). Note that K-means clustering is common for all ARK methods either using SEK, Taxy and Quikr. As ARK-Quikr converged at 17 clusters and took 18 seconds in total, it is clear that the clustering algorithm does not require heavy computational resource. Typically K-means clustering is a simple algorithm and hence can be used for a very large size read dataset without much demand in computational resource.
Finally we compare SEK, ARK-SEK and BEBaC against the ground truth in Figure C. BEBaC results are highly accurate with VD = 0.0038, but come with the requirement of a computation time in the order of more than a day. In comparison, ARK-SEK provides a reasonable performance at the expense of total online computation time of 21 + 12 = 33 minutes. Therefore, while BEBaC is assumed to incapable of handling a very large dataset of reads, ARK-SEK can be used for this scenario.

Genus-level reconstructions
To provide more detailed insight into how ARK Quikr and ARK SEK perform in comparison to RDP's NBC, we included here genus level reconstructions of a few real data sets (detailed in the main text). We include three samples spanning three variable regions (V1-V2, V3-V5, and V3-V4) and two body sites (vagina and feces). Since each method reconstructs a large number of very low abundance organisms, for simplicity of visualization, we restricted our attention to those genera that appear at an abundance level of ≥ 5% in any of the three methods. Figure D gives a bar chart of each method's genera-level reconstruction on the three data sets. There was a general trend of the ARK SEK reconstruction agreeing more similarly to the RDP NBC results, as was observed in each of the real biological samples described in the main text.

Primer details
For the samples detailed in the main text, 16S rRNA gene amplicons were generated from the DNA extractions using the primer combinations shown in Table 1, where the letters in italics show the region targeting the 16S rRNA gene, non-italicized letters show Illumina adapter, primer pad and linker sequences and "nnnnnnnnnnnn" indicates where unique 12-base Golay barcode sequences were incorporated for each sample.

Independence of error correction method
One potential issue is that using a different error-correction method might change the real biological data results contained in the main text. We therefore include in this section PCoA plots for the real biological data using an alternate error correction method, as well as results for using joined paired-end reads.

Read trimming
We used the same computational protocol contained in the main text section on real biological data, but for a different error correction protocol. Namely, we simply trimmed the reads to 150bp. Contained in Figure E are the PCoA plots corresponding to Figures 7, 8, and 9 in the main text. Note the similarity of clustering.

Joining paired-end reads
We used the same computational protocol contained in the main text section on real biological data, but this time we first joined the paired-end sequences (using the ea-utils script fastq-join [6]). Contained in Figure F are the PCoA plots corresponding to Figures 7, 8, and 9 in the main text. Note the similarity of clustering.

Effect of k-mer size
We next consider the question of whether using larger k-mers yields improved performance for Quikr, SEK, ARK Quikr, and ARK SEK. To that end, we utilized the same test datasets (Reads) and Training dataset (Reference) as described in the main manuscript. We then ran each method on all simulated datasets for k = 4, 5, 6, 7, 8, 9, 10. For ARK SEK and ARK Quikr, we set the number of clusters to the relatively low value of 10 as a higher number of clusters, combined with larger k-mer sizes, led to prohibitive computational times, particularly for ARK SEK. Figure G demonstrates that each method experiences decreased error as the k-mer size increases, but with diminishing returns (i.e. the decrease in error from k = 9 to k = 10 is less than from k = 8 to k = 9). As expected, the execution time of each method increased exponentially as the number of k-mers increases exponentially in k. Figure H gives the execution time of each method. Figure I gives a semi-log plot with the execution time on a logarithmic scale so the relative execution times can be compared.
Given the rapidly increasing execution time combined with the slow decrease in reconstruction error as a function of k-mer size, we recommend using the largest k-mer size as possible that a user's time constraints will allow for.

Excluding sister taxa
We also investigate here how well each method will perform when the testing data has organisms absent from the training database. To this end, we randomly selected one species from each genera in the RDP training set 7 and formed a testing database. Note that in RDP training set 7, some genera consist of a single representative species. The remaining sequences were then used to form a training database. The testing database was then utilized to form 180 simulated datasets using the same simulation parameters described in the main text (under the Test datasets (Reads) section). Each method (RDP, Quikr, SEK, ARK Quikr, and ARK SEK) was retrained using this new, reduced testing database. The parameters utilized for each method were the same as in the main text, and the number of clusters for ARK was set to 75.
Included in figure J is a bar chart of the mean error for each method at the genus level. Compare this with Figure 4 from the main text. As expected, each method experienced an increase in error. The percent increase in mean error at the genus level for RDP, Quikr, SEK, ARK Quikr, and ARK SEK was respectively 3%, 2%, 50%, 12%, and 24%. This is consistent with previous observations [7, Table 1] where it was shown that including novel organisms in the testing data marginally increases the error of Quikr.
However, even given the increase in error of ARK Quikr and ARK SEK, these two methods remain more accurate than RDP's NBC. Indeed, ARK Quikr remains 14% more accurate than RDP, and ARK SEK remains 43% more accurate than RDP.