Conceived and designed the experiments: MM GMW ES. Analyzed the data: JM SS SY KK RS NS JO ZW. Wrote the paper: JM GMW MM.
The authors have declared that no competing interests exist.
The Human Microbiome Project (HMP) aims to characterize the microbial communities of 18 body sites from healthy individuals. To accomplish this, the HMP generated two types of shotgun data: reference shotgun sequences isolated from different anatomical sites on the human body and shotgun metagenomic sequences from the microbial communities of each site. The alignment strategy for characterizing these metagenomic communities using available reference sequence is important to the success of HMP data analysis. Six next-generation aligners were used to align a community of known composition against a database comprising reference organisms known to be present in that community. All aligners report nearly complete genome coverage (>97%) for strains with over 6X depth of coverage, however they differ in speed, memory requirement and ease of use issues such as database size limitations and supported mapping strategies. The selected aligner was tested across a range of parameters to maximize sensitivity while maintaining a low false positive rate. We found that constraining alignment length had more impact on sensitivity than does constraining similarity in all cases tested. However, when reference species were replaced with phylogenetic neighbors, similarity begins to play a larger role in detection. We also show that choosing the top hit randomly when multiple, equally strong mappings are available increases overall sensitivity at the expense of taxonomic resolution. The results of this study identified a strategy that was used to map over 3 tera-bases of microbial sequence against a database of more than 5,000 reference genomes in just over a month.
A key goal of the Human Microbiome Project (HMP) is the characterization of the microbial communities present in different body habitats
An alternative method to characterize the structure of microbial communities is to generate shotgun metagenomic sequence, which provides advantages such as the exclusion of biases introduced by using 16S marker gene for community profiling. Shotgun sequencing bias is introduced mainly from the sequencing platform used and thus provides a better absolute measurement of species abundances than do 16S rRNA measurements assuming adequate coverage is generated. Hence, aligning the shotgun metagenomic sequences generated from samples originating from the different body habitats against microbial reference genomes can generate abundance tables that contain information for comparative metagenomics that are free of typical 16s biases. The best method for generating comprehensive abundance tables is to align the metagenomic shotgun reads against a collection of reference genomes comprising the whole genome sequences of all available microorganisms (including the four major superkingdoms, Archaea, Bacteria, Eukaryota and Viruses). To accomplish this in a timely and robust manner for the HMP, which generated over 7 tera-bases of sequence data, effort was invested in the exploration of available tools and methods.
A wide variety of short read alignment software has been developed in recent years
The final Reference Genome Database (RGD)(
An overview of the process of creating our Reference Genome Database (RGD). Complete and WGS genomes were downloaded from GenBank, plasmid sequences were removed to simplify redundancy screening, and then the Mauve genome assembly tool was used to identify redundant strains that were subsequently removed (except for HMP stains which were always kept). For strains remaining after redundancy removal, their corresponding plasmids were restored into the database. This database was periodically updated as new strains became available over the course of the project.
The Mock Metagenomic Database (MMD) that was used for aligner comparisons and parameter optimization comprised 20 bacterial genomes from 17 genera and one archaeal strain (
The percentage of the 22,735,802 mock community reads that mapped to the Mock Metagenomic Database (MMD) ranged from 63% to 92%, with the two extremes being from SMALT and SOAP (
Aligners | Default Parameters | Mapping Style | Memory Footprint (Gb) | Databasesize limit limit | Paired and fragment reads map together | Time of run (minutes) | Reads mapped | |
(#) | % of total | |||||||
SOAP | M = 4 r = 1 m = 150 x = 600 | top random | <4 | 4 Gb | no | 84 | 14,376,440 | 63.23 |
MAP | w = 16 a = 3–legacy-cigar | topN = 5 | 14 | no limit | no | 24 | 19,491,796 | 85.73 |
SMALT | f = samsoft | unique only | <4 | varies |
no | 165 | 20,874,489 | 91.81 |
BWA | <default> | top random | <4 | 4 Gb | no | 120 | 18,446,241 | 81.13 |
CLC | q p fb ss 180 250 | top random | ∼2 | no limit | yes | 15 | 20,184,374 | 88.78 |
NOVOALIGN | -F STDFQ -r Random -o SAM -I PE 215,50 | top random | <1 | 20 Gb |
no | 206 | 19,330,244 | 85.02 |
varies depending on search window size.
4 Gb x step size limit of aligner (max value 5).
Aligner and coverage | ||||||||||||||
BWA | MAP | CLC | SMALT | SOAP | NOVOALIGN | |||||||||
Species | Breadth | Depth | Breadth | Depth | Breadth | Depth | Breadth | Depth | Breadth | Depth | Breadth | Depth | ||
|
99.98 | 237.75 | 98.79 | 252.83 | 99.99 | 295.15 | 99.46 | 292.42 | 99.87 | 165.49 | 99.97 | 245.59 | ||
|
99.98 | 73.02 | 99.99 | 76.73 | 99.99 | 76.25 | 99.52 | 76.06 | 99.90 | 68.53 | 99.98 | 73.62 | ||
|
99.99 | 40.39 | 99.55 | 37.28 | 99.99 | 42.22 | 98.66 | 41.99 | 99.97 | 38.94 | 99.99 | 40.70 | ||
|
99.96 | 38.76 | 99.98 | 41.64 | 99.97 | 40.43 | 98.10 | 39.45 | 99.97 | 36.66 | 99.95 | 39.04 | ||
|
100.00 | 36.95 | 99.51 | 39.47 | 100.00 | 38.75 | 96.98 | 37.53 | 100.00 | 34.62 | 100.00 | 37.24 | ||
|
99.98 | 34.96 | 99.99 | 37.75 | 99.98 | 39.28 | 99.52 | 39.41 | 99.98 | 28.20 | 99.98 | 35.53 | ||
|
100.00 | 34.48 | 99.38 | 36.70 | 100.00 | 35.88 | 97.85 | 34.97 | 99.99 | 32.79 | 100.00 | 34.74 | ||
|
100.00 | 22.15 | 99.85 | 23.74 | 100.00 | 23.02 | 98.24 | 22.78 | 100.00 | 21.11 | 100.00 | 22.29 | ||
|
99.99 | 21.60 | 99.85 | 24.94 | 100.00 | 23.97 | 95.00 | 22.62 | 99.89 | 18.25 | 99.99 | 21.85 | ||
|
91.46 | 20.90 | 92.03 | 22.64 | 92.56 | 22.91 | 92.09 | 22.75 | 88.25 | 20.73 | 91.57 | 21.52 | ||
|
99.90 | 18.88 | 99.94 | 21.15 | 99.94 | 22.58 | 99.37 | 22.87 | 99.51 | 13.91 | 99.91 | 19.44 | ||
|
100.00 | 15.52 | 99.32 | 15.85 | 100.00 | 16.12 | 99.07 | 16.15 | 99.99 | 14.79 | 99.99 | 15.62 | ||
|
99.05 | 12.87 | 99.23 | 15.12 | 99.60 | 16.99 | 99.41 | 17.50 | 95.12 | 8.74 | 98.92 | 13.39 | ||
|
99.96 | 11.09 | 99.97 | 11.55 | 99.97 | 11.55 | 99.26 | 11.61 | 99.92 | 10.55 | 99.95 | 11.16 | ||
|
99.90 | 10.25 | 98.99 | 10.44 | 99.91 | 10.64 | 98.65 | 10.64 | 99.88 | 9.92 | 99.90 | 10.32 | ||
|
99.48 | 7.40 | 98.95 | 7.76 | 99.62 | 7.96 | 98.49 | 8.02 | 98.72 | 6.63 | 99.40 | 7.46 | ||
|
97.91 | 6.54 | 97.35 | 6.72 | 98.19 | 6.79 | 97.01 | 6.87 | 97.70 | 6.46 | 97.90 | 6.59 | ||
|
89.60 | 3.18 | 89.28 | 3.26 | 90.17 | 3.30 | 89.72 | 3.38 | 88.70 | 3.44 | 89.58 | 3.21 | ||
|
80.50 | 2.24 | 82.50 | 2.59 | 86.18 | 2.84 | 89.41 | 3.16 | 65.73 | 2.31 | 79.80 | 2.32 | ||
|
47.94 | 0.91 | 47.76 | 0.92 | 49.00 | 0.96 | 51.63 | 1.03 | 46.94 | 1.88 | 48.02 | 0.92 | ||
|
20.05 | 0.31 | 20.98 | 0.33 | 20.73 | 0.32 | 25.98 | 0.42 | 19.64 | 1.53 | 20.11 | 0.31 |
A detection cutoff of 1% breadth and 0.01x depth of coverage was used allowing the detection of low abundance species (such as
We looked first at the total number of reads mapped at each parameter combination. The Illumina GAIIx reads from the mock community (22,735,802 reads) were aligned to the MMD, which contained genome sequences for all organisms in the mock community. We found that the minimum length of alignment required (in terms of query length) has more of an effect on mapping sensitivity than does varying the percent identity required within the length of the alignment (
The different parameter combinations were also evaluated in regards to their ability to identify each genus independently by looking at the effects on the breadth and depth of coverage for all the genomes present in the mock community.
SOAP | MAP | SMALT | BWA | CLC | NOVOALIGN | |
SOAP | 0.28385 | 0.0217 | 0.49387 | 0.01281 | 0.37475 | |
MAP | 0.99972 | 1 | 0.99992 | 1 | ||
SMALT | 0.9992 | 1 | 0.99991 | |||
BWA | 0.99989 | 1 | ||||
CLC | 0.99997 | |||||
NOVOALIGN |
Based on depth of coverage per genome. Values > = 0.05 are considered significantly similar.
Aligner | Correlation coefficient |
SOAP | 0.6960784 |
MAP | 0.7720588 |
SMALT | 0.7941176 |
BWA | 0.7573529 |
CLC | 0.7916667 |
NOVOALIGN | 0.7720588 |
Often the genome of the exact strain present in a microbial community is not represented in the RGD. Therefore, we tested the parameters under low identity conditions, when the exact query strain is not present in the reference, but a taxonomically related organism from the same genus is (
This plot shows the percent of total mock queries able to be mapped to the mock database at each given CLC parameter combination. The Mock vs. Mock data (dark blue) uses the original MMD, which contains all strains present in the mock community. The Mock vs. Amended data (light blue) shows the same results when the mock query is mapped to an amended MMD where several strains were removed and other strains from the same genus were included in their place.
Looking into the coverage of the amended MMD (
This image displays a phylogenetic tree based on 16S data for all 21 strains in the MMD, and also the 4 strains used as replacements in the amended MMD (
Basing the decision on these observations, the suggested cutoff for community profiling using shotgun metagenomic sequences is 80% identity over 75% of the length of the query. This setting represents a good balance between sensitivity and accuracy, even in an environment where not all strains in the community will be represented in the reference database.
Genus | Original strain in mock community | Replacement strain | Genome wide similarity |
|
|
|
∼46% |
|
|
|
∼15% |
|
|
|
∼81% |
|
|
|
∼78% |
|
|||
|
|
|
na |
|
For
We next mapped the reads from the mock community against the RGD. When using the ‘top random’ mapping strategy (when the aligner randomly reports one hit in the case of multiple equally high scoring top hits) with 80% identity and a 75% fraction of length cutoff, 67% of all mappings are to the correct strains present in the mock community, 21% map to non-mock community strains but within the correct genus, 12% of reads don’t map at all and close to 0% of reads (63,321 out of 22,735,802) map to an organism of the wrong genus (
Considering these results at the species rather than the genus level, we find that under the top random mapping strategy, about 4% of the reads that had previously been classified to the correct genus were not able to be assigned to the correct species (
This plot shows the log transformed depth values for the mock query versus the amended MMD on the y-axis, and the mock query versus the original MMD on the x-axis. Unaffected genera should lie along the diagonal, while those showing a change in depth of coverage will fall off the diagonal. The amended genera are indicated, and the 4 that were swapped do stand off the diagonal. The genus
We also plotted the detected coverage of mock strains when the 22,735,802 mock query sequences were aligned against both the MMD and RGD under both top random and unique placement only mapping strategies. When mapped against the MMD, both strategies displayed very similar coverage for all strains (
(
We found two cases for which this observation did not hold true. The mock strains
(
(
The accuracy was similar for most of the tested aligners, therefore primarily convenience issues, such as which tool has the smallest memory footprint and which tool benchmarks the fastest, drove the choice of aligner. Additional major determining factors were, i) which aligner could handle the size of our reference database, and ii) which aligner could map both paired end reads and fragment reads in a single execution. The number of reference genomes is increasing with a rapid rate. For example, only the HMP project is committed to sequencing 3,000 bacterial genomes over its course, resulting in an ever-increasing size of the RGD (presently 7.3 Gb). Many available next generation aligners impose a 4 Gb database size limitation, which is a technical hurdle for mapping algorithms that use the Burrows-Wheeler transform in their implementation (e.g.
The SOAP aligner was a statistically significant outlier, detecting fewer hits to all strains in the MMD as compared to the other aligners. BWA’s primary weakness was its inability to handle a database larger than 4 Gb in size. The SMALT aligner, while claiming to be able support larger databases if the user increases search window size, was unable to handle a database larger than 6 Gb in our hands. In addition, the loss of sensitivity prompted by an increased window size (data not shown) was of concern. Novoalign displayed the smallest memory footprint of all aligners tested during our benchmarking. Its limitation proved to be speed, clocking in as the slowest aligner tested (over 10 fold slower than the frontrunner). MAP performed similarly to CLC, and was able to support the large database size we required, but the version tested was limited in that the only available mapping strategy revolved around their topN setting, which will only report hits with that number or fewer identical top hits (i.e. topN = 5 tells MAP not to report a query that aligns equally well to >5 spots in the reference). Drastically increasing the topN value to ensure we are not missing hits caused a significant increase in the amount of memory needed to complete the alignment. Note that parameter modifications have since been made in MAP to address this issue (Brian Hilbush, RealTime Genomics, personal communication), but only after this evaluation had been completed. Finally, only the CLC aligner was able to map both fragment and paired end reads in a single execution while still considering read paring information. While several aligners achieved similar levels of sensitivity and accuracy, the overall feature set that CLC offered tipped the balance and so it was selected for the optimization related analysis in our study.
None of the aligners compared were able to map 100% of the 22,735,802 mock community reads back to the MMD. Depending on the aligner, only 63% to 92% of the mock community queries could be aligned (
The CLC parameters were tested to achieve maximum sensitivity while minimizing false positives on a gross level. Due to limitations in the availability of bacterial organisms for inclusion in the reference database, no amount of parameter tweaking will be able to completely overcome problems with false positives detection, but by considering the problem at a higher taxonomic level (the genus level), where we do have good representation across the phylogeny, we were able to arrive at a parameter combination that could provide a relatively good profiling of a microbial community.
Based on the results, in the ideal case when all organisms in the query pool are represented in the database (as in the case of aligning the mock query data against the MMD), it is apparent that the length constraint has a much stronger impact on sensitivity than did the various similarity settings tested. And it was also apparent that only the most stringent length requirements hampered sensitivity. But when we attempted to model the state of live data by replacing several strains with other organisms from within the same genus, we began to see a difference in community structure reflecting changes of required percent identity. This is expected when sequences are mapped to more divergent strains. Furthermore, there is a significant overall decrease in detection caused by the substitution of
The experiment investigating the effects of mapping strategy on taxonomic resolution (i.e. the ability to correctly identify an organism at a given taxonomic level) showed a clear trade-off between the fraction of the reads representing a sample that can be characterized and the accuracy of that classification. As shown in
Looking into the effect of strain representation within the reference database on mapping resolution, we found a relationship between the number of strains available in the RGD under a given genus and our ability to resolve down to the strain level. Mappings performed against the MMD, where only the mock strains known to be present were available, showed that mapping strategy played little role in the ability to detect coverage. In such a perfect scenario almost any hit will be to the correct strain because there are no phylogenetically closely related neighbors to compete for alignment within conserved regions that could preclude its detection under unique placement only rules. But when a query metagenomic shotgun sequence is mapped to the RGD, strains with many similar neighbors available in the database preclude accurate mappings to finer grained taxonomic levels. In summary, genera with many strains available in the reference database, such as
Furthermore, based on alignments of the mock queries against the RGD under the top random alignment strategy, we found that the number of false positive identifications at the species level is higher than what is seen when taxonomic assignments are made at the genus level. Approximately 4% of these classifications are incorrect at the species level, but all of those reads can be mapped to the correct genus. When using the unique only mapping protocol, we did find a few more false positive classifications at the species level, but the overall misclassification rate was not significantly inflated (0.3% false positive rate). This is expected because the only time a misclassification can happen under unique only rules is when the sequence’s strain of origin is missing from the RGD, but the read happens to fall into a region that is divergent from other close neighbors within the same species, but conserved in some other organism represented in the database. Based on our results, this is a very rare event.
In future studies, the more advanced approach would be to generate a pan-genome (e.g.
In conclusion, we compared six short read aligners for the purpose of identifying an aligner and parameters that will enable accurate profiling of metagenomic communities for any project that uses large NGS datasets and aims at completing the analysis within a reasonable timeframe. We used a mock community sample of known composition and aligned it against the MMD, which comprises genome sequences of all organisms in the community. Five of the six aligners perform similarly well, with the notable exception of the SOAP aligner, which seemed to detect less coverage in general. The selection of CLC aligner was prompted by several practical factors: i) the ability to handle large databases, ii) its ability to map both paired end and fragment sequence data in a single operation and iii) its speed and small memory footprint. The MAP aligner held a respectable second place, but its lack of support for traditional top random & unique only mapping strategies (at the time of this evaluation) and its inability to map both paired end and fragment reads simultaneously kept it from taking the lead.
Once the best performing aligner was chosen, we focused on identifying appropriate parameters for mapping shotgun metagenomic data. When the database provided the exact strain targets for all reads in the query, we found that that length of alignment constraint had the strongest effect on mapping sensitivity, with the percent identity (considering only two fairly stringent settings) having only a minimal effect. But when swapping out several MMD strains with other organisms from the same genus, the percent similarity setting becomes more important. When the genome of an exact strain present in the metagenomic community is not sequenced (therefore absent from the reference database) but the genome of a close relative is sequenced, having a slightly more lenient similarity cutoff can improve sensitivity at the species or genus level. The suggested parameter settings for profiling microbial community structure using metagenomic shotgun sequences are 80% similarity over 75% length of the query being required to align.
We further explored the issue of mapping resolution and the effects of taxonomic density (i.e. the number of closely related strains available under a species or genus) within the RGD. We considered the cost of the top random mapping strategy in loss of resolution at the strain level to the benefit of being able to map a larger fraction of samples to the genus level. While identifying a larger percentage of samples at a lower resolution might have more immediate value for some applications than correctly identifying a smaller portion of the samples at a finer grained taxonomic level, its of importance to note that by using the unique placement only alignment strategy the capability to map to a greater degree of taxonomic clarity exists. We also showed that the number of conserved strains or species present within a genus both increases the likelihood of correctly identifying the genus of a read, while lessening the likelihood of correctly identifying the exact strain (or species) under the top random mapping strategy. The final Read Mapping Standard Operating Procedure is described in
For a number of the analyses described in this paper we used a mock community comprising 20 bacterial and 1 archaeal species, mixed together at different concentrations per strain
For the ‘Mapping resolution’ analysis we generated a database comprising archaeal, bacterial, lower eukaryotic and viral organisms available in GenBank, referred to as the ‘Reference Genome Database’ (RGD). These sequences were downloaded via keyword search from the NCBI’s GenBank database on 11/10/2009. The bacterial component underwent special processing as described below, but for the other three superkingdoms, we used the keywords “Archaea[ORGN]”, “Virus[ORGN]” and “Eukaryota[ORGN] NOT Bilateria[ORGN] NOT Streptophyta[ORGN]” (for Archaea, Virus, and lower Eukaryotes respectively), along with the descriptor “complete” and/or “WGS”. All archaeal, viral and lower eukaryotic strains found in that manner were included in the RGD. For the bacterial component of the RGD, a similar keyword search was used, “Bacteria[ORGN] and complete” and “Bacteria[ORGN] and WGS”, followed by removing highly redundant strains that were not part of the HMP. For this redundancy removal step, all sequences from a given genome were first tagged with a prefix unique to that strain. This allows a hit to any component of a draft genome to be easily related back to its parent genome, and was a required step to enable the creation of abundance metrics per genome. The complete and draft genomes were categorized on per species level, resulting in categories including anywhere from single strains to those including many strains per species (e.g.
Six aligners were tested, BWA
Alignments from each aligner were collected using a random top-hit strategy for all programs that supported it (BWA, CLC, SOAP, Novoalign), and the default mapping strategy of the aligner for the others (MAP, SMALT). The top random mapping strategy involves reporting only a single, best hit per query, and in the case of a query having multiple, equally strong best hits (i.e. mapping quality 0
The breadth (defined as the percentage of covered bases over the length of the reference genome) and depth (defined as the sum of the depths of each covered base divided by the length of the genome) of coverage were calculated based on all alignments of each genome represented in the MMD using a software package called RefCov (
Parameter optimization was performed only for the aligner that best fulfilled all the required criteria (CLC bio’s CLC Assembly Cell) by varying the minimum similarity setting (-similarity) and the minimum length of alignment setting (-lengthfraction) across 6 different combinations. The tested combinations include: i) 50% length +80% identity (default), ii) 50% length +90% identity, iii) 75% length +80% identity, iv) 75% length +90% identity, v) 100% length +80% identity and vi) 100% length +90% identity. The 64bit version of the CLC Assembly Cell long read alignment program, clc_ref_assemble_long, was used with the parameters “–l <% length> -s <% identity> -p fb ss 180 250” where the –l & -s values were varied as described above. The castosam program was used to extract a sam format file
The effects on mapping accuracy and sensitivity resulting from changing mapping strategies was tested by aligning 22,735,802 illumina GAIIx reads prepared from the mock community against both the RGD and the MMD. CLC Assembly Cell alignments were run using the 64bit version of the program clc_ref_assembly_long with the parameters “–l <% length> -s <% identity> -p fb ss 180 250” where the <% length> + <% identity> settings varied across: i) 50% length +80% identity (default), ii) 50% length +90% identity, iii) 75% length +80% identity, iv) 75% length +90% identity, v) 100% length +80% identity and vi) 100% length +90% identity, and then sam files were produced from alignment outputs as described in the mapping resolution analysis section above, We report the number of hits to the exact mock strains, to the genera represented by those mock strains, to the species represented by those mock strains, false positive organism assignments and those with no hits at all. The hits were reported using both the top random and unique only mapping strategies against both the RGD and the MMD. We also report how many strains were present in the RGD per genera represented in the mock community.
(DOCX)
(DOCX)
(DOCX)
(DOCX)
(TIF)
(TIF)
(TIF)
We thank Sarah K. Highlander for the generation and information on the mock community used for the evaluations.