In recent years, there has been a veritable boost in next-generation sequencing (NGS) of gene amplicons in biological and medical studies. Huge amounts of data are produced and need to be analyzed adequately. Various online and offline analysis tools are available; however, most of them require extensive expertise in computer science or bioinformatics, and often a Linux-based operating system. Here, we introduce “CoMA–Comparative Microbiome Analysis” as a free and intuitive analysis pipeline for amplicon-sequencing data, compatible with any common operating system. Moreover, the tool offers various useful services including data pre-processing, quality checking, clustering to operational taxonomic units (OTUs), taxonomic assignment, data post-processing, data visualization, and statistical appraisal. The workflow results in highly esthetic and publication-ready graphics, as well as output files in standardized formats (e.g. tab-delimited OTU-table, BIOM, NEWICK tree) that can be used for more sophisticated analyses. The CoMA output was validated by a benchmark test, using three mock communities with different sample characteristics (primer set, amplicon length, diversity). The performance was compared with that of Mothur, QIIME and QIIME2-DADA2, popular packages for NGS data analysis. Furthermore, the functionality of CoMA is demonstrated on a practical example, investigating microbial communities from three different soils (grassland, forest, swamp). All tools performed well in the benchmark test and were able to reveal the majority of all genera in the mock communities. Also for the soil samples, the results of CoMA were congruent to those of the other pipelines, in particular when looking at the key microbial players.
Citation: Hupfauf S, Etemadi M, Fernández-Delgado Juárez M, Gómez-Brandón M, Insam H, Podmirseg SM (2020) CoMA – an intuitive and user-friendly pipeline for amplicon-sequencing data analysis. PLoS ONE 15(12): e0243241. https://doi.org/10.1371/journal.pone.0243241
Editor: Zechen Chong, University of Alabama at Birmingham, UNITED STATES
Received: July 15, 2020; Accepted: November 17, 2020; Published: December 2, 2020
Copyright: © 2020 Hupfauf et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The source code of the CoMA pipeline can be accessed at GitHub (https://github.com/SebH87/coma). The amplicon-sequencing data of the three investigated soils are available at the European Nucleotide Archive under the accession numbers ERS3378910 to ERS3378921 within study PRJEB32269 (https://www.ebi.ac.uk/ena/data/view/PRJEB32269).
Funding: This work was supported by the Federal Ministry of Education, Science and Research (https://www.bmbwf.gv.at/), and by the University of Innsbruck (https://www.uibk.ac.at/) within the scope of the project “MicrobeEnergy”. SH was further supported by the doctoral fellowship (LFU Doktoratsstipendium) of the University of Innsbruck, Austria. MG acknowledges support by the Programa Ramón y Cajal (RYC-2016-21231; Ministerio de Economía y Competitividad, https://www.mineco.gob.es/). The publication fees are sponsored by the "Fonds zur Förderung der Wissenschaftlichen Forschung" (https://www.fwf.ac.at/; Project I 989). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Nowadays, many studies in environmental microbiology [1, 2], molecular ecology [3, 4] and medical diagnostics [5, 6] are either based on or accompanied by next-generation sequencing (NGS) analyses. Amplicon-sequencing (e.g. 16S rRNA, 18S rRNA, internal transcribed spacer (ITS), functional genes) is a cost- and effort efficient alternative to metagenomic approaches and became a mainstream and high-throughput technique for the determination of biological community structures. The costs have decreased with time and today a sample may be analyzed for less than 50 €. Moreover, amplicon-sequencing allows for the simultaneous screening of hundreds of samples. All these aspects have made amplicon-sequencing approaches extremely popular and led to a multitude of studies investigating all kinds of habitats. Among them there are the microbiota of humans , aquatic ecosystems , soils , deadwood , wastewater treatment systems  and anaerobic digesters [12–14], to name only a few. Although most of these studies are focused on prokaryotes by targeting the 16S rRNA genes of bacteria and/or archaea, also other structural genes like 18S rRNA , 28S rRNA [16, 17] and ITS , or different functional genes [19, 20] are commonly used.
However, the downside of these high-throughput approaches are enormous amounts of generated data requiring proper analysis. Modern NGS platforms like Illumina NovaSeq yield up to 2 x 1010 sequences per run (https://emea.illumina.com/systems/sequencing-platforms.html), stored in big data files of usually several gigabyte (GB) in size. The typical workflow for amplicon-sequencing data analysis comprises demultiplexing of the dataset in order to assign sequences to samples and merging of paired-end reads to obtain single sequences (however, there are also situations where single-end reads are provided or where merging is not recommended for some reason). Then, trimming off primers, adapter sequences or barcodes and filtering of bad sequences based on quality or sequence length is required. Similar sequences are clustered into operational taxonomic units (OTUs) using a common similarity cutoff (typically 97%) and OTUs are assigned to a taxonomic reference database such as Greengenes  (last update 2013), RDP  (last update: 2016), SILVA , PR2 , MIDORI , ITSoneDB  or UNITE , a process called de novo assembling (= sequences clustering before taxonomic assignment). The order can also be reversed resulting in a different strategy for OTU picking that is called closed-reference or open-reference if it is a combination of both approaches. However, these picking strategies are rare nowadays and most tools are using de novo assembling. As an alternative to OTUs, current research is going in the direction of amplicon sequence variants (ASVs), where sequences are clustered based on differences down to a single nucleotide . This procedure yields real biological sequences rather than constructed clusters, which can be directly compared between different studies. Moreover, ASV approaches tend to be less time-consuming, and no arbitrary threshold and centroid selections are needed . On the other hand, OTU clustering algorithms such as USEARCH turned out to produce particularly reliable results and these tools are applied by multitudes of scientists. Current publications indicate that both, OTU and ASV approaches yield robust results, which are well comparable to each other [30, 31]. After OTU/ASV picking, data post-processing, including singleton removal and data subsampling may be conducted, followed by statistical appraisal (e.g. alpha-diversity, beta-diversity, hypothesis testing, redundancy analysis) and data visualization.
A wide range of software solutions is available for computing all these steps. Web-based applications promise fast and simple solutions for data analysis. Nonetheless, limited upload capacities and collapsing server speed at peak times often bother the user and may turn online data processing into a frustrating procedure. Beyond web-based applications, there are also various offline tools freely available, among them Mothur  and “Quantitative Insights Into Microbial Ecology” (QIIME , QIIME2 ). Both provide a great variety of different tools in terms of explorative data analysis, quality control (QC), data processing, statistics and data visualization, and allow extensive adjustment and fine-tuning of each step. The drawback of these remarkably wide tool packages and options, however, is that the user needs extensive knowledge and experience on handling command-line tools and sequence-based data processing. This problem can only be partially addressed by the numerously available online documentations and interactive user blogs. If the user does not have sufficient experience and expertise, the complex workflow may be time-consuming and may lead to mistakes and erroneous conclusions. Moreover, many bioinformatic tools require a Linux operating system and deep operation skills, often lacked by the users, or requiring a time-consuming training. An overview on existing software packages is provided in Table 1 showing the most relevant features as well as drawbacks relative to CoMA.
We developed CoMA (Comparative Microbiome Analysis) as a free pipeline for intuitive and user-friendly analysis of amplicon-sequencing data, available for all common computer platforms. We used various open-source, third-party tools and combined them into a linear analysis workflow in the form of a Bash script, starting with the raw input files (in FASTQ format) and resulting in esthetically pleasing and publication-ready graphics. In addition, output files in standardized formats, such as a tab-delimited OTU-table, an OTU-table in BIOM format and a tree file in NEWICK format, are provided. These allow for subsequent secondary analysis using for example Cytoscape , GraPhlAn , LEfSe , PICRUSt  or R , if desired. The operation of this tool is remarkably intuitive and makes it accessible even for entry-level users. A graphical user interface facilitates the handling, representing a major advantage compared with command-line based applications. Nevertheless, multiple adjustment parameters and the high degree of automation make CoMA also acceptable for advanced users who are looking for an efficient and streamlined data analysis. The tool is capable of handling data from today’s most important NGS platforms, including Illumina MiSeq, Illumina HiSeq, or Illumina NovaSeq, but also from the former 454 pyrosequencing technology, which was, in fact, terminated in 2016 but data are still around and analysis tools are still needed. This study focuses on the processing of short rather than long reads, which are produced by currently emerging third-generation sequencing technologies like PacBio or Nanopore. Appropriate analysis tools for long reads are not discussed here but are readily available (e.g. SDip ). Three different options for installation are available for CoMA, which can all be downloaded from the CoMA webpage (https://www.uibk.ac.at/microbiology/services/coma.html). A detailed manual describes both the installation as well as the usage of the tool.
Here, we present the new CoMA pipeline and benchmark it by analyzing three different, constructed mock communities. The performance of CoMA is compared with that of Mothur, QIIME and QIIME2, the currently most popular platforms for NGS data analysis (however, the support for QIIME ended in 2018 and the developers suggest switching to QIIME2). Moreover, we apply CoMA to a real dataset on the prokaryotic microbiota of three different soils (forest, grassland, swamp) in order to demonstrate its functionality also on a practical example. Soil samples were analyzed on an Illumina MiSeq device, using a 16S rRNA amplicon-sequencing approach. We hypothesize that the CoMA pipeline works well in the benchmark test, revealing all included genera. We expect CoMA to perform as efficiently and precisely as Mothur, QIIME and QIIME2, and that the results are stable irrespective of the applied dataset. In addition, we hypothesize that the results for the soil samples obtained by the different analysis tools are highly comparable, especially when looking at the key microbial players.
Material and methods
Implementation of CoMA
CoMA is implemented as a linear data analysis pipeline ranging from the processing of raw input files to the computation of results. It comprises four main sections: data pre-processing and quality checking, OTU clustering and taxonomic assignment, data post-processing, as well as data visualization and statistical appraisal (Fig 1). All steps, except for the merging of paired-end reads and OTU clustering/taxonomic assignment, are optional and can be skipped by the user if desired. Therefore, the tool allows for the (partial) re-calculation of a former, already completed CoMA run or the continuation of an unfinished or terminated analysis. CoMA offers a simple user interface, composed of GTK+ dialogs, which guides the user through the analysis and provides important information on each step of the workflow.
Different colors represent the four sub-sections of the CoMA workflow: Data pre-processing and quality checking (orange), clustering of operational taxonomic units (OTUs) and taxonomic assignment (green), data post-processing (blue) and data visualization and statistical appraisal (yellow). Labelled arrows demonstrate the order of events and name specific file types that are needed as input for each step. Taxonomic assignment is done with Blast, Lambda or RDP using either one of the available databases (e.g. Silva ) or any custom database provided by the user. Numbers indicate third party tools that are used for the specific CoMA step: 1 = PANDAseq, 2 = PRINSEQ, 3 = LotuS/sdm, 4 = QIIME, 5 = Mothur. TDOT = Tab-delimited OTU-table. PER = Paired-end reads. SER = Single-end reads. PCoA = Principal coordinates analysis.
Data pre-processing and quality checking.
Data analysis using CoMA requires input files in FASTQ format (paired-end or single-end, either uncompressed or compressed with gzip), which is the standard NGS file format provided by all common sequencing platforms. In case of paired-end reads, forward and reverse reads are first merged at their overlapping region, using the open-source tool “Paired-end assembler for DNA sequences” (PANDAseq ). This step is followed by a QC step (an individual quality report is created for each sample) and a subsequent sequence selection step, both accomplished by the “Preprocessing and Information of Sequence data” (PRINSEQ ) utility. Amplification primers, sequencing adapters and multiplex identifiers are removed (trimming) and reads are selected based on Phred quality score (PQS), amplicon length and ambiguous bases. The selection parameters are different for each specific primer pair and can be adjusted based on the information gained from QC. A second QC step, checking the success of trimming and sequence selection, finally completes the data pre-processing workflow.
OTU clustering and taxonomic classification.
All steps within this program section are either part of the “Less OTU scripts” (LotuS ) pipeline or the “simple demultiplexer” tool (sdm ). First, an error-correction step is applied, followed by the removal of artificial chimeric sequences, resulting from cross-hybridization of DNA fragments during library preparation. Subsequently, clustering to OTUs is done with the UPARSE-OTU algorithm  at a 97% similarity level, following a de-novo assembling approach. OTU sequences are then aligned/classified with a desired gene reference database using either “Basic Local Alignment Search Tool” (BLAST ), “Local Aligner for Massive Biological Data” (Lambda ) or the “Ribosomal database project” (RDP [57, 58]) classifier. CoMA supports several popular taxonomic databases, including SILVA, Greengenes, RDP, UNITE, HITdb , beetax , and PR2, as well as any custom database provided by the user. A custom database needs to be composed of two individual files: a FASTA file with all sequences and a TAX file with the accompanying taxonomic information. The correct structure for both files is provided in the CoMA manual. It is possible to use multiple databases, where the first serves as primary- and the other(s) as backup database(s). OTUs that cannot be assigned to the primary database are then diverted to the backup database(s). This workflow section is completed by the generation of universally applicable output files (tab-delimited OTU-table, BIOM file, NEWICK tree), allowing for specific subsequent data analysis or individual visualization steps by the user if desired.
Post-processing of data.
Data post-processing starts with an initial removal step, where very rare OTUs can be dropped from analysis (filter_otus_from_otu_table.py, QIIME). OTUs can be omitted due to a low number of total reads on the one hand or due to rare occurrence within the samples on the other hand. To assess whether the sequencing effort (i.e. number of reads per sample) was sufficient, rarefaction curves are computed using Mothur’s rarefaction.single command. The applied metric can be selected by the user out of the following possibilities: OTUs, Chao1 richness estimator, Shannon-Wiener diversity index, Simpson’s diversity index and Good’s coverage for OTUs. The output is provided either as text file containing all raw data or as line plot. Based on rarefaction curve analysis, the dataset can be subsampled to a unique read count for each sample to bring the data onto a common scale . This step is done using the single_rarefaction.py script provided by the QIIME platform. The pseudo-random number generator used here for subsampling is an implementation of the Mersenne twister . To facilitate use, CoMA provides the read count for each sample, sorted in increasing order. The next step allows for renaming sample names; all further results are presented using the newly assigned names. Finally, metadata can be assigned to the samples. These metadata can be used in all upcoming steps in order to group the samples.
Data visualization and statistical appraisal.
First, summary files, containing the most abundant taxa of each sample at all taxonomic levels (kingdom, phylum, class, order, family, genus, and species) are created. The user can compute a report for bacteria, archaea, fungi and eukaryotes, as well as a general summary. Moreover, a summary report for a specific taxon can be created. During the next step, publication-ready and highly esthetic bar charts and heatmaps can be created for each taxonomic level. The user can choose between plots for bacteria, archaea, fungi, eukaryotes and prokaryotes. Alternatively, plots can be computed for a specific taxon, including all descendent taxa at the lower taxonomic levels. In addition, this step allows the user to set a threshold, based on relative abundance, for taxa to be included in the plots and a decision whether unassigned taxa will be included or excluded from the depictions. Moreover, the user can choose between various popular image file formats (EPS, JPEG, PDF, PNG, PS, RAW, SVG, TIFF) and adjust the pixel density if a raster graphic format was selected. Subsequently, Venn diagrams can be created to compare detected taxa among given groups for each taxonomic level. Within the next step, alpha diversity (or within-sample diversity) can be calculated using various different calculators (observed OTUs, Shannon-Wiener index, Simpson’s index, Pielou’s evenness index, Good’s coverage of counts, Chao1 richness estimator, Faith’s phylogenetic diversity). Every metric has different strengths and limitations—technical discussion of each metric is readily available online and in ecology textbooks. The output is provided either as raw-data text file or as publication-ready bar chart. Finally, CoMA offers two different approaches for analyzing beta diversity: ordination and hierarchical cluster analysis. Ordination can be done using various different metrics (Minkowski, Euclidean, Manhattan, Cosine, Jaccard, Canberra, Chebyshev, Braycurtis, Dice, Weighted UniFrac, Unweighted UniFrac). After calculating of a distance matrix, data are visualized in either two- or three-dimensional Principal coordinates analysis (PCoA) plots. To quantify the strength of the grouping, statistical tests (ANOSIM, PERMANOVA; 999 permutations) are applied. For cluster analysis, the user can choose between the following linkage methods for the cluster analysis: Single, Complete, Average (UPGMA), Weighted (WPGMA), Centroid (UPGMC), Median (WPGMC) and Ward. These methods are used to compute the distance d(s,t) between two clusters s and t, using a bottom-up approach. In addition, the user can select a suitable metric to determine the distance between two individual data points: Euclidean, Cosine, Cityblock, Correlation, Jaccard, Braycurtis and Dice. Results from cluster analysis are presented as dendrogram plots.
Mock community datasets
All mock community datasets for benchmarking CoMA were taken from the public mockrobiota repository . In order to cover different input sequences in terms of variable region, sequence length and diversity, the three datasets mock-13 , mock-16  and mock-26  were selected. Mock-13 comprises 21 bacterial strains (18 genera; three replicates) that were sequenced on an Illumina MiSeq machine (paired-end run) by targeting the V4 region of the 16S rRNA. The same sequencing procedure was also used for mock-16; however, this dataset additionally includes archaea and the overall diversity is with 59 strains (46 genera; three replicates) much higher than in mock-13. Mock-26 includes ITS data from 11 fungal strains (11 genera) that were sequenced with a gx-flx-titanium sequencer (Roche, Switzerland). The whole dataset comprises over 130 samples, including multiple replicates and different variations of the community (even abundance, uneven abundance, pure culture). For this work, samples Mock.81, Mock.92 and Mock.99 were randomly picked out of the evenly distributed samples.
Mock data analysis with CoMA.
Mock-13 forward and reverse reads were merged in a first step and the quality of the joined reads was then determined. Based on QC analysis, reads were selected due to their sequence length (240–265 bp) and their mean PQS (> 15). The reads were not trimmed since primers and other sequence appendices were already cut off in the raw input files. After the second QC step, reads were clustered into OTUs (> 97% similarity) and representative sequences were aligned to the SILVA database (SSU 132) using BLAST. The same procedure was applied for mock-16, however with two minor modifications. Before quality selection, primers were trimmed off from the forward (19 bp) and reverse end (20 bp). In addition, the PQS selection was more rigorous (> 30) since the overall quality of the reads was higher in mock-16 compared with mock-13.
A slightly different protocol was applied for mock-26 since the dataset was still multiplexed and contained single-end reads. Demultiplexing was done with the “demux emp-single” command within the QIIME2 toolkit since CoMA does not support multiplexed input files in its current version. Selected files (Mock.81, Mock.92, Mock.99) were then subjected to the CoMA pipeline, starting with the first QC step. Reads with a maximum length of 500 bp were then selected. A minimum sequence length was not adjusted since ITS reads typically show a wide distribution in sequence length and thus a loss of information should be avoided. Since the overall quality level of the sequences was high, a minimum average PQS > 30 was adjusted. After the second QC step and the OTU clustering, reads were aligned to the UNITE database (version 8) using the BLAST algorithm.
Mock data analysis with Mothur.
Data analysis of mock-13 and mock-16 with Mothur (v. 1.39.5) was done according to the standard operating procedure (SOP) for Illumina MiSeq data, provided at the Mothur homepage (https://www.mothur.org/ wiki/MiSeq_SOP, ). Input forward and reverse reads were merged with “make.contigs”. After checking the quality with “summary.seqs”, ambiguous bases and reads with a length of > 270 bp (mock-13) and > 292 bp (mock-16) were removed using “screen.seqs”. A table of unique sequences was created with “unique.seqs” and a prevailing counting table, providing the occurrences of the unique sequences in each sample, was created using “count.seqs”. To prepare the SILVA (version 132) database, “pcr.seqs” was applied on the database file with the following limit settings: start = 13,862, end = 23,444. Unique sequences were aligned to the prepared database using the “align.seqs” tool and outliers were removed with “screen.seqs” (start = 8, end = 9,578 (mock-13) or 9,582 (mock-16), maxhomop = 8). After a sequence filtering step with “filter.seqs” (vertical = T, trump =.), reads were again checked for unique sequences using the “unique.seqs” tool to exclude potential sequence redundancies introduced in course of the trimming step. For de-noising of data, sequences were pre-clustered with “pre.cluster” (diffs = 2) and chimeras were removed with “chimera.uchime” and a subsequent “remove.seqs” step. To get rid of sequencing errors, sequences were classified with “classify.seqs” and unknown reads (at kingdom level) as well as reads assigned to chloroplasts, mitochondria or eukaryota were removed with “remove.lineage”. Sequences were then clustered using “cluster.split” (splitmethod = classify, taxlevel = 4, cutoff = 0.03) and a shared file was created with “make.shared”. After taxonomic assignment (“classify.otu”), a BIOM file was created (“make.biom”) and then transformed to a tab-delimited OTU-table using the independent “biom-convert” tool (http://biom-format.org/).
For mock-26, selected files (Mock.81, Mock.92, Mock.99) were imported using “make.file” and “fastq.info”. Low-quality reads were then excluded (“trim.seqs”; qaverage = 15) and all individual files were merged to a combined FASTA file (“merge.files”, “make.group”). After summarizing all data with “summary.seqs”, reads with > 10 ambiguous bases were removed. A greater tolerance of ambiguous bases compared with mock-13/mock-16 was required since all sequences constantly contained a considerable number of ambiguous bases. A sequence length selection was omitted due to the high inconsistency among the ITS data. After finding the unique sequences (“unique.seqs”) and creating a count table (“count.seqs”), analysis was continued as described for the other mock communities, starting with the pre clustering step (“pre.cluster”). The newest Mothur-optimized UNITE release (version 7) was used as reference database for sequence alignment.
Mock data analysis with QIIME.
Data analysis of mock-13 and mock-16 with QIIME (v. 1.8.0) was done according to the recommendations for Illumina data on the QIIME homepage (http://qiime.org/tutorials/processing_illumina_data.html, ). First, paired-end input reads (all files in FASTQ format) were merged using the “multiple_join_paired_ends.py” script within the QIIME toolkit. This is an alternative version of the “join_paired_ends.py” script and allows the merging of already-demultiplexed FASTQ files. Quality filtering was done with the “multiple_split_libraries_fastq.py” script and chimeras were removed with “usearch61” using the “identify_chimeric_seqs.py” and “filter_fasta.py” scripts. Reads were then clustered into OTUs with the “usearch61” method of “pick_otus.py” and representative (= most abundant) sequences for each OTU were picked using “pick_rep_set.py”. Representatives were then assigned to the QIIME-optimized Silva SSU database (release 132) with the “assign_taxonomy.py” script (assignment method: uclust) and the OTU-table was constructed using “make_otu_table.py”. The created BIOM file was converted to a tab-delimited OTU-table with the external “biom-convert” tool as described in the previous chapter.
Selected input files for mock-26 (Mock.81, Mock.92, Mock.99) underwent, in a first step, a quality selection (“split_libraries_fastq.py”), where reads with an average PQS < 15 were removed. All upcoming steps were done as described above for the other mock communities. The newest QIIME-optimized UNITE release (version 8, dynamic construction) served as reference database for sequence alignment (“assign_taxonomy.py”).
Mock data analysis with QIIME2.
Data analysis with QIIME2 (v. 2019.10) followed the “Moving pictures” tutorial, accessible on the program’s webpage (https://docs.qiime2.org/2019.10/tutorials/moving-pictures/). However, mock-13 and mock-16 required a slightly alternative procedure due to their paired-end data structure, and the files from all mock communities (mock-13, mock-16, mock-26) needed to be imported in a different manner since the data were already demultiplexed.
First, mock-13 and mock-16 input files were imported using “qiime tools import” with the settings “SampleData[PairedEndSequencesWithQuality]” as data type and “CasavaOneEightSingleLanePer- SampleDirFmt” as input format. After summarizing and quality-checking the imported data (“qiime demux summarize”), “qiime dada2 denoise-paired” was applied. This central step calls the DADA2 pipeline and provides several functions, including sequence trimming and truncation, quality filtering, removal of phiX reads, chimera removal, and the construction of a feature table (suitable filtering settings resulted from the previous quality-checking step). In contrast to all other pipelines used for this study, DADA2 provides ASVs, which are created by grouping unique sequences. ASVs are equivalent to OTUs at a 100% cutoff level and are discussed to replace customary OTUs (97% cutoff) in the future. Therefore, QIIME2 always means QIIME2-DADA2 within the context of this publication since the pipeline also offers other options for sequence clustering (e.g. Deblur) and results may considerably differ. For mock-16, 19 and 20 bp were trimmed off from the forward and reverse end, respectively. No trimming was applied for mock-13 since all sequence appendices were already removed in the raw input files. Thereafter, taxonomic information was assigned to the representative sequences using a pre-trained Naive Bayes classifier (Silva 132, 99% OTUs from the 515F/806R inner primer region, available at https://docs.qiime2.org/2019.10/data-resources/) and the “qiime feature-classifier classify-sklearn” tool. After exporting the table- and taxonomy QIIME2 artifact with “qiime tools export” and the external “biom-convert” tool, a tab-delimited ASV-table file was created.
For mock-26, selected input files (Mock.81, Mock.92, Mock.99) were imported with “qiime tools import” (data type: “SampleData[SequencesWithQuality]”, input format: “CasavaOneEightSingle-LanePerSampleDirFmt”). Sequence preparation and ASV-table construction was again done with the DADA2 pipeline; however, in this case with the single-end version of the plugin (“qiime dada2 denoise-single”). No trimming or sequence truncation was applied. Representative sequences were then taxonomically assigned with a pre-trained UNITE classifier and the “qiime feature-classifier classify-sklearn” plugin. The classifier was trained on the newest UNITE database release (version 8, dynamic construction), and the reads were not trimmed to the ITS primer sites as suggested by the QIIME2 documentation. The final tab-delimited ASV-table file was created as previously described for mock-13 and mock-16.
The chosen dataset for demonstration of the CoMA pipeline originated from a field experiment conducted at the Department of Microbiology, University of Innsbruck, in autumn 2016. It aimed at investigating the microbial community structure in soils from three different habitats: forest (F), grassland (GR) and swamp (S), characterized by different nutrient contents, a specific vegetation and different degrees of human influence. However, all sites were located in close proximity and thus subjected to similar geological and climatic conditions.
Samples were taken in Trins, Tyrol (Austria; 47°04′59” N, 11°25′00” E) on November 28, 2016. The annual average temperature is 4°C and the annual precipitation 770 mm . Samples were taken at the end of the vegetation period, two months after the last fertilization with manure (swamp samples). Details on the sampling sites are given in Table 2.
From each site, four field replicates were taken within an area of approximately 100 m2. Each field replicate was composed of nine cores (0–10 cm depth, Ø 2 cm) taken in a regular systematic grid strategy (within a square of 1.4 x 1.4 m). Immediately after the sampling, soil samples were sieved (< 2 mm) and stored at -20°C until use. Soil physico-chemical characterization (S1 Table) was done as described by Fernández-Delgado Juárez et al. .
DNA extraction and NGS.
Total DNA was extracted using the NucleoSpin Soil kit (Macherey-Nagel, Düren, Germany) according to the manufacturer's protocol with minor adaptations: 0.25 g (fresh weight) of twelve soil samples (stored at -20°C) were lysed in Lysis Buffer 1 (SL1) at room temperature using a horizontal shaker (MM 2000, Retsch, Haan, Germany) at an amplitude of 80% for 5 min. Elution was done in 2x40 μL of Buffer SE. DNA extracts were checked on a 1.5% agarose gel and DNA concentration was determined via QuantiFluor dsDNA dye measurement using the Quantus fluorometer (Promega, Mannheim, Germany). Extracts were stored in low-DNA binding tubes (Axygen, Corning, USA) at -20°C until sequencing. Samples were then subjected to NGS amplicon-sequencing (Microsynth AG, Balgach, Switzerland) on an Illumina MiSeq device using a 250 bp paired-end (v2) approach targeting the V4 region of the 16S SSU rRNA gene. The applied primer pair was 515f (5’-GTGCCAGCMGCCGCGGTAA-3’) and 806r (5’-GGACTACHVGGGTWTCTAAT-3’), as described by Caporaso et al.  and recommended for soil in the Earth Microbiome Project .
Soil data analysis with CoMA.
Data analysis with CoMA (quality selection, OTU clustering, taxonomic assignment) was done as previously described for mock-13 but with a PQS cutoff ≥ 30. Rare OTUs with a sum of reads < 50 within all samples were excluded (corresponding ~0.005% of all reads as previously described in other studies [72, 73]) and data was subsampled to a depth of 92,539 after checking rarefaction curves (S1 Fig). This corresponds to the lowest read count among all samples and all analysis pipelines (CoMA, Mothur, QIIME). At this depth, the rarefaction curves of all samples reached a steady state and thus the sequencing effort was assumed to suffice. Diversity analysis was done using the four provided diversity indices: number of OTUs, Shannon-Wiener, Chao1 and Simpson. For cluster analysis, UPGMA (CoMA input: “average”) was chosen as method and Braycurtis distance as metric. This combination is suitable for grouping microbial communities and therefore often used in microbial ecology studies .
Soil data analysis with Mothur.
Soil samples were analyzed with Mothur following the same protocol and settings as described for mock-13. Removal of rare OTUs (sum of reads < 50) as well as data subsampling (92,539 reads) was done with CoMA as described above.
Principal component analysis (PCA) was done with the Canoco 5 software package for multivariate data analysis . Microbial families, excluding unassigned taxa, were used as dataset. Prior to analysis, data were centered and log-transformed. The depiction of Canoco output files was thereafter performed with python. Descriptive statistics, analysis of variance (ANOVA) and post-hoc tests were done using SPSS Statistics 24 (IBM, New York, USA) and Excel 2016 (Microsoft, Washington, USA). Post-hoc tests were calculated using the Bonferroni correction. Venn diagrams and cluster analysis were created with in-house developed Python scripts. Dendrograms were created with the UPGMA method as bottom-up approach. The data points were compared either with the Braycurtis distance (to compare different soil microbiomes) or with the Cosine distance (to compare different analysis platforms). Key microbial families were defined as those including > 5% of all assigned reads within each sample. The normality of the data was tested prior to analysis and the significance level was determined at p ≤ 5%.
Benchmarking with mock communities
Mock-13 comprised 21 bacterial strains that can be classified to 18 different genera: Acinetobacter, Actinomyces, Bacillus, Bacteroides, Clostridium, Deinococcus, Enterococcus, Escherichia, Helicobacter, Lactobacillus, Listeria, Neisseria, Porphyromonas, Propionibacterium, Pseudomonas, Rhodobacter, Staphylococcus (two strains) and Streptococcus (three strains). All strains were equally distributed and thus each genus corresponded to 4.76% of the total abundance, except for Staphylococcus (9.52%) and Streptococcus (14.29%). Data analysis with CoMA revealed 16 out of 18 genera, similar to Mothur and QIIME (Table 3). QIIME2 was able to find all genera except for Propionibacterium, which was not found with any of the pipelines (Fig 2). This, however, is a known issue and several other studies were not able to detect Propionibacterium in course of mock community tests . The reason for that may be mismatches close to the 3’ terminus of the 16S rRNA primer, leading to a low abundance or even a complete missing out of this genus . Apart from Propionibacterium, Actinomyces (CoMA), Bacillus (Mothur) and Clostridium (QIIME) were not consistently detected. When looking at the average deviation per taxon, all pipelines performed similarly well. With a mean deviation of 3.11%, Mothur was most precise, closely followed by CoMA (3.27%) and the two QIIME versions (QIIME: 3.50%, QIIME2: 3.45%). On the other hand, Mothur yielded the highest proportion of erroneously assigned sequences (11.85%), followed by QIIME (9.20%). Both, CoMA and QIIME2 showed error rates below 1% (0.64% and 0.83%, respectively). Cluster analysis revealed smallest cosine distance between CoMA and QIIME2 (0.06; S2 Fig). Mothur and QIIME formed a second cluster, showing a distance of 0.15. All pipelines were closer related to each other than to the set point with an overall distance of 0.22.
The set point (SP) depicts the theoretically expected distribution and serves as reference. The dataset comprised 18 bacterial genera, targeted with 16S rRNA amplicon sequencing.
The second 16S mock community included, beside bacterial, also archaeal strains. In total, mock-16 comprised 59 strains that can be assigned to 46 different genera (archaea: 8, bacteria: 38). All strains were equally distributed, resulting in the following abundances at genus level: Acidobacterium (1.69%), Akkermansia (1.69%), Anaerocellum (1.69%), Archaeoglobus (1.69%), Bacteroides (3.39%), Bordetella (1.69%), Burkholderia (1.69%), Caldicellulosiruptor (1.69%), Chlorobaculum (1.69%), Chlorobium (5.08%), Chloroflexus (1.69%), Clostridium (1.69%), Deinococcus (1.69%), Desulfovibrio (3.39%), Dictyoglomus (1.69%), Enterococcus (1.69%), Erwinia (1.69%), Fusobacterium (1.69%), Gemmatimonas (1.69%), Herpetosiphon (1.69%), Hydrogenobaculum (1.69%), Ignicoccus (1.69%), Leptothrix (1.69%), Methanocaldococcus (1.69%), Methanococcus (3.39%), Nanoarchaeum (1.69%), Nitrosomonas (1.69%), Nostoc (1.69%), Pelodictyon (1.69%), Persephonella (1.69%), Porphyromonas (1.69%), Pyrobaculum (3.39%), Pyrococcus (1.69%), Rhodopirellula (1.69%), Rhodospirillum (1.69%), Ruegeria (1.69%), Salinispora (3.39%), Shewanella (3.39%), Sulfitobacter (3.39%), Sulfolobus (1.69%), Sulfurihydrogenibium (3.39%), Thermoanaerobacter (1.69%), Thermotoga (5.08%), Thermus (1.69%), Treponema (3.39%) and Zymomonas (1.69%). With CoMA, 43 out of 46 genera were correctly assigned (Table 3). QIIME detected 44, whereas Mothur and QIIME2 found 40 and 34, respectively. None of the pipelines was able to reveal Rhodospirillum, while Clostridium (Mothur) and Ruegeria (QIIME) were only found with one of the four applied analysis tools (Fig 3). All other genera were at least detected by two pipelines, 32 of them by all of them. CoMA and QIIME turned out to be most precise for this dataset, resulting in an average deviation of 1.22% per taxon. Accuracy rates of Mothur and QIIME2 were 1.52% and 2.09%, respectively. Comparing the proportion of wrongly assigned or unassigned genera, CoMA was most efficient (3.83%), followed by QIIME (3.88%). Mothur and QIIME2 both revealed more than one erroneous genus out of ten tested sequences (10.53% and 16.31%, respectively). Cluster analysis based on the mock-16 results revealed a cosine distance < 0.01 between CoMA and QIIME, indicating a very high similarity between these two pipelines (S3 Fig). With a cosine distance of 0.08, Mothur was closer related to them than to QIIME2. However, as seen before for mock-13, all applied pipelines were much closer related to each other (maximum distance: 0.15) than to the set point (overall distance: 0.30).
The set point (SP) depicts the theoretically expected distribution and serves as reference. The dataset comprised 46 archaeal and bacterial genera, targeted with 16S rRNA amplicon sequencing.
The ITS dataset (mock-26) included 11 fungal strains, assigned to 11 different genera: Amanita, Archaeorhizomyces, Cortinarius, Debaryomyces, Fusarium, Hymenoscyphus, Melampsora, Mucor, Rhizina, Sistotrema, Umbelopsis. All taxa were equally distributed, resulting in an abundance of 9.09% per genus. Contrary to mock-13 and mock-16, all genera of this dataset were detected with each of the four pipelines (Table 3, Fig 4). QIIME2 and Mothur showed the highest accuracy with an average deviation of 2.05% and 2.09% per taxon, respectively, followed by CoMA (2.31%) and QIIME (4.01%). QIIME2 was also superior in terms of wrongly assigned or unassigned sequences at genus level for this dataset (0.44%), slightly lower than CoMA with an error rate of 0.71%. QIIME and Mothur followed with distinctively higher error rates of 2.62% and 6.17%, respectively. Cluster analysis revealed smallest cosine distance between the results computed with CoMA and QIIME2 (< 0.01; S4 Fig). Mothur was slightly more distant (0.02) but still closer related than to the set point with an overall distance of 0.05 to the three aforementioned pipelines. For the mock-26 dataset, QIIME was far distant to the set point as well as to the other analysis tools (0.10).
The set point (SP) depicts the theoretically expected distribution and serves as reference. The dataset comprised 11 fungal genera, targeted with ITS amplicon sequencing.
Taking all mock communities together, CoMA showed the overall highest accuracy with an average deviation of 1.89% per taxon (Table 3), compared to Mothur (1.99%), QIIME (2.17%) and QIIME2 (2.41%). Looking at the hierarchical cluster analysis, CoMA and QIIME2 showed the smallest cosine distance (0.08; S5 Fig). Mothur followed with a distance of 0.11 and QIIME with a distance of 0.12. All four pipelines were much closer related to each other than to the estimated values (overall cosine distance: 0.18).
Soil dataset as practical example
In addition to mock communities, also real soil samples were analyzed with CoMA, Mothur and QIIME. A detailed description of the microbiological characteristics of the investigated soil samples determined with CoMA can be found in the supplementary material (S1 File). In the main article, we are focusing on the comparison between the three different analysis tools.
Data analyses with Mothur and QIIME resulted in the same pattern of Shannon-Wiener diversity (H’, Fig 5), Chao1 diversity, Simpson diversity (D) and OTU richness (S6 Fig) as seen for the CoMA pipeline (S1 File). However, the level differed between the applied analysis tools. Mothur generally evoked higher diversity (H’: p = 0.002, Chao: p < 0.001, D: p = 0.009) and OTU richness (p = 0.004) than CoMA, irrespective of the diversity calculator and the habitat. QIIME data were statistically neither distinguishable to CoMA nor to Mothur, except for the Chao index where QIIME resulted in a higher diversity than CoMA (p = 0.029). When looking at each habitat individually, Shannon-Wiener diversity reached the highest value with Mothur, followed by QIIME and CoMA in all cases.
Four replicates for each habitat are shown. Letters indicate significant differences across the analysis tools for each habitat. F = forest. GR = grassland. S = swamp.
Principal component analysis based on microbial families clustered the CoMA and QIIME samples in close proximity to each other (Fig 6). Data points from grassland and swamp were located in the third quadrant, showing minor spatial deviation, mainly along the PCA Axis 2. The first two PCA axes explained 74% of the total variation within the taxonomic data (53% and 21% for PCA axis 1 and 2, respectively). Forest samples showed a greater distance to the two other habitats and were located in the second quadrant. Samples analyzed with Mothur were located in the positive direction of PCA Axis 1 and thus clearly separated from the data points of CoMA and QIIME. Data points representing the three habitats showed, however, a similar pattern along PCA Axis 2 as those obtained by the other pipelines.
The color code indicates the applied data analysis tool: CoMA, Mothur and QIIME. Q1—Q4 = quadrants of the coordinate system.
To dig deeper into the differences/similarities between the three applied analysis pipelines, numbers of detected taxa were determined for each taxonomic level (Fig 7). Data analysis with CoMA resulted in the highest number of classes and families (114 and 281, respectively) and QIIME of phyla and genera (45 and 415, respectively). Mothur analysis always resulted in the lowest numbers of taxa, irrespective of the taxonomic level. The analysis showed that CoMA and QIIME detected a similar taxa composition at each level. Mothur, however, did not reveal common taxa at a comparable rate but detected several taxa that were neither found with CoMA nor with QIIME. At phylum level, 39 out of 54 taxa (72%) were found with both CoMA and QIIME. Out of these 39 common taxa, 19 were also detected with Mothur (35%). CoMA and QIIME shared 100 (61%) and 98 (60%) out of all detected 164 classes, respectively, while Mothur found only 27 shared classes (16%) with the other pipelines. The same trend was observed at order level where CoMA and QIIME detected 183 (68%) and Mothur 50 (19%) out of 270 taxa, and at family level where CoMA found 243 (70%), QIIME 245 (70%) and Mothur 88 (25%) out of 349 families. However, at genus level, the percentage of taxa detected with all three of the analysis tools was much higher compared to the other taxonomic levels, except for phylum level. CoMA revealed 317 (61%), QIIME 339 (65%) and Mothur 159 (31%) out of 518 genera. The percentage of taxon overlap for the three analysis tools was higher at phylum level (35% of all phyla), followed by genus (26%) and family level (24%). At order and class levels, the percentages of shared taxa were much lower and accounted for only 18% and 15%, respectively. CoMA and QIIME overlapped in 83%, 78%, 82%, 80% and 72% of all commonly found taxa at phylum, class, order, family and genus level, respectively.
Data include all of the three investigated habitats (forest, grassland, swamp).
To compare the general output of the pipelines and evaluate CoMA as an alternative analysis tool, key families (i.e. abundance > 5% of reads) of the three habitats were analyzed. Within the forest samples, all families determined with QIIME were also found with CoMA (11 out of 16, Table 4). Eight out of these eleven families were also detected with Mothur; however, Chthoniobacteraceae, Nitrosomonadaceae and Solibacteraceae were not found with Mothur. On the other hand, Mothur resulted in five unique families (Bradyrhizobiaceae, Comamonadaceae, Nitrososphaeraceae, Oxalobacteraceae, Planctomycetaceae) that were not caught with either of the other two tools. In grassland, nine out of 18 key families were identified with all pipelines. CoMA and QIIME determined again the same taxa (14 out of 18), but did not expose Bradyrhizobiaceae, Comamonadaceae, Oxalobacteraceae and Planctomycetaceae, which were found with Mothur. Mothur, in return, did not detect Burkholderiaceae, Chthoniobacteraceae, Nitrosomonadaceae, Solibacteraceae and Xanthobacteraceae in the analyzed soil samples. The same occurred in the swamp soil. All key families detected by QIIME were also identified with CoMA (15 out of 19). Ten of them were equally found with Mothur, except for Burkholderiaceae, Chthoniobacteraceae, Nitrosomonadaceae, Solibacteraceae and vadinHA17 (order: Bacteroidales). On the other hand, Mothur determined four families (Bradyrhizobiaceae, Comamonadaceae, Oxalobacteraceae, Planctomycetaceae) that were neither found with CoMA nor with QIIME. Six key families were identified with all of the three analysis tools irrespective of the soil type: Anaerolineaceae, Chitinophagaceae, Flavobacteriaceae, Gaiellaceae, Mycobacteriaceae and Sphingobacteriaceae.
Irrespective of the soil type, CoMA and QIIME identified more taxa than Mothur (p = 0.024, p = 0.032), except for the kingdom and species level, where all of the three pipelines classified 100% and 0–2%, respectively (Table 5). At phylum level, CoMA and QIIME assigned all reads to known taxa whereas Mothur failed in 7–13%. Generally, the classification percentage decreased with lower taxonomic levels (p < 0.001). CoMA and QIIME behaved similarly at all taxonomic levels; however, CoMA identified slightly more orders and families when grassland was analyzed (+5% and +3%, respectively; p < 0.001).
Validation of CoMA
The main purpose of this work was to validate CoMA by comparing the results with those of already established tools. Mothur and QIIME, the two currently most popular analysis tools for amplicon-sequencing data were chosen as reference for both test assays: a benchmark test using three different mock communities, covering different variable regions, sequence lengths and abundances, as well as the analysis of a real soil dataset. For the benchmark test, QIIME2 was included to compare the OTU-based results also with an ASV approach. ASVs are currently proposed as an alternative to OTUs [28, 78] and we believe that a proper validation of CoMA would not be possible if neglecting this progression. For the real dataset, however, analyses focused solely on OTU-based strategies, which are still predominant in soil microbiological studies.
Benchmarking with mock communities.
The benchmark test showed that all tested pipelines (CoMA, Mothur, QIIME, QIIME2) were able to depict the taxonomy in each of the three mock communities in a proper way. Irrespective of the dataset, taxa were found with a high accuracy and a relatively low error rate. However, when looking closer, differences in performance were found among the tools depending on the dataset. While the CoMA output was closely related to that of QIIME2 in mock-13 and mock-26, a considerable distance between these two tools was seen for mock-16, where CoMA was almost identical with QIIME. This indicates that the accuracy of an analysis highly depends on the dataset and that no general conclusion for the best performing analysis tool can be drawn when not considering sample specificities. Decisive characteristics of input data may include the applied primer system (e.g. 16S rRNA, 18S rRNA, ITS, functional genes), the variable region (e.g. V3 or V4 for 16S rRNA), the sequence length, as well as the taxonomic structure of the sample (e.g. species richness, evenness). Even though a general conclusion is difficult, CoMA seemed to be the overall most consistent platform in this test set since it did not show any severe inaccuracies for any of the three datasets, while the other tools did so for at least one mock community.
When comparing the OTU-based platforms (CoMA, Mothur, QIIME) with QIIME2 computing ASVs, no clear trend was observed. For mock-26, the fungal ITS dataset, QIIME2 performed particularly well and was the most accurate tool for this dataset in terms of mean per-taxon deviation. Overall, however, QIIME2 was ranked last, what may be explained with the averagely lowest genus detection rate (< 83%) among all platforms within the test set, especially seen for mock-16. Here, QIIME2 was not able to reveal almost 25% (12 out of 46) of all bacterial/archaeal genera. These results evoke the question whether ASVs are truly preferable over OTUs, albeit it needs to be emphasized that this study was not designed to compare OTUs and ASVs in all its aspects and hence no general conclusion can be drawn. Callahan et al.  suggested a general replacement of OTUs in favor of ASVs for all marker-gene analyses and argued with a higher reproducibility, a finer sequence resolution and the advantage of comparability between different studies. On the other hand, existing sequencing technologies often lack precision and hence an accurate resolving of exact sequences appears doubtful. Choosing the clustering strategy is a tradeoff between lumping taxa on the one hand (OTUs) and splitting them on the other hand (ASVs), and the decision seems to depend on the analyzed dataset, particularly in terms of the applied primer system and the length of the inner primer region. This is also supported by Edgar , who tested various OTU cutoff levels and found optima of 99% for full length 16S rRNA, and 100% for short hypervariable regions (e.g. V4 of the 16S rRNA). For the next years, we expect a co-existence of OTUs (probably with a 99% cutoff instead of 97%) and ASVs, always depending on the data to analyze. However, as soon as the sequencing technologies are reaching the next level in terms of precision, ASVs may be favorable and the field should clearly move in this direction.
Irrespective of the dataset, the results of all pipelines were more closely related to each other than to the expected values (with one exception: QIIME at mock-26). This indicates that the inaccuracies in this benchmark test were, to a certain degree, caused by the sequencing process rather than by the applied analysis tool. Several other authors who reported a systematic distortion of NGS data [80, 81] drew the same conclusion. Biases occur all over the process, starting with the extraction of DNA, the PCR, and the sequencing procedure itself . According to the authors, the steps of DNA extraction and PCR are particularly susceptible to errors, which further accumulate until the end of the sequencing process. This results in a severe bias towards some strains, while others are neglected . These aspects always need to be considered when using amplicon sequencing and we strongly suggest to include samples with a known composition (= mock community) to the analysis in order to estimate the degree of introduced inaccuracy.
In all test categories, CoMA satisfied with a good performance in the benchmark test based on mock communities. It seems to provide proper results, irrespective of the characteristics of the analyzed dataset. The CoMA pipeline is therefore suggested as highly accurate alternative to the established NGS analysis platforms for any kind of input data. Future CoMA releases will also include the possibility to select the OTU cutoff level (currently: 97%) and to generate ASVs rather than OTUs (using the UNOISE3 algorithm ), making the pipeline more flexible and congruent with different opinions in the field no matter in which direction sequence analysis is heading in the future.
Platform comparison using real soil data.
Comparing the three soils, all of the three pipelines used for the soil dataset (CoMA, Mothur, QIIME) generally showed the same trends concerning microbial diversity. The overall level of diversity, however, was different among these analysis tools, which can be explained by the way diversity and richness are typically calculated. Most algorithms, also those in CoMA, are based on OTUs. Therefore, OTU clustering and subsequent data post-processing steps are crucial and may significantly affect diversity calculations, as well as all other upstream analyses. Generally, there are three different strategies for clustering sequences into OTUs: closed-reference OTU picking, de novo assembling and open-reference OTU picking . Closed-reference OTU picking first assigns sequences to a taxonomic database and thereafter clusters them based on the taxonomic classification. This process is fast, but some problems arise when taxonomic assignment is ambiguous or when no similar entry can be found in the reference database . De novo assembling reverses this approach and clusters sequences in a first step, followed by the taxonomic assignment. This strategy is more robust, especially when analyzing complex datasets, but also tends to be more intense in terms of computational effort . Open-reference clustering manifests itself a hybrid of the above approaches. Similar to closed-reference OTU picking, sequences are first assigned to a reference database and unclassified reads are thereafter assembled de novo. This third option combines advantages of both approaches; however, it leads to two different OTU definitions within one single strategy .
CoMA as well as Mothur and QIIME all used de novo assembling for clustering of the sequencing data on a 97% similarity level. However, each tool implements individual algorithms or rather complete tool packages. CoMA and QIIME use USEARCH (in fact UPARSE ) for OTU clustering by using different versions of this algorithm (CoMA: USEARCH 10.0.240, QIIME: USEARCH 6.1.544), though. Mothur uses OptiClust  instead. Assuming that both USEARCH versions used in CoMA and QIIME are largely comparable, particularly OptiClust may have caused such differences with regard to Mothur. This is supported by our data where CoMA and QIIME were generally more closely related to each other than to Mothur. Beyond the fundamental OTU clustering, each pipeline includes various steps for quality filtering, including data demultiplexing, data denoising and chimera removal. All these steps vary among the tools and may significantly affect the results of data analyses. Subsequent removal of rare OTUs as well as data subsampling would evoke different outputs as well. However, this can be disregarded for the present comparison since all OTU-tables were treated with the same algorithms.
Data analysis with Mothur resulted in different microbial compositions in the three soil habitats even though most of the key microbes were found with all of the three pipelines. Despite the different algorithms for OTU clustering and data quality filtering, the output of these analyses might be strongly influenced by the aligning tool/mechanism and the used reference database for taxonomic assignment. In all cases, alignment was done via different tools: BLAST in CoMA, RDP in Mothur and UCLUST in QIIME, following the recommendations in the respective SOPs. However, it should be borne in mind that all of the three pipelines also offer alternative aligners and as such, we are not addressing this aspect in detail. Taking a closer look at the results revealed that most of the conflicting taxa showed particularly low abundances. For the composition of low-abundant taxa, removal of rare reads has a particularly high impact. The same taxon may be slightly above the cutoff in one analysis, whereas it is slightly below in the other. Consequently, a taxon that was in fact determined with both tools may appear just in one of them, resulting in a decreased coverage. We assume that the applied reference database is another important driving factor for the varying results between CoMA/QIIME and Mothur. Both, CoMA and QIIME used the newest distribution of SILVA (SLV_132). Taxonomic classification with Mothur, however, was done using the current release of the RDP trainset (Version 16) as described in the SOP. This explains the differences found in terms of unclassified taxa since SILVA includes significantly more sequences than RDP , resulting in a better coverage. Nevertheless, this does not explain the mismatches when comparing microbial core families. These taxa were not only different at family level but also down to phylum or class.
Taken together, all comparisons indicate that the CoMA output is generally well comparable to that of Mothur and QIIME, particularly when looking at the microbial key players. Regarding less abundant taxa, differences to Mothur were found.
Assets of the CoMA pipeline
The main advantage of CoMA is the intuitive and user-friendly operation that is supported by the graphical user interface. It allows entry-level users to perform amplicon-sequencing data analysis and to obtain solid results without the need for time- and effort-consuming training. Beginners are further supported by the detailed CoMA manual and a systematic tutorial on basis of a provided simple model dataset. All required steps and input parameters are explained in detail and the small size of the dataset (< 1250 reads per sample) allows for a fast progress without long downtimes. Advanced users can utilize the provided high degree of automation to analyze huge amounts of data efficiently. Moreover, they find various settings for optimizing the workflow and adapting the analysis specifically to their needs. CoMA also offers the possibility to stop an uncompleted analysis at any point to continue later or to recalculate parts of the workflow. This allows for the improvement of a former run by simply adjusting decisive input parameters without the need for a complete recalculation. As an open-source tool package, CoMA’s source code may even be adapted or expanded according to the needs of ambitious users.
CoMA offers a huge variety of functions, starting with data pre-processing and ending with several useful tools for data visualization and statistical appraisal. Data pre-processing is following a state of the art procedure for amplicon-sequencing data: merging of paired-end reads, trimming of primers/ barcodes/adapters and quality control with the possibility to filter bad reads. Also OTU clustering (and in the future ASV clustering), taxonomic assignment and data post-processing are done according to the current state of knowledge. A particular focus of CoMA was laid on graphics that meet both esthetic and scientific standards, appropriate for publication. The graphics can be created in many different file formats and pixel depths in order to fulfill specific journal requirements. Taxa plots can be created as bar charts but also as heatmaps. The latter are convenient for the depiction of taxonomic data and we strongly encourage them particularly for big datasets. Venn plots are another option for the illustrative comparison of taxonomic data. Two or three specifically selected groups can be compared with each other in terms of shared taxa, available for each taxonomic level. In future CoMA releases, the possibility for comparing four or five groups shall be implemented. However, Venn diagrams including too many groups tend to be confusing and lose the clarity of its original idea. Alpha- and beta diversity (ordination, cluster analysis) can be calculated using multiple methods and metrics, always providing the optimal combination for the respective dataset.
CoMA also provides the results as text files in standardized formats, which can be used for additional or more sophisticated analyses by advanced users. This may include for instance statistical analyses with software packages such as R. Moreover, taxonomic trees can be created using a text file in NEWICK format. This may either end in a basic depiction of the tree structure (e.g. FastTree ) or in highly sophisticated and complex circular plots (e.g. GraPhlAn), combining the taxonomic tree with circular bar charts or heatmaps. The CoMA output can also be used for the determination of overrepresented taxa (e.g. LEfSe, IndVal ), the creation of taxonomic networks (e.g. Cytoscape) or the construction of pseudo-metagenomic data to reveal the connection between taxonomy and functional genes (e.g. PICRUSt, Tax4Fun , Piphillin ). These are only some examples for further usage, and the choice depends on the research question and the dataset.
CoMA can be run on every common computer operating system (e.g. Linux, Windows, macOS). Currently, three different options for installation are available: a virtual appliance (which can be imported with tools like VMware Workstation, Oracle VM Virtualbox or Parallels Desktop for Mac), a Singularity image, and a direct Linux installer. Each option provides specific advantages and the user should select the most suitable one in order to meet his needs. The CoMA manual includes detailed information for each installation option and describes it thoroughly. When choosing the Singularity option, CoMA can be easily used on high performance computer (HPC) systems. Working on HPC clusters is becoming more and more important in modern NGS data analysis, and applying CoMA here combines easy and intuitive operation with tremendous computing power.
Concluding, CoMA offers several advantages over currently used pipelines, especially for beginners but also for advanced users. This includes particularly the intuitive and user-friendly operation, the flexible usage on any operating system as well as a graphical output in publication-ready form. We strongly recommend the usage of CoMA for convenient and efficient NGS data analysis for researchers, both in the biological and medical field. CoMA will be updated regularly to ensure an excellent performance also in the future, but also to implement additional utilities (e.g. ASV support). These updates will always include the newest versions of the taxonomic databases at the release date to guarantee the best possible results.
S1 Fig. Rarefaction-curve analysis based on operational taxonomic units (OTUs) created with the CoMA pipeline to evaluate the sequencing effort.
F = forest. GR = grassland. S = swamp.
S2 Fig. Hierarchical cluster analysis showing the cosine similarity between four different analysis platforms and the set point (SP) for the mock-13 dataset.
The dendrogram was calculated with the UPGMA method (unweighted pair group method with arithmetic mean) as bottom-up approach.
S3 Fig. Hierarchical cluster analysis showing the cosine similarity between four different analysis platforms and the set point (SP) for the mock-16 dataset.
The dendrogram was calculated with the UPGMA method (unweighted pair group method with arithmetic mean) as bottom-up approach.
S4 Fig. Hierarchical cluster analysis showing the cosine similarity between four different analysis platforms and the set point (SP) for the mock-26 dataset.
The dendrogram was calculated with the UPGMA method (unweighted pair group method with arithmetic mean) as bottom-up approach.
S5 Fig. Hierarchical cluster analysis showing the cosine similarity between four different analysis platforms and the set point (SP) for all mock dataset (mock-13, mock-16, mock-26).
(A) Abundance (OTU, operational taxonomic unit), (B) Simpson diversity (D) and (C) Chao1 diversity of three different soils after sequencing data analysis with CoMA, Mothur and QIIME. Four replicates are shown for each habitat. F = forest. GR = grassland. S = swamp.
S1 Table. Physico-chemical soil properties of three sites in Gschnitz valley, Tyrol (Austria) investigated to validate the CoMA pipeline.
The table shows means ± standard deviation (n = 4). DM = dry matter. VS = volatile solids. EC = electrical conductivity. WHC = water holding capacity.
We thank Philipp Dresch for his much-appreciated help in terms of sample collection and preparation, and Monika Leitner for giving us the permission to take soil samples from her property. Moreover, we thank Florian Steiner for critical comments on an earlier version of the manuscript, and Maraike Probst as well as Thomas Klammsteiner for helpful advice in terms of data analysis with Mothur.
- 1. Cole EJ, Zandvakili OR, Blanchard J, Xing B, Hashemi M, Etemadi F. Investigating responses of soil bacterial community composition to hardwood biochar amendment using high-throughput PCR sequencing. Appl. Soil Ecol. 2019; 136: 80–85.
- 2. Zamyadi A, Romanis C, Mills T, Neilan B, Choo F, Coral LA, et al. Diagnosing water treatment critical control points for cyanobacterial removal: Exploring benefits of combined microscopy, next-generation sequencing, and cell integrity methods. Water Res. 2019; 152: 96–105. pmid:30665164
- 3. Jung SW, Kim HJ, Park JS, Lee T-K, Shin K, Jeong S-Y, et al. Planktonic bivalve larvae identification and quantification in Gomso Bay, South Korea, using next-generation sequencing analysis and microscopic observations. Aquaculture. 2018; 490: 297–302.
- 4. Parlapani F, Michailidou S, Anagnostopoulos D, Sakellariou A, Pasentsis K, Psomopoulos F, et al. Microbial spoilage investigation of thawed common cuttlefish (Sepia officinalis) stored at 2° C using next generation sequencing and volatilome analysis. Food Microbiol. 2018; 76: 518–525. pmid:30166182
- 5. Hu HL, Guo LY, Wu HL, Feng WY, Chen TM, Liu G. Evaluation of next-generation sequencing for the pathogenic diagnosis of children brain abscesses. J. Infection. 2019; 78: 323–337. pmid:30659857
- 6. Shin JM, Luo T, Lee KH, Guerreiro D, Botero TM, McDonald NJ, et al. Deciphering endodontic microbial communities by next-generation sequencing. J. Endodont. 2018; 44: 1080–1087. pmid:29861065
- 7. Ravi A, Avershina E, Angell IL, Ludvigsen J, Manohar P, Padmanaban S, et al. Comparison of reduced metagenome and 16S rRNA gene sequencing for determination of genetic diversity and mother-child overlap of the gut associated microbiota. J. Microbiol. Meth. 2018; 149: 44–52. pmid:29501688
- 8. Nurul ANA, Muhammad DD, Okomoda VT, Nur AAB. 16S rRNA-Based metagenomic analysis of microbial communities associated with wild Labroides dimidiatus from Karah Island, Terengganu, Malaysia. Biotechnol. Rep. 2019; 21: e00303. pmid:30671359
- 9. Epelde L, Lanzén A, Martín I, Virgel S, Mijangos I, Besga G, et al. The microbiota of technosols resembles that of a nearby forest soil three years after their establishment. Chemosphere. 2019; 220: 600–610. pmid:30597368
- 10. Probst M, Gómez‐Brandón M, Bardelli T, Egli M, Insam H, Ascher‐Jenull J. Bacterial communities of decaying Norway spruce follow distinct slope exposure and time‐dependent trajectories. Environ. Microbiol. 2018; 20: 3657–3670. pmid:30003645
- 11. Guo B, Liu C, Gibson C, Frigon D. Wastewater microbial community structure and functional traits change over short timescales. Sci. Total Environ. 2019; 662: 779–785. pmid:30708293
- 12. Hupfauf S, Plattner P, Wagner AO, Kaufmann R, Insam H, Podmirseg SM. Temperature shapes the microbiota in anaerobic digestion and drives efficiency to a maximum at 45° C. Bioresource Technol. 2018; 269: 309–318. pmid:30195223
- 13. Hupfauf S, Winkler A, Wagner AO, Podmirseg SM, Insam H. Biomethanation at 45° C offers high process efficiency and supports hygienisation. Bioresource Technol. 2020; 300: 122671. pmid:31901776
- 14. Liu Y, Wachemo AC, Yuan H, Li X. Anaerobic digestion performance and microbial community structure of corn stover in three-stage continuously stirred tank reactors. Bioresource Technol. 2019; 287: 121339. pmid:31100566
- 15. Vlčková K, Kreisinger J, Pafčo B, Čížková D, Tagg N, Hehl AB, et al. Diversity of Entamoeba spp. in African great apes and humans: an insight from Illumina MiSeq high-throughput sequencing. Int. J. Parasitol. 2018; 48: 519–530. pmid:29530647
- 16. Kraková L, Šoltys K, Otlewska A, Pietrzak K, Purkrtová S, Savická D, et al. Comparison of methods for identification of microbial communities in book collections: Culture-dependent (sequencing and MALDI-TOF MS) and culture-independent (Illumina MiSeq). Int. Biodeterior. Biodegradation. 2018; 131: 51–59.
- 17. Nagler M, Kozjek K, Etemadi-Shalamzari M, Insam H, Podmirseg SM. Simple yet effective: microbial and biotechnological benefits of rumen liquid addition to lignocellulose-degrading biogas plants. J. Biotechnol. 2019; 300: 1–10. pmid:31082412
- 18. Seena S, Bärlocher F, Sobral O, Gessner MO, Dudgeon D, McKie BG, et al. Biodiversity of leaf litter fungi in streams along a latitudinal gradient. Sci. Total Environ. 2019; 661: 306–315. pmid:30677678
- 19. Cai W, Liu W, Zhang Z, Feng K, Ren G, Pu C, et al. mcrA sequencing reveals the role of basophilic methanogens in a cathodic methanogenic community. Water Res. 2018; 136: 192–199. pmid:29510338
- 20. Herbold CW, Pelikan C, Kuzyk O, Hausmann B, Angel R, Berry D, et al. A flexible and economical barcoding approach for highly multiplexed amplicon sequencing of diverse target genes. Front. Microbiol. 2015; 6: 731. pmid:26236305
- 21. DeSantis TZ, Hugenholtz P, Larsen N, Rojas M, Brodie EL, Keller K, et al. Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB. Appl. Environ. Microb. 2006; 72: 5069–5072. pmid:16820507
- 22. Cole JR, Wang Q, Fish JA, Chai B, McGarrell DM, Sun Y, et al. Ribosomal Database Project: data and tools for high throughput rRNA analysis. Nucleic Acids Res. 2013; 42: D633–D642. pmid:24288368
- 23. Quast C, Pruesse E, Yilmaz P, Gerken J, Schweer T, Yarza P, et al. The SILVA ribosomal RNA gene database project: improved data processing and web-based tools. Nucleic Acids Res. 2012; 41: D590–D596. pmid:23193283
- 24. Guillou L, Bachar D, Audic S, Bass D, Berney C, Bittner L, et al. The Protist Ribosomal Reference database (PR2): a catalog of unicellular eukaryote small sub-unit rRNA sequences with curated taxonomy. Nucleic Acids Res. 2012; 41: D597–D604. pmid:23193267
- 25. Machida RJ, Leray M, Ho SL, Knowlton N. Metazoan mitochondrial gene sequence reference datasets for taxonomic assignment of environmental samples. Sci. Data. 2017; 4: 1–7. pmid:28291235
- 26. Fosso B, Santamaria M, Consiglio A, De Caro G, Grillo G, Licciuli F, et al. ITSoneDB: a specialized ITS1 database for amplicon-based metagenomic characterization of environmental fungal communities. EMBnet J. 2012; 18: 619.
- 27. Kõljalg U, Larsson KH, Abarenkov K, Nilsson RH, Alexander IJ, Eberhardt U, et al. UNITE: a database providing web‐based methods for the molecular identification of ectomycorrhizal fungi. New Phytol. 2005; 166: 1063–1068. pmid:15869663
- 28. Knight R, Vrbanac A, Taylor BC, Aksenov A, Callewaert C, Debelius J, et al. Best practices for analysing microbiomes. Nat. Rev. Microbiol. 2018; 16: 410–422. pmid:29795328
- 29. Mahé F, Rognes T, Quince C, de Vargas C, Dunthorn M. Swarm: robust and fast clustering method for amplicon-based studies. PeerJ. 2014; 2: e593. pmid:25276506
- 30. Glassman SI, Martiny JB. Broadscale ecological patterns are robust to use of exact sequence variants versus operational taxonomic units. MSphere. 2018; 3. pmid:30021874
- 31. Zafeiropoulos H, Viet HQ, Vasileiadou K, Potirakis A, Arvanitidis C, Topalis P, et al. PEMA: a flexible Pipeline for Environmental DNA Metabarcoding Analysis of the 16S/18S ribosomal RNA, ITS, and COI marker genes. GigaScience. 2020; 9: giaa022. pmid:32161947
- 32. Schloss PD, Westcott SL, Ryabin T, Hall JR, Hartmann M, Hollister EB, et al. Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl. Environ. Microb. 2009; 75: 7537–7541. pmid:19801464
- 33. Caporaso JG, Kuczynski J, Stombaugh J, Bittinger K, Bushman FD, Costello EK, et al. QIIME allows analysis of high-throughput community sequencing data. Nat. Methods. 2010; 7: 335–336. pmid:20383131
- 34. Bolyen E, Rideout JR, Dillon MR, Bokulich NA, Abnet CC, Al-Ghalith GA, et al. Reproducible, interactive, scalable and extensible microbiome data science using QIIME 2. Nat. Biotechnol. 2019; 37: 852–857. pmid:31341288
- 35. AmpliSAT—Amplicon Sequencing Analysis Tools. Available at http://evobiolab.biol.amu.edu.pl/amplisat/. Accessed: 15.07.2020.
- 36. Fosso B, Santamaria M, Marzano M, Alonso-Alemany D, Valiente G, Donvito G, et al. BioMaS: a modular pipeline for Bioinformatic analysis of Metagenomic AmpliconS. BMC Bioinform. 2015; 16: 203. pmid:26130132
- 37. White JR, Maddox C, White O, Angiuoli SV, Fricke WF. CloVR-ITS: automated internal transcribed spacer amplicon sequence analysis pipeline for the characterization of fungal microbiota. Microbiome. 2013; 1: 6. pmid:24451270
- 38. Ferro M, Antonio EA, Souza W, Bacci M. ITScan: a web-based analysis tool for Internal Transcribed Spacer (ITS) sequences. BMC Res. Notes. 2014; 7: 1–4. pmid:24382056
- 39. Hildebrand F, Tadeo R, Voigt AY, Bork P, Raes J. LotuS: an efficient and user-friendly OTU processing pipeline. Microbiome. 2014; 2: 30. pmid:27367037
- 40. Dong X, Kleiner M, Sharp CE, Thorson E, Li C, Liu D, et al. Fast and simple analysis of MiSeq amplicon sequencing data with MetaAmp. Front. Microbiol. 2017; 8: 1461. pmid:28824589
- 41. Meyer F, Paarmann D, D'Souza M, Olson R, Glass EM, Kubal M, et al. The metagenomics RAST server–a public resource for the automatic phylogenetic and functional analysis of metagenomes. BMC Bioinform. 2008; 9: 1–8. pmid:18803844
- 42. Albanese D, Fontana P, De Filippo C, Cavalieri D, Donati C. MICCA: a complete and accurate software for taxonomic profiling of metagenomic data. Sci. Rep. 2015; 5: 1–7. pmid:25988396
- 43. Dhariwal A, Chong J, Habib S, King IL, Agellon LB, Xia J. MicrobiomeAnalyst: a web-based tool for comprehensive statistical, visual and meta-analysis of microbiome data. Nucleic Acids Res. 2017; 45: 180–188. pmid:28449106
- 44. Anslan S, Bahram M, Hiiesalu I, Tedersoo L. PipeCraft: Flexible open‐source toolkit for bioinformatics analysis of custom high‐throughput amplicon sequencing data. Mol. Ecol. Resour. 2017; 17: e234–e240. pmid:28544559
- 45. Gweon HS, Oliver A, Taylor J, Booth T, Gibbs M, Read DS, et al. PIPITS: an automated pipeline for analyses of fungal internal transcribed spacer sequences from the Illumina sequencing platform. Methods Ecol. Evol. 2015; 6: 973–980. pmid:27570615
- 46. Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 2003; 13: 2498–2504. pmid:14597658
- 47. Asnicar F, Weingart G, Tickle TL, Huttenhower C, Segata N. Compact graphical representation of phylogenetic data and metadata with GraPhlAn. PeerJ. 2005; 3: e1029. pmid:26157614
- 48. Segata N, Izard J, Waldron L, Gevers D, Miropolsky L, Garrett WS, et al. Metagenomic biomarker discovery and explanation. Genome Biol. 2011; 12: R60. pmid:21702898
- 49. Langille MG, Zaneveld J, Caporaso JG, McDonald D, Knights D, Reyes JA, et al. Predictive functional profiling of microbial communities using 16S rRNA marker gene sequences. Nat. Biotechnol. 2013; 31: 814–821. pmid:23975157
- 50. The R Project for Statistical Computing. Available at: https://www.r-project.org/. Accessed: 15.07.2020.
- 51. Heller D, Vingron M, Church G, Li H, Garg S. SDip: A novel graph-based approach to haplotype-aware assembly based structural variant calling in targeted segmental duplications sequencing. bioRxiv. 2020;
- 52. Masella AP, Bartram AK, Truszkowski JM, Brown DG, Neufeld JD. PANDAseq: paired-end assembler for illumina sequences. BMC Bioinformatics. 2012; 13: 31. pmid:22333067
- 53. Schmieder R, Edwards R. Quality control and preprocessing of metagenomic datasets. Bioinformatics. 2011; 27: 863–864. pmid:21278185
- 54. Edgar RC. UPARSE: highly accurate OTU sequences from microbial amplicon reads. Nat. Methods. 2013; 10: 996–998. pmid:23955772
- 55. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J. Mol. Biol. 1990; 215: 403–410. pmid:2231712
- 56. Hauswedell H, Singer J, Reinert K. Lambda: the local aligner for massive biological data. Bioinformatics. 2014; 30: i349–i355. pmid:25161219
- 57. Wang Q, Garrity GM, Tiedje JM, Cole JR. Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Appl. Environ. Microb. 2007; 73: 5261–5267. pmid:17586664
- 58. Cole JR, Wang Q, Cardenas E, Fish J, Chai B, Farris RJ, et al. The Ribosomal Database Project: improved alignments and new tools for rRNA analysis. Nucleic Acids Res. 2009; 37: D141–D145. pmid:19004872
- 59. Ritari J, Salojärvi J, Lahti L, de Vos WM. Improved taxonomic assignment of human intestinal 16S rRNA sequences by a dedicated reference database. BMC Genom. 2015; 16: 1056. pmid:26651617
- 60. Jones JC, Fruciano C, Hildebrand F, Al Toufalilia H, Balfour NJ, Bork P, et al. Gut microbiota composition is associated with environmental landscape in honey bees. Ecol. Evol. 2018; 8: 441–451. pmid:29321884
- 61. De Cárcer DA, Denman SE, McSweeney C, Morrison M. Evaluation of subsampling-based normalization strategies for tagged high-throughput sequencing data sets from gut microbiomes. Appl. Environ. Microb. 2011; 77: 8795–8798. pmid:21984239
- 62. Matsumoto M, Nishimura T. Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator. ACM T. Model. Comput. S. 1998; 8: 3–30.
- 63. Bokulich NA, Rideout JR, Mercurio WG, Shiffer A, Wolfe B, Maurice CF, et al. mockrobiota: a public resource for microbiome bioinformatics benchmarking. mSystems. 2016; 1: e00062–16. pmid:27822553
- 64. Kozich JJ, Westcott SL, Baxter NT, Highlander SK, Schloss PD. Development of a dual-index sequencing strategy and curation pipeline for analyzing amplicon sequence data on the MiSeq Illumina sequencing platform. Appl. Environ. Microbiol. 2013; 79: 5112–5120. pmid:23793624
- 65. Schirmer M, Ijaz UZ, D'Amore R, Hall N, Sloan WT, Quince C. Insight into biases and sequencing errors for amplicon sequencing with the Illumina MiSeq platform. Nucleic Acids Res. 2015; 43: e37–e37. pmid:25586220
- 66. Ihrmark K, Bödeker I, Cruz-Martinez K, Friberg H, Kubartova A, Schenck J, et al. New primers to amplify the fungal ITS2 region–evaluation by 454-sequencing of artificial and natural communities. FEMS Microbiol. Ecol. 2012; 82: 666–677. pmid:22738186
- 67. Climate Data. Trins. Available at: https://de.climate-data.org/europa/oesterreich/tirol/trins-155772/. Accessed: 17.03.2019.
- 68. IUSS Working Group WRB. World Reference Base for Soil Resources 2014, update 2015: International soil classification system for naming soils and creating legends for soil maps. World Soil Resources Reports No. 106, 192. Rome, ITA: FAO. 2015.
- 69. Fernández-Delgado Juárez M, Gómez-Brandón M, Knapp A, Stöhr D, Insam H. Chemical and microbiological properties of alpine forest soils: effects of pelletized ashes in a short-term trial. Forest Ecol. Manag. 2005; 357: 42–49.
- 70. Caporaso JG, Lauber CL, Walters WA, Berg-Lyons D, Lozupone CA, Turnbaugh PJ, et al. Global patterns of 16S rRNA diversity at a depth of millions of sequences per sample. P. Natl. Acad. Sci. USA. 2011; 108: 4516–4522. pmid:20534432
- 71. Thompson LR, Sanders JG, McDonald D, Amir A, Ladau J, Locey KJ, et al. A communal catalogue reveals Earth’s multiscale microbial diversity. Nature. 2017; 551: 457–463. pmid:29088705
- 72. Navas-Molina JA, Peralta-Sánchez JM, González A, McMurdie PJ, Vázquez-Baeza Y, Xu Z, et al. Advancing our understanding of the human microbiome using QIIME. Method. Enzymol. 2013; 531: 371–444. pmid:24060131
- 73. Yamada T, Hino S, Iijima H, Genda T, Aoki R, Nagata R, et al. Mucin O-glycans facilitate symbiosynthesis to maintain gut immune homeostasis. EBioMedicine. 2019; 48: 513–525. pmid:31521614
- 74. Fernandes KA, Kittelmann S, Rogers CW, Gee EK, Bolwell CF, Bermingham EN, et al. Faecal microbiota of forage-fed horses in New Zealand and the population dynamics of microbial communities following dietary change. PLOS One. 2014; 9: e112846. pmid:25383707
- 75. Ter Braak CJF, Šmilauer P. CANOCO reference manual and user's guide: software for ordination, version 5.0. Ithaca: Microcomputer Power; 2012.
- 76. Gohl DM, Vangay P, Garbe J, MacLean A, Hauge A, Becker A, et al. Systematic improvement of amplicon marker gene methods for increased accuracy in microbiome studies. Nat. Biotechnol. 2016; 34: 942–949. pmid:27454739
- 77. Shannon BA, Cohen RJ, Garrett KL. Influence of 16S rDNA primer sequence mismatches on the spectrum of bacterial genera detected in prostate tissue by universal eubacterial PCR. Prostate. 2008; 68: 1487–1491. pmid:18651564
- 78. Callahan BJ, McMurdie PJ, Holmes SP. Exact sequence variants should replace operational taxonomic units in marker-gene data analysis. ISME J. 2017; 11: 2639. pmid:28731476
- 79. Edgar RC. Updating the 97% identity threshold for 16S ribosomal RNA OTUs. Bioinformatics. 2018; 34: 2371–2375. pmid:29506021
- 80. Brooks JP. Challenges for case-control studies with microbiome data. Ann. Epidemiol. 2016; 26: 336–341. pmid:27255739
- 81. Sinha R, Abu-Ali G, Vogtmann E, Fodor AA, Ren B, Amir A, et al. Assessment of variation in microbial community amplicon sequencing by the Microbiome Quality Control (MBQC) project consortium. Nat. Biotechnol. 2017; 35: 1077. pmid:28967885
- 82. McLaren MR, Willis AD, Callahan BJ. Consistent and correctable bias in metagenomic sequencing experiments. Elife. 2019; 8: e46923. pmid:31502536
- 83. Hugerth LW, Andersson AF. Analysing microbial community composition through amplicon sequencing: from sampling to hypothesis testing. Front. Microbiol. 2017; 8: 1561. pmid:28928718
- 84. Edgar RC. UNOISE2: improved error-correction for Illumina 16S and ITS amplicon sequencing. BioRxiv. 2016; 081257.
- 85. Westcott SL, Schloss PD. De novo clustering methods outperform reference-based methods for assigning 16S rRNA gene sequences to operational taxonomic units. PeerJ. 2015; 3: e1487. pmid:26664811
- 86. Westcott SL, Schloss PD. OptiClust, an improved method for assigning amplicon-based sequence data to operational taxonomic units. MSphere. 2017; 2: e00073–17. pmid:28289728
- 87. Rideout JR, He Y, Navas-Molina JA, Walters WA, Ursell LK, Gibbons SM, et al. Subsampled open-reference clustering creates consistent, comprehensive OTU definitions and scales to billions of sequences. PeerJ. 2014; 2: e545. pmid:25177538
- 88. Balvočiūtė M, Huson DH. SILVA, RDP, Greengenes, NCBI and OTT—how do these taxonomies compare? BMC Genomics. 2017; 18: 114. pmid:28361695
- 89. Price MN, Dehal PS, Arkin AP. FastTree: computing large minimum evolution trees with profiles instead of a distance matrix. Mol. Biol. Evol. 2009; 26: 1641–1650. pmid:19377059
- 90. De Cáceres M, Legendre P, Moretti M. Improving indicator species analysis by combining groups of sites. Oikos. 2010; 119: 1674–1684.
- 91. Aßhauer KP, Wemheuer B, Daniel R, Meinicke P. Tax4Fun: predicting functional profiles from metagenomic 16S rRNA data. Bioinformatics. 2015; 31: 2882–2884. pmid:25957349
- 92. Iwai S, Weinmaier T, Schmidt BL, Albertson DG, Poloso NJ, Dabbagh K, et al. Piphillin: improved prediction of metagenomic content by direct inference from human microbiomes. PLOS One. 2016; 11. pmid:27820856