CoMA – an intuitive and user-friendly pipeline for amplicon-sequencing data analysis

In recent years, there has been a veritable boost in next-generation sequencing (NGS) of gene amplicons in biological and medical studies. Huge amounts of data are produced and need to be analyzed adequately. Various online and offline analysis tools are available; however, most of them require extensive expertise in computer science or bioinformatics, and often a Linux-based operating system. Here, we introduce “CoMA–Comparative Microbiome Analysis” as a free and intuitive analysis pipeline for amplicon-sequencing data, compatible with any common operating system. Moreover, the tool offers various useful services including data pre-processing, quality checking, clustering to operational taxonomic units (OTUs), taxonomic assignment, data post-processing, data visualization, and statistical appraisal. The workflow results in highly esthetic and publication-ready graphics, as well as output files in standardized formats (e.g. tab-delimited OTU-table, BIOM, NEWICK tree) that can be used for more sophisticated analyses. The CoMA output was validated by a benchmark test, using three mock communities with different sample characteristics (primer set, amplicon length, diversity). The performance was compared with that of Mothur, QIIME and QIIME2-DADA2, popular packages for NGS data analysis. Furthermore, the functionality of CoMA is demonstrated on a practical example, investigating microbial communities from three different soils (grassland, forest, swamp). All tools performed well in the benchmark test and were able to reveal the majority of all genera in the mock communities. Also for the soil samples, the results of CoMA were congruent to those of the other pipelines, in particular when looking at the key microbial players.

2 Hierarchical cluster analysis using the Braycurtis metric showed a clear separation of samples belonging to the three soils (Fig 2). Therefore, it can be assumed that unique microbial communities developed in each habitat. However, the microbial consortia of swamp and grassland were closer related to each other (sequence similarity of 33%) than to the forest samples (10% similarity). All of the grassland and swamp replicates grouped close to each other (86% and 79% similarity, respectively) while the forest samples showed a significantly lower between-group similarity (57%). Looking at the bacterial community in detail, forest samples were clearly dominated by Chthoniobacteraceae (13%) and Xanthobacteraceae (10%; Fig 3). This latter family was also found in the two other habitats at a significantly lower abundance (3% and 2% in grassland and swamp, respectively). Chthoniobacteraceae also appeared in grassland (1%) but were below the detection limit in the swamp (< 1%). Other important families in the forest samples were Solibacteraceae (Subgroup 3; 5%), Acidobacteriaceae (Subgroup 1; 4%) and Gemmataceae (4%). These taxa were highly characteristic for the forest habitat and could not be found at comparable abundances in grassland or swamp.  Taking a deeper look into the archaeal community structure of the forest samples, Nitrosotaleaceae and Methanosaetaceae were the most abundant ones (relative abundance of 36% and 27%, respectively; showed an abundance of 9% and 8%, respectively, while Methanoregulaceae accounted for at least 3% of all Archaea. In contrast, grassland was completely dominated by Nitrososphaeraceae (99%).
Methanosarcinaceae and Nitrososphaeraceae both accounted for 8% of the classified reads, and Methanomassiliicoccaceae for 5%. Methanospirillaceae (2%) and Methanocaldococcaceae (1%) were low in abundance but still detectable in the swamp samples.

Discussion
Although located in close proximity, each soil revealed a unique microbial community structure with different diversity and species richness. It is known from literature that both, the vegetation and the soil type are main drivers for microbial diversity in soils [1]. In this case, particularly the vegetation type, forest on the one hand and meadow (grassland, swamp) on the other hand, may have influenced the microbial community the most, since the climatic conditions and the parent material of the soils were the same for all sites. This assumption is also strengthened by the results of the cluster analysis where grassland-and swamp samples were much closer related to each other than to the forest. We assume that the soil pH, which was remarkably lower in the forest compared to the other two sites (S1 Table), was another crucial factor shaping microbial communities. Several authors, including Nacke et al. [2] and Zhalnina et al. [3], who further stressed out the importance of soil texture, already reported this.
Ivanova et al. [4] stated that the soil acidity is the strongest driver in the formation of microbial communities and evokes fundamental changes at phylum level. According to the authors, the vegetation type, as second but slightly weaker factor, shapes the microbiota at lower taxonomic levels, e.g. order, family, or genus.
The high species richness and diversity in the swamp may be explained by the high content of organic compounds typically found in (partially) flooded soils. Moreover, this site has been fertilized with manure twice or thrice a year, which is expected to increase the soil organic matter content (OM). Tiedje et al. [5] raised the hypothesis that a high OM content may explain increased microbial diversity while Degens et al. [6] emphasized particularly the role of organic carbon. Despite being generally low in species richness and diversity, the forest soil showed a high variation in their microbial community structure among the replicates compared to grassland and swamp. This dissimilarity suggests a higher heterogeneity within microbial communities across the sampling area of 100 m² and may be explained by the different niches and soil heterogeneity typically found in forest soils [7]. The vegetation as well as local physico-chemical characteristics at these local niches may play a key role in the development of individual microbial communities. This assumption may also apply for grassland and swamp; however, the differences between the replicates were much lower because the vegetational heterogeneity of the meadows was not comparable to the one in forest. High microbial heterogeneity in (forest) soils of the same type was also found at a microscopic scale and is assumed to be due to spatial separation and the establishment of microhabitats [8]. Individuals from these microhabitats are generally isolated and can only encounter other microbes through animal-and root transport or after rain events.
Chthoniobacteraceae as dominant family in forest soils is in line with findings of Vivanco et al. [9] who found this family to be dominant in the litter layer of a Nothofagus mixed forest. Chthoniobacteraceae are member of the phylum Verrucomicrobia (class: Spartobacteria, order: Chthoniobacterales) which is increasingly recognized as a crucial phylum of any soil ecosystem and typically accounts for 1.2-10.9% of the total bacteria in soil [10]. The family Xanthobacteraceae is another core member of forest soils [11][12][13]. For grassland, Propionibacteriaceae were among the key families. This is remarkable since these Gram + bacteria are anaerobes [14], typically found in the digestive tract of animals [15] but not in unfertilized soil systems. The investigated site, however, was not fertilized for ten years and Propionibacteriaceae had not been expected after such a long time. Therefore, we assume that they were introduced to the ecosystem by manure application and persisted in appropriate niches. It remains, however, unclear why Propionibacteriaceae were not found in the swamp where manure is amended regularly and the fluctuating water level causes, at least partially, anaerobic zones. Nitrosomonadaceae, another abundant family in the grassland samples, contains nitrifying bacteria such as Nitrosomonas and Nitrosospira, which are involved in the oxidation of ammonia to nitrate in the presence of oxygen [16].
Their emergence is in line with the fact that a low ammonium content and an accumulation of nitrate were found in this habitat (S1 Table). Nitrosomonadaceae were also detected in the swamp; however, occasional oxygen limitations might have hindered the nitrification process and may explain higher ammonium and lower nitrate levels compared to the grassland. Beyond that, the organic fertilization of the swamp soil might also have influenced the ammonium-and nitrate levels, both directly and indirectly. Other characteristic taxa in the grassland were members of the Actinobacteria (e.g. Gaiellaceae, Solirubrobacterales), which are known as key players in terrestrial and aquatic ecosystems [17,18].
Low nitrate levels in the swamp may be alternatively explained with the distinctive occurrence of Flavobacterium. Some of these microbes are aerobically living denitrifiers (e.g. Flavobacterium denitrificans [19]) that convert nitrate to molecular nitrogen under anaerobic conditions and may explain why no nitrate accumulation took place in the swamp soil. Some denitrifiers are also known within the family Burkholderiaceae, e.g. Ralstonia eutropha (syn.: Alcaligenes eutrophus [20]), which were found in the swamp and in lower abundance in the forest and the grassland soil samples. Burkholderiaceae is a very diverse family and includes nitrogen-fixing bacteria like Burkholderia vietnamiensis [21]. These form close symbioses with plants and convert gaseous dinitrogen to ammonia that becomes available to the plant [22,23]. Anaerolineaceae, which were exclusively detected in the swamp, are strict anaerobes and their presence is indicative of arbuscular mycorrhizal interactions between fungi and plant roots [24].