MIPE: A metagenome-based community structure explorer and SSU primer evaluation tool

An understanding of microbial community structure is an important issue in the field of molecular ecology. The traditional molecular method involves amplification of small subunit ribosomal RNA (SSU rRNA) genes by polymerase chain reaction (PCR). However, PCR-based amplicon approaches are affected by primer bias and chimeras. With the development of high-throughput sequencing technology, unbiased SSU rRNA gene sequences can be mined from shotgun sequencing-based metagenomic or metatranscriptomic datasets to obtain a reflection of the microbial community structure in specific types of environment and to evaluate SSU primers. However, the use of short reads obtained through next-generation sequencing for primer evaluation has not been well resolved. The software MIPE (MIcrobiota metagenome Primer Explorer) was developed to adapt numerous short reads from metagenomes and metatranscriptomes. Using metagenomic or metatranscriptomic datasets as input, MIPE extracts and aligns rRNA to reveal detailed information on microbial composition and evaluate SSU rRNA primers. A mock dataset, a real Metagenomics Rapid Annotation using Subsystem Technology (MG-RAST) test dataset, two PrimerProspector test datasets and a real metatranscriptomic dataset were used to validate MIPE. The software calls Mothur (v1.33.3) and the SILVA database (v119) for the alignment and classification of rRNA genes from a metagenome or metatranscriptome. MIPE can effectively extract shotgun rRNA reads from a metagenome or metatranscriptome and is capable of classifying these sequences and exhibiting sensitivity to different SSU rRNA PCR primers. Therefore, MIPE can be used to guide primer design for specific environmental samples.


Introduction
The elucidation of microbial community structure and diversity is an important issue in the field of molecular ecology. The traditional method is based on PCR, and requires the use of primers that specifically targets SSU genes to characterize a community [1]. The  genes, namely the 16S rRNA gene in Bacteria and Archaea and the 18S rRNA gene in Eukarya, have been widely used in microbial phylogeny since Carl Woese introduced the three-domain system [2,3]. This method has greatly expanded our understanding of microbial diversity and has led to the establishment of some public databases such as National Center for Biotechnology Information (NCBI) GenBank, SILVA, Ribosomal Database Project (RDP) and Greengenes [4][5][6][7][8]. Some studies and tools, such as TestProbe [8] and probeBase [9], have evaluated and improved these universal PCR primers based on these datasets. However, the accuracy of PCR approaches is reduced due to primer bias and chimeras, and these PCR-based databases and primer evaluation tools overestimate primer coverage [10]. The advent of high-throughput sequencing has given rise to a number of shotgun sequencing-based metagenomes and metatranscriptomes [11], and many related datasets have been accumulated. Consequently, some specific websites have started storing metagenomic data, such as the databases CAMERA, iMicrobe, and European Bioinformatics Institute (EBI) Metagenomics [12][13][14]. Because metagenomic and metatranscriptomic sequences are generated without PCR-based amplification, the rRNA sequences in these datasets are used for the microbial population analysis [15,16]. However, PCR amplicon sequencing is not out of date. Different from shotgun sequencing, PCR amplicon sequencing is economical, fast and able to be well aligned to analyze in standard workflows [17]. Several software packages have been developed, such as Mothur [18], Usearch [19] and QIIME [20]. Moreover, in ecology, a large number of environmental samples need to be analyzed in parallel to determine the contribution of environmental parameters to microbial populations, so PCR approaches are still widely used. We previously used different metagenome datasets to evaluate universal primers for the bacterial 16S rRNA gene and found that primer evaluation based on the RDP database overrated the coverage achieved with the primers [10].
Moreover, both primer coverage and PCR efficiency are important for the analysis of specific environmental samples. Increasing primer degeneracy would decrease the efficiency of PCR amplification and specific environments may require the use of specific primers. For example, when amplifying the bacterial 16S rRNA gene from gut samples, researchers mixed another primer with the widely used 16S rRNA primer 27F because the primer 27F cannot cover Bifidobacteria, which is dominant in gut samples [21]. However, 27F is broadly used in analysis of various environmental samples because the content of Bifidobacteria is relatively low in most water and soil samples. Therefore, the development of a software program for primer evaluation based on metagenome or metatranscriptome data from specific environmental samples is necessary.
Although some programs (SSUsearch, EMIRGE, MG-RAST API) [16,22,23] and websites (IMG, MG-RAST) [15,24] have been developed for the metagenomic microbial population analysis, no pipelines are currently available for primer evaluation. TestProbe [8] and probe-Base [9] offer primer evaluation in their websites but rely on the PCR-based SILVA and RDP databases. Some programs, such as PrimerProspector [25] and DegePrime [26], evaluate and develop primers for taxonomic classification, but these programs cannot be used for metagenome sequence datasets and do not include rRNA extraction and global alignment processes, which are necessary for rejecting incorrect primer binding sites in shotgun reads. The tool De-MetaST [27] is available for metagenome datasets but was designed to provide in silico amplicons generated by user-defined degenerate primers found within a user-defined nucleotide database. Therefore, this software only provides information for the covered but not the uncovered parts and cannot be used for the evaluation of primer coverage or for primer modification. Furthermore, this software is not suitable for rRNA primers because it utilizes BLASTx (Basic Local Alignment Search Tool searching protein databases using a translated nucleotide query) for the classification.
To fill in the gap created by software programs for primer evaluation and the microbial population analysis based on shotgun metagenomes, we optimized our automated pipeline, which was established previously [10,28] and integrated it into the software MIPE (MIcrobiota metagenome Primer Explorer). MIPE extracts and classifies rRNA from a metagenome or metatranscriptome to provide information regarding community structure and allow the evaluation of different SSU PCR primers for different taxonomic groups. MIPE calls Mothur (v1.33.3) and the SILVA database for the identification of rRNA and the alignment of these sequences, which allows the identification of primer binding sites. The users only need to input the appropriate metagenome dataset and primer lists. The program automatically generates the output, including information regarding primer evaluation, rRNA sequence and taxonomy. This software also supports a modified process suitable for the same analysis based on a metatranscriptome, in this process, considerable rRNA reads would be extracted for the analysis of the primers. In the study described in this manuscript, MIPE was used to extract SSU sequences from a mock dataset and an MG-RAST metagenome dataset, and was used to evaluate the primer pair F515-R806 utilizing the standard SILVA database, the CAMERA metagenomic dataset and a metatranscriptomic dataset. The results were compared with those obtained using MG-RAST [15] and PrimerProspector [25].

Design principles
MIPE contains three perl scripts and depends on Mothur (v1.33.3) and BLAST (v2.2.26 or higher). MIPE can run under the Linux operating system only and can be downloaded at https://github.com/zoubinok/MIPE.git. MIPE uses private or public shotgun metagenomic or metatranscriptomic rRNA sequences submitted by the user to analyze community structure and SSU primer coverage. It calls Mothur (v1.33.3) and the SILVA database to align and classify rRNA, and a reference sequence is introduced to lock and mark primer binding sites. The reference sequence is separately aligned against the user-defined sequences and primers. Based on the reference sequence location, primer binding sites are extracted with three to five additional bases at both ends to avoid base slips resulting from multiple sequence alignment. The primers are then re-aligned against the primer binding sites obtained from the user-defined sequences to improve the alignment quality, and the sequences that are poorly aligned in this region are deleted. The workflow for MIPE is shown in Fig 1. MIPE can also be used to evaluate large subunit ribosomal RNA (LSU rRNA) primers only if the users replace the SSU reference sequence, the SILVA SSU database and the SSU primers with the LSU reference sequence, the SILVA LSU database (https://www.arb-silva.de/fileadmin/silva_databases/release_119/Exports/ SILVA_119_LSURef_tax_silva_full_align_trunc.fasta.gz) and LSU primers, respectively.

Data preprocessing
MIPE consists of two parts, namely dataset preprocessing and the main program. Dataset preprocessing is needed for metagenome datasets because only approximately 0.2% of the sequences in metagenome datasets are related to SSU rRNA genes [29,30]. To obtain SSU rRNA gene-like sequences in metagenome datasets, 71 representative sequences (45 bacterial, 17 archaeal and 9 eukaryotic sequences) were obtained by clustering the SILVA SSU database (v102) with Usearch (v5.2.32) [19] at a sequence identity level of 75%. These representative sequences are then used as queries in a BLASTn search against the user-defined sequences (default parameters: db alignments per query was 65535). A draft of the user-defined SSU dataset is then built to evaluate different primer sets. Because some hidden Markov model (HMM)-based tools can be used to replace this step, this step was not included in the main program [16,31,32]. For the analysis of metatranscriptome and amplified rRNA gene sequences, MIPE skips this step and runs the main program directly because rRNA gene sequences constitute a large part of these datasets. However, some rRNA sequences in some metatranscriptomic samples are removed for mRNA enrichment; in these cases, using the metagenome workflow would be a wise choice and these rRNA sequences cannot be used to reflect the community. In addition, for other special cases, such as cases with large insertions in SSU sequences, which would decrease alignment accuracy, it would prove beneficial to perform specific preprocessing with other methods before using MIPE [33]. But MIPE uses short reads from shotgun sequencing, so most large insertions would be excluded because of poor alignments in MIPE and then primers can be well evaluated, as well.

MIPE main program
Stage I. This stage aims to classify sequences and extract primer binding sites. The detailed parameters are listed in the program.
The first step is the sequence screening and taxonomy of a metagenome. Based on the sequencing approach, genes may be reverse-complementary sequenced, which may have negative effects on the taxonomy and reduce the accuracy of the primer evaluation. To avoid this problem, each sequence and its reverse-complement sequence are aligned against the SILVA SSU database by calling "align.seqs" in Mothur (v1.33.3) [18]. The correct sequence trend is based on the comparison of two search scores that characterize the similarity of the candidate sequence with the reference database. If both values are lower than the cutoff search score which can be modified by the users, although a value of 30 is recommended, the sequence is not considered part of the SSU rRNA gene. Then, MIPE calls Mothur to classify ("classify. seqs") the selected SSU rRNA gene sequences. If the sequence is classified into Bacteria, Archaea, or Eukarya and the bootstrap value is not less than the threshold cutoff (Mothur declares a minimum cutoff of 60, and 50 is also used for sequences shorter than 250 bp [28,34]), the sequence passes through to the next analysis.
Alternatively, the first step can be the sequence screening and taxonomy of a metatranscriptome. For metatranscriptome datasets, LSU rRNA is also considered because of its high content in the metatranscriptome [28]. Therefore, LSU is extracted in the same manner as SSU, and four search scores are acquired to determine the sequence attributions. The extracted LSU is also used for taxonomy and community analysis with the SILVA LSU database (v119) [8]. The subsequent metatranscriptomic SSU rRNA analysis is the same as that used for metagenomic datasets.
Appending a reference sequence, locating and extracting primer binding sites then follow in order. A reference sequence is then introduced to identify primer binding sites with MIPE. The default reference sequences are the standard full-length 16S rRNA gene sequence of Escherichia coli (GenBank accession number: J01695) and the corresponding sequence of Methanomethylovorans hollandica (GenBank accession number: NC_019977) and the 18S rRNA gene sequence of Saccharomyces cerevisiae (GenBank accession number: NR_132213.1). The user-defined SSU sequences and the reference sequence are combined and aligned against the aligned the SILVA SSU database (v119) using the multiple-sequence aligner "align.seqs" in Mothur (v1.33.3) so that every base is given a unique position on the reference sequence. Users would make a list of degenerate primers as input and make sure both their degenerate metatranscriptome, and the blue background describes steps used for metatranscriptomic sequence screening and taxonomy. The lake blue background details the steps associated with appending a reference sequence, locating and extracting primer binding sites, which form part of Stage I in the MIPE main program. The green background shows Stage II in the MIPE main program.
https://doi.org/10.1371/journal.pone.0174609.g001 and non-degenerate formats have right primer binding sites on the reference sequence. Degenerate primers are split into non-degenerate primers with MIPE, and these non-degenerate primers are aligned against the reference sequence using the same tool. The reference sequence, which is used as a marker, helps extract the primer binding sites in the user-defined SSU sequences.
The SILVA SSU database (v119) that is used with MIPE was downloaded from the Mothur modified Recreated SEED database (http://www.mothur.org/w/images/5/56/Silva.seed_v119. tgz). This database does not have the primer binding regions 8F and 1492R because most bacterial sequences were amplified using this primer pair. To address this shortage, all of the sequences were aligned against an extended SILVA database that was prepared in house. To build the extended SILVA database, the primer 8F was attached to the 5' region of every sequence in the SILVA SSU database (v119) and the reverse complement of the primer 1492R was attached to the 3' region of every sequence. A perl script (SILVA_ENLONG.pl) was provided as a part of MIPE in github.
Stage II. This stage aims to analyze and output the results from primer matching. The Mothur command "align.seqs" is called for the alignment of every extracted primer binding site from the user-defined SSU sequences against the non-degenerate primers to rectify any base slips caused by multiple sequence alignment. Four types of mismatches in the primer binding sites, namely substitution, insertion, deletion and missing fragment, are marked with different signals based on the results of this alignment.
For each non-degenerate primer, MIPE outputs two files, a match type file (".Match_type") and a statistics file (".Stat"). The ".Match_type" file includes details from each sequence evaluation. Five match types, namely match, substitution, insertion, deletion and missing fragment, are expressed by the MIPE match type formats ' = ', 'A(TCG)', 'a(tcg)', 'd' and '.', respectively, to facilitate the statistical analysis which involves each match type for the sequence, the binding sites of the primers, the mismatch base numbers in total, the mismatch base numbers in last 4 bases, the matched degree and the taxonomy information. A single internal primer-template mismatch can greatly decrease PCR efficiency, particularly if the mismatch occurs at the last three to four positions; thus, the degree of matching is defined as "no mismatching" or "only one mismatch that is not found in the last four positions near the 3' end" [35,36]. The match types of each non-degenerate primer separated from each degenerate primer are integrated into ".Match_type" to evaluate degenerate primers using the best match type, which is based on the score provided in the ".Match_type_tmp" files. The format of ".Match_type" is shown in Fig 2A. The ".Stat" file provides a summary of the primer evaluation. All of the results were generated by the analysis program using the "Match_type" files. The ".Stat" file contains six tables: the match and mismatch sequence count, the distribution of completeness, the most frequent mismatch type, the most common match type, the count of each mismatch in every position, and the coverage over taxonomy levels. To reflect the mismatch of any missing information, we introduced the variable "completeness" for each primer binding site. If the primer binding site of a fragment was too marginal to cover all of the bases of a primer or was poorly aligned, this site would not be counted when evaluating the coverage of a certain primer. The format of the ".Stat" file is shown in Fig 2B.

Testing the dataset and demonstration
Comparison of sequence taxonomies. A 13-organism genomic mock dataset was simulated using MetaSim (version 0.9.1; [37]) with a sequence length of 400 bp for each organism, 6X coverage and an exact error model to confirm the accuracy of the taxonomy obtained by . Linear correlations were calculated at the phylum and class levels of Archaea and Bacteria. Taxa whose abundances are greater than 100 reads in either the SILVA SSU (v119) or MG-RAST built-in SSU databases were listed separately and the others were summed into one item. Due to differences in taxonomy information for Eukaryota between the SILVA SSU (v119) and MG-RAST built-in SSU databases, the test for Eukarya was not executed.
Primer evaluation. To compare the primer evaluation results obtained by MIPE and Pri-merProspector [25], the "SILVA test set" and "Metagenome test set" were used. The "SILVA test set" was used with the example pipeline of PrimerProspector, derived from the SILVA SSU database (v104) and filtered at 97% sequence identity with Uclust [19]. The "Metagenome test set" was selected from the CAMERA website (release v.1.3.2.30; http://camera.calit2.net/) based on our previous work in 2012 [10,14]. The primer set F515-R806 [25,40,41] (F515: 5'-GTGCC AGC(A/C)GCCGCGGTAA-3';R806: 5'-GGACTACC(A/C/G)GGGTATCTAAT-3'), which is designed to be universal for nearly all bacterial and archaeal taxa, is widely used in highthroughput sequencing to amplify V4 region of 16S SSU rRNA. We used this primer set for the evaluation. It was also evaluated with the "SILVA test dataset" via PrimerProspector. The MIPE parameters were a search score cutoff of 30, a bootstrap cutoff of 80, the inclusion of Bacteria and other default parameters. The scripts "analyze_primers.py" and "taxa_coverage.py" with default parameters in PrimerProspector were used and the PrimerProspector results were transformed into the MIPE match type format for the comparison. Because the primers F515-R806 are bacterial primers, only bacterial SSU sequences classified by MIPE were compared. In PrimerProspector, a primer binding site with an overall weighted score of at most 1.00 was regarded as a matched site. The phylum-level coverage data presented in the MIPE output " Table 6: Coverage over taxonomy levels" from a ".Stat" file were used for comparison.
The metatranscriptomic dataset SRX155355 (Short Read Archive (SRA) accession number: SRX155355), which is based on our previous work in 2014 [28] was executed using the metatranscriptome section of MIPE. All sequence reads containing "N" were discarded and sequences longer than 400 bp were used. The sequences were checked for chimeric artifacts using the "chimera.uchime" in Mothur (v1.33.3), and the passed sequences were processed with MIPE metatranscriptome (search score cutoff: 20; bootstrap cutoff: 80). completeness, the taxonomy information and the score. (B) Statistics file, ".Stat". The file presents a summary of the primer evaluation, and all tables result from the analysis of the "Match_type" files. The file contains six tables: the match and mismatch sequence count, the distribution of completeness, the most frequent mismatch type, the most common match type, the count of each mismatch in every position, and the coverage over taxonomy levels. Rare biospheres can be found in " Table 6" of the statistics file and sequences whose completeness were less than one were not counted in this table. For more details, please see S3 Table. https://doi.org/10.1371/journal.pone.0174609.g002

Results and discussion Extraction and taxonomic identification of SSU sequences
Running time of MIPE (CPU time) was tested using the 13-organism genomic mock dataset containing 1000000 sequences. MIPE was run on Ubuntu 14.04.1 (2×2.40GHz Quad-core Xeon, 64GB of RAM) but in the single-threaded mode. The dataset preprocessing (MIPE_pre_program. pl) took 3 minutes and extracted 2389 sequences. MIPE main program (MIPE_main_program. pl) took 5 minutes to deal with these 2389 sequences.
The SSU sequences in the 13-organism genomic mock dataset were extracted and classified by MIPE. A goodness-of-fit analysis based on Pearson's Chi-squared test (chisq.test in R and the R code was in S1 Table) at the genus level showed no significant difference between the original percentage and the MIPE-determined percentage (p value > 0.05; S1 Table) indicating that MIPE can extract SSU sequences from an unbiased mock dataset.
The activated sludge sample from MG-RAST dataset 4467420.3 (16663946 reads) was evaluated. Krnoa charts [42] of SSU sequences show the MIPE results, revealing the relative abundance distribution of microbes at different taxonomic levels (S1 and S2 Files). Based on the two results, the main clusters of Bacteria and Archaea are similar. A low content was detected with MIPE for the Eukarya cluster because the taxonomic database SILVA SSU (v119) used for the MIPE analysis does not cover fish, such as Danio, the dominant Eukaryote in this dataset. MG-RAST is broadly used for microbial analysis of shotgun metagenome, which identified rRNA sequences through an initial BLAT search [15]. But MG-RAST built-in SSU database is not available for us and it cannot be replaced. Furthermore, MG-RAST only reports the relative abundance and we cannot get the taxon of every sequence. Thus, we used linear correlation-based comparison instead. Fig 3 clearly shows the significant linear correlation of Archaea and Bacteria read numbers at the phylum (R 2 = 0.897, P = 2.99 e-06) and class level (R 2 = 0.861, P = 2.27 e-07) between the results obtained using the two methods. In conclusion, MIPE can efficiently extract SSU rRNA gene sequences from metagenome datasets and accurately identify the taxonomy of each sequence.
The metatranscriptomic dataset SRX155355 contains 21,035 sequences prior to processing by MIPE. MIPE extracted 10,885 SSU sequences and 7452 LSU sequences. Thus, rRNA sequences accounted for 87.2% of the total sequences. Considering that the whole amount of rRNA genes in a metagenomic dataset is approximately 0.2%, metatranscriptomic datasets provide considerable sequences.
MIPE is flexible because it is a local pipeline to which different SSU or other gene databases can be applied for the analysis of data from different environments because different environments consist of different communities and it is better to select primers for specific environments.

Validation of the primer evaluation process
The primer set F515-R806 were evaluated using the "SILVA test set" and "Metagenome test set" with MIPE and PrimerProspector [25]. Because PrimerProspector was not designed for metagenomes and cannot perform global alignments or classify metagenome sequences using its primer evaluation modules (analyze_primers.py and taxa_coverage.py), we had to input the SSU rRNA gene sequence and taxonomy files for the comparison.
The coverage of most phyla by the primers F515 and R806, as evaluated by MIPE, basically agreed with the results obtained by PrimerProspector, presenting a 0%-to-10% difference (Fig 4) in the "SILVA test set. The different coverages obtained by these two programs were due to different penalty rules. But, in fact, they had the same match type. In other words, there were no differences between two results except penalty rules and penalty rules can get changed in scripts. The  details are shown in sheets "F515_annotation_of_difference" and "R806_annotation_of_difference" of S2 Table. Other sheets of S2 Table original match type files and statistics files. In the analysis of the "Metagenome test set", PrimerProspector found one primer binding site in every sequence for primer F515 or R806, which means that PrimerProspector identified incorrect primer binding sites, because it only performed local alignments and did not anchor the primer binding sites, even if it was an incorrect site. MIPE avoids this problem through the use of a global alignment and the anchoring of primer binding sites. Because the results obtained by PrimerProspector included incorrect information for uncovered sections and cannot be used for primer evaluations and metagenome modifications, just like De-MetaST [27], we can only compare the match types from these two scripts case by case (Sheet: F515_annota-tion_of_difference and F806_annotation_of_difference in S3 Table).
The metatranscriptomic dataset SRX155355 contained 3915 bacterial sequences which had primer binding sites for F515 and 2081 bacterial sequences had primer binding sites for R806. Thus, primer binding sites for F515 and R806 accounted for 18.6% and 9.9% of the total sequences, respectively (see S4 Table). Metatranscriptomic datasets provided considerable sequences for testing primers.

Conclusions
A priori knowledge is important when selecting and developing primers for the microbial population analysis. MIPE is a pipeline that combines automated SSU primer evaluation with rRNA gene extraction and classification for the analysis of shotgun metagenomic and metatranscriptomic datasets, which are becoming increasingly available. Furthermore, "missed" microorganisms from previous studies can also be discovered with newly designed primers based on mismatched types. In the future, MIPE will be able to address other marker genes to offer users a more powerful analysis tool for discovering the features of functional microbial communities, and an SSU rRNA gene database based on metagenomic or metatranscriptomic sequences can be established based on results obtained using the MIPE pipeline.
Supporting information S1 Table. MIPE results obtained using a mock dataset and the R code.