From Gigabyte to Kilobyte: A Bioinformatics Protocol for Mining Large RNA-Seq Transcriptomics Data

RNA-Seq techniques generate hundreds of millions of short RNA reads using next-generation sequencing (NGS). These RNA reads can be mapped to reference genomes to investigate changes of gene expression but improved procedures for mining large RNA-Seq datasets to extract valuable biological knowledge are needed. RNAMiner—a multi-level bioinformatics protocol and pipeline—has been developed for such datasets. It includes five steps: Mapping RNA-Seq reads to a reference genome, calculating gene expression values, identifying differentially expressed genes, predicting gene functions, and constructing gene regulatory networks. To demonstrate its utility, we applied RNAMiner to datasets generated from Human, Mouse, Arabidopsis thaliana, and Drosophila melanogaster cells, and successfully identified differentially expressed genes, clustered them into cohesive functional groups, and constructed novel gene regulatory networks. The RNAMiner web service is available at http://calla.rnet.missouri.edu/rnaminer/index.html.


Introduction
Transcriptome analysis is essential for determining the relationship between the information encoded in a genome, its expression, and phenotypic variation [1,2]. Next-generation sequencing (NGS) of RNAs (RNA-Seq) has emerged as a powerful approach for transcriptome analysis [3,4] that has many advantages over microarray technologies [5,6,7].
A RNA-Seq experiment typically generates hundreds of millions of short reads that are mapped to reference genomes and counted as a measure of expression [5]. Mining the gigabytes or even terabytes of RNA-Seq raw data is an essential, but challenging step in the analysis.
In order to address these challenges, RNAMiner has been developed to convert gigabytes of raw RNA-Seq data into kilobytes of valuable biological knowledge through a five-step data mining and knowledge discovery process. RNAMiner integrates both public tools (e.g., TopHat2 [8], Bowtie2 [9], Cufflinks [10], HTSeq [11], edgeR [12], and DESeq2 [13]) with our in-house tools (MULTICOM-MAP [14,15,16]) to preprocess data and identify differentially expressed genes in the first three steps. In the last two steps, RNAMiner uses our in-house tools MULTI-COM-PDCN [17,18] and MULTICOM-GNET [19,20] to predict both functions and gene regulatory networks of differentially expressed genes, respectively.
As proof of principle, we have applied the RNAMiner protocol to RNA-Seq data generated from Human, Mouse, Arabidopsis thaliana, and Drosophila melanogaster cells. The data mining process successfully produced valuable biological knowledge such as differentially expressed genes, cohesive functional gene groups, and novel hypothetical gene regulatory networks by reducing the size of the initial data set over a thousand-fold.

Methods
Some RNA-Seq data analysis pipelines (e.g. Galaxy [21], KBase [22], iPlant [23]) provide users with a convenient and free platform for RNA-Seq data analysis by combing public tools, such as TopHat [24], Bowtie [25], Cufflinks [10], Cuffmerge [10], and Cuffdiff [10]. As with these pipelines, RNAMiner combines these public tools such as TopHat2 [8], Bowtie2 [9], Cufflinks [10], Cuffdiff [10], and it is free. However, there are several differences between RNAMiner and other pipelines. First, RNAMiner integrates more tools, such as HTSeq [11], edgeR [12], DESeq2 [13], and our in-house MULTICOM-MAP [14,15,16], to calculate gene expression values and identify differentially expressed genes. These tools can generate more accurate consensus results. For example, RNAMiner uses Cuffdiff, edgeR, and DESeq2 to identify differentially expressed genes based on TopHat mapping results and gene expression values calculated by HTSeq and MULTICOM-MAP. RNAMiner generates up to five distinct lists and one consensus list of differentially expressed genes, which usually produces more accurate results. Second, RNAMiner predicts functions of differentially expressed genes and builds gene regulatory networks by integrating our in-house tools MULTICOM-PDCN [17,18] and MULTICOM-G-NET [19,20]. These analyses provide more biological information. Other pipelines (e.g. Galaxy and iPlant) do not provide these analyses. Another software package-KBase-contains a service to predict gene functions, but the service only provides GO annotation for plant genomes. Third, without requirements for user registration and selection of many parameters, RNAMiner is easier to use than other pipelines. Compared to running each tool separately, users can easily run all these tools integrated in RNAMiner at one time and download results generated by all the tools at the RNAMiner web site.
The five data analysis steps of the RNAMiner protocol (Fig 1) are described individually in sub-sections below. Tables 1 and 2 list the versions and the parameters of all the public tools used in RNAMiner and describe the meanings of the parameters.

Mapping RNA-Seq reads to a reference genome
We use two public tools, TopHat2 [8] and Bowtie2 [9], to map RNA-Seq reads to reference genomes in the UCSC genome browser [26] in conjunction with the RefSeq genome reference annotations [27]. The workflow of mapping RNA-Seq reads to a reference genome and calculating gene expression values is illustrated in Fig 2. It is worth noting that, since the RefSeq genome reference annotations contain information about some non-coding small RNAs, the reads of the non-coding RNAs are mapped and counted in addition to regular protein coding mRNAs. MULTICOM-MAP [14,15,16] is used to remove reads mapped to multiple locations in a reference genome from the mapping data in BAM/SAM format [28] generated by TopHat2 and Bowtie2. Only reads mapped to a unique location on the genome are retained to calculate the read counts of the genes. We use MULTICOM-MAP to analyze the mapping results to The RNAMiner protocol for big transcriptomics data analysis. Five blue boxes denote five data analysis steps, i.e. mapping RNA-Seq reads to a reference genome, calculating gene expression values, identifying differentially expressed genes, predicting gene functions, and constructing gene regulatory networks. The tools used in each step are listed inside each box. The external input information is represented by brown boxes and the final output information is represented by green boxes. The information flow between these components is denoted by arrows. doi:10.1371/journal.pone.0125000.g001 Mining Large RNA-Seq Data obtain baseline information, such as the total number of reads, the number of reads mapped to a unique location, and the number of reads mapped to multiple locations. This mapping process can generally reduce the size of datasets by several orders of magnitude. TopHat2 -read-mismatches 2 The maximum number of mismatched nucleotides between a read and a reference allowed for a valid mapping.
-read-gap-length 2 The maximum number of gaps in the alignment between a read and a reference genome allowed for a valid mapping.
-splicemismatches 0 The maximum number of mismatches allowed in the "anchor" region of a spliced alignment.
-segmentmismatches 2 Read segments are mapped independently, allowing up to this number of mismatches in each segment alignment.
-segment-length 25 A read is cut into segments each having at least this length. These segments are mapped independently.

Bowtie2
-end-to-end In this mode, Bowtie2 requires that the entire read to be aligned from one end to the other, without any trimming (or "soft clipping") of characters from either end. Local alignment is not used in Bowtie2.
-sensitive This option generally balances speed, sensitivity and accuracy.
-frag-len-std-dev 80 The standard deviation of fragment lengths.
-max-intron-length 300000 Ignore alignments with gaps longer than this.
Cuffdiff -min-alignmentcount 10 Minimum number of alignments in a locus for testing.
-FDR 0.05 The maximum false discovery rate allowed after statistical correction.
-frag-len-std-dev 80 The standard deviation of fragment lengths.
HTSeq -a 10 Skip all reads with alignment quality lower than this minimum value.
-t Exon Feature type (3rd column in GFF file) to be used. All the features of other types are ignored.
-i gene_id GFF attribute to be used as feature ID.
-m Union Mode to handle reads overlapping more than one feature.

DESeq2
Test LRT Use the likelihood ratio test on the difference in deviation between a full and reduced model formula (nbinomLRT).
fitType parametric The type of fitting of dispersions to the mean intensity. Parametric: fit a dispersion-mean relation via a robust gamma-family GLM.

Calculating gene expression values
For RNAMiner, MULTICOM-MAP [14,15,16] and two public tools: HTSeq [11] and Cufflinks [10] are used to calculate gene expression values according to the genome mapping output and the RefSeq genome reference annotation [27]. MULTICOM-MAP and HTSeq produce raw read counts, while Cufflinks generates normalized values in terms of FPKM, i.e., fragments per kilobase of exon model per million mapped fragments. The normalized gene expression values generated by Cufflinks are used to identify differentially expressed genes in the next step. The read counts generated by MULTICOM-MAP and HTSeq are fed separately into two R Bioconductor packages, edgeR [12] and DESeq2 [13], to identify differentially expressed genes. The  estimated probability that the reads were derived from the isoform. In contrast, MULTICOM--MAP distributes the total count of such reads to each isoform, while HTSeq discards the reads without counting them for any isoform. This analysis step generates the overall expression profile of most genes in a transcriptome and can reduce the size of data from Step 1 by~one thousand-fold, from gigabytes to several megabytes.

Identifying differentially expressed genes
We use Cuffdiff [10] and two R Bioconductor packages, edgeR [12] and DESeq2 [13] to identify differentially expressed genes separately (see Fig 3 for the workflow). EdgeR and DESeq2 identify differentially expressed genes based on the raw read counts calculated by MULTI-COM-MAP and HTSeq, resulting in four lists of differentially expressed genes (i.e., edgeR+-MULTICOM-MAP, edgeR+HTSeq, DESeq2+MULTICOM-MAP, and DESeq2+HTSeq). In contrast Cuffdiff identifies differentially expressed genes directly from the genome mapping outputs containing only reads mapped to a unique location on the genome, resulting in one list of differentially expressed genes. Cuffdiff, edgeR and DESeq2 further adjust p-values by multiple testing using Benjamini and Hochberg's approach, which controls the false discovery rate (FDR) [10,12,13]. Usually, the cut-off of p-value (or q-value) is set to 0.05. Based on the five lists of differentially expressed genes generated by Cuffdiff, edgeR+MULTICOM-MAP, DESeq2+MULTICOM-MAP, edgeR+HTSeq, and DESeq2+HTSeq, a consensus list of differentially expressed genes is generated as the final output which usually comes from the overlap of at least three lists of differentially expressed genes. This step generates valuable information that may play an important role in the biological experiment. For example, the significantly differentially expressed genes identified by RNAMiner could be the targets for new biological experiments. This analysis step can generally reduce the size of data of the previous step by a couple orders of magnitude, condensing the data set size to several hundred kilobytes.

Predicting gene functions
We use MULTICOM-PDCN [17,18], a protein function prediction method ranked among the top methods in the 2011-2012 Critical Assessment of Function Annotation (CAFA) [29], to predict functions of differentially expressed genes (see Fig 4 for the workflow). MULTI-COM-PDCN integrates sequence-profile and profile-profile alignment methods (PSI-BLAST [30] and HHSearch [31]) with protein function databases such as the Gene Ontology database [32], the Swiss-Prot database [33], and the Pfam database [34], to predict functions of proteins in Gene Ontology [32] terms in three categories: biological process, molecular function, and cellular component. MULTICOM-PDCN also provides some statistical information about the predicted functions, such as the number of differentially expressed genes predicted in each function. We then use the Cochran-Mantel-Haenszel test implemented by R program mantelhaen.test [35,36] to check if predicted function terms are good for Fisher's exact test to identify the significantly enriched GO function terms. A p-value from the MH test lower than 0.05 suggests the two nominal variables (e.g., two function terms) are conditionally independent in each stratum [35,36]. We then calculate a p-value of enrichment for each predicted function using R function fisher.test [35,36,37,38,39,40,41,42] and sort the predicted functions by their p-value in ascending order, from the most significant ones to the least significant ones. The list of the most significantly enriched functions can provide an overview of the biological processes differentially perturbed in two biological conditions. Although the physical size of the data and knowledge generated in this step is comparable to the size of the data in the previous step, the differentially expressed genes can be organized in three functional perspectives: biological process, molecular function, and cellular component.

Constructing gene regulatory networks
We use MULTICOM-GNET [19,20] to construct gene regulatory networks based on differentially expressed genes and transcription factors in a genome (see Fig 5 for the workflow). MUL-TICOM-GNET firstly clusters differentially expressed genes with similar expression patterns into functional clusters using the K-means clustering algorithm. Secondly, it builds a binary decision tree to represent potential regulatory relationships between several selected transcription factors (TFs) and the genes in each cluster. Thirdly, it re-assigns differentially expressed genes into clusters whose gene regulatory tree best explained the expression patterns of the genes. The last two steps are repeated until the maximized likelihood of the gene expression data is reached [19,20]. We also use a R network analysis and visualization package "igraph" [43] to visualize gene regulatory networks by linking the regulatory relationships between and within all the gene regulatory modules predicted by MULTICOM-GNET together. The regulatory network construction step generates a comprehensive understanding of underlying mechanisms controlling the expression of a transcriptome and can significantly reduce the size of data. The hundreds of kilobytes of the biological network data provide a system view of the cellular systems, which can be more readily utilized to generate valuable hypotheses for biological experiments.
For replicates from RNA-Seq experiments, RNAMiner maps reads of the replicates to reference genomes and calculates gene expression values separately. The gene expression values of the replicates of two samples are combined into a profile (i.e. a vector of the expression values of a gene in each replicate of each condition), which is input into edgeR and DESeq2 to identify differentially expressed genes. Additionally, the TopHat mapping results of the replicates of two samples are input into Cuffdiff to identify differentially expressed genes. EdgeR [12], DESeq2 [13], and Cuffdiff [10] handle the replicates by modeling the variance (dispersion) in counts across the replicates as a function of the mean count of the replicates. EdgeR [12] estimates the variance by conditional maximum likelihood conditioned on the total count for the gene. DESeq2 [13] uses a flexible and mean-dependent local regression to estimate the variance between the replicates by pooling genes with similar expression levels to enhance the variance estimation. Cuffdiff [10] estimates the variance based on a negative binomial model and uses ttest to calculate the test statistics. Cuffdiff can make a model on each condition with replicates, or use a global model for all conditions together.
Before calling a tool to do data analysis, RNAMiner checks whether the data is appropriate to the tool. For example, MULTICOM-GNET [19,20] is not applied if no transcription factors exist in differentially expressed genes because MULTICOM-GNET needs at least one transcription factor to build gene regulatory networks. Another example is, for some special datasets, overexpression of some treatments in some regions of the genome in one condition leads to very large read counts of some genes in this condition, and dramatic differences of gene expressions between two conditions. This violates the assumption of edgeR's normalization method [12] that the majority of the genes should have similar expression levels. Therefore, calculating a normalization factor across all loci is difficult. RNAMiner will check this assumption and will not call edgeR if it is violated.

Evaluation and Discussion
We tested the RNAMiner protocol on six sets of RNA-Seq data generated from Human, Mouse, Arabidopsis thaliana and Drosophila melanogaster cells in order to evaluate its effectiveness. The details such as organisms, biological conditions, and experimental settings about the six sets of RNA-Seq data were reported in Table 3. The results of each of the five analysis steps are described and discussed as follows.

Results of mapping RNA-Seq reads to a reference genome
RNAMiner used TopHat2 [8] and Bowtie [44] to map RNA-Seq reads in the first and second data sets to the Mouse reference genome (mm9) in the UCSC genome browser [26] in conjunction with the RefSeq genome reference annotation (mm9) [27], map RNA-Seq reads in the third and fourth data sets to the Drosophila melanogaster reference genome (dm3) in the UCSC genome browser [26] in conjunction with the RefSeq genome reference annotation (dm3) [27], map RNA-Seq reads in the fifth data set to the Arabidopsis thaliana reference genome (ftp://ftp.arabidopsis.org/home/tair/Sequences/whole_ chromosomes/) in conjunction with the Arabidopsis thaliana genome reference annotation (ftp://ftp.arabidopsis.org/home/ tair/Genes/TAIR10_genome_release/TAIR10_gff3/), and used TopHat2 [8] and Bowtie2 [9] to map RNA-Seq reads in the sixth data set to the Homo sapiens reference genomes (hg19) in the UCSC genome browser [26] in conjunction with the RefSeq genome reference annotation (hg19) [27]. Tables 4-9 show the mapping statistics of six sets of RNA-Seq data. Overall, more than 70% of reads were mapped to the genome successfully. Particularly, a very high mapping rate (~97%) was reached on the sixth data set. These mapping success rates were within the reasonable range, suggesting the good quality of the data and the correctness of the mapping process. This reads mapping process reduced the size of data by several orders of magnitude.

Gene expression values calculated from the reads mapping data
RNAMiner removed reads that mapped to multiple locations on a reference genome from the mapping data. The gene expression values were calculated by Cufflinks [10], MULTICOM--MAP [14,15,16], and HTSeq [11] on the remaining RNA-Seq reads mapped to unique locations on the genome. Compiling reads mappings into gene expression values generates an overall profile of the expression levels of most genes in a transcriptome, which can reduce the size of dataset by about one thousand-fold (i.e., from gigabytes to megabytes) in our experiments. The compilation process transforms the raw data into meaningful expression profiles of genes. For example, three gene expression plots for comparisons between Control and each treatment in mutant mouse in the first data set are shown in Fig 6,   order to make these figures readable. Usually, the points beyond the diagonal are candidates of differentially expressed genes. MULTICOM-MAP and HTSeq were used to calculate the raw read counts in the third and fourth sets of data. The counts were normalized by dividing them by the total number of uniquely mapped reads in the replicate. The normalized count of a gene was an indicator of the relative expression level of the gene in the replicate. The normalized counts of a gene in multiple replicates of a sample were further averaged and used as the measure of the relative expression level of the gene in the sample.  Fig 10. The left plot in each figure was generated from all the genes, and the right one was generated from differentially expressed genes. The gene expression values were calculated by MULTICOM-MAP and normalized by log 2 . According to the two plots in Figs 9 and 10, the distribution of expression values of differentially expressed genes is quite different than that of the rest of the genes.

Differentially expressed genes identified from the RNA-Seq data
RNAMiner identified differentially expressed genes between control and each treatment using Cuffdiff [10], edgeR [12], and DESeq [13]. The threshold of p-value was set to 0.05 to select differentially expressed genes. For example, the number of differentially expressed genes for each comparison and their overlaps in both mutant mouse and wild-type mouse in the first data set are shown in Fig 11. The number of differentially expressed genes for each comparison in the second data set is shown in Fig 12. These differentially expressed genes were derived from the overlaps of three sets of differentially expressed genes separately identified by Cuffdiff, MULTI-COM-MAP+edgeR, and MULTICOM-MAP+DESeq. As shown in Fig 12, the number of differentially expressed genes increased with the increase of FruHis concentration in the absence or presence of 4 μM lycopene. The number of differentially expressed genes for two comparisons between Col (Wild-Type) and hae-3 hsl2-3 (mutant), between Col_qtrim (Wild-Type) and hae-3 hsl2-3_qtrim (mutant), and their overlaps in the fifth data set are shown in Fig 13. These differentially expressed genes were derived from the overlaps of three sets of differentially expressed genes generated separately by Cuffdiff, MULTICOM-MAP+edgeR, MULTICOM-MAP+DESeq. We also identified differentially expressed genes for two comparisons: between 2pc3 and 1Sfesrrb, between 4ctrl and 3DY131 in the sixth data set using edgeR based on read counts calculated by MULTICOM-MAP. EdgeR identified 6,210 differentially expressed genes for the comparison between 2pc3 and 1Sfesrrb, and 590 differentially expressed genes for the comparison between   4ctrl and 3DY131. On the RNAMiner web service, users can select different p-value (or qvalue) thresholds to select a specific number of differentially expressed genes according to their needs. In addition to generating the testable biological hypotheses (e.g. gene targets for experimental testing), differential gene expression analysis generally reduces the size of data by about two folds, shifting point of interest from almost all the genes in a genome to a small portion of genes most relevant to the biological experiment. in the top 10 biological processes. In these figures, the number besides each column is p-value of the enrichment of each predicted function. Although the step of gene function analysis does not substantially reduce the size of data physically, it can logically summarize hundreds of differentially expressed genes into a small number (i.e., tens) of biological processes activated or deactivated in the biological experiment which sheds light into the potential biological mechanism relevant to the experiment.

Constructed gene regulatory networks
RNAMiner used MULTICOM-GNET [19,20] to construct gene regulatory networks based on differentially expressed genes and transcription factors. For example, a repression gene regulatory module with expression correlation 0.85 in mutant mouse in the first data set is illustrated in Fig 17. This module was comprised of 21 differentially expressed genes. Three transcription factors: Tgfb1i1, Htatip2, and Jun, were predicted to collaboratively regulate this group of genes. An activation gene regulatory module for the comparison between Col (Wild-Type) and hae-3 hsl2-3 (mutant) with expression correlation 0.85 in the fifth data set is shown in Fig 18. This module was comprised of 35 differentially expressed genes. Four transcription factors, AT3G59580, AT1G56650, AT1G28050, and AT1G52890, were predicted to collaboratively regulate this group of genes.
RNAMiner also used a R package "igraph" [43] to visualize gene regulatory networks by linking the regulatory relationships between and within all the gene regulatory modules The step of gene regulatory network reconstruction condenses hundreds of differentially expressed genes and their expression data into dozens of valuable gene regulatory modules, which may reveal the underlying biological mechanism controlling the expression in the biological experiment. The network modules not only provide the human comprehensible interpretation of the gene expression levels, but also the important transcription factors and their target genes that are very valuable for generating hypotheses for new biological experiments.

Use of the RNAMiner Web Service
The RNAMiner web service (Fig 20) is available at http://calla.rnet.missouri.edu/rnaminer/ index.html. Users can submit requests on the home page and receive an email with a link to the data analysis results. 8. Upload reads files. The last three categories request users to upload reads files for both two samples. Users can upload more than one reads files for each sample.
After a request is submitted successfully, one web page (Fig 21) will be shown saying the data is in process. If one user submitted one request to the RNAMiner web service and it is running or it is in the waiting queue, he/she cannot submit another request.

Receive the results
When the data analysis is finished, users will receive an email with a link to one web page (Fig 22) with the data analysis information and a result link. The result page (Fig 23) will be shown by clicking the result link. Users can view and download the analysis data on the result page. The time expense of analyzing a set of RNA-Seq data by RNAMiner depends on how big the data is, how many reads files there are in the data set, and how many jobs there are in the waiting queue. Normally a data analysis can be finished by RNAMiner in several hours. However, the time expense will be longer if there are a lot of jobs in the waiting queue. Our server cannot handle too many jobs at the same time because of CPU and space limitations.

Conclusions
The RNAMiner protocol and pipeline can progressively reduce the size of large datasets to produce valuable and comprehensible biological knowledge of manageable size, ranging from gene expression values, differentially expressed genes, gene function predictions, and gene regulatory networks. The test results on six RNA-Seq datasets of four different species help demonstrate its utility and versatility.
In order to further improve the quality of RNA-Seq data analysis, additional tools can be plugged into the RNAMiner protocol. In the future, we will add a high-speed RNA mapping tool-Gsnap [25] and a high-accuracy RNA mapping tool-Stampy [45] into the pipeline to map RNA reads to reference genomes. For identifying differentially expressed genes, we will include baySeq [46], ShrinkSeq [47], and NOISeq [48] into the pipeline in order to handle various sources of noise in RNA-Seq data even better. Furthermore, we will include an in-house tool of constructing biological networks from a group of co-expressed genes to reconstruct highly valuable metabolic networks and signal transduction networks for gene clusters identified by the RNAMiner protocol. Moreover, we will add the capability of analyzing the function of non-coding small RNAs into RNAMiner and use the information during the reconstruction  of biological networks. The new improvements will be incorporated into the RNAMiner web service for the community to use.