RNA CoMPASS: A Dual Approach for Pathogen and Host Transcriptome Analysis of RNA-Seq Datasets

doi:10.1371/journal.pone.0089445

Figure 1.

Schematic of RNA CoMPASS (RNA comprehensive multi-processor analysis system for sequencing) architecture.

RNA CoMPASS is a graphical user interface (GUI) based parallel computation pipeline for the analysis of both exogenous and human sequences from RNA-seq data. It employs a commercial and several open-source programs to analyze RNA-seq data sets including Novoalign, SAMMate, BLAST, and MEGAN. Each step results in the subtraction of reads in order to further analyze the unmapped reads for pathogen discovery. The mapped reads are analyzed separately. The end result from this pipeline is pathogen discovery and host transcriptome analysis.

More »

Expand

Figure 2.

Performance Analysis of RNA CoMPASS.

RNA CoMPASS was deployed on a local cluster and benchmarking was performed. An Akata RNA-seq data set was split into six files of varying sizes: 1–393.4 MB, 1,397,139 reads, 2–757 MB, 2,685,149 reads, 3–1.44 GB, 5,120,805 reads, 4–2.72 GB, 9,651,466 reads, 5–5.01 GB, 25,465,406 reads, sample 6–8.99 GB, 50,930,812 reads. Overall time was calculated for each file on a single machine (blue column) and on the local 4-node cluster (red column). Speedup time is represented as a green line.

More »

Expand

Figure 3.

Detection of EBV in Human B-Cells using RNA CoMPASS.

Analysis of all 45 single-end RNA-seq data sets (22-Lymphoblastoid cell lines, 23-Burkitt's lymphomas) were analyzed using RNA CoMPASS. (A) The virome branch of the taxonomy trees for two representative LCLs and Burkitt's lymphomas were generated using the metagenome analysis tool, MEGAN 4. (B) EBV reads were quantified in all 45 RNA-seq data sets and are represented as per 5,000,000 total sequence reads.

More »

Expand

Figure 4.

Circos plot of two EBV samples shows distinct gene expression.

An annotated Circos plot depicts the EBV read coverage across the EBV genome of two samples. The graph displays the number of reads mapped to each nucleotide position of the genome and are depicted in log scale. Blue features represent lytic genes, red features represent latency genes, green features represent potential non-coding genes, and black features represent non-gene features (e.g. repeat regions and origins of replication).

More »

Expand

Figure 5.

Heat Map representing Human B-Cells analyzed using RNA CoMPASS.

Human transcript counts from the 45 B-cell samples were imported into the R software environment and analyzed using the edgeR package [15]. Genes with low transcript counts (less than 1 CPM (count per million)) in the majority of samples were filtered out. The Manhattan (L-1) distance matrix for the samples was computed using the remaining transcript counts, and this was taken as input for hierarchical clustering using the Ward algorithm. After assigning each sample to one of two groups identified by hierarchical clustering (Human B-Cell or Burkitt's Lymphoma), the glmFit function was used to fit the mean log(CPM) for each group and likelihood ratio tests were used to identify those genes that were differentially expressed, with adjusted P<0.05 following the Benjamini-Hochberg correction for multiple testing. The fitted log(CPM) values for the subset of genes that were differentially expressed in the LCL samples relative to the Burkitt's lymphoma samples were then clustered using the Euclidean distance and complete linkage algorithm to detect groups of co-expressed genes. The expression heat map displays the top 250 differentially expressed genes.

More »

Expand