CHOmics: A web-based tool for multi-omics data analysis and interactive visualization in CHO cell lines

Chinese hamster ovary (CHO) cell lines are widely used in industry for biological drug production. During cell culture development, considerable effort is invested to understand the factors that greatly impact cell growth, specific productivity and product qualities of the biotherapeutics. While high-throughput omics approaches have been increasingly utilized to reveal cellular mechanisms associated with cell line phenotypes and guide process optimization, comprehensive omics data analysis and management have been a challenge. Here we developed CHOmics, a web-based tool for integrative analysis of CHO cell line omics data that provides an interactive visualization of omics analysis outputs and efficient data management. CHOmics has a built-in comprehensive pipeline for RNA sequencing data processing and multi-layer statistical modules to explore relevant genes or pathways. Moreover, advanced functionalities were provided to enable users to customize their analysis and visualize the output systematically and interactively. The tool was also designed with the flexibility to accommodate other types of omics data and thereby enabling multi-omics comparison and visualization at both gene and pathway levels. Collectively, CHOmics is an integrative platform for data analysis, visualization and management with expectations to promote the broader use of omics in CHO cell research.


Overview of CHOmics
There are several panels stacked in the main interface. The recent experiment and projects are listed in the panel separately for quick access. You can also access them and other functions from the shortcuts at the top menu bar.

Menu Bar
In top menu bar, several shortcuts are listed for quick access of functions including: Toolbox, My Analysis, and Admin, Projects, Comparisons and Samples.
'My Analysis' provides quick access to the information of all 'Experiments', 'Samples' and 'Analysis'.
'Admin' allows the users to manage the data files from private folder, shared folder and overview the platforms applied to all data sets.
'Projects', 'Comparisons' and 'Samples' all provide searching function and access to specific project, comparison and sample respectively.

Experiments and Analyses
Experiment is designed for running the built-in RNA sequencing pipeline on the raw sequencing data. Once the experiment is created, users can upload raw fastq files and sample meta information, and then launch the built-in pipeline for analysis. After the analysis is completed, the analysis report is generated and the results can be exported as one 'Project' for visualization and cross-project comparison.

Projects
The project is used to perform data mining and data visualization. Users can either import analysis report from 'Experiment' or upload pre-processed data to create a project. In the project, users can easily explore different features of the data (e.g, Gene expression profiling, sample clustering, PCA, differential expression genes and pathways, etc), compare the analysis with the other projects or perform the meta-analysis by combining multiple projects.
Each project mainly consists of samples including both meta information and omics profiling, and comparisons showing the statistical differences among samples.

Samples
A project may include many samples which can be searched by the 'Sample' in top menu bar. Each sample has its own properties including Species, CellType, DiseaseState,etc (details available by clicking the button on the left ends).  To change columns displayed in the table, using the table settings (green button). Users can also select the samples to save them into the sample list. Samples from the list can be loaded to other analysis or visualization stools like heatmap.
Each sample has a gene expression profile. In CHOmics, there are multiple ways to analyze and visualize the samples including: correlation tool (noted by 'C'), gene expression plot( noted by 'E'), expression heatmap (noted by 'H'), and PCA analysis (noted by 'P').

Comparisons
Comparison is defined by the comparative analysis between two groups of samples including differential gene analysis and pathway enrichment analysis.
There are a lot of meta data available for each comparison. See the dashboard for an overview of key categories, and the detailed description of each comparison has the full information.

Tools for sample analysis
The selected comparisons can be saved to the comparison list (yellow button) for easy loading into the plotting tools.
Several options on each comparison for complicated visualization and analysis are also listed including: bubble plot of gene expressions(noted by 'B'), meta analysis (noted by 'M'), pathway heamap plot(noted by 'H'), significant changes genes (noted by 'C'), volcano plot(noted by 'V'), Wikipathway mapping(noted by 'W'), and Rectome and KEGG pathway mapping (noted by 'R' and 'K' respectively).

Genes
The genome-wide gene expression values were detected in each sample using RNA-Seq or microarrays. All the human genes that have expression values are listed in gene table. The gene annotation from difference platforms were all mapped to NCBI gene ID (EntrezID) for consistence across platforms.
To find a gene, you can use gene symbol, gene description, gene alias, NCBI gene ID, Ensembl gene ID or Uniprot ID.
For some common genes, the symbols used in publications are often not the official symbol, and you can try search alias field. For example, TP53 is often referred to as P53 in publication. You need to search P53 in alias or tumor protein p53 in description to find it if you don't know its official symbol.
The NCBI Gene search https://www.ncbi.nlm.nih.gov/gene is a good source to get official gene symbols and IDs.
You can view full details of a gene by clicking the button .
From gene details, you can access RNA-Seq data in a box plot, or view all comparisons including this gene in a bubble plot.

Upload fastq files to experiment
After the experiment is created, users can upload fastq or fastq.gz files through remote URLs, server files or local files. The files are uploaded to the private folder named 'Experiments' automatically.

View expression plot View Bubble plot across comparisons
In the folder 'Experiments', there may be multiple subfolders corresponding to different experiments. Users can easily modify the folder or upload new files to the folder.

Upload data file to project
Besides raw RNA sequencing data (fastq files), CHOmics also allow the input of other types of data to start a project, including meta data (i.e,project and samples), expression data, and summary data. Those data should be uploaded in comma separated values (CSV) or tab separated values (TSV) with either

RNAseq analysis pipeline
After fastq files are uploaded to the experiment by following the Section 2.1, users can start the analysis by applying the built-in pipeline mainly including: Raw Data QC (quality control), Alignment, Gene Counts and QC, and DEG, GSEA and GO analysis. After the analysis is completed, the results can be exported into a project for visualization.
After completion of each step, a report is generated for summarizing the metrics in each step to quantify raw data QC, alignment with Subread method, and gene count distribution and sample/gene count QC, respectively.
In the report for raw data QC, all fastq files are verified in quality by software fastQC. Sequencing read information and quality control metrics are summarized for each individual fastq file.
In the report for alignment, parameter setting and quality metrics (e.g, mapped, junctions,etc) for alignment are listed for each fastq file.
In the report for Gene Counts and QC step, several metrics have been calculated and plotted for comprehensive evaluation of genes and samples, including: reads mapping to genes, distribution of detected genes, percentage of reads for highly expressed genes, normalization and boxplot of gene expression, sample grouping and clustering, sample correlation and outlier detection.
'1. Assign reads to genes' plots the mapping summary of reads to genes, showing the percentage of reads assigned to genes or unassigned due to unmapping, no features or ambiguous mapping. '6 Sample correlation' creates scatter plots for the correlation between sample pairs. The idea is that biological replicates from the same group should look similar in the scatter plot, and should have high correlation values compared to the samples from other groups.

DE, GSEA and GO analysis
After completing the first three steps for sample quality control and gene count readout, users can start statistical analysis as the last step of pipeline, including differential expression analysis (DEG), gene set enrichment analysis (GSEA) and gene ontology (GO) analysis.
DEG analysis is applied to compare gene expression between two groups, namely comparison. Users can design one or multiple comparisons for DEG analysis. In each comparison, differential expressed genes are identified by LIMMA model, followed by GSEA pathway analysis and GO enrichment analysis which explore the enrichment of DEGs in diverse pathways.
The reports for DEG and pathway analysis are attached for each comparison after completion of analysis.
In the report for DEG analysis, the table summarizing DEGs with up-and down-regulation is listed along with a heatmap clustering the DEGs expression (up to top 1000 DEGs).

Parameter for gene set analysis
Design comparison for DEG analysis

Report for DEG, GSEA and GO analysis
In the report for GO enrichment analysis, barplots show the significance of enrichment of up-or downregulated DEGs in different pathway databases, e.g, GO, KEGG, Wiki pathways, etc.
Similarly, in the report for GSEA analysis, enrichment results for up-and down-regulated DEGs in pathways from MigDB database are plotted with significance level (FDR), respectively.

Pathway analysis for down-regulated DEGs
Variables Plot shows the weights of top contributing genes in each PCs.
Variable Data summarizes the weights of each gene in each individual PC.
Individuals Plot shows the relationship of samples on the spanned space by different PCs.
Individual Data summarizes the score vector of each sample in each individual PC.

Save results
The PCA results can be saved. Users can load it in the future.
Users can upload their own data matrix or pre-calculated data for PCA analysis and visualization.

Meta-Analysis
Meta-Analysis can be used to identify genes that are changed consistently across multiple projects. It is listed as one functional module in toolbox panel. In the example below, we are looking for the most significant DEGs in three comparisons.

Upload own data for PCA analysis
Graphic options. You can use attributes to define sample color or shapes.

Data matrix for PCA
The meta-analysis pipeline will compute three types of results: 1) Maximum p-value (maxP). This method targets on DEGs have small p-values in "all" comparisons. We recommend using maxP if you are looking for DEGs that are common among several studies. 2) Fisher's p-value. The Fisher's method sums up the log-transformed p-values obtained from individual studies. This p-value combination method is useful if you want to identify DEGs in any of the comparisons. 3) We also applied simple counting method to report the frequency a gene is classified as up or down-regulated DEG from all the comparisons. The default DEG cutoff is two-fold change and FDR<0.05. but user can change the cutoff.
In most cases, combing maxP (smaller values are more significant) and the counting method (e.g. upregulated in 50% of studies) will give the most biological relevant results for consistently regulated genes across comparisons.

Select genes for visualization and other analysis
In the above example, we used a relatively loose filtering criterion (N.data.points>1, and up-regulation in percentage>30% of studies, and Combined_Pval_MaxP <=0.0001) because only small number of genes pass the stringent default criterion.
The data table shows the genes that pass the filters. We can sort the table by maxP value. A different filter can be applied to get down-regulated genes.
The results can be saved for future access. There are also links to several other tools. The download meta data link will save a CSV file that contain results from all genes.
Next, we will choose all the genes that pass filter by checking the box for all listed genes, and use bubble plot to visualize the results.
The resulting bubble plot will show all three comparisons for each gene.

Unclick p-value and FDR to show only logFC
The data table below the bubble plot can also be used for filtering. Remember in the advanced settings, we choose to display logFC only, this makes it easier to look for genes that are reverted in different time points. The logFC values are colored coded (red, increase, blue, decrease), therefore we can see that most of genes show upregulation in D84, and then downregulation in D96 and then upregulation in D108.
You can also redo the plot, check all columns to include p-value and FDR in the table, and export the results to excel file.
The workflow above uses up-regulated genes as example. You can get down-regulated genes from the filter step in meta-analysis result page.

Visualize Gene Expression
CHOmics provides tool to easily visualize gene expression level across multiple genes, samples and omics. For each gene, you can view its expression levels across multiple samples.

View Gene Expression from multiple samples
Choose the Gene Expression tool from Toolbox -> Gene Expression Plot from top menu, and enter the official symbol of genes or load gene list from saved lists. Alternatively, in the gene details page, click View Gene Expression link.

View Gene Expression in Heatmap
Heatmap can be useful to visualize gene profiles from multiple samples. It can also provide information about how genes and samples cluster.
You can enter genes and samples in the box, or load pre-saved genes and samples quickly from your collection. Be default, we will log2 transform the gene expression data, perform scaling of the data across samples for each gene, and limit the scaled value to -3 to 3 before displaying the data in heatmap. This works well in most situations. However, advanced users can change the options. For example, if you want to keep the order of samples as you entered, just uncheck "Cluster Samples".
The heatmap is rendered by CanvassXpress. You can change the plot size if needed.
In the example heatmap, we entered a few significantly differential expressed genes between time D72 vs time D108. From heatmap clustering, we can see that the samples are clearly clustered by time points with increase of expression on most of genes along with time.

Multi-omics Expression View
Besides the plotting of transcriptomics data, CHOmics also enables the visualization of other types of omics data such as proteomics, and the comparison across omics.
Here is an example of comparing gene expression (transcriptomics) and protein expression (proteomics) of gene CTSA at different time points, using the 'Gene Expression Plot' tool. By righ clicking the plotting area, users can group the samples by different treatment time points while segregating the data by omics type(i.e, Samplesource).

Dashboard View of Comparison
The dashboard shows a summary of all the comparisons.
The above dashboard shows the comparisons from different Categories, Cell Type, Disease State, Treatment, Platform, etc. Below the dashboard, there is also a table listing all the comparisons.
In addition, users can set Dashboard Preference to change how the comparison summary is displayed.

Display Options
Platforms used to generate data for these diseases In the bubble plot, the X-axis shows log2 Fold Change of the comparison, the Y-axis shows 'Case_treatment'. Each dot represents the comparison result of this gene from one comparison. The color of the dot represent 'Case_Samplesource' (i.e, here we set as omics type), and the size of the dot represent significance , larger is more significant).
The user can click and unclick the color legend at right to select or deselect omics types. When mouse over a dot, more details are shown. And the user can also click the dot to link to other graphs.
The tool bars at top right corner allows the user to zoom and pan the graph.
The screenshot below shows the same bubble chart after selecting one omics type (i.e,transcriptomics), and zoom into a portion of the chart.

Select omics to show in bubble plot
Move over mouse to click to see comparison detail

Bubble Plot of Multiple Genes and Multiple Comparisons
It can be useful to look at a set of genes (e.g. all differentially expressed genes, or genes from a certain pathways) in a set of related comparisons (e.g. all from the same disease).
To view this type of bubble plot, select the link for Multiple Genes vs. multiple comparisons.
In the Genes and Comparisons Bubble plot window, you can now enter the symbols of the genes, and the comparison names. However, it is much easier to use the saved genes and saved comparisons features, or other tools from the system to quickly get a get set. Please see below for details.
In the example below, we use dashboard to select 6 comparisons that are for different time points in CHO cell lines. We save the comparisons and load in the bubble plot tool. For gene list, we get the upregulated genes from comparison D72 vs D108, and paste into the gene names fields.

Enrichment from Up and Down Regulated Genes
When you view details of a comparison, the functional enrichment results are shown. Briefly, for each comparison, we generated the up-and down-regulate gene lists, and use these lists to compare with all genes in the genome to identify functions that are significantly enriched.
In the example above, this comparison is between D108 vs D72, and the top up-regulated biological processed are response to virus, immune effector process.
Click the left menu will switch the bar charts for different categories (Gene Ontology, KEGG, Molecular signature, Protein domain etc).
The bar charts here show the top 10 categories. To view complete results, click the Enrichment Report.

View full report
In the enrichment report, the full list of functional terms are shown by order of p-value.

View Changed Genes from a Functional Term in Volcano Plot
From the bar chat, click a functional term, and you have the option to view these genes in a volcano plot.
Once you click the link in the popup window, volcano plot will be generated for the comparison with the changed genes from the selected term highlighted.

Gene Set Enrichment from Ranked Genes
For each comparison, we produce a rank file for all genes using logFC. We use PAGE (Parametric Analysis of Gene Set Enrichment) to identify significant biological changes. PAGE can be more sensitive for comparisons where the logFC is relatively small, but most genes in a functional set show the same direction of change.
The predefined gene sets were from MSigDB.
For each comparison, the top up-regulated and down-regulated gene sets are plotted.
To view the full list of gene sets, you can click the report for genes as shown in following figure.

Multi-layer visualization
If you are interested in a particular pathway, sometimes it is useful to map the RNA-Seq or microarray data to the pathway for visualization.

Check gene set
The user can add multiple comparisons from the pathway plot tool by clicking Add Comparison link.
Besides showing log2 Fold Change, the user can also show statistical significance by clicking Enable Second Visualization Columns.
The pathway plot will now have multiple color bars corresponding to the different comparisons.
Choose multiple comparison 5 Customized analysis pipeline

Use alternative tool or algorithm
The analysis pipeline is modular, each step can be modified by uses to use an alternative method if desired. The users should be familiar with the Linux bash to run the analysis steps and be familiar with php programming to make modification to the source code.
The full analysis pipeline has four steps, and each step is listed in a bash file in the analysis folder in the system. These bash files are created by PHP programs chomics/app/bxgenomics/bxgenomics_exe_analysis.php, when users launch analysis pipeline online in a web browser via chomics/app/bxgenomics/analysis.php.
For example, the current pipeline uses subread to perform alignment. If users want to modify the pipeline to change it to use the STAR program for alignment, they need the following steps: 1) Install STAR program on the server, prepare STAR index for the CHO genome.
2) Check the commands in step_1.sh, and change the commands as needed. In this case, the subread command (subjunc step) needs to be replaced by the equivalent STAR command. Since STAR can sort the bam files, the samtools sort step can be omitted. Finally, the STAR output file is named as SampleIDAligned.sortedByCoord.out.bam, an extra step is needed to rename it to SampleID.sorted.bam, so step2.sh can output gene count files with the correct sample names.
3) Edit PHP program chomics/app/bxgenomics/bxgenomics_exe_analysis.php, find the part that generates step_1.sh (The section is marked as "Step 1. Alignment with Subread"), and then make changes accordingly. 4) Test the updated system to make sure it works as expected.