Hydra: A mixture modeling framework for subtyping pediatric cancer cohorts using multimodal gene expression signatures.

Precision oncology has primarily relied on coding mutations as biomarkers of response to therapies. While transcriptome analysis can provide valuable information, incorporation into workflows has been difficult. For example, the relative rather than absolute gene expression level needs to be considered, requiring differential expression analysis across samples. However, expression programs related to the cell-of-origin and tumor microenvironment effects confound the search for cancer-specific expression changes. To address these challenges, we developed an unsupervised clustering approach for discovering differential pathway expression within cancer cohorts using gene expression measurements. The hydra approach uses a Dirichlet process mixture model to automatically detect multimodally distributed genes and expression signatures without the need for matched normal tissue. We demonstrate that the hydra approach is more sensitive than widely-used gene set enrichment approaches for detecting multimodal expression signatures. Application of the hydra analysis framework to small blue round cell tumors (including rhabdomyosarcoma, synovial sarcoma, neuroblastoma, Ewing sarcoma, and osteosarcoma) identified expression signatures associated with changes in the tumor microenvironment. The hydra approach also identified an association between ATRX deletions and elevated immune marker expression in high-risk neuroblastoma. Notably, hydra analysis of all small blue round cell tumors revealed similar subtypes, characterized by changes to infiltrating immune and stromal expression signatures.

genes in your expression dataset. After this, there are two main usage modes for hydra which utilize the ME gene list: supervised gene set analysis using the sweep command, and unsupervised gene-set clustering analysis using the enrich command. In both modes, a Jupyter notebook will be produced to analyze and visualize the results.
The sweep analysis is useful if you are interested in investigating gene sets or gene signatures that have known relevance in your disease of interest. The results of the sweep command can identify gene sets with the greatest power for differentiating subtypes of samples.
The enrich command is useful for generating new hypotheses about subtype-specific expression by identifying novel subtype clusters in your expression data through multivariate clustering with the ME gene list.

Options
Run the hydra command by itself or with -h flag to see all options and descriptions of all arguments: docker run -it -v $PWD:/data jpfeil/hydra:0.2.4 -h

Test:
Test data is available in the hydra/test directory along with a bash script with example commands.
Step 1: Identify Multimodally Expressed Genes Using filter Use the filter tool to identify multimodally expressed genes in your expression dataset. This will generate a MultiModalGenes directory. The next step in the pipeline is to perform supervised or unsupervised clustering analysis with sweep or enrich.

Flags
Step 2: Identify coordinated expression of multimodally expressed genes  After running the sweep analysis you can use the jupyter notebook to analyze the results (see last section of this README)).

Option 2: Unsupervised Enrichment Analysis Using enrich
The enrich command finds enrichment of multimodally expressed genes within a user-defined database of gene sets. There are two ways to perform the enrich analysis. The first way is to use the command-line tool, but we actually recommend using the Jupyter notebook approach because it provides more flexibility for investigating clusters. We will first present the command-line approach, but we encourage the user to also read the Jupyter notebook approach below.
The enrich method includes an important parameter known as the minimum component probability. This is an additional filter to remove multimodally expressed genes that influence a small subset of your samples.
This parameter gives you the ability to subset the enriched genes to those that influence a greater number of patients. Before running enrich, you can use the scan command in the jupyter notebook to adjust the minimum component probability during multivariate clustering. Now you can use the jupyter notebook to analyze the results (see next section).

Perform GO enrichment clustering across multimodally expressed genes
Step 3: Analyze results with Jupyter notebook Interactive environment for investigating expression data. This comes with all of the hydra code and dependencies pre-installed. Input the token given at stdout.
The first step is to add the hydra library to your path from within the docker container. import sys sys.path.append('/opt/hydra/') import library.analysis as hydra If working with sweep results, you can use this code snippet to identify which gene-sets are "hits".

hits = hydra.SweepAnalysis().rank(<path to MultivariateAnalysis directory>)
This provides the number of clusters identified and the Kullback-Leibler divergence, which is a measure of how different the clusters are in expression space. We recommend prioritizing gene-sets with a large Kullback-Leibler divergence, to identify clusters that have significantly different expression patterns.
It is also possible to perform the enrich analysis in a Jupyter notebook on your laptop. All you need is the path to the input expression matrix and the directory of MultiModalGene models.
The minimum component probability filter can be used to tune the resolution of the clustering analysis with respect to the number of samples available. We provide a method ScanEnrichmentAnalysis to explore how the minimum probability thresholds influence gene set enrichment and the number of clusters. We also provide routines for characterizing clusters using GSEA. This analysis provides the gene-sets that are enriched in each cluster. This can be used to identify biological themes within the cluster, including the tumor microenvironment state and druggable pathway expression.