An R package for divergence analysis of omics data

Given the ever-increasing amount of high-dimensional and complex omics data becoming available, it is increasingly important to discover simple but effective methods of analysis. Divergence analysis transforms each entry of a high-dimensional omics profile into a digitized (binary or ternary) code based on the deviation of the entry from a given baseline population. This is a novel framework that is significantly different from existing omics data analysis methods: it allows digitization of continuous omics data at the univariate or multivariate level, facilitates sample level analysis, and is applicable on many different omics platforms. The divergence package, available on the R platform through the Bioconductor repository collection, provides easy-to-use functions for carrying out this transformation. Here we demonstrate how to use the package with data from the Cancer Genome Atlas.


Introduction
The technologies that provide us with high-dimensional omics data continue to advance at a rapid rate. Particularly in the last decade, the available modalities of omics data have considerably expanded and now include, among others, coding and noncoding RNA expression, micro RNA expression, protein expression, epigenetic profiling related to histones and CpG methylation, copy-number and mutation profiling. As a result, the analysis of multi-modal omics data has become indispensable in many domains of biological and medical research.
Whereas the quantity and quality of available data has appreciably increased, the results of research based on such data are often not reliable, robust and replicable. A key challenge has been to quantify the level of variability and diversity in omics profiles in a given population, and to separate normal and technical variability from abnormal variability indicative of a biological property such as disease.
Recently we have introduced divergence analysis as a method for simplifying high-dimensional omics data for bioinformatic analyses [1]. This method conceptually parallels the widespread use of deviation from normality as a disease marker in clinical testing, such as a blood based prostate specific antigen (PSA) test [2,3]. Given a high-dimensional omics data profile, it can be converted to a binary or ternary string of the same length, where each value now indicates how the original value diverges from a baseline or reference population. The features of interest may be univariate such as a gene, a CpG site, or a protein, in which case the original a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 profile is converted to one of three labels 0, -1 and 1 indicating whether the level of a feature is, respectively, within the reference range for that feature, or below or above that range. Or, the features of interest may be multivariate, such as sets of genes representing pathways of interest. In this case, divergence coding is binary, where a 0 or 1 for a multivariate feature indicates, respectively, that the set of feature is inside or outside the support of the multivariate, baseline distribution.
Divergence coding can be used for a wide variety of types of data analysis: class comparison and prediction, feature selection, regression, combination of omics modalities, and many others. Here we present how to use this package and showcase the many different ways in which the divergence coding can be manipulated to perform interesting analyses. In the following we use breast normal and tumor samples from the TCGA project [6] for which RNA-seq expression profiles as well as methylation profiles are provided. First we use some basic examples to illustrate the workflow, followed by more complex types of data analyses that combine multiple omics modalities, and demonstrate how to use divergence in both the univariate and multivariate settings. A subset of this data is available with the R package.

Methods
We may summarize the divergence method as follows. Before computing the baseline range (univariate) or support (multivariate) and the resulting divergence coding, a rank-transformation is applied to all data. Consider an omics sample represented by a vector X = (X j ), j = 1..m, m being the dimensionality of the omics profiles. The following transformation is applied which converts the data to a normalized rank, with the minimum being zero. Q j ðXÞ ¼ ji 2 1::m : 0 < X i � X j j ji 2 1::m : 0 < X i j Then Q j (X) 2 [0, 1]. The zero minimum is particularly useful in preserving the zero valued mass usually observed in many omics data modalities, such as RNA-Seq. Now consider a multivariate feature indexed by S-i.e. a subset of the given m features (the univariate scenario is a special case when |S| = 1). We will denote the corresponding subset of a given omics profile X following the quantile transformation as Q(X) S . Suppose we have n such profiles that constitute the baseline group. We estimate the baseline support as follows: given a parameter γ 2 [0, 1], we compute l which is the floor of nγ. Then from each baseline sample k, if r k is the distance from Q(X) S to it's l th nearest neighbor in the multivariate feature space. If we denote the sphere around Q(X) S of radius r k as U S k then the support of the baseline is the union of the regions covered by these spheres around each baseline sample, which we denote asÛ S : Following the baseline estimation in this manner, for any given omics profile X, the divergence coding Z(X) S can be computed as: For the univariate scenario, the support is the union of a series of intervals. However, we apply a further simplification by replacing these with a single interval spanning the lowest end to the highest end of these intervals. Accordingly the divergence coding becomes ternary: Z(X) j 2 {−1, 0, 1} with −1 indicating a value below the baseline interval for feature j, 1 indicating a value above the baseline interval, and 0 indicating no divergence.
In practice we use two more parameters, α and β. The β parameter is used for allowing a certain number of outliers to be exempted from the baseline. Given the radii of the spheres r 1 .. r n , let � r be the (1 − β) th percentile of these values. Then we select only the spheres with r k � � r to compose the baseline. Once the baseline is computed, we can compute the divergence coding for the baseline samples; then α is the average proportion of divergent features (multivariate or univariate) among the baseline cohort. As discussed in the following section, we specify α and β and then select the γ value that fits these specified parameters. However the functionality for estimating a baseline for a specified γ parameter is available in the package as well.
For a more detailed description of the method, see [1].

Univariate workflow
To carry out an analysis based on divergence, the first step is to determine the case and control samples. The control cohort here will be used for computing the baseline interval for each feature, and we will refer to it as the baseline cohort or baseline group. Once the baseline is computed, the divergence values for the case cohort will be computed with reference to the baseline.
What data should be used as the baseline cohort depends on the problem at hand and needs careful investigation based on the biological or clinical questions of interest that are being investigated. In many scenarios of disease based data, and in particular cancer, normal samples would be a good choice. The more normal samples available, the more robust the baseline will be.
All samples, both in the case cohort and the baseline cohort should have the same features -for example genes or microarray probes, and be from the same platform (e.g. RNASeq or a specific microarray platform). Ideally the baseline cohort should be from the same experiment to avoid study-specific batch effects as much as possible. In general we suggest at least 20 samples be available to compute the baseline.
The divergence algorithm applies a scaled rank transformation to all samples before the digitization of the data. The package provides functionality for the user to compute this rank transformation manually, or it can be set to compute internally when the divergence computation is requested by the user. No other normalization procedures are applied, and the user may apply any normalization procedures beforehand as necessary.
There are three parameters involved in the divergence computation: α, β and γ. All parameters are in the (0, 1) range. The β parameter adjusts the support to account for a certain percentage of outliers included in the baseline data, and the γ parameter provides a way to widen or tighten the support around each baseline sample. The closer γ is to 1, the further the support around each normal sample will extend. For a given support, the α value is simply the average number of divergent features per sample for the baseline cohort.
In usage, we usually provide the α and β values and a range of possible γ values and let the package find the most appropriate γ value out of these for the given α and β values. By default, the package uses α = 0.01, β = 0.95, and γ 2 {0.01, 0.02, . . ., 0.09, 0.1, 0.2, . . ., 0.9} as a list of candidate γ values. Thus if you use these default values, the package will consider 95% of the baseline cohort to be included in the support and find the smallest possible γ value from the given list that will provide the average number of divergent features per sample in the baseline cohort to be 1% or less. For more details, see [1].
We use RNA-Seq data spanning 20530 genes from the TCGA Breast Cancer dataset [6,7] for the following analysis. This data consists of 1097 tumor samples and 114 normal samples, the latter which we will use as our baseline cohort.
Using the default parameters available, we compute the digitized ternary divergence format for the tumor samples with respect to the normal baseline of 114 samples. The algorithm selects a γ value of 0.3, which yields an α value of 0.007 which is below the α threshold of 1% which was specified as the default.
The tumor data contains 601 ER+ and 179 ER-samples. We can see whether the number of divergent genes per sample are similar between the two ER groups or not (Fig 1).
As another example of using the size of the divergent feature set as an indicator of the divergence of the sample, we can compare it against clinically obtained covariates such as the percentage of normal and tumor cells found in the sample. We see that these percentages track with sample level divergence (Fig 2). Table 1 shows the correlations obtained with the percentages of normal, tumor, and stromal cells provided from pathologist observations in the TCGA data. A similar comparison can be made with the cell type enrichment scores provided by xCell, a method which provides a score for gene expression samples suggesting the level of enrichment in the sample for different cell types [8]. As see in Table 1, the number of divergent genes in breast tumor The size of the divergent set in each sample-i.e. the set of features that are divergent in a given sample-can be uses as a measurement of the divergence of a sample. This allows the comparison of sample level divergences between sample groups. When digitized with respect to a normal breast basenline, the boxplots compare the number of divergent features per sample between ER+ and ER-breast tumors. Given the higher risk associated with ER-breast tumors, we expect to observe higher divergence from normality compared to ER+. The same trend is observed if the consideration is limited to the number of upper or lower divergent features only.
https://doi.org/10.1371/journal.pone.0249002.g001 samples show high correlation with the enrichment scores provided for many cell types. These results indicate that some of the divergence observed is reflective of the cell type heterogeneity in the specimen.
To perform a differential expression analysis at the feature level for digitized divergence data, the package provides a χ 2 test functionality, which we can use to perform a χ 2 test for each gene between ER+ and ER-samples. Even simpler, we can merely compute the divergent probability of each gene over the ER+ and ER-samples respectively. Note that the divergent probability of a feature over a group of samples is simply the number of samples for which the feature is non-zero in the divergence space divided by the number of samples. Fig 3 shows the divergent probability of each gene for ER+ and ER-samples. The samples in blue are genes that are significant under a Bonferroni-adjusted P � 0.05 threshold from the χ 2 test. The ESR1 gene, which can be considered a marker for ER status, is highly significant and is colored in red. Table 2 shows the top ten genes by rank of χ 2 test p-value.
We can observe how these genes track with the ER level by comparing the expression level to that of ESR1 expression, both in the regular expression value space and in the divergence space. Figs 4 and 5 show the values of ESR1 expression among ER+ and ER-breast tumor samples against two of the genes highly differentially expressed between the two sample groups, ACADSB and PSAT1, in the regular expression space (log 2 transcripts per million) and in the divergence space.

Multivariate workflow
In the multivariate scenario, the features of interest are composite-that is, they are sets of univariate features, for example a gene set indicating a disease specific signature or a pathway. The workflow is similar to that of the univariate case, with the exception that the divergence coding will be either 0 or 1-that is, it indicates whether the feature is divergent or not, whereas for the univariate case it provided a direction of divergence as well.
With the same breast gene expression data from TCGA which we have used so far, in the following we examine the divergence coding obtained for the KEGG pathways from MSIGDB gene set collection [9]. Computing the divergence coding is similar to the univariate case, except that the multivariate features need to be specified. As with the univariate case, the reference sample data and the samples for which the divergence coding need to be computed are provided in matrix form, along with a range of γ values to choose from, a β value, and the required α threshold. The multivariate features are provided as a list, where each element of the list is a vector of univariate features which are available in the data matrices provided. In the example we present, the same TCGA sourced RNA-Seq expression data used previously are used. The KEGG pathway list has 186 gene sets, where each set ranges from 10 to 389 gene symbols. These gene symbols are available in the RNA-Seq expression data and will be used to estimate the baseline support from the matched normal samples and the placement of the tumor samples either within or outside the support in the high dimensional space representing each KEGG gene set. The output will be a matrix comprising of the binary divergence coding for each KEGG pathway and each tumor sample.
Given that the divergence coding is a highly simplified representation, we can simply observe it in terms of either the features or the samples and observe the proportion of values in Table 1. Correlation between cell type signatures and divergence score. The divergence score (number of divergent genes in a sample) for breast tumor samples based on gene expression is correlated with the scores from the xCell tool reflecting enrichment in the sample for different cell types. The 20 cell types with the largest spearman correlations are shown. Also shown are correlations between the divergence score with the normal, tumor, and stromal cell percentages provided in the TCGA clinical covariate data. each divergence state. Fig 6 shows the sample proportions for some of the KEGG pathways, and similarly at the sample level in Fig 7. As before, we can apply a χ 2 test to each pathway between the ER+ and ER-sample groups to identify which pathways are highly differentially divergent between the two groups (Fig 8). The pathways that meet a P � 0.05 cutoff after Bonferroni adjustment for multiple testing are shown in blue. Differentially divergent features between two phenotypes. The probability of divergence for each feature is plotted for two sub-groups of breast tumors, ER+ and ER-. Genes far from the diagonal are likely to have highly different divergence patterns between the two groups (e.g. ESR1). Alternatively, a χ 2 test can be performed for each feature between the two groups, and points in blue indicate genes that are χ 2 test p-value significiant at a bonferroni adjusted P � 0.05 threshold.

Further examples
In this section we provide a series of different example analyses that use RNA-Seq gene expression, microarray gene expression, and methylation 450k data with both univariate and multivariate modes of divergence.
Discrimination between Luminal sub types with KEGG pathways. In this example, we demonstrate using multivariate divergence coding to train a classifier that provides a mechanistically useful interpretation, and examine its robustness across other datasets. Continuing the example provided in the previous section, we use the multivariate divergence coding obtained for the breast cancer tumor samples with respect to a normal breast baseline, with KEGG pathway gene sets as our multivariate feature sets of interest.  The figure indicates that projection onto one of the leading principal components of the divergence data showed a significant ability to differentiate the Luminal A subset of tumor samples from the other PAM50 sub types showed. Based on this observation, we may train a classifier using the binary divergence coding for classifying between Luminal A and Luminal B sub types. Here we opt for decision trees and random forests as our classifiers of interest, given that with binary divergence data, these classifiers can provide more intuitive understanding of  the biological underpinning behind the separation of the two groups, especially decision trees. Each branch of a decision tree is a binary decision of whether a certain KEGG pathway is displaying aberrant activity or not as indicated by divergence.
Candidate features for the classifiers were selected as the top 20 pathways resulting from χ 2tests between the two groups in the TCGA data (231 Luminal A samples and 127 Luminal B samples). The resulting decision tree provides a training accuracy of 0.85, while a random forest classifier provides 0.94. This decision tree is shown in Fig 10. Next, we obtained independent microarray data for breast cancer from the GPL96 platform from publicly available studies (GEO accession numbers GSE2034, GSE12093, GSE17705, GSE25055, GSE25065) [10][11][12][13]. Due to lack of availability of normal samples from these studies, we used the 'normal-like' sub type as a baseline for obtaining KEGG pathway divergence coding for the Luminal samples. To ensure that the resulting data has a similar level of divergence to that from TCGA, we took the samples that would not be used for testing (i.e. omitting the Luminal samples), and computed divergence coding for the KEGG pathways for a range of γ values. From the resulting divergence codings, we selected the γ parameter that provided the highest correlation in divergence probabilities for the KEGG pathways to that from TCGA data, and used this setting to compute divergence codings for the Luminal A and B samples that would be used for testing (127 Luminal A samples, 66 Luminal B samples). Validating the above classifiers on these microarray samples provided accuracies of 0.64 and 0.71 from the decision tree and the random forest respectively (see Table 3).
We believe with larger datasets and more complex training and tuning procedures, more robust classifiers that work well in cross-study cases may be developed. Our aim here is to demonstrate a fairly simple example of deriving a classifier from divergence that provides some understanding of the possible biological mechanisms underpinning the classifier, while also providing some degree of portability to a different dataset for validation.
Gene set enrichment. Traditional types of analysis can also be performed with divergence coding if so desired. Gene set enrichment analysis (GSEA), which is the standard approach to examining whether a given ranking of genes in enriched in a collection of gene sets, is one such method we showcase here. For this example, we use the univariate gene expression divergence coding computed for breast tumor samples, along with the chemical and genetic perturbations collection of gene sets from the MSigDB project [9]. As a simple example, we use the ER+ and ER-sets of samples used previously. A noteworthy point here is than unlike a traditional setting where differential expression analysis performed on a binary phenotype comparison is used for obtaining the gene rankings, with divergence any set of features can be ranked over a single set of samples without the need for a binary comparison. Thus, we compute the divergence probabilities for genes for ER+ and ERsamples respectively, each set of samples providing us a ranking of genes. These rankings are then used to perform two sets of GSEA analyses, and the comparison of the results can be used to identify gene sets that are enriched with different directionalities in the two groups (see Fig  11). While in these results we compare the two ER groups, any ER group, or any other set of samples-may be analyzed with GSEA using divergence without the need for a binary phenotype comparison.
Divergent CPG clusters. Here we use methylation data from the TCGA breast cancer cohort with divergence. The data we use are the methylation beta values in the [0, 1] range indicating the proportion of methylation observed at a given CpG. As before, we use the methylation data for the TCGA normal samples as a baseline to compute divergence for the tumor samples. The mode of divergence used here is univariate divergence-i.e. we obtain a divergence coding for each sample at each CpG.
We can now use this data to look for differentially methylated regions. By arranging the CpG positions along the genome by chromosome and location, we may cluster the CpGs into regions of interest. Here we cluster CpGs in each chromosome into clusters such that that all adjacent CpGs in a cluster are no more than 300 bp apart. Then we simply compute the average sum of divergence observed over the tumor samples in each cluster. Suppose we have k clusters; given i = 1..n samples, for cluster r k this is with z i,j being the binary divergence coding for CpG j in the cluster and sample i.
Further, we can assign a p-value to each region by computing a null distribution for the clusters through permutation of the CpG labels. The following CpG clusters appear to be the most highly divergent in breast tumor samples (Table 4).
We can compare these results to those obtained via bumphunter [14], a popular package used for finding differentially methylated regions. Fig 12 shows the cluster divergence sums (prior to normalization by cluster size) obtained for the top 10,000 clusters against the areas computed from bumphunter with the methylation data. Bumphunter searches for 'differentially methylated regions', or DMRs, which are regions within the specified clusters where all adjacent CpGs are differentially methylated between normal and tumor. For each cluster we  take the DMRs found by bumphunter for that cluster and take the area (which indicates the degree of differential methylation as measured by a regression coefficient in the bumphunter model, summarized for the region) for the most significant DMR therein for comparison with our divergence based approach. The results show a strong agreement for the clusters with high amounts of divergence. Significant gene-CpG associations. Having computed univariate divergence for two omics modalities-both methylation and gene expression-in this example we combine the two to discover what CpG-gene pairings show significant simultaneous divergence from both modalities. We are provided a table of CpGs mapped to nearby genes such that each CpG is mapped to a gene (with a single gene being mapped to one or more CpGs). We can compute the proportion of samples for which a given CpG-gene pair are both divergent.  By considering the ternary form of divergence (i.e. −1/0/1), in addition to this proportion we can also compute the 'concordant' and 'discordant' proportions-i.e. the proportions of samples for which the sign of divergence are also in agreement. Thus samples for which both the gene and CpG are divergent with a + 1, or both with a −1-would be considered concordant, while other samples would be considered discordant. Table 5 shows the 15 gene-CpG mappings with the largest measurements of co-divergence proportions. Further, we compared the values of p j,k with the correlation for the gene-CpG pair between the gene expression and methylation values. Fig 13 shows the results for the 10,000 gene-CpG pairs with the largest p j,k values, indicating pairs with extremely positive or negative correlations are more likely to indicate high divergence. Some of these pairs are labeled with the associated gene. Fig 14 shows these pairs with the distance between the CpG and the associated gene as provided by the annotation, showing the CpGs mapped to a position near the gene are more likely to indicate co-aberrant activity as expected.
Survival analysis with combined divergence. In this example, we compute and use codivergence in the gene expression and methylation modalities in a different way. Using the gene-CpG annotation pairing previously mentioned, we use multivariate divergence to compute divergence for methylation data at the gene level. That is, sets of CpGs that are mapped to a single gene are treated as multivariate features of interest; multivariate divergence is computed for them using a normal methylation baseline. For genes mapped to a single CpG, binarized univariate divergence is computed. This provides us with a gene level divergence computed from the methylation data, which we can now combine with the gene level divergence computed previously with the gene expression data. The two divergence codings for the same genes and samples-one from the methylation data and one from the gene expression data-can now be 'added'. The binary values are combined to indicate co-divergence in both modalities. Thus for a given sample and gene, a 1 indicates the sample being divergent by gene expression level at that gene, and also being divergent in the CpG methylation space annotated to that gene (0 indicating no divergence in one or both modalities).
The co-divergence probabilities from this resulting data, for any gene considered, is an indicator of how likely the gene is demonstrating aberrant activity in both the transcriptome and the methylome simultaneously. The gene divergence probabilities resulting from this data are similar to the gene-CpG pair proportions computed in the previous example.
We now turn to look at relapse free survival (RFS) with this co-divergence data. We select tumor samples that are in stages I or II. This provides 23 samples with recorded relapses, and 260 censored samples. The selection criteria here is reflective of patients for whom existing genomic diagnostic tests such as MammaPrint [15,16] are offered for risk evaluation.
To examine genes of interest with respect to relapse, we first select samples that have a censoring time of greater than 3 years among those with no relapse observed. Genes that have a divergence probability �0.3 in at least one group (i.e. between the relapsed or censored) are selected, followed by χ 2 -tests performed to select differentially divergent genes between the two groups. We select the resulting top 20 genes to fit to a cox proportional hazards survival model.
Genes are iteratively added to the model fit, keeping only genes that provide a significant coefficient estimate at a P � 0.05 threshold. Table 6 shows the resulting estimates (log-rank test P < 10 −5 ). Apolipoprotein D (APOD) is primarily down-divergent in the gene expression data, and low expression of it has been noted to be associated with poor prognosis in breast cancer [17,18]. Regulator of G protein signaling 6 (RGS6), which is down-divergent in most tumor samples, has been observed to be a suppressor of breast cancer [19]. BEX1 has been linked to ER+ breast cancer (a greater proportion of ER+ samples are divergent compared to ER-samples), as well as acute myeloid leukemia (AML), chronic myeloid leukemia (CML), and squamous cell cancer [20][21][22]. The human protein atlas data suggests PROZ as a prognostic marker in liver cancer, with higher protein expression observed in a subset of breast cancer samples [23,24]. This suggests further examination of these genes both in the gene expression and methylation spheres for their contribution to increased risk in stage breast cancer.

Conclusion
In the above results we have showcased a variety of analyses that can be performed with divergence at the univariate and multivariate level. While we have used RNA-seq, microarray and methylation 450k data here, the software is applicable to many other modalities of high dimensional omics data, some of which we have showed in [1]. While our previous publication was aimed at explaining the divergence framework from a statistical perspective, here we aim to explain how to use the package and conduct different types of analyses from divergence data with simple, practical examples.
Once the data has been processed as necessary and the baseline cohort identified, the R package can be used to compute the divergence coding quite easily. A full outline of the Table 6. Cox proportional hazards model estimations with relapse free survival data. Genes were iteratively selected to add to a cox proportional hazards model with the combined gene expression and methylation divergence codings to obtain the above coefficients. functions that can be used and the workflows possible are provided in the package vignette [4].
Here we have presented some of the many different ways that the digitized divergence coding can be visualized and analyzed by the user. All the analyses conducted in this mansucript are available as R code through the publicly available github repository https://github.com/wikum/divergenceApplications. (The divergence package is available through https://www.bioconductor.org/packages/release/bioc/html/ divergence.html). The code provides an option for downloading the relevant TCGA data through the TCGAbiolinks package [25].
Divergence provides high utility with respect to omics data analysis. We note that it is a novel framework without any comparable prior methods: the ability to binarize omics data, applicability across many different platforms, enabling sample level scoring and analysis, and being applicable in both univariate and multivariate modes are highly desirable features that are not usually found together in other methods for omics data analysis. Given this fact, we believe that the divergence package will be highly useful to the computational biology research community in their pursuits.