Autoregressive Higher-Order Hidden Markov Models: Exploiting Local Chromosomal Dependencies in the Analysis of Tumor Expression Profiles

Changes in gene expression programs play a central role in cancer. Chromosomal aberrations such as deletions, duplications and translocations of DNA segments can lead to highly significant positive correlations of gene expression levels of neighboring genes. This should be utilized to improve the analysis of tumor expression profiles. Here, we develop a novel model class of autoregressive higher-order Hidden Markov Models (HMMs) that carefully exploit local data-dependent chromosomal dependencies to improve the identification of differentially expressed genes in tumor. Autoregressive higher-order HMMs overcome generally existing limitations of standard first-order HMMs in the modeling of dependencies between genes in close chromosomal proximity by the simultaneous usage of higher-order state-transitions and autoregressive emissions as novel model features. We apply autoregressive higher-order HMMs to the analysis of breast cancer and glioma gene expression data and perform in-depth model evaluation studies. We find that autoregressive higher-order HMMs clearly improve the identification of overexpressed genes with underlying gene copy number duplications in breast cancer in comparison to mixture models, standard first- and higher-order HMMs, and other related methods. The performance benefit is attributed to the simultaneous usage of higher-order state-transitions in combination with autoregressive emissions. This benefit could not be reached by using each of these two features independently. We also find that autoregressive higher-order HMMs are better able to identify differentially expressed genes in tumors independent of the underlying gene copy number status in comparison to the majority of related methods. This is further supported by the identification of well-known and of previously unreported hotspots of differential expression in glioblastomas demonstrating the efficacy of autoregressive higher-order HMMs for the analysis of individual tumor expression profiles. Moreover, we reveal interesting novel details of systematic alterations of gene expression levels in known cancer signaling pathways distinguishing oligodendrogliomas, astrocytomas and glioblastomas. An implementation is available under www.jstacs.de/index.php/ARHMM.


Appendix A: Prior distributions
A problem-specific characterization of an autoregressive higher-order HMM is obtained by including prior knowledge about the distribution of measurements into the training of the model.
This prior knowledge is integrated by defining a prior distribution over the parameters of the HMM λ := ( π, A, B). As outlined in the main manuscript, the prior distribution P [ λ | Θ ] is defined to be a product of independent prior distributions for the initial state distribution π, the set of transition matrices A and the emission parameters B of the HMM.

Prior for initial state distribution
The conjugate prior for the initial state distribution π := (π i ) i∈S with initial state probability π i := exp(Λ π i ) is given by the transformed Dirichlet distribution with hyperparameter vector Θ 1 := (ϑ i ) i∈S that contains each pseudocount ϑ i ∈ R + . Z(Θ 1 ) := Γ( i∈S ϑ i )/ i∈S Γ(ϑ i ) with Gamma function Γ(x) = ∞ 0 u x−1 · exp(−u) du for all x ∈ R + represents the normalization constant. The transformed Dirichlet prior has been specified in a general manner by MacKay (1998). Dirichlet priors are generally applied to model prior knowledge for initial state distributions (e.g. Durbin et al. (1998); Bishop (2006)). Following Seifert et al. (2011), each ϑ i is set to ϑ i = 3.

Prior for set of transition matrices
The prior distribution for the set of transition matrices A := {A 1 , . . . , A L } is defined by a product of prior distributions under consideration of the hyperparameters Θ 2 := (Θ l 2 ) 1≤l≤L and a prior distribution D l 2 ( A l | Θ l 2 ) for each transition matrix A l ∈ A. The prior for the transition matrix A l := (a ij ) i∈S l ,j∈S with transition probability a ij := exp(Λ a ij ) is defined by a product of transformed Dirichlet distributions in dependency of the hyperparameters Θ l 2 := (Θ l 2,i ) i∈S l with hyperparameter vector Θ l 2,i := (ϑ ij ) j∈S containing each pseudocount ϑ ij ∈ R + . Each transformed Dirichlet distribution represents the specific conjugate prior for the state-specific transition probabilities of the corresponding state-context i of length l. The normalization constant Z(.) is defined in analogy to the previous section. Each ϑ ij has been set to ϑ ij = 3.

Prior for emission parameters
In the following, we focus on the prior distribution D 3 ( B | Θ 3 ) for the emission parameters to enable the integration of prior knowledge into the training of the state-specific autoregressive Gaussian emission densities. To avoid that each gene t in a tumor expression profile k is having its specific emission prior, we define the prior distribution for the state-specific autoregressive mean µ t i (k) := µ i + M m=1 c im · o t−m (k) · δ t,m with δ t,m = 0 if t − m < 1 and otherwise δ t,m = 1 using the following strategy. We only explicitly model prior knowledge for each basic statespecific mean µ i and assume constant priors for the corresponding coefficients (c i1 , . . . , c iM ).
This leads to a reduction of the prior for µ t i (k) to a prior that only depends on the corresponding µ i . Based on that, the emission prior for an HMM with autoregressive Gaussian emissions can be defined by a product of independent Gaussian-Inverted-Gamma distributions as done for standard non-autoregressive HMMs in Seifert et al. (2011). Details of this emission prior in the context of autoregressive HMMs are considered subsequently.
The prior for the emission parameters B is defined by a product of independent Gaussian- with respect to the hyperparameter matrix Θ 3 := (η i , i , r i , α i ) i∈S containing the a priori mean η i ∈ R, the scale parameter i ∈ R + , the shape parameter r i ∈ R + and the scale parameter α i ∈ R + . The Gaussian-Inverted-Gamma distribution has been specified in a general form by Evans et al. (2000). For each state i ∈ S the state-specific emission prior is given by a Gaussian distribution with mean η i and standard deviation σ i / √ i as prior distribution for the basic state-specific mean µ i of the corresponding autoregressive mean µ t i (k) in combination with an Inverted-Gamma distribution with shape parameter r i and scale parameter α i as prior of the state-specific standard deviation σ i . Based on that, prior knowledge can be modeled for µ t i (k) by choosing an a priori mean η i and a strength i of belief in this choice in combination with proper values for r i and α i to characterize the corresponding state-specific standard deviation σ i (see emission parameter estimation formulas below and in Seifert et al. (2011) to obtain an overview of the influence of hyperparameters on the state-specific emission parameters).

Prior hyperparameters
The choice of appropriate hyperparameters for the prior distribution is generally problemspecific and depends on the size and the characteristics of the considered data set. A histogram of log-ratios (e.g. Fig. S1) can help to set the hyperparameter value of the mean η i of each state i ∈ S for characterizing each state of the model. For the analysis of the breast cancer and glioma gene expression data sets, the values η − := −2, η = := 0, and η + := 2 were used (each η i is defined to be identical to the initially user-specified value of mean µ i of state i ∈ S). These values specify that underexpressed genes are modeled by state '−', unchanged genes are represented by state '=' and that overexpressed are realized by state '+'. To improve the separation between these gene categories, less flexibility is allowed for the training of the emission parameters of the states '−' and '+' by using the scale parameters − = + := 15, 000 in comparison to = := 1, 000 for the state '='. To characterize each state i ∈ S by an appropriate standard deviation, the shape parameter r i := 10 and the scale parameter α i := 10 −4 were used. Finally, for each state i ∈ S the hyperparameter ϑ i := 3 was used for the initial state prior, and for each combination of a state-context i ∈ S l with a next state j ∈ S the hyperparameter ϑ ij := 3 was used for the transition prior. This basic initialization strategy has led to robust results in Seifert et al. (2011). That was further confirmed by an in-depth sensitivity analysis of autoregressive higher-order HMMs with respect to varying prior hyperparameter settings (see Figure 3e and f in the main manuscript). Yet, users have still the possibility to change these settings to enable the modeling of specific characteristics of other data sets. However, that was not necessary for the three different tumor data sets that we have analyzed here.

Basics of the Bayesian Baum-Welch algorithm
The Bayesian Baum-Welch algorithm is a training procedure that iteratively determines new for the next HMM λ(h+1) based on the current parameters of the HMM λ(h) (h = 1 initial HMM) by combining Baum's auxiliary function Q( λ | λ(h) ) with the prior distribution P [ λ | Θ ] defined in Eqn. (1). This algorithm extends the typically used Baum-Welch training (Baum (1972);Rabiner (1989); Durbin et al. (1998)) by the integration of prior knowledge. In the following sections, Baum's auxiliary function and the prior distribution are considered to derive estimators of initial state, transition and emission parameters for the next HMM λ(h + 1). Finally, a computational scheme of the Bayesian Baum-Welch algorithm is provided.

Splitting of Baum's auxiliary function
Baum's auxiliary function provides the basis for the estimation of the model parameters using the standard Baum-Welch algorithm (Baum (1972); Rabiner (1989)). In combination with the specified prior distribution of the HMM, Baum's auxiliary function is used to derive the basics for the parameter estimation in the context of the Bayesian Baum-Welch algorithm. Baum's auxiliary function is given by in analogy to Seifert (2010) of an emission sequence o(k) = ( o 1 (k), . . . , o T k (k)) and a corresponding state sequence q = (q 1 , . . . , q T k ). This complete-data-likelihood is given by under an HMM λ := ( π, A, B) of order L with initial state distribution π, set of transition matrices A and emission parameters B. All these model parameters are specified in detail in the methods section of the main manuscript. Based on the complete-data-likelihood, Baum's auxiliary function can be split up into three independent functions to represent each class of model parameters separately. Subsequently, these three functions are considered in detail and the corresponding parameter estimators or parameter estimation strategies will be derived.

Estimation of initial state parameters
To estimate the initial state probabilities for the HMM λ(h + 1), Baum's auxiliary function for the estimation of the initial state probabilities is required. This function is given by based on expressing the sum over all state sequences q ∈ S T k by two sums. The first sum considers all initial states i ∈ S, and the second sum marginalizes over all state sequences q ∈ S T k with initial state q 1 = i leading to P [ q 1 = i | o(k), λ(h) ], which represents the state-posterior γ k 1 (i) computed under the current HMM λ(h) using an extended Forward-Backward algorithm for higher-order HMMs as described in Seifert (2010). Finally, the initial state probability π i is parameterized by Λ π i := log (π i ) in the log-space to provide the basics for the parameter estimation.
The new initial state probability π (h+1) i of state i ∈ S is determined by combining Baum's auxiliary function for initial state parameters with the corresponding prior distribution D 1 ( π | Θ 1 ) defined in Eqn. (2) with respect to the constraint i∈S exp(Λ π i ) = 1. This is done by applying the method of Lagrange multipliers to the auxiliary function Q 1 ( π | λ(h) ) + log(D 1 ( π | Θ 1 )) − δ · (( i∈S exp(Λ π i )) − 1) with variable Λ π i and Lagrange multiplier δ. This leads to the new initial state probability for the HMM λ(h + 1).

Estimation of transition parameters
Baum's auxiliary function for transition parameters is required to estimate the transition probabilities for the next HMM λ(h + 1). This function is specified by considering each specific auxiliary function Q l 2 ( A l | λ(h) ) defined for each transition matrix A l ∈ A of an HMM of order L.
For a fixed state-context length 1 ≤ l < L, Baum's auxiliary function for transition matrix A l is given by substituting the sum over all state sequences q ∈ S T k by three sums. Two of these sums are shown explicitly and the third sum is substituted as explained subsequently. The first sum considers each current state-context i ∈ S l . The second sum considers each next state j ∈ S. Now, a third sum is necessary to marginalize over all state sequences q ∈ S T k with fixed current state-context q 1...l = i and fixed next state q t+1 = j. This third sum is simplified to its ] computed under the current HMM λ(h) using an extended Forward-Backward algorithm as outlined in Seifert (2010). Finally, the transition probability a ij is parameterized by Λ a ij := log(a ij ) in the log-space to provide the basics for the parameter estimation.
The new transition probability a (h+1) ij for a transition from state-context i ∈ S l to state j ∈ S is determined by combining Baum's auxiliary function for transition parameters with the corresponding prior distribution D l 2 ( A l | Θ l 2 ) given in Eqn.
(3) with respect to the constraint j∈S exp(Λ a ij ) = 1. This is done by applying the method of Lagrange multipliers to the auxil-iary function Q l 2 ( A l | λ(h) ) + log(D l 2 ( A l | Θ l 2 )) − i∈S l δ i · (( j∈S exp(Λ a ij )) − 1) with variable Λ a ij and Lagrange multiplier δ i . That leads to the new transition probability for the HMM λ(h + 1). This transition probability is only considered for a state-transition at the fixed time-step l. The derivation of state-transition probabilities used at all time-steps t ≥ L considering the full memory on predecessor states is done in a similar manner and will be outlined in the following.
Baum's auxiliary function for transition matrix A L considered for all state-transitions at timesteps t ≥ L is given by substituting the sum over all state sequences q ∈ S T k by three sums in analogy to the previously described derivation of the other auxiliary functions for transition parameters. The auxiliary function Q L 2 ( A L | λ(h) ) contains an additional sum over all time-steps t ≥ L. Again, the transition probability a ij is parameterized in the log-space by Λ a ij := log(a ij ).
The new transition probability a (h+1) ij for a transition from state-context i ∈ S L to state j ∈ S is determined by combining Baum's auxiliary function for transition parameters with the cor- (3) with respect to the constraint j∈S exp(Λ a ij ) = 1. This is again done by applying the method of Lagrange multipliers to the auxiliary function for the HMM λ(h + 1).

Estimation of emission parameters
In the following, Baum's auxiliary function for emission parameters is derived and prepared for the estimation of state-specific autoregressive means and standard deviations. Some basics of the parameter estimation of the autoregressive mean have already been outlined in the main manuscript. Here, detailed derivations of the parameter estimation process are given.

Baum's auxiliary function for emission parameters
To estimate the emission parameters for the next HMM λ(h + 1), Baum's auxiliary function for emission parameters needs to be specified. This function is given by including the substitution of the sum over all state sequences q ∈ S T k by two sums. The first sum runs over all current states i ∈ S. Now, a second sum is required to marginalize over all state sequences q ∈ S T k with fixed current state q t = i. The second sum can be simplified to its marginal probability γ k representing exactly the state-posterior computed under the current HMM λ(h) using extended Forward and Backward algorithms (Seifert (2010)).
To ease the representation of the emission parameter estimation for an autoregressive Gaussian emission distribution, some formulas and definitions given in the main manuscript are reconsidered. The expression level o t (k) of a gene t in a profile k is modeled by a statespecific autoregressive Gaussian emission distribution of order M ≥ 0 defined by with respect to the state-specific standard deviation σ i ∈ R + and the state-specific autoregressive mean for the expression level of gene t in profile k. Here, µ i ∈ R defines the basic state-specific mean and the coefficients (c i1 , . . . , c iM ) ∈ R M are used to model the impact of predecessor expression levels on the gene-specific mean. Additionally, at the start of an emission sequence when the complete memory on M predecessor emissions does not exist, we have to truncate the modeling of dependencies by defining the function δ t,m to be zero in cases where t − m < 1 and otherwise one. The specified autoregressive Gaussian emission distribution is further used to refine Baum's auxiliary function for emission parameters given in Eqn. (8) leading to with respect to transformations according to logarithm rules and ignoring constant terms. This function is used subsequently in combination with the emission prior to estimate the parameters of the state-specific autoregressive mean µ t i (k) (h+1) and the corresponding state-specific standard deviation σ (h+1) i of the next HMM λ(h + 1).

Estimation of state-specific autoregressive means
The parameters of the state-specific autoregressive mean µ t i (k) of the autoregressive Gaussian emission density of order M of state i ∈ S are determined by maximizing Baum's auxiliary function for emission parameters Q 3 ( B | λ(h) ) given in (9) in combination with the emission (4). This is done by determining the critical point of the derivatives ) with respect to the basic state-specific mean µ i and all corresponding coefficients c i1 , . . . , c iM . The resulting derivatives are set to zero and the individual terms are multiplied by σ 2 i . Additional regrouping results in a system of M + 1 linear equations enabling to determine the parameters µ i and c i1 , . . . , c iM for the next model λ(h + 1). The first equation of this system of linear equations is given by derived from the derivation of the basic state-specific mean µ i . In addition to this, each of the corresponding M equations with fixed index 1 ≤ m ≤ M is given by derived on the basis of the derivative of Q 3 ( B | λ(h) ) + log(D 3 ( B | Θ 3 )) with respect to the corresponding coefficient c im . For all equations, all components are known except the basic state-specific mean µ i and its corresponding coefficients c i1 , . . . , c iM for defining the autore-gressive mean µ t i (k) (h+1) of the next HMM λ(h + 1). These parameters can be determined by using standard solvers for systems of linear equations. Considering continuous gene expression measurements, this system of linear equations is expected to have full rank enabling to determine a unique solution for the unknown parameters µ i and c i1 , . . . , c iM . These parameters specify the autoregressive mean µ t i (k) (h+1) of state i ∈ S of the next HMM λ(h + 1). We utilize the publicly available Jama package (Hicklin et al. (2012)) to compute each state-specific autoregressive mean.

Estimation of state-specific standard deviations
The state-specific standard deviation σ (h+1) i of state i ∈ S for the next HMM λ(h + 1) is determined by maximizing Baum's auxiliary function for emission parameters Q 3 ( B | λ(h) ) given in (9) in combination with the emission prior D 3 ( B | Θ 3 ) defined in (4). This is done by calculating the critical point of the derivative of Q 3 ( B | λ(h) ) + log(D 3 ( B | Θ 3 )) with respect to the standard deviation σ i . This leads to the state-specific standard deviation for the next HMM λ(h + 1). Here, the corresponding state-specific autoregressive mean µ t i (k) (h+1) occurs only in the first part of the denominator, because the prior has only been specified on the corresponding basic mean µ

Computational scheme of the Bayesian Baum-Welch algorithm
The computational scheme of the Bayesian Baum-Welch algorithm is specified in the following in terms of an initialization and an iteration step under consideration of the parameter estimation formulas derived in the previous sections.
• Initialization: Choose an initial autoregressive HMM λ(1) with an emission process of order M and a state-transition process of order L.
for each given emission sequence o(k) using the Forward-Backward algorithm adapted to higher-order HMMs in Seifert (2010).
-Compute the optimal parameters of the next HMM λ(h + 1) with the help of these precomputations.
1. Compute the initial state probability in (5) for each state i ∈ S. (6) and (7) for each state-context i ∈ S l with 1 ≤ l ≤ L and next state j ∈ S.

Compute the transition probabilities in
3. For each state i ∈ S, compute the basic mean µ (h+1) i and its corresponding coefficients c i1 , . . . , c iM for the autoregressive mean µ t i (k) (h+1) by solving the corresponding system of linear equations in section 2.5.2. Additionally, compute the standard deviation σ (h+1) i in (10).
-Stop if the log-posterior under the next HMM λ(h + 1) has increased less than a predefined threshold in comparison to the log-posterior under the current HMM λ(h), otherwise start the next iteration step with h := h + 1.

Markov Models to glioma data
Here, we consider gene expression profiles of glioblastomas by de Tayrac et al. (2009) and demonstrate that autoregressive HMMs are valuable for the identification of previously known and so far uncharacterized hotspots of differential expression. We complement this by further evaluating the identified hotspot genes in the context of a large independent cohort of glioma samples based on the Rembrandt repository by Madhavan et al. (2009). Additionally, we compiled an independent glioma gene expression data set based on data from the Repository for Molecular Brain Neoplasia Data (Rembrandt, current release 1.5.9) by Madhavan et al. (2009). We performed stringent quality controls of downloaded gene expression arrays and removed all arrays with hybridization artifacts. We further did a standard Affymetrix microarray processing utilizing a customized design file from BrainArray (HGU133Plus2 version 15.0.0) in combination with GCRMA (Wu et al. (2004)) normalization. The final data set contains tumor gene expression profiles of 89 different gliomas (45 glioblastomas, 33 astrocytomas, and 11 oligodendrogliomas) for which gene expression levels of 16,282 genes are quantified in terms of log-ratios with respect to an average normal brain reference computed based on data from Rembrandt. We utilize this data set to further investigate the expression behavior of hotspots of differential expression identified in the data set by de Tayrac et al. (2009) in the context of an independent large cohort of glioma samples. Local chromosomal dependencies between gene expression levels in gliomas are shown in Figure 1 of the main manuscript.

Identification of hotspots of differential expression in glioblastomas
Here, we apply our autoregressive HMMs to investigate their potential to identify hotspots of differential expression in glioblastomas. Therefore, we utilized publicly available glioblastoma gene expression data from de Tayrac et al. (2009). We adapted the AR(4)-HMM(2) to the tumor expression profiles of the 18 different patients based on the initial basic settings described in the methods section of the main manuscript. To reveal hotspots of differential expression across all tumors, we first determined the most likely underlying expression state of each gene in each patient by state-posterior decoding. We next ranked all genes according to their frequencies of underexpressed and overexpressed predictions across all samples and generated lists of hotspot genes. Utilizing a very stringent hotspot criterion defining that a gene must be predicted in at least 17 of 18 samples, we identified 21 hotspots of overexpression (Table S5) and 16 hotspots of underexpression (Table S6) in glioblastomas. The corresponding glioblastoma expression profiles of these genes across the different patient samples are shown in Figure S9. We note that these stringent hotpots were very similar considering different autoregressive HMMs. We performed in-depth literature studies of all hotspot genes using PubMed (www.ncbi.nlm.nih.gov/pubmed) and GeneCards (www.genecards.org) and find that seven of these genes have not been reported as hotspots of differential expression in gliomas (glioblastomas, or astrocytomas of WHO grades II or III) or other types of cancer so far. These genes are summarized in Table S7 and will be discussed together with other interesting hotspot genes in the following.
In-depth literature studies revealed that the identified hotspots of overexpression are especially enriched in genes that have previously been reported as being overexpressed in gliomas (11 of 21: glioblastomas, or astrocytomas of WHO grades II or III) or other tumors (9 of 21). So far, only one gene, CKAP2L, has not been reported as hotspot of overexpression. Currently, CKAP2L does not have an assigned annotation, but it is known to be a paralog of CKAP2, which is associated with the regulation of aneuploidy, cell cycle, and cell death in a p53-dependent manner. Thus, this gene may also be important in glioblastomas.
In contrast to these findings, literature-based studies of hotspots of underexpression revealed that a relatively large proportion of these genes has not been reported as hotspots of underexpression in gliomas or other tumors so far (6 of 16). The majority of these genes are involved in brain development, neuron function and differentiation. Three of these genes (CABP1, GABRA5, CACNG3) encode for ion channels. SNAP25 encodes for a presynaptic plasma membrane protein involved in the regulation of neurotransmitter release. MPPED1, a paralog of MPPED2, encodes for a metallophosphoesterase involved in brain development.
SMPX encodes for a small protein which may play a role in the regulatory network through which muscle cells coordinate their structural and functional states during growth, adaptation and repair. Other interesting hotspots of underexpression are NEUROD6, DRD1 and SLC12A5.
NEUROD6 encodes for a transcription factor involved in the development and the differentiation of the nervous system. DRD1 encodes for a G-protein coupled dopamine receptor involved in the regulation of neuronal growth and development. SLC12A5 (alias KCC2) encodes for a membrane K-Cl cotransporter, which has been recently reported to affect cell migration, cell morphology and malignancy of cancer cell lines (Zdebik (2011)).
Interestingly, a slightly relaxed hotspot criterion of predictions in at least 15 of 18 samples has led to the identification of four hotspots of overexpression (CDCA2, CD44, OIP5, TNFRSF12A) that have recently been suggested as potential therapeutic targets in other tumors (see supporting text of Table S5 for more details). Considering the hotspots of underexpression, two other interesting genes were identified. WIF1, a known tumor suppressor inhibiting the Wnt signaling pathway, and CHD5, a potential tumor suppressor involved in chromatin remodeling and gene transcription. Additional gene ontology analysis of these extended hotspots using GOrilla (Eden et al. (2009)) revealed that hotspot genes of overexpression are significantly enriched in categories related to cell cycle, cell division, mitosis and mitotic spindle formation (p-value < 1e-7). Hotspots of underexpression are significantly enriched in categories related to synapses, cell junctions and cell-cell signaling (p-value < 1e-7).

Analysis of identified hotspots of differential expression in the context of different types of gliomas
Next, we analyzed the expression behavior of the identified hotspots of over-and underexpression in independent data sets taken from the Rembrandt brain tumor repository (Madhavan et al. (2009)). In addition to gene expression profiles of glioblastomas, Rembrandt also contains gene expression profiles of astrocytomas and oligodendrogliomas enabling to study how the identified hotspots behave in different types of gliomas. Therefore, we mapped the identified 21 hotspots of overexpression and the identified 16 hotspots of underexpression to the corresponding genes measured in Rembrandt. Based on that, we created a heatmap visualizing the expression behavior of the corresponding genes ( Figure S10). We clearly see that the identified hotspots of over-and underexpression in glioblastomas predicted by the AR (4)  Histograms of log-ratios representing breast cancer gene expression levels of genes with at least two-, three-, or four-fold increased copy numbers in tumor compared to healthy tissue based on the breast cancer gene expression data set by Pollack et al. (2002). The majority of these genes (two-fold: 856, three-fold: 228, four-fold: 115) have increased expression levels indicated by log-ratios greater than zero. The mean log-ratio of each histogram is significantly greater than zero (t-test: p-value < 4.5e-18).

Two-fold increased copy numbers
Three-fold increased copy numbers Four-fold increased copy numbers

(b) Transfer to other similar breast cancer data
Two-fold increased copy numbers Three-fold increased copy numbers Four-fold increased copy numbers Identification of overexpressed genes with at least two-, three-or four-fold increased copy numbers by autoregressive HMMs in breast cancer gene expression data from Pollack et al. (2002). Initial HMMs with emission processes of order M ∈ {0, . . . , 5} (AR(M )) and state-transition processes of order L ∈ {0, . . . , 5} (HMM-Order) were trained by the Bayesian Baum-Welch algorithm on fifty percent of the data. The identification of candidate genes of overexpression with increased copy numbers is quantified by the true positive rate (TPR) reached at a fixed false positive rate of 5% for the training data (a) and independent test data (b).

Fig. S5: Influence of varying initial model parameter settings on the identification of overexpressed genes with known increased copy numbers by different AR-HMMs (a) Performance on breast cancer data
Two-fold increased copy numbers Three-fold increased copy numbers Four-fold increased copy numbers

(b) Transfer to other similar breast cancer data
Two-fold increased copy numbers Three-fold increased copy numbers Four-fold increased copy numbers Identification of overexpressed genes with at least two-, three-or four-fold increased copy numbers by autoregressive HMMs in breast cancer gene expression data from Pollack et al. (2002). Initial HMMs were trained with respect to twelve different initial model parameter settings (Table S1) in analogy to the study in Figure S4. Average true positive rates (TPRs) and corresponding standard deviations reached at a fixed false positive rate of 5% for the different initialization scenarios are shown for the training data in (a) and for the independent test data in (b).

Fig. S6: Influence of varying prior hyperparameter settings on the identification of overexpressed genes with known increased copy numbers by different AR-HMMs (a) Performance on breast cancer data
Two-fold increased copy numbers Three-fold increased copy numbers Four-fold increased copy numbers

(b) Transfer to other similar breast cancer data
Two-fold increased copy numbers Three-fold increased copy numbers Four-fold increased copy numbers Identification of overexpressed genes with at least two-, three-or four-fold increased copy numbers by autoregressive HMMs in breast cancer gene expression data from Pollack et al. (2002). Initial HMMs were trained with respect to twelve different prior parameter settings (Table S1) in analogy to the study in Figure S4. Average true positive rates (TPRs) and corresponding standard deviations reached at a fixed false positive rate of 5% for the different initialization scenarios are shown for the training data in (a) and for the independent test data in (b). Comparison of the AR(4)-HMM(2) and related methods with respect to the identification of overexpressed genes with at least two-fold (a), three-fold (b) and four-fold (c) increased copy numbers based on breast cancer data by Pollack et al. (2002). The performance of each method is quantified by a ROC curve displaying the true positive rate (TPR) reached at different levels of false positive rates (FPR). Overall, the AR(4)-HMM(2) with a fourth-order autoregressive emission process and a second-order state-transition process reaches the best performance. Overlap of the top 300 underexpressed genes (a) and the top 300 overexpressed genes (b) identified in the Rembrandt data set by the AR(4)-HMM(2) with known cancer signaling pathways taken from ConsensusPathDB (Kamburov et al. (2011)). Grey bars represent pathwayspecific overlaps expected for randomly chosen top 300 candidate gene lists. See Figure 6 in the main manuscript for a detailed view on the most discriminative pathways between oligodendrogliomas, astrocytomas and glioblastomas. See Table S4 for an overview with functional annotations of cancer signaling pathways.

Fig. S9: Hotspots of differential expression in glioblastomas
Hotspot genes of over-and underexpression identified by the AR (

Tab. S1: Sensitivity analysis settings
Overview of the parameter settings used to perform the sensitivity analysis to study the robustness of the predictions made by autoregressive HMMs. Table S1A summarizes the different tested initial model parameter settings. Table S1B summarizes  For these different scenarios, we only highlight parameters that were different from the standard settings to ease the readability. We note that the changes of the initial transition parameters in case studies 1 to 4 are directly transferred to the initial transition parameters (see Model initialization of the methods section in the main manuscript; a scale factor s = 0.0125 was used in case study 1 to get well-defined transition matrices, in analogy s = 0.025 was used in case study 2). Based on these different settings, the corresponding autoregressive HMMs were trained on fifty percent of the breast cancer gene expression data set by Pollack et al. (2002) using the Bayesian Baum-Welch algorithm. We always used the standard prior hyperparameter settings for the training of models with varying initial model parameter settings. In analogy, we always used the standard initial model parameter settings for the training of models with varying prior hyperparameters. The influences of these different scenarios on the predictions of autoregressive HMMs are summarized in Figure S5 for the different model initializations and in Figure S6 for the different prior hyperparameter settings. Generally, we find that the predictions of autoregressive HMMs are very robust.

Tab. S1A: Varying Initial Model Parameters
Initial HMM parameters Case Study Identification of overexpressed genes with at least two-, three-or four-fold increased copy numbers by different methods based on all breast cancer gene expression profiles by Pollack et al. (2002). Methods are compared based on true positive rates (TPR) reached at different levels of fixed false positive rates (FPR) and based on their the area under the receiver operator characteristic curve (AU-ROC). Methods are sorted in decreasing order of corresponding AU-ROC values reached for the identification of overexpressed genes with at least four-fold increased copy numbers. We note that DSHMM and SHMM as well as MixMod and LFC reached identical performances. The AR(4)-HMM(2) reaches globally the best performance outperforming other related methods. was not linked to other types of cancer (1 of 21). CKAP2L did not have an assigned annotation, but it is a paralog of CKAP2, which is associated with the regulation of aneuploidy, cell cycle, and cell death in a p53-dependent manner (http://www.genecards.org).
Additionally, we note that four genes (CDCA2, OIP5, TNFRSF12A, CD44) recently sug-  We performed PubMed (www.ncbi.nlm.nih.gov/pubmed) literature searches and GeneCards (www.genecards.org) analysis to better characterize these genes. Dark grey rows represent genes reported as underexpressed in gliomas (glioblastomas, or astrocytomas of WHO grades II and III) or in other studies (3 of 16). Light grey rows represent genes that have already been identified to play a role in other tumors (7 of 16). White rows represent genes for which we did not find a link to gliomas or other types of cancer (6 of 16). Generally, a large proportion of identified hotspot genes is involved in brain development, neuron functions and differentiation (http://www.genecards.org). For example, four of these genes (KCNV1, CABP1, GABRA5, CACNG3) encode for ion channels. NEUROD6 encodes for a transcription factor involved in the development and the differentiation of the nervous system. DRD1 encodes for a G-protein coupled dopamine receptor involved in the regulation of neuronal growth and development.
SNAP25 encodes for a presynaptic plasma membrane protein involved in the regulation of neurotransmitter release. MPPED1, a paralog of MPPED2, encodes for a metallophosphoesterase involved in brain development. SMPX encodes for a small protein which may play a role in the regulatory network through which muscle cells coordinate their structural and functional states during growth, adaptation and repair. Another interesting hotspot of underexpression is SLC12A5. SLC12A5 (alias KCC2) encodes for a membrane K-Cl cotransporter, which has been recently reported to affect cell migration, cell morphology and malignancy of cancer cell lines (Zdebik (2011)).
Additionally, we note that the tumor suppressor gene WIF1 (16 of 18) and the potential tumor suppressor gene CHD5 (15 of 18) have been identified when we relaxed the hotspot criterion.
Standard gene ontology enrichment analysis of underexpressed hotspot genes identified in at least 15 of 18 tumors (93 genes) utilizing GOrilla (Eden et al. (2009) Summary of identified novel hotspots of differential expression in glioblastomas for which we did not find a link to other cancer studies based on PubMed literature searches. These hotspots were identified by the AR(4)-HMM (2)