Fast Bayesian Inference of Copy Number Variants using Hidden Markov Models with Wavelet Compression

By integrating Haar wavelets with Hidden Markov Models, we achieve drastically reduced running times for Bayesian inference using Forward-Backward Gibbs sampling. We show that this improves detection of genomic copy number variants (CNV) in array CGH experiments compared to the state-of-the-art, including standard Gibbs sampling. The method concentrates computational effort on chromosomal segments which are difficult to call, by dynamically and adaptively recomputing consecutive blocks of observations likely to share a copy number. This makes routine diagnostic use and re-analysis of legacy data collections feasible; to this end, we also propose an effective automatic prior. An open source software implementation of our method is available at http://schlieplab.org/Software/HaMMLET/ (DOI: 10.5281/zenodo.46262). This paper was selected for oral presentation at RECOMB 2016, and an abstract is published in the conference proceedings.


Introduction
The human genome shows remarkable plasticity, leading to significant copy number variations (CNV) within the human population [1]. They contribute to differences in phenotype [2][3][4], ranging from benign variation over disease susceptibility to inherited and somatic diseases [5], including neuropsychiatric disorders [6][7][8] and cancer [9,10]. Separating common from rare variants is important in the study of genetic diseases [5,11,12], and while the experimental platforms have matured, interpretation and assessment of pathogenic significance remains a challenge [13].
Computationally, CNV detection is a segmentation problem, in which consecutive stretches of the genome are to be labeled by their copy number; following the conventions typically employed in CNV method papers, e.g. [14][15][16][17], we use this term rather abstractly to denote segments of equal mean value, not actual ploidy, though for homogeneous samples the latter can be easily assigned. Along with a variety of other methods [14][15][16], Hidden Markov Models (HMM) [42] play a central role [17,[43][44][45][46][47][48][49][50][51][52], as they directly model the separate layers of observed measurements, such as log-ratios in array comparative genomic hybridization (aCGH), and their corresponding latent copy number (CN) states, as well as the underlying linear structure of segments.
As statistical models, they depend on a large number of parameters, which have to be either provided a priori by the user or inferred from the data. Classic frequentist maximum likelihood (ML) techniques like Baum-Welch [53,54] are not guaranteed to be globally optimal, i. e. they can converge to the wrong parameter values, which can limit the accuracy of the segmentation. Furthermore, the Viterbi algorithm [55] only yields a single maximum a posteriori (MAP) segmentation given a parameter estimate [56]. Failure to consider the full set of possible parameters precludes alternative interpretations of the data, and makes it very difficult to derive pvalues or confidence intervals. Furthermore, these frequentist techniques have come under increased scrutiny in the scientific community.
Bayesian inference techniques for HMMs, in particular Forward-Backward Gibbs sampling [57,58], provide an alternative for CNV detection as well [59][60][61]. Most importantly, they yield a complete probability distribution of copy numbers for each observation. As they are sampling-based, they are computationally expensive, which is problematic especially for highresolution data. Though they are guaranteed to converge to the correct values under very mild assumptions, they tend to do so slowly, which can lead to oversegmentation and mislabeling if the sampler is stopped prematurely.
Another issue in practice is the requirement to specify hyperparameters for the prior distributions. Despite the theoretical advantage of making the inductive bias more explicit, this can be a major source of annoyance for the user. It is also hard to justify any choice of hyperparameters when insufficient domain knowledge is available.
Recent work of our group [62] has focused on accelerating Forward-Backward Gibbs sampling through the introduction of compressed HMMs and approximate sampling. For the first time, Bayesian inference could be performed at running times on par with classic maximum likelihood approaches. It was based on a greedy spatial clustering heuristic, which yielded a static compression of the data into blocks, and block-wise sampling. Despite its success, several important issues remain to be addressed. The blocks are fixed throughout the sampling and impose a structure that turns out to be too rigid in the presence of variances differing between CN states. The clustering heuristic relies on empirically derived parameters not supported by a theoretical analysis, which can lead to suboptimal clustering or overfitting. Also, the method cannot easily be generalized for multivariate data. Lastly, the implementation was primarily aimed at comparative analysis between the frequentist and Bayesian approach, as opposed to overall speed.
To address these issues and make Bayesian CNV inference feasible even on a laptop, we propose the combination of HMMs with another popular signal processing technology: Haar wavelets have previously been used in CNV detection [63], mostly as a preprocessing tool for statistical downstream applications [28][29][30][31][32] or simply as a visual aid in GUI applications [21,64]. A combination of smoothing and segmentation has been suggested as likely to improve results [65], and here we show that this is indeed the case. Wavelets provide a theoretical foundation for a better, dynamic compression scheme for faster convergence and accelerated Bayesian sampling (Fig 1). We improve simultaneously upon the typically conflicting goals of accuracy and speed, because the wavelets allow summary treatment of "easy" CN calls in segments and focus computational effort on the "difficult" CN calls, dynamically and adaptively. This is in contrast to other computationally efficient tools, which often simplify the statistical Sampling is performed on a compressed version of the data, using sufficient statistics for block-wise computations (panel b) to accelerate inference in Bayesian Hidden Markov Models. During the sampling (panel c) parameters and copy number sequences are sampled iteratively. During each iteration, the sampled emission variances determine which coefficients of the data's Haar wavelet transform are dynamically set to zero. This controls potential break points at finer or model or use heuristics. The required data structure can be efficiently computed, incurs minimal overhead, and has a straightforward generalization for multivariate data. We further show how the wavelet transform yields a natural way to set hyperparameters automatically, with little input from the user.
We implemented our method in a highly optimized end-user software, called HaMMLET. Aside from achieving an acceleration of up to two orders of magnitude, it exhibits significantly improved convergence behavior, has excellent precision and recall, and provides Bayesian inference within seconds even for large data sets. The accuracy and speed of HaMMLET also makes it an excellent choice for routine diagnostic use and large-scale re-analysis of legacy data. Notice that while we focus on aCGH in this paper as the most straightforward biological example of univariate Gaussian data, the method we present is a general approach to Bayesian HMM inference as long as the emission distributions come from the exponential family, implying that conjugate priors exist and the dimension of its sufficient statistics remain bounded with increasing sample size. It can thus be readily generalized and adapted to read-depth data, SNP arrays, and multi-sample applications.

Simulated aCGH data
A previous survey [65] of eleven CNV calling methods for aCGH has established that segmentation-focused methods such as DNAcopy [14,36], an implementation of circular binary segmentation (CBS), as well as CGHseg [37] perform consistently well. DNAcopy performs a number of t-tests to detect break-point candidates. The result is typically over-segmented and requires a merging step in post-processing, especially to reduce the number of segment means. To this end MergeLevels was introduced by [66]. They compare the combination DNAcopy+MergeLevels to their own HMM implementation [17] as well as GLAD (Gain and Loss Analysis of DNA) [27], showing its superior performance over both methods. This established DNAcopy+MergeLevels as the de facto standard in CNV detection, despite the comparatively long running time.
The paper also includes aCGH simulations deemed to be reasonably realistic by the community. DNAcopy was used to segment 145 unpublished samples of breast cancer data, and subsequently labeled as copy numbers 0 to 5 by sorting them into bins with boundaries (−1, −0.4, −0.2, 0.2, 0.4, 0.6, 1), based on the sample mean in each segment (the last bin appears to not be used). Empirical length distributions were derived, from which the sizes of CN aberrations are drawn. The data itself is modeled to include Gaussian noise, which has been established as sufficient for aCGH data [67]. Means were generated such as to mimic random tumor cell proportions, and random variances were chosen to simulate experimenter bias often observed in real data; this emphasizes the importance of having automatic priors available when using Bayesian methods, as the means and variances might be unknown a priori. The data comprises three sets of simulations: "breakpoint detection and merging" (BD&M), "spatial resolution study" (SRS), and "testing" (T) (see their paper for details). We used the MergeLevels implementation as provided on their website. It should be noted that the superiority of DNAcopy+-MergeLevels was established using a simulation based upon segmentation results of DNAcopy itself.
We used the Bioconductor package DNAcopy (version 1.24.0), and followed the procedure suggested therein, including outlier smoothing. This version uses the linear-time variety of CBS [15]; note that other authors such as [35] compare against a quadratic-time version of CBS [14], which is significantly slower. For HaMMLET, we use a 5-state model with automatic hyperparameters Pðs 2 0:01Þ ¼ 0:9 (see section Automatic priors), and all Dirichlet hyperparameters set to 1.
Following [62], we report F-measures (F 1 scores) for binary classification into normal and aberrant segments (Fig 2), using the usual definition of F ¼ 2pr pþr being the harmonic mean of precision p ¼ TP TPþFP and recall r ¼ TP TPþFN , where TP, FP, TN and FN denote true/false positives/ negatives, respectively. On datasets T and BD&M, both methods have similar medians, but HaMMLET has a much better interquartile range (IQR) and range, about half of CBS's. On the spatial resolution data set (SRS), HaMMLET performs much better on very small aberrations. This might seem somewhat surprising, as short segments could easily get lost under compression. However, Lai et al. [65] have noted that smoothing-based methods such as quantile smoothing (quantreg) [23], lowess [24], and wavelet smoothing [29] perform particularly well in the presence of high noise and small CN aberrations, suggesting that "an optimal combination of the smoothing step and the segmentation step may result in improved performance". Our wavelet-based compression inherits those properties. For CNVs of sizes between 5 and 10, CBS and HaMMLET have similar ranges, with CBS being skewed towards better values; CBS has a slightly higher median for 10-20, with IQR and range being about the same. However, while HaMMLET's F-measure consistently approaches 1 for larger aberrations, CBS does not appear to significantly improve after size 10. The plots for all individual samples can be found in Web Supplement S1-S3, which can be viewed online at http://schlieplab.org/Supplements/ HaMMLET/, or downloaded from https://zenodo.org/record/46263 (DOI: 10.5281/zenodo. 46263).

High-density CGH array
In this section, we demonstrate HaMMLET's performance on biological data. Due to the lack of a gold standard for high-resolution platforms, we assess the CNV calls qualitatively. We use raw aCGH data (GEO:GSE23949) [68] of genomic DNA from breast cancer cell line BT-474 (invasive ductal carcinoma, GEO:GSM590105), on an Agilent-021529 Human CGH Whole Genome Microarray 1x1M platform (GEO:GPL8736). We excluded gonosomes, mitochondrial and random chromosomes from the data, leaving 966,432 probes in total.
HaMMLET allows for using automatic emission priors (see section Automatic priors) by specifying a noise variance, and a probability to sample a variance not exceeding this value. We compare HaMMLET's performance against CBS, using a 20-state model with automatic priors, Pðs 2 0:1Þ ¼ 0:8, 10 prior self-transitions and 1 for all other hyperparameters. CBS took over 2 h 9 m to process the entire array, whereas HaMMLET took 27.1 s for 100 iterations, a speedup of 288. The compression ratio (see section Effects of wavelet compression on speed and convergence) was 220.3. CBS yielded a massive over-segmentation into 1,548 different copy number levels; cf. Web Supplement S4 at https://zenodo.org/record/46263. As the data is derived from a relatively homogeneous cell line as opposed to a biopsy, we do not expect the presence of subclonal populations to be a contributing factor [69,70]. Instead, measurements on aCGH are known to be spatially correlated, resulting in a wave pattern which has to be removed in a preprocessing step; notice that the internal compression mechanism of HaMMLET is derived from a spatially adaptive regression method, so smoothing is inherent to our method. CBS performs such a smoothing, yet an unrealistically large number of different levels remains, likely due to residuals of said wave pattern. Furthermore, repeated runs of CBS yielded different numbers of levels, suggesting that indeed the merging was incomplete. This can cause considerable problems downstream, as many methods operate on labeled data. A common approach is to consider a small number of classes, typically 3 to 4, and associate them semantically with CN labels like loss, neutral, gain, and amplification, e.g. [27,59,61,67,[71][72][73][74][75]. In inference models that contain latent categorical state variables, like HMM, such an association is readily achieved by sorting classes according to their means. In contrast, methods like CBS typically yield a large, often unbounded number of classes, and reducing it is the declared purpose of merging algorithms, see [66]. Consider, for instance, CGHregions [74], which uses a 3-label matrix to define regions of shared CNV events across multiple samples by requiring a maximum L 1 distance of label signatures between all probes in that region. If the domain of class labels was unrestricted and potentially different in size for each sample, such a measure would not be meaningful, since the i-th out of n classes cannot be readily identified with the i-th out of m classes for n 6 ¼ m, hence no two classes can be said to represent the same CN label. Similar arguments hold true for clustering based on Hamming distance [72] or ordinal similarity measures [71]. Furthermore, even CGHregions's optimized computation of medoids takes several minutes to compute. As the time depends multiplicatively on the number of labels, increasing it by three orders of magnitude would increase downstream running times to many hours.
For a more comprehensive analysis, we restricted our evaluation to chromosome 20 (21,687 probes), which we assessed to be the most complicated to infer, as it appears to have the highest number of different CN states and breakpoints. CBS yields a 19-state result after 15.78 s (Fig 3,  top). We have then used a 19-state model with automated priors (Pðs 2 0:04Þ ¼ 0:9Þ, 10 prior self-transitions, all other Dirichlet parameters set to 1) to reproduce this result. Using noise control (see Methods), our method took 1.61 s for 600 iterations. The solution we obtained is consistent with CBS (Fig 3, middle and bottom). However, only 11 states were part of the final solution, i. e. 8 states yielded no significant likelihood above that of other states. We  Fast Bayesian Inference of Copy Number Variants using Hidden Markov Models with Wavelet Compression observe superfluous states being ignored in our simulations as well. In light of the results on the entire array, we suggest that the segmentation by DNAcopy has not sufficiently been merged by MergeLevels. Most strikingly, HaMMLET does not show any marginal support for a segment called by CBS around probe number 4,500. We have confirmed that this is not due to data compression, as the segment is broken up into multiple blocks in each iteration (cf. Web Supplement S5 at https://zenodo.org/record/46263). On the other hand, two much smaller segments called by CBS in the 17,000-20,000 range do have marginal support of about 40% in HaMMLET, suggesting that the lack of support for the larger segment is correct. It should be noted that inference differs between the entire array and chromosome 20 in both methods, since long-range effects have higher impact in larger data.
We also demonstrate another feature of HaMMLET called noise control. While Gaussian emissions have been deemed a sufficiently accurate noise model for aCGH [67], microarray data is prone to outliers, for example due to damages on the chip. While it is possible to model outliers directly [60], the characteristics of the wavelet transform allow us to largely suppress them during the construction of our data structure (see Methods). Notice that due to noise control most outliers are correctly labeled according to the segment they occur in, while the short gain segment close to the beginning is called correctly.

Effects of wavelet compression on speed and convergence
The speedup gained by compression depends on how well the data can be compressed. Poor compression is expected when the means are not well separated, or short segments have small variance, which necessitates the creation of smaller blocks for the rest of the data to expose potential low-variance segments to the sampler. On the other hand, data must not be overcompressed to avoid merging small aberrations with normal segments, which would decrease the F-measure. Due to the dynamic changes to the block structure, we measure the level of compression as the average compression ratio, defined as the product of the number of data points T and the number of iterations N, divided by the total number of blocks in all iterations. As usual a compression ratio of one indicates no compression.
To evaluate the impact of dynamic wavelet compression on speed and convergence properties of an HMM, we created 129,600 different data sets with T = 32,768 many probes. In each data set, we randomly distributed 1 to 6 gains of a total length of {100, 250, 500, 750, 1000} uniformly among the data, and do the same for losses. Mean combinations (μ loss , μ neutral , μ gain ) were chosen from log 2 . These values have been selected to yield a wide range of easy and hard cases, both well separated, low-variance data with large aberrant segments as well as cases in which small aberrations overlap significantly with the tail samples of high-variance neutral segments. Consequently, compression ratios range from *1 to *2, 100. We use automatic priors Pðs 2 0:2Þ ¼ 0:9, self-transition priors α ii 2 {10, 100, 1000}, non-self transition priors α ij = 1, and initial state priors α 2 {1,10}. Using all possible combinations of the above yields 129,600 different simulated data sets, a total of 4.2 billion values.
We achieve speedups per iteration of up to 350 compared to an uncompressed HMM ( Fig  4). In contrast, [62] have reported ratios of 10-60, with one instance of 90. Notice that the speedup is not linear in the compression ratio. While sampling itself is expected to yield linear speedup, the marginal counts still have to be tallied individually for each position, and dynamic block creation causes some overhead. The quantization artifacts observed for larger speedup are likely due to the limited resolution of the Linux time command (10 milliseconds). Compressed HaMMLET took about 11.2 CPU hours for all 129,600 simulations, whereas the uncompressed version took over 3 weeks and 5 days. All running times reported are CPU time measured on a single core of a AMD Opteron 6174 Processor, clocked at 2.2 GHz.
We evaluate the convergence of the F-measure of compressed and uncompressed inference for each simulation. Since we are dealing with multi-class classification, we use the micro-and macro-averaged F-measures (F mi , F ma ) proposed by [76]: Here, TP i denotes a true positive call for the i-th out of M states, π and ρ denote precision and recall. These F-measures tend to be dominated by the classifier's performance on common and rare categories, respectively. Since all state labels are sampled from the same prior and hence their relative order is random, we used the label permutation which yielded the highest sum of micro-and macro-averaged F-measures. The simulation results are included in Web Supplement S6 at https://zenodo.org/record/46263. In Fig 5, we show that the compressed version of the Gibbs sampler converges almost instantly, whereas the uncompressed version converges much slower, with about 5% of the cases failing to yield an F-measure >0.6 within 1,000 iterations. Wavelet compression is likely to yield reasonably large blocks for the majority class early on, which leads to a strong posterior estimate of its parameters and self-transition probabilities. As expected, F ma are generally worse, since any misclassification in a rare class has a larger impact. Especially in the uncompressed version, we HaMMLET's speedup as a function of the average compression during sampling. As expected, higher compression leads to greater speedup. The non-linear characteristic is due to the fact that some overhead is incurred by the dynamic compression, as well as parts of the implementation that do not depend on the compression, such as tallying marginal counts. observe that F ma tends to plateau until F mi approaches 1.0. Since any misclassification in the majority (neutral) class adds false positives to the minority classes, this effect is expected. It implies that correct labeling of the majority class is a necessary condition for correct labeling of minority classes, in other words, correct identification of the rare, interesting segments requires the sampler to properly converge, which is much harder to achieve without compression. It should be noted that running compressed HaMMLET for 1,000 iterations is unnecessary on the simulated data, as in all cases it converges between 25 and 50 iterations. Thus, for all practical purposes, further speedup by a factor of 40-80 can be achieved by reducing the number of iterations, which yields convergence up to 3 orders of magnitude faster than standard FBG.

Coriell, ATCC and breast carcinoma
The data provided by [77] includes 15 aCGH samples for the Coriell cell line. At about 2,000 probes, the data is small compared to modern high-density arrays. Nevertheless, these data sets have become a common standard to evaluate CNV calling methods, as they contain few and simple aberrations. The data also contains 6 ATCC cell lines as well as 4 breast carcinoma, all of which are substantially more complicated, and typically not used in software evaluations. In Fig 6, we demonstrate our ability to infer the correct segments on the most complex example, a T47D breast ductal carcinoma sample of a 54 year old female. We used 6-state automatic priors with Pðs 2 0:1Þ ¼ 0:85, and all Dirichlet hyperparameters set to 1. On a standard laptop, HaMMLET took 0.09 seconds for 1,000 iterations; running times for the other samples were similar. Our results for all 25 data sets have been included in Web Supplement S7 at https:// zenodo.org/record/46263.

Discussion
In the analysis of biological data, there are usually conflicting objectives at play which need to be balanced: the required accuracy of the analysis, ease of use-using the software, setting software and method parameters-and often the speed of a method. Bayesian methods have obtained a reputation of requiring enormous computational effort and being difficult to use, for the expert knowledge required for choosing prior distributions. It has also been recognized HaMMLET's inference of copy-number segments on T47D breast ductal carcinoma. Notice that the data is much more complex than the simple structure of a diploid majority class with some small aberrations typically observed for Coriell data.  [60,62,78] that they are very powerful and accurate, leading to improved, high-quality results and providing, in the form of posterior distributions, an accurate measure of uncertainty in results. Nevertheless, it is not surprising that a hundred times larger effort in computation alone prevented wide-spread use.
Inferring Copy Number Variants (CNV) is a quite special problem, as experts can identify CN changes visually, at least on very good data and for large, drastic CN changes (e. g., long segments lost on both chromosomal copies). With lesser quality data, smaller CN differences and in the analysis of cohorts the need for objective, highly accurate, and automated methods is evident.
The core idea for our method expands on our prior work [62] and affirms a conjecture by Lai et al. [65] that a combination of smoothing and segmentation will likely improve results. One ingredient of our method are Haar wavelets, which have previously been used for pre-processing and visualization [21,64]. In a sense, they quantify and identify regions of high variation, and allow to summarize the data at various levels of resolution, somewhat similar to how an expert would perform a visual analysis. We combine, for the first time, wavelets with a full Bayesian HMM by dynamically and adaptively infering blocks of subsequent observations from our wavelet data structure. The HMM operates on blocks instead of individual observations, which leads to great saving in running times, up to 350-fold compared to the standard FB-Gibbs sampler, and up to 288 times faster than CBS. Much more importantly, operating on the blocks greatly improves convergence of the sampler, leading to higher accuracy for a much smaller number of sampling iterations. Thus, the combination of wavelets and HMM realizes a simultaneous improvement on accuracy and on speed; typically one can have one or the other. An intuitive explanation as to why this works is that the blocks derived from the wavelet structure allow efficient, summary treatment of those "easy" to call segments given the current sample of HMM parameters and identifies "hard" to call CN segment which need the full computational effort from FB-Gibbs. Note that it is absolutely crucial that the block structure depends on the parameters sampled for the HMM and will change drastically over the run time. This is in contrast to our prior work [62], which used static blocks and saw no improvements to accuracy and convergence speed. The data structures and linear-time algorithms we introduce here provide the efficient means for recomputing these blocks at every cycle of the sampling, cf. Fig 1. Compared to our prior work, we observe a speed-up of up to 3,000 due to the greatly improved convergence, O(T) vs. O(T log T) clustering, improved numerics and, lastly, a C++ instead of a Python implementation.
We performed an extensive comparison with the state-of-the-art as identified by several review and benchmark publications and with the standard FB-Gibbs sampler on a wide range of biological data sets and 129,600 simulated data sets, which were produced by a simulation process not based on HMM to make it a harder problem for our method. All comparisons demonstrated favorable results for our method when measuring accuracy at a very noticeable acceleration compared to the state-of-the-art. It must be stressed that these results were obtained with a statistically sophisticated model for CNV calls and without cutting algorithmic corners, but rather an effective allocation of computational effort.
All our computations are performed using our automatic prior, which derives estimates for the hyperparameters of the priors for means and variances directly from the wavelet tree structure and the resulting blocks. The block structure also imposes a prior on self-transition probabilities. The user only has to provide an estimate of the noise variance, but as the automatic prior is designed to be weak, the prior and thus the method is robust against incorrect estimates. We have demonstrated this by using different hyperparameters for the associated Dirichlet priors in our simulations, which HaMMLET is able to infer correctly regardless of the transition priors. At the same time the automatic prior can be used to tune certain aspects of the HMM if stronger prior knowledge is available. We would expect further improvements from non-automatic, expert-selected priors, but refrained from using them for the evaluation, as they might be perceived as unfair to other methods.
In summary, our method is as easy to use as other, statistically less sophisticated tools, more accurate and much more computationally efficient. We believe this makes it an excellent choice both for routine use in clinical settings and principled re-analysis of large cohorts, where the added accuracy and the much improved information about uncertainty in copy number calls from posterior marginal distributions will likely yield improved insights into CNV as a source of genetic variation and its relationship to disease.

Methods
We will briefly review Forward-Backward Gibbs sampling (FBG) for Bayesian Hidden Markov Models, and its acceleration through compression of the data into blocks. By first considering the case of equal emission variances among all states, we show that optimal compression is equivalent to a concept called selective wavelet reconstruction, following a classic proof in wavelet theory. We then argue that wavelet coefficient thresholding, a variance-dependent minimax estimator, allows for compression even in the case of unequal emission variances. This allows the compression of the data to be adapted to the prior variance level at each sampling iteration. We then derive a simple data structure to dynamically create blocks with little overhead. While wavelet approaches have been used for aCGH data before [29,33,34,63], our method provides the first combination of wavelets and HMMs.

Bayesian Hidden Markov Models
Let T be the length of the observation sequence, which is equal to the number of probes. An HMM can be represented as a statistical model ðq; A; y; π j yÞ, with transition matrix A, a latent state sequence q = (q 0 , q 1 , . . ., q T−1 ), an observed emission sequence y = (y 0 , y 1 , . . ., y T−1 ), emission parameters θ, and an initial state distribution π.
In the usual frequentist approach, the state sequence q is inferred by first finding a maximum likelihood estimate of the parameters, ðA ML ; y ML ; π ML Þ ¼ arg max ðA;y;πÞ LðA; y; π j yÞ; using the Baum-Welch algorithm [53,54]. This is only guaranteed to yield local optima, as the likelihood function is not convex. Repeated random reinitializations are used to find "good" local optima, but there are no guarantees for this method. Then, the most likely state sequence given those parameters,q ¼ arg max q Pðq j A ML ; y ML ; π ML ; yÞ; is calculated using the Viterbi algorithm [55]. However, if there are only a few aberrations, that is there is imbalance between classes, the ML parameters tend to overfit the normal state which is likely to yield incorrect segmentation [62]. Furthermore, alternative segmentations given those parameters are also ignored, as are the ones for alternative parameters.
The Bayesian approach is to calculate the distribution of state sequences directly by integrating out the emission and transition variables, Pðq; A; y; π j yÞ dπ dy dA: Since this integral is intractable, it has to be approximated using Markov Chain Monte Carlo techniques, i. e. drawing N samples, ðq ðiÞ ; A ðiÞ ; y ðiÞ ; π ðiÞ Þ $ Pðq; A; y; π j yÞ; and subsequently approximating marginal state probabilities by their frequency in the sample Iðq ðiÞ t ¼ sÞ: Thus, for each position t, we get a complete probability distribution over the possible states. As the marginals of each variable are explicitly defined by conditioning on the other variables, an HMM lends itself to Gibbs sampling, i. e. repeatedly sampling from the marginals ðA j q; y; y; πÞ, ðy j q; A; y; πÞ, ðπ j A; y; y; qÞ, and ðq j A; y; y; πÞ, conditioned on the previously sampled values. Using Bayes's formula and several conditional independence relations, the sampling process can be written as y $ Pðyjq; y; t y Þ / Pðq; yjyÞPðyjt y Þ; π $ PðπjA; q; t π Þ / PðA; qjπÞPðπjt π Þ; and q $ PðqjA; y; y; πÞ; where τ x represents hyperparameters to the prior distribution Pðx j t x Þ. Typically, each prior will be conjugate, i. e. it will be the same class of distributions as the posterior, which then only depends on updated parameters τ ? , e.g. A $ PðA j t ? A Þ ¼ PðA j π; q; t A Þ. Thus τ π and t Aðk;:Þ , the hyperparameters of π and the k-th row of A, will be the α i of a Dirichlet distribution, and τ θ = (α, β, ν, μ 0 ) will be the parameters of a Normal-Inverse Gamma distribution.
Notice that the state sequence does not depend on any prior. Though there are several schemes available to sample q, [58] has argued strongly in favor of Forward-Backward sampling [57], which yields Forward-Backward Gibbs sampling (FBG) above. Variations of this have been implemented for segmentation of aCGH data before [60,62,78]. However, since in each iteration a quadratic number of terms has to be calculated at each position to obtain the forward variables, and a state has to be sampled at each position in the backward step, this method is still expensive for large data. Recently, [62] have introduced compressed FBG by sampling over a shorter sequence of sufficient statistics of data segments which are likely to come from the same underlying state. Let B ≔ ðB w Þ W w¼1 be a partition of y into W blocks. Each block B w contains n w elements. Let y w, k the k-th element in B w . The forward variable α w (j) for this block needs to take into account the n w emissions, the transitions into state j, and the n w − 1 self-transitions, which yields and Pðy w;k j m; s 2 Þ: The ideal block structure would correspond to the actual, unknown segmentation of the data. Any subdivision thereof would decrease the compression ratio, and thus the speedup, but still allow for recovery of the true breakpoints. In addition, such a segmentation would yield sufficient statistics for the likelihood computation that corresponds to the true parameters of the state generating a segment. Using wavelet theory, we show that such a block structure can be easily obtained.

Wavelet theory preliminaries
Here, we review some essential wavelet theory; for details, see [79,80]. Let be the Haar wavelet [81], and ψ j, k (x) ≔ 2 j/2 ψ(2 j x − k); j and k are called the scale and shift parameter. Any square-integrable function over the unit interval, f 2 L 2 ([0,1)), can be approximated using the orthonormal basis fc j;k j j; k 2 Z; À1 j; 0 k 2 j À 1g, admitting a multiresolution analysis [82,83]. Essentially, this allows us to express a function f(x) using scaled and shifted copies of one simple basis function ψ(x) which is spatially localized, i. e. non-zero on only a finite interval in x. The Haar basis is particularly suited for expressing piecewise constant functions. Finite data y ≔ (y 0 , . . ., y T−1 ) can be treated as an equidistant sample f(x) by scaling the indices to the unit interval using x t ≔ t T . Let h ≔ log 2 T. Then y can be expressed exactly as a linear combination over the Haar wavelet basis above, restricted to the maximum level of sampling resolution (j h − 1): The wavelet transform d ¼ Wy is an orthogonal endomorphism, and thus incurs neither redundancy nor loss of information. Surprisingly, d can be computed in linear time using the pyramid algorithm [82].

Compression via wavelet shrinkage
The Haar wavelet transform has an important property connecting it to block creation: Letd be a vector obtained by setting elements of d to zero, thenŷ ¼ W Td ≔Ŵ T d is called selective wavelet reconstruction (SW). If all coefficients d j, k with ψ j, k (x t )6 ¼ψ j, k (x t+1 ) are set to zero for some t, thenŷ t ¼ŷ tþ1 , which implies a block structure onŷ. Conversely, blocks of size >2 (to account for some pathological cases) can only be created using SW. This general equivalence between SW and compression is central to our method. Note thatŷ does not have to be computed explicitly; the block boundaries can be inferred from the position of zero-entries ind alone.
Assume all HMM states had the same emission variance σ 2 . Since each state is associated with an emission mean, finding q can be viewed as a regression or smoothing problem of finding an estimateμ of a piecewise constant function μ whose range is precisely the set of emission means, i. e.
Unfortunately, regression methods typically do not limit the number of distinct values recovered, and will instead return some estimateŷ 6 ¼μ. However, ifŷ is piecewise constant and minimizes kμ Àŷk, the sample means of each block are close to the true emission means. This yields high likelihood for their corresponding state and hence a strong posterior distribution, leading to fast convergence. Furthermore, the change points in μ must be close to change points inŷ, since moving block boundaries incurs additional loss, allowing for approximate recovery of true breakpoints.ŷ might however induce additional block boundaries that reduce the compression ratio.
In a series of ground-breaking papers, Donoho, Johnstone et al. [84][85][86][87][88] showed that SW could in theory be used as an almost ideal spatially adaptive regression method. Assuming one could provide an oracle Δ(μ, y) that would know the true μ, then there exists a method M SW ðy; DÞ ¼Ŵ T SW using an optimal subset of wavelet coefficients provided by Δ such that the quadratic risk ofŷ SW ≔Ŵ T SW d is bounded as By definition, M SW would be the best compression method under the constraints of the Haar wavelet basis. This bound is generally unattainable, since the oracle cannot be queried. Instead, they have shown that for a method M WCT (y, λσ) called wavelet coefficient thresholding, which sets coefficients to zero whose absolute value is smaller than some threshold λσ, there exists some l ?
This l ? T is minimax, i. e. the maximum risk incured over all possible data sets is not larger than that of any other threshold, and no better bound can be obtained. It is not easily computed, but for large T, on the order of tens to hundreds, the universal threshold l u T ≔ ffiffiffiffiffiffiffiffiffi ffi 2lnT p is asymptotically minimax. In other words, for data large enough to warrant compression, universal thresholding is the best method to approximate μ, and thus the best wavelet-based compression scheme for a given noise level σ 2 .

Integrating wavelet shrinkage into FBG
This compression method can easily be extended to multiple emission variances. Since we use a thresholding method, decreasing the variance simply subdivides existing blocks. If the threshold is set to the smallest emission variance among all states,ŷ will approximately preserve the breakpoints around those low-variance segments. Those of high variance are split into blocks of lower sample variance; see [89,90] for an analytic expression. While the variances for the different states are not known, FBG provides a priori samples in each iteration. We hence propose the following simple adaptation: In each sampling iteration, use the smallest sampled variance parameter to create a new block sequence via wavelet thresholding (Algorithm 1).
Algorithm 1 Dynamically adaptive FBG for HMMs 1: procedure HaMMLETðy; t A ; t y ; t π Þ 2: T |y| 3: l ffiffiffiffiffiffiffiffiffiffiffi 2 ln T p 4: A $ PðA j t A Þ 5: y $ Pðy j t y Þ 6: π $ Pðπ j t π Þ 7: for i = 1, . . ., N do 8: s min min s i fŝ MAD ; s i j s 2 i 2 yg 9: Create block sequence B from threshold λσ min 10: q $ Pðq j A; B; y; πÞ using Forward-Backward sampling 11: Add count of marginal states for q to result 12: A $ PðA j t ? A Þ ¼ PðA j π; q; t A Þ / Pðπ; q j AÞPðA j t A Þ 13: y $ Pðy j t ? y Þ ¼ Pðy j q; B; t y Þ / Pðq; B j yÞPðy j t y Þ 14: π $ Pðπ j t ? π Þ ¼ Pðπ j A; q; t π Þ / PðA; q j πÞPðπ j t π Þ 15: end for 16: end procedure While intuitively easy to understand, provable guarantees for the optimality of this method, specifically the correspondence between the wavelet and the HMM domain remain an open research topic. A potential problem could arise if all sampled variances are too large. In this case, blocks would be under-segmented, yield wrong posterior variances and hide possible state transitions. As a safeguard against over-compression, we use the standard method to estimate the variance of constant noise in shrinkage applications, as an estimate of the variance in the dominant component, and modify the threshold definition to l Á minfŝ MAD ; s i 2 yg. If the data is not i.i.d.,ŝ 2 MAD will systematically underestimate the true variance [28]. In this case, the blocks get smaller than necessary, thus decreasing the compression.

A data structure for dynamic compression
The necessity to recreate a new block sequence in each iteration based on the most recent estimate of the smallest variance parameter creates the challenge of doing so with little computational overhead, specifically without repeatedly computing the inverse wavelet transform or considering all T elements in other ways. We achieve this by creating a simple tree-based data structure.
The pyramid algorithm yields d sorted according to (j, k). Again, let h ≔ log 2 T, and ℓ ≔ h − j. We can map the wavelet ψ j, k to a perfect binary tree of height h such that all wavelets for scale j are nodes on level ℓ, nodes within each level are sorted according to k, and ℓ is increasing from the leaves to the root (Fig 7). d represents a breadth-first search (BFS) traversal of that tree, with d j, k being the entry at position b2 j c+k. Adding y i as the i-th leaf on level ℓ = 0, each non-leaf node represents a wavelet which is non-zero for the n ≔ 2 ℓ data points y t , for t in the the interval I j, k ≔ [kn, (k+1)n − 1] stored in the leaves below; notice that for the leaves, kn = t.
This implies that the leaves in any subtree all have the same value after wavelet thresholding if all the wavelets in this subtree are set to zero. We can hence avoid computing the inverse wavelet transform to create blocks. Instead, each node stores the maximum absolute wavelet coefficient in the entire subtree, as well as the sufficient statistics required for calculating the likelihood function. More formally, a node N ℓ,t corresponds to wavelet ψ j,k , with ℓ = h − j and t = k2 ℓ (ψ −1,0 is simply constant on the [0,1) interval and has no effect on block creation, thus we discard it). Essentially, ℓ numbers the levels beginning at the leaves, and t marks the start position of the block when pruning the subtree rooted at N ℓ,t . The members stored in each node are: • The number of leaves, corresponding to the block size: • The sum of data points stored in the subtree leaves: • Similarly, the sum of squares: • The maximum absolute wavelet coefficient of the subtree, including the current d j, k itself: All these values can be computed recursively from the child nodes in linear time. As some real data sets contain salt-and-pepper noise, which manifests as isolated large coefficients on the lowest level, its is possible to ignore the first level in the maximum computation so that no information to create a single-element block for outliers is passed up the tree. We refer to this technique as noise control. Notice that this does not imply that blocks are only created at even t, since true transitions manifest in coefficients on multiple levels. Mapping of wavelets ψ j, k and data points y t to tree nodes N ℓ, t . Each node is the root of a subtree with n = 2 ℓ leaves; pruning that subtree yields a block of size n, starting at position t. For instance, the node N 1,6 is located at position 13 of the DFS array (solid line), and corresponds to the wavelet ψ 3,3 . A block of size n = 2 can be created by pruning the subtree, which amounts to advancing by 2n − 1 = 3 positions (dashed line), yielding N 3,8 at position 16, which is the wavelet ψ 1,1 . Thus the number of steps for creating blocks per iteration is at most the number of nodes in the tree, and thus strictly smaller than 2T. The block creation algorithm is simple: upon construction, the tree is converted to depthfirst search (DFS) order, which simply amounts to sorting the BFS array according to (kn, j), and can be performed using linear-time algorithms such as radix sort; internally, we implemented a different linear-time implementation mimicking tree traversal using a stack. Given a threshold, the tree is then traversed in DFS order by iterating linearly over the array (Fig 7,  solid lines). Once the maximum coefficient stored in a node is less or equal to the threshold, a block of size n is created, and the entire subtree is skipped (dashed lines). As the tree is perfect binary and complete, the next array position in DFS traversal after pruning the subtree rooted at the node at index i is simply obtained as i + 2n − 1, so no expensive pointer structure needs to be maintained, leaving the tree data structure a simple flat array. An example of dynamic block creation is given in Fig 8. Once the Gibbs sampler converges to a set of variances, the block structure is less likely to change. To avoid recreating the same block structure over and over again, we employ a technique called block structure prediction. Since the different block structures are subdivisions of each other that occur in a specific order for decreasing σ 2 , there is a simple bijection between the number of blocks and the block structure itself. Thus, for each block sequence length we register the minimum and maximum variance that creates that sequence. Upon entering a new iteration, we check if the current variance would create the same number of blocks as in the previous iteration, which guarantees that we would obtain the same block sequence, and hence can avoid recomputation.
The wavelet tree data structure can be readily extended to multivariate data of dimensionality m. Instead of storing m different trees and reconciling m different block patterns in each iteration, one simply stores m different values for each sufficient statistic in a tree node. Since we have to traverse into the combined tree if the coefficient of any of the m trees was below the threshold, we simply store the largest N ℓ,t [d] among the corresponding nodes of the trees, which means that the block creation can be done in O(T) instead of O(mT), i. e. the dimensionality of the data only enters into the creation of the data structure, but not the query during sampling iterations. contained all states in almost equal number. However, due to possible class imbalance, means for short segments far away from μ 0 can have low sampling probability, as they do not contribute much to the sample variance of the data. We thus define δ to be the sample variance of block means in the compression obtained byŝ 2 MAD , and take the maximum of those two variances. We thus obtain m 0 ≔ S 1 n ; and n ¼ b max

Numerical issues
To assure numerical stability when working with probabilities, many HMM implementations resort to log-space computations, which involves a considerable number of expensive function calls (exp, log, pow); for instance, on Intel's Nehalem architecture, log (FYL2X) requires 55 operations as opposed to 1 for adding and multiplying floating point numbers (FADD, FMUL) [93]. Our implementation, which differs from [62] greatly reduces the number of such calls by utilizing the block structure: The term accounting for emissions and self-transitions within the block can be written as Any constant cancels out during normalization. Furthermore, exponentiation of potentially small numbers causes underflows. We hence move those terms into the exponent, utilizing the much stabler logarithm function.
exp À X n w k¼1 ðy w;k À m j Þ 2 2s 2 j þ ðn w À 1Þ log A jj À n w log s j ! : Using the block's sufficient statistics n w ; S 1 ≔ X n w k¼1 y w;k ; S 2 ≔ X n w k¼1 y 2 w;k : the exponent can be rewritten as E w ðjÞ ≔ 2m j S 1 À S 2 2s 2 j þ Kðn w ; jÞ; Kðn w ; jÞ ≔ ðn w À 1Þ log A jj À n w log s j þ m 2 j 2s 2 j ! : K(n w , j) can be precomputed for each iteration, thus greatly reducing the number of expensive function calls. Notice that the expressions above correspond to the canonical exponential family form exp(ht(x), θi − F(θ) + k(x)) of a product of Gaussian distributions. Hence, equivalent terms can easily be derived for non-Gaussian emissions, implying that the same optimizations can be used in the general case of exponential family distributions: Only the dot product of the sufficient statistics t(x) and the parameters θ has to be computed in each iteration and for each block, while the log-normalizer F(θ) can be precomputed for each iteration, and the carrier measure k(x) (which is 0 for Gaussian emissions) only has to be computed once.
To avoid overflow of the exponential function, we subtract the largest such exponents among all states, hence E w (j) 0. This is equivalent to dividing the forward variables by exp max k E w ðkÞ ; which cancels out during normalization. Hence we obtaiñ a w ðjÞ ≔ exp E w ðjÞ À max k E w ðkÞ X n w i¼1 a wÀ1 ðiÞA ij ; which are then normalized toâ w ðjÞ ¼ã w ðjÞ P kãw ðkÞ :