Fast Bayesian Inference of Copy Number Variants using Hidden Markov Models with Wavelet Compression

doi:10.1371/journal.pcbi.1004871

Fig 1.

Overview of HaMMLET.

Instead of individual computations per observation (panel a), Forward-Backward Gibbs Sampling is performed on a compressed version of the data, using sufficient statistics for block-wise computations (panel b) to accelerate inference in Bayesian Hidden Markov Models. During the sampling (panel c) parameters and copy number sequences are sampled iteratively. During each iteration, the sampled emission variances determine which coefficients of the data’s Haar wavelet transform are dynamically set to zero. This controls potential break points at finer or coarser resolution or, equivalently, defines blocks of variable number and size (panel c, bottom). Our approach thus yields a dynamic, adaptive compression scheme which greatly improves speed of convergence, accuracy and running times.

More »

Expand

Fig 2.

F-measures of CBS (light) and HaMMLET (dark) for calling aberrant copy numbers on simulated aCGH data [66].

Boxes represent the interquartile range (IQR = Q3−Q1), with a horizontal line showing the median (Q2), whiskers representing the range ( beyond Q1 and Q3), and the bullet representing the mean. HaMMLET has the same or better F-measures in most cases, and on the SRS simulation converges to 1 for larger segments, whereas CBS plateaus for aberrations greater than 10.

More »

Expand

Fig 3.

Copy number inference for chromosome 20 in invasive ductal carcinoma (21,687 probes).

CBS creates a 19-state solution (top), however, a compressed 19-state HMM only supports an 11-state solution (bottom), suggesting insufficient level merging in CBS.

More »

Expand

Fig 4.

HaMMLET’s speedup as a function of the average compression during sampling.

As expected, higher compression leads to greater speedup. The non-linear characteristic is due to the fact that some overhead is incurred by the dynamic compression, as well as parts of the implementation that do not depend on the compression, such as tallying marginal counts.

More »

Expand

Fig 5.

F-measures for simulation results.

The median value (black) and quantile ranges (in 5% steps) of the micro- (top) and macro-averaged (bottom) F-measures (F_mi, F_ma) for uncompressed (left) and compressed (right) FBG inference, on the same 129,600 simulated data sets, using automatic priors. The x-axis represents the number of iterations alone, and does not reflect the additional speedup obtained through compression. Notice that the compressed HMM converges no later than 50 iterations (inset figures, right).

More »

Expand

Fig 6.

HaMMLET’s inference of copy-number segments on T47D breast ductal carcinoma.

Notice that the data is much more complex than the simple structure of a diploid majority class with some small aberrations typically observed for Coriell data.

More »

Expand

Fig 7.

Mapping of wavelets ψ_{j, k} and data points y_t to tree nodes N_{ℓ, t}.

Each node is the root of a subtree with n = 2^ℓ leaves; pruning that subtree yields a block of size n, starting at position t. For instance, the node N_1,6 is located at position 13 of the DFS array (solid line), and corresponds to the wavelet ψ_3,3. A block of size n = 2 can be created by pruning the subtree, which amounts to advancing by 2n − 1 = 3 positions (dashed line), yielding N_3,8 at position 16, which is the wavelet ψ_1,1. Thus the number of steps for creating blocks per iteration is at most the number of nodes in the tree, and thus strictly smaller than 2T.

More »

Expand

Fig 8.

Example of dynamic block creation.

The data is of size T = 256, so the wavelet tree contains 512 nodes. Here, only 37 entries had to be checked against the threshold (dark line), 19 of which (round markers) yielded a block (vertical lines on the bottom). Sampling is hence done on a short array of 19 blocks instead of 256 individual values, thus the compression ratio is 13.5. The horizontal lines in the bottom subplot are the block means derived from the sufficient statistics in the nodes. Notice how the algorithm creates small blocks around the breakpoints, e. g. at t ≈ 125, which requires traversing to lower levels and thus induces some additional blocks in other parts of the tree (left subtree), since all block sizes are powers of 2. This somewhat reduces the compression ratio, which is unproblematic as it increases the degrees of freedom in the sampler.

More »

Expand