Genomic region detection via Spatial Convex Clustering

Several modern genomic technologies, such as DNA-Methylation arrays, measure spatially registered probes that number in the hundreds of thousands across multiple chromosomes. The measured probes are by themselves less interesting scientifically; instead scientists seek to discover biologically interpretable genomic regions comprised of contiguous groups of probes which may act as biomarkers of disease or serve as a dimension-reducing pre-processing step for downstream analyses. In this paper, we introduce an unsupervised feature learning technique which maps technological units (probes) to biological units (genomic regions) that are common across all subjects. We use ideas from fusion penalties and convex clustering to introduce a method for Spatial Convex Clustering, or SpaCC. Our method is specifically tailored to detecting multi-subject regions of methylation, but we also test our approach on the well-studied problem of detecting segments of copy number variation. We formulate our method as a convex optimization problem, develop a massively parallelizable algorithm to find its solution, and introduce automated approaches for handling missing values and determining tuning parameters. Through simulation studies based on real methylation and copy number variation data, we show that SpaCC exhibits significant performance gains relative to existing methods. Finally, we illustrate SpaCC’s advantages as a pre-processing technique that reduces large-scale genomics data into a smaller number of genomic regions through several cancer epigenetics case studies on subtype discovery, network estimation, and epigenetic-wide association.


Introduction
Modern genomic technologies take fine-grained measurements on on human subjects that allow for increasingly individualized treatment options for various diseases.Several of these technologies capture genomic information at spatially registered locations on the DNA sequence; examples include point mutations, next generation sequencing, copy number variation, and the focus of this paper, DNA Methylation arrays which measure epigenetic variation.Here, the units returned by the technology, CpG sites, are not of primary interest to scientists.More important are regions of CpG sites whose cumulative impact affects gene function [Das and Singal, 2004].To this end, we introduce an unsupervised feature learning technique which maps technological units to biological units by coalescing probes into contiguous genomic regions that are common across multiple subjects.
Epigenetic technologies measure genetic aspects that affect gene regulation beyond gene expression and transcriptomics.One such example is DNA Methylation, which measures the addition of a methyl group to CpG sites creating 5-methylcytosine.High methylation levels (hypermethylation) have been shown to block gene transcription in cancer.Similarly, low methylation levels (hypomethylation), typically at a global level, have also been observed by cancer researchers [Das and Singal, 2004].Recent advances in whole genome bisulfite sequencing technology now yield methylation intensity measurements at hundreds of thousands of CpG sites across the genome [Bibikova et al., 2009].The ratio of methylated intensity to total intensity, the so-called beta-value, is returned as a measure of the DNA methylation level at each site.Such technologies interrogate genomic regions (such as gene promoter regions) by taking measurements of multiple CpG sites in close spatial proximity; beta-values in these regions are often strongly correlated, indicating that they behave functionally as a unit [Eckhardt et al., 2006].These observations imply that probes which are close in genomic distance and which display similar methylation levels are better treated as functional units, or genomic regions.Previous work on region detection in the context of methylation has focused on differentially methylated region (DMR) discovery.Approaches in this area have utilized both smoothing techniques [Hansen et al., 2012, Jaffe et al., 2012] and the concept of linkage disequilibrium [Shoemaker et al., 2010].The task of DMR discovery is, however, an inherently supervised one.In this work, we focus on the unsupervised discovery of genomic regions for methylation data.We develop a method for grouping probes into genomic regions that leads to more interpretable scientific measurements, improves the performance of downstream analyses, and can thus serve as a dimensionality reducing pre-processing step for methylation data.
Another example of grouping spatial genomics data is the well-studied problem of copy number segmentation.Copy number variation (CNV) is one measure of structural variability in the genome that quantifies large scale insertions and deletions of genetic information across the DNA sequence.By utilizing array-CGH technology, structural differences relative to a reference sample are quantified as the log ratio of intensities at various loci across the genome.Such differences have been linked to various cancers such as breast and lung [Shlien and Malkin, 2009].Similar to DNA methylation, copy number measurements at a particular loci are typically of lesser interest than regions of gain or loss, which signal large-scale amplification or deletions [Taylor et al., 2008].As such, the problem of Copy Number Segmentation has received much attention and numerous methods exist for this type of analysis [Olshen et al., 2004, Venkatraman and Olshen, 2007, Wang et al., 2005, Nowak et al., 2011, Bleakley and Vert, 2011].Due to both its well-studied nature and similarity to region detection in the case of methylation data, we consider the task of copy number segmentation as a benchmark, and show that our method performs competitively to many state of the art methods.For each data type we plot raw probe values overlayed with SpaCC clusters for three subjects.Subjects are ordered according to maximum deviation from the average cluster centroid over all subjects.For methylation data we note the ordering detects a hypomethylated region (blue), while for copy number data the ordering sorts patients according to genetic variability.In both cases, we note the discovery of common genomic regions across all patients.
For both methylation and CNV data, scientists seek entire regions of CNV amplifications or deletions or regions of CpG sites that behave as functional units.For these regions to serve as meaningful features for downstream multivariate statistical analyses, they must be consistent across all subjects.To address this, we introduce an unsupervised learning method which maps technological units to more meaningful biological units by clustering probes into genomic regions.An example of the genomic regions resulting from our method, Spatial Convex Clustering (SpaCC), can be seen in Figure 1.To detect such regions, we utilize the popular concept of fusion penalties [Tibshirani et al., 2005] by extending methods for convex clustering to be appropriate for spatial genomics data [Hocking et al., 2011, Chi and Lange].We specifically focus on developing a fully automated and data-driven method that can be used to pre-process spatial genomics data into genomic regions for downstream analyses (Section 2).While we illustrate our method on CNV data to benchmark our method against widely used segmentation approaches (Section 3.1), the main focus of this paper is on methylation data.For this, we show how our method can lead to improved results for biomarker discovery in region-based epigenetic-wide association studies (Section 4.3), and our resulting genomic regions can yield improved features in downstream multivariate analysis such as clustering for subtype discovery (Section 4.2.1) and epigenetic network estimation (Section 4.2.2).

Spatial Convex Clustering
Our objective is to discover groups of probes which (i) have similar measurements, (ii) are spatially contiguous, and (iii) are nearby on the chromosome.Taken together, (i)-(iii) deliver groups of probes similar enough to be considered as single functional units, or genomic regions.Importantly, we require that (i)-(iii) be satisfied simultaneously for all subjects, so that the genomic regions discovered are intrinsically meaningful, and not artifacts of a particular sample.One approach to multi-subject genomic region detection would be to consider clustering the genomic probes based on the subject observations, as opposed to the more commonly used clustering of observations.To this end, we merge ideas that use fusion penalties for copy number segmentation [Wang et al., 2005, Bleakley and Vert, 2011, Nowak et al., 2011] and the more recently introduced convex clustering methods [Chi and Lange, Hocking et al., 2011] to develop an automated pipeline for feature learning with spatial genomics data.In particular, we address data-specific weighting schemes, automated methods for detecting the number and extent of clusters, as well as a principled approach for handling missing values.

Optimization Problem and Algorithm
Let X ∈ R n×p be our data matrix with n subjects and p variables (probes).Our SpaCC problem is defined as minimize Here, we use the Frobenius loss associated with the Gaussian distribution to estimate the cluster means, given by the columns of U ∈ R n×p .The second regularization term induces sparsity in the adjacent column differences of U , thus forcing adjacent column mean estimates to exactly fuse as the regularization parameter, γ is increased; fused columns of U form a spatially contiguous cluster of variables.As entire columns of U are fused as a unit, the clusters are common to all samples.The level of regularization is governed by γ, with larger values encouraging more fusion and larger spatial clusters, and the weights, w i .As discussed subsequently, the latter are taken as spatial weights proportional to the inverse distance between adjacent probes that are specific to each genomic technology; these in turn, lead to more interpretable results and computationally efficient algorithms.Note that in contrast to convex clustering methods [Chi and Lange, Hocking et al., 2011] which cluster observations, our approach clusters variables; additionally, we do not permit clusters among arbitrary groups of probes, and instead permit only genomically adjacent probes to coalesce together.In this sense, our method builds on the success of several existing fusion-based approaches that have been proposed specifically for CNV data [Bleakley andVert, 2011, Nowak et al., 2011].Finally, note that we apply our SpaCC problem separately to data for each chromosome.
To fit our SpaCC model, we adopt an approach introduced by [Chi and Lange] for convex clustering problems.We reformulate (1) by introducing an auxiliary variable V , where , and rewrite the penalty in terms of V .We then use the Alternating Minimization Algorithm (AMA) [Tseng, 1991] optimization algorithm to fit our SpaCC problem.Formulating the augmented Lagrangian, we obtain updates for both primal variables (U •i , V •i ) and dual variables (Λ •i ), as shown in Algorithm 1.
Data: X, w, γ Result: U , V while err > tol do

Spatial Weights
An important input to our SpaCC problem is the weight vector, w, which incorporates genomic spatial structure and yields more interpretable results and faster computation.Importantly, the choice of weights is inversely proportional to the genomic distance between adjacent probes.While there are many such possible spatial weighting schemes, we propose to use w i = exp {−σd i }, where d i = dist(probe i , probe i+1 ) is the distance in basepairs between probes.Here, σ is chosen based on properties of genomic data.For example with copy number data, it is common to have sizable portions of a chromosome amplified or deleted [Redon et al., 2006].Hence, weights that decay at a slower rate as a function of genomic distance will perform well for CNV data; specifically, we take σ = 0.00001 in our empirical studies.For methylation data, we expect methylated regions to form in small localized CpG islands near promoter regions of genes [Eckhardt et al., 2006].Hence, we take σ = 0.0002 for methylation data, which yields a more rapid decay in weights as a function of genomic distance.Examples of these weight choices are shown in the supplemental materials and the efficacy of these choices are tested in simulations, Section 3, and real data examples, Section 4.
Overall, tailoring spatial weights to specific technologies yields more interpretable genomic regions, with larger clusters in CNV data and smaller localized methylated regions, Figure 1.
Additionally, our choice of weights can dramatically decrease the computational burden of our SpaCC algorithm.Specifically, we apply hard-thresholding to the weights to set tiny weights to zero.Exact sparsity in the weight values prevents distant adjacent probes from coalescing into a common genomic region.Further if T weights are set to zero, the penalty term of ( 1) is perfectly separable into T + 1 terms, yielding T + 1 subproblems.These subproblems may in turn be solved in parallel.For example using our methylation weight choice, the Chromosome 17 TCGA Level 3 Breast Cancer Methylation SpaCC problem, Figure 1, separates into 910 subproblems that can be solved completely in parallel.This yields dramatic computational gains when compared to the original problem which consists of p = 23515 probes.

Missing Values and Parameter Selection
We seek to develop a fully automated method for detecting multi-subject genomic regions in CNV and methylation data; often automated methods are more reliable and reproducible, as practitioners have no knobs to tune that can yield different results across studies and labs.To this end, we introduce two automated, optimization-based approaches to handling practical problems with spatial genomics data: missing data and regularization parameter selection.First, missing values tend to be a major problem for large-scale high-throughput genomics data.For example, the TCGA Breast Cancer Level 3 Methylation Chromosome 17 data shown in Figure 1, contains 14270 total missing values.Similarly for TCGA Ovarian Cancer Level 2 Copy Number Variation, Chromosome 17 contains 2349 missing values.Typically, missing genomics data is handled via a two-step process where one first imputes the missing values using many popular off-the-shelf imputation routines [Troyanskaya et al., 2001] and then continues with analyses of the fully imputed data.
This approach, however, is less reproducible as results of downstream analyses can change depending on the imputation procedure employed.Instead, we propose to fit our SpaCC model in the presence of missing data, effectively eliminating the need to preform a separate imputation step.
As before, let X ∈ R n×p be our data matrix.Let M = M n × M p ⊂ {1, . . ., n} × {1, . . ., p} denote the indices of missing elements.We adopt an approach similar to that in Chi et al. [2016] to fit our SpaCC procedure in the presence of missing values.Specifically, we fit our SpaCC loss function only over the the non-missing elements of X, given by the indices M C : minimize First, notice that this optimization problem is still convex, and hence our approach will yield the global solution.We propose to optimize (2) using the majorization minimization (MM) algorithm [Hunter and Lange , 2004].Defining the surrogate function to be notice that g() majorizes the objective f (U ); namely (i) Iteratively minimizing the surrogate objective creates a non-increasing sequence of objective values we augment (3) leading to following algorithm to fit our SpaCC problem in the presence of missing data: end Algorithm 2: SpaCC Algorithm for Missing Data Notice that in contrast to traditional imputation routines, Algorithm 2 does not explicitly replace missing elements in X prior to analysis.Rather, missing indices are filled in iteratively via the SpaCC cluster means for the genomic regions, U .Now that we have an automated way of fitting SpaCC with missing values, we leverage this to propose a k-fold cross-validation scheme to select the single tuning parameter, γ.Note that γ controls both the number of genomic regions and the extent of these regions.To select γ, we employ the approach of Wold [1978] where we remove random elements of X in each fold and take the optimal γ as the parameter whose SpaCC imputed solution fit in the presence of missing values most closely aligns with the removed data.Specifically, for k = 1, . . ., K, where K the number of folds, we define the indices to be left out at the kth k=1 is an approximately equal sized partition of the non-missing elements of X.For each fold k, we introduce additional missing elements via C k and solve the missing data problem, Algorithm 2. Specifically, Given γ, we solve SpaCC with missing values given by I k : minimize We can solve this problem via Algorithm 2, replacing M with I k ; we denote the solution as (U * γ ) k .Then, we evaluate the performance of γ on the kth fold by comparing the solution (U * γ ) k to the left out (non-missing) elements of X via mean squared error: validation algorithm is given as follows: Given the tuning parameter, γ * , selected by minimizing the cross-validation error, we noticed that SpaCC identifies genomic regions which are spatially smaller than desired.Stated another way, cross-validation tends to underestimate the sparsity level in the differences, V .Such behavior is well known for the lasso and other sparse problems and is hence why many advocate using the one-standard-error cross-validation rule [van de Geer et al., 2011].An alternative approach is to post-process the results after cross-validation by thresholding [Meinshausen and Yu, 2009].We adopt such a scheme and propose to threshold the elements of V γ * at a level proportional to the estimated noise, σ2 ; specifically we threshold at the level, n σ, similar to that proposed by Meinshausen and Yu [2009].In Figure 2, we give examples of cross-validation error curves over a sequence of γ values for the TCGA Ovarian Copy Number and Methylation chromosome 17 example from Figure 1; both the minimum cross-validation error as well as our proposed thresholding level are shown.The efficacy of our cross-validation and post-thresholding procedure are further studied in subsequent simulations.Overall, we have developed an automated and principled optimization-based approach for handling missing values and selecting the number and extent of genomic regions that offers many advantages over competing approaches.

Simulation Studies
We study the performance of SpaCC empirically and compare our approach to existing methods through simulations based on real array-CGH and DNA-methylation data.Before presenting our simulation results, it will be helpful to introduce some notation.Our probes are indexed by j ∈ {1, 2, . . ., p}.Associated with each probe is a genomic location l j ∈ R, j = 1, . . ., p, and we denote the distance between genomic locations by d ij .The probes are partitioned into clusters, c g , indexed by g = 1, . . ., G.For each probe j we denote the cluster to which it belongs via r(j) = {g | j ∈ c g }.Our method and competitors will estimate a clustering {ĉ g } Ĝ g=1 , with ĉg ⊂ {1, . . ., p} and ĉg ∩ ĉg = ∅.We will then evaluate the performance of all methods by by comparing the estimated clustering {ĉ g } Ĝ g=1 to the true clustering {c g } G g=1 using common clustering metrics such as the Rand, Adjusted Rand, and Jaccard Indexes [Wagner and Wagner, 2007].Note that these clustering metrics are general and do not specifically account for the fact that we require clusters to be spatially contiguous.Hence, we also employ a lesser-known entropy-based clustering metric, the Variation of Information (VI) metric [Meilȃ, 2007], that is better suited to measuring the information loss between two sets of clusters that may be nested, as we would often expect with spatially contiguous clusters.

Simulation Studies: Copy Number Segmentation
We first evaluate the performance of SpaCC for the well-studied problem of copy number segmentation for array-CGH data.While not our primary focus, the problem shares many attributes in common with the problem of genomic region detection for methylation data.These common features along with many popular tools and software packages [Seshan and Olshen, Zhang] make the problem an ideal benchmark to test the performance of our method.
Our simulations are based on TCGA Ovarian Cancer Level II array-CGH data for chromosome 17 [Network et al., 2011], where we adopt the probe locations and use data for observed subjects to form the mean simulated CNV signal as well as the locations of copy number segments with amplifications or deletions.Specifically, we simulate series for j = 1 . . .p probes and i = 1 . . .n subjects from the following model: Here, µ j is the base mean of the series which is taken from a subject in the TCGA data that was detected as having no amplifications or deletions according to DNACopy [Seshan and Olshen]; s i,r(j) is an indicator of an amplification or deletion for subject i in region r(j) where the regions r(j) are taken from the observed segments as detected by DNACopy for subjects from the TCGA data; m i,r(j) gives the mean shift for the amplification or deletion in region r(j); and ij ∼ N (0, σ) is iid additive noise.As not all subjects will have copy number changes for each region, we simulate s i,r(j) ∼ Bernoulli(q).Also, as each subject could have an amplification or deletion of differing magnitude, we simulate m i,r(j) ∼ ±U (a, b).
We study four simulation scenarios of varying difficulty.First, we study the effect of the magnitude of the copy number changes relative to the noise level by taking a = 0.2, b = 0.4, σ = .1 for large changes, and a = 0.05, b = 0.3, σ = .1 for small magnitude shifts that are more difficult to detect.Next, we use two different subsets of DNAcopy regions detected for a real TCGA patient, with easier or more difficult to detect shapes to seed the segment boundaries; that is, easier to detect segments tend to be spatially longer and harder to detect segments are shorter and more fragmented.Note that the easy and hard set of segments are shown in the Supplemental Materials.Also related to shape, we vary the probability of a shift, q = 0.7, 0.5, with a larger probability corresponding greater consensus across subjects, and hence easier shift detection.Results indicate that SpaCC outperforms according to all metrics on all regimes in this simulation.As illustrated subsequently for real data, DNACopy segments each subject separately.As such, DNACopy performs well on per-subject basis, but performs poorly at the detecting segments common across all patients.To address this shortcoming, the CNTools package and its implementation of the Circular Binary Segmentation (CBS) algorithm offers an alternative whereby common segments are returned for all subjects.
Yet, CNTools' still fails to reach a reasonable consensus regarding common subject segments.Given the discrepancy between subject's amplification/deletion regions, CNTools tends to deliver shorter segments, wherein agreement can be reached across several subjects.These short segments often partition the true, larger segments, but are both difficult to interpret and necessarily score poorly on the various metrics.In contrast, SpaCC's regularized solution yields an improved consensus between the individual series and their common segments, detecting larger segments more closely aligned with the underlying truth.

Simulation Studies: Methylation Region Detection
Next we evaluate SpaCC's performance for the task of methylation region discovery.We again model our study based on real data, utilizing TCGA Breast Cancer Level 3 Methylation data for chromosome 17 [Network et al., 2012].Initial clusters, {c g }, are detected using SpaCC, which then act as the ground truth for what follows.We simulate methylation beta values via the cdf transform X ij = Φ(z ij ), for subjects i = 1, . . ., n and probes j = 1, . . ., p, to ensure values in (0, 1).Our simulated methylation regions are denoted by the degree of correlation of probes within the region.Specifically, we take z ij to be z i ∼ N p (0, Σ).
with spatial covariance given by Σ where d ij is the distance between probes.Here σ w controls the decay rate of the within cluster spatial correlation, and σ b similarly between clusters.The difficulty of the simulation is controlled by difference between the the within and between spatial correlation decay rate.By varying the ratio of decay rates, σ b /σ w , we consider three scenarios of high, medium, and low within region spatial correlation, relative to between; these scenarios correspond to (σ w , σ b ) = (100KB, .01KB),(100KB, .1KB),(100KB, 1KB), respectively.We compare SpaCC's perfor-  given by the degree of spatial correlation.
We again note that SpaCC outperforms existing region-detection methods across all regimes and metrics.
A portion of SpaCC's success may be attributed to the continuous nature of its spatial fusions, wherein clusters are joined in a smooth fashion via continuous spatial weight decay.The LD method, by contrast, implements a greedy strategy for accumulating clusters based on a discrete window size, here 500 bp.This fixed window size gives the LD method less flexibility to detect long range clusters when they are present; SpaCC, by not choosing apriori distance cutoffs, does not have this difficulty.

Applications of SpaCC to Cancer Genomics Data
We now illustrate how SpaCC can be used as a pre-processing tool to improve the analysis of real array-CGH and DNA-methylation data.We plot three subjects' copy number series for the whole chromosome and overlay the estimated segments, colored consecutively, detected by each method; the level of each segment corresponds to the subject's average copy number per segment.The three subjects plotted represent the subject with the minimum, median, and maximum copy number changes across the cohort.These results visually illustrate the many advantages of SpaCC over existing tools for copy number segmentation.In particular, DNACopy, while performing well on a per-subject basis delivers inconsistent segments across multiple subjects.Indeed, the number of segments returned by DNACopy varies from 1 to 57, depending on subject.Thus, the segments produced by DNACopy cannot be used as a pre-processing step before further multivariate analyses, as the features are unaligned across the subjects.Conversely, CNTools, while delivering segments consistent across the subjects, finds difficulty assessing the trade-off between subject-specific patterns and patterns common to all subjects.The result is an awkward consensus with many (3754), short segments.This "shattered" appearance makes interpretation difficult due to both the number of segments and their size.In contrast, SpaCC finds longer more interpretable segments (61), finding a more appropriate balance between subject-specific and common patterns across the cohort.As such, SpaCC is ideally suited as a pre-processing method to segment and reduce the dimension of copy number data before further multivariate analyses.We now study and illustrate how SpaCC can be used for biomarker discovery and as a pre-processing tool that yields a reduced and more meaningful set of features for downstream analysis of methylation data.First, we apply SpaCC to discover methylation regions from the TCGA Level 3 Breast Cancer data [Network et al., 2012] for chromosome 17 with n = 791 subjects and p = 23515 probes.We visually examine methylation regions returned by SpaCC in Figure 4. Notice that the methylation regions detected are much shorter than their Copy Number counterparts.Segments in the latter tend to be amplified or deleted in large chromosomal regions, whereas regions of consistent methylation levels to occur in small localized areas corresponding to promoter regions of genes [Eckhardt et al., 2006].On the right in Figure 4, we can easily see how SpaCC features aggregate across probes whose levels are meaningful and consistent across all subjects; this is illustrated by the region denoted in blue.In contrast, SpaCC has not fused the two probes flanking this region as their levels differ significantly from the blue region for the subjects with the minimum and median variation in methylation levels shown.These local regions allow us to capture finegrained characteristics that are more biologically meaningful than examining single probes.For example, the highlighted blue region is hypomethylated in most subjects, but hypermethylated for the bottom subject

Detecting Methylation Regions with SpaCC
shown, which has the maximum variation in methylation levels.This region corresponds to the promoter region of the ABCC3 gene which is associated with HER2 and luminal breast tumors O'Brien et al. [2008].
In total, SpaCC detected 9, 080 methylation regions for chromosome 17, which offers a reduction from the original 23, 515 probes.

Breast Cancer Methylation Subtype Discovery
Several have recently suggested that methylation levels can be used to define cancer subtypes, and in breast cancer, methylation levels have been used to characterize the well-known expression-based subtypes [Van der Auwera et al., 2010].Here, we illustrate how reducing methylation data to the SpaCC-derived reduced set of features offers improvements in downstream multivariate analyses such as subtype detection.We continue to work with the TCGA breast cancer data for chromosome 17 and consider classifying between the Basal and Luminal (A and B) subtypes.To fairly assess the performance of SpaCC's region-based features relative to using the raw probe values, we consider a simple classification scheme where we first reduce the data using principal components and then fit a Naive Bayes classifier.In Figure 5, we repeatedly split the data into training and test sets and report the misclassification errors on the test sets for a range of principal components (PCs).We also show PC scatterplots for SpaCC-based and probe-based analysis, illustrating the superior discriminatory power of SpaCC-based features.Also notice that SpaCC-based features outperform in terms of misclassification error for all the number of PCs considered.By reducing the noisy probe-level data into more biologically meaningful units, SpaCC is able to yield improved results for subtype detection.
Next, individual probes, X ij are simulated as deviations about the region means via X ij ∼ beta(α, τ ), where α, τ are chosen to ensure E[X ij ] = m ig .Finally, we generate our response as a linear function of a subset of the region means: y i = β 1 m ig1 + . . .β d m ig d + i , where ∼ N (0, σ 2 ), and β l ∼ N (β seed , √ β seed ).The difficulty of the simulation is controlled by the signal-to-ratio (SNR) level, here the size of β seed relative to σ 2 .Our simulation consists of n = 94 patients relative to p = 23, 515 probes.
We compare our SpaCC-based rEWAS approach to rEWAS using linkage disequilibrium [Shoemaker et al., 2010] and the Fisher product method [Yu et al., 2015] as well as EWAS on the raw probes.For all methods, we fit a univariate linear regression model at each probe or region and corrected for multiplicity using Benjamini-Hochberg's method to control the FDR [Benjamini and Hochberg, 1995].The objective is to recover the regions or all the probes contained within the regions that determine the outcome.Notice then that there are two ways to report the true positive rate (TPR) and false discovery proportion (FDP) for this simulation study.First, we can use the regions detected as significant and compare the spatial extent of a detected region to that of the corresponding true region to determine a true positive rate; we call this the region point-of-view (RegionPOV) and the FDP is defined analogously.Second, we can compare the probes detected, or the probes within regions detected as significant, to the probes that lie within the true regions to determine the true positive rate; we call this the probe point-of-view (ProbePOV).The Supplemental Materials contain formal definitions of these metrics used to evaluate our simulation study.
In Figure 7, we report the results of our simulation study at various SNR levels and FDR levels.Compared to probe-based EWAS and LD-based rEWAS, SpaCC-based rEWAS, achieves comparable FDP levels while achieving higher TPR according to both probe and region-based metrics.Overall, this study confirms our intuition that first reducing methylation data to regions via SpaCC, leads to increases in statistical power for EWAS.Our analysis reveals several important epigenetic markers which are largely consistent with the cancer literature.In the top portion of the Table 5, we list the most significant discoveries from rEWAS which were also discovered by EWAS.We also report the epigenetic marker location, its gene target or the nearest gene, and a brief description of its role in the cancer literature.Note that further details on the literature are given single integrative analysis to discover meaningful regions across differing biological modalities.An R package SpaCCr implementing our method is available from CRAN [Nagorski].

Figure 1 :
Figure 1: SpaCC detected regions for TCGA Level 3 Breast Cancer DNA-Methylation Data n = 791, p = 23515 (left) and TCGA Level 2 Ovarian Cancer Copy Number Variation (array-CGH) Data with n = 456, p = 7658 (right), both for Chromosome 17.For each data type we plot raw probe values overlayed with SpaCC Figure 2: Cross Validation plots for both methylation (left) and Copy Number (right).Sparsity-level is plotted along the x-axis and Average Error on the y-axis.The red line shows the sparsity-level of minimum Average Error, and the blue line shows sparsity level obtained after thresholding.

Figure 3 :
Figure 3: Copy Number Segmentation results for TCGA Ovarian cancer chromosome 17.We compare SpaCC (top) to DNAcopy (bottom left) and CNTools (bottom right).Copy number series over the chromosome are represented by black points and segments are represented by horizontal colored lines.The subjects with the minimum, median, and maximum copy number changes over the cohort are visualized.

Figure 4 :
Figure 4: Breast Cancer methylation region detection results.A portion of Chromosome 17 with SpaCCdetected regions overlayed for the subjects with the minimum, median, and maximum variation in methylation levels (left).We zoom in on the 48.70Mb -48.72Mb region of chromosome 17 (right) containing the promoter region of the ABCC3 gene.

Figure 5 :
Figure 5: (Left) Principal Component scatterplots for SpaCC features (top) and raw probes (bottom), illustrating increased separation between Basal and Luminal subtypes for SpaCC features.(Right) Naive Bayes misclassification error on repeated test/train splits for both Probe and SpaCC features.SpaCC achieves lower error rates for all number of PC's.
chromosome 17 data and use Gaussian Graphical Models to infer networks where either the raw probes or the SpaCC-estimated regions serve as nodes.Networks are estimated via the graphical lasso [Friedman et al., 2008] with a common (dimension dependent) regularization parameter λ = 8 log(p) n and stability selection [Meinshausen and Bühlmann, 2010] with a common threshold parameter of τ = .95;we utilize the BigQUIC software [Dhillon] to estimate both the probe and region-based networks.

Figure 7 :
Figure 7: rEWAS Simulation Results: True Positive Rate (TPR) and False Discovery Proportion (FDP) for SpaCC-based rEWAS, LD-based rEWAS and probe-based EWAS over various SNR levels and FDR levels.Results are reported according to both probe point-of-view (ProbePOV; top) and Region point-of-view (RegionPOV; bottom) metrics.

Figure 8 :
Figure 8: Manhattan Plots with 1% and 5% FDR lines for EWAS (top) and SpaCC-based rEWAS (bottom)analysis for lung cancer methylation.We note the region-based analysis recovers a sizable portion of the probes deemed significant in the probe-based analysis.In addition, a large number of additional probes are recovered at an equivalent FDR-level.

Table 2 :
SpaCC and Linkage Disequilibrium performance for region detection on simulated methylation data.We see SpaCC outperforms linkage disequilibrium across all metrics and for all levels of difficulty, Using case studies from TCGA ovarian, breast, and lung cancers we show how SpaCC to those of the two most popular competitors, DNACopy [Seshan and Olshen] and CNTools [Zhang].

Table 4 :
(Left) Probe (region) significance overlap between region and probe based analysis at 5% FDR level.Total Number of .05-levelprobes is 287.Total Number of .05-levelregions is 556.Total Number of probes in .05-levelregions is 861.(Right)Similarly at 1% FDR level.Total number of .01-levelprobes is 29 .Total number of .01-levelregions is 49.Total Number of probes in .01-regions is 77.