Identification of Networks of Co-Occurring, Tumor-Related DNA Copy Number Changes Using a Genome-Wide Scoring Approach

Tumorigenesis is a multi-step process in which normal cells transform into malignant tumors following the accumulation of genetic mutations that enable them to evade the growth control checkpoints that would normally suppress their growth or result in apoptosis. It is therefore important to identify those combinations of mutations that collaborate in cancer development and progression. DNA copy number alterations (CNAs) are one of the ways in which cancer genes are deregulated in tumor cells. We hypothesized that synergistic interactions between cancer genes might be identified by looking for regions of co-occurring gain and/or loss. To this end we developed a scoring framework to separate truly co-occurring aberrations from passenger mutations and dominant single signals present in the data. The resulting regions of high co-occurrence can be investigated for between-region functional interactions. Analysis of high-resolution DNA copy number data from a panel of 95 hematological tumor cell lines correctly identified co-occurring recombinations at the T-cell receptor and immunoglobulin loci in T- and B-cell malignancies, respectively, showing that we can recover truly co-occurring genomic alterations. In addition, our analysis revealed networks of co-occurring genomic losses and gains that are enriched for cancer genes. These networks are also highly enriched for functional relationships between genes. We further examine sub-networks of these networks, core networks, which contain many known cancer genes. The core network for co-occurring DNA losses we find seems to be independent of the canonical cancer genes within the network. Our findings suggest that large-scale, low-intensity copy number alterations may be an important feature of cancer development or maintenance by affecting gene dosage of a large interconnected network of functionally related genes.


Introduction
Tumor development is generally thought to be a process in which healthy cells transform into malignant tumor cells through the step-wise acquisition of oncogenic alterations [1,2]. This implies that certain changes have to occur together for effective oncogenic transformation of a normal cell. There are a multitude of (epi-)genetic lesions that cause deregulated expression of oncogenes and tumor suppressor genes. Co-operative deregulation of cancer genes has indeed been observed in several different settings. Retroviral insertional mutagenesis screens in mice have shown preferential co-mutation of specific combinations of genes within the same tumor [3]. Likewise, in a study where a thousand individual tumors were screened for mutations in 17 different oncogenes, preferential co-mutation of the PIK3CA and KRAS genes was observed [4].
Besides single basepair mutations or retroviral integrations, the activity of genes can also be perturbed by DNA copy number alterations that arise as a result of genomic instability, which is frequently observed in tumor cells [1]. Whether genomic instability is important for tumor initiation is controversial, but its contribution to tumor progression is undisputed [5,6]. Loss of DNA is a mechanism for the tumor to eliminate copies of tumor suppressor genes, which prevent cancer formation. Conversely, DNA copy number gain or amplification may lead to activation of oncogenes that promote tumor development. We aimed to find genomic regions of gains and losses that are preferentially gained or lost together. We could subsequently link genes that lie in cooccurring regions to each other, allowing us to find functional interactions that reveal the mechanisms underlying tumor development.
DNA copy number alterations (CNAs) may be measured on microarray platforms [7]. Array-based comparative genomic hybridization (aCGH) of differentially labeled tumor and normal (2n) DNA is performed on oligonucleotide-or Bacterial Artificial Chromosome (BAC) based microarray platforms. For each probe on the microarray, the ratio of signal intensities of tumor versus normal DNA is a measure of the relative DNA copy number of the corresponding genomic region in the tumor sample. Platforms designed to identify single nucleotide polymorphisms (SNPs) can also infer CNAs by comparing the raw probe intensity values measured in a tumor sample with a reference sample.
In order to extract those DNA copy number aberrations that preferentially occur together, we developed an analysis framework. The basic premise of our analysis is to define a pair-wise score for any given pair of genomic locations present in the dataset. This scoring index will only be high if both genomic locations are recurrently aberrated in multiple independent samples within the tumor panel, and if they co-vary similarly over the different samples ( Figure 1). Using a Gaussian kernel convolution method we look for aggregates of high scores in the 2D genomic pair-wise space ( Figure 2). The top peaks in the convolved score matrix can be mapped back to two distinct co-mutated genomic locations. The genes that reside in these genomic locations can then be functionally related to each other.
The raw data consist of non-discrete measurements of the average DNA copy number of the population of cells present in the measured sample. The signal consists of a measurement of a heterogeneous population of tumor cells, which may contain many populations potentially carrying different mutations and copy number alterations, as well as normal (diploid) cells. To reduce heterogeneity as much as possible we choose to analyze a collection of hematopoietic tumor cell lines, which on a persample basis can be considered clonal. There were several other reasons for analyzing this particular dataset. First, it is a high resolution dataset of well-characterized, clinically relevant samples. Although these samples are cell lines, they are widely used as a model system for the diseases from which they have been derived. Second, this collection of samples includes cell lines derived from T-and B-cell leukemias carrying rearranged T-cell receptor and immunoglobulin loci, respectively. We therefore should be able to separate these two distinct lymphoid malignancies based on cooccurring DNA copy number losses at the T-cell receptor and immunoglobulin loci. During T-and B-cell development, these loci undergo DNA recombination and gene deletion in a process known as V(D)J-recombination. The human genome contains three specific T-cell receptor loci (alpha/delta, beta and gamma) on two different chromosomes that determine their variability. B-cells have three different loci (IgG kappa, IgG lambda and the IgG heavy chain) on three different chromosomes that undergo recombination to generate a diverse repertoire of immunoglobulins. Since T-and B-cells only undergo recombination of their respective loci after lineage commitment, it is unlikely that T-cell receptor loci are recombined in B-cells and vice-versa. If our approach is successful at finding co-occurring losses, it should identify the co-occurring rearrangements at the T-cell receptor alpha/delta and beta/gamma loci in T-cell leukemias. Similarly, we should be able to pick up co-occurring losses at the IgG kappa, lambda and heavy chain loci in B-cell malignancies.

Results
Defining a continuous co-occurrence score A classic example of finding associations in a large (binary) dataset is association rule mining. Identification of cooperating events in continuous data requires a different approach than binary association rule mining. First we developed a method to score for co-occurrence between two continuous measurements ( Figure 1). We then applied this score in a framework that is able to find co-occurrences in genome-wide measurements. This framework is shown in Figure 2 and is detailed in the Materials and Methods. DNA copy number measurements at two different genomic loci can be visualized in a 2D space, with each axis representing measurements at a certain genomic locus. A point in this space represents a sample in which both loci were measured. Figure 1a shows four hypothetical combinations of measurements. We sought to score for co-occurring high or low values in the DNA copy number data; in other words, regions that display similar patterns of large-amplitude amplification and deletion across the tumor set. This situation is shown in the third panel Figure 1a. The other panels show other potential situations that may arise when comparing two continuous measurements. To score for cooccurring gains all negative values are set to zero (Figure 1b). To score for co-occurring losses all positive values need to be set to zero and the absolute values of the measurements need to be used.
We use the covariance of the two measurements to score for cooccurring loci. This score only rewards a high value to a truly cooccurring and co-varying pairs of measurements ( Figure 1c, right panel). However, a high covariance alone is not sufficient, since it is possible that a high covariance occurs while at least one of the loci never reaches a high amplitude (see Figures 1e and 1f).For this reason we multiply the covariance score with the sum of the individual valued in each sample. This method of scoring only rewards a high value to a co-varying pair of measurements with a large aberration amplitude across the tumors (Figure 1c, right panel).
A framework for genome-wide co-occurrence scoring The co-occurrence scores can be computed for every pair of genomic loci (Figure 2c). By performing a two-dimensional Gaussian kernel convolution on these scores in the co-occurrence space we can take local neighborhood effects into account. This operation is performed for different kernel widths in order to capture scale dependent effects, resulting in a Convolved Cooccurrence Matrix (CCM) as shown in Figure 2d. High values in this matrix represent candidate co-occurring regions in the data. A peak in the CCM can be mapped back to two specific loci, the size of which is determined by the s parameter of the Gaussian function used to convolve the score matrix ( Figure 2e). The genes that are located in the loci associated with a peak in the CCM are subsequently investigated. We examined both enrichment for

Author Summary
It is generally accepted that a normal cell has to acquire multiple mutations in order to become a malignant tumor cell. Considerable effort has been invested in finding single genes involved in tumor initiation and progression, but relatively little is known about the constellations of cancer genes that effectively collaborate in oncogenesis. In this study we focus on the identification of co-occurring DNA copy number alterations (i.e., gains and losses of pieces of DNA) in a series of tumor samples. We describe an analysis method to identify DNA copy number mutations that specifically occur together by examining every possible pair of positions on the genome. We analyze a dataset of hematopoietic tumor cell lines, in which we define a network of specific DNA copy number mutations. The regions in this network contain several well-studied cancer related genes. Upon further investigation we find that the regions of DNA copy number alteration also contain large networks of functionally related genes that have not previously been linked to cancer formation. This might illuminate a novel role for these recurrent DNA copy number mutations in hematopoietic malignancies.
known cancer genes in these gene lists and we investigated functional relationships between the genes derived from the two loci ( Figure 2f). Additional details can be found in the Materials and Methods section.

Co-occurrences in hematological cell lines
We ran our analyses on the aCGH profiles of 95 hematological tumor cell lines analyzed on the Affymetrix Genome-Wide Human SNP Array 6.0. See the supplemental data (Dataset S1) for a list of the cell lines that were analyzed. The data was generated by the Cancer Genome Project (Wellcome Trust Sanger Institute, Hinxton, UK). We employed three scale parameters: 2Mb (s = 1/3 Mb), 10Mb (s = 5/3 Mb) and 20Mb (s = 10/ 3 Mb). In the remainder of this text we will refer to these as Scales 2, 10 and 20. These scales roughly determine the size of the aberrant regions we expect to find. By employing a small, medium and large scale we maximize the chance of detecting co-occurring changes of all possible sizes. To remain conservative we limited our primary analysis to the top 50 peaks in the Convolved Cooccurrence Matrix (CCM) for each of the scales and each of the comparisons (gain-gain, loss-loss, loss-gain). This resulted in 9 top-50 lists of co-occurring regions retrieved from this dataset.

Co-occurrences involving the T-cell receptor and IgG loci
A substantial fraction of the 95 cell lines are derived from T-or B-cell lymphomas with functionally rearranged T-cell receptor or IgG genes. We therefore expected to identify co-occurring losses at the T-cell receptor alpha/delta and beta/gamma loci in the T-cell leukemias. Similarly, our method should identify co-occurring losses at the IgG kappa, lambda and heavy chain loci in B-cell malignancies. Because the recombination loci for both the T-cell receptor and the IgG genes are both relatively small (in the 1Mb Each of the four hypothetical measurement pairs is plotted in scatter plot, giving each sample in the set an x-and y-coordinate. The random pair (first panel) is a noisy pair containing no effect. The constitutive member pair (second panel) consists of one measurement that is continuously high, paired with a measurement that varies between two noisy levels. The co-occurring signal (third panel) consists of two noisy measurements that alternate between a high and a basal level, but show concerted change. The mutual exclusive pair (fourth panel) also alternates between two levels but one measurement excludes the other from also reporting a high value. b. In this example we show scoring for co-occurring gains. Therefore we set all negative values to zero. To score for loss-loss pairs we would need to set all positive value to zero and continue using the absolute values. For loss-gain analysis we would set the positive values of the x (y) axis to zero and use the absolute values in the x (y) direction. c. The first panel shows the resulting scores of the four pairs of measurements if only the sum of the minimum is used. The second panel shows the score when the covariance is included. doi:10.1371/journal.pcbi.1000631.g001 range) we expected to retrieve these co-occurring losses in the small (2 Mb) scale analyses. Since we disregarded co-occurrences on the same chromosome we expected to find five co-occurring losses. Indeed, four of the five expected co-occurring losses are present in the top 50 peaks of the Scale 2 analyses (Table 1). Figure 3 shows two examples from the top 50 lists of co-occurring loci. The separation of T-and B-cell lines is immediately apparent. T-cell lines are strongly associated with losses in the T-cell receptor loci. A large subset of B-cell lines are associated with losses in the IgG loci. However, a subset of the B-cell lines is not associated with any loss of these loci. In this particular subset of lines the IgG loci seem to be gained. It is known that the IgG loci are favorite partners for oncogenic translocations [8]. Whether this is the cause of the amplification of these loci is not known.

Cancer gene enrichment in co-occurring loci
While the recovery of the V(D)J-related recombination loci as co-occurring losses serves as a positive control for our analysis approach, we are mainly interested in identifying cooperating genes or regions that might play a role in cancer. To see whether the locations we recover are linked to this disease, we analyzed whether the co-occurring genomic loci are enriched for genes known to play a role in cancer. As a reference gene set we used the Cancer Gene Census list [9]. The results of this analysis are shown Figure 2. Schematic overview of co-occurrence analysis. a. Overview of aCGH data. Both y y i and y y j are vectors of genomic grid points spanning a chromosome arm (see Materials and Methods). The genomic grid is constructed from aCGH probe measurements, as explained in the Materials and Methods section. b. The combinations of y y i and y y j are used to construct a genomic pair-wise space in which all further calculations are performed. In this panel a schematic view of the genomic pair-wise space is shown. Each pair of genomic grid points between y y i and y y j is a point in this space and each point contains two values. A pair-wise genomic matrix exists for each tumor in the data set. c. To score for co-occurrence, the minimum value of the pairs of genomic grid points are summed over the tumors and the co-variance over tumors of all genomic grid points is calculated. This results in two equally sized matrices which are multiplied element wise to produce the co-occurrence score matrix. This matrix is again represented in the genomic pair-wise space (S co-oc ). d. The co-occurrence score matrix is convolved with a Gaussian matrix to find local enrichment of high cooccurrence scores in the pair-wise space. Peaks in the convolved co-occurrence matrix are translated back to two genomic regions (i n and j n ) that are annotated as being co-aberrated across the tumor set. e. For the n-th peak in the Convolved Co-occurrence Matrix (CCM) two gene sets, A n and B n , are defined, based on a 2s window centered on the peak. f1. Using a protein-protein interaction database the interactions between gene sets derived from a single co-occurrence peak are analyzed, producing a set of interactions (I A n B n ). f2. Using the Cancer Gene Census we inspect the resulting gene sets for presence of known tumor-suppressor genes and oncogenes. doi:10.1371/journal.pcbi.1000631.g002 in Table 2. As can be seen, the co-lost loci are mainly enriched for tumor suppressor genes, and the gain-gain regions for oncogenes. Since one expects loss of tumor suppressors and gain of oncogenes, this is a logical result, further increasing our confidence that our approach identifies truly relevant genomic loci.
Discovering functional relationships between co-occurring loci While finding enrichment for cancer genes is an encouraging result, this does not explain the possible cooperation between two loci. We expect that the co-occurring loss of two regions points to a functional relationship between the constituents of the genomic loci. A co-occurrence between two genomic regions can point to many different kinds of interactions between the genes residing in both regions, e.g. biochemical interactions of the protein products or functional collaboration of two cancer genes in tumorigenesis. We therefore decided to employ interaction data to shed further light on the genes present in the co-occurring regions. We translated the cooccurring pairs of genomic loci to pairs of gene sets, and we investigated the functional relationships of their protein products using the STRING database [10] (version 8.1). The STRING model weighs functional associations between genes based on several different sources of evidence, among which: biochemical interaction, joint presence in a pathway, high-throughput interaction experiments, text mining and interactions of homologs in other species. To find a functional relationship between two co-occurring  regions we looked for a direct interaction in the STRING database between the two gene-sets defined by our co-occurrence analysis.
To determine whether the number of observed interactions is significant, we compared the number of direct interactions found between genes located in the top 50 co-occurring regions to a set of randomly chosen pairs of genomic loci. The metric we used to determine significance is the ratio between the number of interacting genes and the total number of genes found on the genomic loci. A p-value for enrichment for direct interactions was calculated using a two-tailed Fisher's exact test. Results are shown in Figure 4. As can be seen, the only analysis that resulted in an enrichment of functional interactions is Scale 20, for all three situations. We found no enrichment for interacting protein coding genes on Scale 2 (not shown) and Scale 10. Since we evaluated gene sets in a window one-third the size of the analysis-scale we may be under-estimating the size of the co-occurring loci and the larger Scale 20 actually captures the size of the aberrations best.
In order to keep control of the complexity, we considered in our co-occurrence analysis only radially symmetric kernels, i.e. Gaussian kernels with diagonal, equal variance covariance matrices.
This implies that asymmetric co-occurring regions -where a small locus co-occurs with a large locus -will not be optimally detected.
Since an asymmetric co-occurring region typically consists of a series of symmetric co-occurring regions detected on a smaller scale (just like a rectangle can be constructed from a collection of squares), we set out to construct larger co-occurring regions from the results of the smaller scales using a hierarchical clustering approach. For details see Supplemental Figure S1.
Briefly, we collected the loci involved in the top 500 cooccurrences of the Scale 2 analysis. This resulted in 1000 genomic loci. For each pair of loci, we calculated the genomic distance in base pairs. The distance between two loci on different chromosome arms was set to a default high value (1 * 10 8 ). This resulted in a 100061000 distance matrix. On this distance matrix we performed single linkage hierarchical clustering. The resulting dendrogram was cut at 1 * 10 7 bp (5 kernel widths). The resulting clusters are unique genomic loci and were represented as nodes in a graph. Clusters were then linked if a co-occurrence was found between individual loci of different clusters. These links are represented as edges in a graph. The result of the clustering analysis is shown in Figure 5.    A network view of co-occurring copy number changes As can be seen in Figure 5 we were able to construct a network of co-occurring copy number changes for the gain-gain, loss-loss and gain-loss situations. As expected, the gain-gain and loss-loss networks show enrichment for oncogenes and tumor suppressor genes, respectively. The gain-loss network only shows enrichment for tumor suppressors. The percentage of genes involved in functional interactions between the nodes that are linked in the graph vastly exceeds the functional interaction enrichment found in the single scale 20 Mb analyses. At least 11% of the genes present in the genomic locations -represented by the nodes in the graphs -have high confidence (.0.9) annotated functional interactions along the edges as revealed by STRING analysis.
The thickness of the edges in the graphs shown in Figure 5 indicates how often a co-occurrence was found in the top 500 of the Scale 2 analysis. Several edges were strongly supported by co-occurrences in the top 500. These strongly supported edges were always associated with loci that were ranked high in the co-occurrence list (as indicated by node size). The nodes that are associated with these highly represented edges seem to form an important subgraph. To reveal these subgraphs, we removed all edges supported by less than 5% of the top 500 co-occurrences. For brevity and simplicity we only consider the gain-gain and loss-loss networks. This resulted in the two core networks shown in Figures 6 and 7.

The gain-gain core network
The edge thickness of the gain-gain core network shown in Figure 6 represents the number of functional interactions found using the STRING database between genes that map within the loci described by the nodes. To determine the common denominator among the interacting genes, we employed Ingenuity Pathway Analysis (IPA; Ingenuity Systems) to perform a functional enrichment analysis on all genes residing within the gain-gain core network. This revealed strong enrichment for processes involved in cancer (Figure 6b). From Figure 6a it is immediately apparent that most of the functional interactions are found between 1q and 7p/q. If we remove the 1q node from the entire network described in Figure 5 the enrichment for functional interaction drops dramatically (Figure 6c). Therefore, we hypothesize that the co-occurring gain between 1q and 7p/q is the most important effect in the gain-gain analysis in this dataset. This is strengthened by the fact that almost all known oncogenes within the entire network map to 1q, 7p or 7q (Figure 6a). The well-studied canonical oncogene MYC maps to 8q and is not a determining hub in the gene interaction network as constructed by STRING. Figure 6. The gain-gain core network. a. The reduced core network for the gain-gain analysis obtained by pruning all edges with less than 5% support in the top 500 list of the Scale 2 analysis. Edge thickness and label represent the number of functional interactions between genes associated with the nodes being connected based on the STRING database. The oncogenes as defined by the Cancer Gene Census that map within the regions defined by the nodes are shown in rectangular insets. b. Representation of the 10 most enriched Ingenuity terms associated with the entire collection of genes in the core network that have a STRING interaction along the edges. The x-axis shows the 2log transformed p value, corrected by the Benjamini Hochberg procedure as implemented in the Ingenuity software. c. Functional interaction enrichment is shown as a bar graph, which represent the ratio of interacting genes with respect to the total number of genes. P-values are determined using a Fishers' Exact test with randomly selected pairs of loci representing the null hypothesis. doi:10.1371/journal.pcbi.1000631.g006 The loss-loss core network The loss-loss core network is shown schematically in Figure 7a. A loss of approximately 18 megabases on chromosome 17p appears to be a central hub, which is co-lost with several other genomic loci. These loci show a very high enrichment of genes that interact with 17p, and of the six loci, four contain multiple known tumor suppressor genes. A functional enrichment analysis of all genes residing on loci co-lost with 17p, reveals many cancerrelated processes (Figure 7b), suggesting that the interacting genes are most likely also the cancer-relevant genes. If we remove 17p from the original network we see a large decrease in the percentage of genes involved in functional interactions (Figure 7c) confirming the importance of 17p in the loss-loss network.
The co-occurring losses involving 17p might target non-canonical cancer genes One of the most intensively studied cancer genes, TP53, resides in the 17p locus. Furthermore, the canonical cancer gene RB1 and the CDKN2A/B locus are present in two of its co-lost regions. Since these are well known tumor suppressors, and therefore the subject Figure 7. The loss-loss core network. a. The reduced core network for the loss-loss analysis determined by pruning all edges with less than 5% support in the top 500 list of the Scale 2 analysis. Edge thickness and label represent the number of functional interactions between genes associated with the nodes being connected based on the STRING database. The tumor suppressor genes as defined by the Cancer Gene Census that map within the regions defined by the nodes are shown in rectangular insets. b. Representation of the 10 most enriched Ingenuity terms associated with the entire collection of genes that have a STRING interaction between the 17p region and 9p, 9q, 13q, 16q or 22q as determined by the Ingenuity software. The x-axis shows the 2log transformed p value, corrected by the Benjamini Hochberg procedure as implemented in the Ingenuity software. c. Functional interaction enrichment is shown as a bar graph, which represent the ratio of interacting genes with respect to the total number of genes. P-values are determined using a Fishers' Exact test with randomly selected pairs of loci representing the null hypothesis. d. A functional interaction network around the nuclear co-repressor NCOR1 (also known as TRAC1) is shown. This network is a part of the network of interactors derived from the 17p interacting regions after removal of the canonical cancer genes TP53, RB1, CDKN2A and CDKN2B from the analysis. e. Illustration of the retroviral insertions mapped near CBFA2T3, recovered in a large screen of MuLV retroviral mutagenesis [11]. Insertions are shown as triangles.

Blue triangles indicate insertions in the direction of transcription (plus), red triangles indicate insertions in the anti-transcription direction (minus).
Insertions linked by dashed boxes are bi-allelic integrations recovered from the same tumor. doi:10.1371/journal.pcbi.1000631.g007 of thousands of research papers, they might constitute the bulk of the functional relationships in our analysis. To test this hypothesis, we excluded these four genes and repeated the interaction analysis of the core network. As can be seen in Figure 7c, the enrichment is only slightly lower without the canonical genes, suggesting that the functional relationship between the co-occurring losses on 17p and the other loci are driven by other genes.
We investigated the remaining 113 interactors for any interesting interactions that might be a target of this collection of co-occurring losses. Within the total network of interactors we found a sub-network centered on the nuclear co-repressor NCOR1 (TRAC1) (Figure 7d). This interaction network included -besides NCOR1 -the peroxisome proliferator-activated receptor alpha (PPARA), the MAPK pathway suppressor GPS2, the nuclear coactivator (and known tumor-suppressor) p300 and a gene of unknown function, CBFA2T3. All interactions found are based on physical binding and co-occurrence in Pubmed abstracts.

Combining co-occurrence data with insertional mutagenesis data
To see whether we could find more information regarding the putative tumor suppressor function of the different interactors, we tested if we could corroborate our findings with data from a large retroviral insertional mutagenesis (IM) screen where hematopoietic tumors were induced through Murine Leukemia virus (MuLV) infection of wild-type mice or Trp53 or p19-ARF deficient mice [11]. An illustration of the retroviral insertions sites near Cbfa2t3 is shown in Figure 7d. Although Cbfa2t3 was not flagged as a common integration site, several viral integrations near this gene were found. Remarkably, two individual tumors harbored a biallelic integration near the transcription start site of Cbfa2t3, suggesting functional inactivation of this candidate tumor suppressor gene. Indeed, bi-allelic integration is thought to be a hallmark of tumor suppressor genes in IM screens [12].
Given that we find this sub-network of interactors in a cooccurring network of DNA copy number losses and the recovery of inactivating insertions in a retroviral IM screen, we conclude that this network might be a putative tumor suppressor network.

Discussion
In this study we present a genome-wide analysis for finding collaborating DNA copy number changes on different chromosomes. Using our 2D kernel convolution framework we can score and find co-occurring DNA copy number changes in a high quality, high resolution aCGH dataset. Using a dataset of hematological cell lines we are able to recover DNA copy number alterations specific to the cell lineage of the samples. Furthermore we uncover cancer-related networks of co-occurring DNA copy number changes.

Previous work on co-occurring copy number changes
Several studies have investigated concerted copy number changes in aCGH data. In studies on lung cancer [13] and ovarian cancer [14] the authors performed a post-hoc cooccurrence analysis on genomic locations that were found to be significantly altered in a one-dimensional analysis. A more integrated effort to analyze relations between CNAs in brain cancer was published recently [15]. Although this study scores systematically for co-aberration, it is limited in resolution as it employs cytobands as the genomic unit within which aberrations are scored. Cytobands are relatively arbitrarily determined entities and are quite heterogeneous in size. Furthermore this approach is dependant on converting the continuous-valued copy number data to discrete copy number calls. This results in loss of important information since it removes the possibility of weighting the intensity of a CNA. In contrast, our approach is able to correct for unequal probe distances, enabling us to perform our analysis on a very high (20 kbp) resolution. In addition, our scoring method not only incorporates the sign of the copy number change, but also its intensity and the concomitant CNAs within the immediate genomic neighborhood.

Determining significance of co-occurrences
The output of our analysis does not include a measure of significance. Constructing a background distribution based on permutations of the DNA copy number data would mean rerunning our analysis thousands of times, a task which remains computationally infeasible at this stage. Furthermore, the multipletesting problem would have to be properly addressed, given that the number of tests is the square of the number of grid points in the 2D space. Due to the complexity of the analysis procedure (minimum operation and kernel smoothing) the definition of an analytical null distribution has remained elusive. Therefore, we have chosen to work with top n results, residing in the extremes of the results distribution, thus minimizing the chance of including false positives. The top n lists allowed us to generate workable results which we have validated extensively with other sources of evidence.

Analytical challenges
While we were able to use a distributed computing solution for our analysis, we were fortunate to have the required computational architecture at our disposal. Since the problem basically consists of repeating the same action many times it could be wellsuited to software optimization or a hardware based solution where the most time-consuming actions are handled by a dedicated processing unit.
When looking for areas in the 2D pair-wise space highly enriched for co-occurrence scores we convolve this space with a 2D-Gaussian kernel. The sigma parameter of this function is a representation of the size of the aberrations we expect to recover. Currently we make the implicit assumption that the co-occurring aberrations have the same size by using a symmetric kernel for the convolution. This could be relieved by allowing for an asymmetrical (ellipsoid) Gaussian kernel for all combinations of scales used. Clearly, this comes at the cost of increased computational complexity. Here we resolve this issue by concatenation of the results obtained in a small scale. In this way we can recover cooccurring losses of different sizes that give a better enrichment for functional interactions when combined with the single peaks obtained in a higher scale analysis.

Gene dosage effects on a large number of genes
In our analysis of a set of cell lines derived from hematological malignancies we found enrichment of cancer related genes and functional interactions in co-occurring DNA copy number changes. Our results suggest that tumorigenesis requires elimination of multiple gatekeeper genes and gain of multiple oncogenes as demonstrated by the presence of many functional interactions between the loci in the gain-gain and loss-loss core networks.
Haploinsufficiency is a well known characteristic of several tumor suppressor genes, where simple reduction of gene dosage by loss of gene copies at the DNA level can already promote oncogenic transformation [16]. It is conceivable that changes in gene dosage of multiple interconnected genes involved in cancerrelated processes such as cell cycle, DNA repair and signaling can also weaken a cells defense against uncontrolled cell proliferation.
In this case, heterozygous loss or gain of large genomic regions, such as the ones identified in this study, might effectively sensitize cells to become tumorigenic.

17p as a central player in co-occurring losses
We show that the 17p loss and its co-lost regions are highly enriched for functional relationships, which are not fully explained by the presence of the TP53 gene, often thought to be the single target of this deletion [17][18][19]. Although TP53 is no doubt an important target of the DNA copy number loss, our analysis indicates that the concomitant loss of other genes near TP53, as well as co-occurring losses on the other genomic loci may together account for the full tumorigenic effect.
Loss of the loci on 17p, 9p, 9q, 13q, 16q and 22q has been reported previously for several types of hematological malignancies represented in our dataset [20][21][22][23]. The picture that emerges from this analysis of collaborative aberrations is that many of the reported losses collaborate with the frequently occurring 17p loss as a central hub. We don't recover co-occurring losses among the spoke loci in the core network. This could suggest that the non-17p regions form subsets of co-occurring losses with 17p, whose interconnections themselves do not occur frequently enough in the top 500 co-occurring losses we investigated.

NCOR1 and its interactors
Not all of the gene-gene interactions defined by the 17p network involve the well-known canonical cancer genes TP53, RB1 and CDKN2A (INK4a/ARF). We found one sub-network of genes around NCOR1 which might be an example of other tumor suppressor genes that are affected by the concerted loss of these genomic loci.
The hub of this interaction network, NCOR1, is a well-known transcriptional co-repressor that associates in a ligand-independent manner with nuclear receptors [24]. It is responsible, together with the closely related factor SMRT, for recruitment of HDAC proteins to the DNA to induce transcriptional silencing. Its role in cancer is not well-established. NCOR1 null mice die in early embryogenesis [25]. A dominant-negative mutant of NCOR1 is known to increase proliferation in hepatocytes [26] and more recently it has been shown that NCOR1 decreases AKT phosphorylation, thus countering its pro-survival signal [27]. It would seem that specific loss -or at least decrease in gene dosage of NCOR1 -might increase proliferation and promote survival.
All interactions between NCOR1 and its partner genes (PPARA, GPS2 and CBFA2T3) have been based on co-occurrence in PubMed abstract and true physical binding [28][29][30][31]. CBFA2T3 is a close relative of ETO, which is a target of the recurrent AML1-ETO translocation that occurs in acute myeloid leukemia. It has been shown that the fusion gene AML1-ETO actually interferes with the CBFA2T3-NCOR1 interaction, and that its oncogenic effect derives from that inhibition [31]. In a retroviral insertional mutagenesis screen in mice, Cbfa2t3 is recurrently targeted by biallelic retroviral integrations, which are predicted to cause functional inactivation of Cbfa2t3 [11]. PPARA is a member of the Peroxisome proliferator-activated receptors, and has been implicated in hepatocellular carcinoma development in rodents [32]. Since other members of this family, such as PPARG, exhibit a tumor suppressor-like phenotype, it is possible that PPARA can act as a tumor suppressor in hematological malignancies. GPS2 is a known suppressor of JNK signaling [33], which is one of the constituents of the MAP kinase signaling pathway. Deregulation of this pathway is a well-known phenomenon in cancer [34]. Taken together with the association between NCOR1 and the known tumor suppressor p300, our data suggest a selective advantage for loss of multiple constituents that interact with NCOR1 since they all may have tumor suppressor-like activities.
Many studies focus on a single hematological malignancy in which a single combination of aberrations might be important [19,35,36]. Since we examine a large panel of samples derived from many different hematological malignancies, our results might not specifically apply to any single type of lymphoma or leukemia. They might hint at more general processes that are important for the tumors to arise and maintain themselves. However, one should not forget that this analysis is based on a panel of cell lines, which may have adapted to in vitro tissue culture conditions by acquiring additional aberrations that are rarely found in real tumors in patients. Furthermore, given the fact that we examine copy number changes it might be worthwhile to analyze a highly genomically unstable tumor type, such as BRCA1/2-related breast cancer.

Conclusions
We have developed a method for genome-wide analysis of collaborating DNA copy number changes and their corresponding networks. Using this approach we have identified a loss-loss network centered around a region on human chromosome 17p. This network is highly enriched for functional relationships and hints at a more complex system of tumor suppression in which many different genes are affected simultaneously to induce cancer. We show one example of a sub-network around the nuclear corepressor NCOR1 that may be a novel network of tumor suppressor genes that are affected by the observed co-occurring losses. The observation that DNA copy number changes may affect gene dosage of larger numbers of cancer-relevant genes deviates from the classical view where mutations in a few (5-7) cancer genes lead to tumor development. Our data support the notion of cancerrelated networks or pathways, where multiple collaborating genes are deregulated simultaneously to induce oncogenesis. Such a network view of oncogenesis is an important step towards developing effective drug targets because it increases the number of potential targets. However, this view also implies that multiple molecules need to be targeted simultaneously in order to achieve optimal therapy response and to reduce the risk of therapy resistance.

Materials and Methods
Transformation of probe-measurements to genomic grid Datasets consisting of array-based copy number measurements are continuously increasing in size. If probe level interactions are evaluated, the analysis space is of dimensionality N 2 for N probes on the genome. As a result, the analysis time and memory usage will also increase quadratically with the number of probes. Instead of a grid positioned at the genomic positions of the probes, we employ an equally spaced genomic grid as a basis for all subsequent steps. The distance between grid-points is a userdefined variable, and will determine the finest resolution of the outcome and computational efficiency. We have performed all analyses using a genomic grid with a grid spacing of 20 Kb. Given a genome of G base pairs and a grid spacing of D, this results in N G grid positions, with N G~t G=Ds, where tzs represents the integer part of the real number, z. The grid positions can be represented in the following row vector: x x~fx k jk~1,2,:::,N g g, where x k~k D.
Let the aCGH profile of the t th tumor be represented by the following row vector of probe measurements: a a t~f a t,k j k~1,2,:::,pg, with p being the number probes. Let the midpositions of the probes be located at z z~fz k jk~1,2,:::,pg. To employ the genomic grid we need to compute, for each aCGH array, the value of the aCGH profile on the grid points. We achieve this by performing, for the k th grid position, x k , a kernel-weighted regression of all probe values situated in the range ½x k {D; x k zD~½D(k{1); D(kz1), employing a triangular kernel centered at x k , with maximal amplitude of 1 and width of 2 D. More specifically, the interpolated copy number aberration at the k th grid position is given by: Here the set J k is the set of probe positions such that J kf jj x k {z j ƒD, j~1 . . . pg. The interpolated copy number profile of the t th tumor is represented by the row vector: b b tf b t,k j k~1,2,:::,N g g. The complete dataset of T tumors is refopresented by the matrix B~f b b 1 , b b 2 ,:::, b b T g, where the probe values of the t th tumor constitute the t th row of matrix B. Because we treat gains and losses separately we have four different cooccurrence situations to be considered given two loci on the genomic grid: i) gain/gain, ii) gain/loss, iii) loss/loss and iv) loss/gain. So, when evaluating the co-occurrence of loci i and j, we will evaluate the behavior of i) columns B z .,i and B z .,j for gain/gain; ii) columns B z .,i and B { .,j for gain/loss; iii) columns B { .,i and B { .,j for loss/loss and iv) columns B { .,i and B z .,j for loss/gain. (Here X .,i is the i th column of matrix X ). All subsequently described steps will be performed for these four situations separately, where y y i~f y t,i jt~1,2,:::,Tg and y y j~f y t,j jt~1,2,:::,Tg will be employed as shorthand for the abovementioned column vectors of interpolated copy number values associated with genomic grid positions i and j, respectively.

Co-occurrence score
The first component of the co-occurrence score is the continuous variant of the AND Boolean logic function: the minimum operation. For two grid points, i and j, the sum across all tumors of the minimal probe value per tumor at i and j, s min i,j , is calculated as follows: These values are aggregated in a matrix, S minf s min i,j ji,j~1,2,:::,N g g. If we only use the minimum as a scoring function, those grid positions that are ubiquitously aberrated will always receive a high score, regardless of the aberration pattern in the other grid position. Two regions that are aberrated ubiquitously in all tumors are undoubtly important to the tumor but they are not necessarily functionally related. They might be a hallmark of the particular disease under study, but show no direct functional interaction. To prevent these ubiquitously aberrated regions from dominating the analysis and to detect those regions that represent functional co-occurrences, we weigh the minimum score computed above with the covariance of the interpolated probe values at the two grid positions i and j, where m i and m j are the expected values of the probe values at grid position i and j across tumors, respectively (i.e. m i1 T X T t~1 y t,i and m j~1 T X T t~1 y t,j ). These values are aggregated in a matrix, S cov~f s cov i,j ji,j~1,2,:::,N g g. We combine both the minimum matrix and the co-variance matrix by element-wise multiplication to form the co-occurrence score matrix, S co-ocf s co{oc i,j ji,j~1,2,:::,N g g, with Kernel convolution Since we believe the co-occurrence score to be a smooth variable, and since neighboring co-occurrence values can therefore be employed to reduce the noise locally, we convolve the cooccurrence score matrix with an isotropic 2D Gaussian kernel function. In practice this implies sampling the 2D Gaussian kernel function on a square grid consisting of M6M genomic grid positions and then performing the convolution of this sampled kernel function with the co-occurrence score matrix. The sampled isotropic 2D Gaussian function is defined as G~fg i,j ji,j1 {M,{(M{1),:::,{1,0,1,:::,(M{1),Mg, with The standard deviation of the isotropic Gaussian, s, determines the scale of the analysis. Since the Gaussian quickly decays we set M~t 3s D s, allowing contributions from +3s, convolving with a finite kernel with minimal loss in accuracy. The scale of an analysis is therefore defined as 6s. The scales employed in this study are: 2 Mb (s = 1/3 Mb), 10Mb (s = 5/3 Mb) and 20Mb (s = 10/ 3 Mb). Before the convolution step, we pad the co-occurrence matrix by mirroring the true data at each chromosome boundary and each centromere. By convolving the appropriately padded cooccurrence score matrix and the sampled 2D Gaussian function the Convolved Co-occurrence Matrix (CCM) is obtained: with C~fc i,j ji,j~½1,2,:::,N g g, Ã as the convolution operator and This matrix is a representation of the amount of co-occurrence between two locations on the genome. We calculate a CCMmatrix for each possible combination of chromosome-arms and for each of the four combinations of gains and losses listed above.

Distributed computing
With 39 unique chromosome arms in the human genome (disregarding the p-arms of the acrocentric chromosomes and the sex chromosomes), three different scales and 4 triangular pair-wise matrices to evaluate (loss-loss, gain-gain, gain-loss and loss-gain) we compute 8892 different CCMs. To solve this problem computationally we used a large distributed computing cluster. Our choice of resolution of the genomic grid was bounded by the memory present on the nodes. We set D to 20000 base pairs, which is the lowest value still allowing the largest chromosomearm pair to be successfully computed on one computing node.

Downstream analyses
For each CCM we determine the top N peaks for each combination of gains and losses. The n th peak represents two cooccurring loci, i n and j n , and the location of the peak is defined by two co-ordinates on the genomic grid: (x in ; x jn ). For each locus, we define a region of interest of size 2s centered on x in and x jn , respectively. We define this small region of interest to only select regions that are very near to the actual peak. To investigate the cooccurrence for functional relationships, we extract, for each of the co-occurring loci, the genes present in the regions of interest. More specifically, we define, for loci i n and j n , the associated gene sets A n and B n , where A n~f gjx in {sƒq g ƒx in zs,Vgg ð 10Þ and B n~f gjx jn {sƒq g ƒx jn zs,Vgg ð 11Þ Where q g is the position of gene g, which we chose to be the midposition of the gene. The genesets were established by a BioMart query from the Ensembl database. We restricted ourselves to the bio type 'protein_coding'.

Cancer Gene Census (CGC) enrichment
The list of CGC genes was obtained from the CGC website (http://www.sanger.ac.uk/genetics/CGP/Census/). The reference list of all genes was retrieved from the Ensembl website, with a filter to keep only genes with bio-type = 'protein_coding'. This left 18840 genes. All CGC genes that could not be mapped back to the reference gene set were excluded. The CGC genes that were annotated as 'recessive' were used as the tumor-suppressor genes and 'dominant' as oncogenes. Enrichment for all CGC genes, the tumor-suppressor subgroup and the oncogene subgroup in the gene sets determined by the co-occurrence analysis was calculated using a Fisher's exact test.

Analysis of functional relationships
The set of pairs of interacting genes which are such that one gene is associated with locus i n and the other gene of the pair with locus j n is then defined as I A n B n~ffg k ,g l gjg k [A n , g l [B n , I C (g k ,g l )w0:9g ð12Þ Where I C (g k ,g l ) represents the confidence of interaction, according to the STRING database, between genes g k and g l . We then determine all gene lists of interactors for the top N peaks of a given co-occurrence analysis, i.e.: For each of the top N co-occurring loci, we also determine the total number of genes in the regions of interest of those loci. So, for loci i n and j n we define the set: The total number of genes associated with the top N co-occurring loci is then given by The interaction ratio, R N I , is then defined as where a j j denotes the cardinality of set a. As a control we randomly pick size-matched locations for all co-occurring regions (i n ,j n ) in the top N and repeat the process for recovering interactions. For 100 randomly chosen co-occurring regions we calculate the resulting R RAND I . A Fisher's exact test is then used to asses the significance of enrichment of R N I versus R RAND I .

Hierarchical clustering of co-occurrence loci
For all pairs of co-occurring loci, (i n ,j n ), present in the top N of an analysis, L N~f (i n ,j n )jn~1,2,:::Ng let the set of loci representing the first and second member of the co-occurrence locus be defined as L N I~f i n jn~1,2,:::Ng and L N J~f j n jn~1,2,:::Ng respectively. Given that the pairs of genomic locations corresponding to the top N co-occurring loci are given by X N~f (x in ; x jn )jn~1,2,:::,Ng we define the set of genomic locations loci involved in cooccurrences as We perform hierarchical clustering on matrix D using single linkage hierarchical clustering. Leaf nodes are assigned to clusters using a distance cutoff of 10 7 bp (10Mb). Clusters are represented as nodes in a graph. Edges between nodes are drawn if any cooccurrence relationship is found between loci present in the nodes.

Data description
The case we subjected to analysis was a dataset containing 105 cell-lines derived from hematological origin. The aCGH measurements were done on 1.8 million probe Affymetrix SNP 6.0 arrays. After data pre-processing we were left with 95 samples. These cell lines are a subset of the Cancer Genome Project cancer cell line project (http://www.sanger.ac.uk/genetics/CGP/CellLines/). A list of the cell lines included in this dataset can be found in Dataset S1. Figure S1 Hierarchical clustering of small scale CCM peaks. a. All peaks present in the top 500 that report a co-occurrence between these chromosome arms can be represented as a table of 1000 genomic loci, with 500 pairwise co-occurrence relationships. b. The 1000 genomic loci are individually clustered based on their genomic distance. The distance between two loci is the absolute difference in basepairs if both are mapped to the same chromosome arm and a high constant value if not. Single linkage is used to grow a dendrogram from these distances. Clusters of genomic loci are defined by aggregating all children under a certain set cutoff distance, 10Mb in this case. c. The clusters defined in b. are represented as nodes in a network. Edges are drawn between nodes if a co-occurrence relationship exists between any genomic loci assigned to the clusters. Found at: doi:10.1371/journal.pcbi.1000631.s001 (1.14 MB EPS)

Supporting Information
Dataset S1 This file contains information about the cell lines used in this study. Found at: doi:10.1371/journal.pcbi.1000631.s002 (0.03 MB XLS)