The Mutational Landscape of Acute Promyelocytic Leukemia Reveals an Interacting Network of Co-Occurrences and Recurrent Mutations

Preliminary Acute Promyelocytic Leukemia (APL) whole exome sequencing (WES) studies have identified a huge number of somatic mutations affecting more than a hundred different genes mainly in a non-recurrent manner, suggesting that APL is a heterogeneous disease with secondary relevant changes not yet defined. To extend our knowledge of subtle genetic alterations involved in APL that might cooperate with PML/RARA in the leukemogenic process, we performed a comprehensive analysis of somatic mutations in APL combining WES with sequencing of a custom panel of targeted genes by next-generation sequencing. To select a reduced subset of high confidence candidate driver genes, further in silico analysis were carried out. After prioritization and network analysis we found recurrent deleterious mutations in 8 individual genes (STAG2, U2AF1, SMC1A, USP9X, IKZF1, LYN, MYCBP2 and PTPN11) with a strong potential of being involved in APL pathogenesis. Our network analysis of multiple mutations provides a reliable approach to prioritize genes for additional analysis, improving our knowledge of the leukemogenesis interactome. Additionally, we have defined a functional module in the interactome of APL. The hypothesis is that the number, or the specific combinations, of mutations harbored in each patient might not be as important as the disturbance caused in biological key functions, triggered by several not necessarily recurrent mutations.


Introduction
Acute Promyelocytic Leukemia (APL) is characterized by the PML/RARA rearrangement as a consequence of the translocation t(15;17)(q24;q21). However, it is well established that this aberration alone is not able to trigger the whole leukemic phenotype [1]. Additional chromosomal abnormalities to t (15;17) and other gene mutations (e.g. FLT3 mutations) have been reported to act as potential secondary hits in up to 40% of APL cases. [2] However, little is known about the nature and precise role of these cooperating events in APL [3].
The recent advent of next-generation sequencing (NGS) technologies has allowed for the identification of a growing number of novel mutations in leukemia and other cancers. In APL, preliminary whole genome/exome studies involving about thirty APL patients identified several somatic mutations affecting up to 135 different genes in a non-recurrent manner, except for FLT3, WT1 and KRAS. These findings suggest that APL is a heterogeneous disease with secondary relevant changes not yet defined [4][5][6][7][8]. Recent approaches that take into account the physical and functional interactions of the mutated genes have been useful to discover driver mutations in several cancers [8][9][10][11].
To extend our knowledge of subtle genetic alterations involved in APL that might cooperate with PML/RARA in the leukemogenic process, we have performed whole-exome sequencing (WES) of diagnosis-completed remission matched samples on a selected discovery cohort of de novo APL patients and targeted resequenced novel candidate genes in an extended cohort. Finally, we used network analysis to select a reduced subset of high confidence candidate driver genes.

Material and Methods Summary
Whole-exome sequencing (WES) of diagnosis-completed remission matched samples from 5 APL patients (discovery cohort) was performed to identify gene mutations. WES data were analyzed using an in-house bioinformatics pipeline (Data in S1 Material and Methods and S1 Fig) to compare the coding sequence of matched samples and identify somatically acquired deleterious changes. The variants detected were confirmed in both samples of each patient by Sanger sequencing. Furthermore, complete coding sequence of 97 genes (17 novel candidate genes with confirmed somatic mutations from in-house results and 80 genes reported to be mutated in at least 1 patient from previous APL studies [4,6]) were targeted re-sequenced in 25 additional patients (extended cohort) using SureDesign Tool (Agilent) for NGS, according to the manufacturer's instructions. Finally, network enrichment analysis [12,13] was used to help in the prioritization of candidate genes on the basis of their connectivity, mutational recurrence, mutational co-occurrence and cancer related functionalities.
According to the Declaration of Helsinki, written informed consent was obtained from all patients, and the protocol was approved by the Research Ethics Board of the Health Research Institute Hospital La Fe (No.2012/0175).
Further Material and Methods details are provided in the Extended Experimental Procedures at the S1 Material and Methods.

Whole Exome Sequencing
Average target coverage obtained in both diagnostic and remission samples was 61.5x (S1 Table). In addition, 81% of the target regions were covered by more than 10 reads, with a high concordance (~98%) for detected SNPs from array assays (Data in S2 Fig). A total of 309,498 variant positions were detected in all sequenced exons from all the samples (Table 1).

Validation of the candidate variants in the discovery cohort
Sanger sequencing of samples from the discovery cohort confirmed 18 of the variant candidates (17 missense coding mutations and 1 indel) in 17 genes. The affected genes were ADC, ALPK3, APPL1, CSNK1A1L, FAM171A1, FBLN1, FILIP1L, FLT3 (carrying 2 mutations), GJB7, HMGCR, KIAA0317, MDN1, NR4A2, ORC3, PRICKL2, PTPRT and ZNF518B (S2 Table). No recurrent mutations were observed except for FLT3 mutations, neither when the mutations were checked in an independent cohort of 65 de novo APL patients. The remaining variations have not been reported before in APL cases. The mutational spectrum was dominated by C>T/G>A transitions.
Screening for somatic mutations in the complete coding sequence of 97 genes in the extended cohort Details of the 25 patients included in the extended cohort are provided in S3 Table and the genes included in the targeted resequencing are described in S4 Table. Overall, mean target coverage was 223.5x. Absolute median coverage at positions where mutations were identified was 208.4x.
In addition, the protein kinase, ubiquitination and the transcription factor category were more frequent than those awaited by the number of genes that compiled each category (Fig 1B,  1C and 1D). The same observation was made for the ubiquitination and spliceosome category, when we compared our results with those expected in the 1000 genomes repository [14].
Candidate gene prioritization. Network analysis [12,13], as implemented in the Babelomics [16] tool, was also used to prioritize the best candidate genes. For this propose, we included the set of recurrently mutated genes from NGS results beside PML, RARA and FLT3-ITD data of each patient obtained from cytogenetic and molecular analysis performed at diagnosis time. So, 17 out of the 46 selected genes, together with PML and RARA, were found to be significantly more connected that the random expectation (P = 0.02) (Fig 2). Some of the genes (STAG2, U2AF1, SMC1A, USP9X, IKZF1, LYN, MYCBP2 and PTPN11) also displayed an accumulation of mutations significantly higher that the corresponding healthy controls taken from the 1000 genomes repository [14]. Among these genes, the functional implications of the network could be observed in the edges connecting well-known cancer-related genes with similar functional annotations. PTPN11 interacted with PML through an intermediate gene, STAT1. At the same time, another member of the STAT family, STAT3, networked with PML and with a gene that works as a transmembrane protein (LYN). Moreover, we have identified an edge relating the genes with transmembrane protein function with those from the spliceosome (U2AF1, and USP9X) and the cohesin complex (SMC1A and STAG2) that have an important role in cancer. Concurrently, the cohesin complex was joined by the nuclear pore complex proteins (such as NUP98) with the spliceosome, which networked with PML through a transcription factor binding and with the Rho GTPase cycle through signaling pathways (MYCBP2).
We reassessed the value of our network analyses by adding the mutations reported in APL cases of TCGA dataset. The combination of both cohorts (n = 45) displayed an accumulation of mutations in some genes. Several genes without recurrence in both series were enriched for mutations so we finally obtained an input of 59 genes. Despite the increase in the number of genes, the 8 candidate genes (STAG2, U2AF1, SMC1A, USP9X, IKZF1, LYN, MYCBP2 and PTPN11) obtained from our cohort analysis remained significantly involved in the network (Data in S4 Fig).

Co-occurrence of SNVs
To assess patterns of co-occurrence we focused on the 46 genes mutated in more than one patient. Fig 3 depicts the presence of concomitant mutations within single patients, where multiple pairs of genes showed co-occurring mutations. We discovered a total of 32 interactions of mutations occurring at the same time in 2 or more patients with statistical significance (P 0.05). Interestingly, many of these co-occurrences took place within the significant subnetwork of protein interactions that link many of the affected genes. For example, PRPF8 mutations were carried simultaneously with mutations in ABL, DNAH9 and PTPRG.
Mutations in PTPRG coincided with mutations in other two genes of the network, MYCBP2 and KIAA0317. Moreover mutations were simultaneous in SMC1A and LYN2 (Fig 3). This association was not replicated in the 1000 genomes repository [14].

Correlation with clinical data
No single mutation or functional category showed an association with clinical variables or prognostic impact in terms of OS or RFS.

Discussion
This study shows a comprehensive analysis of APL, combining WES with the assessment of somatic mutations by a custom next-generation sequencing panel of targeted genes. It represents, to our knowledge, the largest analysis of somatic alterations in this subtype of acute leukemia. After careful filtering of variants, we identified deleterious mutations in eight genes previously Network-based analysis (SNOW) applied to 46 selected genes (33 recurrent and RARA, PML, and FLT3 genes). The network was complemented with the co-occurrence relationships, in order to summarize the two kind of significant results. Significant network-based analysis genes are coloured in light blue and stroked with a magenta border whether they resulted also significant in co-occurrence analysis. Genes only included by cooccurrence are coloured in magenta. Intermediate genes were painted in white and square shaped. While grey edges represent protein-protein interaction, relationships, broad orange dashed lines describe significant co-occurrences. Moreover, main genes are grouped depending on their biological role (cohesin complex, signalling pathways, spliceosome, RHO-GTPase, retinoic acid regulators and other cellular processes roles).  reported as mutated in other myeloid neoplasms. Network enrichment analysis significantly points to these 8 recurrently mutated genes to be involved in the pathogenesis of APL.
To identify leukemia-specific somatic alterations that could cooperate in the development and progression of leukemia [6,7], WES analysis was performed in 5 of diagnosis-completed remission matched samples. We examined DNA from bone marrow or peripheral blood at complete remission of each single case (defined according to the recommendations of Cheson et al.). Using a matched remission sample as a germline control we could established the de novo nature of each anomaly and eliminate false positive results that correspond to low-frequency variations that may not be reported in public SNPs databases. Matched samples were systematically covered by more than 10 reads, previously reported to be sufficient for variant calling [17]. In parallel, to minimize errors associated with any measurement, diploid coverage for matched samples was assessed using a set of high-quality SNPs derived from an array platform. We detected a 98% of concordance among the 1% of the SNPs common for the WES and the SNP-A approach. The small differences observed could be due to the fact that SNP-A focuses exclusively on SNP´s position, while WES screens the whole exome with different coverage in different regions. WES regions with low coverage are filtered by the designed pipeline, missing SNPs therein. In addition to the stringent filtering criteria, to prevent false-positive calls in WES, all alterations were analyzed by direct sequencing in paired diagnosis and complete remission samples. Only a part of the initially detected variants were confirmed, probably due to the limited sensitivity of Sanger sequencing, able to detect mutations that are present in at least 20% of the analyzed cells, or to the introduction of some artefacts during the WES amplification. Moreover, some false negative results were possible, due to the difficulty in the identification of indels or the presence of low frequency alleles. As a whole, these technical limitations might imply an underestimation of candidate mutated genes in our discovery cohort.
In a first step, we identified 18 genetic variants none of them, with the exception of FLT3, previously known to be mutated in leukemia. When we extended the validation of these candidate variants to 65 patients from an independent cohort, none of the newly discovered mutations were found to be recurrent. The same was found when we compared our mutated genes with those previously reported by the TCGA study and from APL patients previously described [4,7]. This is in agreement with recent reports, where no recurrent mutations were found [4,6,7]. It is also consistent with current models of leukemogenesis in APL, which hold that the vast majority of mutations occur as random events in normal precursor cells before these cells acquire an initiating mutation, so only a small portion of the alterations detected in each APL exome might be cooperating events, acting PML-RARA as the main driver mutation [6]. Furthermore, recent studies have highlighted that certain cancer types display fewer mutations on average. Actually, it has been shown that in leukemia the number of somatic mutations harbored by patient is 3 to 7 times lower that in solid tumors [18]. In particular, the study developed by the TCGA consortium brought out that in APL the expected recurrent somatic mutations by patient were almost twice lower than in other subtypes of AML. [4,6,7].
To test the relationship between our WES candidate genes and mutations in oncogenes known to be involved in myeloid malignancies, we selected a panel of 97 target genes known to be recurrently mutated in APL with a frequency 2% [7]. The selected subset of genes was analyzed in an extended cohort of 25 patients using in-house scripts. As no germline matched sample was available from each patient, and to avoid the selection of germline variants, we implemented a germinality test that was also complemented with a comparison against the healthy 1000G cohort and other variant population databases, in order to filter out the majority of germline variants. On the other hand, our protocol takes into account several sequencing artifacts as strand bias and other measurable effects derived from poorly mapped reads, what ensures to minimize the proportion of false positive mutations. The resequencing produced a high level of coverage for the targeted sequences, thus increasing our ability to detect mutations with less than 10% presence. We found a mean of 9.48 mutations (range 1-47) in each patient across 54 genes.
In line with recent reports, highlighting the complex nature of genomic aberrations in leukemia [4,6,7], we observed a wide spectrum of different mutations affecting well-known as well as novel APL targets. Of note, we found 8 genes mutated at a frequency significantly higher (8-40%; P<0.01) than expected in the 1000 genomes repository [14], and therefore with a strong novel potential to be involved in APL pathogenesis. With the exception of 4 genes well characterized in hematological neoplasms, we identified several mutations affecting genes without any known evidence to be involved directly in APL. Among them, 2 genes, PRICKLE2 and KIAA0317, were also detected in the discovery cohort by WES. Focusing on the allele frequency of each variation, it was not possible to distinguish between those mutations that are likely to be initiating or to be cooperating events. Functional validation studies will be required to assess the importance of these newly defined somatic mutations for APL pathogenesis.
To provide a reliable approach to prioritize genes for additional analysis, we used exploratory in silico approaches. First, we grouped mutations into larger sets, or pathways, according to the functional GO categories and examined patterns of mutual co-occurrence or exclusion between these groups. The co-occurring mutations affected mainly genes which were participating in the metabolism, along with genes acting as a protein kinase or as transmembrane proteins, and also mutated genes related to the signaling, spliceosome and ubiquitination functions. Additionally, network analysis resulted in a significantly connected subnetwork in which 17 of the candidate genes, recurrently harboring deleterious mutations, were immersed. These findings reinforce the hypothesis that a functional module, partially defined by this subnetwork, is associated to APL, and that specific combinations of mutations within this module contribute to the arising and maintenance of APL. This result strongly suggests that all these genes are close in the interactome, as frequently occur with genes of the same disease [19,20]. This confirms that mutations also co-occur in the interactome neighborhood. Genes connected to the candidate genes were involved in several key functionalities, such as cell cycle, kinases, transcriptional factors, splicesome and cohesin complex. These results are in agreement with those previously reported by the TCGA consortium, where APL patients were affected by several mutations in spliceosome genes and other unusually reported myeloid tyrosine kinases and transcriptional factors [4,6,7]. It seems that in the development of APL the impairment of a specific relevant gene is not as essential as the interaction of several mutated genes belonging to different functionally related categories. As a result, the combination of the specific APL chromosomal rearrangement with other mutated genes might trigger the leukemogenesis, displaying the importance of this biologic link. This hypothesis could explain the lack of concordance in the set of genes reported by previous studies, where numerous genes have been described to be involved in a not recurrent manner in APL, but with similar cell functions [4,6,7].
A major limitation to our study is the relative small sample size. We have analyzed an initial cohort of 5 patients by WES and 25 patients by a targeted gene-panel using NGS. Previous studies have also been conducted in small cohorts, ranged from 1-20 APL patients [4][5][6][7]. Although our series is one of the largest APL cohorts (n = 30) analyzed by NGS, it has been difficult to set up a significant independent association with outcome for any gene mutation or functional category, as neither was in similar preceding reports in APL. Therefore, more solid evidence based on larger series with long-term follow-up are needed to correlate our results with other clinical data.
In summary, this report describes the analysis of the largest number of gene mutations in APL carried out to date. After prioritization and network analysis we found 8 individual genes (STAG2, U2AF1, SMC1A, USP9X, IKZF1, LYN, MYCBP2 and PTPN11) with a strong novel potential of being involved in APL pathogenesis. In fact, on the basis of the study of recurrent mutations in interaction interfaces, PTPN11 has recently been suggested to be a cancer driver [21]. Also, genes STAG2 and U2AF1 were detected as drivers in AML and SMC1A in distinct cancers by the IntOGen tool [22]. Additionally, we have defined a functional module in the interactome of APL, hypothesizing that the number or the specific combination of mutations harbored in each patient might not be as important as the perturbation caused in biological key functions, triggered by several not necessarily recurrent mutations [23]. Finally, despite our study reports new APL candidate genes, it is clear that the understanding of the total number and clonal distribution of mutations in this disease is still limited and the future potential of discovery of new disease genes is high. Genes are represented by nodes and their sizes defined from the number of significant co-occurrences they are implied. Edges represent co-occurrences between pairs of genes. Every edge is labelled with the number of samples that carries the mutated pair of genes as follows: higher than expected co-occurrences are coloured in green, while lower than expected (only one) in red. Edge width is proportional to the statistical p-value of chi-square test. Those genes co-occurring only at one single patient are painted in white. Seven co-occurrence subnetworks arise from the significant co-occurrence network, where remarkably a single component connected the half of represented genes. In contrast, 3 pairs are simultaneously mutated only in 2 different individuals, and 3 significant co-occurrence subnetworks, only in 1 patient. (TIF) S1 Material and Methods. Further Material and Methods details are provided in the Extended Experimental Procedures at the Supplemental Information.