Detection of new drivers of frequent B-cell lymphoid neoplasms using an integrated analysis of whole genomes

B-cell lymphoproliferative disorders exhibit a diverse spectrum of diagnostic entities with heterogeneous behaviour. Multiple efforts have focused on the determination of the genomic drivers of B-cell lymphoma subtypes. In the meantime, the aggregation of diverse tumors in pan-cancer genomic studies has become a useful tool to detect new driver genes, while enabling the comparison of mutational patterns across tumors. Here we present an integrated analysis of 354 B-cell lymphoid disorders. 112 recurrently mutated genes were discovered, of which KMT2D, CREBBP, IGLL5 and BCL2 were the most frequent, and 31 genes were putative new drivers. Mutations in CREBBP, TNFRSF14 and KMT2D predominated in follicular lymphoma, whereas those in BTG2, HTA-A and PIM1 were more frequent in diffuse large B-cell lymphoma. Additionally, we discovered 31 significantly mutated protein networks, reinforcing the role of genes such as CREBBP, EEF1A1, STAT6, GNA13 and TP53, but also pointing towards a myriad of infrequent players in lymphomagenesis. Finally, we report aberrant expression of oncogenes and tumor suppressors associated with novel noncoding mutations (DTX1 and S1PR2), and new recurrent copy number aberrations affecting immune check-point regulators (CD83, PVR) and B-cell specific genes (TNFRSF13C). Our analysis expands the number of mutational drivers of B-cell lymphoid neoplasms, and identifies several differential somatic events between disease subtypes.

In this report we present an integrated genomic analysis of diverse mature B-cell lymphoid neoplasms using whole genome sequencing (WGS) data produced by the International Cancer Genome Consortium (ICGC) [23]. Our results expand the catalog of B-cell lymphoma driver genes, identify novel putative drivers based on functionally connected subnetworks and characterize new structural aberrations and regulatory mutations that modify the expression of several oncogenes and tumor suppressors.

Data source and analysis
WGS data from the CLLE-ES and MALY-DE projects produced by the ICGC were analyzed. This cohort included 132 CLL, 36 Burkitt lymphoma, 85 DLBCL, 97 follicular lymphoma and 4 unspecified B-cell lymphoma cases. RNAseq expression data for a subgroup of the samples was also available.
Tumor-normal matched whole genomes were processed using the bcbio-nextgen pipeline, which provides best practices for NGS data analysis [24]. GRCh37 was used as the reference genome. Low complexity regions, areas with abnormally high coverage, sequences with single nucleotide stretches >50bp and loci with alternative or unplaced contigs in the reference genome were not analyzed. Some polymorphic regions in noncoding regions are prone to be classified as mutation hotspots due to artifacts or biases in the sequencing process (mainly in low coverage regions), and suspicious elements were manually discarded from downstream analysis. Single nucleotide and indel mutation detection was performed with vardict-java version 1.5.8 [25], varscan version 2.4.3 [26], mutect2 implemented in GATK version 3.8 [27] and freebayes version 1.1.0.46 [28] using default bcbio-nextgen parameters. Events with a minimum sequencing depth (DP) of 10 and a genotype quality (GQ) of 20 Phred in both tumor and normal samples were selected. A mutation was called when detected by �2 callers. Mutations were annotated to the 1000G [29], gnomAD [30] and ExAc [31] databases. Tumor mutations reported as polymorphisms with a minimum allele frequency > 0.001 in any population were discarded. For copy number aberration (CNA) detections, the CNVkit version 0.9.6a0 [32] algorithm was used (ploidy-adjusted and with default parameters). We initially used the circular binary segmentation algorithm, observing important hypersegmentation that could lead to an increase in false positives. Therefore, we finally used the HaarSeg segmentation method, retrieving a vast majority of cases with segment counts in the range of 100-200. Events detected in centromeric and telomeric regions were discarded. Similarly, we also removed events within low 100bp-read mappability regions according to UCSC tracks.

Detection of mutation drivers in coding regions
Three methods were used to detect driver genes using nonsynonymous coding mutation data: MutSigCV [33], dNdScv [34] and OncodriveFML [35]. MutSigCV version 1.3.5 and dNdSCV were run with default parameters. OncodriveFML was run using CADD 1.3 scores. Significance threshold was set at FDR of 10% for all methods.
Hierarchical HotNet [36] was used to infer networks of functionally connected mutated genes. The following protein-protein interaction networks were used: Hint+Hi2012, Irefindex9 and Multinet. Mutation frequency and log-transformed MutSigCV p-values were used as input scores. Heat scores were permuted 100 times for each network. Hierarchies were constructed and processed with default parameters. The deviation of observed dendrogram distribution from the random expectation at different similarity thresholds was calculated, and significance threshold was set to p-value <0.05. Finally, a consensus network (G 2 ) was created from the resulting significant subnetworks.

Noncoding region annotation and mutation enrichment analysis
Annotations corresponding to promoter regions, 5'UTR, 3'UTR and lincRNAs were retrieved from Genecode version 18 [37]. Enhancer regions were obtained from the GeneHancer database [38], and those supported by two or more sources of evidence ("elite" enhancers) were selected. Transcription start sites (TSS) were defined as the 100bp-region upstream of the point of transcription. Regulatory regions within telomeric and centromeric positions were discarded.
LARVA [39] was used to identify areas with evidence of positive selection of mutations. LARVA models the mutation counts of each target region as a β-binomial distribution in order to handle overdispersion. LARVA also includes replication timing information in order to estimate local mutation rate, and provides a β-binomial distribution adjusted for replication timing which is used to compute p-values. Significance threshold was set to FDR<10%. As we used LARVA including tumor classification data, the mutation background estimation was calculated for each tumor subtype.
Regions targeted by aSHM were retrieved from literature analysis [40][41][42], and were used to annotate the list of significantly mutated noncoding regions identified by LARVA. In the case of genes without annotation, we used Signal [43] to test for local enrichments in the mutational signature of aSHM.

Recurrent focal CNA detection
Gistic2.0 [44] was used to identify recurrent CNA. Focal CNA were defined as those spanning a maximum of 25% of an arm's length. Deletions were called in regions with tumor/normal log ratios < -0.3, and amplifications in regions with ratios >0.3. Evenly spaced pseudomarkers were automatically created by the algorithm, and regions were considered only if they spanned 10 or more pseudomarkers. Sex chromosomes were not analyzed. Arm-level peel off was enabled, and residual q-values were calculated after removing segments shared with higher peaks. Significance threshold was set to FDR of 10%.

Gene expression analysis and association with CNA and regulatory mutations
RNA-seq data from tumor samples were transformed to FPKM counts and then rank normalized. The Wilcoxon-Rank sum test was used to detect changes in gene expression between mutated and wild-type cases. Changes in expression of the nearest gene were analyzed. When multiple regulatory regions mapped the same gene, p-values were adjusted for multiple testing using the FDR method (significance threshold of 5%). In the case of significant associations, we used non-linear Kernel regression adjusted for disease subtype in order to rule out independence from diagnostic subtype [45]. In the case of CNA, association with expression of the affected genes was analyzed using Pearson's correlation. P-values were adjusted for multiple testing using the FDR method (significance threshold of 5%).

Differential distribution of somatic events and pathways analysis
Differential mutation analysis between the different disease subtypes was performed using Fisher's exact test (significance FDR threshold of 5%).
WebGestalt [46] was used to analyze enrichment of gene networks in biological pathways. The KEGG database was used as reference, and a significance threshold of FDR 5% was chosen. 5,743,241 mutations were detected in 354 B-cell malignancy samples. A minor proportion of these affected protein-coding regions (1.34%), of which 71.64% were non-synonymous. Among these, missense mutations predominated (87.64%), followed by nonsense mutations (6.79%) and splice-site mutations (2.76%). The vast majority of mutations were either intergenic (40.28%) or intronic (45.25%). Mutation rate in the cohort was 2.51 mutations/Mb. This mutation rate was different for the different B-cell neoplasms. Mutation rate was 0.48 mutations/Mb in CLL, 0.75 in Burkitt Lymphoma, 3.14 mutations/Mb in follicular lymphoma and 5.69 mutations/Mb in DLBCL.
Missense mutations were the most frequent at the exome level, but some of these novel drivers were predominantly affected by other types of mutations. Nonsense mutations were more frequent in MAGI3, RFX7, TRAF6, VMA21 and WDR93. The genes HIST2H3D and HLA-C were prone to suffer multiple types of mutations in the same patient, whereas frameshift deletions predominated in LAPTM5. Finally, the following genes were targeted by multiple mutation types: ANGPT1, ID2, IRF1, MAGI3, SYNCRIP and ZFP36L1. 31 significantly mutated protein subnetworks were detected, involving 313 different genes (Fig 3  and S5 Table). 8 networks were mutated in >10% patients. The widest network (Network 1) was composed of 153 genes, among which CREBBP, EEF1A1, GNA13, STAT6 and TP53 were the main hubs. This network was enriched in pathways such as "B cell receptor signalling pathway", "Hepatitis B" and "NFKB signalling pathway" (S6 Table and S1 Fig). The second widest network (Network 2) was composed of 22 genes centered around BCL2. As expected, this network was notoriously enriched in "Apoptosis" pathway genes (S6 Table and
A fraction of the patients (58%) had matched RNAseq data available. We tested association between regulatory mutations and expression of the adjacent genes. Strong underexpression of DTX1 was associated with mutations in its promoter, which includes the GH12J113056 enhancer region (promoter q-value, 2.90 x 10 −4 ; enhancer q-value, 4.80 x 10 −3 , Fig 4). In the same line, mutations in S1PR2 enhancer were also significantly associated with S1PR2 underexpression (pvalue 2.46 x 10 −4 , Fig 4). On the contrary, mutations in the PAX5/ZCCHC7 enhancer were significantly associated with higher expression of ZCCHC7 (q-value 3.31 x 10 −2 ) but not with PAX5 expression (q-value 0.93). Using non-linear kernel regression adjusted for diagnostic subtype, we could confirm independent associations for DTX1 and GH12J11305 mutations (p-value 2.51x 10 −3 ) and S1PR2 and its enhancer (p-value 5.01 x 10 −3 ), but not for the association of ZCCHC7 expression with mutations in the PAX5/ZCCHC7 enhancer (p-value 0.17).

Differential events between B-cell lymphoma subtypes
We studied the distribution of the significant genomic events across the different B-cell lymphoproliferative subtypes (S10 Table). The greatest number of significant disparities (FDR < 5%) was discovered between CLL versus DLBCL and CLL versus follicular lymphoma. We detected 77 differential events between CLL and DLBCL, 34 differential events between CLL and follicular lymphoma, 11 differential events between CLL and Burkitt lymphoma, 17 differential events between DLBCL and Burkitt lymphoma, 11 differential events between follicular lymphoma and DLBCL and 9 differential events between follicular lymphoma and Burkitt lymphoma. Overall, 76.43% of all differential events were between CLL and any of the other lymphomas.
Although most differential events were less common in CLL, IGH deletions were highly enriched in CLL compared with the remaining lymphomas, and IGL deletions were more frequent in CLL compared to follicular lymphoma. Additionally, 11p15.5 deletions were more frequent in CLL than in follicular lymphoma or DLBCL, and 11q22.3 deletions were significantly more frequent in CLL than in DLBCL. In the same line, non-coding mutations in the IGH locus were significantly more frequent in CLL than in DLBCL or follicular lymphoma, and those in the IGL loci predominated in CLL over DLBCL. Furthermore, non-coding mutations in RP11-789C2.1 (4q28.3) and in the first intron of BACH2 (6q15) were significantly increased in CLL compared with DLBCL. On the contrary, coding mutations in 97 driver genes were significantly depleted in CLL compared with the remaining disease subgroups, likely reflecting the lower mutational burden on CLL. Additionally, 10 structural aberrations (9 deletions and 1 amplification) were significantly less frequent in CLL than in DLBCL or follicular lymphoma. The most significant were particularly predominant in DLBCL cases, and these were 6q26 loss (q-value 5.36 x 10 −5 ), 16q21 loss (q-value 4.56 x 10 −4 ) and 13q13.3 gain (q-value 4.56 x 10 −4 ).
As expected, mutations in MYC, ID3 and CCND3 were more frequent in Burkitt Lymphoma than in DLBCL or follicular lymphoma, and additionally we also detected the significant enrichments of Burkitt Lymphomas in TP53 and FBXO11 mutations. On the contrary, mutations in KMT2D and BCL2 were more prevalent among follicular lymphoma and DLBCL than in Burkitt Lymphoma. 1p36.32 deletion was absent in Burkitt lymphoma, but noncoding mutations in RP11-44H4.1 (3q27.3) were enriched in Burkitt lymphoma compared to DLBCL. Finally, the comparison of follicular lymphoma with DLBCL revealed significant differences in the mutational frequency of 11 genes. Of these, mutations in CREBBP, KMT2D, TNFRSF14 and RRAGC were enriched in follicular lymphoma, and those of BTG2, HLA-A, PIM1, IGLL5, SOCS1, CD83 and SGK1 were enriched in DLBCL.

Discussion
Different analyses of B-cell lymphoproliferative disorders have deconvoluted part of the complex genomic landscape of these neoplasms. Despite extensive evidence in other fields [20,34], this is the first combined analysis of whole genomes of B-cell lymphoid tumors performed to our knowledge. In this work, we detected 112 recurrently mutated genes across the genomes of different B-cell lymphoid malignancies, of which 31 (27.7%) were not previously described in any of the analyzed tumor subtypes. Among these, some of the most frequently mutated (FAM230A, FAM186A and PABPC3) are barely characterized genes with testis-biased expression. On the contrary, many others play roles in pathways linked with lymphomagenesis.  [54] in oncogenesis. Several other genes are members of the family of known lymphoma drivers, such as CXCR5 [55], HIST2H3D [56], ID2 [57] and IRF1 [58]. Additionally, 180 regulatory regions were significantly enriched in mutations, with a significant contribution of aSHM target loci (67.7% of cases). Importantly, we could detect regulatory mutations accompanied by aberrant underexpression of the tumor suppressors DTX1 [59,60] and S1PR2 [61,62]. 31 significantly mutated subnetworks involving 313 genes were discovered. In comparison with single gene approaches, this perspective provides a more complete landscape of mutational processes in B-cell lymphomas, and points towards the existence of new altered proteomic subnetworks in lymphomas. As a result, the genes CREBBP, BCL2, EEF1A1, GNA13, STAT6 and TP53, and the pathways "B-cell receptor", "Apoptosis", "Notch signalling", "Polycomb Repressive Complex" and "Toll-like receptor" emerge as master players of lymphomagenesis. Nevertheless, our results also support the implication of a myriad of novel players in the pathogenesis of B-cell lymphoid disorders, such as cell signalling proteins, cell-cycle regulators, ion transporters, cytoskeleton proteins, vesicle trafficking factors, extracellular enzymes and immunity genes. For some of these, the association with lymphomagenesis is well established, as in the case of CR2/CD21 [63], CD81 [64], GLI1 [65], SMO [66] and the self-activating autocrine loops of IL10 and IL13 with their receptors [67,68]. For other genes, likely associations can be inferred from its function, such as the cell-cycle checkpoint protein MAD1L1 [69] and the angiogenesis inhibitor BAI2 [70]. Finally, another group of genes belongs to emerging pathways in cancer whose function in B-cell lymphomas awaits further elucidation, such as glutamate receptors [71], ion channels [72] and microvesicles [73].
Finally, some clues are provided about the distribution of these mutational events across Bcell tumor subtypes. A relative depletion of CLL in mutations affecting common drivers was found, in line with the lower mutational burden of this tumor. Additionally, several differences in mutation frequencies were also detected between follicular lymphoma, DLBCL and Burkitt lymphoma. As expected, mutations in frequent drivers such as CREBBP, KMT2D , TNFRSF14 and RRAGC were more frequent among follicular lymphomas, those of MYC, CCND3 and ID3 prevailed among Burkitt lymphoma and those of BTG2, PIM1, SGK1 and SOCS1 were more frequent among DLBCL. Some of these findings are concordant with reported frequencies of driver genes across distinct lymphoma subtypes [85][86][87], whereas others provide new clues about the pathogenesis and possible drug targets in these tumors. For example, the increased mutational burden of TP53 and the BCL6-regulator gene FBXO11 among Burkitt lymphomas suggest an increased deregulation of these pathways in this disease [88,89]. Additionally, the skewed mutational profile of the immune check-point regulator CD83 [90] towards DLBCL tumors might have both biological and therapeutic implications.
This work, as many others, has some limitations. For example, although the included B-cell disorders represent a majority of patients in real practice, other frequent B-cell malignancies need to be taken into account in the future. Furthermore, protein network analysis is still limited by incomplete annotation of the protein interactome and by the type of input scores that can be used as input. Additionally, it should be noted that data produced from different research groups can be affected by batch effects, which is the reason why we used a uniform and optimized pipeline for the analysis.
In conclusion, we present an integrated overview of the genomic drivers of some of the most frequent B-cell lymphoproliferative disorders. Our results shed new light about the pathogenic mutations and structural aberrations in coding and noncoding regulatory regions of the genome of B-cell lymphoproliferative disorders, and pinpoint towards disease-specific mutational events that might be useful both for therapeutic and diagnostic purposes.