Developing Molecular Signatures for Chronic Lymphocytic Leukemia

Chronic lymphocytic leukemia (CLL) is a clonal malignancy of mature B cells that displays a great clinical heterogeneity, with many patients having an indolent disease that will not require intervention for many years, while others present an aggressive and symptomatic leukemia requiring immediate treatment. Although there is no cure for CLL, the disease is treatable and current standard chemotherapy regimens have been shown to prolong survival. Recent advances in our understanding of the biology of CLL have led to the identification of numerous cellular and molecular markers with potential diagnostic, prognostic and therapeutic significance. We have used the recently developed digital multiplexed gene-expression technique (DMGE) to analyze a cohort of 30 CLL patients for the presence of specific genes with known diagnostic and prognostic potential. Starting from a set of 290 genes we were able to develop a molecular signature, based on the analysis of 13 genes, which allows distinguishing CLL from normal peripheral blood and from normal B cells, and a second signature based on 24 genes, which distinguishes mutated from unmutated cases (LymphCLL Mut). A third classifier (LymphCLL Diag), based on a 44-gene signature, distinguished CLL cases from a series of other B-cell chronic lymphoproliferative disorders (n = 51). While the methodology presented here has the potential to provide a "ready to use" classification tool in routine diagnostics and clinical trials, application to larger sample numbers are still needed and should provide further insights about its robustness and utility in clinical practice.


Introduction
Chronic lymphocytic leukemia (CLL) is the most common leukemia in the Western world. Diagnosis is based on the results of flow cytometric analysis of malignant B cells obtained from peripheral blood, bone marrow, lymph nodes, and other organs. The characteristic phenotype The analysis of the IGHV-D-J mutation status was performed on genomic DNA after isolation of leukemic cells on a Ficoll gradient. PCR amplification of IgH rearrangements was performed with either family-specific VH leader primers [11] or FR1 primers, using the BIOMED-2 protocol [12]. PCR amplicons were subjected to direct sequencing on both strands. Sequence data were analyzed using the IMGT database and the IMGT/V-QUEST tool (http:// www.imgt.org). Only productive rearrangements were evaluated. VH sequences with a germline homology of 98% or higher were considered as unmutated, and those with a homology less than 98% were considered as mutated [13].
All the relevant patient information is presented in S1 Table. Normal blood samples were obtained from blood donors of the Geneva blood transfusion center. Pure CD19+ B cells were prepared from four Ficoll-enriched normal blood samples using a Selection Kit from Stemcell Technologies, according to manufacturer's instructions. Purity of the isolated cell populations was verified by flow cytometry with specific anti-CD19 and anti-CD20 antibodies and was >95% in all cases (data not shown).
Additional samples for this study were obtained from the three centers from a series of 51 patients with the following diagnoses: 20 mantle cell lymphoma (MCL), 22 marginal zone lymphoma (MZL) including 8 splenic marginal zone lymphoma (SMZL) with villous lymphocytes, 4 follicular lymphoma (FL), 5 hairy cell leukemia (HCL). RNA from these samples was processed in the same way as from the CLL samples described above.

mRNA Analysis
For the analysis with the nCounter system either 250 ng of extracted mRNA or mRNA in lysis buffer, corresponding to the equivalent of 10 5 cells, was used, according to the manufacturer's protocol (Nanostring H Technologies, Seattle, WA, USA). In brief, 4 μl of cell lysate or extracted mRNA was hybridized with the Nanostring CodeSet overnight at 65°C. Probes for the analysis of 290 different antigens were synthesized by NanoString technologies, including probes for nine normalization genes (S2 Table). After probe hybridizations and NanoString nCounter digital reading, counts for each mRNA species were extracted, analyzed using a homemade Excel macro, and then expressed as counts (molecules of mRNA/ sample), as described previously [14]. The nCounter CodeSet contained two types of built-in controls: positive controls (spiked mRNA at various concentrations to assess the overall assay performance), and negative controls (alien probes for background calculation). Data handling and analysis was performed as described: background correction consisted of the subtraction of the negative control average plus two SD from the original counts. To select adequate normalization genes from the series of nine candidates included in the CodeSet (ACTB, TBP, RPL19, RPLP0, G6PD, ABCF1, B2M, TPT1, RPS23), their relative stability was evaluated using geNorm-method [15]. For the final normalization of the sample values the geometric mean of the counts obtained for the three selected normalization genes (RPL19, RPLP0 and TPT1) was calculated and used as normalization factor.
The data have been submitted to GEO and can be accessed via Access Number GSE66425).

Establishing a gene list for mRNA analysis
We performed an extensive literature search and extracted a set of 290 genes from published articles and public databases, satisfying one of the following criteria: reported to be overexpressed in normal B cells compared to other blood cells; to be over-or under-expressed in CLL samples compared to normal blood samples; to be over-or under-expressed in CLL samples compared to other B-CLPD (S2 Table). In total, a set of 299 genes (290 genes + 9 normalization genes) was used to study the mRNA profile in 5 normal peripheral blood (PB) samples, 4 purified B cell samples, and 30 samples from patients with CLL.

Statistical analysis
Microsoft Excel, GraphPadPrism and Partek Genomics Suite software packages were used for statistical calculations and data presentation; p-value< 0.05. An arbitrary cut-off was chosen to describe a gene as being over-or under-expressed in comparisons between two patient cohorts: 50 counts by DMGE, a 2-fold change in expression (mean of population 1 divided by mean of population 2 2 or 2), with a p-value 0.05, using the student t-test.

Results
Numerous gene expression profiling studies have been performed during the last two decades on B-CLPD, and CLL in particular, based on microarray technologies. We set out to test a recently developed DMGE method for its potential utility in CLL diagnostics and prognosis determination.

Genes expressed preferentially by normal B-cells
Pure B cell samples were compared to samples from normal PB. In order for a gene to be considered by normal B cells to be preferentially expressed compared to other blood cells, samples with purified B cells were prepared as described above and compared to samples from normal PB. Out of 290 genes 99 genes fulfilled the criteria of an expression level 50 counts, a 2-fold change in expression (= mean of the four pure B cell samples divided by the mean of the five normal PB samples 2), with a p-value 0.05. As expected, this list contained genes coding for B-cell specific surface antigens, namely CD19, CD20 and CD79, as well as for B-cell specific transcription factors, like PAX5 and SOX11, or for immunoglobulin heavy and light chains ( Table 1, for the complete list of the 99 genes see S3 Table).
Fold changes with values of 20-40 corresponded to the ratio of B cells present in the pure B cell and peripheral blood samples (>95% versus 2-3%, respectively) and were found typically for B cell specific genes, like CD19, CD20 and CD22. Fold changes with values <20 correspond to genes which are expressed not only by B cells, but also by other peripheral blood cells, or to genes, for which the hybridization kinetics were not optimal.
Genes expressed preferentially in CLL samples compared to normal blood cells A panel of 30 CLLs, with a mean of 80% malignant B cells/sample (range: 58-100% per sample) and characterized for their typical cell surface phenotype, and the presence or absence of IgVH mutations, was used for this study. In order to find among the 290 gene list genes that could differentiate most effectively between CLL and normal samples, we compared the mean of all gene counts from the 30 CLL samples to the mean from the normal PB samples, and to the mean of the pure B cell samples. As previously, we selected only genes that fulfilled the arbitrary criteria of an expression level 50 counts, a 2-fold change in expression with a p-value 0.05. Fig 1A shows the Venn diagram for the genes expressed preferentially in the three different sample groups.
We obtained a list of 111 genes expressed preferentially in CLL samples compared to normal PB, and of 86 genes compared to pure B cells (S4 Table). 44 genes were common to both lists (Table 2A). Interestingly, this set of 44 genes contained genes well known to be overexpressed in CLL, like CD5, LPL and ROR1, but also kappa, lambda and IgG genes, showing that on a per-cell-basis CLL B cells produce more IgG mRNA than normal B cells.
As expected, principal component analysis (PCA) performed on all the 290 genes resulted in a clear separation of the samples according to their origin (pure B cells, normal PB or CLL; Fig 1B); restricting the PCA analysis to the 44 genes defined above resulted in a slightly different distribution, with normal PB and pure B cells clustered together, but with all CLL samples  clearly separated from them ( Fig 1C). A detailed analysis of the expression levels of these 44 genes showed a wide range of coefficients of variation (Table 2A). 13/44 genes were expressed at a highly similar expression level in all CLL samples, with a CV < 0.5 (BMI1, CD200, CD27, CD5, COL9A2, DNMBP, FAIM3, GNRH1, LEF1, RASGRF1, ROR1, SFMBP1, TTN) (S1 Fig).
These thirteen genes with the lowest CV can therefore be used as a classifier, which allows unambiguously to separate CLL samples from normal PB samples, being restricted to genes preferentially and homogenously expressed in CLL B cells compared to normal B cells ( Fig 1D). Several genes were also found to be specifically underexpressed in all CLL samples as compared to samples with pure B cells (Table 2B). These genes correspond to genes downregulated in CLL B cells compared to normal B cells, like IL-6, TIMP4 and MMP12, whose mRNA is absent in CLL B cells although normal B cells produce them in high amounts (mean: 1758.9; 92.7; 90.5 copies/sample, respectively).

Correlation between protein expression and mRNA expression
Flow cytometry allows the analysis of surface and intracytoplasmatic antigens of CLL B-cells. We studied the correlation between some antigens measured by routine flow cytometry (i.e.: CD19, CD20, CD5, CD23, CD38, CD200, kappa, lambda, ZAP70) and mRNA counts obtained by the nCounter measurements (Table 3). These expression levels corresponded to the protein expression levels detected by flow cytometry. In all the samples the CLL B cells strongly expressed CD5, CD23, CD43 and CD200 on their surface, whereas CD20 expression was found decreased compared to normal B cells.
Correlating the CD38 mRNA counts with the results from flow cytometry (S1 Table) resulted in a correlation coefficient of 0.53, similar to values of correlation coefficients found in a previous study on surface antigens in acute myeloid leukemia (AML) blasts [9] (Fig 2A). We then compared the mRNA counts in the cohort of CD38pos CLL patients with the CD38neg cohort, and the ZAP70pos cohort to the ZAP70neg one: the expected results were found, with CD38pos and ZAP70pos B-cell samples exhibiting higher mRNA counts then the corresponding negative samples ( Fig 2B). Interestingly, the comparison with normal B cells gave different results for CD38 and ZAP70: normal B cells, although expressing low surface CD38 protein, showed higher CD38mRNA counts than the CD38neg CLL samples, but lower ZAP70 mRNA counts than the ZAP70neg CLL samples (Fig 2B).
Analysis of the kappa/lambda ratio at the mRNA and protein level showed a 100% correlation between flow cytometry results and mRNA counts, and all CLL cases, which were monoclonal by cytometry also showed abnormal ratios in their mRNA counts (Fig 2C).

Analysis of mutated and unmutated cases of CLL
One of the most significant prognostic factors identified in CLL that ultimately ties to the biology of disease is the mutational status of the variable region of the immunoglobulin heavy chains. We determined the mutation status in our cohort of patients and found 11 patients with mutated and 17 patients with unmutated IgVH, 2 patients were borderline (S1 Table). In order to determine which genes were correlated with the mutation status, we compared group, if they showed an expression level 50 counts and a 2-fold difference in expression levels between the 2 groups, with a p-value 0.05. (B) Samples from PB (n = 5), B cell samples (n = 4), and samples from CLL patients (n = 30) were analyzed by PCA, based on the results of the differential expression of 290 genes. (C) PCA analysis on the same samples as in (b), but using a restricted set of 44 genes relevant for this purpose according to their differential expression in CLL, B cells and normal blood. (D) Heat map of normal PB, pure B cells, and CLL samples, analyzed with thirteen genes overexpressed homogenously (CV<5%) in all CLL samples compared to normal PB and pure B cells. Unsupervised analysis shows a perfect clustering of the CLL samples compared to the normal samples.
doi:10.1371/journal.pone.0128990.g001 mutated to unmutated samples and listed genes either overexpressed in mutated versus unmutated, or in unmutated compared to mutated samples (Table 4). 24 genes were found to be differentially expressed: nineteen genes were overexpressed in unmutated, and five in mutated samples. Among the differentially expressed genes were nine genes from the 44-gene-list, which we used to distinguish CLL from normal samples, as well as several genes described in the literature: CD38, ZAP-70, LPL, etc. ( [4], [16], [17], [18]; see also Table 4). This 24-gene panel called "LymphCLL Mut" allowed a clear distinction between both types of CLL (Fig 3; S2 Fig).
Total light chain production was increased in mutated vs unmutated samples (185207 counts vs 324812 counts; p = 0.008).

Analysis of LDOC1 expression
LDOC1 mRNA has been reported to be highly expressed in aggressive cases of CLL and to correlate with IgVH mutation status and with prognosis [22]. When we analyzed the mRNA expression in our 30 CLL cases, we found indeed a dichotomic distribution, completely different from the homogenous distribution, which we described in the thirteen genes used for the CLL classifier (S1 Fig). Interestingly, when we looked among all the 290 genes analyzed, only eight genes were found to correlate with LDOC1 expression: six genes with a positive correlation (SEPT10, LPL, CD26, EPB41L2, CXCR6, CRY1) and two genes with a negative correlation (ADAM29, CD150; Table 5; S3 Fig). ADAM29 was exclusively expressed in samples with absent/low LDOC1 expression and vice versa (Fig 4A). Superficially, this expression pattern corresponded to the IgVH mutation status of these samples, but a closer inspection yielded a group of five samples with absence of both LDOC1 and ADAM29 mRNAs (two mutated and three unmutated cases). The LDOC1/ADAM29 ratio clearly reflects this separation into three different groups of samples. Interestingly, in a previous report Oppezzo et al have published the LPL/ADAM29 ratio as a surrogate marker for IgVH status [24]. Comparing this ratio in our samples to the IgVH status showed concordance in 27/30 samples; the three discordant samples corresponded to samples with absent LDOC1 or ADAM29 expression (Fig 4B). Validating the CLL classifier In order to develop a clinically useful classifier, CLL samples not only have to be distinguished unambiguously from normal PB samples, but also from other lymphoma subtypes. We therefore tested the 44 genes found to distinguish CLL from normal PB samples on a series of 51 patients with different B-CLPD, i.e., MCL, MZL, FL and HCL. The PCA analysis showed a clear separation of the CLL from all the other B-CLPD samples, with the exception of one confirmed CLL case, which was misdiagnosed (Fig 5). None of the B-CLPD samples was misdiagnosed as a CLL.

Discussion
In the present work we describe the use of DMGE, a recently developed technique for the quantitative and parallel analysis of hundreds of mRNAs, in a cohort of 30 CLL patients. Starting Genes with an expression level > 50, and a ratio mut/unmutated or unmutated/mutated > 2 with a p-value < 0.05 are shown; References from the literature for each cited gene are given, when available.
doi:10.1371/journal.pone.0128990.t004 from a set of 290 genes with preferential expression in B cells and CLL cells described in previously published reports, we were able to establish lists of genes, with preferential expression in normal and in CLL B cells, respectively, and which allowed distinguishing unambiguously CLL samples from normal PB samples and CLL B cells from normal B cells. Restricting these gene lists to genes expressed homogenously by all CLL samples, independent of chromosomal abnormalities, yielded a classifier, based on the analysis of only 13 genes. Applying this classifier in an unsupervised analysis of our cohort resulted in a perfect separation of all 30 CLL samples.
Adding kappa/lambda ratios to the classifier will certainly increase its discriminative power, since our results show for all cases a clear distinction between polyclonal samples (normal PB and sorted normal B cells) and CLL samples with essentially monoclonal B cell populations. In a previous study, we have already used DMGE in acute myeloid leukemia (AML) to correlate leukemic blast mRNA expression with surface antigens determined by flow cytometry [9]. The present study confirms close correlation between flow cytometry results and DMGE analysis for some surface proteins.
The DMGE technique also allows the study of genes with prognostic relevance in parallel with the 13 gene diagnostic classifier, using a single assay. Interestingly, these genes fall broadly into two categories: those expressed with wide variations in different samples (up to 10 4 difference in mRNA expression; e.g.: LILR4 and CLLU1), and those with a present/absent, dichotomic pattern (e.g.: LDOC1, LPL, ADAM29).
Analysis of the IgVH mutation status is widely used to distinguish patients with a good from those with a bad prognosis. Several surrogate markers have been described in the literature and shown to correlate with the IgVH mutation status. We could confirm most of them, such as ZAP70, CD38, LPL and LDOC1 (Table 4). On the contrary, we did not find any differential expression for the following genes FCRL2 (p = 0.08) and HS1 (p = 0.10), also reported to vary between mutated and unmutated CLL samples [36], [37]. In an unsupervised analysis 28/ 30 (93%) CLL samples were correctly classified using the "Lymph CLL Mut" classifier based on 24 genes with a differential expression between IgVH mutated and unmutated cases.
The LPL/ADAM29 ratio has already been described previously to constitute a surrogate marker for the IgVH mutation status [24], and also to be related to prognosis [38]. Whereas this ratio distinguishes two different types of CLL samples, the determination of the LDOC1/ ADAM29 ratio allowed distinction of 3 subclasses: IgVH mutated with high expression of ADAM29, unmutated samples with high expression of LDOC1, and a third category (mixed mutated and unmutated samples) without expression of LDOC1 and ADAM29. This third group did not show any common IgVH usage or chromosomal abnormalities. Future studies have to tell us whether there is any clinical significance or any existing correlations between this category and prognosis.
In our final analysis we tested the 44-gene signature, which differentiated CLL from normal PB samples, on a set of 51 samples from patients with various common B-CLPD. Similar to the flow cytometric RMH score our "LymphCLL Diag" molecular classifier distinguished CLL  from other B-CLPD with high sensitivity and specificity (97% and 100%, respectively). Associating the "LymphCLL Diag" gene panel with the "LymphCLL Mut" panel, the kappa/lambda ratio and the LDOC1/ADAM29 ratio, a complete diagnostic and prognostic procedure could be performed in one single "ready to use" assay, based on a panel of 61 genes. Several of the laboratory analyses described for diagnostic and prognostic purposes in CLL are time-and labor-intensive and not well suited for routine testing in most clinical laboratories. One example is the determination of the IgVH mutational status, which is rather expensive, needs specialized know-how, and is currently only performed in a restricted number of laboratories under the expertise of molecular biologists [39]. Threshold levels are arbitrary (in most reports > 2% are considered mutated) and a grey zone exists [39].
Another example is the ZAP70 expression analysis by flow cytometry, which has been largely abandoned due to difficulties in standardization [40] [41] or the ZAP expression analysis by RT-PCR, which requires purification of B cells prior to the assay [39], rendering this approach unsuitable for routine diagnostics.
The new sequencing technologies also hold the promise to give valuable data for prognosis determination of CLL patients, most notably TP53, NOTCH1, ATM, SF3B1 and BIRC3 mutations [42] [43]. Mutations in these genes occur in approx. 2%-17% of CLL patients at diagnosis and the prognostic importance of some of them have already been studied in prospective trials [44] [45]. Whether this information is complementary to established prognostic factors and results from mRNA and gene expression studies like ours have ideally to be investigated in large future prospective and comparative trials. Although already widely used for research purposes deep sequencing techniques are not yet used in routine laboratories and the expensive, laborintensive technology and bioinformatically complex softwares will make this transfer challenging.
With the development of DMGE a new technique has arrived, which allows genetic profiling with the parallel analysis of hundreds of mRNAs by a technically extremely simple method. DMGE has a short turn-around time of < 2 days, needs minimal hands-on-time for technicians and is much less costly than whole gene expression profiling or deep sequencing. Moreover, this technique allows for automated data-analysis, and has a read-out, which is intuitive and does not need complicated bio-informatics tools for the analysis or interpretation. By focusing our approach on the analysis of a highly selected set of genes expressed preferentially by B cells, we could obtain signatures from the analysis of whole blood samples, rendering an additional B cell purification step unnecessary.
The parallel quantitative analysis of tens to hundreds of mRNAs allows the integration of several diagnostic and prognostic factors in one assay, contrary to numerous studies from the past, which have only analyzed one or two factors at a time. It should be therefore ideally suited for large trials aiming at the comparison of multiple factors in many different patient samples and for molecular characterization of cases without available living cells for flow cytometry, such as cDNA or FFPE. Smaller labs could also profit from this approach for routine diagnostics based on an automated data analysis.
To fully appreciate the clinical usefulness and discriminative power of this approach, prospective studies with a much larger number of CLL samples will have to be performed in the future, including samples with other B-CLPD and reactive/inflammatory conditions. Additional prognostic markers can be easily incorporated to the classifier and then be studied simultaneously in clinical trials, but also in routine diagnosis. Integrating information from gene profiling studies with results from genomic mutation and NGS analyses should ultimately lead to better prognostication schemes for patients.