The authors have declared that no competing interests exist.
Conceived and designed the experiments: MAC AJ DRW RMT. Performed the experiments: MAC. Analyzed the data: MAC DRW RMT. Contributed reagents/materials/analysis tools: SB LW AJ. Wrote the paper: MAC RMT.
Cell of origin classification of diffuse large B-cell lymphoma (DLBCL) identifies subsets with biological and clinical significance. Despite the established nature of the classification existing studies display variability in classifier implementation, and a comparative analysis across multiple data sets is lacking. Here we describe the validation of a cell of origin classifier for DLBCL, based on balanced voting between 4 machine-learning tools: the DLBCL automatic classifier (DAC). This shows superior survival separation for assigned Activated B-cell (ABC) and Germinal Center B-cell (GCB) DLBCL classes relative to a range of other classifiers. DAC is effective on data derived from multiple microarray platforms and formalin fixed paraffin embedded samples and is parsimonious, using 20 classifier genes. We use DAC to perform a comparative analysis of gene expression in 10 data sets (2030 cases). We generate ranked meta-profiles of genes showing consistent class-association using ≥6 data sets as a cut-off: ABC (414 genes) and GCB (415 genes). The transcription factor
Diffuse large B-cell lymphomas (DLBCL), the commonest human lymphoma type, can be separated into distinct categories based on gene expression signature, and relationship to normal stages of B-cell differentiation
Since the inception of the cell of origin classification
In routine practice most diagnostic material is formalin-fixed and paraffin embedded (FFPE), which may not yield comparable gene expression data to that obtained from fresh material. While approaches have been developed to circumvent this issue, immunohistochemical surrogates fail to recapitulate the success of the gene expression based classifier on a consistent basis
The segregation of DLBCL into cell of origin classes has led to profound insight into disease biology
Here we describe a systematic analysis of tools for implementation of the cell of origin classification of DLBCL. We establish a platform robust classifier based on balanced voting between four machine-learning tools. This is effective on FFPE material, and provides improved survival separation for ABC and GCB classes for the majority of data sets analyzed. We make this tool available as open source software. The development of this tool allows the first comparative analysis of gene expression across 10 DLBCL data sets encompassing 2030 cases uniformly classified with the same implementation. We define meta-profiles of genes consistently associated with ABC- and GCB class, assess consistent molecular signature enrichments and provide these data as a resource.
A formalin-fixed paraffin embedded (FFPE) data set was produced by the Haematological Malignancy Diagnostic Service (HMDS; St. James’s Institute of Oncology, Leeds) and details of preparation, epidemiology and outcome data are described in detail elsewhere
Each dataset was quantile normalized using the R Limma package
For all data sets the probes for each of the classifier genes were merged using 2 methods: (1) the median value across the probes for each gene (
The Survival library for R was used to analyze right-censored survival data, overall survival was estimated using the Kaplan-Meier method, modeled with Cox Proportional Hazards technique
HMDS (GSE32918) | GSE10846 R-CHOP | GSE10846 CHOP | |||||||||||
Classifier | Average Rank | Hazard Ratio | 95% Conf Intervals | Hazard Ratio | 95% Conf Intervals | Hazard Ratio | 95% Conf Intervals | ||||||
SMO | 3.7 | 0.56 | 0.32 | 1.01 | 0.052 | 0.25 | 0.14 | 0.46 | 8.07E-06 | 0.36 | 0.23 | 0.55 | 2.06E-06 |
LMT | 12.0 | 0.57 | 0.33 | 0.98 | 0.043 | 0.28 | 0.15 | 0.50 | 2.35E-05 | 0.40 | 0.26 | 0.61 | 3.17E-05 |
BayesNet | 14.3 | 0.67 | 0.38 | 1.17 | 0.155 | 0.24 | 0.13 | 0.45 | 9.18E-06 | 0.41 | 0.26 | 0.64 | 7.47E-05 |
J48 | 15.0 | 0.68 | 0.40 | 1.17 | 0.162 | 0.28 | 0.15 | 0.50 | 1.58E-05 | 0.41 | 0.27 | 0.63 | 4.29E-05 |
RF100 | 16.3 | 0.85 | 0.48 | 1.48 | 0.561 | 0.24 | 0.13 | 0.44 | 3.34E-06 | 0.41 | 0.27 | 0.63 | 5.06E-05 |
RF200 | 18.3 | 0.78 | 0.45 | 1.35 | 0.379 | 0.24 | 0.13 | 0.44 | 3.79E-06 | 0.44 | 0.29 | 0.68 | 1.81E-04 |
LPS 0.9 MaxAvgMerge | 21.3 | 0.68 | 0.36 | 1.27 | 0.226 | 0.31 | 0.17 | 0.55 | 8.49E-05 | 0.40 | 0.26 | 0.63 | 7.65E-05 |
LPS 0.9 MedianMerge | 21.7 | 0.64 | 0.34 | 1.20 | 0.162 | 0.32 | 0.17 | 0.58 | 1.77E-04 | 0.42 | 0.27 | 0.65 | 1.19E-04 |
LPS 0.8 MedianMerge | 22.3 | 0.73 | 0.41 | 1.28 | 0.267 | 0.31 | 0.17 | 0.54 | 5.60E-05 | 0.44 | 0.29 | 0.67 | 1.17E-04 |
LPS 0.8 MaxAvgMerge | 23.7 | 0.71 | 0.40 | 1.25 | 0.235 | 0.33 | 0.19 | 0.58 | 1.04E-04 | 0.43 | 0.28 | 0.66 | 1.20E-04 |
REPTree | 24.7 | 1.51 | 0.85 | 2.69 | 0.164 | 0.43 | 0.24 | 0.77 | 4.64E-03 | 0.47 | 0.31 | 0.71 | 4.03E-04 |
FT | 27.7 | 0.77 | 0.41 | 1.46 | 0.426 | 0.33 | 0.18 | 0.61 | 3.92E-04 | 0.48 | 0.30 | 0.76 | 1.92E-03 |
BFTree | 28.7 | 0.91 | 0.51 | 1.64 | 0.758 | 0.40 | 0.22 | 0.72 | 2.33E-03 | 0.50 | 0.32 | 0.76 | 1.38E-03 |
NBTree | 28.7 | 0.96 | 0.50 | 1.88 | 0.915 | 0.29 | 0.15 | 0.52 | 5.61E-05 | 0.57 | 0.37 | 0.89 | 1.30E-02 |
RandomTree | 28.7 | 0.97 | 0.55 | 1.72 | 0.914 | 0.27 | 0.13 | 0.57 | 5.58E-04 | 0.44 | 0.28 | 0.70 | 4.97E-04 |
SimpleCart | 30.3 | 0.93 | 0.52 | 1.67 | 0.804 | 0.48 | 0.27 | 0.85 | 1.27E-02 | 0.56 | 0.36 | 0.86 | 8.46E-03 |
Results obtained with individual machine-learning tools, trained on the Wright et al. data set and using 20 classifier genes are shown. Survival separation between ABC and GCB classes for the data sets GSE32918, and GSE10846 divided into CHOP and R-CHOP components, was used for assessment. Hazard Ratios were generated for GCB relative to ABC as baseline. The classifiers were ordered by their average rank across the data sets; with rank determined by the p-value of the ABC/GCB separation. The LPS classifier was used for comparison with either a 0.8 or 0.9 p-value cut-off, with either MaxAvgMerge or MedianMerge methods of combining probes (see Materials and Methods). The Classifier Identity, Hazard Ratio (GCB vs ABC as baseline), 95% confidence interval of the Hazard Ratio, and the resulting p-value for survival separation are shown.
An implementation of the LPS classifier was generated that can process the Weka ARFF file format
Using the Wright et al data as training set, 12 machine learning tools (BayesNet, BFTree, FT, J48, LMT, NBTree, RandomTree, REPTree, RF100, RF200, SimpleCart and SMO) were trained using the Weka package (Weka version 3.6.5)
The Wright et al. data set consists of expression data for 240 patients annotated as one of three classes: ABC (n = 73; 30.4%), GCB (n = 115; 47.9%) or Type-III (n = 52; 21.7%). The trained Weka classifiers output predictions for each sample analyzed consisting of p-values for each of the 3 classes, the class with the largest p-value giving the predicted class. The LPS classifiers assign everything with a confidence greater than a p-value threshold (0.8/0.9) to ABC/GCB and everything below the threshold to Type-III/Unclassified. The predicted classes were used to rank the classifiers (see Classifier Ranking).
The 6 best individual Weka classifiers (LMT, SMO, BayesNet, J48, RF100, RF200) were combined using the Weka Vote scheme with the average of probabilities combination rule. Each classifier was removed in turn to generate all possible 5-tool classifiers (5N). This process was iterated with the best meta-classifier at each N progressing to the next level (e.g. best 5N → all possible 4N). As with the individual classifiers these were analyzed by comparing the survival of their ABC/GCB assigned cases.
The data sets GSE10846 (split into CHOP/R-CHOP), GSE4475 and GSE19246 were used to generate additional lists of classifier genes from 3 sources using the published classifications (ABC, GCB and Type-III/unclassified): (1) the 20 Wright genes were ranked using the Weka CfsSubsetEval method (search method: GreedyStepwise). (2) the 185 genes from Dave et al. present on the 3 platforms were ranked using CfsSubsetEval (search method: GreedyStepwise)
Using the classes assigned by the best classifier (LMT_J48_RF100_SMO) a linear model was fitted to the gene expression data using the R Limma package
Enrichment of genes against gene-lists was assessed using a hypergeometric test, where the draw is the significantly differentially expressed genes, the successes are the signature genes and the population is the genes present on the platform. To avoid any bias the genes used for training the machine-learning tool were removed from the signatures before assessment. For each assessment Z-scores were generated by comparing against random distributions.
The ABC/GCB classification of the 11 DLBCL datasets (GSE10846 split in 2) was assessed by comparing the overlap between the up-regulated genes in ABC/GCB for each dataset against all others. The genes that were significantly up-regulated (adjusted p-value <0.05) in ABC/GCB were defined in each data set, creating a set of gene lists linked to either ABC or GCB class. Enrichment analysis (described above) was then carried out for differentially expressed genes from each dataset individually against the set of gene lists defined for all other data sets for each class (e.g. ABC genes data set 1 vs ABC genes data set 2, ABC genes data set 1 vs ABC genes data set 3 etc.) (Z-scores from random distributions of 107 samples). The Z-scores were then averaged between the two directions of analysis (data set 1 vs data set 2 and data set 2 vs data set 1) and also between ABC/GCB.
A data set of 12,323 gene signatures was created by merging signatures downloaded from
Signature enrichment analysis was carried out for the significantly differentially expressed genes (adjusted p-value <0.05). The contrast of ABC and GCB classes yields two lists, those up-regulated in ABC or GCB, which were analyzed separately. Enrichment of the signatures was assessed as described above.
Heat-maps were generated using the MEV program from the TM4 package of microarray tools
A graphical user interface (GUI) driven implementation of the DAC was created to simplify the classification process for users. The code behind the GUI was written in Python (
The classifier takes as input a tab separated list of raw gene/probe expression values. The file is quantile normalised (normalizeQuantiles function of R Limma package) and then if there are multiple probes for a gene these are merged by taking their median value. Finally, Z-scores are generated for each gene across the samples
The automatic classifier allows a background file (>30 samples of random class generated on the equivalent platform) to be used for classification of individual samples. The program finds the genes shared between the two files, though ideally the two files should be from the exact same platform. The file to be classified (1 or more samples) is split into individual samples and each of these is separately appended to the background file, followed by quantile normalization and generation of Z-scores. Once this process finishes for all individual samples the resulting Z-scores are merged and used to generate an ARFF file, which is then processed as above.
The classifier and manual is available from:
Our goal was to establish an implementation of the cell of origin classifier that was robust against variation in microarray platform and fresh or FFPE sample type. The cell of origin classifier is distinguished by two linked characteristics: (1) the ability to define two primary classes of DLBCL, ABC and GCB, with significant differences in outcome when treated with the combination chemotherapy regimen CHOP, (cyclophosphamide, hydroxydaunorubicin, vincristine (Oncovin), and prednisolone), alone or including rituximab anti-CD20 monoclonal antibody therapy (R-CHOP), and (2) the fact that these classes are linked to extended patterns of gene expression reflecting underlying molecular pathogenesis. Our assessment of classifier performance was therefore based first on the ability to define ABC and GCB classes with differences in outcome, using overall survival, and second on the demonstration that the defined classes of DLBCL across multiple data sets showed similar overall patterns of gene expression and appropriate segregation of non-classifier genes and molecular signatures.
In addition to the data set generated on a custom Lymphochip spotted cDNA microarray, on which the original formulation of the cell of origin classification by Wright et al. was based
We used two data sets (GSE32918, and GSE10846 divided into CHOP and R-CHOP treated groups) for classifier development. At each stage classifiers were ranked by the survival separation between the assigned ABC/GCB classes (see Classifier ranking in methods). We proceeded through the following steps (
Colored boxes (gray/green) depict different training data sets. Step 1- assessment of individual machine-learning tools vs LPS; Step 2– assessment of machine-learning tool combinations; Step 3–assessment of classifier gene sets, training on GSE10846_R-CHOP, and testing on previously seen and unseen data sets: Step 4- further assessment on unseen data sets; Step 5– classification of additional data sets, evaluation of differential gene expression in all-by-all comparison, downstream analysis with meta-profiles and enrichment of molecular signatures.
Twenty of the 27 genes, described by Wright et al.
Machine-learning tools can be combined to provide the potential advantage of balanced voting between classifiers generated from individual tools. We again used survival separation between assigned ABC and GCB classes as a metric to test the performance of balanced voting between classifiers. To assess such meta-classifiers we combined the best 6 individual machine-learning tools (
HMDS (GSE32918) | GSE10846 R-CHOP | GSE10846 CHOP | |||||||||||
Classifier | Average Rank | Hazard Ratio | 95% Conf Intervals | Hazard Ratio | 95% Conf Intervals | Hazard Ratio | 95% Conf Intervals | ||||||
LMT_J48_RF100_SMO | 5.0 | 0.56 | 0.33 | 0.97 | 0.037 | 0.26 | 0.14 | 0.47 | 9.88E-06 | 0.37 | 0.24 | 0.57 | 6.38E-06 |
LMT_RF200_RF100_SMO | 5.7 | 0.65 | 0.37 | 1.15 | 0.136 | 0.24 | 0.13 | 0.43 | 3.16E-06 | 0.36 | 0.23 | 0.55 | 2.21E-06 |
RF100_SMO | 5.7 | 0.58 | 0.33 | 1.02 | 0.060 | 0.26 | 0.14 | 0.47 | 8.66E-06 | 0.37 | 0.24 | 0.56 | 3.46E-06 |
LMT_RF100_SMO | 6.7 | 0.65 | 0.37 | 1.13 | 0.126 | 0.25 | 0.14 | 0.45 | 4.88E-06 | 0.36 | 0.23 | 0.55 | 2.54E-06 |
J48_RF100_SMO | 7.3 | 0.56 | 0.32 | 0.96 | 0.036 | 0.27 | 0.15 | 0.48 | 1.21E-05 | 0.40 | 0.26 | 0.61 | 2.21E-05 |
LMT_J48_RF200_SMO | 7.7 | 0.59 | 0.34 | 1.01 | 0.054 | 0.26 | 0.14 | 0.47 | 9.88E-06 | 0.39 | 0.25 | 0.60 | 1.45E-05 |
J48_RF200_RF100_SMO | 9.7 | 0.65 | 0.38 | 1.12 | 0.124 | 0.26 | 0.14 | 0.47 | 1.02E-05 | 0.39 | 0.25 | 0.59 | 1.43E-05 |
LMT_J48_RF200_RF100_SMO | 10.0 | 0.62 | 0.36 | 1.07 | 0.085 | 0.26 | 0.14 | 0.47 | 9.88E-06 | 0.40 | 0.26 | 0.61 | 2.61E-05 |
LMT_J48_SMO | 10.0 | 0.57 | 0.33 | 0.98 | 0.041 | 0.27 | 0.15 | 0.49 | 1.67E-05 | 0.40 | 0.26 | 0.62 | 3.09E-05 |
LMT_RF200_RF100_BN_SMO | 12.0 | 0.64 | 0.37 | 1.11 | 0.112 | 0.29 | 0.16 | 0.53 | 5.43E-05 | 0.38 | 0.25 | 0.58 | 6.56E-06 |
J48_SMO | 12.7 | 0.60 | 0.35 | 1.03 | 0.062 | 0.28 | 0.15 | 0.50 | 1.58E-05 | 0.41 | 0.27 | 0.63 | 4.29E-05 |
LMT_J48_RF200_RF100 | 12.7 | 0.60 | 0.35 | 1.02 | 0.061 | 0.26 | 0.14 | 0.48 | 1.38E-05 | 0.41 | 0.27 | 0.63 | 4.63E-05 |
LMT_J48_RF200_BN_SMO | 13.0 | 0.72 | 0.41 | 1.24 | 0.230 | 0.26 | 0.14 | 0.48 | 1.35E-05 | 0.38 | 0.25 | 0.59 | 1.40E-05 |
LMT_J48_RF100_BN_SMO | 14.0 | 0.72 | 0.41 | 1.24 | 0.230 | 0.26 | 0.14 | 0.48 | 1.35E-05 | 0.39 | 0.26 | 0.60 | 1.94E-05 |
LMT_J48_RF200_RF100_BN_SMO | 15.7 | 0.72 | 0.41 | 1.24 | 0.230 | 0.26 | 0.14 | 0.48 | 1.35E-05 | 0.41 | 0.27 | 0.63 | 3.95E-05 |
LMT_J48_RF200_RF100_BN | 17.7 | 0.70 | 0.41 | 1.22 | 0.211 | 0.27 | 0.15 | 0.49 | 1.95E-05 | 0.41 | 0.26 | 0.63 | 4.54E-05 |
J48_RF100 | 19.0 | 0.68 | 0.40 | 1.17 | 0.162 | 0.28 | 0.15 | 0.50 | 1.58E-05 | 0.46 | 0.30 | 0.70 | 2.95E-04 |
LMT_J48_RF100 | 19.0 | 0.70 | 0.41 | 1.20 | 0.193 | 0.27 | 0.15 | 0.49 | 1.78E-05 | 0.43 | 0.28 | 0.65 | 8.56E-05 |
J48_RF200_RF100_BN_SMO | 19.3 | 0.74 | 0.43 | 1.28 | 0.284 | 0.27 | 0.15 | 0.49 | 1.71E-05 | 0.41 | 0.27 | 0.63 | 5.02E-05 |
Machine-learning tools were combined using balanced voting to generate meta-classifiers. The best 6 individual classifiers were combined, and with iterative cycles of classifier removal 5, 4, 3 and 2 machine-learning tool meta-classifiers were tested. Survival separation between assigned ABC and GCB classes for the data sets GSE32918, and GSE10846 divided into CHOP and R-CHOP components was used for assessment. The classifiers were ordered by their average rank across the data sets; with rank determined by the p-value of the ABC/GCB separation. The Classifier Identity, Hazard Ratio (vs ABC as baseline), 95% confidence interval of the Hazard Ratio, and the resulting p-value for survival separation are shown.
To arrive at a single classifier we considered both the rank of all classifiers assessed across the data sets and the percentage of cases assigned to ABC or GCB class, since improved segregation of outcome between ABC and GCB cases could come at the expense of fewer cases assigned to one or other of these classes, and increased assignment to Type-III/unclassified. Only three classifiers, LMT_J48_RF100_SMO, RF100_SMO, and SMO, were consistently ranked in the top 25% of all classifiers tested in every data set (8 out of 31 or better). Amongst these LMT_J48_RF100_SMO gave the lowest average assignment to the Type-III/unclassified subset (17%), and was therefore selected for further analysis (
A notable feature amongst published applications of the cell of origin classification is the variable number of genes used for classifier implementation (15 to 183)
GSE10846 CHOP | GSE32918 | |||||||
Classifier | Hazard Ratio | 95% Conf Intervals | Hazard Ratio | 95% Conf Intervals | ||||
GEO Published Class | 0.41 | 0.27 | 0.64 | 6.43E-05 | – | – | – | – |
LMT_J48_RF100_SMO Classified | ||||||||
Train Wright | ||||||||
Wright20 | 0.37 | 0.24 | 0.57 | 6.38E-06 | 0.56 | 0.33 | 0.97 | 0.037 |
Train GSE10846 R-CHOP | ||||||||
Wright20 | 0.42 | 0.27 | 0.64 | 4.98E-05 | 0.86 | 0.51 | 1.46 | 0.573 |
Wright10 | 0.49 | 0.32 | 0.75 | 1.04E-03 | 0.87 | 0.52 | 1.46 | 0.594 |
Wright5 | 0.47 | 0.31 | 0.74 | 9.15E-04 | 0.87 | 0.51 | 1.47 | 0.598 |
Dave185 | 0.49 | 0.33 | 0.75 | 7.96E-04 | 0.78 | 0.46 | 1.34 | 0.367 |
Dave100 | 0.46 | 0.30 | 0.70 | 3.69E-04 | 0.97 | 0.56 | 1.68 | 0.919 |
Dave50 | 0.49 | 0.32 | 0.73 | 5.67E-04 | 0.92 | 0.53 | 1.58 | 0.761 |
Dave20 | 0.50 | 0.33 | 0.76 | 1.09E-03 | 1.17 | 0.68 | 1.99 | 0.572 |
Dave10 | 0.46 | 0.31 | 0.71 | 3.29E-04 | 0.93 | 0.55 | 1.58 | 0.785 |
All185 | 0.49 | 0.33 | 0.73 | 5.67E-04 | 0.92 | 0.54 | 1.55 | 0.748 |
All100 | 0.45 | 0.29 | 0.70 | 3.05E-04 | 0.90 | 0.52 | 1.56 | 0.706 |
All50 | 0.45 | 0.29 | 0.69 | 2.22E-04 | 1.03 | 0.59 | 1.78 | 0.918 |
All20 | 0.45 | 0.29 | 0.68 | 1.87E-04 | 0.86 | 0.50 | 1.49 | 0.602 |
All10 | 0.44 | 0.29 | 0.67 | 1.42E-04 | 0.88 | 0.52 | 1.48 | 0.619 |
LPS Classified | ||||||||
Train Wright | ||||||||
Wright20 | 0.42 | 0.27 | 0.65 | 1.19E-04 | 0.64 | 0.34 | 1.20 | 0.162 |
Train GSE10846 R-CHOP | ||||||||
Wright20 | 0.43 | 0.28 | 0.66 | 1.00E-04 | 0.79 | 0.46 | 1.36 | 0.392 |
Wright10 | 0.40 | 0.26 | 0.62 | 4.37E-05 | 0.94 | 0.53 | 1.67 | 0.826 |
Wright5 | 0.45 | 0.28 | 0.70 | 4.60E-04 | 0.89 | 0.51 | 1.53 | 0.663 |
Dave185 | 0.52 | 0.34 | 0.80 | 2.59E-03 | 0.81 | 0.46 | 1.43 | 0.475 |
Dave100 | 0.51 | 0.33 | 0.77 | 1.47E-03 | 0.81 | 0.46 | 1.42 | 0.457 |
Dave50 | 0.51 | 0.33 | 0.78 | 1.74E-03 | 0.84 | 0.48 | 1.45 | 0.526 |
Dave20 | 0.45 | 0.29 | 0.70 | 3.90E-04 | 1.05 | 0.59 | 1.86 | 0.869 |
Dave10 | 0.46 | 0.30 | 0.71 | 5.08E-04 | 1.00 | 0.57 | 1.77 | 0.994 |
All185 | 0.44 | 0.28 | 0.67 | 1.29E-04 | 0.83 | 0.49 | 1.42 | 0.496 |
All100 | 0.47 | 0.31 | 0.71 | 3.44E-04 | 0.87 | 0.51 | 1.50 | 0.622 |
All50 | 0.46 | 0.31 | 0.70 | 3.01E-04 | 0.90 | 0.52 | 1.53 | 0.687 |
All20 | 0.45 | 0.29 | 0.68 | 2.12E-04 | 0.85 | 0.50 | 1.45 | 0.544 |
All10 | 0.49 | 0.33 | 0.74 | 7.50E-04 | 0.78 | 0.44 | 1.36 | 0.375 |
The results obtained with classifiers trained on the Wright et al. data using 20 classifier genes were compared against those obtained with classifiers trained on the GSE10846 R-CHOP component using either the same 20 classifier genes, or a range of different classifier gene selections. Shown are the results for classifying the GSE10846 CHOP data (left) and GSE32918 (right). In each table the survival separation observed with the published GEO classes (top) was compared to the meta-classifier (middle) and the LPS (bottom). The Classifier identity, Hazard Ratio (GCB vs ABC as baseline), 95% confidence interval of the Hazard Ratio, and the resulting p-value for survival separation are shown. In the meta-classifier and LPS portions of the tables the results are shown for training on the Wright et al. data set (20 classifier genes) followed by the results for classifiers trained on the GSE10846 R-CHOP data set with different sets of classifier genes (Wright20-Wright5, Dave185-Dave10, All185-All10).
For this analysis we used the GSE10846 data set, which was generated on the Affymetrix Human Genome U133 Plus 2.0 and contained a more comprehensive representation of genes than the Wright et al. data generated on the Lymphochip platform. We trained on the R-CHOP treated component and tested on the CHOP treated component. Increasing the number of classifier genes did not improve outcome separation for either LMT_J48_RF100_SMO or LPS classifiers (
We next tested the performance of the LMT_J48_RF100_SMO meta-classifier on two additional data sets, GSE4475 and Monti et al., for which outcome data was available
A | ||||
GSE4475 | ||||
Classifier | Hazard Ratio | 95% Conf Intervals | ||
GEO Published Classes | 0.3622 | 0.1660 | 0.7907 | 0.011 |
LMT_J48_RF100_SMO Classified | ||||
Train Wright | ||||
Wright16 | 0.38 | 0.20 | 0.73 | 0.004 |
Train GSE10846 R-CHOP | ||||
Wright16 | 0.35 | 0.17 | 0.70 | 0.003 |
Dave185 | 0.39 | 0.20 | 0.75 | 0.005 |
Dave100 | 0.42 | 0.22 | 0.81 | 0.010 |
Dave50 | 0.44 | 0.23 | 0.83 | 0.011 |
Dave20 | 0.46 | 0.24 | 0.87 | 0.017 |
Dave10 | 0.37 | 0.19 | 0.73 | 0.004 |
All185 (137 Actual) | 0.45 | 0.23 | 0.89 | 0.022 |
All100 (69 Actual) | 0.43 | 0.22 | 0.87 | 0.019 |
All50 (35 Actual) | 0.37 | 0.18 | 0.76 | 0.007 |
All20 (15 Actual) | 0.45 | 0.23 | 0.90 | 0.024 |
All10 (7 Actual) | 0.38 | 0.20 | 0.73 | 0.004 |
LPS Classified | ||||
Train Wright | ||||
Wright16 | 0.36 | 0.17 | 0.76 | 0.007 |
Train GSE10846 R-CHOP | ||||
Wright16 | 0.35 | 0.17 | 0.73 | 0.005 |
Dave185 | 0.46 | 0.23 | 0.93 | 0.031 |
Dave100 | 0.46 | 0.23 | 0.93 | 0.031 |
Dave50 | 0.46 | 0.22 | 0.93 | 0.030 |
Dave20 | 0.44 | 0.22 | 0.89 | 0.023 |
Dave10 | 0.35 | 0.17 | 0.72 | 0.004 |
All185 (137 Actual) | 0.44 | 0.21 | 0.89 | 0.023 |
All100 (69 Actual) | 0.43 | 0.22 | 0.85 | 0.016 |
All50 (35 Actual) | 0.36 | 0.18 | 0.73 | 0.005 |
All20 (15 Actual) | 0.46 | 0.24 | 0.89 | 0.021 |
All10 (7 Actual) | 0.31 | 0.15 | 0.62 | 0.001 |
|
||||
|
||||
|
|
|
|
|
Monti Class | 0.34 | 0.18 | 0.65 | 1.09E-03 |
LMT_J48_RF100_SMO | 0.47 | 0.26 | 0.86 | 1.51E-02 |
LPS 0.9 MaxAvgMerge | 0.49 | 0.26 | 0.93 | 2.95E-02 |
The effect of training data set and classifier gene selection was assessed on a previously unseen data set GSE4475 (
In contrast to GSE4475, the Monti et al. data set proved difficult to separate into classes with significant differences in survival when using probe selection and normalization parameters effective on other data sets. However using the same parameters for probe selection and normalization as reported by Monti et al.
We conclude that a cell of origin classifier using balanced voting of LMT_J48_RF100_SMO machine-learning tools, using 20 classifier genes is applicable to data sets generated on multiple platform types, separating DLBCL into ABC and GCB classes with significant survival differences. Training on the Wright et al. data set provides better performance on the FFPE data generated on the Illumina platform, and performs well on data sets derived from fresh material generated on Affymetrix platforms. We therefore selected the combination of balanced voting of LMT_J48_RF100_SMO machine-learning tools, 20 classifier genes and training on the Wright et al. data for downstream analysis.
The comparison of survival separation for ABC and GCB classes by different classifiers did not assess the effects of classifier choice on a case-by-case basis. To assess this we examined the classification choice for each case, for every classifier tested in all 3 data sets used for selecting classifiers. The resulting maps of classification choice illustrate an important point (
The classes assigned by 31 tested classifiers for the GSE10846 CHOP data set are shown along with published classes in GEO and those assigned by the LPS classifier (GCB = blue, Type-III = green, and ABC = yellow). Samples are vertically ordered by class assigned by the meta-classifier LMT_J48_RF100_SMO (later referred to as “DAC”); this meta-classifier assigns confidence scores for each class, and the class with highest confidence is selected for each sample. Within each class samples are ranked by classification confidence. At either extreme, samples are ordered from high to low confidence GCB, and from low to high confidence ABC. In the Type-III category high confidence cases are shown centrally, flanked by lower confidence Type-III cases. On either side the latter are ordered by GCB or ABC signal (identified by GCB or ABC being the second highest classification confidence). The first column (labelled with black bar and red 5) identifies the classes assigned by LMT_J48_RF100_SMO, followed by results obtained for 30 other machine-learning classifiers, with the classes assigned for each case in the appropriate color. Classifiers are ranked (number above each column) from left to right according to the significance of survival separation between assigned ABC and GCB classes; note that LMT_J48_RF100_SMO was selected as the reference based on overall performance across multiple data sets, and in this data set is ranked 5th (shown in red) for survival separation. On the far right the published class assignments linked in GEO to the data set (GEO class, orange bar) and classes assigned by the LPS classifier using either a 0.8 or 0.9 p-value threshold classes are shown (dark gray bars respectively).
The ABC and GCB classes assigned by balanced voting between LMT_J48_RF100_SMO were characterized by more significant survival separation than published classification choices for most data sets. Assessing individual case-by-case class assignments, this was not attributable to an increase in the percentage of Type-III class. Across both components of the GSE10846 data set there was little difference in the number of Type-III cases assigned: R-CHOP - 15% LMT_J48_RF100_SMO vs 14% GEO-published, CHOP –15% LMT_J48_RF100_SMO vs 17% GEO-published. The improvement in outcome separation was thus due to the selection of cases for inclusion in the ABC or GCB groups. In regard to this, LMT_J48_RF100_SMO assigned more cases to the GCB category (52% vs 46% and 51% vs 42% for R-CHOP and CHOP subsets) and fewer cases to the ABC category (33% vs 40% and 34% vs 41% for R-CHOP and CHOP subsets). Since the assignments made by LMT_J48_RF100_SMO were associated with more significant separation in survival between ABC and GCB classes, and this is the primary clinical characteristic of the cell of origin classification
The concept of a “molecular gray zone” was inherent in the original formulation of the cell of origin classifier where cases that fall below a defined confidence threshold for either ABC or GCB class were assigned to the Type-III/unclassified category
The data set is indicated on the left above the relevant heat-map. The LMT_J48_RF100_SMO (later referred to as “DAC”) assigned class is shown by the “Class” bar at the top of each heat-map; blue = GCB, green = Type-III and yellow = ABC. The classification confidence is shown in the “Classifier Confidence” bar under the “Class” bar (red to blue = high to low-confidence). The classifier genes are ordered vertically for each heat-map as shown in the expanded box, and the class-association of the genes is indicated by the vertical colored bar (yellow = ABC and blue = GCB). The expression values for each sample are shown as Z-scores using a blue to red color scale for low to high expression (−2 to +2).
The balanced voting classifier provides a confidence for assignment to ABC, GCB or Type-III/unclassified and cases are assigned to the class with highest confidence, without a hard threshold. To assess whether classifier confidence assigned by the balanced voting classifier had clinical significance we examined the outcome of ABC and GCB cases by classifier confidence across all 4 data sets using hard confidence thresholds in 0.1 steps incrementing from 0.5. The results demonstrated that classifier confidence was generally associated with increasing differences in survival separation, although not in a linear fashion, since in some instances the Hazard Ratio of GCB vs ABC class was lowest for a 0.7 rather than 0.8 threshold. This was potentially attributable to the small number of highest confidence cases in any individual data set. Overall, high confidence ABC-DLBCL had a poor outcome (range of Hazard Ratios for 5 data sets 0.23 to 0.36
Routine implementation of the cell of origin classification in a clinical setting requires individual cases to be assigned to ABC, GCB or Type-III/unclassified class as they occur, rather than classification of a large collection of cases as embodied in the data sets used in this study. We therefore developed a downloadable application, featuring a simple graphical user interface, which employs the LMT_J48_RF100_SMO balanced voting approach to classify either a large collection (as in this study) or individual samples, given a background data-set from the same platform. We refer to this application as the “Diffuse Large B-cell Lymphoma Automatic Classifier” (DAC), and use this designation for the remainder of this manuscript. This application offers other groups the opportunity to directly compare their own classifications to those generated by DAC in this study. The classifier, user guide, and example data can be downloaded from:
To verify that DAC represented a fully transferrable implementation we evaluated additional DLBCL data sets imported from GEO (
We identified differentially expressed genes between ABC and GCB class defined by DAC for each data set. We refer to these differentially expressed genes, as “class-associated”. We performed a pairwise comparison of the resulting lists of class-associated genes for all data set combinations, using a hypergeometric test (
(
We next proceeded to a more refined comparative analysis of gene expression, assessing consistent associations of individual genes with the primary cell of origin classes, ABC and GCB DLBCL. We compared the lists of class-associated genes for all 11 data sets, maintaining the sub-division of GSE10846 by CHOP/R-CHOP treatment, ranking genes first by the number of data sets in which the genes were class-associated and second by the median normalized fold change (
Up-regulated in ABC | Up-regulated in GCB | ||||||
Gene | Classifier Gene | Median Normalised FC | Number Of Files | Gene | Classifier Gene | Median Normalised FC | Number Of Files |
ZBTB32 | 0.69 | 11 | MME | Yes | 0.67 | 11 | |
KCNA3 | 0.59 | 11 | LMO2 | Yes | 0.66 | 11 | |
CYB5R2 | 0.58 | 11 | SPINK2 | 0.62 | 11 | ||
CCND2 | Yes | 0.56 | 11 | STAG3 | 0.56 | 11 | |
IRF4 | Yes | 0.54 | 11 | LRMP | Yes | 0.41 | 11 |
PHF16 | 0.52 | 11 | ASB13 | 0.41 | 11 | ||
FAM46C | 0.52 | 11 | AUTS2 | 0.40 | 11 | ||
BATF | 0.52 | 11 | MAPK10 | 0.40 | 11 | ||
PIM2 | 0.46 | 11 | BCL6 | Yes | 0.40 | 11 | |
TNFRSF13B | 0.44 | 11 | SLC12A8 | 0.37 | 11 | ||
FUT8 | Yes | 0.41 | 11 | PLEKHF2 | 0.36 | 11 | |
SH3BP5 | Yes | 0.40 | 11 | SSBP2 | 0.35 | 11 | |
ADTRP | 0.39 | 11 | DENND3 | Yes | 0.31 | 11 | |
ENTPD1 | Yes | 0.37 | 11 | FADS3 | 0.30 | 11 | |
TCF4 | 0.34 | 11 | ITPKB | Yes | 0.28 | 11 | |
ARID3A | 0.33 | 11 | PTK2 | 0.24 | 11 | ||
HSP90B1 | 0.31 | 11 | HIP1R | 0.23 | 11 | ||
PIM1 | Yes | 0.31 | 11 | STS | 0.20 | 11 | |
BCL2L10 | 0.30 | 11 | VGLL4 | 0.16 | 11 | ||
BLNK | Yes | 0.30 | 11 | SULT1A1 | 0.15 | 11 | |
CREB3L2 | 0.28 | 11 | MYBL1 | 0.65 | 10 | ||
MAN1A1 | 0.26 | 11 | TTC9 | 0.34 | 10 | ||
CFLAR | 0.20 | 11 | ZPBP2 | 0.46 | 9 | ||
CLINT1 | 0.13 | 11 | FNDC1 | 0.43 | 9 | ||
BSPRY | 0.45 | 10 | SNX22 | 0.32 | 9 | ||
ARID3B | 0.28 | 10 | EEPD1 | 0.29 | 9 | ||
ATP13A3 | 0.19 | 10 | ANKRD13A | 0.24 | 9 | ||
C1ORF186 | 0.54 | 9 | SERPINA9 | Yes | 0.93 | 8 | |
TOX2 | 0.53 | 9 | LINC00487 | 0.72 | 8 | ||
CLECL1 | 0.49 | 9 | LOC285286 | 0.43 | 7 | ||
LRRC33 | 0.42 | 9 | SNX29P1#SNX29P2 | 0.87 | 6 | ||
FOXP1 | Yes | 0.38 | 9 | LOC440864 | 0.55 | 6 | |
ZNF385C | 0.34 | 9 | C12ORF77 | 0.36 | 6 | ||
CCDC50 | 0.24 | 9 | |||||
PARP15 | 0.53 | 8 | |||||
MPEG1 | 0.43 | 8 | |||||
TBC1D27 | Yes | 0.42 | 8 | ||||
FAM108C1 | 0.41 | 8 | |||||
ISY1#RAB43 | 0.17 | 6 |
Genes shown are differentially expressed and up-regulated in the indicated class in all data sets (shown for > = 6) that have a corresponding probe (ABC (
Excluded Wright classifier genes (
The meta-profiles are informed by consistency of expression across multiple different data sets. This provides an important variable for assessing the likely significance of a gene in the lymphoma class. Notably of the 414 genes in the ABC-DLBCL meta-profile only 24 genes were class-associated in 11/11 data sets. Microarray platforms differ in the selection of probes and hence the ability to assess individual genes. When taking differential representation of genes into account, 42 genes were ABC class-associated in all data sets in which probes for the gene were present on the platform used (100% class-association). Equally for GCB-DLBCL 20 genes were class-associated in 11/11 data sets, with 35 being 100% class-associated when accounting for gene representation on platforms used.
A notable feature was that the top three genes (
We reasoned that if the data sets were appropriately classified then meta-profiles should be enriched for genes that overlap in a statistically significant fashion with gene signatures representing lists of genes defined in previous work linked to cell of origin class or molecular pathogenesis (i.e. contain more signature genes than expected by chance for a list of equivalent number). Furthermore if this condition were met, and since the meta-profiles were uniquely informed by consistency of gene expression across multiple data sets, then analysis of gene signature databases would additionally form the basis for identifying novel associations relevant to disease biology. The databases MSigDB, GeneSigDB, as well as the SignatureDB of the Lymphoma Leukemia Molecular Profiling Project, provide an extensive compendium of gene signatures/sets related to particular cell states, pathways, or gene features
Enriched | ||||||||||
Gene Signature | Overlapping | GeneSigSize | randomAvg | randomSD | %Overlap | Zscore | FDR | Source | ||
ABCgtGCB_U133AB | 167 | 270 | 4.52 | 2.10 | 61.85 | 77.45 | 5.61E-235 ** | SignatureDB | ||
ABC_gt_GCB_PMBL_MCL_BL_U133AB | 37 | 46 | 0.77 | 0.87 | 80.43 | 41.65 | 2.68E-55 ** | SignatureDB | ||
NFkB_Up_all_OCILy3_Ly10 | 13 | 64 | 1.07 | 1.02 | 20.31 | 11.65 | 9.07E-09 ** | SignatureDB | ||
MYD88_Ngo_etal | 21 | 266 | 4.45 | 2.08 | 7.89 | 7.96 | 6.83E-07 ** | SignatureDB | ||
NFkB_Up_HBL1 | 18 | 211 | 3.53 | 1.85 | 8.53 | 7.80 | 2.28E-06 ** | SignatureDB | ||
V$NFKB_Q6_01 | 18 | 231 | 3.86 | 1.94 | 7.79 | 7.29 | 8.17E-06 ** | MSigDB_C3 | ||
NFkB_Up_bothOCILy3andLy10 | 8 | 37 | 0.62 | 0.78 | 21.62 | 9.48 | 1.33E-05 ** | SignatureDB | ||
chr3q29 | 7 | 54 | 0.91 | 0.94 | 12.96 | 6.47 | 1.22E-03 ** | MSigDB_C1 | ||
chr18p11 | 8 | 75 | 1.25 | 1.11 | 10.67 | 6.09 | 1.38E-03 ** | MSigDB_C1 | ||
chr3q13 | 7 | 86 | 1.44 | 1.19 | 8.14 | 4.69 | 0.01 * | MSigDB_C1 | ||
chr3q21 | 7 | 99 | 1.66 | 1.27 | 7.07 | 4.19 | 0.02 * | MSigDB_C1 | ||
chr3p21 | 11 | 237 | 3.96 | 1.96 | 4.64 | 3.58 | 0.03 * | MSigDB_C1 | ||
chr18q21 | 6 | 82 | 1.37 | 1.16 | 7.32 | 3.99 | 0.04 * | MSigDB_C1 | ||
KEGG_FOCAL_ADHESION | 7 | 198 | 3.31 | 1.80 | 3.54 | 2.06 | 0.27 | MSigDB_C2 | ||
TGGAAA_V$NFAT_Q4_01 | 40 | 1883 | 31.48 | 5.35 | 2.12 | 1.59 | 0.33 | MSigDB_C3 | ||
chr12p13 | 6 | 201 | 3.36 | 1.81 | 2.99 | 1.46 | 0.45 | MSigDB_C1 | ||
CROONQUIST_STROMAL_STIMULATION_UP | 2 | 60 | 1.00 | 0.99 | 3.33 | 1.00 | 0.64 | MSigDB_C2 | ||
SUNG_METASTASIS_STROMA_UP | 3 | 110 | 1.84 | 1.34 | 2.73 | 0.87 | 0.66 | MSigDB_C2 | ||
KEGG_ECM_RECEPTOR_INTERACTION | 2 | 84 | 1.40 | 1.17 | 2.38 | 0.51 | 0.75 | MSigDB_C2 | ||
|
||||||||||
|
|
|
|
|
|
|
|
|
||
GCB_gt_ABC_U133plus | 0 | 297 | 4.97 | 2.20 | 0.00 | −2.26 | 0.07 | SignatureDB | ||
Stromal-1_DLBCL_survival_predictor | 0 | 260 | 4.35 | 2.06 | 0.00 | −2.12 | 0.12 | SignatureDB | ||
GC_B_cell_U133Plus | 3 | 324 | 5.42 | 2.29 | 0.93 | −1.06 | 0.59 | SignatureDB | ||
chr17q24 | 0 | 42 | 0.70 | 0.83 | 0.00 | −0.85 | 0.80 | MSigDB_C1 | ||
chr2q23 | 0 | 22 | 0.37 | 0.60 | 0.00 | −0.61 | 0.86 | MSigDB_C1 |
Gene signature enrichments in the meta-profiles were assessed with a hypergeometric test. Shown are selected signatures discussed in the text, including those related to the reciprocal class, a comprehensive list is provided in
Enriched | ||||||||||
Gene Signature | Overlapping | GeneSigSize | randomAvg | randomSD | %Overlap | Zscore | FDR | Source | ||
GCB_gt_ABC_U133plus | 162 | 297 | 4.98 | 2.20 | 54.55 | 71.46 | 4.59E-214 ** | SignatureDB | ||
GC_B_cell_U133Plus | 82 | 324 | 5.43 | 2.30 | 25.31 | 33.33 | 3.05E-70 ** | SignatureDB | ||
Stromal-1_DLBCL_survival_predictor | 61 | 260 | 4.36 | 2.06 | 23.46 | 27.52 | 3.92E-49 ** | SignatureDB | ||
TGGAAA_V$NFAT_Q4_01 | 68 | 1883 | 31.57 | 5.36 | 3.61 | 6.80 | 1.15E-07 ** | MSigDB_C3 | ||
KEGG_FOCAL_ADHESION | 18 | 198 | 3.32 | 1.80 | 9.09 | 8.16 | 4.99E-07 ** | MSigDB_C2 | ||
CROONQUIST_STROMAL_STIMULATION_UP | 10 | 60 | 1.01 | 0.99 | 16.67 | 9.05 | 3.19E-06 ** | MSigDB_C2 | ||
KEGG_ECM_RECEPTOR_INTERACTION | 11 | 84 | 1.41 | 1.18 | 13.10 | 8.16 | 7.88E-06 ** | MSigDB_C2 | ||
SUNG_METASTASIS_STROMA_UP | 12 | 110 | 1.84 | 1.34 | 10.91 | 7.57 | 1.51E-05 ** | MSigDB_C2 | ||
chr12p | 2 | 6 | 0.10 | 0.31 | 33.33 | 6.03 | 0.04 * | MSigDB_C1 | ||
chr17q24 | 4 | 42 | 0.70 | 0.83 | 9.52 | 3.96 | 0.04 * | MSigDB_C1 | ||
chr2q23 | 3 | 22 | 0.37 | 0.60 | 13.64 | 4.37 | 0.05 * | MSigDB_C1 | ||
MYD88_Ngo_etal | 10 | 266 | 4.46 | 2.08 | 3.76 | 2.66 | 0.09 | SignatureDB | ||
chr3q13 | 4 | 86 | 1.44 | 1.19 | 4.65 | 2.15 | 0.24 | MSigDB_C1 | ||
chr3q21 | 4 | 99 | 1.66 | 1.27 | 4.04 | 1.84 | 0.31 | MSigDB_C1 | ||
V$NFKB_Q6_01 | 7 | 231 | 3.87 | 1.94 | 3.03 | 1.61 | 0.34 | MSigDB_C3 | ||
NFkB_Up_HBL1 | 5 | 211 | 3.54 | 1.86 | 2.37 | 0.79 | 0.59 | SignatureDB | ||
chr3q29 | 1 | 54 | 0.91 | 0.94 | 1.85 | 0.10 | 0.81 | MSigDB_C1 | ||
NFkB_Up_all_OCILy3_Ly10 | 1 | 64 | 1.08 | 1.03 | 1.56 | −0.07 | 0.84 | SignatureDB | ||
|
||||||||||
|
|
|
|
|
|
|
|
|
||
ABCgtGCB_U133AB | 0 | 270 | 4.53 | 2.10 | 0.00 | −2.16 | 0.07 | SignatureDB | ||
chr3p21 | 2 | 237 | 3.98 | 1.97 | 0.84 | −1.00 | 0.55 | MSigDB_C1 | ||
chr18p11 | 0 | 75 | 1.26 | 1.11 | 0.00 | −1.13 | 0.59 | MSigDB_C1 | ||
ABC_gt_GCB_PMBL_MCL_BL_U133AB | 0 | 46 | 0.77 | 0.87 | 0.00 | −0.89 | 0.73 | SignatureDB | ||
NFkB_Up_bothOCILy3 andLy10 | 0 | 37 | 0.62 | 0.78 | 0.00 | −0.79 | 0.78 | SignatureDB | ||
chr18q21 | 1 | 82 | 1.38 | 1.16 | 1.22 | −0.32 | 0.81 | MSigDB_C1 |
Gene signature enrichments in the meta-profiles were assessed with a hypergeometric test. Shown are selected signatures discussed in the text, including those related to the reciprocal class, a comprehensive list is provided in
We also considered signature enrichment for differentially expressed genes for each class in each data set independently. This approach assesses the consistency of signature enrichment across data sets (
Several oncogenic pathways have been established for ABC-DLBCL
Other signatures of NFkB activity have been defined in distinct cellular contexts. Our analysis allowed a ranking of the relative enrichment of these signatures across both the meta-profiles and individual data sets. Of 46 signatures related to NFkB, 8 were enriched in the ABC but not GCB meta-profiles and of these 3 (“NFkB_Up_all_OCILY3_LY10”, “NFkB_Up_HBL1”, “NFkB_Up_bothOCILY3andLY10”) were enriched in 11/11 individual data sets (
For GCB-DLBCL a distinctive feature was association with stromal signatures. The most significant enrichment was observed for the signature Stromal-1_DLBCL_survival_predictor (ranked 4th, 23.5% overlap, FDR corrected p-value = 3.92E-49) (
Recurrent cytogenetic abnormalities characterise DLBCL and show association with cell of origin class and outcome
Human chromosomal cytobands are depicted using gray scales, with chromosomes displayed vertically in numerical order. Enrichment or depletion of cytobands was assessed using a hypergeometric test and the MSigDB C1 component against the meta-profiles and across differentially expressed genes for each data set individually. Observed enrichments and depletions are shown as the average Z-scores with red to blue color scale (average Z-score = +6 to −2) as indicated in the insert. These are derived from analyses of individual data sets (
Despite the established nature of the cell of origin classification a comparative analysis of gene expression across multiple data sets classified with the same algorithm has not been published. To allow such an analysis we have developed a robust classifier based on 20 genes, described in the original Wright classifier implementation
An issue of primary importance in the evaluation of a test is the performance against a “gold standard” and the choice of metric that is used to assess performance. The most significant clinical feature of the cell of origin classification is the ability to separate DLBCL into two major subgroups ABC and GCB with different survival
An important observation emerging from this study is the existence of an extended molecular gray zone, representing cases whose classification is sensitive to the type of classifier implementation used. A substantial group of cases in each data set was equivalently classified by most classifier implementations and thus had a consistent class. In contrast the differences in outcome separation observed were attributable to cases that had more marginal expression of classifier genes and moved been class in a fashion dependent on classifier implementation. While a “molecular gray zone” was inherent in the cell of origin classification, and in its original form encompassed the Type-III or unclassifiable cases
The development of meta-profiles representing the most consistent differentially expressed genes between ABC- and GCB-DLBCL is significant since these gene lists are uniquely informed by the consistency of differential gene expression between multiple data sets. Indeed limited numbers of genes were detected as ABC- or GCB-associated in all data sets, and these genes are likely to be enriched for core regulators. Amongst the transcription factors most consistently linked to the ABC-subset is
Another feature of the ABC-DLBCL meta-profile was the fact that three genes,
Enrichments of gene signatures derived from previous studies
In conclusion, the generation of the robust classifier algorithm, DAC, provides a tool with which to consistently classify DLBCL cases regardless of microarray platform type. It has potential applications in the research and clinical setting, since it is designed, and is currently being used, to allow real-time assessment of individual incident cases. Currently real-time classification of DLBCL cases into molecular classes does not affect primary clinical management decisions, but in future this may change. The analysis we have performed highlights the issues surrounding the effect that classifier choice can have on class assignment, and argues for a robust analysis of classifier algorithm in such settings. The development of this classifier has allowed the generation of a useful resource in which the consistency of class-associated gene expression provides a method for identifying associations of relevance to disease biology, and in particular highlights transcription factors operating in ABC-DLBCL.
(PDF)
(XLSX)
(XLSX)
(XLSX)
(XLSX)
(XLSX)
(XLSX)
(XLSX)
(XLSX)
(XLSX)
This work was supported by a Cancer Research UK Senior Clinical Fellowship (RMT) grant ref C7845/A10066. We thank Ming Du and Gina Doody for critical reading of the manuscript.