Using Rule-Based Machine Learning for Candidate Disease Gene Prioritization and Sample Classification of Cancer Gene Expression Data

Microarray data analysis has been shown to provide an effective tool for studying cancer and genetic diseases. Although classical machine learning techniques have successfully been applied to find informative genes and to predict class labels for new samples, common restrictions of microarray analysis such as small sample sizes, a large attribute space and high noise levels still limit its scientific and clinical applications. Increasing the interpretability of prediction models while retaining a high accuracy would help to exploit the information content in microarray data more effectively. For this purpose, we evaluate our rule-based evolutionary machine learning systems, BioHEL and GAssist, on three public microarray cancer datasets, obtaining simple rule-based models for sample classification. A comparison with other benchmark microarray sample classifiers based on three diverse feature selection algorithms suggests that these evolutionary learning techniques can compete with state-of-the-art methods like support vector machines. The obtained models reach accuracies above 90% in two-level external cross-validation, with the added value of facilitating interpretation by using only combinations of simple if-then-else rules. As a further benefit, a literature mining analysis reveals that prioritizations of informative genes extracted from BioHEL’s classification rule sets can outperform gene rankings obtained from a conventional ensemble feature selection in terms of the pointwise mutual information between relevant disease terms and the standardized names of top-ranked genes.


Datasets and normalization Diffuse large B-cell lymphoma (DLBCL)
The lymphoma dataset [1] contains expression values for 7,129 genetic probes and 77 microarray samples, 58 of which were obtained from patients suffering from diffuse large B-cell lymphoma (D), while the remaining samples are derived from a related B-cell lymphoma type, termed follicular lymphoma (F). The experiments in this microarray study have been carried out on an Affymetrix HU6800 oligonucleotide platform.
To pre-process the raw data, the Variance stabilizing normalization method [2] was applied to filter out intensity-dependent variance. This was done using the vsn library and the expresso package in the R statistical learning environment [3]. Moreover, a thresholding was applied based on the suggestions in the supplementary material of the original publication associated with the dataset [1], and a fold-change filter used to remove features with low variance (all genetic probe vectors with less than a 3-fold change between the maximum and minimum expression value were discarded), resulting in 2647 remaining genetic probes (the pre-processed data is available online: http://icos.cs.nott.ac.uk/datasets/microarray.html).

Prostate cancer
The prostate cancer dataset [4] consists of expression measurements for 12,600 genetic probes across 50 normal tissues and 52 prostate cancer tissues. All experiments have been carried out on Affymetrix Hum95Av2 arrays. Due to the large number of samples, the fast GeneChip RMA (GCRMA) normalization algorithm was applied [5], a method that combines stochastic and sequence-based physical models to estimate the mRNA abundances. Moreover, thresholding was employed based on the suggestions of the original publication associated with the dataset [4] and a fold change filter to remove all probes with less than a 2-fold change between the maximum and minimum expression value, providing 2135 remaining genetic probes (the pre-processed data is available online: http://icos.cs.nott.ac.uk/datasets/microarray.html). Moreover, to investigate the robustness of the prediction and feature selection results across different filtering settings (see sections 2 and 3 in the Suppl. Mat.), two additional versions of the prostate cancer data were obtained by applying a 3-fold change filter (providing 340 remaining genetic probes) and a 1.5-fold change filter (providing 7355 remaining genetic probes).

Breast cancer
The breast cancer dataset from the collaborating Queen's Medical Centre [6,7,8,9] provides gene expression values for 128 primary breast tumors across 47,293 genetic probes. Two groups of tumor samples can be distinguished in the data, the luminal group (L, 84 samples), which is characterised by oestrogen receptor expression, and the non-luminal group (N, 44 samples, no oestrogen receptor expression). The expression profiling procedure has previously been described in detail [6,7,8], and has also been applied in a recent ensemble gene selection analysis of this dataset [9]. Since the probe level data was obtained from a Sentrix Human-6 BeadChip platform (v1.0, Illumina, San Diego, CA), the dedicated Bioconductor "beadarray" package was used for normalization and probe replicate summarization (the pre-processed data is available online: http://icos.cs.nott.ac.uk/datasets/microarray.html).

Cross-validation results for different dataset pre-processing variants
To analyze the robustness of BioHEL's sample classification performance across different settings for the pre-filtering of datasets and the maximum number of selected attributes, the whole experimental protocol as outlined in the main manuscript was applied again on six different pre-processed versions of the largest dataset (obtained from the prostate cancer study by Singh et al. [4]. For this purpose, first, two additional pre-filtered versions of the prostate cancer data were generated by removing all genetic probe vectors with less than a 3-fold change (resulting in 340 remaining genetic probes), and respectively a 1.5-fold change (resulting in 7355 remaining genetic probes), between the maximum and minimum expression value (see classification results for these datasets in Table 1 and 2). Moreover, the original pre-processed version of the prostate cancer dataset with a 2-fold filtering was used in combination with four different maximum numbers of selected features (5, 15, 50 and 100; see Tables 3, 4, 5 and 6). For all these input settings, the predictive performance was evaluated for each combination between the prediction methods (BioHEL, GAssist, RF, SVM, PAM) and the feature selection methods (CFS, RFS, PLSS) using both external 10-fold cross-validation (CV) and Leave-one-out CV (LOOCV), as described in the main manuscript.
Overall, BioHEL attained average cross-validated classification accuracies between 84% and 95% for all of the above settings, with robust results across all data pre-processing methods and similar performances in relation to the other benchmark approaches as compared to the default settings used in the main manuscript. The best average accuracies obtained with BioHEL for each combination of a dataset pre-processing with a cross-validation procedure (highlighted in bold face in the tables) were always above 90% and similar to the average accuracies obtained with the best alternative benchmark approach (in six cases, the same maximum performance was reached, in one case BioHEL was the only method reaching the highest average accuracy, and in the remaining cases BioHEL's highest average accuracy was within 3% of the best overall accuracy).
In summary, these results confirm that BioHEL's performance is robust and competitive in comparison to other widely used microarray classification approaches across different pre-filtering settings with and without feature selection. 10-fold cross-validation results obtained with BioHEL, SVM, RF and PAM and three feature selection methods (CFS, PLSS, RFS) on the prostate cancer dataset using two pre-processing variants (2-fold and 1.5-fold filtering); AVG = average accuracy, STDDEV = standard deviation; the highest accuracies achieved with BioHEL and the best alternative method are both shown in bold type for each dataset. Leave-one-out cross-validation results obtained with BioHEL, SVM, RF and PAM and three feature selection methods (CFS, PLSS, RFS) on the prostate cancer dataset using two pre-processing variants (2-fold and 1.5-fold filtering); AVG = average accuracy, STDDEV = standard deviation; the highest accuracies achieved with BioHEL and the best alternative method are both shown in bold type for each dataset.   Leave-one-out cross-validation results obtained with BioHEL, SVM, RF and PAM and three feature selection methods (CFS, PLSS, RFS) on the prostate cancer dataset using alternative settings for the maximum number of selected features (5 and 15, see next page for 50 and 100 maximum features); AVG = average accuracy, STDDEV = standard deviation; the highest accuracies achieved with BioHEL and the best alternative method are both shown in bold type for each dataset. Leave-one-out cross-validation results obtained with BioHEL, SVM, RF and PAM and three feature selection methods (CFS, PLSS, RFS) on the prostate cancer dataset using alternative settings for the maximum number of selected features (50 and 100, see previous page for 5 and 15 maximum features); AVG = average accuracy, STDDEV = standard deviation; the highest accuracies achieved with BioHEL and the best alternative method are both shown in bold type for each dataset.
In addition to the comparison of the classification results across different feature selection, prediction, crossvalidation (CV) and pre-processing methods, we also investigated the stability of the feature selection results across different CV-cycles, using BioHEL both with and without external attribute selection on the smallest and largest pre-filtered version of the prostate cancer data by Singh et al. [4]). For this purpose, for each 10-fold CV-cycle the 10 top-ranked selected genetic probes were determined, and the number of times they were chosen across at least 8 of 10 CV-cycles was recorded (these numbers are reported as robustness scores in Tables 7 and 8). The highest scores (5 and 6) were obtained when using BioHEL without external feature selection method. However, in combination with some of the external selection methods (e.g. PLSS) the same or similar robustness scores were reached. RFS tended to provide slightly lower scores than the other approaches, including a score of 0 in one case (for the 1.5-fold filtering of the prostate cancer dataset, providing the largest number of 7355 remaining genetic probes), which might result from the stochastic nature of the random forests approach.
When comparing the attribute identifiers for the 10 top-ranked features obtained from BioHEL using no external selection with those obtained directly from PLSS, RFS and CFS (prior to the classification), several features that were shared across different CV-cycles for a single selection method were also shared across these different algorithms (see Tables 9, 10 and 11 for the results on the 1.5-fold, 2-fold and 3-fold pre-filtered version of the prostate cancer dataset; attributes which appear at least twice across these tables are highlighted by matching colors). On all pre-processed versions of the prostate cancer dataset BioHEL shares at least its first 4 top-ranked features with at least one of the external feature selection methods. Given that only the 10 highest-ranked features are considered for each method in this comparison and the initial dataset contained between 7355 and 340 attributes (for 1.5-fold and 3-fold filtering, respectively), these results reveal a significant concordance between the top-ranked features for different methods. Moreover, since most of these high-ranked attributes across different methods were also selected by the univariate PLSS approach, the corresponding features are univariately significant, with four notable exceptions: The genetic probes 37068 at (phospholipase A2, group VII), 38291 at (proenkephalin), 914 g at (transcriptional regulator ERG) and 38634 at (cellular retinol binding protein 1) appear among the top 10 attributes for multiple multivariate selection methods but never among the top-ranked features for PLSS. This might indicate that these attributes are selected by the multivariate approaches due to their significance in the presence of other features (since the outcome variable is binary, a meaningful comparison of the correlation between these features and the outcome was not possible).
A more specific analysis of the individual shared and non-shared genetic probes across the different selection methods was not within the scope of this study; however, in the main manuscript a detailed discussion of the top-ranked attributes selected by the more robust ensemble feature selection method (Ensemble FS) and BioHEL-based feature ranking (BioHEL FR) is provided in the literature mining section. Feature selection robustness scores across 10 cross-validation cycles on the prostate cancer dataset, corresponding to the number of features among the 10 top-ranked attributes that were selected across at least 8 out of 10 cycles. The results are shown both for the smallest and largest version of the prostate cancer dataset (3-fold and 1.5-fold pre-filtering) using the same settings as for the corresponding classification results reported in Table 1 (for the other pre-processing variants of this dataset, see Table 8). Feature selection robustness scores across 10 cross-validation cycles on the prostate cancer dataset, corresponding to the number of features among the 10 top-ranked attributes that were selected across at least 8 out of 10 cycles. The results are shown for the 2-fold pre-processing of the prostate cancer dataset using different maximum numbers of selected features in the external attribute selection (5, 15, 50 and 100 -see also Tables 3 and 4 for the corresponding classification results). Table 9: Comparison of the 10 top-ranked features for different feature selection methods and BioHEL without external selection on the prostate cancer dataset (1.5-fold pre-processing)