The authors have declared that no competing interests exist.
Conceived and designed the experiments: HHM RV CR PM. Performed the experiments: HHM RV CR IT. Analyzed the data: HHM RV CR IT. Contributed reagents/materials/analysis tools: HHM RV CR RB IT. Wrote the paper: HHM RV CR IT PM. Critical review of content: RB. First draft of manuscript: HHM.
The prediction of breast cancer intrinsic subtypes has been introduced as a valuable strategy to determine patient diagnosis and prognosis, and therapy response. The PAM50 method, based on the expression levels of 50 genes, uses a single sample predictor model to assign subtype labels to samples. Intrinsic errors reported within this assay demonstrate the challenge of identifying and understanding the breast cancer groups. In this study, we aim to: a) identify novel biomarkers for subtype individuation by exploring the competence of a newly proposed method named CM1 score, and b) apply an ensemble learning, as opposed to the use of a single classifier, for sample subtype assignment. The overarching objective is to improve class prediction.
The microarray transcriptome data sets used in this study are: the METABRIC breast cancer data recorded for over 2000 patients, and the public integrated source from ROCK database with 1570 samples. We first computed the CM1 score to identify the probes with highly discriminative patterns of expression across samples of each intrinsic subtype. We further assessed the ability of 42 selected probes on assigning correct subtype labels using 24 different classifiers from the Weka software suite. For comparison, the same method was applied on the list of 50 genes from the PAM50 method.
The CM1 score portrayed 30 novel biomarkers for predicting breast cancer subtypes, with the confirmation of the role of 12 well-established genes. Intrinsic subtypes assigned using the CM1 list and the ensemble of classifiers are more consistent and homogeneous than the original PAM50 labels. The new subtypes show accurate distributions of current clinical markers ER, PR and HER2, and survival curves in the METABRIC and ROCK data sets. Remarkably, the paradoxical attribution of the original labels reinforces the limitations of employing a single sample classifiers to predict breast cancer intrinsic subtypes.
Breast cancer has been perceived as several distinct diseases characterised by intrinsic aberrations, heterogeneous behaviour and divergent clinical outcome [
The transcriptomic patterns observed across subtypes has given us insight into the molecular complexity and inherent alterations in tumour cells modelling the breast cancer heterogeneity and unpredicted outcome [
Although independent cohorts attempted to identify molecular subtypes, the chosen microarray-based Single Sample Predictor (SSP) model revealed unreliable assignments and modest agreement between studies [
In this report, we focus on the use of a ranking feature method based on the newly proposed CM1 score [
The METABRIC microarray data set used in this study is hosted by the European Bioinformatics Institute (EBI) and deposited in the European Genome-Phenome Archive (EGA) at
The second data set is publicly available in ROCK online portal [
In brief, both METABRIC and ROCK data sets have information on patients’ long-term clinical and pathological outcomes, including the sample assignment into intrinsic subtypes (luminal A, luminal B, HER2-enriched, normal-like, and basal-like) according to the PAM50 method [
In this study, we propose a systematic approach that aims at improving breast cancer subtype prediction. The systematic approach is built based on feature selection and data mining concepts. We first compute the CM1 score—using the microarray mRNA expression values—to rank the whole set of probes based on their discriminative power across breast cancer subtypes. We then select the top 10 probes that best represent each intrinsic subtype. The quality of this selection is assessed using a set of classifiers from the Weka software suite with the METABRIC and ROCK data sets, followed by the statistical analysis. The process flow is depicted in
The image shows the method steps based on
The CM1 score is a supervised univariate method used to measure the difference in expression levels of samples in two different classes [
To define the most discriminative probes for each breast cancer subtype (luminal A, luminal B, HER2-enriched, normal-like and basal-like), we computed the CM1 score for each of 48803 probes taking the subtype of interest and the remaining ones. This results in 5 lists of 48803 CM1 scores.
Considering the fact that Parker et al. (2009) [
The quality of the CM1 list for distinguishing subtypes was assessed using a list of well-known classifiers available in the Weka data mining software suite [
A similar approach was performed with the PAM50 list to serve as baseline for comparing the results obtained with the 42 probes from the CM1 list. The 50 genes identified by Parker et al. (2009) [
Given a
The average sensitivity (AS) [
The consensus of the different classification methods concerning the samples’ labels was measured by the popular interrater reliability metric Fleiss’ kappa [
Assuming a
Kappa values range from
The agreement between pairs of sample labellings was also quantified using this metric. It ranges between 0 to 1, where 1 indicates an almost perfect concordance between the two compared bipartitions, and 0 a complete discordance between them. The
The survival analysis for each breast cancer subtype is performed using Cox proportional hazards model from the package
To understand the results described in this section, we introduce the sequence of our approach which combines the
The CM1 score was applied to rank the set of 48803 probes for each of the five subtypes in the METABRIC discovery data set (Supporting Information
Probe ID | Gene name | Gene symbol and aliases | [Refs.] |
---|---|---|---|
ILMN_1684217 | Aurora kinase B | [ |
|
ILMN_1683450 | Cell division cycle associated 5 | [ |
|
ILMN_1747016 | Centrosomal protein 55kDa | [ |
|
ILMN_2212909 | Maternal embryonic leucine zipper kinase | [ |
|
ILMN_1714730 | Ubiquitin-conjugating enzyme E2C | [ |
|
ILMN_1796059 | Ankyrin repeat domain 30A | [ |
|
ILMN_1651329 | Long intergenic non-protein coding RNA 993 | ||
ILMN_2310814 | Microtubule-associated protein tau | [ |
|
ILMN_1728787 | Anterior gradient 3 | [ |
|
ILMN_1688071 | N-acetyltransferase 1 | [ |
|
ILMN_1729216 | Crystallin, alpha B | [ |
|
ILMN_1666845 | Keratin 17 | [ |
|
ILMN_1786720 | Prominin 1 | [ |
|
ILMN_1753101 | V-set domain containing T cell activation inhibitor 1 | [ |
|
ILMN_1798108 | Chromosome 6 open reading frame 211 | ||
ILMN_1747911 | Cyclin-dependent kinase 1 | [ |
|
ILMN_1666305 | Cyclin-dependent kinase inhibitor 3 | [ |
|
ILMN_1678535 | Estrogen receptor 1 | [ |
|
ILMN_2149164 | Secreted frizzled-related protein 1 | [ |
|
ILMN_1788874 | Serpin peptidase inhibitor, clade A (alpha-1 antiproteinase, antitrypsin), member 3 | [ |
|
ILMN_1785570 | Sushi domain containing 3 | [ |
|
ILMN_1803236 | Chloride channel accessory 2 | [ |
|
ILMN_2161820 | Glycine-N-acyltransferase-like 2 | [ |
|
ILMN_1810978 | Mucin-like 1 | [ |
|
ILMN_1773459 | SRY (sex determining region Y)-box 11 | [ |
|
ILMN_1674533 | Transient receptor potential cation channel, subfamily V, member 6 | [ |
|
ILMN_1687235 ILMN_2358760 | Hepsin | [ |
|
ILMN_1655915 | Matrix metallopeptidase 11 (stromelysin 3) | [ |
|
ILMN_1711470 | Ubiquitin-conjugating enzyme E2T (putative) | [ |
|
ILMN_1740609 | Chemokine (C-C motif) ligand 15 | [ |
|
ILMN_1789507 | Collagen, type XI, alpha 1 | [ |
|
ILMN_1651282 | Collagen, type XVII, alpha 1 | [ |
|
ILMN_1723684 | Duffy blood group, atypical chemokine receptor | [ |
|
ILMN_1809099 | Interleukin 33 | [ |
|
ILMN_1766650 | Forkhead box A1 | [ |
|
ILMN_1811387 | Trefoil factor 3 (intestinal) | [ |
|
ILMN_1738401 | Forkhead box C1 | [ |
|
ILMN_1689146 | Gamma-aminobutyric acid (GABA) A receptor, pi | [ |
|
ILMN_1807423 | Insulin-like growth factor 2 mRNA binding protein 3 | [ |
|
ILMN_1692938 | Phosphoserine aminotransferase 1 | [ |
|
ILMN_1668766 | Rhophilin associated tail protein 1 | [ |
Luminal A | Luminal B | Her2 | Normal | Basal | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Probe ID | score | rank | score | rank | score | rank | score | rank | score | rank | Symbol | PAM50 |
ILMN_1728787 | 0.203 | 5 | 0.144 | 5 | -0.314 | 2 | 54 | -0.461 | 3 | AGR3 | ||
ILMN_1796059 | 0.216 | 3 | 8730 | 1434 | 3666 | -0.390 | 5 | ANKRD30A | ||||
ILMN_1684217 | -0.203 | 1 | 74 | 497 | 146 | 97 | AURKB | |||||
ILMN_1798108 | 1980 | 0.155 | 2 | 68 | 405 | 179 | C6orf211 | |||||
ILMN_1740609 | 476 | 43 | 970 | 0.252 | 3 | 2776 | CCL15 | |||||
ILMN_1747911 | 80 | 0.144 | 4 | 2080 | 194 | 1496 | CDC2 | |||||
ILMN_1683450 | -0.196 | 3 | 30 | 306 | 79 | 166 | CDCA5 | |||||
ILMN_1666305 | 16 | 0.146 | 3 | 438 | 167 | 917 | CDKN3 | |||||
ILMN_1747016 | -0.195 | 5 | 88 | 362 | 73 | 127 | CEP55 | x | ||||
ILMN_1803236 | 1875 | 354 | 0.316 | 3 | 688 | 13483 | CLCA2 | |||||
ILMN_1789507 | 12176 | 5363 | 1820 | -0.155 | 3 | 9245 | COL11A1 | |||||
ILMN_1651282 | 915 | 16 | 4821 | 0.244 | 4 | 12205 | COL17A1 | |||||
ILMN_1729216 | 6657 | -0.153 | 5 | 3008 | 52 | 45 | CRYAB | |||||
ILMN_1723684 | 456 | 14 | 2830 | 0.255 | 2 | 4215 | DARC | |||||
ILMN_1678535 | 8 | 0.181 | 1 | -0.360 | 1 | 7 | -0.440 | 4 | ESR1 | x | ||
ILMN_1766650 | 70 | 85 | 12522 | 216 | -0.478 | 2 | FOXA1 | x | ||||
ILMN_1738401 | 1047 | 10 | 2254 | 226 | 0.443 | 1 | FOXC1 | x | ||||
ILMN_1689146 | 1177 | 13 | 1833 | 283 | 0.414 | 2 | GABRP | |||||
ILMN_2161820 | 310 | 270 | 0.333 | 1 | 791 | 1479 | GLYATL2 | |||||
ILMN_1687235 | 79 | 1942 | 58 | -0.157 | 2 | 211 | HPN | |||||
ILMN_2358760 | 105 | 1941 | 73 | -0.152 | 4 | 284 | HPN | |||||
ILMN_1807423 | 1269 | 2087 | 21820 | 11567 | 0.405 | 3 | IGF2BP3 | |||||
ILMN_1809099 | 3400 | 141 | 6282 | 0.275 | 1 | 23413 | IL33 | |||||
ILMN_1666845 | 8365 | -0.186 | 2 | 3879 | 35 | 29 | KRT17 | x | ||||
ILMN_1651329 | 0.221 | 1 | 2481 | 1149 | 1159 | 20 | LOC646360 | |||||
ILMN_2310814 | 0.221 | 2 | 8776 | 33 | 1131 | 23 | MAPT | x | ||||
ILMN_2212909 | -0.196 | 4 | 137 | 501 | 92 | 65 | MELK | x | ||||
ILMN_1655915 | 5274 | 3486 | 3832 | -0.166 | 1 | 4148 | MMP11 | x | ||||
ILMN_1810978 | 20520 | 9 | 0.326 | 2 | 6 | 1495 | MUCL1 | |||||
ILMN_1688071 | 0.215 | 4 | 902 | -0.256 | 5 | 24 | 19 | NAT1 | x | |||
ILMN_1786720 | 988 | -0.174 | 3 | 273 | 465 | 20 | PROM1 | |||||
ILMN_1692938 | 68 | 343 | 93 | 1864 | 0.391 | 5 | PSAT1 | |||||
ILMN_1668766 | 721 | 62 | 1415 | 368 | 0.405 | 4 | ROPN1 | |||||
ILMN_1788874 | 148 | 4633 | -0.259 | 4 | 1961 | 1462 | SERPINA3 | |||||
ILMN_2149164 | 11497 | -0.203 | 1 | 1697 | 0.244 | 5 | 40 | SFRP1 | x | |||
ILMN_1773459 | 185 | 621 | 0.293 | 5 | 10046 | 483 | SOX11 | |||||
ILMN_1785570 | 11 | 2499 | -0.308 | 3 | 438 | 82 | SUSD3 | |||||
ILMN_1811387 | 26 | 64 | 1263 | 661 | -0.521 | 1 | TFF3 | |||||
ILMN_1674533 | 643 | 605 | 0.300 | 4 | 2756 | 1819 | TRPV6 | |||||
ILMN_1714730 | -0.200 | 2 | 9 | 318 | 43 | 353 | UBE2C | x | ||||
ILMN_1711470 | 56 | 7 | 1732 | -0.145 | 5 | 1113 | UBE2T | x | ||||
ILMN_1753101 | 474 | -0.153 | 4 | 2424 | 3373 | 1522 | VTCN1 |
The CM1 scores for the topmost 5 positive and negative probe IDs in each subtype are given. The ranks correspond to the position of the probe from the topmost positive or negative (with 1 being the top ranked score at either side). The rightmost two columns indicate the gene symbol the probe maps to, and which genes appear also in the PAM50 list.
The effectiveness of the CM1 list for segregating the five subtypes is depicted in
The annotated genes are defined for each subtype as an intrinsic, highly discriminative, signature. Samples were ordered according to the gene expression similarities in each breast cancer subtype. Colours represent the selected genes and sample subtypes: luminal A (yellow), luminal B (green), HER2-enriched (purple), normal-like (blue), and basal-like (red).
The heat map diagram exhibit 42 probes (rows) and 997 samples (columns) from the discovery set ordered according to gene expression similarity, based on a memetic algorithm [
A detailed description of our 42 probes in the context of the literature can be found in Supporting Information
The box plot uncover the values of 997 samples in the METABRIC discovery set, 989 in the validation set, and 1570 in the ROCK test set.
After applying the ensemble learning, several statistical measures were computed as referred in
We determined the performance of the ensemble learning (Supporting Information
CM1 list | PAM50 list | |||
---|---|---|---|---|
Dataset | CV | AS | CV | AS |
0.731 ± 0.057 | 0.763 ± 0.060 | 0.752 ± 0.064 | 0.781 ± 0.070 | |
0.632 ± 0.036 | 0.641 ± 0.039 | 0.643 ± 0.041 | 0.650 ± 0.047 | |
0.571 ± 0.060 | 0.673 ± 0.077 | 0.578 ± 0.054 | 0.687 ± 0.081 |
Values are given as
METABRIC discovery | METABRIC validation | ROCK test set | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
LA | LB | H | N | B | I | LA | LB | H | N | B | I | LA | LB | H | N | B | I | |
435 | 19 | 2 | 2 | 0 | 8 | 252 | 2 | 0 | 0 | 0 | 1 | 452 | 122 | 2 | 0 | 0 | 17 | |
24 | 234 | 0 | 0 | 0 | 10 | 62 | 156 | 0 | 0 | 0 | 6 | 18 | 371 | 42 | 0 | 2 | 14 | |
4 | 4 | 67 | 0 | 2 | 10 | 23 | 45 | 71 | 2 | 2 | 10 | 0 | 1 | 13 | 0 | 0 | 0 | |
13 | 0 | 8 | 31 | 0 | 6 | 80 | 0 | 0 | 59 | 0 | 5 | 115 | 8 | 36 | 74 | 56 | 50 | |
0 | 0 | 10 | 2 | 103 | 3 | 6 | 7 | 22 | 19 | 142 | 17 | 0 | 0 | 0 | 7 | 166 | 4 |
Rows contain labels assigned by the majority of classifiers trained with the CM1 list, while columns contain the the original METABRIC labels assigned using the PAM50 method. In this table,
METABRIC discovery | METABRIC validation | ROCK test set | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
LA | LB | H | N | B | I | LA | LB | H | N | B | I | LA | LB | H | N | B | I | |
440 | 17 | 1 | 1 | 0 | 7 | 254 | 0 | 0 | 0 | 0 | 1 | 530 | 46 | 2 | 0 | 0 | 15 | |
25 | 239 | 0 | 0 | 0 | 4 | 56 | 162 | 0 | 0 | 0 | 6 | 53 | 327 | 34 | 0 | 3 | 30 | |
0 | 5 | 72 | 0 | 1 | 9 | 21 | 39 | 80 | 0 | 0 | 13 | 0 | 0 | 12 | 0 | 0 | 2 | |
9 | 0 | 2 | 34 | 1 | 12 | 82 | 0 | 0 | 55 | 0 | 7 | 105 | 4 | 18 | 92 | 67 | 53 | |
0 | 0 | 7 | 1 | 103 | 7 | 4 | 7 | 20 | 14 | 145 | 23 | 0 | 0 | 3 | 0 | 172 | 2 |
Rows contain labels assigned by the majority of classifiers trained with the PAM50 list, while columns contain the the original METABRIC labels assigned using the PAM50 method. In this table,
METABRIC discovery | METABRIC validation | ROCK Set | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
LA | LB | H | N | B | I | LA | LB | H | N | B | I | LA | LB | H | N | B | I | |
450 | 15 | 0 | 4 | 0 | 7 | 390 | 14 | 1 | 4 | 0 | 14 | 550 | 8 | 0 | 10 | 0 | 17 | |
20 | 235 | 0 | 0 | 0 | 2 | 12 | 185 | 8 | 0 | 0 | 5 | 112 | 361 | 0 | 0 | 0 | 29 | |
0 | 0 | 75 | 2 | 1 | 9 | 0 | 1 | 83 | 0 | 1 | 8 | 0 | 4 | 67 | 0 | 8 | 21 | |
0 | 0 | 0 | 28 | 0 | 7 | 6 | 0 | 0 | 61 | 1 | 12 | 0 | 0 | 0 | 67 | 0 | 7 | |
0 | 0 | 2 | 0 | 101 | 2 | 0 | 0 | 1 | 0 | 140 | 3 | 0 | 0 | 0 | 2 | 219 | 3 | |
4 | 11 | 5 | 2 | 3 | 12 | 9 | 8 | 7 | 4 | 3 | 8 | 26 | 4 | 2 | 13 | 15 | 25 |
Rows contain the labels assigned by the majority of classifiers trained with the CM1 list, while columns contain labels assigned by the majority of classifiers trained with PAM50 list. In this table,
The Average Sensitivity statistic was used to characterize the average proportion of accurately labelled samples in each subtype. Considering the analysis with CM1 list, the measure was 0.76±0.06 in the METABRIC discovery set and 0.64±0.04 in the validation set; and with PAM50 list was 0.78±0.07 and 0.65±0.05, respectively. Likewise, the average sensitivity calculated for the ROCK test set was 0.67±0.07 using the CM1 and 0.69±0.08 with PAM50 list. A complete table containing the performance of all individual classification methods is available in the (Supporting Information
Fleiss’ kappa was computed to assess the reliability of agreement between two raters, as displayed in
METABRIC | ROCK | |||
---|---|---|---|---|
discovery | validation | test set | ||
CM1 | 0.73 | 0.753 | 0.626 | |
PAM50 | 0.724 | 0.729 | 0.59 | |
CM1 | 0.814 | 0.596 | 0.591 | |
PAM50 | 0.84 | 0.618 | 0.641 | |
0.859 | 0.832 | 0.804 |
Rows entitled
Considering the agreement of the ensemble of classifiers, there was a
The agreement between the different sample labellings was also scrutinized using the Adjusted Rand Index measure (
METABRIC | ROCK | ||
---|---|---|---|
discovery | validation | test set | |
0.757 | 0.426 | 0.453 | |
0.792 | 0.457 | 0.507 | |
0.822 | 0.788 | 0.642 |
This contains the agreement between the original and predicted labels of samples in the discovery and validation sets.
The number of samples in each original PAM50 subtype is markedly different across the METABRIC sets (
The bars represent the number of samples in each breast cancer subtype. In the first row, the labels refer to the original assignment using the PAM50 method. The following rows show the new labels attributed using an ensemble of 24 classifiers with PAM50 and CM1 lists, respectively. Samples were classified as
We summarize the similarities and differences in subtypes distribution (graphically displayed in
The image shows the similarity between the subtypes distribution for METABRIC discovery (MD) and validation (MD) sets, and ROCK test set (RS). The labels were assigned in the original data sets using the PAM50 method, and relabelled in this study with an ensemble learning using PAM50 and CM1 lists. The similarity is measured using the square root of the Jensen-Shannon divergence. Darker shades represent more similar distributions, while lighter shades refer to divergent patterns. The diagonal shows the darkest color as each data set is the closest to itself. According to this image, labels assigned using an ensemble learning with CM1 and PAM50 lists are highly similar, and both exhibit lower levels of agreement with the original labels assigned using a single classifier (PAM), or PAM50 method.
Given the heterogeneity among breast cancer patients and the intricate assignment of PAM50 labels in the original METABRIC data set, we further investigated whether significant differences exist in the analysis of current clinical markers (ER, PR and HER2). Figs
(A) Discovery and (B) Validation. The bars represent the number of samples with ER positive and negative in the five intrinsic subtypes, based on the patients’ clinical information. The top row is based on the original subtype labels obtained with the PAM50 list and a single classifier (PAM). Middle and bottom rows are based on the labels obtained by Ensemble Learning using the PAM50 and CM1 lists, respectively.
(A) Discovery and (B) Validation. The bars represent the number of samples with PR positive and negative distributed in the five intrinsic subtypes, based on the patients’ clinical information. The top row is based on the original subtype labels obtained with the PAM50 list and a single classifier (PAM). Middle and bottom rows are based on the labels obtained by Ensemble Learning using the PAM50 and CM1 lists, respectively.
(A) Discovery and (B) Validation. The bars represent the number of samples with
Subsequently, we illustrate the survival curves for all breast cancer subtypes using Cox proportional hazards model, as described in
The survival curves for each breast cancer subtype are generated using Cox proportional hazards model based on the grade and size of the tumour, patient’s age, number of lymph nodes positive and ER status. Each curve represents the survival probability at a certain time after the diagnosis. Ticks on the curve correspond to the observations of patients who are still alive, while drops indicate the death. The probability curves based on the last 10 observations are plotted in dash. The top row is based on the original subtype labels obtained with the PAM50 list and a single classifier (PAM). Middle and bottom rows are based on the labels obtained by Ensemble Learning using the PAM50 and CM1 lists, respectively.
In this study, we exposed the power of the CM1 list for improving the breast cancer subtype prediction in the METABRIC and ROCK data sets. The CM1 score portrayed 30 novel genes as potential biomarkers, along with 12 well-established markers shared between CM1 and PAM50 lists. The 42 biomarkers have a great potential to differentiate breast cancer intrinsic subtypes. Among them,
Within the application of an ensemble of classifiers, CM1 and PAM50 lists showed concordant predictive power for disease subtyping. In fact, there was an
In spite of luminals sharing the same origin and large molecular commonalities [
Overall, the new intrinsic subtype labels based on the CM1 list and ensemble learning revealed more accurate distributions of clinical markers (ER, PR and HER2) and survival curves, when compared to the original PAM50 labels in the METABRIC cohort and ROCK test set. Interestingly, the CM1 list shows
The document shows the CM1 probe list along with an extensive literature review. The 42 CM1 biomarkers revealed a great potential to differentiate breast cancer intrinsic subtypes in the METABRIC and ROCK data sets. The 30 novel markers and 12 well-established genes vary the expression levels across different subtypes. The vast majority has been associated with breast cancer disease, either included or not in the subtyping context.
(PDF)
Box plots illustrating the expression levels for all selected transcripts in the CM1 list in the METABRIC discovery and validation sets, and ROCK test set. The figure shows the probes differential behaviour across breast cancer intrinsic subtypes.
(TIFF)
Table listing the CM1 score used to rank the set of 48803 probes for each of the five breast cancer subtypes in the METABRIC discovery data set. In each case, we selected the top 10 highly discriminative probes (5 with the greatest positive CM1 score values—indicating up-regulated probes relative to the other subtypes, and 5 with the smallest negative values—representing down-regulation).
(XLSX)
Table describing the performance of each classifier on the METABRIC discovery and validation sets, and ROCK test set using the CM1 list. It shows the percentage of correctly, incorrectly and not classified samples, Fleiss Kappa index, Cramer’s V, Average Sensitivity, and other values for classification. The 24 classifiers from the Weka software suite are also listed. The labels predicted by each classifier for all samples using CM1 list are defined as: 1—luminal A, 2—luminal B, 3—HER2-enriched, 4—normal-like, 5—basal-like. Count of predicted labels was obtained with the consensus of the majority of classifiers.
(XLSX)
Table describing the performance of each classifier on the METABRIC discovery and validation sets, and ROCK test set using the PAM50 list. It shows the percentage of correctly, incorrectly and not classified samples, Fleiss Kappa index, Cramer’s V, Average Sensitivity, and other values for classification. The 24 classifiers from the Weka software suite are also listed. The labels predicted by each classifier for all samples using CM1 list are defined as: 1—luminal A, 2—luminal B, 3—HER2-enriched, 4—normal-like, 5—basal-like. Count of predicted labels was obtained with the consensus of the majority of classifiers.
(XLSX)
Table containing the Fleiss’ Kappa agreement of labels for the METABRIC discovery and validation sets, and ROCK test set. It shows the overall agreement
(XLSX)
PM is supported by Australian Research Council (ARC,
PM and RB also acknowledge the support of Cancer Institute of New South Wales (