Figures
Abstract
Dimensionality reduction of microarray data is a very challenging task due to high computational time and the large amount of memory required to train and test a model. Genetic programming (GP) is a stochastic approach to solving a problem. For high dimensional datasets, GP does not perform as well as other machine learning algorithms. To explore the inherent property of GP to generalize models from low dimensional data, we need to consider dimensionality reduction approaches. Random projections (RPs) have gained attention for reducing the dimensionality of data with reduced computational cost, compared to other dimensionality reduction approaches. We report that the features constructed from RPs perform extremely well when combined with a GP approach. We used eight datasets out of which seven have not been reported as being used in any machine learning research before. We have also compared our results by using the same full and constructed features for decision trees, random forest, naive Bayes, support vector machines and k-nearest neighbor methods.
Citation: Tariq H, Eldridge E, Welch I (2018) An efficient approach for feature construction of high-dimensional microarray data by random projections. PLoS ONE 13(4): e0196385. https://doi.org/10.1371/journal.pone.0196385
Editor: Yun Li, University of North Carolina at Chapel Hill, UNITED STATES
Received: June 27, 2017; Accepted: April 12, 2018; Published: April 27, 2018
Copyright: © 2018 Tariq et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: Data is available publicly and links for the data are provided in the paper.
Funding: We received no specific funding for this work.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Microarray is a collection of DNA or RNA attached to a solid surface. The purpose of the microarray is to do expression profiling or assessing the genome content in closely related cells or organisms [1]. Microarray datasets have become a center of attention for researchers working in bioinformatics and machine learning domains. Studying the underlying patterns of differential gene expression is a major challenge in these kinds of datasets, as the number of instances for both training and testing is usually less than 100, while on the other hand number of features ranges from 6000–60,000. High dimensionality implies high computational cost and massive memory requirements for training. The capacity of these trained algorithms is also compromised by what is known as the curse of dimensionality [2]. Several studies have been carried out to find a robust machine learning method to classify such data [3].
Evolutionary algorithms (EA) are population-based, random search techniques where a population of solutions gets updated iteratively using algorithm-specific heuristics until convergence is achieved [4]. Genetic programming(GP) is one of the most popular techniques among the EA community. Since GP's introduction by Koza [5], the research community has frequently applied it to solve problems such as optimization, control, data mining, image processing and signal processing [6]. Dimensionality reduction maps data to low-dimensional space from high-dimensional space by assuming that the intrinsic structure of the high-dimensional data can remain intact in the low-dimensional space. Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA) are the two most commonly used dimensionality reduction techniques. These two techniques construct features which perform well with various machine learning algorithms, but the high computational cost is one of the major limitations of these methods. To address this issue of computational cost, Random Projection (RP), which maps data to a randomly generated, low-dimensional latent space, was proposed [7]. The motivation behind the current work was to explore the effectiveness of RP for feature construction to improve the classification performance of a GP classifier for a high-dimensional microarray dataset. The purpose of this work was to address the following objectives;
- To investigate the performance of GP on very high-dimensional microarray datasets.
- To investigate the performance of random projection-based features constructed with GP.
- To investigate how k-Nearest Neighbours(KNNs), Support Vector Machines(SVMs), Decision Trees(DT), Naive Bayes(NB) and Random Forests(RFs) algorithms perform on very high-dimensional microarray datasets as compared to GP.
Background
GP is a population-based method to evolve programs [8]. It typically follows these steps:
- Initialization: produce an initial population of programs from terminal and function sets.
- Until a certain stopping-criteria is fulfilled, perform:
- Evaluation: the fitness of each individual program is calculated by a pre-selected fitness function.
- Selection: select a subset of programs to produce next generation of programs based on their fitness scores.
- Evolution: generate new generation by either copying a program to the new generation (reproduction) or combining different parts of programs or mutating a part of a program randomly(crossover).
- Return the solution with the highest fitness.
Terminal and function set
In GP, each program is a tree-like structure where terminal nodes are the feature values and internal nodes are elements of a pre-determined function set, in our case (+, −, ÷, ×).
Fitness function
In order to measure the fitness of our program, we used Mathew’s Correlation coefficient (MCC). The MCC is a correlation between the observation and prediction which in our case is defined as: (1) Where Ntp, Ntn, Nfp, and Nfn are the number of true positives (TPs), true negatives (TNs), false positives (FPs) and false negatives (FNs), respectively. When the denominator is 0, MCC is set to 0. The standardized fitness of a rule was calculated as: (2)
Since MCC ranges between -1.0 and +1.0, the standardized fitness ranges between 0.0 and +1.0, the higher values being better with 1.0 being the best.
Dataset description
For experimentation, we have chosen eight high-dimensional microarray datasets.
- Lung Cancer Histology- This Dataset is a comparison of two non-small cell lung cancer histology sub-types [9]. It contains expression levels of 54,675 RNAs of 58 carcinoma's sample: 18 squamous cell carcinoma(SCC) and 40 adenocarcinomas(AC). The complete dataset is available at https://www.ncbi.nlm.nih.gov/sites/GDSbrowser? acc=GDS3627.
- Oral Mucosa—This dataset provides insight into the carcinogenic effects of cigarette smoking [10]. The dataset has expression levels of 54675 RNAs of 79 Oral mucosa samples: 39 smokers and 40 non- smokers. The complete dataset is available at https://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS3709.
- B lymphocytes-This dataset was generated during a study conducted on US white females [11]. The objective was to analyse the effect of smoking on circulating B lymphocytes because B cells are directly linked with the onset of smoking-induced diseases. The dataset contains expressions levels of 22,283 RNAs from 79 blood samples of females: 40 non-smokers and 39 smokers. The complete sets of data are available at https://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS3713.
- Placenta dataset- This dataset provides an insight into the effect of tobacco smoking on placenta [12]. Smoking increases the risk of preterm delivery and other complications during pregnancy. The dataset has expression levels of 11,155 RNAs taken from the placenta of 76 females: 64 non-smokers and 12 smokers. The dataset is available at https://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS3793.
- Melanoma- This dataset provides an insight into the molecular basis of primary melanoma and melanoma metastasis [13]. It has the microarray expression levels of 22,283 RNAs from 83 melanoma samples: 31 primary melanomas and 52 melanoma metastasis. The dataset is available at https://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS3966.
- Breast cancer- It contains 97 cDNA microarrays, each representing 24,481 genes based on the biopsy specimens of primary breast tumors of patients with germline mutations of relapsed and non-relapsed. The pre-processed dataset is available at http://csse.szu.edu.cn/staff/zhuzx/Datasets.html.
- Skeletal muscle-This dataset gives an overview of molecular changes in the skeletal muscles of young and old people [14]. The dataset has 54,675 RNA expression levels of 110 samples: (62) young and (48) old. The pre-processed dataset is available at https://www.ncbi.nlm.nih.gov/sites/GDSbrowser
- Osteoarthritis- This dataset provides an insight into the molecular changes of Osteoarthritis patients [15]. It has 48, 802 RNA microarray expression from 139 patients: 106 osteoarthritis and 33 control. The pre-processed dataset is available at https://www.ncbi.nlm.nih.gov/sites/GDSbrowser
Experimental set up
To measure the performance of our method for feature selection and classification, we conducted several experiments on the eight different microarray datasets. ECJ [16] was used for GP and Weka package [17] was used to implement random projections for feature construction. The Weka API was also used for KNNs, SVMs, DT, NB and RFs classifiers. We used K-fold cross-validation to avoid feature selection bias for all of the above methods, and the value of k is 10. Our experimental design is shown in Fig 1.
Training set and Test set performance evaluations goes into Tables 3 and 4 respectively. Performance has been measured in each of the GP run for each fold and used to calculate mean accuracies and standard deviations by the end of 10-folds.
In random projection, if we have d-dimensional data originally then it is projected through the origin to a k-dimensional (k << d) subspace, using a random k*d matrix, R, whose columns have unit lengths [18]. Using matrix notation where Xd*N is the original set of N d-dimensional observations, (3)
Table 1, gives the summary of parameters used. Ramped half-and-half was used to generate the initial population of algorithms/RPs, where the individual tree depth ranges from 2 to 8. Tournament selection with size 7 and population of size 1024 was used. Elitism is applied to copy the best individual into next generation. Once the maximum number of generations is achieved, termination of the evolutionary process takes place. The whole experiment was repeated 30 times with random seeds.
We use accuracy to measure the performance of models on training and test sets. For training data, the performance is measured as:
Results and discussion
We have used eight datasets, all of them have a very low number of instances and very large number of features. As we can see in Table 2, that shows the results of using GP with the full feature set, it has not given us good training accuracy as compared to other machine learning algorithms. In most of the cases, SVM and RF have achieved very good training accuracy results.
Similar is the case when calculated the Test set accuracy as shown in Table 3.
SVMs has performed exceptionally well for almost all the high-dimensional datasets. For Skeletal Muscle and Adenocarcinoma datasets, it has achieved greater than 96%. RF has also achieved very good results with all the datasets. The most impressive of them are skeletal muscle and Melanoma datasets. In case of KNNs (k = 3), Skeletal Muscle, Adenocarcinoma, and Melanoma datasets have shown good results. For NB and DT, Melanoma and skeletal muscle have shown better results as compared to other datasets. With GP, Adenocarcinoma and Melanoma datasets have shown better performance from the rest of the datasets.
As shown in Table 4, the newly constructed features by using random projections have shown the different story as that of using full feature subset. We have constructed three sets of features for each of the datasets. GP has shown excellent results in all the cases. In case of 50 constructed features, GP has shown best results all the time. As the number of constructed features increase, the accuracy gradually decreases. But in case of other algorithms, the patterns are different. For Adenocarcinoma dataset, as we use a higher number of features average accuracy increases slowly. Most significant change is in case of DT and RF where there is a rise of 12% in average accuracy.
For Oral mucosa, 50 constructed features have shown better results for all the algorithms except KNNs and RF. when compared to 100 and 150 features. In case of B-Cells dataset, DT and KNNs have shown better results with a lower number of features along with GP. As for the highly unbalanced dataset of Placenta, the accuracy was maintained for all of the algorithms except DT. Most of the times, there is a very small difference in accuracy when using RF, NB, KNNs, and SVMs.
For Melanoma dataset and Breast cancer datasets, a higher number of constructed features show better results in all the methods except DT. In case of skeletal muscle dataset, KNNs and SVMs show better results as we increase the number of features while the inverse is true for DT, NB, and RF. In case of Osteoarthritis dataset, a higher number of features has shown better classification accuracies as compares to other feature subsets for all the methods.
When we compare results from full feature set with constructed features, GP has shown significant an increase in overall accuracy with random projection-based constructed features as shown in Fig 2 and a decrease in standard deviation. For all the dataset there is an increase of 15% to 40%. For DT, there is a decrease of 5% to 15% in overall accuracy. For NB, there is a decrease of 2% to 5%. In case of KNNs, B-Cells, Melanoma, Adenocarcinoma and Osteoarthritis datasets have shown better results as that of the full feature set with an increase of 2% to 10%. For SVMs, there is an increase of 2% to 20% for most of the datasets except Adenocarcinoma, Oral mucosa, and B-cells. For RF, there is an increase of 1% to 7% in the overall accuracy of newly constructed feature sets for most of the datasets.
Conclusion and future work
In the light of above results, it is evident that random projections are very effective for feature construction when combined with the genetic programming as a classifier. For future work, we will explore this method to address other high-dimensional problems like DNA-binding protein prediction [19], detection of tubule boundary [20], methylation site prediction [21], phosphorylation site prediction [22] and protein-protein interaction prediction [23,24], etc.
References
- 1. Pollack JR, Perou CM, Alizadeh AA, Eisen MB, Pergamenschikov A, Williams CF, et al. Genome-wide analysis of DNA copy-number changes using cDNA microarrays. Nature genetics. 1999 Sep 1;23(1):41–6. pmid:10471496
- 2. Bolón-Canedo V, Sánchez-Maroño N, Alonso-Betanzos A. Recent advances and emerging challenges of feature selection in the context of big data. Knowledge-Based Systems. 2015 Sep 30; 86:33–45.
- 3. Piatetsky-Shapiro G, Tamayo P. Microarray data mining: facing the challenges. ACM SIGKDD Explorations Newsletter. 2003 Dec 1;5(2):1–5.
- 4. Krishna GJ, Ravi V. Evolutionary computing applied to customer relationship management: A survey. Engineering Applications of Artificial Intelligence. 2016 Nov 30; 56:30–59.
- 5.
Koza JR. Genetic programming: on the programming of computers by means of natural selection. MIT press; 1992.
- 6. Koza JR. Human-competitive results produced by genetic programming. Genetic Programming and Evolvable Machines. 2010 Sep 1;11(3–4):251–84.
- 7. Zhao R, Mao K. Semi-random projection for dimensionality reduction and extreme learning machine in high-dimensional space. IEEE Computational Intelligence Magazine. 2015 Aug;10(3):30–41.
- 8. Heřmanovský M, Havlíček V, Hanel M, Pech P. Regionalization of runoff models derived by genetic programming. Journal of Hydrology. 2017 Apr 30; 547:544–56.
- 9. Kuner R, Muley T, Meister M, Ruschhaupt M, Buness A, Xu EC, et al. Global gene expression analysis reveals specific patterns of cell junctions in non-small cell lung cancer subtypes. Lung cancer. 2009 Jan 31;63(1):32–8. pmid:18486272
- 10. Boyle JO, Gümüş ZH, Kacker A, Choksi VL, Bocker JM, Zhou XK, et al. Effects of cigarette smoke on the human oral mucosal transcriptome. Cancer Prevention Research. 2010 Mar 1;3(3):266–78. pmid:20179299
- 11. Pan F, Yang TL, Chen XD, Chen Y, Gao G, Liu YZ, et al. Impact of female cigarette smoking on circulating B cells in vivo: the suppressed ICOSLG, TCF3, and VCAM1 gene functional network may inhibit normal cell function. Immunogenetics. 2010 Apr 1;62(4):237–51. pmid:20217071
- 12. Bruchova H, Vasikova A, Merkerova M, Milcova A, Topinka J, Balascak I, et al. Effect of maternal tobacco smoke exposure on the placental transcriptome. Placenta. 2010 Mar 31;31(3):186–91. pmid:20092892
- 13. Xu L, Shen SS, Hoshida Y, Subramanian A, Ross K, Brunet JP, et al. Gene expression changes in an animal melanoma model correlate with aggressiveness of human melanoma metastases. Molecular Cancer Research. 2008 May 1;6(5):760–9. pmid:18505921
- 14. Raue U, Trappe TA, Estrem ST, Qian HR, Helvering LM, Smith RC, et al. Transcriptome signature of resistance exercise adaptations: mixed muscle and fiber type-specific profiles in young and old adults. Journal of applied physiology. 2012 May 15;112(10):1625–36. pmid:22302958
- 15. Ramos YF, Bos SD, Lakenberg N, Böhringer S, den Hollander WJ, Kloppenburg M, et al. Genes expressed in blood link osteoarthritis with apoptotic pathways. Annals of the rheumatic diseases. 2014 Oct 1;73(10):1844–53. pmid:23864235
- 16.
Luke S, Panait L, Balan G, Paus S, Skolicki Z, Bassett J, et al. Ecj: A java-based evolutionary computation research system. Downloadable versions and documentation can be found at the following url: http://cs.gmu.edu/eclab/projects/ecj. 2006 Feb.
- 17. Witten IH, Frank E, Hall MA, Pal CJ. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann; 2016 Oct 1.
- 18.
Bingham E, Mannila H. Random projection in dimensionality reduction: applications to image and text data. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining 2001 Aug 26 (pp. 245–250). ACM.
- 19. Wei L, Tang J, Zou Q. Local-DPP: An improved DNA-binding protein prediction method by exploring local evolutionary information. Information Sciences. 2017 Apr 30;384:135–44.
- 20. Su R, Zhang C, Pham TD, Davey R, Bischof L, Vallotton P, et al. Detection of tubule boundaries based on circular shortest path and polar‐transformation of arbitrary shapes. Journal of microscopy. 2016 Nov 1;264(2):127–42. pmid:27172164
- 21. Wei L, Xing P, Shi G, Ji ZL, Zou Q. Fast prediction of protein methylation sites using a sequence-based feature selection technique. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2017 Feb 16.
- 22. Wei L, Xing P, Tang J, Zou Q. PhosPred-RF: a novel sequence-based predictor for phosphorylation sites using sequential information only. IEEE Transactions on NanoBioscience. 2017 Jan 31.
- 23. Wei L, Xing P, Zeng J, Chen J, Su R, Guo F. Improved prediction of protein–protein interactions using novel negative samples, features, and an ensemble classifier. Artificial Intelligence in Medicine. 2017 Mar 4.
- 24. Wei L, Wan S, Guo J, Wong KK. A novel hierarchical selective ensemble classifier with bioinformatics application. Artificial Intelligence in Medicine. 2017 Feb 27.