An efficient approach for feature construction of high-dimensional microarray data by random projections

Dimensionality reduction of microarray data is a very challenging task due to high computational time and the large amount of memory required to train and test a model. Genetic programming (GP) is a stochastic approach to solving a problem. For high dimensional datasets, GP does not perform as well as other machine learning algorithms. To explore the inherent property of GP to generalize models from low dimensional data, we need to consider dimensionality reduction approaches. Random projections (RPs) have gained attention for reducing the dimensionality of data with reduced computational cost, compared to other dimensionality reduction approaches. We report that the features constructed from RPs perform extremely well when combined with a GP approach. We used eight datasets out of which seven have not been reported as being used in any machine learning research before. We have also compared our results by using the same full and constructed features for decision trees, random forest, naive Bayes, support vector machines and k-nearest neighbor methods.


Introduction
Microarray is a collection of DNA or RNA attached to a solid surface. The purpose of the microarray is to do expression profiling or assessing the genome content in closely related cells or organisms [1]. Microarray datasets have become a center of attention for researchers working in bioinformatics and machine learning domains. Studying the underlying patterns of differential gene expression is a major challenge in these kinds of datasets, as the number of instances for both training and testing is usually less than 100, while on the other hand number of features ranges from 6000-60,000. High dimensionality implies high computational cost and massive memory requirements for training. The capacity of these trained algorithms is also compromised by what is known as the curse of dimensionality [2]. Several studies have been carried out to find a robust machine learning method to classify such data [3].
Evolutionary algorithms (EA) are population-based, random search techniques where a population of solutions gets updated iteratively using algorithm-specific heuristics until convergence is achieved [4]. Genetic programming(GP) is one of the most popular techniques among the EA community. Since GP's introduction by Koza [5], the research community has frequently applied it to solve problems such as optimization, control, data mining, image a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 processing and signal processing [6]. Dimensionality reduction maps data to low-dimensional space from high-dimensional space by assuming that the intrinsic structure of the high-dimensional data can remain intact in the low-dimensional space. Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA) are the two most commonly used dimensionality reduction techniques. These two techniques construct features which perform well with various machine learning algorithms, but the high computational cost is one of the major limitations of these methods. To address this issue of computational cost, Random Projection (RP), which maps data to a randomly generated, low-dimensional latent space, was proposed [7]. The motivation behind the current work was to explore the effectiveness of RP for feature construction to improve the classification performance of a GP classifier for a high-dimensional microarray dataset. The purpose of this work was to address the following objectives; 1. To investigate the performance of GP on very high-dimensional microarray datasets.
2. To investigate the performance of random projection-based features constructed with GP.
3. To investigate how k-Nearest Neighbours(KNNs), Support Vector Machines(SVMs), Decision Trees(DT), Naive Bayes(NB) and Random Forests(RFs) algorithms perform on very high-dimensional microarray datasets as compared to GP.

Background
GP is a population-based method to evolve programs [8]. It typically follows these steps: 1. Initialization: produce an initial population of programs from terminal and function sets.
2. Until a certain stopping-criteria is fulfilled, perform: • Evaluation: the fitness of each individual program is calculated by a pre-selected fitness function.
• Selection: select a subset of programs to produce next generation of programs based on their fitness scores.
• Evolution: generate new generation by either copying a program to the new generation (reproduction) or combining different parts of programs or mutating a part of a program randomly(crossover).
3. Return the solution with the highest fitness.

Terminal and function set
In GP, each program is a tree-like structure where terminal nodes are the feature values and internal nodes are elements of a pre-determined function set, in our case (+, −, Ä, ×).

Fitness function
In order to measure the fitness of our program, we used Mathew's Correlation coefficient (MCC). The MCC is a correlation between the observation and prediction which in our case is defined as: Where N tp , N tn , N fp , and N fn are the number of true positives (TPs), true negatives (TNs), false positives (FPs) and false negatives (FNs), respectively. When the denominator is 0, MCC is set to 0. The standardized fitness of a rule was calculated as: Since MCC ranges between -1.0 and +1.0, the standardized fitness ranges between 0.0 and +1.0, the higher values being better with 1.0 being the best.

Dataset description
For experimentation, we have chosen eight high-dimensional microarray datasets.

Experimental set up
To measure the performance of our method for feature selection and classification, we conducted several experiments on the eight different microarray datasets. ECJ [16] was used for GP and Weka package [17] was used to implement random projections for feature construction. The Weka API was also used for KNNs, SVMs, DT, NB and RFs classifiers. We used Kfold cross-validation to avoid feature selection bias for all of the above methods, and the value of k is 10. Our experimental design is shown in Fig 1. In random projection, if we have d-dimensional data originally then it is projected through the origin to a k-dimensional (k << d) subspace, using a random k Ã d matrix, R, whose columns have unit lengths [18]. Using matrix notation where Xd Ã N is the original set of N ddimensional observations, Table 1, gives the summary of parameters used. Ramped half-and-half was used to generate the initial population of algorithms/RPs, where the individual tree depth ranges from 2 to 8. Tournament selection with size 7 and population of size 1024 was used. Elitism is applied to copy the best individual into next generation. Once the maximum number of generations is achieved, termination of the evolutionary process takes place. The whole experiment was repeated 30 times with random seeds.
We use accuracy to measure the performance of models on training and test sets. For training data, the performance is measured as:  Tables 3 and 4 respectively. Performance has been measured in each of the GP run for each fold and used to calculate mean accuracies and standard deviations by the end of 10-folds. And for test data the performance is measured as:

Results and discussion
We have used eight datasets, all of them have a very low number of instances and very large number of features. As we can see in Table 2, that shows the results of using GP with the full feature set, it has not given us good training accuracy as compared to other machine learning algorithms. In most of the cases, SVM and RF have achieved very good training accuracy results.
Similar is the case when calculated the Test set accuracy as shown in Table 3. SVMs has performed exceptionally well for almost all the high-dimensional datasets. For Skeletal Muscle and Adenocarcinoma datasets, it has achieved greater than 96%. RF has also achieved very good results with all the datasets. The most impressive of them are skeletal muscle and Melanoma datasets. In case of KNNs (k = 3), Skeletal Muscle, Adenocarcinoma, and Melanoma datasets have shown good results. For NB and DT, Melanoma and skeletal muscle have shown better results as compared to other datasets. With GP, Adenocarcinoma and Melanoma datasets have shown better performance from the rest of the datasets.
As shown in Table 4, the newly constructed features by using random projections have shown the different story as that of using full feature subset. We have constructed three sets of features for each of the datasets. GP has shown excellent results in all the cases. In case of 50 constructed features, GP has shown best results all the time. As the number of constructed features increase, the accuracy gradually decreases. But in case of other algorithms, the patterns are different. For Adenocarcinoma dataset, as we use a higher number of features average with a lower number of features along with GP. As for the highly unbalanced dataset of Placenta, the accuracy was maintained for all of the algorithms except DT. Most of the times, there is a very small difference in accuracy when using RF, NB, KNNs, and SVMs.
For Melanoma dataset and Breast cancer datasets, a higher number of constructed features show better results in all the methods except DT. In case of skeletal muscle dataset, KNNs and SVMs show better results as we increase the number of features while the inverse is true for DT, NB, and RF. In case of Osteoarthritis dataset, a higher number of features has shown better classification accuracies as compares to other feature subsets for all the methods.
When we compare results from full feature set with constructed features, GP has shown significant an increase in overall accuracy with random projection-based constructed features as shown in Fig 2 and a decrease in standard deviation. For all the dataset there is an increase of 15% to 40%. For DT, there is a decrease of 5% to 15% in overall accuracy. For NB, there is a decrease of 2% to 5%. In case of KNNs, B-Cells, Melanoma, Adenocarcinoma and Osteoarthritis datasets have shown better results as that of the full feature set with an increase of 2% to 10%. For SVMs, there is an increase of 2% to 20% for most of the datasets except Adenocarcinoma, Oral mucosa, and B-cells. For RF, there is an increase of 1% to 7% in the overall accuracy of newly constructed feature sets for most of the datasets.

Conclusion and future work
In the light of above results, it is evident that random projections are very effective for feature construction when combined with the genetic programming as a classifier. For future work, we will explore this method to address other high-dimensional problems like DNA-binding protein prediction [19], detection of tubule boundary [20], methylation site prediction [21], phosphorylation site prediction [22] and protein-protein interaction prediction [23,24], etc.