Heterogeneous Ensemble Combination Search Using Genetic Algorithm for Class Imbalanced Data Classification

doi:10.1371/journal.pone.0146116

Table 1.

List of base classifiers used in GA-EoC.

More »

Expand

Fig 1.

The steps in preprocessing the training dataset and generating the base classifier models.

The process starts taking the training dataset as input. First, it balances the class distribution for imbalanced training data. Next, it selects features using (α, β) − k Feature Set selection method if features are not available. Then, it creates train and validation folds from the training dataset for 10-fold cross validation. These folds of the dataset are saved and used for internal validation of ensembles. Finally, it generates the models for each classifiers on each training fold (Train 1 to Train 10) and save them for future use.

More »

Expand

Fig 2.

Overall process flow of the proposed GA-EoC algorithm.

In GA-EoC, each individual represents an EoC and the genetic algorithm is used to find the best EoC based on its performance on validation folds. For each individual an EoC is constructed using the base classifier models of a training fold and the MCC score of the EoC is calculated for the corresponding validation fold generated beforehand (Fig 1). The average MCC score calculated over 10 folds is taken as the fitness value of the individual. The algorithm iterates creating a new population from the current one until a terminating condition is satisfied. The individual with the best fitness value form the final population is returned as the solution.

More »

Expand

Fig 3.

Representation of an individual in GA-EoC and its mapping into the corresponding base classifiers for ensemble combination.

More »

Expand

Table 2.

Characteristics of the datasets used for experiments.

More »

Expand

Table 3.

Distribution of the training and testing data in PubFig05 dataset.

More »

Expand

Table 4.

Outcome of the (α, β) − k Feature Set selection method for three different setups (UAB, IAB, UEAB) showing the number of selected features per binary-class datasets of PubFig05.

More »

Expand

Table 5.

Classification performances (in MCC scale) of the base classifiers and GA-EoC for all experiments.

More »

Expand

Table 6.

Classification accuracies achieved by the base classifiers and GA-EoC for all experiments.

More »

Expand

Fig 4.

Confusion matrices for comparing the best classification performances using 18-protein biomarker.

(a-b) These classification performances are achieved by [Ray et al., 07], (c-d) These classification performances are achieved by [R.Moscato, 08] and (e-f) These classification performances are achieved by the proposed GA-EoC for TestSetAD and TestSetMCI, respectively.

More »

Expand

Table 7.

Average classification performances (in terms of accuracy and MCC) using 18-protein biomarker.

More »

Expand

Table 8.

Average classification performances (in terms of accuracy and MCC) using 5-protein biomarker.

More »

Expand

Fig 5.

Best classification performances by the state of art method vs. the proposed method with the 5-protein biomarker.

The comparison of best classification performances using the 5-protein biomarker (RavettiMoscato-AD-Trn-5) as training dataset and TestSetAD and TestSetMCI as test datasets. (a-b) Classification performances achieved by [R.Moscato, 08], (c-d) Classification performances achieved by GA-EoC for the TestSetAD and TestSetMCI, respectively.

More »

Expand

Table 9.

Average classification performances on UAB setup.

More »

Expand

Fig 6.

Classification performances of GA-EoC and other ensemble of classifiers on PubFig05 datasets.

The classification performances of AdaBoostM1, Bagging, Random Forest and GA-EoC are compared in terms of Precision, Accuracy and F-Measure scores for (a) UAB datasets, (b) IAB datasets and (c) UEAB datasets.

More »