## Figures

## Abstract

Among numerous artificial intelligence approaches, k-Nearest Neighbor algorithms, genetic algorithms, and artificial neural networks are considered as the most common and effective methods in classification problems in numerous studies. In the present study, the results of the implementation of a novel hybrid feature selection-classification model using the above mentioned methods are presented. The purpose is benefitting from the synergies obtained from combining these technologies for the development of classification models. Such a combination creates an opportunity to invest in the strength of each algorithm, and is an approach to make up for their deficiencies. To develop proposed model, with the aim of obtaining the best array of features, first, feature ranking techniques such as the Fisher's discriminant ratio and class separability criteria were used to prioritize features. Second, the obtained results that included arrays of the top-ranked features were used as the initial population of a genetic algorithm to produce optimum arrays of features. Third, using a modified k-Nearest Neighbor method as well as an improved method of backpropagation neural networks, the classification process was advanced based on optimum arrays of the features selected by genetic algorithms. The performance of the proposed model was compared with thirteen well-known classification models based on seven datasets. Furthermore, the statistical analysis was performed using the Friedman test followed by post-hoc tests. The experimental findings indicated that the novel proposed hybrid model resulted in significantly better classification performance compared with all 13 classification methods. Finally, the performance results of the proposed model was benchmarked against the best ones reported as the state-of-the-art classifiers in terms of classification accuracy for the same data sets. The substantial findings of the comprehensive comparative study revealed that performance of the proposed model in terms of classification accuracy is desirable, promising, and competitive to the existing state-of-the-art classification models.

**Citation: **Salari N, Shohaimi S, Najafi F, Nallappan M, Karishnarajah I (2014) A Novel Hybrid Classification Model of Genetic Algorithms, Modified k-Nearest Neighbor and Developed Backpropagation Neural Network. PLoS ONE 9(11):
e112987.
https://doi.org/10.1371/journal.pone.0112987

**Editor: **Sergio Gómez, Universitat Rovira i Virgili, Spain

**Received: **July 22, 2013; **Accepted: **October 21, 2014; **Published: ** November 24, 2014

**Copyright: ** © 2014 Salari et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

**Funding: **The authors have no support or funding to report.

**Competing interests: ** The authors have declared that no competing interests exist.

## Introduction

In the last decade, the extensive effect of classification models on decision making in various scientific fields including medicine, has attracted a lot of attention. Classification in the realm of research is the designation of an individual or an item to a set of classes so that the decision making is made based on the characteristic of that individual or the item. Successful classification depends on the two major factors of “how to select the most informative features” and the “classifier method”, especially in the field of medical classification. The widespread in congruency of features in this field has made the selection of a subcategory of the best factors of features more significant, and has given it a more effective and valuable role in the promotion of the performance of the classification model. Using a set of training patterns, in which the correct classification is known subcategory of classified observations called the training set, the classifier function organizes the classification. Thus, it is expected that proper selection and classification of methods at each stage would lead to a classification model with successful performance.

Following the first classification rule presented in 1936 by Fisher in statistical classification literature [1], various classification models have been proposed. Among them, the simple and efficient method for the implementation and understanding of non-parameterized classification was the k-Nearest Neighbor (k-NN) which has been well-received. For instance, in order to improve the classification accuracy, Weinberger and Saul [2] presented a developed algorithm of k-NN. In their proposed model, they used Mahalanobis distance as the criterion for distance determination. A developed hierarchical model of k-NN was introduced by Kubota et al. [3]. The high capability and sensitivity of this model in the fine discrimination of classes is noteworthy. Zeng et al. [4] have proposed a modified classification algorithm of k-NN whose underlying algorithm is local average and class statistics. That is, in addition to local information from k-NN of new non-classified data, general information about neighbors in each class is analyzed separately.

Artificial neural network is an efficient approach that in recent years has been considered by researchers as one of the most useful and applicable constructs in artificial intelligence. This is due to its numerous advantages such as being non-parametric (no requirement for any primary assumption on data), self-adaptiveness, ability to be generalized, and having a high capacity in modeling non-linear patterns. This approach is a functional technology that provides the user the possibility to obtain the best linear combination of features in order to achieve his/her goals including the classification of complex models, estimation of non-linear functions and prediction [5].

In the medical field, Olmez and Dokur [6] have proposed the use of artificial neural networks algorithm to classify heart beats. In their proposed model, they first selected the best features using dynamic programming; then, using artificial neural networks, they successfully classified heart beats into seven categories. To classify heart beat data, Rajendra A et al. [7] employed artificial neural networks and Fuzzy equivalence classifier. Qiu et al. [8] presented a model for classification of cervical cancer risk, using artificial neural networks. The findings indicated sensitivity and specifity of 98% and 97%, respectively. Salari et al. [9] used artificial neural network methods for prediction of late onset heart failure. In 2013, Salari et al. [10] used an integrated medical model based on artificial intelligence approach. The proposed model, was put forwarded for medical data classification.

However, traditional methods which are based on single technology were gradually replaced by hybrid models. Hybrid models which are increasingly getting noticed by researchers are a relatively new approach which include innovative, creative, and appropriate combination of several models for achieving a final common goal with a performance far better than traditional models based on single technology. The main idea behind these models is to benefit from the synergies among technologies. This characteristic provides the opportunity to learn about the exclusive strengths of each technology and can be used as a means of compensating for the deficiencies, and overcome limitations of each technology [11], [12].

The review of medical literature indicates that research on the application of hybrid models in the field of artificial intelligence is growing. Chakraborty [13] proposed an integrated approach for cancer classification and simultaneous gene selection. He argues that, because only a small part of the large number of genes in this field is suitable for discriminating between different types of cancer, it will be better if these two processes take place simultaneously. The application of this model is choosing findings among suitable genes and simultaneously developing a model of possible nearest neighbor for cancer classification. Ostermark [14] proposed a classification hybrid model by employing genetic algorithms, Fuzzy logic, and artificial neural networks. Aci et al. [15] presented a hybrid model with a combination of genetic algorithms, Bayesian methods, and k-NN. Their goal is to eliminate the data that are barrier to learning to achieve successful results in classification. Khashei et al. [16] proposed a hybrid model combining artificial neural networks and multiple linear regression. This model has been proposed for classification purposes, and for achieving higher accuracy and a more generalized application than the traditional artificial neural network models.

In 2014, Seera and Lim [17] also put forward a hybrid intelligent system for medical data classification. The proposed system consisted hybrid of the Fuzzy Min–Max neural network, the classification and regression tree (CART), and the random forest model. They concluded that the domain users (i.e., medical practitioners) were able to comprehend the prediction given by the hybrid intelligent system; thus accepting its role as a useful and efficient medical decision support system. Again, in 2014, Shao et al. [18] addressed the classification heart disease issue by combining the multivariate adaptive regression splines (MARS), logistic regression, artificial neural network, and rough set (RS) techniques. In initial step, the proposed hybrid model reduced the set of explanatory variables by using logistic regression, MARS, and RS techniques. Subsequently, selected variables was employed as inputs for the artificial neural network method in the process of classifying heart disease patients. Experimental results have shown the effectiveness of the proposed hybrid model to classify heart disease.

Forghani and Yazdi [19] came up with a hybrid model called “robust support vector machine-trained fuzzy system”. The proposed hybrid classifier established with a combination of support vector machine and Fuzzy if–then rules. Experimental results have shown the use of proposed approach results in very fast training and testing convergence time with good overall classification accuracy rate. In effect, this model had 63% of classification accuracy based on the Cleveland multi-class data set. Zhang and Zhang [20] suggested a hybrid method employing Rotation Forest in conjunction with AdaBoost. This model achieved 55.62% and 74.69% classification accuracies for the Cleveland multi-class and Pima's data sets, respectively. A classification model entitled “Forest Optimization Algorithm” was proposed by Ghaemi and Feizi-Derakhshi [21]. It was established by incorporating a few trees into the forests to improve the predictive accuracy of classifiers. This classification model attained 58.14% and 71.11% accuracies for the Cleveland multi-class and Pima's data sets, respectively. Zhang et al. [22] came up with a novel k-NN-based algorithm, 3N-Q, for enhancing the performance accuracy of k-NN classifiers. The reported experiment results demonstrated that 3N-Q is efficient and accurate for performing classification tasks.

The review of literature indicates that models with diverse applications based on various combinations of k-NN, genetic algorithms and artificial neural netwoks have been proposed for classification purposes. However, no measure has been taken for linking these three methods in the literature of classification models. Therefore, it can be argued that such an action is a novel approach that adds to the body of literature in this field. The present study aims to present a new model to appropriately link the above mentioned three methods. It is expected that the synergy resulting from the combination of these elements improves classification performance, especially in various medical fields.

This model begins with features prioritization using classification techniques that facilitate learning such as Fisher's discriminant ratio, and class separability criteria. In fact, Fisher's discriminant ratio is the criteria for features ordering in terms of discrimination ratio of both classes relative to each other whereas class separability criteria is the criteria for features classification in terms of discrimination ratio of each class relative to all other classes. Then, using high and unique capabilities of genetic algorithm in optimization, optimized arrays were produced so that the results of features classification, including previously classified arrays, were utilized as initial population of the genetic algorithm. Afterwards, using the modified k-NN method in parallel with a Developed Back Propagation Neural Network (DBPNN) method, the classification process was carried out according to optimization arrays of selected features by genetic algorithms. Finally, a method of Fuzzy class membership was applied to integrate and finalize decision making from proposed classes.

The new proposed model was tested with six data sets taken from the University of California Irvine (UCI) machine-learning repository as well as a dataset in the real world called Acute Coronary Syndrome Event — in Kermanshah, Iran (ACSEKI). From these data sets, four were on heart diseases, two on breast cancer and one on diabetes. In addition, the performance of the proposed new hybrid model was compared to some of the well-known classification models.

The rest of this study is organized as follows: section two presents the materials and methods including a brief explanation of each applied approach in the hybrid model; the framework and building process of the proposed hybrid model is described in detail; the model performance assessment process is presented and a detailed plan for the statistical evaluation of the model is provided. The results of the performance evaluation of the proposed model are discussed in section three comparing to some of the well-known classification models (based on seven different data sets) as well as statistical evaluation results. Finally, section four includes the conclusion.

## Material and Methods

In this section, first the attributes of each dataset is explained. Second, a brief review of concepts and methods of Genetic Algorithms, fuzzy class memberships, BackPropagation Neural Networks(BPNNs), and k-NNs is presented. Finaly, the proposed model is thoroughly described.

### 2.1 Data sets technical information

To test the proposed hybrid model in this study, widespread and different standard data sets from the real world were used. Among these data sets, four were on heart disease, two on breast cancer and one on diabetes. These data sets, briefly discussed here, are similar in terms of number and type of features, number of classes, and number of missing values.

One of the data sets applied in heart field is ACSEKI. Using the Euro Heart Survey on ACS, designed by the European Society of Cardiology, we registered all admitted patients referred to the Imam Ali hospital, the main center for cardiovascular care in Kermanshah, Iran. While the first Euro Heart Survey of ACS was conducted in 25 countries (in 2000–2001), the second survey involved 32 European countries. For the purpose of this registry, all hospitalized patients diagnosed with ACS during 2010–2011 were included. According to the standard protocol of this registry, all patients with unstable angina as well as those suspected of acute myocardial infarction were included.

A total of 2068 patients were enrolled in the study. They were divided into four groups based on ACS causes including STEMI, NSTEMI, UA, and other. In the case report, a form was completed for each patient by the attending physician. A data collection officer reviewed and checked each form for missing data. Table 1 shows the distribution of ACS causes.

For each subject, 266 clinical factors were collected consisting of numerical and nominal features. Based on numerous interviews with cardiologists, and examining references in relevant medical literature, 26 seminal features for classification of ACS were selected. These factors along with their description and data types are shown in Table 2.

The other six data sets used in this study are taken from the UCI, each taken from a different source. The Cleveland dataset collected in Cleveland Clinic Foundation is designed to determine the presence or the absence of heart disease in individuals based on some of their features. This dataset consists of 75 predictive features in addition to the type of the disease diagnosis feature that must be anticipated for new cases. From these 75 features, based on the expert views of heart specialists, 13 features that were more important in the disease diagnosis were chosen. The list and explanations of these features are presented in Table 3. Similarly, disease diagnosis feature includes five different classes; the first class belongs to healthy individuals while the other four classes belong to people affected with heart disease, according to the disease severity. The sample distribution of these classes is shown in Table 4.

In order to balance the distribution of classes in the Cleveland multi-class dataset, four classes of individuals with heart disease with different severity were combined with disease diagnosis feature. The result was the creation of the Cleveland two-class dataset along with disease diagnosis feature including two healthy and sick classes. The sample distribution of new classes is presented in table 5.

The Hungarian dataset is the last dataset in the field of heart disease which was used in this study. These data sets, which were collected in the Hungarian Institute of Cardiology, Budapest, consist of 249 instances. All of the features have a similar structure to that of the Cleveland two-class dataset. The sample distribution of these classes is indicated in table 6.

Another dataset used in this study is the Wisconsin Breast Cancer (WBC) which is provided by the researchers of the Wisconsin university. Between 1989 and 1991, Doctor Wolberg collected a group of digital images taken from fine needle aspirates (FNA) of biopsies from the breast of patients diagnosed with breast cancer. Afterwards, nine features of the images processed were measured. These features describe the characteristics of the cell nucleus shown in the image. The list of these nine features along with the diagnosis feature are illustrated in Table 7. Furthermore, disease diagnosis feature is comprised of two malignant and benign classes whose sample distribution is presented in Table 8. It should be noted that this dataset consists of 699 instances and also has missing values.

Another dataset used in the present study is Wisconsin Diagnostic Breast Cancer (WDBC). This dataset was also collected in 1995 by Wolberg et al. to diagnose breast cancer disease. To start with, digital images of FNA of biopsies from the breast of patients with breast cancer were collected. Then, ten features were measured using image processing methods to analyze the size, shape, and tissue of each cell nuclei, whose description is indicated in Table 9. Calculation ofthe mean, the standard deviation, and the extreme (mean of three of the extremes values) for each of theses 10 features resulted in 30 predictive features. Based on the the final result of this process, the total dataset of WDBC include 569 instances, and 30 predictive features along with disease diagnosis feature. Also, the disease diagnosis feature included malignant and benign classes whose sample distribution is shown in Table 10. It should be pointed out that this dataset has no missing values.

The Pima Indians diabetes data set is last data sets taken from UCI machine-learning repository. In Pima database, all of patients are Pima-Indian women at least 21 years old and living near Phoenix, Arizona, USA. The total dataset of Pima include 768 instances, and 8 predictive features per instance along with disease diagnosis feature including two healthy and sick classes. The predictive features along with their explanations are listed in Table 11. Note that, the healthy class is labeled as“negative to diabetes”, whereas the sick class is labeled as “positive to diabetes” whose sample distribution is shown in Table 12.

### 2.2 Genetic algorithms

Genetic algorithm is a randon search method that was introduced by John Holland and his students in 1975, based on Darvin's theory of evolution [23]. In this method, according to the law of survival of the fittest,“, the process begins with an initial population (response set) and, based on the target function, our goal of presenting the problem and an indicator of the individuals' performance (responses) continues in a repeating fashion in order to find better responses. The selection of the target function depends on the nature of the problem. In this study, the target function that follows the issue of more effective classification, is a classifier accuracy whose purpose is achieving maximum level. A simple algorithm for this approach is as follows:

- Step 1: Create the primary population
- Step 2: Evaluate the members of the primary population by fitness function
- Step 3: Select one or more pair of the population probabilistically based on fitness function
- Step 4: Crossover selected pairs from the third step
- Step 5: Randomly mutate some of the newly created members of the fourth step (within the permitted limit of the response set)
- Step 6: Evaluate all of the created population (new generation) based on the fitness function
- Step 7: End the operation in case the algorithm stopping criterion is met, otherwise repeat the operation from the third step

### 2.3 Backpropagation neural network

One of the efficient methods in solving complicated problems is breaking them down to simpler subproblems that will be easier to comprehend and describe. It is these simple structures that describe the final complex system of a network when they are placed next to one another. Heb and Perceptron models are the simplest yet most efficient proposed arrangements for artificial neural networks consisting of an input layer, zero or a few hidden multilayers and an output layer. In this construct, all the neurons of a layer are linked to all the neurons of the next layer. This arrangement constitutes a network with complete connections. Nonetheless, due to the weakness of these networks in learning complex issues, BPNNs (i.e. a multilayer, feedforward network trained by backpropagation)were proposed by Werbos [24], Rumelhart et al. [25]. These networks create a good balance between memory power and generalization power. The general layout of this network is illustrated in Figure 1. Teaching this network includes three stages. First, calculation of the output corresponding to input (i.e. feed-forward phase). Second, error calculation and propagating to previous layers (i.e. backpropagation phase), and finally, adjustment of network weights(i.e. adjustment phase). The mathematical basis of backpropagation algorithm is an optimization technique called gradient descent. The gradient of a function is a direction in which the function ascends more rapidly, and consequently negative gradient is a direction in which the function quickly descends.

### 2.4 Fuzzy class memberships

Fuzzy sets theory was first introduced by Lotfizadeh in 1965 as the generalization of classical set theory. Following this, he presented a logic by the same name in the domain of new calculus [26]. The feature of lack of definite appears in various forms in all fields and phenomena. On the other hand, there are many inexact concepts around us that we refer to daily in various forms for which no precise quantity for their measurement could be found. Based on deductive thinking and considering various factors, the human brain in effect defines and evaluates statements in a way that their modeling to mathematical language and formulas remains a complicated task. Fuzzy logic is a relatively new approach proposed with the intent to overcome these increasing complexities as well as to design and model systems that require complex and developed mathematics using language variables and expert knowledge. It is an approach to solve problems that are far closer to human methods of thinking and learning. In keeping with this approach, many studies are carried out on pattern recognition and decision analysis [27]–[30].

In the realm of classification too, final decisions can be made based on this approach. Thus, in the present study, the final decision in determining the class of new data is made according to this approach and by presenting a probable array called class membership probabilities, which determines the allocation degree of this data to any specific class. Next, by integrating the created class membership arrays, we will reach a final decision making with fuzzy strategy. The details of administering this approach will be explained in details in the proposed new model section [31].

### 2.5 k-Nearest Neighbor Algorithm (k-NN)

One of the very simple machine-learning algorithms, in terms of classification implementation and understanding, is k-NN. Owing to good results albeit with a simplicity to solve many classification problems, this non-parametric method is still extremely popular in many research fields [32]. No training is given in this method and all of the training data are memorized. Therefore, it is known as an instance-based and lazy method [33]. To classify a new instance in this method, first the similarity, that is the distance between this instance and all other instances in the training set, is determined. Then, a k-NN in the training set closest to this new instance is chosen. Among this selected k-instance, the class with the most absolute frequency is considered as the new instance class. One can observe three key elements in this algorithm: a set of classified instances as training set, a criterion for calculating distance or similarity between instances (usually the Euclidean distance scale), and k [3], [34].

The performance of the k-NN algorithm can be improved by allocating a weight to each of k-NNs. This weight is chosen based on the distance of these instances to the new (observed) instance that should be classified, and usually has a reverse relationship with this distance. By choosing the weight, it will be possible to use all instances for classification instead of k-NN. This method is called distance-weighted k-NN algorithm and is effectively applied to different practical issues for inductive reasoning. This method is resistant to noise and is efficient when there is a lot of training data [33].

### 2.6 Proposed model

The proposed novel hybrid model is created from the combination of artificial intelligence methods including genetic algorithms, DBPNN, modified k-NN, fuzzy class membership, and in conjunction with some pattern recognition methods to improve the classification accuracy. Implementation stages of the suggested model are presented in Figure 2. The full details of theses stages are subsequently discussed. Taking advantage of these benefits is one rationale behind using this algorithm as a part of proposed hybrid model in current study

- Step 1. Pre-processing: First, pre-processing is performed on all the data. That is, all quantitative features of dataset except the distinguishing variable of the disease class (response variable) are normalized. It should be pointed out that the dataset is considered as the n record of the m-dimensional, that is, it includes the m feature.
- Step 2. Medical feature selection: Based on expert views of skilled specialists, features that are important for disease diagnosis are selected.
- Step 3. Establishing prioritized feature arrays in different viewpoints:

The basic “majority voting” classification is considered as a controversial fundamental dilemma of the conventional k-NN algorithm [35]. The issue that when the class distribution is skewed can cause the samples of a more frequent class tend to dominate the prediction of the unclassified sample [36]. Essentially, this is because of the fact that they tend to be common among the k-NNs due to their large number. In this stage, in order to overcome this issue, the inherent bias of the majority class, are prioritized by different ways utilizing two pattern recognition methods, (i.e., the features defined in the previous step).

The rationale behind this strategy relies on the fact that in the present study, the nearest neighbors are found in different viewpoints (i.e., based on each of the generated prioritized feature arrays) which can greatly reduce the deficiency. It is because different combinations of features are resulted in different feature arrays —each with its own unique advantages. Ergo, for a new test instance vector, each feature array can lead to different nearest neighbors. Therefore, these feature arrays can be seen as different nearest neighbors' finder who are independent from each other. These independent arrays go to overwhelm the majority of voting difficulty.

The “Fisher discriminant ratio” [37], as a separability criterion, has derived from Fisher discriminant analysis. Due to its good capability in class separability viewpoint, it is a popular approach yet which is widely used in many pattern recognition problems; for example, see [38]–[40]. The Fisher discriminant ratio is a prioritized feature array according to the potential of each feature in discriminating between two specific classes. For instance, for a dataset with three classes c1, c2, and c3, features can be prioritized by three different ways. In the first method, features that discriminate c1 and c2 classes in the best way possible in a linear fashion receive higher priority. Accordingly, feature arrays are prioritized through two other methods for discriminating c1 and c3 as well as c2 and c3 classes. Generally, it can be argued that the Fisher discriminant ratio method builds the arranged arrays of the features for a c class dataset.

Another method used in this study for features prioritization is the other “Class Separability Criteria” method. In this method, features' arrays are prioritized/ranked according to the potentiality of each feature in separating a specific class from other classes. This method, therefore, creates ranked arrays of features for each of the classes. The Class Separability Criteria can prioritize features using five different criteria, namely; t-test, relative entropy, Chernoff bound, Mann-Whitney test and receiver operating characteristic (ROC) curve (i.e., these criteria quantify the relationship between each band and the desired output).

Assuming a dataset with four distinct classes, using the Fisher discriminant ratio, six arrays will be built. For each, five criteria of class separability criteria with four arrays will be created, that is 6+(5×4) = 26. At the end of this step, feature reduction is performed, i.e. about one third of the elements of each array that have the lowest separability potential are eliminated.

- Step 4. Generating optimized features' arrays by genetic algorithm:

Based on the features' arrays obtained in the third step, we have attempted to create new generations of arrays that possess higher ability in classification tasks. The GAs was adopted to accomplish this goal. Since this algorithm was explained in section 2.2, its implementation is discussed here. It needs to be mentioned that choosing a good initial population is as a challenge in GA [41]. Therefore, using the ranked and the reduced feature arrays by different methods (i.e., Fisher discriminant ratio and Class Separability Criteria) can make the tolerable initial population for GA. Accordingly, the features' arrays produced in step 3 are considered as the initial population of this algorithm. Then, members of the initial population are evaluated by the fitness function (or the target function). Since our goal in the next stage is the application of the results of this genetic algorithm in a k-NN classification so as to increase its accuracy, therefore the best fitness function of this genetic algorithm that evaluates the members of population can be the k-NN classifier. Thus, the k-NN method performs the classification based on feature arrays. The efficiency of the k-NN approach for each of the arrays determines the array's fitness.

Next, weighted random selection is performed on the members of the population using the Roulette wheel approach. The value of these weights is determined based on the results obtained from the fitness function. Now, crossover operator is performed on each pair of the feature array that was selected as parents to produce the children of the new generation. In this study, the one-point crossover method was used from among various crossover methods. That is in each pair of features' array selected as parents from a randomly selected point, part of the two arrays are exchanged with each other. If there is a specific feature in two arrays of features selected for crossover, it is possible that, due to crossover, the children of the new generation contain repetitive features. The mutation operator (that randomly replaces a specific feature with a value within its permissible limit) is used as a solution for solving this problem. By performing these steps, the first repetition of algorithm is finished, and a new generation with combination of children and members of the initial population, which had a higher fitness, is created. By utilization this generation as the next repetitive primary population, the algorithmic steps continues until reaching the stop condition. At the end of the genetic algorithm steps, the process of reducing and selecting the features does, in effect, finish, and the obtained results are the best feature arrays that could assist the classification in the next step.

- Step 5. Classifying new instance vector using modification k-NN

The classification is implemented using the k-NN method with a modification in its algorithm, that is in the final decision making step of determining class label of the new instance. For classification of every new instance vector , similar to conventional k-NN method, first it calculates the distance (usually Euclidean) of this instance to all the other instances of the training set (whose class label is known). Second, these calculated distances are sorted. Then, k of the closest instances/neighbors of the training set to the new instance is selected. In conventional k-NN, the final decision about determining the class label of the new instance vector is made based on the majority vote of these k closest neighbors. While in the present study, this step of the k-NN improved by the allocation of Fuzzy class membership to the new instance/input vector (i.e., incorporating Fuzzy set theory (Fuzzy Logic) instead of crisp set theory (Boolean Logic) in k-NN). That is, an array with a dimension equal to the number of classes is built (i.e., “Fuzzy class membership array”), the votes of every of classes is inserted into the array (without exertion of the majority of votes). In effect, the Fuzzy class membership arrays is calculated through equation 1:(1)where k is the number of nearest neighbors and Xj, is j^{th} the nearest neighbor of in the k-NN method and is the indicator function. Hence, this array is indicative of the degree of belongingness of the new instance to each class (i.e., the results of dividing the number of neighbors belonging to each of the classes by the number of k). In this stage, the class membership array is calculated for each of the output feature's arrays of GA. Subsequently (in steps 8 and 9), these arrays along with the Fuzzy class membership array derived from DBPNN are integrated together to predict final class label of this new instance .

- Step 6. Introducing Dynamic BackPropagation Neural Network

The main aim of this phase is the introduction of a newly improved neural network that advances the classification process in parallel with the k-NN method that was improved in the previous step. It is expected that, with regard to the different construct of these two classification approaches, the classification be improved using the high potential resulting from the synergy between the elements of these two classification approaches. In effect, the presented model is a dynamic neural network that in this study is called the DBPNN. The difference of the new method with the traditional BPNN method is that in each epoch, the transfer function is made dynamic in a way that the learning speed and accuracy is increased remarkably. Usually, functions such as log-Sigmoid (logistic) and/or tan-Sigmoid are the most common transfer functions used in BPNNs. These functions, owing to their desirable characteristics, have shown an appropriate performance in feedback neural networks. One of their benefits is that derivative of these functions is obtained according to the function, that is (Equations 2–3): (2)(3)

The graph and the derivative of this function are shown in Figure 3.

To increase the speed and the learning accuracy, the active domain is made dynamic in this study. In logistic function, the domain is (−∞ ∞) and the range is [0 1], but its active domain is limited to the range [−4, 4]. In other words, this function takes the value around zero for the values of range (−∞ −4), and a number around 1 for the values of range (4 ∞) (with a maximum level of error of 0.018; that is the output equals to 0 or 1, ignoring this error).

In the new method, DBPNN, there is an attempt to modify sigmoid function parameters in each epoch in a way that the active domain correspond to network inputs and weights. Therefore, we try to achieve this goal, step by step, by the application of suitable modifications. As mentioned above, the range of the Sigmoid logarithm function is the interval [0 1]. Now, we intend to define a function with Sigmoid logarithm nucleus whose range interval is [−1 1]. To this end, we must define a map from [0 1] to [−1 1]. Assuming that this map is a linear modifier, we can consider , and consequently we have equations 4–6. (4)where; (5)(6)where;

As mentioned previously, the Sigmoid logarithm active domain is in the range [−4 4]. By making an appropriate change, we now intend to define another function that could change its active domain to any [m M] desired range. In other words, to change every [m M] desired range to range [−4 4] (Equations 7–8). (7)(8)

However, by comparing 6 and 8, we will have equations 9–13.

The derivative of function f (.) is presented in equations 14–18.(14)(15)(16)(17)(18)

As can be seen, a function with Sigmoid logarithm nucleus is proposed whose active domain can be dynamically changed to any desired range [m M]. In addition, the derivative of this function can be presented as a function of Sigmoid logarithm. The next goal is to obtain a function that could dynamically change the output range to any desired range of [a b]. To achieve this goal, that is transferring the range of the function from [−1 1] to [a b], equations 19–22 are presented.(19)(20)(21)(22)

Now according to the presented introduction, we will attempt to organize the above-mentioned materials to propose a function with nucleus of Sigmoid logarithm function so that the active range accept any desired range of [m M] instead of the range [−4 4], and the desired output be any desired range of [a b] instead of the range [0 1], that is:

Assume that we have equations 23–25:(23)(24)(25)

The transferred function is (Equations 26–28):(26)(27)(28)

The first derivative of this function is obtained through (Equations 29–31):(29)(30)(31)

Therefore, along with its derivative, the final function with the Sigmoid logarithm nucleus function (in effect, Sigmoid logarithm function with new parameters) for transferring the range [m M] to the range [a b] is shown in equations 32 and 33. (32)(33)

The shape of this dynamic logistic function and its derivative for different inputs and outputs are shown in Figures 4 and 5.

This final modification function has a dynamic capability that presents a function of the Sigmoid type by defining input and output ranges. Additionally, the derivative of this function is calculated according to that function at every point. Another benefit of this function is that its derivative is maximum in the center of its definition range which can be very important for neural networks. This is because, for values farthest from real output value, the network output considers the longest steps resulting in quicker network learning. It should be mentioned that in common networks, if the length of the step is large, the network cannot precisely regulate the weights and a large error will always exist. Conversely, if the length of the step is small, the possibility of entanglement in local minimum increases and also, network learning in this local minimum is very slow. Therefore, functions such as the Sigmoid logarithm or the Sigmoid tangent, whose derivative increases in case of high error, were introduced.

A criticism posed on static backpropagation networks was that regular transfer functions (such as tan-Sigmoid and/or log-Sigmoid) that these networks use, practically function well only in a limited range of their domain known as active range. Out of these ranges, the network has a rather fixed output and the derivative in these distances is also very close to zero.

The application of this dynamic function presented as transfer function of this network can be a giant step in removing this deficiency in a way that in each epoch, an appropriate active range [m = min M = max] is determined for each transfer function and changes it to the range [−1 1] where the range [m M] is determined according to the network weights. We have used this developed neural network on our own algorithm to implement the classification in parallel with the modified k-NN method. In this neural network, features that are important in the diagnosis of the disease based on the viewpoints of expert specialists (results of step 2) are considered as input, and the degree of belongingness of a new instance vector to every specific class as output.

- Step 7. Assigning Fuzzy class membership array to new instance vector in DBPNN

Herein, at first, it should be mentioned that the proposed hybrid classifier has been designed in a way that, like some classifiers, e.g., Naive Bayes, assigns instance probability or score to each new case, which indicates its degree of belongingness to each specific class (i.e., Fuzzy class membership). In other words, this classifier is like a scoring or a ranking classifier and not a discrete one. In fact, “Fuzzy class membership array” is an array with a dimension equal to the number of classes. This array is indicative of the degree of belongingness of the new instance vector to each specific class. For instance, we assume that in a classification problem with five classes, the proposed classifier for a new case assigns the following belongingness degrees, which are shown in Table 13. As these belongingness degrees are independent, it is not necessary for their sums to be one.

- Step 8. Removing Fuzzy class membership array with highest small belongingness degree

In this and next steps, “Fuzzy class memberships” tries to select one of the classes as the predicted class for the new instance vector . To this end, let us consider a classification problem with “c” classes problem, if the highest belongingness degree to classes in a class membership array is less than 3/(2×c), this array will not be used in decision making. This is because membership array of classes with highest small belongingness degree (i.e., less than 3/(2×c)) does not probably have a good decision making power, which could stem from the lack of appropriate selection of suitable neighbors.

- Step 9. Assigning a class label to new instance vector

In this step, the difuzzification operation i.e., changing the output of Fuzzy viewpoint to a crisp form is carried out. Based on the integration of the results of the class membership arrays obtained from DBPNN and k-NN classifiers, the final decision process is reached in predicting the instance vector class label. In order to do the integrating process in the best possible way, a weight is applied to this process based on the number of class membership arrays created by k-NN (i.e., it has been adopted a weighted average). Various methods are proposed for the subject of integration in the Fuzzy theory. In this study, the approach taken to predict the class label of the new instance , is based on a class that has the maximum total belongingness degree. Therefore, the degree of belongingness to any specific class is averaged up in all the class membership arrays.

### 2.7 Performance assessment

#### 2.7.1 Model validation.

Model validation is possibly the most important step in the model building sequence. There are various resampling-based model validation methods with cross-validation being the most popular [42]. In the process of model construction, model selection, and model validation, cross-validation assesses a model based on error/accuracy-rate estimation (i.e. estimation of the generalization error/accuracy). In the current study, was used repeated random sub-sampling cross-validation method from among others cross-validation method (i.e. holdout, k-fold, and leave-one-out cross-validation method) [43], [44].

In this approach, the dataset is split into two sets of training and test. The training set is used to find the model's parameter of interest (i.e. model construction and selection). In addition, test sets are used to evaluate the generalizability performance of the final classifier/model. The process of train–test are repeated several times using different random samples which can be a common way to reduce any bias. Finally, the estimate of the overall error/accuracy rate is derived by averaging over all the separate error/accuracy rate estimates produced from different iterations [45].

In pattern recognition problems, the potential benefits of cross-validation method can help prevent two fundamental problems. The problems are (i) overfitting of final model (i.e. final model is unable to generalize to unseen data) and (ii) the error rate estimate will be overly optimistic (i.e. lower than the true error rate). On the other hand, if we want to select the classification model and estimate the error rate/accuracy rate simultaneously, three-way data splits procedure should be applied during the cross validation process. That is, the data should be divided into three disjoint (non-overlapping) sets namely training, validation, and test sets.

In this approach, the training set is used for learning, i.e. to optimize the tuning parameters of the model/classifier (e.g. in MLP, in order to determine the optimal weights and the bias with the back-propagation rule). The validation set is used to optimize the structural/regularization parameters of the model/classifier (e.g. in MLP, in order to determine the optimal number of hidden units and a stopping point of the algorithm). The test set is used only to estimate the error/accuracy rate (assessing the performance) of the final model. In other words, once both tuning and regularization parameters of the model/classifier have been optimized, the testing process will start. The procedure, using a three-way data split method is presented as follow Dougherty [42]:

- Step i: Divide the available data set into training, validation, and test sets.
- Step ii: Choose the architecture and training parameters.
- Step iii: Train the model using the training set.
- Step iv: Assess the model using the validation set.
- Step v: Repeat steps ii–iv using different architectures and training parameters.
- Step vi: Select the best model as the final model i.e. based on the estimate of the overall error/accuracy rate on validation set.
- Step vii: Evaluate the final model using the test set.

It should be noted that, this procedure is based on a holdout method. If other cross-validation method is utilized, steps iii and iv should be repeated for each of the k (i.e. k is number of folds in k-fold method or it is the number of times to repeat the random sub-sampling method).

#### 2.7.2 Performance evaluation criteria.

Performance evaluation is one of the major introductory steps of a new classification model. The performance evaluation criteria are usually built from a confusion matrix, which can be categorized into two major approaches: numerical and graphical. Numerical approaches summarize and quantify the performance of a final classifier (i.e. fully-trained model) in a single number, whereas graphical approaches portray the performance in a plot. In this study, numerical approaches have been adopted as performance evaluation criteria. A confusion matrix or contingency table is a C×C matrix/table, in which each row is indicative of the actual class and each column of this matrix indicates the predicted class. Indeed, in a confusion matrix, the diagonal elements indicate how many subjects have been correctly classified, whereas the off-diagonal elements indicate how many subjects have been misclassified. Therefore, larger diagonal elements and smaller off-diagonal elements of the matrix would reflect a higher level of classification power and vice versa.

#### Binary-class' performance evaluation criteria.

Various numerical-based criteria are used to quantify the performance of a binary-class classifier (i.e. measure of the generalization capability of classifiers) described in the literature. The conventional data layout for the 2×2 confusion matrix, used to calculate numerical-beasd criteria, are shown in Table 14. Here, TP and TN stand for the number of positive and negative examples, respectively, (i.e. sick and healthy people) that are classified truly while FP and FN stand for the number of positive and negative examples, respectively, that are classified falsely.

However, on the basis of the results of Table 14, it is possible to derive many of the classification performance metrics defined in Figure 6 [46].

Accuracy is the rate of correctly classified subjects and its appeal in presenting a single summary estimated to assess the overall effectiveness of a classifier which has been become the most commonly used measure for these purposes. However, in classification problem with im-balaced classes, accuracy is not a proper criterion [47]. Because, the overall accuracy does vary with the classes' frequency frequency changes (i.e. disease prevalence), it is obviously presented in equations 34–36.(34)(35)(36)

Where, Prevalence = and (1-Prevalence) =

Accordingly, the classification accuracy is not sufficient as a performance evaluation of the classification models [48]. Sensitivity denotes the percentage of actual positive cases (i.e. sick subjects) correctly recognized by the classifier whereas specificity denotes the percentage of actual negative cases (i.e. healthy subjects) correctly recognized by the classifier. Indeed, both latter criteria, qualified the classifier' performance on different classes. Negative Predictive Value (NPV) is the part of predicted negatives that are the actual negatives. The precision (i.e. positive predictive value) is the part of predicted positives that are the actual positives. Precision and Recall are two criteria which are opposite to each other in terms of effectiveness: aa precision increases, recall usually decreases, and vice versa. The F-measure criterion, which takes precision and recall into account, is the harmonic-mean of these two [46]. The values of F-measure lie in the interval (0, 1) and larger F-measure values denote higher classification performance. The MCC is a correlation coefficient on the basis of the true and false positives and negatives, which can be used to as a performance evaluation criterion in binary-class classifications [49]. It returns a value between −1 and +1; where −1, 0 and +1 indicate the worst possible classification, a completely random classification and a perfect classification, respectively.

#### Multi-class' performance evaluation criteria.

Researching in the context of performance evaluation criteria of multi-class classification is still an open topic because in the literature most of the criteria are originally designed only for binary-class tasks, although significant efforts have been carried out to develop them in the last few years [50]–[52]. The results of these efforts have introduced an expanded form of the criteria, which have been developed and adapted for evaluating multi-class classification in one of the following two strategies: one vs. one and one vs. rest. In one vs. one strategy, the performance is evaluated by measuring the capability to discriminate among the subjects of one class (i.e. considered as a positive class or a reference one), from subjects of another classes (i.e. considered as negative class). The second strategy is called one vs. rest (i.e. one vs. all), in which performance is assessed by measuring the capability to discriminate between subjects of one class (i.e. considered as positive class or reference class), from subjects of all the other classes (i.e. considered as negative class). The one vs. one and one vs. rest strategies produce a separate 2×2 confusion matrix for each “pair of classes” and for “each class”, respectively, with a corresponding set of values of classification performance criteria. Ultimately, the desired criteria can be achieved by combining the results appropriately. It should be noted that, in the present study the one vs. one strategy was used.

In this section, the definition of the developed form of some of the performance measures for multi-class classification problems are briefly addressed. As mentioned earlier, the F-measure is the harmonic mean of recall and precision, in multi-class problems, Pi, Ri and Fi are stand for precision, recall and F-measure for class (i.e. class reference) respectively, which is defined in Equations 37–39: (37)(38)(39)

Here, TPi stands for the number of examples from class (i.e. reference class) “i” that are classified truly to class “i”, FNi stands for the number of examples from class “i” that are classified falsely to another class and FPi stands for the number of examples that are classified falsely to class “i”. Ultimately, the overall precision, recall, and F-measure of the multi-class classification problem can be obtained by two different kinds of average, namely, micro-average and macro-average. The computation of micro average precision (P-micro), micro average recall (R-micro) and micro average F-measure (F-micro) are done by Equations 40–42 respectively. In fact, F-micro represents a harmonic mean of P-micro and R-micro.(40)(41)(42)

The computation of macro average precision (P-macro), macro average recall (R-macro) and macro average F-measure (F-macro) are done in two steps for each one. Firstly, computing precision, recall, and F-measure locally over each reference class (by equations 37–39 respectively); secondly, taking the average of all the obtained values (i.e. based on each reference class), for each criterion by Equations 43–45 respectively(43)(44)(45)*where*; (45-1)

The confusion entropy (CEN) is an entropy theory-based criterion, which has been recently introduced for evaluating the performance of multi-class classifiers. The evaluation criterion thoroughly takes the advantage of the misclassification information of the confusion matrix. In fact, it evaluates the confusion level of the class distribution of misclassified samples. In a c-class classification problem with confusion matrix c×c, the CEN is defined by Equations 46–47:(46)(47)

Where, is defined as the misclassification probability of classifying the samples of class i to class j subject to class j and is calculated by Equation 48; is defined as the misclassification probability of classifying the samples of class i to class j subject to class i calculated by Equation 49; and also = 0; is defined as the confusion probability of class j calculated by Equation 50:(48)(49)(50)

In binary-class classification, the CEN can be directly calculated by using confusion matrix results by Equation 51Jurman et al. [51]:(51)where, N = TP+TN+FP+FN.

The CEN takes the value between 0 and 1; where 0 indicates a perfect classification whereas 1 indicates that the worst possible classification i.e. an interpretation in opposite dirction of other evaluation criteria. Thus, in order to solve this issue (i.e. to simplify and be perfectly comprehensible interpretation of the CEN result in comparison with other evaluation criteria), in this paper the measure for CEN is subtracted from 1. Jurman et al. [51] showed that CEN should not be reliably used in the binary-class classification. In the binary-class case, CEN can even take values greater than 1. Therefore, in the present study it has been refrained from employing the CEN as a performance evaluation criteria in the binary-class cases.

The MCC is another performance criterion, which has been developed and adapted to multi-class problems [51], [53] and it is calculated by Equation 52.(52)

### 2.8 Statistical tests

One might rather say that the statistical analysis is an integral part of any scientific research [54]. Nevertheless, little attention has been paid to these valuable procedures in the artificial intelligence-based studies area [55], [56]. Thus, surprisingly, very few studies can be found recently in the literature that have been performed the statistical analysis. Therefore, in the present study, in order to improve the evaluation process of the performance of the novel hybrid model, statistical analysis is performed.

Statistical methods developed to carry out statistical analyses, from a methodological point of view, which can be categorized as parametric and nonparametric methods [54]. In fact, as a general rule, statistical inference procedures, which are used for evaluating a dependent variable measured by a nominal or ordinal scale, are categorized as nonparametric procedures, whereas those which are used to evaluate a dependent variable measured by an interval or ratio scale are categorized as parametric procedures. It should also be mentioned that there are other underlying assumptions, i.e. independence, normality, and homogeneity, which need to be checked for a more safe and prudent usage of parametric tests [55]. However, there is a trade-off here. Even though parametric tests are generally more powerful than their nonparametric analogs, many researchers believe that if one or more of the fundamental assumptions of a parametric test are violated, the power advantage of the parametric test may be negated, thereby the statistical analysis loses credibility [54], [57]. Accordingly, in such circumstances a prudent approach can be employed a suitable nonparametric tests.

#### 2.8.1 Preliminary analysis (Checking the conditions for a safe use of parametric tests).

In this section three underlying assumptions, which need to be checked for a more safe usage of parametric tests is briefly described. By definition, the two events are independent if the occurrence of one of them does not modify the probability of the occurrence of the other [58]. In our present case, it is obvious that the independence of the obtained results is derived from independent runs of the algorithm with randomly generated initial seeds. Ergo, it is necessary to check the rest of the fundamental assumptions of a parametric test.

A random variable X has the normal or Gauss distribution with mean µ and standard deviation σ (i.e., X ∼ N (µ, σ)) if its probability density function is given by Equation 53 [58]. (53)where

In the present study, in order to verify the normality hypothesis, the Shapiro-Wilk test was used as a preliminary analysis. This test is the most effective omnibus test that is able to find out departures from normality caused by either skewness or kurtosis or maybe both [59]. In fact, the Shapiro-Wilk test employed a weighted sum of ordered observations to quantify how the observations are far from a Normal distribution. Subsequently, the p-value drives through the sum of the squares of these disparities.

In addition, as a preliminary analysis, the homoscedasticity assumption was assessed using Levene's test. The Levene's test indicates the existence or inexistence of a significance violation of equality of variances. In other words, this test checks whether k samples present this homogeneity of variances (i.e. homoscedasticity).

#### 2.8.2 Primary and supplementary analysis (Friedman test and post-hoc test).

In the area artificial intelligence studies, particularly in modeling, multiple comparisons with a control method is one of the most commonly used statistical procedures [60]. In fact, the control method can be the most interesting algorithm for the researchers, what is commonly known as a novel proposed algorithm. When dealing with multiple comparisons tests, in statistical terminology, a block is composed of at least three results or subjects, every one corresponding to the performance evaluation of an algorithm or method based on a data set. Friedman test is a multiple comparisons non-parametric test equivalent to the repeated measures analysis of the variances, which can be used in this context as a primary analysis [54]. On the basis of the null-hypothesis, Friedman test states that all the algorithms have the equivalent performance, so a rejection of null-hypothesis implies the existence of significant differences among the performance of at least two algorithms.

Once Friedman's test rejects the null hypothesis, evaluating process can then be proceeded with a post-hoc test (i.e. a set/family of pairwise multiple comparisons tests) [61]. Indeed, the post-hoc test can be performed as a supplementary analysis in order to find out the significant differences between the performance of the control and novel proposed algorithm against the rest of the used algorithms. The post-hoc test statistic for comparing the i-th and j-th algorithm is given by Equation 54 [56].(54)where stands for the mean ranks calculated through the Friedman test for the i-th algorithm, k stands for the number of classifiers to be compared and N stands for the number of data sets used in the comparison. Here, the z-statistic value is used to determine the corresponding probability (p-value) from the table of standard normal distribution, which is then compared with an appropriate α.

In performing the post-hoc tests, controlling global Type I error, i.e. the so-called family-wise error rate (FWER) is a key feature [61]. More precisely, the FWER is the probability of rejecting at least one true null hypothesis among a family of pairwise multiple comparisons tests. The classic Bonferroni procedure is an appealing approach for the control of the FWER (i.e. to maintain FWER≤α) due to its applicability in various situations [62]. In practice, Bonferroni procedure controls FWER by adjusting p-values obtained from the post hoc test. On the other hand, the original Bonferroni procedure is generally considered as a conservative procedure. In order to overcome this issue, some modifications of the original Bonferroni procedure have been presented in the literature that are much more powerful than the conventional Bonferroni procedure [63], [64]. In this study, the focous was on Holm's sequentially rejective step down procedure that is a modified Bonferroni-based procedure for determining the adjusted p-value.

## Results

To evaluate the performance of the new proposed model, called the Hybrid Model, in this study, the seven standard data sets (i.e. five binary-class and two multi-class) were used. By using these data sets and based on some of the most commonly used evaluation criteria, the classification performance of the proposed model was compared to thirteen well-known classification methods. The contents are Adaptive Network Fuzzy Interface System (ANFIS), Radial Basis Function (RBF), k-NN, DWk-NN, Partial Distance k-NN (PDk-NN), MLP, Naive Bayes (NB), BPNN, Iterative Dichotomiser 3 (ID3), Bagging ID3 and Generalized Linear Models (GLM) with four different distributions.

### 3.1 Binary-class results

Based on the five binary-class data sets, the classification was performed using the proposed model and all the other thirteen methods mentioned previously. The performance of the methods was subsequently evaluated in terms of the overall classification accuracy, sensitivity, specificity, precision, F-measure and MCC on the basis of the confusion matrix results. The experimental results, as presented in Table 15, indicate that the proposed hybrid model has significantly outperformed all the other methods in terms of all the considered evaluation criteria for all the five binary-class data sets.

### 3.2 Analysis of the conditions for a safe use of parametric tests on the binary-class results

The normality test of Shapiro-Wilk was performed on the obtained results by applying six commonly used classification evaluation criteria for assessing the performance of the fourteen methods, based on the five binary-class dataset at a significance level of α = 0.05. The results of the Shapiro-Wilk test show that the normality conditions are not fulfilled in some cases, they are not presented here to avoid reader confusion with so many results.

In order to verify the homoscedasticity hypothesis, Levene's test was performed on the results obtaining from the six classification evaluation criteria based on the five binary-class dataset at a significance level of α = 0.05. In effect, Levene's test is used for checking whether the fourteen used methods exhibit (or not) the homogeneity of variances. The results (p-values) are shown in Table 16, where the symbol “*” implies that homoscedasticity condition was not satisfied for a certain data set and a certain performance evaluation criterion.

Based on the results of the normality and homoscedasticity tests, it can be concluded that the necessary conditions for the utilization of the parametric tests are not fulfilled in some cases. Thus, for testing the null-hypothesis that all the methods have similar performance applying non-parametric tests is appropriate.

### 3.3 Friedman and post-hoc tests' results for multiple comparisons on the binary-class results

The Friedman test was carried out on the results (i.e., the results of the classification performance of the proposed hybrid model and the all the other thirteen methods) were obtained from the six classification evaluation criteria based on the five binary-class dataset at a significance level of α = 0.05. The results, including test statistics and p-values, are presented in Table 17, where the symbol “*” implies that the null hypothesis was rejected for a certain performance evaluation criterion. As illustrated in Table 17, there are statistically significant differences between the algorithms' performance in terms of all the six classification evaluation criteria. Accordingly, in order to illustrate the significant differences between the performances of the proposed model against the rest of the used algorithms more concretely, the post hoc test was performed. Subsequently, the p-values resulting from the post hoc test were adjusted using Bonferroni–Holm's procedure. The post hoc tests results, including Z-Score, unadjusted p-value, coefficient adjustment of Holm and adjusted p-value, for all the six performance evaluation criteria are presented in Tables 18. It is apparent from the table that there are significant differences between the performances of the proposed model against the rest of considered algorithms in terms of each of the six classification evaluation criteria. In other words, the proposed model has significantly outperformed all the other algorithms in terms of all the considered evaluation criteria.

### 3.4 Multi-class results

On the basis of two multi-class dataset, the classification was carried out by the same way described for binary-class data sets. The performance evaluation criteria was subsequently determined for each classifier based on the confusion matrix results. The results are shown in Table 19 in terms of overall classification accuracy, P-micro, P-macro, R-micro, R-macro, F-micro, F-macro, MCC and 1-CEN. The experimental results in Table 19 have demonstrated that the proposed hybrid model has significantly outperformed all the other algorithms in terms of all considered evaluation criteria for the two multi-class data sets.

### 3.5 Analysis of the conditions for a safe use of parametric tests on the multi-class results

The normality test of Shapiro-Wilk was carried out on the obtained results by applying the nine evaluation criteria for assessing the performance of the fourteen methods, based on the two multi-class dataset at a significance level of α = 0.05. The results of the Shapiro-Wilk test show that the normality conditions are not fulfilled in some cases, they are not presented here to avoid reader confusion with so many results.

In order to verify the homoscedasticity hypothesis, Levene's test was carried out on the results obtaining from the nine classification evaluation criteria based on the two multi-class data sets at a significance level of α = 0.05. The results (p-value) were presented in Table 20, where the symbol “*” implies that homoscedasticity condition was not satisfied for a certain data set and a certain performance evaluation criterion. Based on the results of the normality and homoscedasticity tests, it can be concluded that the necessary conditions for the utilization of parametric tests are not fulfilled in some cases. Thus, for testing the null-hypothesis that all the methods have similar performance, applying non-parametric tests is appropriate.

### 3.6 Friedman and post-hoc tests' results for multiple comparisons on the multi-class results

The Friedman test was performed on the results obtained from the nine classification evaluation criteria based on the two multi -class dataset at a significance level of α = 0.05. The experimental results, including test statistics and p-values, are shown in Table 21, where the symbol “*” implies that the null hypothesis was rejected for a certain performance evaluation criterion.

As it is shown in Table 21, there are statistically significant differences between the algorithms performance on the basis of all the nine classification evaluation criteria. Accordingly, in order to more concretely depict the significant differences between the performances of the proposed model and the rest of used algorithms, a post hoc test was performed. Subsequently, the p-values resulting from the post hoc test were adjusted using Bonferroni–Holm's procedure. The results of the post hoc tests, including Z-Score, unadjusted p-value, coefficient adjustment of Holm and adjusted p-value for all the nine performance evaluation criteria are presented in Table 22. Some of the highlights of the table are outlined as follows. As expected, it is found that the overall proposed model significantly outperformed the other used algorithms.

More specifically, there are significant differences between the performances of the proposed model and the rest of the considered algorithms (i.e., the proposed model significantly outperformed the other algorithms) in terms of the MCC and 1-CEN criteria. Moreover, the proposed model achieved higher classification accuracy than 11 out of the 13 considered algorithms (i.e., except the BPNN and Bagging-ID3). Furthermore, the experimental results represented a meaningful improvement of the proposed model over all the used algorithms except the ID3, BPNN and Bagging-ID3 in terms of P-micro. Besides, it is apparent that the proposed model has a higher performance than all the other used algorithms except the ID3 and Bagging-ID3 in terms of P-macro. In addition, the results revealed the superiority of the proposed model over all the other considered methods except the BPNN and in terms of R-micro. Also, it can be concluded that the performance of the proposed model surpasses all the other considered methods except ANFIS and Bagging-ID3 on the basis of R-macro. Finally, based on F-micro and F-macro criteria, the proposed model demonstrated superior performance for 11 out of the 13 (i.e., except the ID3 and Bagging-ID3) and 12 out of the 13 considered algorithms respectively (i.e., except the Bagging-ID3).

### 3.7 Comparison with the other state-of-the-art models

In this section, the obtained results of the proposed hybrid model are compared to the obtained results of the state-of-the-art classifiers (i.e., the single, hybrid or ensemble SVM and random forest-based models) in the recent literature in terms of classification accuracy for the data sets under consideration. The results of this comprehensive comparative survey are reported in Table 23. It is apparent from the table that the proposed model shows very promising performance. More specifically, as shown in Table 23, the proposed model demonstrated superior classification accuracy for 13 out of the 14 algorithms in multi-class Cleveland dataset.

Moreover, in binary-class Cleveland dataset, the accuracy of proposed model was competitive to 14 out of 16 the state-of-the-art classifiers. Furthermore, the proposed classification model in terms of accuracy is better than all six considered algorithms in Hungarian dataset. Besides, in WDBC and WBC datasets the classification accuracy of the proposed model surpasses 21 out of 26 and 20 out of 22 the state-of-the-art classifiers, respectively. Finally, in Pima Indian diabetes dataset our hybrid model obtains higher classification accuracies for 26 out of the 30 state-of-the-art classifiers.

## Conclusions

Classification models based on artificial intelligence have had a significant impact on the predictive decision making process in various sciences, including medicine. Numerous research have been carried out on these classification models. Nevertheless, research continues to achieve models with better efficiency. Combing different methods and algorithms to find more efficient hybrid models is an approach that has attracted a lot of attention. The basic and fundamental point in the structure of such models is the proper selection of their components to benefit from exclusive features of each method or algorithm in the hybrid model, as well as increasing the accuracy in their combination. Building such a combination and benefiting from the advantages of each method can eliminate the deficiencies of the participatory methods and create a hybrid model with the least deficiency.

The main goal of the present study was developing the accuracy of the classification models with take advantage of a synergy that was expected to emerge from the hybridization of the components of the proposed model. The proposed hybrid model was implemented based on combining some methods and algorithms of artificial intelligence in three main stages. First, selecting appropriate features and then creating optimized features' arrays from them using pattern recognition methods, and also unique features of GA in optimization. Second, performing the classification in parallel with the two methods of the modified K-NN and the BPNN, developed into DBPNN. Finally, the integration of the final decision of class allocation using the Fuzzy class membership.

It should be pointed out that in this study, the decision making process in class allocation was improved through an adjustment made in the last stage of the k-NN algorithm so that several sets of k-dimensional optimized arrays of features (i.e., generated by GA) were used instead of one set. In addition, a developed network called DBPNN was created by introducing a dynamic transfer function for BPNN. A problem that the transfer function of these networks such as logarithm and tangent Sigmoid functions suffer from is the limitation of their active domain. This means that those functions will have better performance in a limited range of their domains (active domain). Out of this range, the network has a rather stable output and its derivative is also very close to zero in these range. This would result in the slowing of the learning process and the reduction of the classifier's accuracy, both of which were resolved via this method.

The evaluation process of the proposed hybrid model was performed by repeated random sub-sampling cross validation and the method of three-way data splits on six data sets taken from the UCI machine-learning repository and another dataset in the real world called ACSEKI. Four instances of the data were related to heart disease, two instances were concerned to breast cancer and one instance was regarded to diabetes. In this evaluation process, by taking the seven data sets into account and based on some of the most commonly used evaluation criteria, the classification performance of the proposed model was compared with the thirteen of the most well-known classification methods.

The statistical analysis was performed using the non-parametric Friedman test and followed by post-hoc tests (i.e., the Dunn's pairwise multiple comparisons tests). The p-values resulted from the post hoc test were adjusted using Bonferroni–Holm's procedure. Interestingly, the experimental results indicated that the proposed hybrid model has significantly outperformed the all others thirteen considered classification methods, and effectively increased the classification accuracy as well. Furthermore, in a comprehensive comparative survey, the performance results of the proposed model were benchmarked against the best ones reported for the state-of-the-art classifiers (i.e., the single, hybrid or ensemble SVM-based and random forest-based classifiers) in the recent literature in terms of classification accuracy for the same data sets under consideration. This comparative survey has provided a concise summary of substantial results that reveal the efficiency of the proposed model is desirable, promising, and competitive to the state-of-the-art classification models available in the literature. It worth mentioning that the nature of hybrid models is associated with minor inevitable complexities and the proposed model in this study is not an exception. Nevertheless, given the proven capabilities and the effectiveness of this model in the classification duty in three different fields, that is in ACS, breast cancer and diabetes, there is hope that the proposed model could be used as a tool in the quick, timely, and accurate diagnosis of diseases in other medical fields as well as in non-medical ones.

## Acknowledgments

We thank Mr. Aliakbar Kiaei and Dr. Hossein Moayedi for their technical support during the study.

## Author Contributions

Conceived and designed the experiments: NS SS FN MN IK. Performed the experiments: NS SS FN MN IK. Analyzed the data: NS SS FN MN IK. Contributed reagents/materials/analysis tools: NS SS FN MN IK. Wrote the paper: NS.

## References

- 1.
Raudys S (2001) Statistical and Neural Classifiers: An integrated Approach to Design: Springer-Verlag New York Incorporated.
- 2. Weinberger KQ, Saul LK (2009) Distance metric learning for large margin nearest neighbor classification. The Journal of Machine Learning Research 10:207–244.
- 3. Kubota R, Uchino E, Suetake N (2008) Hierarchical K-Nearest neighbor classification using feature and observation space information. IEICE Electronics Express 5:114–119.
- 4. Zeng Y, Yang Y, Zhao L (2009) Nonparametric classification based on local mean and class statistics. Expert Systems with Applications 36:8443–8448.
- 5.
Bishop CM (1995) Neural Networks for Pattern Recognition. Oxford: Oxford University Press.
- 6. Olmez T, Dokur Z (2003) Classification of heart sounds using an artificial neural network. Pattern Recognition Letters 24:617–629.
- 7. Rajendra AU, Subbanna BP, Iyengar SS, Rao A, Dua S (2003) Classification of heart rate data using artificial neural network and fuzzy equivalence relation. Pattern Recognition 36:61–68.
- 8. Qiu X, Tao N, Tan Y, Wu X (2007) Constructing of the risk classification model of cervical cancer by artificial neural network. Expert Systems with Applications 32:1094–1099.
- 9. Salari N, Shohaimi S, Najafi F, Nallappan M, Karishnarajah I (2012) An improved Artificial Neural Network based model for Prediction of Late Onset Heart Failure. Life Science Journal 9.
- 10. Salari N, Shohaimi S, Najafi F, Nallappan M, Karishnarajah I (2013) Application of pattern recognition tools for classifying acute coronary syndrome: an integrated medical modeling. Theoretical Biology and Medical Modelling 10:57.
- 11. Shapiro A (2002) The merging of neural networks, fuzzy logic, and genetic algorithms. Insurance: Mathematics and Economics 31:115–131.
- 12. Hur J, Kim JW (2008) A hybrid classification method using error pattern modeling. Expert Systems with Applications 34:231–241.
- 13. Chakraborty S (2009) Simultaneous cancer classification and gene selection with bayesian nearest neighbor method: an integrated approach. Computational Statistics & Data Analysis 53:1462–1474.
- 14. Ostermark R (2000) A hybrid genetic fuzzy neural network algorithm designed for classification problems involving several groups. Fuzzy Sets and Systems 114:311–324.
- 15. Aci M, Inan C, Avci M (2010) A hybrid classification method of K nearest neighbor, bayesian methods and genetic algorithm. Expert Systems with Applications 37:5061–5067.
- 16. Khashei M, Reza Hejazi S, Bijari M (2008) A new hybrid artificial neural networks and fuzzy regression model for time series forecasting. Fuzzy Sets and Systems 159:769–786.
- 17. Seera M, Lim CP (2014) A hybrid intelligent system for medical data classification. Expert Systems with Applications 41:2239–2249.
- 18. Shao YE, Hou C-D, Chiu C-C (2014) Hybrid intelligent modeling schemes for heart disease classification. Applied Soft Computing 14:47–52.
- 19. Forghani Y, Yazdi HS (2014) Robust support vector machine-trained fuzzy system. Neural Networks 50:154–165.
- 20. Zhang C, Zhang J (2008) RotBoost: A technique for combining Rotation Forest and AdaBoost. Pattern recognition letters 29:1524–1536.
- 21. Ghaemi M, Feizi-Derakhshi M (2014) Forest optimization algorithm. Expert Systems with Applications 41:6676–6687.
- 22. Zhang S, Mouhoub M, Sadaoui S (2014) 3N-Q: natural nearest neighbor with quality. Computer and Information Science 7:p94.
- 23.
Holland JH (1975) Adaptation in natural and artificial systems: an introductory analysis with applications to biology, control, and artificial intelligence. USA: University of Michigan.
- 24. Werbos PJ (1974) Beyond regression: new tools for prediction and analysis in the behavioral sciences. Harvard University
- 25. Rumelhart DE, Hinton GE, Williams RJ (1986) Learning representations by back-propagating errors. Nature 323:533–536.
- 26. Zadeh LA (1965) Fuzzy sets. Information and control 8:338–353.
- 27.
Gupta MM, Ragade RK, Yager RR (1979) Advances in Fuzzy Set Theory and Applications: North Holland.
- 28.
Wang P, Chang S (1980) Fuzzy Sets: Theory of Applications to Policy Analysis and Information Systems: Springer.
- 29.
Kandel A (1982) Fuzzy Techniques in Pattern Recognition: Cambridge Univ Press.
- 30. Keller JM, Gray MR, Givens JA (1985) A Fuzzy k-Nearest neighbor algorithm. Systems, Man and Cybernetics, IEEE Transactions on 580–585.
- 31.
Bezdek JC (1981) Pattern Recognition with Fuzzy Objective Function Algorithms: Kluwer Academic Publishers.
- 32. Ver Hoef J, Temesgen H (2013) A comparison of the spatial linear model to nearest neighbor (K-NN) methods for forestry applications. PLoS ONE 8:e59129.
- 33. Wu X, Kumar V, Ross QJ, Ghosh J, Yang Q, et al. (2008) Top 10 algorithms in data mining. Knowledge and Information Systems 14:1–37.
- 34. Song Y, Huang J, Zhou D, Zha H, Giles C (2007) Iknn: Informative k-nearest neighbor pattern classification. Knowledge Discovery in Databases: PKDD 2007:248–264.
- 35.
Premaratne P (2014) Effective hand gesture classification approaches. Human Computer Interaction Using Hand Gestures: Springer Singapore. pp. 105–143.
- 36.
Mitchell TM (1997) Machine learning. Part II. McGraw-Hill Boston, MA:.
- 37. Webb AR, Copsey KD (2011) Statistical pattern recognition. Statistical Pattern Recognition: John Wiley & Sons, Ltd.
- 38. Segovia F, Bastin C, Salmon E, Górriz J, Ramírez J, et al. (2014) Combining pet images and neuropsychological test data for automatic diagnosis of alzheimer's disease. PLoS ONE 9:e88687.
- 39.
Segovia F, Górriz JM, Ramírez J, Chaves R, Illán IÁ (2012) Automatic differentiation between controls and Parkinson's disease DaTSCAN images using a Partial Least Squares scheme and the Fisher Discriminant Ratio. pp. 2241–2250.
- 40. Lu Y, Wang L, Lu J, Yang J, Shen C (2014) Multiple kernel clustering based on centered kernel alignment. Pattern Recognition
- 41. Guo J, White J, Wang G, Li J, Wang Y (2011) A genetic algorithm for optimized feature selection with resource constraints in software product lines. Journal of Systems and Software 84:2208–2221.
- 42.
Dougherty G (2013) Estimating and comparing classifiers. Pattern Recognition and Classification. New York: Springer pp. 157–176.
- 43. Boulesteix AL, Strobl C (2009) Optimal classifier selection and negative bias in error rate estimation: an empirical study on high-dimensional prediction. BMC medical research methodology 9:85.
- 44. Borra S, Di Ciaccio A (2010) Measuring the prediction error. A comparison of cross-validation, bootstrap and covariance penalty methods. Computational Statistics & Data Analysis 54:2976–2989.
- 45.
Dybowski R, Gant V (2001) Clinical Applications of Artificial Neural Networks. Cambridge: Cambridge University Press.
- 46. Gu Q, Zhu L, Cai Z (2009) Evaluation measures of the classification performance of imbalanced data sets. Computational Intelligence and Intelligent Systems: Springer. 461–471.
- 47. Cho BH, Yu H, Kim K, Kim T, Kim I, et al. (2008) Application of irregular and unbalanced data to predict diabetic nephropathy using visualization and feature selection methods. Artificial Intelligence in Medicine 42:37–53.
- 48. Alberg AJ, Park JW, Hager BW, Brock MV, Diener WM (2004) The Use of “Overall Accuracy” To Evaluate The Validity of Screening or Diagnostic Tests. Journal of General Internal Medicine 19:460–465.
- 49. Baldi P, Brunak S, Chauvin Y, Andersen C, Nielsen H (2000) Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics 16:412–424.
- 50. Sokolova M, Lapalme G (2009) A systematic analysis of performance measures for classification tasks. Information Processing & Management 45:427–437.
- 51. Jurman G, Riccadonna S, Furlanello C (2012) A comparison of MCC and CEN error measures in multi-class prediction. PLoS ONE 7:e41882.
- 52. Fawcett T (2006) An introduction to ROC analysis. Pattern recognition letters 27:861–874.
- 53. Gorodkin J (2004) Comparing two K-category assignments by a K-Category correlation coefficient. Computational Biology and Chemistry 28:367–374.
- 54.
Sheskin D (2003) Handbook of Parametric And Nonparametric Statistical Procedures: crc Press.
- 55. Derrac J, García S, Molina D, Herrera F (2011) A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms. Swarm and Evolutionary Computation 1:3–18.
- 56. Demsar J (2006) Statistical comparisons of classifiers over multiple data sets. The Journal of Machine Learning Research 7:1–30.
- 57. García S, Fernández A, Luengo J, Herrera F (2009) On learning the derivatives of an unknown mapping with multilayer feedforward networks. Soft Computing 13:959–977.
- 58.
Casella G, Berger RL (1990) Statistical Inference: Duxbury Press Belmont, CA.
- 59. Razali N, Wah YB (2011) Power comparisons of shapiro-wilk, kolmogorov-smirnov, lilliefors and anderson-darling tests. Journal of Statistical Modeling and Analytics 2:21–33.
- 60. Garcia S, Herrera F (2008) An extension on" statistical comparisons of classifiers over multiple data sets" for all pairwise comparisons. Journal of Machine Learning Research 9.
- 61.
Zar JH (1999) Biostatistical Analysis: Pearson Education India.
- 62. Dunn OJ (1961) Multiple comparisons among means. Journal of the American Statistical Association 56:52–64.
- 63. Hochberg Y (1988) A sharper bonferroni procedure for multiple tests of significance. Biometrika 75:800–802.
- 64. Holm S (1979) A simple sequentially rejective multiple test procedure. Scandinavian journal of statistics 65–70.