One-Hot Vector Hybrid Associative Classifier for Medical Data Classification

Pattern recognition and classification are two of the key topics in computer science. In this paper a novel method for the task of pattern classification is presented. The proposed method combines a hybrid associative classifier (Clasificador Híbrido Asociativo con Traslación, CHAT, in Spanish), a coding technique for output patterns called one-hot vector and majority voting during the classification step. The method is termed as CHAT One-Hot Majority (CHAT-OHM). The performance of the method is validated by comparing the accuracy of CHAT-OHM with other well-known classification algorithms. During the experimental phase, the classifier was applied to four datasets related to the medical field. The results also show that the proposed method outperforms the original CHAT classification accuracy.


Introduction
Recognizing objects is an automatic routine task for humans and there is a myriad of problems involving pattern recognition. Simulating the human capacity for objects recognition has been a very important topic for computer sciences. For several decade, various approaches have been developed, which can be implemented on computers, to simulate the human ability to recognize objects. One of such approaches is the associative approach, whose main purpose is to correctly retrieve complete patterns from input patterns.
The first known model of associative memories is the Lernmatrix, developed in 1961 by Karl Steinbuch [1]. Some years later, an optical device capable of behaving as an associative memory was created by Buneman and Longuet-Higgins. [2]. In 1972, the work of Anderson [3], Kohonen [4], and to some extent Nakano [5], led to the model that is now known by the generic name of Linear Associator. In this same year Shun-Ichi Amari, published a theoretical work about self-organizing nets of threshold elements [6]. The work of Amari represents an essential background to one of the most important associative models: the Hopfield memory [7]. In the late 1980's, Kosko [8] developed a bidirectional associative memory from two Hopfield memories. The morphological associative memories were introduced by Ritter et al. in 1998 [9], which represented a qualitative leap for associative models. These models incorporated concepts from mathematical morphology, which give them several advantages over the known models. Associative models have been widely and successfully used in different applications such as: pollutant concentration prediction [10], pattern classification [11], images processing [12,13], among others.
In this paper, a method that combines a hybrid associative classifier, a coding technique for output patterns and majority voting, is presented. The rest of this paper is organized as follows. Section 2 describes all the materials and methods needed to develop our proposal. Section 3 describes how the experimental phase was conducted and discusses the results. Some conclusions are presented in Section 4 and finally the Acknowledge and References are included.

Associative Memories
An associative memory M is a system that relates input patterns and output patterns as follows [14]: x?M?y with x and y being the input and output patterns vectors. Each input vector form an association with its corresponding output vector. An associative memory is represented by a matrix whose ijth component is m ij . For each k integer and positive, the corresponding association will be denoted as: x k ,y k À Á . The matrix M is generated from a finite set of previously known associations, called the fundamental set. If m is an index, the fundamental set is represented as: x m ,y m ð ÞDm~1,2,:::,p f g , where p is the cardinality of the fundamental set. The patterns that form the fundamental set are called fundamental patterns. If it holds that x m~ym Vm [ 1,2,:::,p f g , M is autoassociative, otherwise it is heteroassociative. If we consider the fundamental set of patterns x m ,y m ð ÞDm~1,2,:::,p f g where n and m are the dimensions of the input patterns and output patterns, respectively, it is said that x m [ A n ,A~0,1 f g and y m [ A m . Then the j-th component of an input pattern x m is x m j [ A. Analogously, the j-th component of an output pattern y m is represented as y m j [ A. Therefore, the fundamental input and output patterns are represented as follows: . . A distorted version of a pattern x m to be recalled will be denoted asx x m . An unknown input pattern to be recalled will be denoted as x v . If when an unknown input pattern x v is fed to an associative memory M, and it happens that the output corresponds exactly to the associated pattern, y v it is said that recalling is correct.

Lernmatrix
The Lernmatrix is a heteroassociative memory that can function as a binary pattern classifier if the output patterns are properly selected [14]. It is an input and output system that accepts a binary input pattern x m [ A n ,A~0,1 f g and produce as an output the class y m [ A m . For a class k [ 1,2,:::m f g , where m is the number of classes in the fundamental set, the class is coded according to the following expression: y m k~1 and y m j~0 for j~1,2,:::,k{1,kz1,:::,p. The Lernmatrix is represented by a matrix M. At the beginning of the learning phase, each component m ij of M is set to zero and then it is updated according to rule m ij zDm ij , where: Where e is any positive constant that was previously chosen. The recovery phase consists of finding the class vector for a given vector x v [ A n . Finding the class means to obtain the coordinates of the vector y v [ A m that corresponds to the pattern is obtained according to the following expression:
The learning phase consists of two steps: 1. For each of the p associations x m ,y m ð Þfind the matrix y m: x m ð Þ t of dimensions m|n The recovering phase consists of presenting an input pattern x v to the memory M and performing the following operation: The CHAT is a hybrid associative classifier developed by Santiago-Montero [16], which is based on two associative memories: the Lernmatrix and the Linear Associator. This classifier overcomes some limitations that these two memories presented, by ingeniously combining the learning and recovery phases of both models. The first proposed model was called CHA, which combined the learning phase of the Linear Associator and the recovering phase of the Lernmatrix, but sometimes this model fails to perform a correct classification. To overcome this limitation, a new version was proposed, by adding a new step to the model: translation of coordinates axes. This new version was named CHAT. With this axes translation, the new origin is located at the centroid of the input vectors patterns.

CHAT Algorithm
1. Let x m Dm~1,2,:::,p f gbe a set of n-dimensional fundamental input patterns with real values in its components, which are grouped into m classes. 2. To each of the fundamental input patterns belonging to class k, an output vector of size m is assigned. This vector consists of zeros, except for the k-th component, whose value is set to 1. 3. Calculate the mean vector of the set of input patterns according to definition 3.1. 4. The mean vector is taken as the new origin of the coordinate axes. 5. Translate the patterns of the input set according to definition 3.2. 6. Apply the learning phase, which is the same as the learning phase of the Linear Associator, to the translated set obtained in the previous step. 7. Translate the patterns that have to be classified using the definition 3.2.
8. Apply the recovering phase, which is the same as the recovering phase of the Lernmatrix, to the translated set obtained in the previous step.

CHAT-OHM
In this section the description of the proposed method is presented. This proposal is part of the results achieved by several members of the Neural Networks and Unconventional Computing Laboratory of the Centro de Investigación en Computación, Instituto Politécnico Nacional, in an attempt to improve the performance of the CHAT model [16]. This joint effort resulted in a number of methods that implemented some variations on the CHAT, being the proposed method one of them.

One-hot Vectors
One-hot vector is a coding technique for output patterns, which will be used in the proposed method instead of the original coding technique presented in the step 2 of the CHAT algorithm that was described in the previous section.

Majority Voting
The classification phase consists of finding the output vector y v to which an unknown input pattern x v belongs. Majority voting is a procedure used during the classification phase to perform this task. 3.2. 6. Apply the learning phase, which is the same that as learning phase of the Linear Associator, to the translated set obtained in the previous step. 7. Translate the patterns that have to be classified using the definition 3.2. 8. Apply the recovering phase, which is the same as the recovering phase of the Lernmatrix, to the translated set obtained in the previous step. Because of the way in which the classes were coded, we will obtain an output vector z v of size p and not the desire output classỹ y v of size m. Our algorithm performs an extract step. 9. Perform the majority voting explained in the previous section.

Data Sets
This section provides a brief description of the dataset used during the experimental phase. All the used datasets were taken from the University of California at Irvine Machine Learning Repository [17]. A summary of the main characteristics of the datasets is shown in Table 1.

Haberman's Survival Dataset
The dataset contains cases from a study conducted at the University of Chicago's Billings Hospital on the survival of patients who had undergone surgery for breast cancer. The dataset contain 306 instances, which belong to two different classes; 255 instances belong to the first class (patients who survived 5 years or more) and 81 instances belong to the second class (patients who died within 5 years). The dataset has 4 attributes including the class attribute. The purpose of the dataset is to predict the survival status of patients that have undergone breast cancer surgery.

Wisconsin Breast Cancer Dataset
This dataset was obtained from the University of Wisconsin Hospitals, Madison from Dr. William H. Wolberg. The dataset has information of clinical cases of breast cancer. The dataset contains 699 instances belonging to two classes, 458 instances belong to the first class (benign) and 241 belong to the second class (malignant). Each instance consists of 10 attributes, including the class attribute. The dataset has 16 pattern with one missing values. The instances with missing values were deleted from the original Table 5. Number of outliers for dataset.      dataset and the resulting data set was used for the experimental phase.

Liver Disorders Dataset
The Liver Disorders dataset was created by BUPA Medical Research Ltd. This dataset presents the results of a study of liver disorders that might arise from excessive alcohol consumption. It contains 345 instances belonging to two classes, 145 instances belong to the first class and 200 instances belong to the second class. Each instance consists of 7 attributes, including the class attribute.

Hepatitis Disease Dataset
This dataset contains information of the clinical results of hepatitis patients. It contains 155 instances belonging to two classes, 32 instances belong to the first class (die) and 123 instances belong to the second class (alive). Each instance consists of 20 attributes, including the class attribute. This dataset has multiple missing values. Due to the small size of the dataset and the considerable number of missing values, these cannot be discarded. In this case the missing values were substituted by the class mode for categorical features and by the class mean for continuous values.

Machine Learning Algorithms
This section provides a short description of the algorithms used during the experimental phase. All of these algorithms are implemented in the WEKA 3: Data Mining Software in Java [18]. Further details on the implementation of these algorithms can be found in the following reference [19].

IB1
IB1 is a basic nearest-neighbor instance-based learner that finds the training instance closest in Euclidean distance to the given test instance and predicts the same class as this training instance. If several instances qualify as the closest, the first one found is used [19].

ConjunctiveRule
ConjunctiveRule learns a simple conjunctive rule learner that predicts either a numeric or a nominal class value. Uncovered test instances are assigned the default class value of the uncovered training instances. The information gain (nominal class) or variance reduction (numeric class) of each antecedent is computed, and rules are pruned using reduced-error pruning [19].

RandomTree
Trees built by RandomTree test a given number of random features at each node, performing no pruning. The tree is constructed considering K randomly chosen attributes at each node. Also has an option to allow estimation of class probabilities based on a hold-out set [19].

RandomForest
RandomForest constructs random forests by bagging ensembles of random trees. A random forest is a classifier consisting of a collection of tree-structured classifiers and each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest [20].

BFTree
BFTree constructs a decision tree using a best-first expansion of nodes rather than depth-first expansion used by standard decision tree learners. Pre and post pruning option are available that are based on finding the best number of expansion to use via crossvalidation on the training data. While fully grown trees are the same for best-first and depth-first algorithms, the pruning mechanism used by BFTree will yield a different pruned tree structure than that produced by depth-first methods [19].

SMO
SMO implements John Platt's sequential minimal optimization algorithm for training a support vector classifier, using kernel functions such as polynomial or Gaussian kernels. Missing values are replaced globally, nominal attributes are transformed into binary ones, and attributes are normalized by default. For further details of the implementation, see [21].

AdaBoostM1
AdaBoostM1 is a variation of boosting, method for combining multiple models seeking models that complement one another. This algorithm is constructed through the combination of various classifiers produced by repeatedly running T rounds a given ''weak'' learning algorithm on various distributions over the training data. Finally the booster combine the T ''weak'' hypotheses into a single final hypothesis [22].

MultiBoostAB
MultiBoostAB combines boosting with a variant of wagging to prevent overfitting. Multiboosting is an extension of AdaBoost technique [23]. Wagging is a technique that allow variance reduction, while AdaBoost perform both variance and bias reduction. MultiBoost is achieved by wagging a set of subcommittees of classifiers, each sub-committee formed by Ada-Boost. When forming decision committee using C4.5 as the base learning algorithm, MultiBoost is demonstrated to produce committees with lower error than AdaBoost.

RBFNetwork
RBFNetwork implements a normalized Gaussian radial basis function network, deriving the centers and widths of hidden units using k-means and combining the outputs obtained from the hidden layer using logistic regression if the class is nominal and linear regression if it is numeric. The activations of the basis functions are normalized to sum to 1 before they are fed into the linear models [19].

NaiveBayes
NaiveBayes implements the probabilistic Naïve Bayes classifier. The NaiveBayes algorithm is based on Bayes rule and assumes that the attributes are conditional independent given the class, and it posits that no hidden or latent attributes influence the prediction process [24].

BayesNet
Bayesian networks are alternative ways of representing a conditional probability distribution by means of directed acyclic graphs (DAGs). In this model, each node represents a random variable and an arrow connecting a parent node with a child node indicates a relationship between them [25]. BayesNet learns Bayesian nets under two assumptions: nominal attributes (numeric ones are pre-discretized) and no missing values (any such values are replaced globally).

NaiveBayesMultinomial
NaiveBayesMultinomial implements the multinomial Bayes' classifier. A Naïve Bayes classifier is based on Bayes rule but this does not take into account the number of occurrence of an element. The Naïve Bayes Multinomial incorporates frequency to perform classification [19].

ComplementNaiveBayes
ComplementNaiveBayes builds a Complement Naïve Bayes classifier as described by Rennie et al [26]. In this work, they proposed heuristic solutions to some problems presented by the Naïve Bayes classifiers. They proposed a solution for skewed data, more training examples for one class than another that causes that the classifier prefer one class over the other.

DecisionTable
DecisionTable builds a simple decision table majority classifier, this table has two components: a set of features that are included in the table and a body consisting of labeled instances from the space defined by the features [27].

LWL
LWL is a general algorithm for locally weighted learning. It assigns weights using an instance-based method and builds a classifier from the weighted instances. Different classifiers can be selected, but a good choice is Naïve Bayes for classification problems and linear regression for regression problems. Attribute normalization is turned on by default [19].

DMNBtext
Another Naïve Bayes scheme for text classification is DMNBtext. This learns a multinomial Naïve Bayes classifier in a combined generative and discriminative fashion. DMNBText injects a discriminative element into parameter learning by considering the current classifier's prediction for a training instance before updating frequency counts. When processing a given training instance, the counts are incremented by one minus the predicted probability for the instance's class value. DMNBText allows users to specify how many iterations over the training data the algorithm will make, and whether word frequency information should be ignored, in which case, the method learns a standard Naïve Bayes model rather than a multinomial one [19].

MultiScheme
MultiScheme selects the best classifier from a set of candidates using cross-validation of percentage accuracy or mean-squared error for classification and regression, respectively. The number of folds is a parameter. Performance on training data can be used instead [19].

Vote
Vote provides a baseline method for combining classifiers. The default scheme is to average their probability estimates or numeric predictions, for classification and regression, respectively. Other combination schemes are available-for example, using majority voting for classification [19].

VotedPerceptron
VotedPerceptron implements the voted perceptron algorithm. The solution vector found by the perceptron algorithm depends greatly on the order in which the instances are encountered. One way to make the algorithm more stable is to use all the weight vectors encountered during learning, not just the final one, letting them vote on a prediction. Each weight vector contributes a certain number of votes [19].

Normalization
During the experiments performed over the original data, we observed that some of the datasets present large scale difference between features. To avoid the effect that an overly large variable can have over the classification performance, the datasets were normalized and the experiments were performed with the normalized datasets. Normalization can prevent some features from dominating just because they have large numeric values. Subtracting the mean and dividing by the standard deviation can be an appropriate normalization method for this situation [31]. The normalization was performed separately on each attribute. Normalization was calculated using the following expression: Where z i is the normalized value of x i , m is the mean of the population and s is the standard deviation of the population.

Wilson's Edition
One of the most popular filtering algorithms is the Wilson's Edition [28]. The general idea of this method is to identify and remove noisy or atypical patterns, primarily those which exist in the overlap area between two or more classes. The process consists of applying the rule of the k nearest neighbor (usually k = 3) to estimate the corresponding class of each pattern in the dataset. Those patterns whose class does not correspond to the majority class of the k-nearest neighbors will be discarded [28].

Algorithm Comparison
One of the objectives of this study is to perform a consistent comparison between the classification performance of our proposal and the classification performance of other well-known pattern classification algorithms. There are two aspects that need to be addressed: select a suitable test set and the method to compare the classification performance of each algorithm. To predict the performance of a classifier, we need to assess the success rate on a dataset that takes no part in the construction (training phase) of the classifier. When the data available is big, there is no problem in the selection of a suitable test set, just use a large training set and a large test set. But the question of predicting performance with limited data is still controversial. There are many techniques, of which cross-validation is the method of choice in most situations. Kohavi [29] compared cross-validation and bootstrap, the results show that bootstrap has low variance, but extremely large bias for some problems; as a consequence stratified 10-fold cross-validation is recommended. To perform the comparison of our proposal with other pattern classification algorithms, we used the 10-fold cross-validation approach.

Classification Accuracy
For classification problems, the performance of a classifier can be measured in term of the success rate. The classifier predicts the class of each instance in the test set; if the class is correct, it is counted as a success. The success rate is the proportion of success over the whole set of test instances. In this paper, the accuracy of the classifiers is expressed as a percentage, and was computed according to the following expression:

Validation Method
According to [19] the standard way of predicting the classification accuracy of a learning technique is to use stratified 10-fold cross-validation. This method divides the dataset into 10 parts in which each class is represented in approximately the same proportion as in the full dataset. The classification algorithms will be executed 10 times, in each execution one different part will be used as the test set and the classification algorithm will be trained with the remaining nine parts. The success rate will be calculated for each execution. Finally, the 10 success rates are averaged to yield an overall success rate.

Experiments and Discussion
In this section we present and discuss the results obtained during the experimental phase, throughout which four datasets were used to obtain the classification performance of each of the compared classification algorithms. The datasets used in this section were taken from the UCI Machine Learning Repository [17]. A brief summary of the datasets is presented in Table 1.
The performance achieved by the proposed method is compared with the performances of 19 well-known methods taken from the WEKA 3 Data Mining Software [18]. Further information about the used algorithms can be found in [19]. All experiments were conducted using a personal computer with an Intel Core i3-2100 Processor running Ubuntu 13.04 64-bits operating system with 4096 GB of RAM.
To ensure valid comparison of classification performance, the same conditions and validation schemes were applied in each experiment. Classification performance of each of the algorithms was calculated using stratified 10-fold cross-validation, with random re-ordering of the patterns before fold generation. In order to account for the random re-ordering of the patterns, the experiments for each classification algorithm, including our proposal, were executed 10 times using the stratified 10-fold cross-validation approach and the results averaged to obtain a final success rate for each algorithm. These results are used to compare the performance of our proposal and the other classification algorithms.

Original Datasets
In this subsection we analyze the classification accuracy results of each one of the compared algorithms, when applied to the original four datasets that were selected for this study. Table 2 shows the classification accuracy achieved by the original CHAT model and by our proposal in the four datasets. It is worth noting that CHAT-OHM achieved the best classification accuracy for all the datasets. In some cases the improvement in the classification accuracy is quite significant, as in the cases of the Breast Cancer dataset and the Hepatitis Disease dataset, with an improvement of 31.9 percent and 16.77 percent, respectively. The improvement for the Liver Disorders dataset is 5.82 percent, which is still important. The Haberman's Survival dataset is where we observed the least improvement with only 0.41 percent. Table 3 shows the classification accuracy achieved by our proposal and the 19 classification algorithms from WEKA, against which we will compare our method. For each dataset, the highest classification accuracy is emphasized with boldface.
As we can observe in Table 3, the CHAT-OHM does not surpass all the other classification algorithms, still it exhibits a competitive classification accuracy. For the Breast Cancer dataset the CHAT-OHM achieved a performance of 95% (9th place), only 2.34% below the best performer, BayesNet. For the Liver Disorders dataset the best classifier was RandomForest, with a 68.44% of classification accuracy, while the CHAT-OHM reached the 9th place with a difference of performance of 6.99%. For the case of Haberman's Survival dataset CHAT-OHM achieved a classification accuracy of 66.36% which leaves it in 18th place with 8.44% below the best classifier, NaiveBayes. The best performance for Hepatitis Disease dataset was achieved by RandomForest with 90.61% of classification accuracy, while the CHAT-OHM was positioned in the 13th place with a classification accuracy of 84.96%.
Notice, however, that despite not exhibiting the best performance for any given dataset, CHAT-OHM has a consistent behavior: the proposed method reached the 9th place for the Breast Cancer dataset and the Liver Disorders dataset, while being the 13th place for the Hepatitis Disease dataset and the 18th place for the Haberman's Survival dataset. On the other hand, Bayes Net is the best classifier for the Breast Cancer dataset, while being the 17th place for the Liver Disorders dataset, the 16th place at the Haberman's Survival dataset, and the 9th place for the Hepatitis Disease dataset. Another example of this inconsistent performance is that of the NaiveBayes algorithm: it is the 5th place for Brest Cancer dataset, the worst method for the Liver Disorders dataset, the best at Haberman's Survival dataset, and the 10th method for Hepatitis Disease dataset.

Normalized Datasets
While performing the experiments, we noticed that some attribute values are significantly larger than the values of the rest of the attributes. As recommended by [31] to avoid the impact of scale change, the dataset can be normalized. The justification usually given for this normalization is that it prevents certain features from dominating merely because they have large numerical values. Table 4 shows the classification accuracy achieved by our proposal and the 19 classification algorithms from WEKA, when applied to normalized datasets. For each dataset, the highest classification accuracy is emphasized with boldface. In general, no significant variations were achieved with respect to the results of the datasets without normalization. In most cases the improvement is less than 2 percent, with only two clear exceptions: VotedPerceptron and DMNBtext, which significantly increased their classification accuracy. The former exhibits an improvement of 5.79% for the Breast Cancer dataset and 6.94% for the Hepatitis Disease dataset, while the latter shows an improvement of 24.99% for the Breast Cancer dataset, 6.19% for the Liver Disorders dataset and 9.61% for the Hepatitis Disease dataset.
The performance of CHAT-OHM was not significantly affected by normalization, but for the Hepatitis Disease dataset the improvement of 4.56% changes its rank from the 13th place (Table 3) to the 4th place ( Table 4).
The normalization method used in our experiments, produce both positive and negative normalized values. This situation did not allow us to perform the experiment with two classification algorithms from WEKA: ComplementNaiveBayes and Naive-BayesMultinomial, since these algorithms are unable to handle negative values.

Outliers Treatment
During the testing phase, we also noticed the presence of some atypical pattern in the datasets. To verify the presence of outliers in the datasets, a method for detection and deletion of outliers called Wilson's Edition was applied to the datasets [28]. Table 5 shows the amount of outliers found and deleted from the four datasets using this technique. The information presented by this table, shows that the Breast Cancer dataset presents only 3.22% of outliers while Haberman's Survival dataset presents 36.45%. The fact that most of the classifiers work much better for the Breast Cancer dataset, may be justified by the almost absence of outliers in this dataset. The decision boundary between the classes appears to be better defined for the Breast Cancer dataset, thus the classification algorithms exhibit a higher classification accuracy than the one achieved with the other datasets, where the decision boundaries seem not so well defined. Table 6 shows the classification accuracy achieved by our proposal and the 19 classification algorithms from WEKA, when applied to datasets without outliers. For each dataset, the highest classification accuracy is emphasized with boldface. The removal of outliers leads to an improvement for all the classification algorithms presented in this work. For the Breast Cancer dataset 22 outliers were removed, which represent the 3.22% of the original dataset. The improvement in the classification accuracy for this dataset varies from 0.5% to 4.27%. The CHAT-OHM shows an improvement of 3.95% for the Breast Cancer dataset, which changes its position from the 9th place (Table 3) to the 4th place (Table 6), as mentioned before. Table 7 shows the classification accuracy achieved by our proposal and the 19 classification algorithms from WEKA, when applied to normalized datasets without outliers. For the Liver Disorder dataset, increases in the classification accuracy can be observed when compared with the experiments performed on the original datasets. But if we compare the result of experiments with the datasets without outliers and the ones with the normalized datasets without outliers, the classification accuracy gets worse instead of better. In general, it seems that for this dataset, it is better not to use normalization and instead rely on the removal of outliers. On the other hand CHAT-OHM performed better with the normalized and outliers-free dataset. The original performance was 61.45%, the performance with the outliers-free dataset was 69.23% and the performance with the normalized outliers-free dataset was 74.13%; with this improvement the classifier changes its rank from the 9th place with the original dataset (Table 3) to the 4th place with the normalized outliers-free dataset (Table 7).

Improvement Analysis
From the results presented in Table 3, 4, 6, and 7, it is shown that there is no specific classification algorithm that exceed all the other algorithms in all the presented problems. This claim is supported by the No-Free-Lunch Theorems presented by Wolpert and Macready [30], which establish that for an algorithm, any performance gain in one kind of problem is offset by its performance loss in other kind of problems. Table 8, 9, 10, and 11 show the percentage of improvement achieved by our proposal and the 19 classification algorithms from WEKA, when applied to normalized datasets, datasets without outliers, and normalized datasets without outliers, for each of the four datasets used.
For the Brest Cancer dataset, CHAT-OHM exhibits an improvement of 3.95%, being the second algorithm with higher improvement when removing the outliers. The dataset that presented greatest improvements with the removal of outliers was Haberman's Survival. On average the classification accuracy improved from 71.82% to 91.49%. The improvements for this dataset vary from 15.86% to 28.37%. The CHAT-OHM exhibit an improvement of 24.08% when removing the outliers, positioning itself in the fourth place of the algorithms with higher improvements. For the Liver Disorders dataset the improvements when removing the outliers vary from 3.65% to 20.13%. The CHAT-OHM shows an improvement of 7.78%, which is relatively low when compared with the improvements presented by the rest of the algorithms for this dataset.
With the normalized outliers-free datasets, CHAT-OHM shows an improvement of 12.68% with the Liver Disorder dataset and its rank changes from the 9th place to the 4th place. Also, it was the classifier with the best improvement for this dataset. For the Haberman's Survival dataset the model exhibit a 24.08% of increase in its performance and it was the fourth best improvement for this dataset.

Conclusions
In this paper, we present a method that combines a Hybrid Associative classifier, a coding technique for output classes and a procedure of majority voting during the classification phase. This method is called CHAT-OHM. During the experimental phase, this method is applied to four different datasets related to the medical field. The performance of the method is compared with 19 machine learning algorithms implemented in WEKA Data Mining Software.
The proposed method uses an associative classifier, the CHAT [16], combined with a novel coding technique and a voting procedure. The results obtained demonstrate that the proposed method improved the result obtained by the CHAT.
However the experiments show that the CHAT-OHM is sensitive to the presence of outliers. To improve the classification accuracy of this algorithm, it has to be combined with a method of detection and removal of outliers. In the present work we use Wilson's Edition as such method.
The CHAT-OHM presented fairly good results and a consistent behavior when applied to the four datasets used in this study. The performance of the model was not significant affected by the normalization process. On the other hand it was positive affected by the removal of outliers, displaying remarkable improvement in its performance, such as the ranking improvement for the Breast Cancer (4th place) with a performance increase of 3.95%. Another significant performance enhancement was obtained with the Liver Disorders dataset using normalization and outliers removal, the CHAT-OHM improved its rank to the 4th place with an increase of the classification accuracy of 12.68%.
It should be mentioned that our proposal is part of a family of methods based on the CHAT classifier. The main difference between these methods is the coding technique of each one, such as: Modified Johnson-Möbius binary coding, Gray coding, among others.