Imbalanced learning: Improving classification of diabetic neuropathy from magnetic resonance imaging

One of the fundamental challenges when dealing with medical imaging datasets is class imbalance. Class imbalance happens where an instance in the class of interest is relatively low, when compared to the rest of the data. This study aims to apply oversampling strategies in an attempt to balance the classes and improve classification performance. We evaluated four different classifiers from k-nearest neighbors (k-NN), support vector machine (SVM), multilayer perceptron (MLP) and decision trees (DT) with 73 oversampling strategies. In this work, we used imbalanced learning oversampling techniques to improve classification in datasets that are distinctively sparser and clustered. This work reports the best oversampling and classifier combinations and concludes that the usage of oversampling methods always outperforms no oversampling strategies hence improving the classification results.


Introduction
Machine learning has enabled us to extract patterns from data to build predictive models. However, machine learning models tend to suffer from class imbalance especially in biomedical diagnosis [1]. In this context, class imbalance describes the skewed representation of a disease phenotype, whereby some classes appear more frequently [2]. Having an imbalanced class label can lead to biased learning classification in algorithms such as k-nearest neighbors (k-NN), support vector machines (SVM), decision trees (DT) and multilayer perceptron (MLP). This occurs as a result of inherent tendencies to preference and overfit towards the majority classes [3]. We assume in many machine learning classifier algorithms that the number of instances (classes) is roughly similar. However, by biasing training towards the majority classes, we risk overlooking the unique and occasionally more important minority classes.
There are currently three categorical approaches to managing imbalanced data. The simplest method is to take an Algorithm level approach whereby classifiers are tuned for class imbalance based on existing classifier learning algorithms, one example is k-NN [4]. The second group of methods take a Data Level approach, which includes preprocessing methods (i.e the Synthetic Minority Oversampling Technique (SMOTE) [5]-see Fig 1), whereby additional training samples are generated for minority classes to rebalance the class distribution. The third method lies between the data and algorithm level approaches called the Cost Sensitive technique [6][7][8]. In these techniques, a higher cost is assigned to minority samples during the training process, well known examples are SVM [9] and ADACost [10]. In this paper, we only consider Data Level preprocessing methods, specifically oversampling methods. These methods tackle the root of the imbalanced learning problem, which is the lack of data. Secondly, they also allow easy application of a machine learning pipeline, unlike cost-sensitive and classifier specific solutions.
Some real life examples of class imbalanced problems include credit card fraud detection [11], text recognition [12] and crucially in healthcare diagnostics [13]. Increasingly advances in machine learning classification, especially in the field of medical imaging, are being used to diagnose diseases and predict treatment outcomes in various medical conditions [14]. In our work, we will be looking closely at classifying diabetic peripheral neuropathy (DPN) subjects. Diabetic peripheral neuropathy (DPN) is a common condition affecting half of all diabetic subjects and is a challenging condition to manage effectively [15]. With current treatments, the best outcome we can achieve is 50% pain relief in only a third of subjects. The current approach assumes that all subjects respond similarly to a given drug when in fact there is a wide variability in response. Over the last 10 years, we have demonstrated using magnetic resonance (MR) neuroimaging that altered brain structure and functional connectivity could serve as a possible Central Pain Signature (CPS) for painful DN [16,17], which could provide a means of stratifying subjects to the right treatment first time.
The primary aim of this study was to determine whether oversampling improves the diagnostic performance of machine learning classification trained on MR imaging features. We compared 73 oversampling techniques against a baseline of no oversampling and conventional SMOTE to justify our exhaustive approach. Our secondary aim was to determine which oversampling strategies result in the best performance reported over two distinct datasets (clustered and sparse-see Fig 2). Both our datasets utilises MR imaging features specifically resting state and structural features commonly associated with the pain pathways in DPN subjects [16,17]. The first consisted of an imbalanced binary classification dataset that is traditionally a multiclass classification problem. For this dataset, we tried to classify painful DPN from three other groups consisting of healthy volunteers (HV), no neuropathy (noDPN) and painless DPN. Our second sparser dataset investigated oversampling methods when applied to a smaller dataset in a particularly focused disease phenotype. Here we looked closer at painful DPN in particular responsiveness to treatment.

MRI methods
Subjects. Our dataset comprises subjects with diabetes (n = 121) and heathy controls (n = 37). All subjects underwent detailed clinical and neurophysiological assessments to diagnose and phenotype DPN [18]. Subjects with diabetes were divided into three groups: no DPN (N = 42); painless DPN (N = 40) and painful DPN (N = 39). In the first analysis (DTS1), we classified subjects with painful DPN from the rest of the subjects. There were no significant differences in mean age or gender distribution (p > 0.05) between these two groups.
For the second analysis a different subset of subjects with painful DPN, which have been assessed for response to neuropathic pain treatment was used (DTS2). We divided these subjects into responders and non-responders. We used the NTSS-6 questionnaire, which grades neuropathic pain intensity and duration, to define responders [N = 13] with a score [19] below seven and non-responders [N = 40] with a pain score seven or above. Table 1 also shows the other characteristics of the two binary classification datasets used. Written informed consent for the study was obtained before subjects participated in the study which has prior approval by the Sheffield Local Research Ethics Committee.
Dataset assessment. As shown in Table 1, Dataset 1 (DTS1-see S1 Table) comprises of diabetic and healthy control (HC) subjects separated into painful and non-painful subject classes. Dataset 2 (DTS2-see S2 Table) depicts only DPN subjects separated into responders and non-responders to treatment.
ATR refers to number of attributes or features used, column N is the total number of instances or subjects, N+ is the majority sample class, N-is the minority sample class and lastly IR is the imbalanced ratio (N+/N). Both datasets have similar imbalanced ratios with IR � 3 and ATR number. Keeping IR and ATR constant allows us to focus this paper on exploring the two types of datasets. We kept the ATR similar by selecting the best features from our imaging data using recursive feature elimination (RFE) method as described in the next section. As shown in Fig 2, DTS1 have a more structured sub grouping or sub clustering structure as the non-painful class contains HC, painless and no DPN, which have distinctive   Table 1. DTS1 above shows classes with more pockets of bunching together (Clustered) whereby the DTS2 is a more sporadic class dataset (Sparser).
https://doi.org/10.1371/journal.pone.0243907.g002 neuroimaging characteristics. DTS2 however is a more random or sparser dataset with a smaller minority sample size. In addition, all the subjects in this dataset are painful DPN subjects making this dataset highly similar in neuroimaging characteristics.
Image acquisition & processing. MRI acquisition. In the weeks before treatment all subjects underwent MRI using a Phillips Achieva 3 Tesla system (Phillips Medical Systems, Holland) with a 32-channel head coil. Anatomical data were acquired using a T1-weighted magnetisation prepared rapid acquisition gradient echo sequence with the following parameters: repetition time (TR) 7.2 ms, echo time (TE) 3.2 ms, flip angle 8˚, and voxel size 0.9 mm3, yielding isotropic spatial resolution. A 6-minute resting-state fMRI sequence was acquired while subjects fixated on a cross using a T2 � -weighted pulse sequence, with TE = 35ms; TR = 2600ms, in-plane pixel dimensions = 1.8mmx1.8mm, contiguous trans-axial slices thickness of 4mm were orientated in the oblique axial plane parallel to the AC-PC bisection, covering the whole cerebral cortex.
Resting state. ROI-ROI based analysis was performed using the CONN (version 18.a) [20]: functional connectivity toolbox software. This software was also used to perform all preprocessing steps (using the default preprocessing pipeline), as well as subsequent statistical analyses, on all subject scans. In CONN's preprocessing pipeline, raw functional images were slicetime corrected, realigned (motion corrected), unwarped, and coregistered to each subject's T1-weighted dataset in accordance with standard algorithms. Resulting images were then normalized to Montreal Neurological Institute (MNI) coordinate space, spatially smoothed (5 mm full-width at half maximum), and resliced to yield 2 × 2 × 2 mm voxels. Regional mean blood oxygenation level dependent time series were extracted from each patient for 10 chosen regions with each ROI defined with a spherical radius of 5mm. The 10 sources chosen were the insular cortex (l,r), postcentral gyrus(l,r), precentral gyrus(l,r), thalamus(l,r) and the cingulate gyrus(a,p) region. Pearson's correlation coefficients were calculated for each region's BOLD time series correlating with every other region's BOLD time series to form a symmetric 10x164 matrix for each patient. The correlation coefficients were z-transformed using Fishers transform to normalize the distribution Structural processing. Cortical reconstruction and volumetric segmentation were performed with FreeSurfer software [21] (http://surfer.nmr.mgh.harvard.edu). Preprocessing includes motion correction and averaging [22] of volumetric T1-weighted images, removal of nonbrain tissue [23] using a hybrid watershed/ surface deformation procedure, affine registration to the Talairach atlas [24,25], intensity normalization, tessellation of the gray matter-white matter boundary, automated topology correction [26,27], and surface deformation following intensity gradients to optimally place the gray/white and gray/cerebrospinal fluid borders at the location where the greatest shift in intensity defines the transition to the other tissue class [28,29]. Intensity and continuity information from the entire three-dimensional MR volume in segmentation and deformation procedures is used to produce surface-based maps. These maps subsequently produce representations of cortical thickness calculated as the closest distance from the gray/white to the gray/cerebrospinal fluid boundaries at each vertex on the tessellated surface (34). Cortical thickness (in mm), volumes of the insular cortex, postcentral gyrus, precentral gyrus, thalamus and the cingulate gyrus were assessed.

Machine learning methods
Oversampling strategies. In this work we have kept the IR as one after the application of oversampling for all classification experiments. There are more than 100 variants of SMOTE in the literature [30], but we have only adopted 73 oversamplers in our study and have discounted techniques that are essentially similar. In total we conducted 292 oversampling classification experiments and four no oversampling experiments using four different classifiers. We have also categorised the oversamplers based on their key characteristic operating principles as reported by [31] and shown in S3 Table. As shown in S3 Table, each oversampling method falls into a few operating principles, however some techniques are unique and does not fall into any particular operating principle. In the results section, we report which operating principles perform best on our two datasets. By reporting the best oversamplers in this way, it should also allow future studies with similarly sparser or clustered datasets to select potential oversamplers not described in this work. In the rest of this section, we will endeavor to summarise some of the most prevalent operating principles: Dimensionality reduction. These techniques reduce the dimensions of the data to a lower dimensional space. Some common techniques are principal component analysis (PCA) and linear discriminant analysis (LDA).
Component-wise sampling. Attributes are sampled independently in this method and the assumption is that the entire volume of a hypercube spanned by two neighboring minority samples belongs to the minority class.
Ordinary sampling. These techniques are very similar to conventional SMOTE methods (see Fig 1) and adapt the underlying principle that new minority samples are generated in between two neighbouring minority line sections.
Borderline. Borderline methods increase the number of minority samples that border majority samples. The objective of using a borderline method is to allow the classifier to be able to distinguish between these borderline observations more easily.
Using a sampling density. The key principle of density based methods are to assign a weighted distribution for different minority class examples.
Use of clustering. In these methods, clustering techniques are used to identify minority concepts, and then the oversampling is done within the individual clusters independently.
Classifier algorithms. We used four different classifiers covering neural network methods, ensemble methods and lazy learners. These were chosen as they offer classifier diversity and have also been adopted most in imbalanced learning literature as base classifiers [32]. The four classifiers used were: k-NN. K-nearest neighbor (k-NN) [33,34] is a lazy learning algorithm storing all instances corresponding to training data points in n-dimensional space. Once new discrete data is received, it analyses the closest k number of instances saved (nearest neighbors) and returns the most common class as the prediction. We trained the k-NN classifier used in this work by optimising the number of nearest neighbours and using uniform weighting between them.
SVM. Linear support vector machines SVM [35] try to classify cases by finding a separating boundary termed hyperplane. The distances from the hyperplane boundary relate to the likelihood of a subject belonging to a class. SVM has been used in a range of problems and they have already been successful in pattern recognition in bioinformatics, cancer diagnosis [36] and other areas.
MLP. Multilayer perceptron is a class of feedforward deep, artificial neural network composing of more than three layers of nodes. They are the input layer, a hidden layer and an output layer whereby each input node is a neuron that uses a nonlinear activation function. In training MLP, a supervised learning technique called backpropagation is utilised.
DT. Decision tree uses a tree-like graph or model of decisions and their possible consequences and is an algorithm that only contains conditional control statements. This classifier uses the tree representation in which each leaf node corresponds to a class label and attributes are represented on the internal node of the tree.
Performance measures. One of the most crucial processes of defining a machine learning model is model evaluation. In binary classification problems traditional metrics such as accuracy, precision and recall have been widely accepted as standard evaluation measures. These are not suitable in imbalanced scenarios, since the performance of the majority class will be overrepresented. Usage of oversampling techniques maintains a reasonable performance for the majority samples whilst improving the classification of minority samples. We will compare three performance measures in this work. These are the G-score, F1 score and AUC score. Previous works have investigated the effectiveness of different measures and have concluded that these measures fit best for imbalanced data problems [37][38][39][40]. Firstly, introducing some acronyms TP, TN, FP and FN are the number of true positive, true negative, false positive and false negative samples, respectively, and P = TP +FN, N = TN +FP, the selected measures are defined as follows.
G score the geometric mean of accuracies achieved on minority and majority instance: F1 score is interpreted as a weighted average of the precision (PR) and recall (RE): Where precision measures samples correctly classified as positive, and recall describes the proportion of all positive samples classified as positive.
AUC score (Area Under the receiver operating characteristic Curve) characterises the area under the curve of sensitivities plotted against corresponding false positive rates (FPR): Experimental protocol. All analyses were performed using the Scikit-learn package in Python [41]. Oversampling algorithms based on S3 Table above were implemented by adapting a freely available imbalanced learning toolbox [42] to apply the described oversampling strategies. Our full experimental workflow can be seen in Fig 3 and is described below.
The first step in our experiment involved data exploration and cleaning. This involved selecting the most relevant regions (10 regions described in MRI Processing section above) from the resting state and volumetric brain image analysis in the brain that define our classification problems as reported in previous work [16,17]. Next, both our datasets were numerically normalised and standardised to a common scale. A dimensionality reduction feature selection method was then implemented using the RFE method to avoid selecting highly correlated features. After feature selection the original imbalanced dataset was oversampled before cross validation. We also split the data with a 0.3/0.7 (training/testing split) split according to subject classes found in Table 1. We described in the rest of this subsection, the details of the evaluation methodology.
Classifier parameters. To evaluate each classifier we selected different combinations of hyperparameter tuning parameters. In our MLP classifier, we used one hidden layer and specified the logistic activation functions and hidden units as 10%, 50% and 100% of the number of input features. In our k-NN classifier, we used standard or distance weighted decision functions with L2 distance and the k voting neighbors as 3, 5 or 7. For the DT classifier, we selected

PLOS ONE
Gini-impurity or entropy as the splitting criterion, unbounded and with a maximum depth of 3 and 5. We used a linear SVM with L1 and L2 penalties with compatible hinge or squared hinge loss and regularisation parameter C to be 1 and 10.
Cross validation. We evaluated classification performance by repeated stratified k-fold cross-validation with 10 splits and 10 repeats.
Performance evaluation. The performance of all oversampler classifier combination (OC) is carried out on 30 random oversampler parameter combination with six different classifier parameter combinations. We also oversampled the training set data before classifier training. We evaluated F1, AUC and G-score and reported the top results for each dataset, classifier and oversamplers. We consistently used an average score (average over AUC, F1 and G score) (AS) over the three performance measures in this paper to compare oversamplers and classifiers allowing an unbiased performance evaluation.

Result
All detailed findings from 292 oversampling and four no oversampling experiments are shown in supplementary tables, S4

Oversampling algorithm comparison
We have shown the top 10 performers for each dataset in Tables 2 and 3 below, aggregated by the performance measure results over all four classifiers. Next, we ranked the oversamplers using the average rank (average AUC, F1 and G rank) to obtain a more unbiased oversampler ranking rather than ranking using AS.
The top performers always perform better than baseline comparisons (no oversampling or SMOTE). Usage of oversampling measures, including baseline SMOTE, outperforms no oversampling measures as shown in Table 2. When compared to No Oversampling, the best oversampler (Rank 1) gives an improved AS performance of 12.9 percent for DTS1 and 13.2 percent for DTS2. We also conducted an independent samples T test to test the significance of oversampling versus no oversampling and reported a p value of 0.159 for DTS1 and 0.044 for DTS2.
Using baseline SMOTE also yields an AS boost of 14.2 percent for DTS1 and 11.3 percent for DTS2. Comparing baseline SMOTE to the best oversampler there was an improvement of

Classifier comparison
SVM is the best performing classifier compared to the other four classifiers (see Table 4). Based on AS scores, SVM performs 3.74 percent better than MLP the next best classifier for DTS1 and 5.42 percent better than MLP in DTS2. DT is also consistently the worst performing classifier choice. We also compared SVM versus the other 3 base classifiers when oversampling is used, this correlated to a p value of 0.024 for DTS1 and 0.044 for DTS2 when we conducted an independent T test.

Oversampling classifier combination comparison
Top 10 best oversamplers with their respective classifiers is shown in Table 5. We ranked these based on AS scores and show that an oversampler combination with SVM classifier always outperforms oversampler combination with DT, k-NN and MLP. This is true for both the datasets. The SVM classifier provided the top performing AS score at 0.76 for DTS1 and 0.93 for DTS2. Looking into DTS1, the top performers using the SVM classifiers are A_SUWO [47] and Borderline_SMOTE1 [48], which perform the best amongst all the oversamplers. The top performing SVM oversampler combinations for DTS2 are SMOTE_Cosine [49] and Borderli-ne_SMOTE1 [48] achieving the highest AS score.

Operating principles
The top three operating principles for DTS1 are Ordinary Sampling, Density Based and Application (see Table 6). Some examples of Ordinary Sampling are ProWSyn [50] and ADASYN [51]. ADASYN also falls in the Density Based category. Other examples of in the Density Based category are A_SUWO [47]. This oversampler is the top performing OC as shown in Table 5. The top three principles for DTS2 were Application, Uses Classifier and Ordinary Sampling. Examples of oversamplers in the Uses Classifier operating principle includes SMOTE_IPF [52]. The next top principle was Application, these were oversamplers developed for specific applications with an example of CE_SMOTE [53] oversampler. CE_S-MOTE and SMOTE_FRST_2T [54] also falls in both the Application and Ordinary Sampling principle which is placed in two of the top three operating principles for DTS2. On the contrary, we observed moderate performance when using density estimation or dimensionality reduction.

Discussion
The two key findings from this study were 1) using an oversampling strategy results in better classification performance than not oversampling at all. When we implemented oversampling on average across both datasets and performance measures, we demonstrated a 14.3 percent improvement. 2) classification can be improved further by empirically using 73 oversamplers. We reported the best oversamplers, classifier and operating principles for DTS1 and DTS2. In this work, we did not review or discuss oversampling techniques in depth, however objectively using these as a measure for classification improvement. We have also not considered undersampling or hybrid methods as we have only have a small minority total sample size. For DTS1 (more clustered), Assembled SMOTE and SL_graph_SMOTE were the best oversamplers. Both these techniques are categorised principally as Borderline methods. These methods perform well on DTS1 as the clusters of minority and majority samples are more assembled together or closely packed. Hence, by using Borderline methods we can distinguish between the two instances more easily. The top 5 oversamplers for DTS 1 are further discussed in S1 Appendix. Individually the best overall OC is A_SUWO when used with SVM. This method is ideal for this dataset as it works best to differentiate sub-clusters of minority samples from majority classes that are close together. It oversamples the sub-clusters by assigning weights to their instances whilst avoiding generating overlapping synthetic instances by considering the majority instances that overlap minority ones. In terms of operating principles, Ordinary Sampling, Density Based and Application methods are the best performers. These principles are ones that implement oversampling very similar to conventional SMOTE. One reason these are successful is that it makes the right compromise between introducing variance and staying close to the original distribution of our dataset.
In DTS2, a smaller and sparser dataset with a smaller minority sample. The best overall oversampler was Lee whereby a noise filtering approach very similar to the k-NN approach was used. Using a post-processing noise-filtering step enhances the performance on a small minority sampled dataset. The top 5 oversamplers for DTS 2 are further discussed in S1 Appendix. Individually the best OC is SMOTE-Cosine with SVM as the base classifier. This oversampler has been shown to work better with the SVM classifier [49]. Oversampling based on Application and Uses Classifier operating principles gives the best performance, whereas the worst performance is achieved with density estimation and dimensionality reduction. These methods fail due to the number of minority samples being extremely low (N-= 13) and the number of attributes (ATR = 13) is not smaller than the number of N-. Secondly due to the sparse nature of this dataset N-can sometimes be mistakenly identified as noise [31].
In terms of number of misclassified cases, there was little change when comparing the oversampling techniques for both DTS1 and DTS2. However, we observed a significant change in misclassification as compared to no oversampling. When we used oversampling in DTS1, we found a misclassification of 21.1 percent or 15 out of 71 misclassified cases as compared to 27.6 percent or 13 out of 47 cases in the no oversampling case. When we used oversampling, we found a similar trend for DTS2 with 8.3 percent misclassification or 2 out of 24 cases as compared to 12.5 percent or 2 of 16 cases when no oversampling was considered. We reported these numbers based on a 70/30 training testing split and using the top OC performer for both datasets.
As a whole, SVM performs better than the other three classifiers. SVM is robust, precise and easier to train on smaller datasets. SVM also has the ability to generate nonlinear decision boundaries using methods designed for linear classifiers and adopts a flexible decision boundary. This adaptive boundary ability is very important in handling the problem of imbalanced datasets [9,55]. In our results, we have found the AUC scores are comparatively higher as compared to the F1 or G score. This was because the features (ATR) were selected maximising AUC. There is scope to improve the AS scores further by better feature selection and hyperparameter tuning of the classifier (in the present study, parameter tuning was constrained by computational time). However, we emphasise that the objective of this work was primarily to compare oversamplers over our two datasets rather than pushing AS scores. Future experiments will explore a larger sample size for the smaller datatset (DTS2). We will also explore using other base classifiers such as unsupervised principle component analysis and K-means clustering techniques.
To conclude, we have reported the most appropriate oversampling approaches for two distinct datasets. In addition, to our knowledge, this is the first study addressing 73 different oversampling strategies to improve the diagnostic performance of machine learning classification on MRI datasets. Our findings provide an insight into the best approach to improving the binary classification of imbalanced datasets.
Supporting information S1