Minimum Displacement in Existing Moment (MDEM)- A new supervised learning algorithm by incrementally constructing the moments of the underlying classes

Ahmed Mehedi Nizam

doi:10.1371/journal.pone.0336933

Abstract

We introduce a supervised learning method that classifies each test point by selecting the class for which its inclusion causes minimum displacement of the class’s existing n-th central moment. After each such inclusion, the n-th central moment of the corresponding class is updated by some incremental calculations in constant time, i.e., each class evolves gradually and changes its definition incrementally after the inclusion of every new data point. We then use k-fold and stratified k-fold cross validation techniques to compare the performance of our proposed model with various state of the art supervised learning algorithms including Neural Network (NN), Support Vector Machine (SVM), Random Forest (RF), K-Nearest Neighbor (KNN) and Logistic Regression (LR) using Pima Indian Diabetes (PID) dataset and Wisconsin breast cancer dataset, which are popular datasets in machine learning research. Our analyses suggest the performances of different MDEM algorithms as proposed here involving different order of moments vary within the range of [83.19% − 95.82%] of the best algorithm under consideration in k-fold and stratified k-fold cross validation techniques for PID dataset. Moreover, for Wisconsin breast cancer dataset, different variants of MDEM algorithms have achieved accuracy scores in the range of [88.85% − 96.41%] of the best algorithm. Finally, we compare the results produced by different algorithms by constructing the corresponding confusion matrices.

Citation: Nizam AM (2025) Minimum Displacement in Existing Moment (MDEM)- A new supervised learning algorithm by incrementally constructing the moments of the underlying classes. PLoS One 20(12): e0336933. https://doi.org/10.1371/journal.pone.0336933

Editor: Zeheng Wang, Commonwealth Scientific and Industrial Research Organisation, AUSTRALIA

Received: May 15, 2025; Accepted: October 31, 2025; Published: December 5, 2025

Copyright: © 2025 Ahmed Mehedi Nizam. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All data used in this study are publicly available and can be accessed from the Pima Indians Diabetes Database (https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database) and the Breast Cancer Wisconsin (Diagnostic) Data Set (https://www.kaggle.com/datasets/uciml/breast-cancer-wisconsin-data).

Funding: The author(s) received no specific funding for this work.

Competing interests: The authors have declared that no competing interests exist.

1 Introduction and scope of the current study

To date, there are numerous supervised machine learning algorithms each having its own strength and weakness. For example, K-Nearest Neighbor (KNN), one of the earliest and perhaps most straightforward algorithm for supervised learning has the ability to train itself in constant time, i.e., as soon as a labelled input is provided, the model can learn it instantly without further processing. However, it has a testing complexity of O(nd), where n is the total number of training data, d is the dimensionality of the feature space. This is a rather daunting task, specifically when we have a large set of training data, as we need to calculate the distance between the new point and all other previously classified points[1]. Using techniques like KD Tree and Ball Tree, average running time in the testing phase can be improved up to at the expense of a costlier time complexity in the training phase [2,3]. However, the worst-case time complexity is still O(n) [2,3].

Apart from the KNN, another most studied supervised learning algorithm is Support Vector Machine (SVM) that aims to construct maximum margin hyperplanes amongst the training data points [4]. The time complexity of SVM method depends, among other things, upon the algorithm used for optimization (e.g., quadratic programming or gradient-based methods), dimensionality of the data, the type of the kernel function used, number of support vectors, and the size of the training dataset. Worst-case time complexity in training phase of linear and non-linear SVMs are found to be O((n² × and O(n²d)–O(n³) respectively, where n is the size of the training sample and d is the dimensionality of the feature space, although the time complexity can be improved upon using various stochastic optimization technique [5,6]. Time complexities of the testing phase of SVMs are found to be O(d) for linear kernel and × for non-linear kernel, where s is the number of support vectors.

Perhaps one of the most straightforward algorithm for supervised learning is the Nearest Centroid Classifier (NCC) that attempts to estimate the centroid of each class in the training phase, which can be done in × time, where n is the number of training samples and d is the dimensionality of the feature space [7]. Moreover, in NCC, every new instance can be classified in × time, where c is the number of centroids, as it involves computing the Euclidean distance between each centroid and the new sample under consideration [8].

Another algorithm that is frequently used in classification of labelled data is Random Forest (RF) that works by constructing multiple decision trees in the training phase, where each tree is trained with a subset of the total data [9]. To predict the final output of the RF in testing phase, majority voting technique is used to combine the results of multiple decision trees. The time complexity of RF depends upon the number of trees in the forest (t), sample size (n), dimensionality of the feature space (d) and tree height (h) among other things. Training time complexity of RF is found to be × n × d × , while the testing complexity is × per sample.

Another important algorithm for classification is Logistic Regression (LR), which is used primarily for binary classification, although the algorithm can be easily adapted to handle multiclass classification problems. Training phase time complexity of the Logistic Regression (LR) depends, among others, on number of training samples, number of iterations and dimensionality of the features space, where the number of iterations depends further on the choice of the algorithm used (stochastic, batch gradient descent or, alike) [10]. In a nutshell, the total time complexity of the Logistic Regression (LR) in training phase can be summarized as × n × , where E is the number of iterations, n is the size of the training sample and d is the dimensionality of the input space. On the other hand, the testing time complexity of LR per sample is O(d) as it simply involves computing the dot product of the weight (w) and feature vector (x) [10,11].

However, perhaps one of the most popular and widely used supervised learning algorithms is the Neural Network (NN), which is inspired from the networks of biological neurons that comprise human brain and is presently used extensively in image and video processing, natural language processing, healthcare, autonomous vehicle routing, finance, robotics, gaming and entertainment, marketing and customer service, anomaly detection etc. Performance of a Neural Network (NN) depends upon the number of hidden layers, number of neurons per layer, number of epochs, input size, input dimensions etc. If there are L hidden layers each having M neurons, then the training time complexity of the Neural Network (NN) can be summarized as × M × E × n × , where E is the number of epochs/iterations, n is the sample size and d is the dimensionality of the input space, while the testing time complexity per sample of the said NN is × M × [12,13]. Choices of the number of hidden layers L, number of neurons per layer M and number of epochs E are somewhat arbitrary, i.e., we can choose any value for L,M and E from a seemingly infinite range.

In fact, all of the above algorithms apart from KNN have one or more arbitrary parameters to be set, e.g., number of iterations, number of trees, choice of optimization algorithm, choice of kernels, number of hidden layers, number of nodes in each hidden layer, choice of activation function etc. Although KNN has a deterministic training and testing complexity, which can be anticipated beforehand, its testing time complexity is linear on training space, which is a very time-consuming process and renders KNN effectively ineffective in case of large training data. Here, we propose a new supervised learning algorithm that has a deterministic running time and can learn in O(nd) time and classify new inputs in O(kd) times, where n is the number of inputs, d is the dimensionality of input space and k is the number of classes under consideration. For a specific problem, the dimensionality of input space d and the number of classes k are fixed. Thus, unlike KNN, the training phase time complexity of our proposed algorithm is linear on the number of inputs and the testing time complexity per sample is constant. So, whenever we need a light-weight deterministic algorithm like KNN that, unlike KNN, can effectively classify new instances in constant time, we can use our proposed algorithm, which does not involve solving a complex quadratic programming problem (as like SVM) or operations that require matrix multiplication (for NN) or the alike.

In the training phase, our proposed algorithm resorts to find the n-th moment (raw or central) of each attribute of every class. At testing phase, the algorithm temporarily includes the new input into each of the k classes and computes the new, temporary n-th moment for each attribute of each class after a temporary inclustion of the new data point into every possible class. The new input will then be finally classified into the class for which such inclusion causes minimum displacement in the existing n-th moment of the underlying class attributes. Once the new input is classified, the n-th moment of the attributes of the respective class is updated to reflect the change, while the moments of all other classes are left unchanged. Thus, apart from classifying new input in constant time, our algorithm also evolves incrementally after inclusion of every new data point, which makes the model dynamic in nature.

The rest of the article is organized as follows: Sect 2 provides the definitions of raw and central moments as used in our analysis, Sect 3 describes the new algorithm for supervised learning based upon the Minimum Displacement in Existing Moment (MDEM) technique as improvised here, Sect 4 discusses the time complexities of the proposed algorithm, Sect 5 provides the theoretical foundation of the proposed MDEM algorithms, Sect 6 derives a sufficient condition for optimality under minimum perturbation criteria of MDEM algorithms, Sect 7 presents the methodology used for empirical analysis, Sect 8 describes and elaborates the data, Sect 9 presents various preprocessing techniques used for data cleansing, Sect 10 discusses the empirical results and compares the performance of our proposed algorithms to that of various state of the art supervised learning techniques including K-Nearest Neighbor (KNN), Support Vector Machine (SVM), Random Forest (RF), Logistic Regression (LR) and Neural Network (NN), and finally, Sect 11 concludes the article.

2 Raw and central moments

2.1 Raw moment

n-th Raw moment of a random variable X is defined as the expected value of the n-th power of X, i.e.,

When n = 1, the first raw moment of X is the mean of the random variable X.

2.2 Central moment

n-th central moment of a random variable X is defined as the expected value of the n-th power of the deviation of the random variable X from its mean. Mathematically,

When n = 1, the first central moment of the random variable X about its mean is zero and when n = 2, the second central moment is the variance of the random variable X. For any arbitrary n, if we extend the expression of using binomial theorem, we get the following expression:

(1)

Using the linearity property of the expectation operator, we can simplify Eq 1 for n = 2 as follows:

(2)

For n = 3, Eq 1 can be simplified into the following form to get the 3rd central moment about the mean:

(3)

And for n = 4, Eq 1 can be simplified as below to get the 4th central moment about the mean:

(4)

We use Eqs 2, 3 and 4 to calculate 2nd, 3rd and 4th central moments respectively as part of our current proposition.

As we have discussed above, the first raw moment and second central moment of a distribution are its mean and variance respectively. Mean and variance are two important statistics of the distribution that captures important information about the distribution. Moreover, in this regard, we may recall that the third central moment of a distribution is popularly known as the (non-normalized) skewness, which represents how symmetric the distribution is with respect to its mean. If the distribution has zero skewness, then the data points are evenly distributed on both sides of the mean. Besides, positive skewness (or right skew) indicates a longer tail on the right side, while negative skewness (or left skew) indicates a longer tail on the left side.

On the other hand, the fourth central moment of a probability distribution is known as its kurtosis in non-normalized form. Kurtosis represents the heaviness of the two tails as well as the sharpness of the peak of a probability distribution as compared to a normal distribution. If a distribution has higher kurtosis, i.e., heavier tail and sharper peak than a normal distribution, then it means, it has a higher concentration of outliers than a normal distribution and is known as Leptokurtic distribution. On the other, if a distribution has low kurtosis, i.e., lighter tails and flatter peaks than a normal distribution, then it is supposed to have fewer outliers and less variability from its mean.

Four moments mentioned above, namely, mean, variance, skewness and kurtosis are four very important sample statistics and they say a lot about the underlying distribution. So, we consider all of these attributes in our current study, although we can equivalently consider any other higher order moments. In our approach, every new point is classified into the class that is least perturbed in terms of mean, variance, skewness, kurtosis as a result of such inclusion.

3 Proposed supervised learning algorithm based upon Minimum Displacement in Existing Moment (MDEM)

We begin our analysis by sketching the algorithm for the 1st raw moment, i.e., mean. In this step, we intuitively describe the main idea behind the current discourse and then in subsequent steps, we enhance the reasoning to account for higher order central moments of the classes under construction.

3.1 MDEM in mean

Let us assume that we are attempting to solve a multiclass classification problem involving M different classes. The classes are enumerated from 1 to M. Let us also assume that each of the training instances has N number of numerical attributes. Based upon the value of these N attributes, each training instance is classified into one of the M possible classes. To begin with, we scan each of the training inputs one at a time and incrementally calculate the sum of the attributes of the respective class. Sum of all of the k-th attributes of the j-th class is stored into Sum[j][k]. This is done through line [10–14] of Algorithm 1. As soon as a new training instance is found to belong to class j, the count[j] is increased by one. After we are done with scanning of the training rows, we calculate the mean of each attribute of every class. This is done simply by dividing the sum[i][j], by count[i], . This is done in line [15–17] of Algorithm 1. This marks the end of the training phase of our algorithm.

At the end of the training phase, we have an array of Mean[][], which is a 2D array containing the mean of j-th attribute of i-th class at Mean[i][j]. We also have another 2D array, namely, Sum[][] containing the sum of all of the j-th attributes belonging to i-th class at Sum[i][j]. In the testing phase, we incrementally enhance the as soon as a new instance is classified into class i by our proposed algorithm. Thus, after every increment, our algorithm evolves to accommodate the new changes. At the very beginning of the testing step, we scan a new test row and temporarily include it into every possible class. This will allow us to calculate the new temporary mean of every attribute of each class after such pseudo inclusion. This is done in line [2–5] of Algorithm 2. Next, we find the Euclidean distance between the existing mean and new temporary mean for each class. This is to be noted in this regard that both Mean[k] and tMean[k], are N-dimensional vectors of the attributes. This is done in line [6–8] of Algorithm 2.

Euclidean distances thus calculated need to be multiplied by the cardinality of the respective class lest the class with high cardinality eat up every new point due to its gravitational pull. This is diagrammatically presented in Fig 1. In Fig 1, we have two classes, namely, Class-1 and Class-2. Class-1 already has 46 members in it, while Class-2 only has 2 members. As soon as we have a new point A to be classified, we notice that the inclusion of point A causes very little displacement in the existing mean of Class-1 as compared to Class-2. To be precise, inclusion of point A into Class-1 causes its mean (Class-1’s mean) to be shifted from point 1 to point 2. However, if point A is instead classified into Class-2, then its mean (Class-2’s mean) is shifted from point 3 to point 4. As evident from Fig 1, distance between [1,2] is quite small as compared to that of [3,4]. So, apparently at this point, we may consider the new point A to be classified into Class-1. However, as we can visually comprehend from Fig 1, point A is supposed to be classified into Class-2. To resolve the issue, we multiply the Euclidean distance calculated as above by the cardinality of the respective class. This modification of weighting the Euclidean distance by the class cardinality is done in line 8 of Algorithm 2.

Download:

Fig 1. Insertion of new data point.

https://doi.org/10.1371/journal.pone.0336933.g001

Algorithm 1. Pseudocode for MDEM in mean: training phase.

Algorithm 2. Pseudocode for MDEM in mean: Testing phase.

Next, we find the Index at which is minimized and this Index represents the class of this new testing instance under consideration. Then, we update the sum and mean of all attributes of Index class by the temporary sum and temporary mean respectively. All other means and sums (other than that of Index class) are left unchanged before the commencement of the new iteration with a new testing instance.

3.2 MDEM in n-th central moment

The proposed algorithm involving higher order central moments intuitively uses the same idea as discussed in the previous subsection. In the training phase, we calculate the n-th central moment of every attribute of each class. Let us numerically describe the idea using 3rd order central moments (n = 3). Let us assume that the 5th attribute of the 2nd class has values of 5,10,15,8,20, which has a mean of 11.6. So, the 3rd central moment of the 5th attribute of the 2nd class is: × or, 58.75. So, after the training phase, we have the n-th central moment of each attribute of every class. At the testing phase, the new instance is temporarily included into all of the classes and the new temporary n-th central moment of each attribute of every class is calculated. Now, for each class , both the temporary n-th central moment and existing n-th central moments are vectors of length N, where N is the number of attributes. Next, for each class j, we calculate the N-dimensional Euclidean distance between the temporary and existing n-th central moment. The distance thus calculated is then multiplied by the cardinality of the respective class in order to avoid the most densely populated class from engulfing every new test input.

To calculate n-th central moment in line with Eq 1, we have to have the l-th powered sum () for each attribute of each class beforehand. These powered sums are generated in line [8–13] of Algorithm 3, where Sum[l][j][k] indicates l-th powered sum of the k-th attribute of the j-th class. Apart from the powered sums of each attribute of each class, we need to calculate the mean of each attribute of each class in order to determine n-th central moments for each attribute of each class in line with Eq 1. These means are generated in line [14–16] of Algorithm 3. Once we have the means and powered sums, we can calculate the n-th central moments according to formula given in Eq 1. To generate the expected values for any combination of n-th central moment and mean, we need to divide it by the cardinality of the respective class. The calculation of the n-th central moments are done in line [17–20] of Algorithm 3. This marks the end of the training phase of our algorithm.

After the end of the training phase, we have captured the values of n-th central moment of the j–th attribute () of i-th class () in Moment[i][j]. Next, we temporarily include every new training instance into every possible class and calculate the new temporary moment of each attribute of each class. This is done in line [1–9] of Algorithm 4. Next, we calculate the N-dimensional Euclidean distance between the temporary and existing n-th central moments and weight each such distance by the respective cardinality of each class. This is done in line [10–12] of Algorithm 4 and these weighted distances are preserved in wd[]. Next, we select the index value at which the weighted distance wd[] is minimized and this index value indicates the class to which the new test instance is classified by our algorithm. Once the class is fixed for the new instance, the existing moments, means and powered sums corresponding to that specific class are set to the temporary moment, temporary mean and temporary powered sums as calculated previously and the count for that specific class in increased by one. These are done in line [15–19] of Algorithm 4. For all other classes, existing means, moments and powered sums are left unchanged before the beginning of a new iteration for yet-to-be-classified test rows.

Algorithm 3. Pseudocode for MDEM in n-th central moment: Training phase.

Algorithm 4. Pseudocode for MDEM in n-th central moment: Testing phase.

4 Time complexity analysis of the MDEM algorithm

In this section, we will analyze the time complexity of our proposed algorithm both in training and testing phase. We split our analysis into two parts: In first part, we determine the training and testing time complexity of MDEM in mean and in the second part, we analyze the training and testing time complexity of MDEM algorithms involving higher order central moments.

4.1 Time complexity of MDEM in mean

In the training phase, we scan the training rows one by one, check the class of each instance, and if the class value is found to be j, then the value of is increased by the amount of the k-th attribute value of the training instance under consideration. This is done in line [10–14] of Algorithm 1. So, time complexity of this step in the training phase is × , where P is the number of training rows. As the number of attributes N for a specific problem is fixed, the overall time complexity of this step is linear on the number of training rows. Once we have calculated the Sum[][] for all the training rows, we can divide by count[i], to get the mean value of each attribute of each class. This is done in line [15–17] of Algorithm 1. Time needed in this step is × . For a specific problem, the number of classes (M) and the number of attributes (N) are fixed. Thus, the overall time complexity of the training phase is O(P), where P is the number of training instances.

In the testing phase, we temporarily include each training instance in every class and calculate the temporary mean of each attribute of each class, which can be done × times, where M is the number of classes, N is the number of attributes. This is done in line [1–5] of Algorithm 2. Next, we calculate weighted and unweighted Euclidean distance between existing and temporary mean of each class, which is done in line [6–8] of Algorithm 2. Again, this can be done in O(M) time. Next, we find the index value at which the weighted Euclidean distance as calculated above is minimized, which is calculated in O(M) time (line 9 of Algorithm: 2). Finally, we update the sum and mean with the temporary sum and temporary mean of the attribute values of the respective class, which is done O(N) time (line [11–13] of Algorithm 2). For a specific problem, M and N are fixed, which implies, every new instance can be classified in constant time.

4.2 Time complexity of MDEM in n-th central moment

In the training phase, we need to calculate n (n is the number of moments considered) number of powered sums as shown in line [8–13] of Algorithm 3. This can be done in O(n) times. This step of calculating the powered sums needs to be repeated for each of the N attributes of the training row under consideration. So, for each training row, we need to calculate n × N powered sums (line [10–12] of Algorithm 3). Steps as mentioned in line [10–12] are repeated for each training rows as well, and if there are P number of training rows, then we need × N × number of operations in line [8–13] of Algorithm 3. For a specific problem, n and N are fixed beforehand. Thus, the time complexity of line [8–13] of training phase is linear on number of training rows, i.e., O(P). Once the powered sums are generated, we can calculate the mean of each attribute of each class in × time (line [14–16] of Algorithm 3) and the n-th central moment of each attribute of each class in × M × time (line [17–20] of Algorithm 3). As for a specific problem, the values of n, M and N are prefixed, the overall time complexity of the training phase is linear on the number of training instances, i.e., O(P).

At the testing phase, we temporarily include every test instance into each of the possible M classes and calculate the resulting temporary means, temporary powered sums and temporary moments (line [1–9] of Algorithm 4). For every test row, we need to calculate the temporary mean (line 4 of Algorithm 4), temporary powered sums (line [5–6] of Algorithm 4) and temporary moments (line [8–9] of Algorithm 4) for each attribute of each class, which can be calculated in × , × M × and × M × time respectively. As for a specific problem, the values of n,M and N are predetermined, the above steps can be completed in constant time for every new test instance. The next step in the testing phase involves calculating the Euclidean distance between the vectors of existing and temporary moments of every class (line [10–12] of Algorithm 4), which can be done in O(M) time. Finding the index value at which such weighted distance wd is minimum (line 18), can be done again in O(M) time. Finally, updating the mean, temporary moment and temporary powered sums for the respective class can be done in O(N), O(N) and × time (line [20–26]). As we have mentioned previously, the values of n,M and N are fixed beforehand for a specific problem, the overall time complexity of classifying a new testing instance is constant.

5 MDEM as a size-weighted minimum perturbation classifier

In this section, we will provide a theoretical foundation of the MDEM algorithms proposed in the current discourse. To do so, we invoke Minimum Perturbation Principle (MPP) that says, when we need to update a dynamic model to accommodate new instances, then the solution that minimally perturbs the system’s initial structures and/or parameters should be selected. Minimum Perturbation Principle evaluates how small changes can influence classification and model output and is foundational in the areas of perceptual learning [14], adversarial machine learning [15,16], robustness testing [15], sensitivity-based feature selection [17] etc. In our context, we apply MPP with a view to minimizing weighted displacement in the existing n-th central moment of a particular class resulting from the inclusion of a new instance into that class. So, when we need to classify a new instance x, we first calculate the temporary moment TM_i for each class due to inclusion of the new instance into class i and measure how far away TM_i is from the class’s initial n-th central moment M_i, i.e., we estimate |TM_i − M_i| for each class, weight such raw displacement with the class cardinality |C_i| and choose the class for which such weighted displacement, |TM_i − M_i| × |C_i| is minimized. Except for the weighting factor, the idea is inspired from MPP. We now show that for both mean and n-th central moment of a class, such an unweighted difference between existing and new temporary moment is inversely proportional to class cardinality and as such, we need to multiply such unweighted difference with the cardinality of the respective class in order to get an unbiased estimate of the perturbation metric.

Proposition 1: When we add a new point to an existing set of points, then the displacement in the existing mean of the class resulting from such inclusion is inversely proportional to class size.

Proof: Let us assume we have N number of points in a class and the points are given by . Let us also assume that the existing mean of the class is μ, while the new mean after inclusion of a new point x_N + 1 is . Thus,

Therefore, the change in mean as a result of inclusion of point x_N + 1 is given by the following:

From the above expression, we can see that the displacement is inversely proportional to the new class cardinality (N+1).

Proposition 2: When we add a new point to an existing set of points, then the displacement in the existing n-th central moment of the class resulting from such inclusion is inversely proportional to class size.

Proof: Suppose we have a dataset with N points: . The mean of these points is given by the following:

By definition, the n-th central moment of the class is given by:

Now, let us separately analyze the scenario how the mean and n-th central moment of the class change due to inclusion of a new point x_N + 1 into it.

New mean after inclusion of a new point
Due to inclusion of x_N + 1, the new mean becomes:
Now let us assume that the distance between the new point and the existing mean is δ, i.e., . If we substitute the value of δ into the above expression of new mean, we get:(5)
New central moment after inclusion of a new point
The new n-th central moment is:
Now, we substitute the value of from Eq 5 inside the summation sign of the above expression. Doing so, the above expression turns out to be:
Let us asume, . This will reduce the above expression into the following.(6)
Then by expanding the term inside the summation sign by binomial expansion, we get:
Substituting the value of from the above equation into Eq 6 yields:
As the indices of both sums range over finite sets, we can use the property of associativity and commutativity to interchange the ordering of the summations as follows:
At this stage, we may recall that:
Substituting this into the above expression, we get:(7)
Now, we rewrite the term outside the summation, i.e., in a manner that serves our purpose. So, we substitute the value from Eq 5 into the above expression, which results into:
Now, subsituting the value of from the above expression into Eq 7 yields:(8)
Displacement in n-th central moment
The displacement in the n-th central moment due to inclusion of a new point is given by:
Now, substituting the value of from Eq 8 into the above expression, we get:(9)
Now, the scaling factor fades out to 1 for any practical application as the value of N grows. Moreover, as we may recall here that the value of δ is defined as the difference between the new point and the existing mean, i.e., . Thus, δ is independent of our decision making process and is given beforehand. Additionally, when we are estimating n-th central moment involving instances, all the lower order moments involving N number of elements are already given, i.e., all the , are estimated and do not change. So, the only variable thing that remains in the expression relating to moment difference is a power series of h and as we have discussed earlier that h is inversely proportional to class cardinality as it is defined as . As h is inversely proportional to class cardinality, higher order terms of h in the displacement equation will increasingly have lower contribution to the measured displacement as N grows. Thus, we can assume that the stated displacement measure is inversely proportional to the first power of h, i.e., , or more consicely, , as δ is given beforehand and is independent of our selection criteria. This completes the proof.

6 A sufficient condition for optimality under minimum perturbation criteria

Suppose we need to classify a new instance x and let the true class of x be c . To do so, we need to calculate the displacement of the existing moment of each class caused by the temporary inclusion of the new instance x into that particular class. For class c, let this empirical displacement be denoted by . We will include the new instance x into the class for which the estimated is minimum, i.e.,

One important thing to note here is that we are estimating empirical displacement, i.e., , which may be different from true population displacement D_c(x). This implies that the true population classifier should choose the label of the new instance as follows:

As the true class is c , will be minimum and all other D_c(x) will be greater than . Let us now define a margin m(x) that represents the displacement between and the second best option, i.e.,

We now use this definition of margin to establish a sufficient condition for optimality under minimum perturbation criteria of our proposed approach.

Proposition 3: If the difference between empirical and true population displacement for all the classes is bounded by , i.e., , then MDEM will always choose the correct class under minimum perturbation criteria.

Proof:

From the assumption, we have for every class c. This implies:

(10)

For the optimal class c^* under minimum perturbation criteria, we use the inequality involving the second and third term of Eq 10, which entails:

(11)

On the other hand, for any class c, taking first two terms of Eq 10, we have:

Thus, for all c.

This means, for any arbitrary class c, the estimated empirical distance is always greater than the estimated empirical distance of the true class, provided the condition mentioned in proposition 3 is satisfied. As the estimated empirical distance is minimum for the true class c , MDEM always choose c as the labelling class for a new instance x under minimum perturbation criteria. This completes the proof. □

7 Methodology

To begin with, we apply various data preprocessing techniques, e.g., identification and replacement of missing values, detection and removal of outliers using Inter Quartile Range (IQR) filter and feature scaling techniques to normalize the data within the range [0-1]. Processed data are then fed into various machine learning algorithms including MDEM in mean, MDEM in variance (2nd central moment), MDEM in 3rd central moment, MDEM in 4th central moment, KNN with 3, 5, 7 neighbors and NN with 1, 2, 3 hidden layers each comprising 5 neurons with 50, 100 and 150 epochs. We use Eqs 2, 3 and 4 to calculate 2nd, 3rd and 4th central moments used in our proposed analysis based on Minimum Displacement in Existing Moment (MDEM) technique.

Here, we only consider first raw moment as well as 2nd, 3rd and 4th order central moments as they capture some specific attributes of the underlying distribution, namely, mean, variance, skewness and kurtosis respectively. It is to be noted in this regard that moments, higher in order than the 4th order moments, are known as hyper-skewness (odd ones) and hyper-kurtosis (even one) that are rarely used in statistical analysis and only capture fine details of skewness and kurtosis respectively.

For the purpose of this analysis, we randomly split our dataset into k equally sized folds. We then use some (k–1) folds to train our model and the rest 1 fold is used to analyze the performance of the trained model, which is popularly known as k-fold cross validation technique. The choice of k is rather arbitrary and there exists bias-variance trade-offs associated with the choice of k in k-fold cross validation [18]. One of the preferred choices for k is 5, as it is shown to yield test error estimates that do not suffer either from excessively high bias or from extreme variances [18]. So, we check the performance of different machine learning algorithms using 5-fold cross validation technique. Apart from that, we also compare the performance of different algorithms under 2, 3 and 7-fold cross validation also.

While k-Fold cross validation technique may perform well for balanced data, its performance deteriorates once it is used handle imbalanced or skewed data [19]. As we will mention later in the data section, the data used in our present analysis are skewed to some degree, e.g., we have unequal number of observations in each class. To overcome the hurdles faced by k-fold cross validation techniques to classify imbalanced data, we use stratified k-fold cross validation, which intends to solve the problem of imbalanced data to some extent. In stratified k-fold cross validation, the folds are generated by preserving the relative percentage of each class. Like k-fold, we use 2, 3, 5 and 7 number of folds in our stratified k-fold cross validation technique to analyze the performance of different algorithms.

8 Description of data

We collect Pima Indian Diabetes (PID) dataset originally developed by the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK), which is part of the United States’ National Institutes of Health. The Pima Indians are native American people, who traditionally lived along the Gila and Salt rivers in Arizona, United States and can now be found in various parts of Arizona, US and Mexico [20]. This group has a high prevalence of diabetes among its members and diabetes research around the Pima Indians are often considered to be significant and representative of the global health [21]. The PID dataset comprises records of some 768 females having age 21 and above from the Pima Indian population and is widely considered as a benchmark dataset in diabetes research [22]. 08 (eight) attributes, namely, number of pregnancies, plasma glucose concentration in an oral glucose tolerance test, diastolic blood pressure, triceps skin fold thickness, serum insulin level, BMI, diabetes pedigree function, age and a class variable representing the prevalence of diabetes in the respective individual are recorded. Descriptive statistics of the PID dataset are presented in Table 1.

Download:

Table 1. Descriptive statistics of Pima Indian Diabetes (PID) dataset.

https://doi.org/10.1371/journal.pone.0336933.t001

As can be seen from Table 1, out of 768 records in PID dataset, 500 are non-diabetic, while the rest 268 are diabetic.

Apart from PID dataset, we also use Wisconsin breast cancer dataset that comprise 569 records each containing 30 features [23]. The mentioned feature values are extracted from digitized images of different fine needle aspirates (FNA) of breast masses. To be precise, there are 10 real valued features, namely, radius (mean of distances from center to points on the perimeter), texture (standard deviation of gray-scale values), perimeter, area, smoothness (local variation in radius lengths), compactness (squared perimeter/area - 1.0), concavity (severity of concave portions of the contour), concave points (number of concave portions of the contour), symmetry and fractal dimension (coastline approximation - 1). From these 10 real valued features, a total of 30 features are synthesized that represent mean, standard error and worst (mean of the three largest values) of the feature values. There is one output value, which represents whether the extracted breast mass is benign or malignant. In the main dataset, each record also contains the patient ID number, which is not contextual in the current analysis. Descriptive statistics of the records are presented in Table 2.

Download:

Table 2. Descriptive statistics of Wisconsin breast cancer dataset.

https://doi.org/10.1371/journal.pone.0336933.t002

As can be seen from Table 2, out of 569 records in Wisconsin breast cancer dataset, 357 breast masses are benign, while the rest 212 are malignant.

9 Data preprocessing

Data preprocessing is essential before feeding any data into machine learning algorithms as it helps build a better machine learning model with greater accuracy. Data preprocessing in our current analysis involves identification and replacement of missing values, detection and removal of the outliers and normalization.

9.1 Identification and replacement of missing values

In this step, we identify the missing values in the PID dataset and replace them with the corresponding mean values. It has been observed that three attributes, namely, number of pregnancies, diabetes pedigree function and age have no missing values, i.e., we have all 768 values for these three attributes. For the other attributes, there are some missing values. To be precise, plasma glucose level, blood pressure, triceps skin thickness, insulin level and BMI have 5, 35, 227, 374 and 11 missing values. We replace the missing values with their respective mean values for the sake of our current analysis.

On the other hand, there is no missing value in Wisconsin breast cancer dataset.

9.2 Detection and removal of outliers

In this step, we identify and remove outliers from the PID dataset. Outliers are simply data points that differ significantly from all other data points under consideration and they may arise due to variability in the measurement, experimental errors etc. So, in this step, we remove the outliers from the dataset lest we run the risk of building an over-fitted model that may perform well for training data but, behaves poorly for new test data. We use Inter Quartile Range (IQR) filter to detect and remove outliers from our dataset. After applying the IQR filter on our data, we have observed that there are 49 outliers and 719 normal records. We remove these 49 outliers from our analysis and continue with the remaining 719 observations.

On the other hand, applying IQR filter on Wisconsin breast cancer data, we find that there are 55 outliers and 10 extreme values. We remove them at this step as part of the data cleansing process.

9.3 Feature scaling

Feature scaling is a data preprocessing technique intended to standardize the attribute values within a permissible range. In our present analysis, we have heterogeneous attribute values that differ significantly from one another in their orders of magnitude. For example, the number of pregnancies in the PID dataset varies between [0-17], while data on plasma glucose level swings between an even larger range of [44-199]. So, if we do not normalize the data, then the machine learning algorithms will tend to provide higher weightages to plasma glucose level than number of pregnancies, which is not intended. Thus, by normalizing all the attribute values within the range [0-1], we can build a better machine learning model. We use unsupervised normalization filter in Weka to normalize the attribute values within the range of [0-1].

Same is applicable for Wisconsin breast cancer data as well. Like the PID dataset, the features in Wisconsin breast cancer data also vary widely in their magnitudes. For example, the average value of perimeter mean is 91.97, while the average value of concavity mean is only 0.09. So, it is better to normalize all the features within [0-1] range before feeding them into any learning algorithm. Like PID dataset, we use unsupervised normalization filter in Weka to cast the attribute values within [0-1] range.

10 Results

10.1 PID dataset

We run 19 different machine learning algorithms, namely, MDEM (in 4 variants), Logistic Regression (LR), Random Forest (RF), Support Vector Machine (SVM), K-Nearest Neighbor (KNN) (in 3 variants), Neural Network (NN) (in 9 variants) and note down their performances based upon their ability to properly classify PID dataset. Apart from the accuracy scores, we also calculate the confusion matrix to compare the overall performances of of different algorithms. We use both k-fold and stratified k-fold cross validation techniques to analyze the performances of different algorithms. Apart from calculating the accuracy scores of different models, we also estimate their respective confusion matrices. Results under k-fold cross validation techniques are presented in Tables 3, 4 and 5, while the results under stratified k-fold cross validation techniques are presented in Tables 6, 7 and 8.

Download:

Table 3. Performance analyses of different algorithms using k-fold cross validation technique for PID dataset.

https://doi.org/10.1371/journal.pone.0336933.t003

Download:

Table 4. Precision and recall scores of different algorithms using k-fold cross validation technique for PID dataset.

https://doi.org/10.1371/journal.pone.0336933.t004

Download:

Table 5. F1 scores and MCC of different algorithms using k-fold cross validation technique for PID dataset.

https://doi.org/10.1371/journal.pone.0336933.t005

Download:

Table 6. Performance analyses of different algorithms using stratified k-fold cross validation technique for PID dataset.

https://doi.org/10.1371/journal.pone.0336933.t006

Download:

Table 7. Precision and recall scores of different algorithms using stratified k-fold cross validation technique for PID dataset.

https://doi.org/10.1371/journal.pone.0336933.t007

Download:

Table 8. F1 scores and MCC of different algorithms using stratified k-fold cross validation technique for PID dataset.

https://doi.org/10.1371/journal.pone.0336933.t008

From Table 3, we can see that MDEM in mean, variance, 3rd and 4th central moment have obtained accuracy scores of 73.43%, 70.51%, 68.00% and 68.29% respectively under 2-fold cross validation technique. The accuracies obtained by Logistic Regression (LR), Random Forest (RF) and Support Vector Machine (SVM) are found to be 74.83%, 75.94% and 76.63%. On the other hand, KNN algorithms with 3, 5, 7 neighbors have accuracies of 69.82%, 70.38% and 72.32% respectively. Moreover, the performance of the NN varies somewhere between 66.48% to 74.55% depending upon the number of hidden layers and epochs. Best for the NN under current consideration, i.e., accuracy score of 74.55% is obtained for NN with 3 hidden layers with 150 epochs, while the worst is obtained for NN with 1 hidden layer in 100 epochs. As can be seen from Table 3, MDEM in mean runs better than KNN with 3, 5, 7 neighbors and 8 (eight) of the NN models under consideration in 2-fold cross validation technique. The best algorithm under 2-fold cross validation is found to be Support Vector Machine (SVM) having a run time accuracy of 76.63%. So, the performances of MDEM in 04 different variants are within the range of [89.12% − 95.82%] of the best algorithm under consideration. The results are graphically presented in Fig 2.

Download:

Fig 2. Performance of MDEM algorithms as percentage of the best algorithms in k-fold for PID dataset.

https://doi.org/10.1371/journal.pone.0336933.g002

Moreover, as can be seen from Table 4, the precision scores of different MDEM algorithms vary between 0.54–0.58. For 2-fold cross validation, the best precision score of 0.73 is obtained for NN with 1 hidden layer in 100 epochs. So, precision scores of MDEM algorithms are within [73.97% − 79.45%] of the best algorithm. On the other hand, recall scores of MDEM algorithms are found within the range of [0.37–0.68]. It is to be noted in this regard that MDEM in mean has obtained the best recall score of 0.68 amongst all the algorithm considered. With a high recall score, MDEM algorithms have obtained F1 score and MCC of [0.44–0.63] and [0.24–0.43] respectively. From Table 5, we can see that MDEM in mean has obtained the highest F1-score of 0.63, while its MCC of 0.43 is only next to that of SVM and RF.

Under 3-fold cross validation, the accuracies of MDEM in mean, variance, 3rd and 4th central moments are 73.57%, 70.37%, 68.00% and 67.17% respectively. The best algorithm under 3-fold cross validation is Logistic Regression (LR) having an accuracy of 77.47%. So, the performances of different variants of MDEM are within the range of [86.78% − 94.97%] of the best algorithm for 3-fold. Details are given in Fig 2. Moreover, from Table 5, we can see that the F1 score and MCC of different MDEM algorithms vary within the range of [0.46–0.64] and [0.24–0.43] respectively. F1 score of MDEM in mean is the highest amongst the all 19 algorithms considered, while its MCC is only next to that of SVM and RF.

For 5-fold cross validation, the accuracies of different MDEM algorithms are 73.44%, 70.37%, 67.74% and 65.23% respectively. As can be seen from Table 3, the best algorithm under 5-fold cross validation technique is identified to be Support Vector Machine (SVM) with a run time accuracy of 77.06%. So, MDEM algorithms run within the range of [87.54% − 95.30%] of the best algorithm under 5-fold cross validation. Details are given in Fig 2. Additionally, from Table 5, we can find that the F1 score and MCC of different MDEM algorithms lie within [0.46–0.64] and [0.21–0.44] respectively. From Table 5, it is also evident that F1 score of MDEM in mean is the best amongst the all 19 algorithms considered, while its MCC is only next to that of LR, RF and SVM.

For 7-fold cross validation, the accuracies of MDEM in mean, variance, 3rd and 4th central moments are 73.16%, 70.93%, 67.46% and 64.81% respectively. The best algorithm for 7-fold cross validation is found to be Support Vector Machine (SVM) with a run time accuracy of 77.48%. So, different MDEM algorithms run within the range of [83.65% − 94.42%] of the best algorithm under 7-fold cross validation technique. Detailed results are graphically presented in Fig 2. Moreover, from Table 5, we can see that the F1 score and MCC of different MDEM algorithms lie within [0.46–0.63] and [0.21–0.43] respectively. From Table 5, it is also evident that F1 score of MDEM in mean is the best amongst the all 19 algorithms considered, while its MCC is only next to that of LR, RF, SVM and one of the NNs.

To summarize, the performance of different MDEM algorithms varies between [83.65% − 95.82%] of the best algorithm under 2, 3, 5 and 7-Fold cross validation technique. Moreover, F1 score and MCC of different MDEM variants are within [0.46–0.63] and [0.21–0.43] respectively. As can be seen from Table 5, MDEM in mean has the highest F1 score, while its MCC is only next to that of LR, RF, SVM and 3 of the NNs.

As we have mentioned earlier, there are 719 records after the outliers and extreme values are removed from our PID dataset. Out of these 719 records under consideration, 477 records are non-diabetic and 242 are diabetic. So, the sample under consideration is not quite uniform. Rather, it is skewed to some extent towards non-diabetic records. A preferred choice to work with such imbalanced dataset is to use stratified k-fold cross validation technique instead of simple k-fold. In this part of this analysis, we are going to summarize the performances of different MDEM algorithms along with other state of the art algorithms under stratified 2, 3, 5 and 7-fold cross validation techniques. The detailed results are presented into Tables 6, 7 and 8, while the summarized statistics are shown graphically in Fig 3.

Download:

Fig 3. Performance of MDEM algorithms as percentage of the best algorithms in stratified k-fold for PID dataset.

https://doi.org/10.1371/journal.pone.0336933.g003

From Table 6, we can see that the accuracies of MDEM algorithms in mean, variance, 3rd and 4th central moment under stratified 2-fold cross validation are 73.99%, 70.79%, 68.01% and 66.90%. The best algorithm in this case is found to be Logistic Regression (LR) with a run time accuracy of 77.33%. So, MDEM algorithms run within the range of [86.51% − 95.68%] the best algorithm under consideration. This is graphically presented in Fig 3.

Additionally, from Table 8, we can see that the F1 score and MCC of different MDEM algorithms vary within the range of [0.46–0.64] and [0.23–0.44] respectively. It is evident from Table 8 that F1 score of MDEM in mean (0.64) is maximum amongst all the 19 algorithms considered and its MCC of 0.44 is only next to that of RF, LR and one of the NNs.

Under stratified 3-fold cross validation technique, accuracies of different MDEM algorithms are found to be 73.02%, 70.93%, 67.03% and 67.18%, where MDEM in mean is the best performing one, while MDEM in 3rd central moment is the worst performing one with run time accuracy of 73.02% and 67.03% respectively. The best algorithm under stratified 3-fold cross validation is Logistic Regression (LR) with an accuracy of 77.05%. So, as can be seen from Fig 3, different MDEM algorithms run within the range of [87.19% − 94.77%] of the best algorithm under consideration. Moreover, as can be seen from Table 8, F1 score and MCC of different MDEM algorithms vary within the range of [0.47–0.63] and [0.24–0.43] respectively. F1 score of MDEM in mean is the best amongst the 19 algorithms considered, while its MCC is only next to that of LR, RF, SVM and one of the NNs.

Moreover, the run time performances of different MDEM algorithms under stratified 5-fold cross validation technique are found to be 73.85%, 70.78%, 67.31% and 67.03%. The best algorithm under current scenario is Logistic Regression (LR) with an accuracy of 77.75%. This implies, MDEM algorithms run within the range of [86.21% − 94.98%] of the best algorithm under present consideration as can be seen from Fig 3. In addition, F1 score and MCC of the MDEM algorithms vary within [0.47–0.64] and [0.23–0.44] respectively (see Table 8). F1 score of MDEM in mean is the best amongst all the algorithms, while its MCC is next to that of SVM, LR, RF and one of the NNs.

Finally, we analyze the performance of different algorithms under stratified 7-fold cross validation technique. As can be seen from Table 4, MDEM in mean, variance, 3rd and 4th central moment have accuracies of 73.85%, 70.65%, 67.18% and 66.07% respectively. The best algorithm under stratified 7-fold cross validation is Random Forest (RF) with a run time accuracy of 79.42%. So, MDEM algorithms run within the range of [83.19% − 92.99%] of the best algorithm under consideration. Additionally, F1 score and MCC of different MDEM algorithms are found to be within [0.49–0.64] and [0.24–0.44] respectively. It is evident from Table 8 that MDEM in mean has the highest F1 score, while its MCC is only next to SVM, LR, RF and one of the NNs.

10.2 Wisconsin breast cancer dataset

Like the PID dataset, we use the same 19 algorithms on Wisconsin breast cancer database and compare their performances. The results obtained are presented in Tables 9–14. To be more precise, Table 9 contains the accuracy scores, Table 10 contains the precision and recall scores and Table 11 contains F1 scores and MCC of different algorithms under k-fold cross validation technique. On the other hand, Table 12 documents the accuracy scores, Table 13 documents the precision and recall scores and Table 14 presents the F1 statistics and MCCs of all the 19 algorithms under stratified k-fold cross validation technique.

Download:

Table 9. Performance analyses of different algorithms using k-fold cross validation technique for Wisconsin breast cancer dataset.

https://doi.org/10.1371/journal.pone.0336933.t009

Download:

Table 10. Precision and recall scores of different algorithms using k-fold cross validation technique for Wisconsin breast cancer dataset.

https://doi.org/10.1371/journal.pone.0336933.t010

Download:

Table 11. F1 scores and MCCs of different algorithms using k-fold cross validation technique for Wisconsin breast cancer dataset.

https://doi.org/10.1371/journal.pone.0336933.t011

Download:

Table 12. Performance analyses of different algorithms using stratified k-fold cross validation technique for Wisconsin breast cancer dataset.

https://doi.org/10.1371/journal.pone.0336933.t012

Download:

Table 13. Precision and recall scores of different algorithms using stratified k-fold cross validation technique for Wisconsin breast cancer dataset.

https://doi.org/10.1371/journal.pone.0336933.t013

Download:

Table 14. F1 scores and MCCs of different algorithms using stratified k-fold cross validation technique for Wisconsin breast cancer dataset.

https://doi.org/10.1371/journal.pone.0336933.t014

From Table 9, we can see that the accuracy scores of different MDEM algorithms vary between [86.77% − 93.00%] in 2-fold cross validation. The best algorithm in 2-fold is found to be a NN with 3 hidden layers and 100 epochs, which has an accuracy score of 97.47%. Thus, the MDEM algorithms runs within [89.02% − 95.41%] of the best algorithm in 2-fold cross validation technique. Moreover, from Table 11 we can see that the F1 scores and MCCs of different variants of MDEM fluctuate between [0.84–0.90] and [0.76–0.85]. The best F1 score and MCC are obtained for MDEM in mean (0.90 and 0.85), which is a bit lower than the best F1 score of 0.96 and 0.94 respectively.

Under 3-fold cross validation, MDEM algorithms have obtained accuracy scores within the range [86.77% − 93.97%]. The best amongst the MDEM algorithms is MDEM in mean with an accuracy score of 93.97%, while the overall best is 97.66% (for NN with 3 hidden layers and 150 epochs). Thus, the performance of MDEM algorithms vary within the range of [88.85% − 96.22] of the best algorithm. Moreover, F1 score and MCC of the best MDEM algorithm are 0.91 and 0.86 respectively, which represent a good fit.

For 5-fold cross validation, the accuracy scores of MDEM algorithms are found to be within [87.74% − 93.79%], while the best accuracy score of 97.09% is obtained for a NN with 2 hidden layers and 150 epochs (Table 9). Thus, the accuracy scores of MDEM algorithms vary within [90.47% − 96.60%] of the best algorithm under consideration. Additionally, the F1 score and MCC of the best MDEM algorithm is found to be 0.90 and 0.86 respectively, which also represent a good fit.

Next, the accuracy scores of the MDEM algorithms under 7-fold cross validation are fond to fluctuate within [88.71% − 93.96%], while the best accuracy score of 97.67% is obtained for SVM. Thus, the accuracy scores of MDEM algorithms vary within the [90.83% − 96.20%] range of the best algorithm under consideration. Additionally, the F1 score and MCC of the best MDEM algorithm turn out to to be 0.91 and 0.86 respectively, which are obtained for MDEM in mean. As can be seen from Table 11, the best F1 score of 0.97 and MCC of 0.95 are obtained for SVM, which are quite comparable to those of MDEM in mean.

Accuracy scores of different MDEM algorithms as percentage of the performance metric of the best algorithm under consideration in k-fold cross validation are graphically presented in Fig 4.

Download:

Fig 4. Performance of MDEM algorithms as percentage of the best algorithms in k-fold for Wisconsin breast cancer dataset.

https://doi.org/10.1371/journal.pone.0336933.g004

As we are done with the analysis of our algorithms under k-fold cross validation technique, we now compare their performances using stratified k-fold cross validation. To start with, the accuracy scores, precision and recall scores along with the corresponding F1 scores and MCCs are presented in Tables 12, 13 and 14 respectively.

From Table 12, we can see that accuracy scores of different MDEM algorithms vary between [86.77% − 93.77%] in 2-fold stratified cross validation. The best algorithm in 2-fold is found to be a NN with 3 hidden layers and 100 epochs with an accuracy score 97.28%. Thus, the performance of the MDEM algorithms are within the range of [89.20% − 96.40%] of the best algorithm under consideration. F1 score and MCC of the best MDEM algorithm are found to be 0.90 and 0.86 respectively (Table 14).

For stratified 3-fold cross validation, the best algorithm is found to be a NN with 3 hidden layers and 100 epochs yielding an accuracy score of 98.25%. The accuracy scores of different MDEM algorithms are noted to be within [87.55% − 93.97%], which are within [89.11% − 95.64%] range of the best algorithm. F1 score and MCC of the best MDEM algorithm are found to be 0.91 and 0.87 respectively (Table 14).

For stratified 5-fold cross validation, the best algorithm is a NN with 3 hidden layers and 100 epochs, having an accuracy score of 97.47%, while the performance of MDEM algorithms vary between [87.94% − 94.16%]. So, MDEMs run within the [90.22% − 96.60%] range of the best algorithm. F1 score and MCC of the best MDEM algorithm are found to be 0.91 and 0.87 respectively (Table 14).

Finally, for 7-fold stratified cross validation, the best algorithm is a NN with an accuracy score of 97.86% (NN with 3 hidden layers and 100 epochs). On the other hand, the accuracy scores of MDEM algorithms vary within [87.95% − 94.17%], which is very close to the accuracy scores of the best algorithm. F1 score and MCC of the best MDEM algorithm are found to be 0.91 and 0.87 respectively, which represent a quite good fit (Table 14).

Accuracy scores of different MDEM algorithms as percentage of the performance metric of the best algorithm under consideration in stratified k-fold cross validation are graphically presented in Fig 5.

Download:

Fig 5. Performance of MDEM algorithms as percentage of the best algorithms in stratified k-fold for Wisconsin breast cancer dataset.

https://doi.org/10.1371/journal.pone.0336933.g005

10.3 Run time analysis

In this subsection, we will analyze the running time of different algorithms using k-fold cross validation techniques. Although not reported here, stratified k-fold cross validation technique also entails equivalent results. Results using k-fold cross validation are tabulated in Tables 15–18. To be precise, Tables 15 and 16 summarize the running time for Pima Indian Diabetes dataset, while Tables 17 and 18 summarize the same for Wisconsin breast cancer dataset. We will discuss the running time analysis of different algorithms using PID dataset and Wisconsin breast cancer dataset in separate sections.

Download:

Table 15. Training time analysis of different algorithms for PID dataset using k-fold cross validation technique.

https://doi.org/10.1371/journal.pone.0336933.t015

Download:

Table 16. Testing time analysis of different algorithms for PID dataset using k-fold cross validation technique.

https://doi.org/10.1371/journal.pone.0336933.t016

Download:

Table 17. Training time analysis of different algorithms for Wisconsin breast cancer dataset using k-fold cross validation technique.

https://doi.org/10.1371/journal.pone.0336933.t017

Download:

Table 18. Testing time analysis of different algorithms for Wisconsin breast cancer dataset using k-fold cross validation technique.

https://doi.org/10.1371/journal.pone.0336933.t018

10.3.1 Running time analysis using PID dataset.

From Table 15, we can see that the training time of MDEM in mean under 2, 3, 5 and 7-fold cross validation technique using PID dataset is best amongst all the simulated MDEM algorithms, which nicely converges with our previous theoretical analysis. Moreover, MDEM in mean has a significantly lower running time in training phase as compared to LR, RF, SVM and all the nine NN variants. However, all the 03 KNN run better than MDEM in mean, as we have discussed earlier that KNNs have constant running time in the training phase as almost zero pre-processing is needed at this stage. But, KNNs performance suffer much in the testing phase as its time complexity is linear on all the training and testing data. This is evident from Table 16 that MDEM in mean runs noticeably well in the testing phase as compared to all the KNN variants. In testing phase, MDEM in mean runs better than all the 03 KNNs as well as all the nine NN variants as can be seen from Table 16. However, it is outperformed by LR, RF and SVM in the testing phase.

10.3.2 Running time analysis using Wisconsin breast cancer dataset.

For Wisconsin breast cancer dataset also, the best MDEM algorithm in training and testing phase is again MDEM in mean, as can be seen from Tables 17 and 18. From Table 17, we can see that MDEM in mean runs better than LR, RF and all the nine NN variants in the training phase. But, it is outperformed by SVM and all the 03 KNN variants. On the other hand, in testing phase, MDEM in mean runs better than all the 03 KNN and 09 NN variants. However, it is outperformed by LR, RF and SVM.

10.4 Discussion

From the above discussion, it can be seen that the MDEM algorithm performs within the [83.19% − 95.82%] bound of the best algorithm under consideration for PID dataset. Moreover, for Wisconsin breast cancer dataset, its performance seems to swing between [88.85% − 96.41%]. Moreover, for both the datasets, MDEMs involving lower order moments perform remarkably well as compared to MDEMs involving higher order moments. In fact, the lower bounds of the afore-mentioned accuracy scores are attributed to MDEMs involving 3rd and 4th order central moments, while the upper bounds are practically due to MDEMs in mean. Also, as we can be seen from Tables 5, 8, 11 and 14, MDEMs in mean and variance have consistently obtained better F1 score and MCC as compared to MDEMs in skewness and kurtosis. However, which version of MDEM algorithm will perform better than the others depends mostly on the underlying dataset used in the analysis. If the dataset favors lower order moments like mean and variance over higher order ones like skewness and kurtosis, then using lower order moments delivers better performance and, in our cases, we have reported exactly the same incident. However, there may be other datasets that prefer homogeneity in skewness and kurtosis to that in mean and variance and, for those datasets, MDEMs involving higher order moments may perform better.

MDEM algorithms are better suited in numerous situations, where speed in running time is more important than cutting-edge accuracy. For example, MDEM algorithms can be practically deployed to different text classification scenarios, e.g., classifying documents according to search query, email filtering (spam vs non-spam and also categorization and labelling of emails), categorization of news articles (sports, politics, international etc.), populating newsfeeds for users in social media, sentiment analysis (classifying reviews as positive, negative and neutral), support ticket routing (assign incoming tickets to the right department), legal and medical documents tagging (tag documents based on predefined categories) and the alike. These are some of the applications, where speed is preferred to optimality as the data here are massive in quantity and consequences of possible misclassification are relatively less dangerous.

11 Conclusion

Here, we have proposed a new supervised learning algorithm that can train itself in times linear to the number of training rows. After the training phase, the algorithm can effectively classify every new instance in constant time for a particular problem and can instantly change its definition after each such inclusion. As we have discussed throughout this article, this significant improvement in running time is obtained at the cost of a slightly less than optimal performance. For the Pima Indian Diabetes (PID) dataset, our algorithms are found to perform within the range of [83.19% − 95.82%] of the best algorithm at hand. For Wisconsin breast cancer dataset, our algorithms perform within the range of [88.85 − 96.41]% of the best algorithm under consideration. So, whenever we are in need of a multiclass learning algorithm that is intended to handle massive amount of data e.g., newsfeed generation in social media, email filtering, text classification and categorization, support ticket routing, sentiment analysis, different variants of our proposed MDEM algorithms can come up as a suitable choice with a far better running time and a slightly less than optimal performance.

Acknowledgments

The author is greatful to the editor and reviewers of the journal for their valuable suggestions.

References

1. Arya S, Mount DM, Netanyahu NS, Silverman R, Wu AY. An optimal algorithm for approximate nearest neighbor searching fixed dimensions. J ACM. 1998;45(6):891–923.
- View Article
- Google Scholar
2. Bentley JL. Multidimensional binary search trees used for associative searching. Commun ACM. 1975;18(9):509–17.
- View Article
- Google Scholar
3. Yianilos PN. Data structures and algorithms for nearest neighbor search in general metric spaces. Soda. 1993. p. 311–21.
4. Cortes C. Support-vector networks. Machine learning. 1995.
- View Article
- Google Scholar
5. Burges CJC. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery. 1998;2(2):121–67.
- View Article
- Google Scholar
6. Chang CC, Lin CJ. LIBSVM: a library for support vector machines. ACM transactions on intelligent systems and technology (TIST). 2011;2(3):1–27.
- View Article
- Google Scholar
7. Manning CD, Raghavan P, Schütze H. Introduction to information retrieval. Cambridge, UK: Cambridge University Press; 2008.
8. Dabney AR, Storey JD. Optimality driven nearest centroid classification from genomic data. PLoS One. 2007;2(10):e1002. pmid:17912341
- View Article
- PubMed/NCBI
- Google Scholar
9. Tin Kam Ho. Random decision forests. In: Proceedings of 3rd International Conference on Document Analysis and Recognition. p. 278–82. https://doi.org/10.1109/icdar.1995.598994
10. Bishop CM, Nasrabadi NM. Pattern recognition and machine learning. Springer; 2006.
11. Hosmer Jr DW, Lemeshow S, Sturdivant RX. Applied logistic regression. Wiley; 2013.
12. Goodfellow I. Deep learning. 2016.
13. Bengio Y. Learning deep architectures for AI. 2009.
14. Shan H, Sompolinsky H. Minimum perturbation theory of deep perceptual learning. Phys Rev E. 2022;106(6–1):064406. pmid:36671118
- View Article
- PubMed/NCBI
- Google Scholar
15. Fawzi A, Fawzi O, Frossard P. Analysis of classifiers’ robustness to adversarial perturbations. Mach Learn. 2017;107(3):481–508.
- View Article
- Google Scholar
16. Brau F, Rossolini G, Biondi A, Buttazzo G. On the minimal adversarial perturbation for deep neural networks with provable estimation error. IEEE Trans Pattern Anal Mach Intell. 2023;45(4):5038–52. pmid:35914038
- View Article
- PubMed/NCBI
- Google Scholar
17. Čyras K, Birch D, Guo Y, Toni F, Dulay R, Turvey S, et al. Explanations by arbitrated argumentative dispute. Expert Systems with Applications. 2019;127:141–56.
- View Article
- Google Scholar
18. Gareth J, Daniela W, Trevor H, Robert T. An introduction to statistical learning: with applications in R. Spinger; 2013.
19. He H, Ma Y. Imbalanced learning: foundations, algorithms, and applications. 2013.
20. Schulz LO, Bennett PH, Ravussin E, Kidd JR, Kidd KK, Esparza J, et al. Effects of traditional and western environments on prevalence of type 2 diabetes in Pima Indians in Mexico and the U.S. Diabetes Care. 2006;29(8):1866–71. pmid:16873794
- View Article
- PubMed/NCBI
- Google Scholar
21. Chang V, Bailey J, Xu QA, Sun Z. Pima Indians diabetes mellitus classification based on machine learning (ML) algorithms. Neural Comput Appl. 2022:1–17. pmid:35345556
- View Article
- PubMed/NCBI
- Google Scholar
22. Larabi-Marie-Sainte S, Aburahmah L, Almohaini R, Saba T. Current techniques for diabetes prediction: review and case study. Applied Sciences. 2019;9(21):4604.
- View Article
- Google Scholar
23. Dua D, Graff C. Breast cancer wisconsin (diagnostic) data set. 2017. https://www.kaggle.com/datasets/uciml/breast-cancer-wisconsin-data

[ref1] 1. Arya S, Mount DM, Netanyahu NS, Silverman R, Wu AY. An optimal algorithm for approximate nearest neighbor searching fixed dimensions. J ACM. 1998;45(6):891–923.
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. Bentley JL. Multidimensional binary search trees used for associative searching. Commun ACM. 1975;18(9):509–17.
View Article
Google Scholar

[5] View Article

[6] Google Scholar

[ref3] 3. Yianilos PN. Data structures and algorithms for nearest neighbor search in general metric spaces. Soda. 1993. p. 311–21.

[ref4] 4. Cortes C. Support-vector networks. Machine learning. 1995.
View Article
Google Scholar

[9] View Article

[10] Google Scholar

[ref5] 5. Burges CJC. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery. 1998;2(2):121–67.
View Article
Google Scholar

[12] View Article

[13] Google Scholar

[ref6] 6. Chang CC, Lin CJ. LIBSVM: a library for support vector machines. ACM transactions on intelligent systems and technology (TIST). 2011;2(3):1–27.
View Article
Google Scholar

[15] View Article

[16] Google Scholar

[ref7] 7. Manning CD, Raghavan P, Schütze H. Introduction to information retrieval. Cambridge, UK: Cambridge University Press; 2008.

[ref8] 8. Dabney AR, Storey JD. Optimality driven nearest centroid classification from genomic data. PLoS One. 2007;2(10):e1002. pmid:17912341
View Article
PubMed/NCBI
Google Scholar

[19] View Article

[20] PubMed/NCBI

[21] Google Scholar

[ref9] 9. Tin Kam Ho. Random decision forests. In: Proceedings of 3rd International Conference on Document Analysis and Recognition. p. 278–82. https://doi.org/10.1109/icdar.1995.598994

[ref10] 10. Bishop CM, Nasrabadi NM. Pattern recognition and machine learning. Springer; 2006.

[ref11] 11. Hosmer Jr DW, Lemeshow S, Sturdivant RX. Applied logistic regression. Wiley; 2013.

[ref12] 12. Goodfellow I. Deep learning. 2016.

[ref13] 13. Bengio Y. Learning deep architectures for AI. 2009.

[ref14] 14. Shan H, Sompolinsky H. Minimum perturbation theory of deep perceptual learning. Phys Rev E. 2022;106(6–1):064406. pmid:36671118
View Article
PubMed/NCBI
Google Scholar

[28] View Article

[29] PubMed/NCBI

[30] Google Scholar

[ref15] 15. Fawzi A, Fawzi O, Frossard P. Analysis of classifiers’ robustness to adversarial perturbations. Mach Learn. 2017;107(3):481–508.
View Article
Google Scholar

[32] View Article

[33] Google Scholar

[ref16] 16. Brau F, Rossolini G, Biondi A, Buttazzo G. On the minimal adversarial perturbation for deep neural networks with provable estimation error. IEEE Trans Pattern Anal Mach Intell. 2023;45(4):5038–52. pmid:35914038
View Article
PubMed/NCBI
Google Scholar

[35] View Article

[36] PubMed/NCBI

[37] Google Scholar

[ref17] 17. Čyras K, Birch D, Guo Y, Toni F, Dulay R, Turvey S, et al. Explanations by arbitrated argumentative dispute. Expert Systems with Applications. 2019;127:141–56.
View Article
Google Scholar

[39] View Article

[40] Google Scholar

[ref18] 18. Gareth J, Daniela W, Trevor H, Robert T. An introduction to statistical learning: with applications in R. Spinger; 2013.

[ref19] 19. He H, Ma Y. Imbalanced learning: foundations, algorithms, and applications. 2013.

[ref20] 20. Schulz LO, Bennett PH, Ravussin E, Kidd JR, Kidd KK, Esparza J, et al. Effects of traditional and western environments on prevalence of type 2 diabetes in Pima Indians in Mexico and the U.S. Diabetes Care. 2006;29(8):1866–71. pmid:16873794
View Article
PubMed/NCBI
Google Scholar

[44] View Article

[45] PubMed/NCBI

[46] Google Scholar

[ref21] 21. Chang V, Bailey J, Xu QA, Sun Z. Pima Indians diabetes mellitus classification based on machine learning (ML) algorithms. Neural Comput Appl. 2022:1–17. pmid:35345556
View Article
PubMed/NCBI
Google Scholar

[48] View Article

[49] PubMed/NCBI

[50] Google Scholar

[ref22] 22. Larabi-Marie-Sainte S, Aburahmah L, Almohaini R, Saba T. Current techniques for diabetes prediction: review and case study. Applied Sciences. 2019;9(21):4604.
View Article
Google Scholar

[52] View Article

[53] Google Scholar

[ref23] 23. Dua D, Graff C. Breast cancer wisconsin (diagnostic) data set. 2017. https://www.kaggle.com/datasets/uciml/breast-cancer-wisconsin-data

Figures

Abstract

1 Introduction and scope of the current study

2 Raw and central moments

2.1 Raw moment

2.2 Central moment

3 Proposed supervised learning algorithm based upon Minimum Displacement in Existing Moment (MDEM)

3.1 MDEM in mean

3.2 MDEM in n-th central moment

4 Time complexity analysis of the MDEM algorithm

4.1 Time complexity of MDEM in mean

4.2 Time complexity of MDEM in n-th central moment

5 MDEM as a size-weighted minimum perturbation classifier

6 A sufficient condition for optimality under minimum perturbation criteria

7 Methodology

8 Description of data

9 Data preprocessing

9.1 Identification and replacement of missing values

9.2 Detection and removal of outliers

9.3 Feature scaling

10 Results

10.1 PID dataset

10.2 Wisconsin breast cancer dataset

10.3 Run time analysis

10.3.1 Running time analysis using PID dataset.

10.3.2 Running time analysis using Wisconsin breast cancer dataset.

10.4 Discussion

11 Conclusion

Acknowledgments

References