Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Enhancing symbolic image classification through Gaussian copulas and optimized distinguishing points

  • Sri Winarni ,

    Roles Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Software, Writing – original draft

    sri.winarni@unpad.ac.id

    Affiliation Department of Statistics, Universitas Padjadjaran, Sumedang, Indonesia

  • Sapto Wahyu Indratno,

    Roles Conceptualization, Formal analysis, Investigation, Methodology, Writing – original draft

    Affiliation Statistics Research Group, Institut Teknologi Bandung, Bandung, Indonesia

  • Mohd Shahizan Othman,

    Roles Funding acquisition, Investigation, Methodology, Project administration, Resources, Supervision

    Affiliation Faculty of Computing, Universiti Teknologi Malaysia, Johor Bahru, Malaysia

  • Siti Zaiton Mohd Hashim,

    Roles Methodology, Resources, Software

    Affiliation Faculty of Computing, Universiti Teknologi Malaysia, Johor Bahru, Malaysia

  • Mohd Murtadha Mohamad,

    Roles Data curation, Resources, Validation

    Affiliation Faculty of Computing, Universiti Teknologi Malaysia, Johor Bahru, Malaysia

  • Apri Junaidi,

    Roles Data curation, Software, Visualization

    Affiliation Faculty of Computing, Universiti Teknologi Malaysia, Johor Bahru, Malaysia

  • Ebenezer Bonyah,

    Roles Resources, Visualization, Writing – review & editing

    Affiliations Department of Mathematics Education, Akenten Appiah-Menka University of Skills Training and Entrepreneurial Development, Kumasi, Ghana, Azerbaijan University of Architecture and Construction, Baku, Azerbaijan

  • Anindya Apriliyanti Pravitasari,

    Roles Data curation, Methodology, Visualization

    Affiliation Department of Statistics, Universitas Padjadjaran, Sumedang, Indonesia

  • Triyani Hendrawati,

    Roles Software, Visualization, Writing – review & editing

    Affiliation Department of Statistics, Universitas Padjadjaran, Sumedang, Indonesia

  • Irlandia Ginanjar

    Roles Funding acquisition, Project administration, Writing – review & editing

    Affiliation Department of Statistics, Universitas Padjadjaran, Sumedang, Indonesia

Abstract

In this paper a new method is being proposed which makes use of symbolic data manifested by empirical cumulative distribution functions (ECDF) and distribution functions based on sets of ECDF values, referred to as the distribution function of distribution values (DFDV) of image features. This differs with conventional image classification studies, which mostly rely on pixel intensity values as features. Such symbolic representations provide a more general characterization of the pixel intensities patterns across image areas. The main novelty associated with the given research is the creation of the DFDV based on the best possible selection of the distinguishing points in order to capture fundamental changes in intensity distributions within image classes. Compared to previous studies which have used pre-determined distinguishing points, the proposed study proposes a clustering-based approach to find distinguishing points that maximise class separability. Experimental evaluation on the MNIST handwritten digits data set demonstrates the effectiveness of the proposed method, achieving an average classification accuracy of 68.27% and a highest accuracy of 95.33%. These results indicate that integrating clustering-based symbolic features extraction with copula-based modelling provides a competitive and promising for image classification tasks.

Introduction

Image classification is the branch of artificial intelligence (AI) as a fundamental type of task used in different industries, such as healthcare, security, or automation. Due to the rapid development of technology, there is a considerable rise in the need of effective and accurate classification methods to contribute to the activities of object identification, numerical image analysis and plant disease identification [16]. The recognition of handwritten digits is of exceptional importance in this collection of applications because it is generally used in automation procedures like cheque scanning, document reading, and data entering. Miscalculations in numbers detection can lead to major impact on financial security, lower productivity, and compromised quality of an activity. It has found relevance in cybersecurity too, most notably verification of user inputs in captchas, authenticity of handwritings to verify handwritten signatures, and fraudulent numerical codes [7,8]. As a result, the modern studies are aimed at creation of new methods to enhance the correctness, stability, and data reliability of the image classification systems. The increase in the implementation of reliable and accurate classification techniques remains a major research field.

There are a number of remedies which have been suggested over the past few years to enhance performance of image classification. The use of deep learning has created enormous excitement due to the capacity to learn complicated and even multidimensional features on image data [911]. Typically, such methods have significant processing requirements and need huge numbers of labelled data; also, their core processes are generally hard to interpret [1215]. Probabilistic, on the other hand, such as the Gaussian copula build a more explanatory framework of modelling since there is an explicit representation of the dependencies between features in a statistical basis. Although this method is rather rare in image classification, it has a considerable potential, especially taking into account the importance of interpretability and statistical structure rather than simplicity of the model.

The Gaussian copula is a performance that links the marginals distribution of a number of random components to create their joint distribution. The dependencies between several variables are determined by a multivariate normal distribution [16,17]. Gaussian copulas would be advantageous to image classification in explaining the associations between the features of the images and the labels. The advantage of this is specifically useful in simulating complex image data like handwritten digits which show variations in style and thickness. The given paper explores the use of symbolic information, in the form of distribution functions, symbolising image features treated as random variables within the Gaussian copula framework. The methodology differs with the usual analysis where pixel intensity information is used as features.

Symbolic data are data that carry information beyond numerical values and that may be categories, functions or distributions with intrinsic meanings [18,19]. In this work the symbolic data is used as an aspect of visual feature by treating pixel intensities breakdown as distributions of picture characterised as stochastic variables. Such an approach allows a more comprehensive modelling of statistical data compared to the use of raw pixel value or simple feature vectors. It has several advantages: (1) summarising of random variables behaviour in a convenient form of distribution; (2) reinforcement of stability in feature representation; (3) compacting data structure, which facilitates statistical analysis. Thus, the method can serve as a relevant alternative to the statistical modelling of image data.

The prerequisite of previous studies involving use of copulas in image classification systems is characterised by a heavy reliance on pixel intensity values and their variation attributes [2023]. A symbolic data technique using distribution functions in particular the empirical cumulative distribution function (ECDF) and the distribution function of distribution values (DFDV) as an image feature proposed itself in a recent study [24]. In that study, DFDV was developed utilising a predetermined set of distinguishing points to encapsulate essential elements of the ECDF. Nonetheless, employing fixed distinguishing points may constrain the capacity to identify the most pertinent variances among image categories. This study enhances [25] which previously achieved an accuracy of 62.66%, by introducing an innovative strategy for identifying distinguishing points using clustering. This adaptive technique seeks to discern image features that more efficiently distinguish classes by identifying salient patterns in the data. The dynamic selection of differentiating features is anticipated to enhance classification accuracy by more accurately representing the statistical structure of the images.

To evaluate the proposed framework, experiments are conducted using the MNIST handwritten digit dataset, which remains a widely adopted benchmark for image classification research. Recent MNIST studies predominantly focus on optimizing predictive performance through convolutional neural networks, recurrent models, and hybrid deep learning architectures [2630], typically relying on pixel-level representations and increasingly complex neural structures. In contrast, this study does not aim to introduce a new neural architecture, but rather to demonstrate a fundamentally different statistical modeling perspective. MNIST was chosen as it is a standard dataset for image classification, facilitating performance comparison with previous studies. Its variations in handwriting and pixel intensity reflect real-world challenges, allowing methods that perform well to be adapted for applications such as OCR and symbol recognition.

The main novelty of this work lies in integrating optimized symbolic feature construction based on ECDF with Gaussian copula modeling within a unified classification framework. Specifically, the proposed approach constructs DFDV using clustering-based optimized distinguishing points and captures dependence structures among symbolic features through copula modeling. This integrated framework provides a statistically grounded and interpretable alternative to conventional pixel-driven methods.

The study aims to (1) develop a copula model utilising distribution functions as random variables; (2) emphasise the discriminative capacity of the ECDF and the DFDV as salient image features; and (3) improve class separability via a clustering-based methodology for identifying optimal distinguishing points. This research’s primary contributions are: (1) the creation of a Gaussian copula-based framework tailored for image classification tasks; (2) the incorporation of pixel intensity distributions as symbolic features to enhance classification performance; and (3) the progression of probabilistic methods in symbolic data modelling. The subsequent sections of this work are structured as follows. Section 1 introduces the history, rationale, objectives, and importance of incorporating Gaussian copula modelling in image classification. Section 2 delineates the methodology, encompassing the theoretical framework of Gaussian copula, the formulation of symbolic data via ECDF and DFDV, and the suggested clustering-based approach for pinpointing discriminative spots. Section 3 details the studies performed with the MNIST dataset of handwritten digits and delineates the classification results achieved with the proposed strategy. Section 4 brings forward the results discussion and analysis, which includes interpretation of the performance and comparison with other studies. The last section, section 5 provides an overall conclusion on the study by highlighting the important findings of the study in addition to suggesting areas of future research.

Methods

In this section, the methodology of this research is outlined. The chapter starts by performing a clear definition of the Gaussian copula, which is expected to be used to define dependencies among random variables. The following subsections explain how symbolic data characteristics were generated via ECDF and DFDV, determining query points of difference, methodology of parameter estimation, and classification procedure as to the proposed copula-based model.

Overview of gaussian copula

Gaussian copula is a mathematical construction, which defines the dependence relationship between the random variables, making the independent of their marginal distributional forms by simply linear transforming of multivariate normal distributions. Copula provides a strong model to express the relation between multivariate random variables because it creates clear distinction between dependence structure and distribution of individual variables. The methodology follows the copula theory whereby a multivariate distribution is conceived as an integration of the marginal distributions of the same with a copula. In the multivariate normal case, the Gaussian copula easily approximates the linear correlations and creates scope to allow the marginal distributions to differ with normal distribution [31].

Consider random variables , each with a marginal CDF . Gaussian copula transforms these variables to uniform random variables with respect to the range . The transformed variables retain the dependency patterns imbedded in the copula. We can mathematically articulate the joint CDF of random variables with marginal distributions , as follows:

(1)

The copula function with is the function that generalizes the dependence between the referred to variables. The theory of copula has its theoretical footing in the Sklars theorem. It postulates that any distribution of multivariate can be presented as a copula which is in observance jointed with the marginal distributions [32]. In particular:

(2)

where . This separation allows us to model the dependence structure, as shown in , independently of the marginals . For completely continuous distributions, the associated density function is defined as:

(3)

with is the marginal density, and is the copula density. The multivariate normal distribution forms the Gaussian copula. Let represent a vector of standard normal random variables characterised by a covariance matrix . The Gaussian copula exhibits the following characteristics:

(4)

Here, is the cumulative distribution function (CDF) of the multivariate normal distribution defined by the covariance matrix , and , where represents the inverse of the standard normal CDF, with for signifying the marginal uniform variables [33].

The dependency structure in the Gaussian copula is defined by the covariance matrix of a multivariate normal distribution. This structure allows it to capture and represent linear relationships among the variables. At the same time, the marginal distributions of the variables are flexible and do not need to follow a normal distribution, enabling diverse applications with customized marginal properties. However, they are limited in capturing tail dependencies, as the Gaussian structure inherently assumes symmetric dependency patterns. This restriction is in contrast to the -copula, and other copulas that have the ability of capturing greater tail dependencies. The Gaussian copula provides a density function for dependency modeling:

(5)

In which , are uniform random variables that are related to the original data through the marginal cumulative distributions, is the vector of inverse standard normal transformations, and is the covariance matrix with ones along the diagonal. is the standard normal density function, whereas signifies its cumulative distribution function (CDF). In this formulation there is a normalisation term to cover fixed marginal distributions [20]. The joint density is altered using the equation to make the marginal distributions fit a uniform distribution on the range [0,1]. Note that ; hence,

(6)

Then by putting (6) in density function (5) hence, we get. Focusing on dependency structure of the Gaussian copula, the term may be eliminated to arrive to:

(7)(8)

The proportionality denotes the missing of the marginal density elements needed to uphold the regularity of the marginals. This is simplification, which points out to the ability of the copula to describe dependencies regardless of the specific marginal distributions. The joint density function of the Gaussian copula is expressed as follows:

(9)

Parameter estimation and marginal modeling

Gaussian copula models are often estimated using the maximum likelihood (ML) approach. In this strategy, both the marginal and copula parameters are combinedly computed. It means optimising the likelihood function of all parameters including the marginal distributions () and the covariance matrix (). But this method does not scale well as the space of the random variables in the model becomes dimensional, that is, growing exponentially with the number of random variables. The process of optimising the whole log likelihood requires large calculations especially in the cases of high dimensional data sets. The maximum likelihood two-stage inference of marginals (IFM) applied [34], usually deals with this difficulty. This method saves a lot of computational burden and at the same time estimates parameters correctly. The whole likelihood function for a Gaussian copula model can be articulated as follows:

(10)

In which the determinant of the covariance matrix is denoted as where, with , and is the marginal density of the i-th variable of the j-th data [35]. The associated log-likelihood function can be articulated as:

(11)

The IFM approach divides the estimating procedure into two phases: The marginal parameters () are initially computed for each variable individually. Subsequently, the copula parameters, particularly the covariance matrix (), are estimated while the marginal parameters remain unchanged. We first derive the marginal parameters by optimising the likelihood function for the marginal distributions. The marginal parameters are obtained by the marginal likelihood .

(12)

where . The computed marginal parameters serve as inputs for the copula likelihood function, facilitating the estimation of the copula parameters. The estimate of copula parameters is accomplished by optimising the likelihood function .

(13)

. The joint log-likelihood function is expressed as follows:

(14)

where signifies the marginal parameters and represents the covariance matrix that parameterises the copula. The optimisation process commences with the estimation of marginal parameters via . Upon determining these estimations are incorporated into the copula likelihood function . The copula parameters are subsequently determined by resolving . Collectively, these fulfil the system’s requirements:

(15)

In the Gaussian copula framework, marginal distributions may be estimated by either parametric or non-parametric methods, contingent upon the characteristics of the data. Kernel density estimation (KDE) is a prevalent non-parametric technique that estimates the probability density function of a random variable without presupposing any particular parametric structure. KDE operates by interpolating observable data points to generate a continuous representation of the underlying distribution. The KDE for a given set of data points is articulated as:

(16)

Let signify the total number of data points, indicate the kernel function, represent the -th observed data point, and imply the bandwidth parameter [36]. The choice of bandwidth influences the continuity of the density function estimation. Augmenting bandwidth yields more fluid and generalised estimates, whereas reducing bandwidth produces sharper and more reactive estimates that discern small intricacies in the data. KDE is advantageous when parametric assumptions regarding marginal distributions are inadequate or when the data exhibits intricate structures that parametric distributions cannot adequately represent, due to its inherent flexibility.

Symbolic feature construction with ECDF and DFDV

The ECDF serves as a statistical tool to characterise the distribution of pixel intensities in an image. For a certain image , let represent a random variable corresponding to the pixel values, where . The ECDF, denoted as , is computed using the following formula:

(17)

In this equation, denotes the total pixel count in the image, indicates the intensity of the -th pixel in the -th image, and is an indicator function that equals when the pixel intensity is less than or equal to , and otherwise [37].

The ECDF function represents the proportion of pixels in the image with intensity values less than or equal to a given intensity level . It provides a thorough summary of pixel intensities across the entire image. The ECDF value increases with rising , spanning from (signifying that no pixels have values less than or equal to ) to (showing that all pixels satisfy the criterion ). The ECDF elucidates the distribution of pixel intensities, enabling the examination and comparison of image attributes such as brightness and contrast.

Let denote the ECDF of the pixel values from the -th image, where . Upon calculating the ECDF for a collection of pictures, the DFDV consolidates the ECDF values at designated distinguishing points. The DFDV at a particular discriminating point is defined as the probability that the ECDF values at that point are less than or equal to a certain value . The DFDV function at point can be articulated as:

(18)

signifies the ECDF of the -th image, whereas denotes the aggregation of ECDFs from all images. The term denotes a certain pixel intensity level utilised as a differentiating criterion. The variable , which ranges from to , represents the cumulative probability at , namely the value of the ECDF assessed at point [25]. These distinguishing features encapsulate essential traits in the distribution of pixel intensities among various image categories. The discrepancies in the choice of these sites give different shape of features hence influence the discriminative ability of the features obtained. The creation of ECDF and the further conversion into the DFDV is a procedure described in [25] and provides a well-structured approach to represent the data of the image in symbolic form by using the cumulative distribution-based features.

The DFDV classifies different images according to the distribution of their pixels intensity. It finds important distribution patterns which distinguishes one type of photos over another by comparing DFDV at certain distinguishing points. The representation is more concezed symbolic data representation, unlike traditional pixel-based data. DFDV data is output data which is minimised and therefore computationally powerful whilst preserving the capability of portraying complex relationships existing between pixel intensities inside the visual image [38,39].

Cluster-based point selection distinction

The point of distinction is a particular value of a pixel which is used in the assessment of the ECDF at the point. The probability value , or the realisation of the ECDF at point is also represented as . DFDV function is given by and it is the probability that . This gives information on the detailed spread of pixel values in that area.

The K-means clustering algorithm detects the points of distinction. They assign the data , which are shown as , into cluster groups based on similarity [40,41]. It starts with the random choice of initial centroids, . We assign every piece of data to the cluster those with the nearest centroid, using the following measure of distance:

(19)(20)

represents the cluster assigned to , whereas represents the centroid value of cluster for . Afterward, the recalibration of the centroids is performed by taking the mean of those values of the data points in every cluster [42]. is the number of data points of cluster . The redistribution of the data points and the switching of the centroid is repeated until the convergence of the clusters, which means the convergence of the algorithm [43].

Algorithm 1. K-means Clustering of ECDF points

Input: Set of ECDF values , for , number of clusters

Output: Cluster assignments , centroids

1: Initialize centroids randomly

2: Repeat:

  For each in data:

    Assign to cluster with nearest centroid

  For each cluster :

    Update centroid as mean of assigned points

3: Until centroids converge

In this study, the number of clusters was set to v = 10, corresponding to the ten digit classes (0–9) in the MNIST dataset. This design choice ensures that the clustering process aligns structurally with the class composition of the dataset. Euclidean distance was used as the similarity measure, and centroid updates were iterated until convergence.

This method clusters ECDF values to acquire the unique characteristics of pixel values distribution in each category of images that enable determination of the most suitable discriminative level to use in the classification. After grouping the data, the effectiveness of the obtained clusters can be evaluated using the Silhouette coefficient, which measures the effectiveness of the groups. Suppose that we have a dataset of points (where ), referred to the value of ECDF at point for the -th image.

The Silhouette coefficient for point is determined as follows:

(21)(22)(23)

Let mean to other data points be the average of all data points in the same cluster to . Conversely, denotes the mean distance from to points in the closest neighbouring cluster. To compute , assume cluster consists of n points that are part of the same cluster as , where d denotes the distance between and . To calculate , denote as the cluster including and as the nearest neighbouring cluster. Let and represent the amounts of points in clusters and , respectively. denotes the distance between point and , situated in distinct clusters [44]. The overall Silhouette coefficient , for the entire dataset is the mean of all individual coefficients.

(24)

The Silhouette coefficient ranges from to . A number close to indicates that a data point is well-aligned with its cluster and distinctly separates from others. A score of about indicates that the data point is closer to an alternative cluster, suggesting a potential clustering error [45]. The mean Silhouette coefficient for all clusters is calculated at various candidate points to determine the optimal distinguishing point that maximises the average Silhouette coefficient.

Distinguishing points are selected using a silhouette-based optimization strategy. First, ECDF are constructed for all images. A set of candidate threshold values is generated across the ECDF domain, and the corresponding ECDF values are computed. For each candidate threshold, K-means clustering (Algorithm 1) is performed with clusters, consistent with the ten MNIST digit classes, using Euclidean distance. The average silhouette coefficient is then calculated to assess clustering quality. The first distinguishing point is selected as the candidate maximizing , and the second is chosen as the next highest value excluding the first. In this study, the number of distinguishing points is set to , balancing clustering quality and model complexity within the copula framework. The overall procedure is illustrated in Fig 1.

thumbnail
Fig 1. Flowchart for selecting two optimal distinguishing points based on average Silhouette coefficients.

The first distinguishing point is selected as the candidate maximizing the average silhouette coefficient , and the second is chosen as the next highest coefficient excluding the first selected point.

https://doi.org/10.1371/journal.pone.0346790.g001

Gaussian Copula with DFDV for image classification

This research employs a Gaussian copula methodology for image classification, utilising random variables characterised by the DFDV derived from the ECDF. The algorithm categorises photos into classes , with each image segmented into areas. For each partition, the points are utilised to assess pixel values, yielding random variables. These random variables encapsulate the fundamental distribution patterns of pixel intensities throughout the image. Implementing this technique necessitates several essential processes. Each of these procedures is detailed in the subsequent sections.

The distinguishing points are identified by the analysis of comprehensive data from classes. The ECDF is initially created for data points across all classes, and candidate distinguishing points are chosen. These points signify critical thresholds that may elucidate disparities in the pixel intensity distributions among various classes. The ECDF for a particular class , partition , and image is represented as , adhering to Equation (17). This function offers a cumulative depiction of pixel intensities for a specified area of the image. The DFDV, originating from the ECDF, encapsulates the distribution at designated locations , where . For partition ,and class , the DFDV is denoted as , as articulated in Equation (18). This transformation enables a concise symbolic representation of visual regions, hence enhancing subsequent analysis and classification.

The -means algorithm is employed to partition the data into clusters for each distinguishing point. The clustering procedure is directed by Equations (19) and (20), which guarantee the establishment of significant clusters. Equations (2123) calculate the Silhouette coefficient to assess the quality of clustering at each prospective distinguishing point. This statistic assesses the extent to which each data point aligns with its assigned cluster compared to other clusters. We determine the optimal threshold as the point that maximises the average Silhouette coefficient. This ensures that the chosen distinguishing point embodies the characteristics that distinguish image classes.

This study consistently develops the copula model for each class, enhancing generality. The copula model for the -th class can be expressed as follows:

(25)

This formulation enables the copula function to encapsulate the dependency structure among random variables, while the marginal distributions manage the individual behaviour of each variable. Through this generalisation, the model is rendered applicable to a variety of image classes, thereby facilitating the consistent and systematic representation of the data. The random variables are defined as transformations that satisfy the equation , where the variables exhibit a uniform distribution over the interval . The correlation between these random variables is expressed as follows by a copula function:

(26)

Furthermore, the joint distribution function will be associated with the joint probability density function, which is defined as follows:

(27)

This study employed the Gaussian copula model to examine the interrelationship among random variables related to features, partitions, and distinguishing points within the image. The Gaussian copula function utilised in this work is articulated as follows:

(28)

The density function of the Gaussian copula is obtained from its distribution function as illustrated below:

(29)

With The joint density function of the Gaussian copula is expressed as follows:

(30)

We come up with the likelihood function in order to simultaneously involve both the copula and marginal distributions in establishing the parameters of the Gaussian copula model. This joint likelihood formulation consists of two elements: one for the Gaussian copula and another for the marginal components, as follows:

(31)

Denote the parameters correlated with the marginal distributions of the copula by the log-likelihood of the Gaussian component of the copula is . The second in the summation symbolizes the log-likelihood of the marginals distributions whereby; denotes the log-likelihood that is related to the marginal distributions of every territory and division.

Parameter estimation is performed by maximizing the joint log-likelihood function with respect to both and . To ensure computational efficiency and numerical stability, the optimization is conducted using the two-step Inference Functions for Margins (IFM) procedure described in Equations (12) and (13). In the first stage, the marginal parameters are estimated independently by maximizing the marginal likelihood function as given in Equation (12). Subsequently, the copula parameter is estimated by maximizing the copula log-likelihood in Equation (13), while keeping the marginal estimates fixed. This procedure enables coherent modeling of both the marginal distributions and their dependence structure within the Gaussian copula framework.

Classification procedure

The method used in this study is one of classification, but this is done using a probabilistic framework where the intention is to determine the likelihood of a particular test image being related to a particular class. Let and be the new input image and the -th class respectively. The conditional probability shows the assigned probability that an input falls in an appropriate class. It is computed by first calculating the joint probability of the input and the class and this is done by multiplying two factors namely: (1) probability of observing centering dot given the probability of class which is expressed as , and (2) the prior probability of class which has the probability expressed as . Probability is estimated using a Gaussian copula model trained on the data so far, but prior is calculated as training samples belonging to the -th class divided by the entire trained sample set [46].

After computing the conditional probabilities of each of the classes, the machine will assign the test image to the one with the greatest probability. This approach allows making classification judgements using the most dependable statistical records extracted out of the data and model. The confusion matrix evaluates the effectiveness of the classification model that counts and lists all correct and incorrect guesses of the model on each of the classes. The main performance indicator is the accuracy which refers to the number of successful predictions made to the total amount of forecasts. Moreover, precision rate, recall and F1-score are also obtained to have a more comprehensive evaluation of the success of this model. These measures are used together to test the competency of the model in distinguishing and disparity between classes of images to guarantee overall analysis of the accuracy of the model in prediction [47]. The overall structure of the proposed approach, including both training and testing phases, is presented in Fig 2.

thumbnail
Fig 2. Block diagram of the proposed image classification method using symbolic ECDF and DFDV features.

Starting from data preprocessing and ECDF construction on partitioned images. Distinguishing points are selected based on clustering ECDF values and evaluating Silhouette coefficients (detailed in Fig 1). The DFDV and Gaussian copula parameter are estimated from training data. In testing, the model evaluates and computes the joint copula distribution for final classification.

https://doi.org/10.1371/journal.pone.0346790.g002

Fig 2 illustrates the overall classification framework. In the training phase, images are partitioned and their ECDFs are constructed. Optimal distinguishing points are selected based on the highest silhouette coefficients (see Fig 1). Using ECDF values at these points, the parameters of the DFDV , denoted by , are estimated for each class. The dependence structure among features is then modeled using a Gaussian copula with correlation parameter . In the testing phase, ECDF values at the selected distinguishing points are evaluated using the estimated parameters , and the joint copula density is computed using . Each test image is assigned to the class that maximizes the corresponding joint density value.

Experiment and results

In order to decide upon the effectiveness of the given approach, one conducted experiments to analyze a well-known unofficial benchmark in image classification, MNIST dataset. The dataset contains 70,000 grayscale images of digital handwritings (0–9) with a size of 28 × 28 pixels each, and is divided into 60,000 training and 10,000 test set. Every image is in the form of a matrix that comprises pixel intensity values between 0 and 255. As a part of the present study, symbolic data denoted by Gaussian copula modelling is used to improve classification. The pixel intensity should be normalised and features should be extracted using the ECDF and the DFDV as the preprocessing activities. DFDV is a random variable that should summarise critical distributional assumptions of pixel intensity values over image partitions. We estimated the model of the Gaussian Copula by applying the methods explained in the previous section. The performance of the classification was measured by using the mean accuracy as well as the maximum accuracy of the test set.

Preprocessing data

This experiment preprocesses the handwritten numeric dataset MNIST to maximise on the feature extraction to increase the performance of the classification accuracy. The dataset is freely available at https://www.kaggle.com/datasets/hojjatk/mnist-dataset. All of the images are rescaled to a 20 × 20 grid, which is enough to memorise the basic outline of the numeric shapes and reduces excess information [48]. Each of the enlarged pictures is partitioned into equal sections to better reflect characteristics of the photos. This decomposition allows the study of localised properties and these are necessary in symbolic representations of data. The results of the previous studies [25] showed that the best modelling results were achieved after dividing the images into four horizontal parts as can be seen in Fig 3.

thumbnail
Fig 3. Image partitioning into four horizontal segments.

Each digit image is divided into four horizontal parts to extract localized ECDF features.

https://doi.org/10.1371/journal.pone.0346790.g003

This finding highlights the need of determining an optimal size of partition to achieve effective representation of distribution-related features during image classification. The pixel values are normalised using the min-max scaling approach a process in which after normalisation, the values adopt a range of [0,1] to normalise the data hence making manipulation possible.

Outcomes of the distance points

The identifications of distinguishing points entail application of special pixel values that can best distinguish the features of classes. Clustering is executed for each of candidate points (with ), ECDF values are clustered using K-Means. The Silhouette coefficient is computed to evaluate clustering quality. This coefficient assesses the degree to which each data point corresponds with its respective cluster in relation to other clusters. We identify the two points with the highest Silhouette coefficient as the distinguishing point, as it effectively differentiates the characteristics of many classes. Selecting 10 candidate points ensures adequate coverage of pixel variations while maintaining computational efficiency. Limiting the final selection to two distinguishing points reduces model complexity and mitigates the risk of overfitting. Fig 4 depicts the methodology, including the methods for pinpointing these locations. We produced the ECDF for all images spanning the ten classes.

thumbnail
Fig 4. Illustration of distinguishing point determination selection using K-means clustering and Silhouette coefficients.

The first step is constructing the ECDF for all data. ECDF values are evaluated at candidate thresholds, where in this case . At each candidate point, the ECDF values are clustered into groups using K-means clustering, and the clustering quality is assessed using the average Silhouette coefficient . the two thresholds with the highest values are selected as optimal distinguishing points..

https://doi.org/10.1371/journal.pone.0346790.g004

We select two distinguishing points for each partition. The resulting points may vary for each occurrence, depending on the value of the acquired Silhouette coefficient. The chosen distinguishing qualities are those that most efficiently differentiate the image characteristics among its categories. We construct the Gaussian copula model utilising the IFM approach and optimal distinguishing points. Table 1 presents the mean Silhouette coefficient values.

thumbnail
Table 1. Silhouette coefficient values for 10 candidate distinguishing points. Each row corresponds to a specific image partition , and each column represents a candidate distinguishing point with . For each partition, the two candidate points with the highest Silhouette coefficient values (in bold) are selected as optimal distinguishing points.

https://doi.org/10.1371/journal.pone.0346790.t001

For each , the pixel thresholds associated with the two highest Silhouette coefficients are selected as distinguishing points. This guarantees that the chosen points yield optimal separability for clustering among classes. At partition 1, the maximum Silhouette value is recorded at with a Silhouette coefficient of 0.6897, indicating that this threshold yields the most distinct clustering for this scenario. The distinguishing points for each threshold are selected based on the two highest Silhouette values to provide excellent separability. In partition 1, the positions and demonstrate the highest silhouette values. Likewise, for partition 2, the chosen points are  = 0.03 and  = 0.5. Transitioning to partition 3, the best points are  = 0.3 and  = 0.5. Finally, for partition 4, the points and have the optimal clustering performance. These values serve as determining thresholds that distinctively separate the classes of images to be analyzed later on. The selected distinctive points represent pixel value which provides the most reliable feature of separation between image classes. These can provide points to use as the basis of the ensuing clustering or classification in the pipeline of the analysis.

Results for ECDF and DFDV

After identifying distinguishing features on the basis of the greatest Silhouette coefficients, there is the need to calculate the partitions and classes ECDF, respectively. Each pixel intensity distribution of a class in each partition produces an ECDF, so there are four ECDFs per a class, as there are four partitions. For each division, two distinguishing points are chosen, resulting in eight distinguishing points for each class. These ECDFs illustrate the cumulative probability of pixel intensities, reflecting variations in image attributes across distinct partitions. Figs 5 and 6 illustrates the ECDFs for each class and partition, offering a visual depiction of the pixel distributions that underpin the development of the Gaussian copula model.

thumbnail
Fig 5. ECDF plots of four partitions for digit classes .

Each subplot shows the ECDF with rows corresponding to partitions and columns to digit classes. Red and blue vertical dashed lines denote the first and second selected distinguishing points, respectively.

https://doi.org/10.1371/journal.pone.0346790.g005

thumbnail
Fig 6. ECDF plots of four partitions for digit classes .

Each subplot shows the ECDF with rows corresponding to partitions and columns to digit classes. Red and blue vertical dashed lines denote the first and second selected distinguishing points, respectively.

https://doi.org/10.1371/journal.pone.0346790.g006

ECDF plots represent full distribution of pixel brightnesses in the separate image divisions (rows) belonging to different classes of images (columns). The visualisation comprises two segments: the upper segment presents the ECDF results for classes 0–4, whereas the lower segment exhibits the ECDF results for classes 5–9. Every row is a representation of an image partition and every column is a representation of image classes numbered 0–9. The distribution patterns in the graphs underscore discrepancies among partitions and classes. In certain classes, pixel distributions are dispersed throughout the intensity spectrum, signifying variability in pixel attributes within those segments. Conversely, other classes have more concentrated distributions within particular intensity ranges, indicating a predominance of pixels with uniform attributes. This phenomenon is seen in specific partitions when the ECDF curves ascend sharply over a limited range, indicating that a significant proportion of pixels have the same intensity value.

The disparities in distribution patterns among classes and partitions indicate that each class exhibits distinct traits as evidenced by its pixel distributions. The divisions segment the image, facilitating a more nuanced study of pixel changes. The red and blue vertical lines in the plots serve as essential markers for comprehending the data distribution and constructing the Gaussian copula model.

The development of DFDV relies on the collection of ECDFs at the specified distinguishing points indicated in the plots. For each image class, a total of 8 DFDVs are produced at the distinguishing points for each partition. The DFDVs will be used as random variables in Gaussian model of copula. Illustration of the DFDV results is depicted in Fig 7.

thumbnail
Fig 7. DFDV results for ten image classes.

Each subplot corresponds to a digit class , showing the estimated DFDV curves for each of the 8 features formed from 4 partitions and 2 selected distinguishing points. The variations in curve shapes across digit classes reflect how distributional differences are captured and used for classification.

https://doi.org/10.1371/journal.pone.0346790.g007

According to the DFDV plot shown above, it is observed that the DFDV distribution pattern is various in each image class ( to ). The curves of DFDV of each of the classes differ in their shape and in the points of distribution with higher concentrations attesting to the differences of their type. The DFDV distribution is more cluster in some classes, e.g., and while it is more dispersed with flatter peaks in other classes, e.g., and . This trend shows that the DFDV values differ among the image class and partition as a difference between the classes in the pixel intensity characteristic. The divergence in DFDV distribution contributes as an influential variable in comprehending the disparity in pixel features of each partition and forms the input in the Gaussian copula prototyping to better simulate the associations between partitions. These findings also reveal the suitability of DFDV in identifying the slight changes in distributions of pixel intensities.

The graphic of DFDV presented above shows the discrete patterns of DFDV distributions applying to each class of images ( through to ). Each of the classes represents a certain level of variance in DFDVs shape and in the locations of the peaks, resulting in the individual profile of classes. In some of the classes, like variation in shape and distribution peak points, it is possible to present the variations in the characteristics of classes. The distribution of the DFDV in certain classes, e.g., and is clustered where the peaks are pronounced whereas in certain other classes, e.g., and , the distribution is more spread with the peaks being low. This trend indicates that the DFDV values change depending on the image class and the partition by depicting variations in the intensity of pixels between the classes. The match inequalities in DFDV patterns are an important indicator to understand the deviation in pixel features in thousands of classes and is used as an input in Gaussian copula model to better describe the relationship between partitions. These findings also depict the sensitivity of DFDV to subtle variations in pixel intensity distributions.

Results for Gaussian Copula Model

After obtaining the values of the DFDV at the characteristics points, the next stage involves analysis of distributional properties of the DFDV. KDE method is applied as a non-parametric way of accurately and smooth drawing probability density function of the DFDV to help understand the distributional properties. KDE enables the evaluation and comparison of the DFDV distribution for each image class without necessitating any prior assumptions on the distribution’s structure. The estimation of the KDE parameter primarily involves determining the bandwidth, which regulates the smoothness of the computed density. Equation (16) estimates the parameter, while the Kolmogorov-Smirnov (KS) test assesses the derived bandwidth values for goodness-of-fit. The KS test quantitatively evaluates the degree of correspondence between the DFDV data and the distribution generated by KDE. Table 2 encapsulates the expected bandwidth parameters and the outcomes of the KS test.

thumbnail
Table 2. Estimated Bandwidth Parameters and Results of the Kolmogorov-Smirnov Test. The table presents the bandwidth values (outside the parentheses) alongside the associated (inside the parentheses) for classes 0–9 over four partitions.

https://doi.org/10.1371/journal.pone.0346790.t002

This analysis assesses the bandwidth of KDE and the from the KS test over different partitions and thresholds ( and ). The minimal bandwidth values () signify narrow kernel smoothing. signify a strong correspondence between the ECDF and KDE distributions. The copula model employs a Gaussian copula approach to define the joint distribution of eight interrelated random variables, denoted as . The copula density function is intended to represent both the marginal behaviours of the variables and their dependence structure. The model is defined as:

(31)

Let and denotes the covariance matrix illustrating the interrelationships among the eight random variables. signifies the marginal cumulative distribution function of each variable, whereas symbolises the inverse of the ordinary normal cumulative distribution function. The terms represent the marginal density functions of the respective variables.

The parameters of the Gaussian copula model are determined by formulating a likelihood function based on the copula density function. This procedure incorporates the dependence structure represented by the correlation matrix and the marginal distributions . Parameter estimation is conducted via the IFM, facilitating a distinct and efficient delineation of the marginal components and dependencies, hence yielding optimal parameter estimations. The estimated parameters are displayed in Table 3.

thumbnail
Table 3. Results of the estimated covariance matrix of the Gaussian Copula Model for digit classes . This table presents the estimated covariance matrices for the Gaussian copula model across all ten digits classes. Each matrix captures the dependence structure among the eight DFDV features derived from ECDF values at selected distinguishing points. Variations in the correlation strengths across classes reflect class-specific inter-feature dependencies used in the final classification stage.

https://doi.org/10.1371/journal.pone.0346790.t003

We performed clasfsification utilising the Bayesian technique, grounded in the predicted parameter values of the obtained covariance matrices. Following this approach, the data are grouped into the right categories by using the information regarding the interdependence of various variables which people have identified in the covariance matrices. Performance measures on the classification results were then carried out to estimate the aptitude of the model to categorize the categories accurately. The metrics used in the analysis will consist of accuracy, precision, recall and F1-score which will be presented in Table 4. A comparison is also made with the previous approach [25] that employed fixed distinguishing points.

thumbnail
Table 4. Performance comparison between fixed and optimally selected distinguishing points. The table presents average and maximum classification performance metrics (accuracy, precision, recall, and F1-score) using fixed distinguishing points versus optimally selected points. The proposed method improves average accuracy and recall, indicating better generalization, while maintaining competitive maximum performance levels.

https://doi.org/10.1371/journal.pone.0346790.t004

As shown in Table 4, the proposed method, which employs optimally selected distinguishing points based on the highest Silhouette coefficients, achieves an average accuracy of 68.27% and recall of 66.70%, outperforming the previous method that used fixed distinguishing points (62.22% and 62.46%, respectively). This improvement indicates better generalization and enhanced ability to distinguish class-specific distributional features across partitions. Although the previous approach achieves slightly higher maximum values in accuracy (96.92%) and precision (95.30%), the proposed method maintains competitive maximum performance (up to 95.40% recall), while demonstrating more consistent results overall. These findings affirm the robustness of the model and highlight the effectiveness of using optimized symbolic features in enhancing the classification of handwritten digits. To provide a more detailed view of the classification performance, the confusion matrix for the proposed method is presented in Fig 8.

thumbnail
Fig 8. Confusion matrix for the proposed method using optimally selected distinguishing points.

Cells show the number of predictions for each actual versus predicted digit. Darker colors indicate higher counts, highlighting the most frequent misclassifications.

https://doi.org/10.1371/journal.pone.0346790.g008

Discussion

The MNIST dataset of handwritten digits shows the relation between the pixel intensity distribution and the classification effectiveness through using the symbolic representation of data and the copula-based modeling. This is because preprocessing method that involves scaling of photos into a 20 times 20 grid and dividing the photos into four horizontal sections is indicative of the importance of local feature extraction of symbolic data analysis. This segmentation is precise in defining structural characteristics of hand written digits. Min-max scaling ensures that the input is standardised thus enhancing the reliability of the resultant analysis. The decidedly proper division size matches the previous research results [25].

Silhouette coefficient is necessary to achieve successful differentiation of classes of images. The two points on each segment have been selected so that they balance simplicity of the models against classification accuracy. Such results confirm the importance of selecting particular areas to reduce the dimensionality of data and preserve essential elements of the data to be correctly classified. The effectiveness of the clustering-based methodology is reflected by a comparative analysis with previous research. The same distinguishing points that were preset in the previous research seemed to have an average accuracy rate of only 62.44 percent [25]. This research also improved the mean accuracy of 68.27 percent through the identification of distinctive features through cluster analysis, clearly confirming that the proposed research methodology is useful in enhancing the performance of classifications. Moreover, the best precision increased up to 95.33 percent. In parallel to the accuracy, such metrics as precision, recall, and F1-score also demonstrated increases in relation to previous studies. It means that it is necessary to use cluster analysis to identify distinctive factors to improve the overall results of the model.

ECDF analysis is an elaborate analysis of the pixel intensity distribution in each of the section and category. Such different patterns on the ECDF curves mean that each class has its own peculiar properties because it shows differences in the pixel distribution. ECDFs provide the DFDV values that highlight the differences between classes (with high peaks in certain classes (e.g., 1 and 4) and lower distributions in others (e.g., 0 and 5)). Such trends explain the effectiveness of DFDV in identifying subtle differences in the attributes of pixels.

The ability of Gaussian copula to describe the joint distribution of the DFDV over multiple regions emphasises the effectiveness of copula to establish the correlation among the local variables. The use of KDE in estimation of marginal distributions offers higher level of convenience in controlling non-standard distributions whose bandwidth parameter would be determined by examining the Kolmogorov-Smirnov test. Despite its advantages, the current study has certain limitations: the given reliance on rigid division methods can limit the flexibility of the system when handling a more complex range of data. However, the findings demonstrate the effectiveness of symbolic representation of data and the use of copulas modelling in classifications of images. The approach, particularly, suits the datasets with complex patterns of interconnection, as the case with MNIST application demonstrates.

The effectiveness of classification in this study, which integrates symbolic data and Gaussian copula on the MNIST dataset, is comparable with CNN-based applications, including LeNet, which achieve higher accuracies [49,50]. Our method also shows similar performance to traditional probabilistic models, including Bayesian and Gaussian Mixture Models, with mean accuracy around 75–85% [39]. Importantly, unlike CNNs or LeNet, our approach is interpretable and avoids the “black-box” issue. This transparency allows for better understanding of feature interactions and decision-making in classification tasks, which is often not possible with deep learning models. Although CNNs and LeNet achieve slightly higher accuracy, our method provides competitive performance while offering explainable results. These findings are also compatible with recent works using copulas in image classification [20,21,51] which demonstrate that copulas can effectively model the interactions among variables in visual datasets.

Although our method provides interpretable results, its accuracy is slightly lower than CNN and LeNet. The performance also depends on the choice of distinguishing points in the symbolic representation; different selections of these points can lead to varying classification results. Additionally, the current study focuses on the MNIST dataset, and further work is needed to assess scalability, robustness to noise, and generalization to more complex datasets. Future research could explore hybrid approaches that combine the interpretability of copula-based methods with the high accuracy of deep learning, as well as methods to optimize the selection of distinguishing points for improved performance.

Conclusions

This paper demonstrates the ability of the symbolic data representation and the copula-based modelling in improving the performance of image classification, in this case on the MNIST handwritten digit dataset. The pixel intensities are transformed to the empirical cumulative distribution functions (ECDF), and its distribution, a summation of this distribution, called the distribution function of distribution values (DFDV), offering them as an alternative by which local features of an image can be described.

The incorporation of local feature extraction by image scaling and horizontal segmentation, along with cluster analysis to identify prominent distinguishing points, markedly improves classification performance, with an average accuracy of 68.27% and a maximum accuracy of 95.33%. This indicates a significant enhancement compared to previous work, which documented a 62.66% accuracy with a fixed-point selection technique.

Moreover, Gaussian copula model displays the connection between local symbolic features, yet KDE begins adaptability in dealing with non-parametric marginal distributions. Although this approach relies on existing segmentation and clustering techniques, it presents a strong potential in the scenarios where interpretability, data organisation and class uniqueness are most important. The results emphasize the possibility of the combination of symbolic data modelling and copula-based framework as an additional method to classical machine learning algorithms.

In this study, we demonstrated that integrating symbolic data with Gaussian copula provides an interpretable approach for MNIST image classification. While the accuracy is slightly lower than CNN and LeNet, our method avoids the “black-box” issue, allowing better understanding of feature interactions through distinguishing points. This approach can be applied in settings where interpretability is critical. Future improvements could focus on optimizing the selection of distinguishing points, combining copula-based interpretability with deep learning accuracy, and extending the method to larger or more complex datasets.

Acknowledgments

The authors wish to convey their profound appreciation to the anonymous reviewers for their insightful remarks and constructive recommendations, which have markedly enhanced the quality and clarity of this paper. This research was funded by Universitas Padjadjaran, Indonesia, through the Internal Matching Funds Research Grant (IMF) for the project “Model Classification of Rice Plant Diseases Based on Deep Learning and Gaussian Copula to Support Sustainable Precision Agriculture,” under contract number 4356/UN6.D/PT.00/2025.

References

  1. 1. Alem A, Kumar S. End-to-End Convolutional Neural Network Feature Extraction for Remote Sensed Images Classification End-to-End Convolutional Neural Network Feature. Appl Artif Intell. 2022.
  2. 2. Alem A, Kumar S. Transfer learning models for land cover and land use classification in remote sensing image transfer learning models for land cover and land use. Appl Artif Intell. 2022;36.
  3. 3. Shoaib M, Shah B, Ei-Sappagh S, Ali A, Ullah A, Alenezi F, et al. An advanced deep learning models-based plant disease detection: A review of recent research. Front Plant Sci. 2023;14:1158933. pmid:37025141
  4. 4. Lindroth H, Nalaie K, Raghu R, Ayala IN, Busch C, Bhattacharyya A, et al. Applied Artificial Intelligence in Healthcare: A Review of Computer Vision Technology Application in Hospital Settings. 2024:1–29.
  5. 5. Yenikaya MA, Kerse G. Artificial intelligence in the healthcare sector: comparison of deep learning networks using chest X-ray images. Frontiers in Public Health. 2024:1–11.
  6. 6. Alhejaily AMG. Artificial intelligence in healthcare. Biomedical Reports. 2025.
  7. 7. Jiang Q. A financial handwritten digit recognition model based on artificial intelligence. 2022;2:2–6.
  8. 8. Zhang Z, Hamadi HAL, Member S, Damiani E, Member S, Yeun CY. Explainable artificial intelligence applications in cyber security: state-of-the-art in research. IEEE Access. 2022;10:93104–39.
  9. 9. Ming Seng L, Bang Chen Chiang B, Arabee Abdul Salam Z, Yih Tan G, Tong Chai H. MNIST handwritten digit recognition with different CNN architectures. J Appl Technol Innov. 2021;5:2600–7304.
  10. 10. Yahya AA, Tan J, Hu M. A Novel Handwritten Digit Classification System Based on Convolutional Neural Network Approach. Sensors (Basel). 2021;21(18):6273. pmid:34577479
  11. 11. Lestari KE, Winarni S, Prihandhika A, Nugraha ES, Yudhanegara MR. Neurocognitive prediction of dyslexic handwriting pattern using an explainable AI-driven custom LiteBinaryNet-CNN. Commun Math Biol Neurosci. 2025;2025.
  12. 12. Yang X, Yang H, Huang H, Song K. Evolution of Tax Exemption Policy and Pricing Strategy Selection in a Competitive Market. Mathematics. 2024.
  13. 13. Talaei Khoei T, Ould Slimane H, Kaabouch N. Deep learning: systematic review, models, challenges, and research directions. Neural Comput & Applic. 2023;35(31):23103–24.
  14. 14. Hosain T, Rahman J, Mridha MF, Kabir M. Explainable AI approaches in deep learning: Advancements, applications and challenges. Computers and Electrical Engineering. 2024;117:109246.
  15. 15. Habiba U, Habib MK, Fritzsch J, Wagner S. How do ML practitioners perceive explainability? An interview study of practices and challenges. 2025. https://doi.org/10.1007/s10664-024-10565-2
  16. 16. Nelsen RB. Properties and applications of copulas: A brief survey. First Brazilian Conf Stat Model Insur Financ. 2003;3:1–18.
  17. 17. Durante F, Puccetti G, Scherer M, Vanduffel S. My introduction to copulas. Dependence Modeling. 2017;5(1):88–98.
  18. 18. Billard L, Diday E. Symbolic Data Analysis: Definitions and Examples. 2004.
  19. 19. Beranger B, Lin H, Sisson S. New models for symbolic data analysis. Adv Data Anal Classif. 2022;17(3):659–99.
  20. 20. Stitou Y, Lasmar N, Berthoumieu Y. Copulas based multivariate gamma modeling for texture classification. ICASSP, IEEE Int Conf Acoust Speech Signal Process - Proc. 2009; 1045–1048.
  21. 21. Salinas-Gutiérrez R, Hernández-Aguirre A, Rivera-Meraz MJJ, Villa-Diharce ER. Using Gaussian Copulas in supervised probabilistic classification. Stud Comput Intell. 2010;318: 355–72.
  22. 22. Bansal R, Hao X, Liu J, Peterson BS. Using Copula distributions to support more accurate imaging-based diagnostic classifiers for neuropsychiatric disorders. Magn Reson Imaging. 2014;32(9):1102–13. pmid:25093634
  23. 23. Pérez-Díaz ÁP, Salinas-Gutiérrez R, Hernández-Quintero A, Dalmau-Cedeño O. Supervised classification based on copula functions. Res Comput Sci. 2017;133:9–18.
  24. 24. Winarni S, Indratno SW, Arisanti R, Pontoh RS. Image feature extraction using symbolic data of cumulative distribution functions. Mathematics. 2024.
  25. 25. Indratno SW, Winarni S, Sari KN. Classification of images using Gaussian copula model in empirical cumulative distribution function space. PLoS One. 2024;19(12):e0309884. pmid:39642166
  26. 26. Shao H, Ma E, Zhu M, Deng X, Zhai S. MNIST Handwritten Digit Classification Based on Convolutional Neural Network with Hyperparameter Optimization. 2023. https://doi.org/10.32604/iasc.2023.036323
  27. 27. Wang R. Handwritten digit recognition based on the MNIST dataset under PyTorch. 2023;0:450–5. https://doi.org/10.54254/2755-2721/8/20230216
  28. 28. Wen Y, Ke W. Improved localization and recognition of handwritten digits on MNIST dataset with ConvGRU. 2025:1–16.
  29. 29. Matijašević P, Mravik M. Handwritten digit recognition using convolutional neural networks and big data processing. 2025;:531–5. https://doi.org/10.15308/Sinteza-2025-531-535
  30. 30. Ben D, Noured D. Handwritten digit recognition: Comparative analysis of ML, CNN, vision transformer, and hybrid models on the MNIST dataset. 2025.
  31. 31. Trivedi PK, Zimmer DM. Copula Modeling: An Introduction for Practitioners. Foundations and Trends® in Econometrics. 2007;1(1):1–111.
  32. 32. Czado C, Nagler T. Vine Copula Based Modeling. Annual Review of Statistics and Its Application. 2022;9:453–77.
  33. 33. Patton AJ. A review of copula models for economic time series. J Multivar Anal. 2012;110:4–18.
  34. 34. Ko V, Hjort NL. Model robust inference with two-stage maximum likelihood estimation for copulas. J Multivar Anal. 2019;171:362–81.
  35. 35. Ko V, Hjort NL. Copula information criterion for model selection with two-stage maximum likelihood estimation. Econom Stat. 2019;12:167–80.
  36. 36. Duong JECT. Multivariate kernel smoothing and its applications. Taylor & Francis Group; 2018.
  37. 37. Castro R. Introduction and the empirical CDF. 2013:1–10.
  38. 38. Diday E, Vrac M. Mixture decomposition of distributions by copulas in the symbolic data analysis framework. Discrete Applied Mathematics. 2005;147:27–41.
  39. 39. Vrac M, Chédin A, Diday E. Clustering a global field of atmospheric profiles by mixture decomposition of copulas. J Atmos Ocean Technol. 2005;22:1445–59.
  40. 40. Sinaga KP, Yang M-S. Unsupervised K-Means Clustering Algorithm. IEEE Access. 2020;8:80716–27.
  41. 41. Ikotun AM, Ezugwu AE, Abualigah L, Abuhaija B, Heming J. K-means clustering algorithms: A comprehensive review, variants analysis, and advances in the era of big data. Inf Sci (Ny). 2023;622:178–210.
  42. 42. Wong GN, Weiner ZJ, Tkachenko AV, Elbanna A, Maslov S, Goldenfeld N. Modeling COVID-19 Dynamics in Illinois under Nonpharmaceutical Interventions. Phys Rev X. 2020;10:41033.
  43. 43. Ghazal MT, Zahid Hussain M, Said AR, Nadeem A, Kamrul Hasan M, Ahmad M, et al. Performances of K-Means Clustering Algorithm with Different Distance Metrics. Intelligent Automation & Soft Computing. 2021;29(3):735–42.
  44. 44. Tambunan HB, Barus DH, Hartono J, Alam AS, Nugraha DA, Usman HHH. Electrical peak load clustering analysis using K-means algorithm and silhouette coefficient. Proceeding - 2nd Int Conf Technol Policy Electr Power Energy, ICT-PEP 2020. 2020. 258–262.
  45. 45. Shahapure KR, Nicholas C. Cluster quality analysis using silhouette score. Proc - 2020 IEEE 7th Int Conf Data Sci Adv Anal DSAA 2020. 2020. 747–8.
  46. 46. Sen S, Diawara N, Iftekharuddin KM. Statistical pattern recognition using gaussian copula. J Statist Theory Pract. 2015;9(4):768–77.
  47. 47. Bickel EP, Diggle P, Fienberg S, Gather U. Copula theory and its applications. 2009.
  48. 48. Chaki J, Dey N. A beginner’s guide to image preprocessing techniques. Taylor & Francis Group; 2019.
  49. 49. Kadam SS, Adamuthe AC, Patil AB. CNN model for image classification on MNIST and Fashion-MNIST dataset. J Sci Res. 2020;64:374–84.
  50. 50. Beohar D, Rasool A. Handwritten digit recognition of MNIST dataset using deep learning state-of-the-art artificial neural network (ANN) and Convolutional Neural Network (CNN). 2021 Int Conf Emerg Smart Comput Informatics, ESCI 2021. 2021; 542–8.
  51. 51. Lasmar N-E, Berthoumieu Y. Gaussian Copula multivariate modeling for texture image retrieval using wavelet transforms. IEEE Trans Image Process. 2014;23(5):2246–61. pmid:24686281