Assessment of clustering techniques to support the analyses of soybean seed vigor

Soy is the main product of Brazilian agriculture and the fourth most cultivated bean globally. Since soy cultivation tends to increase and due to this large market, the guarantee of product quality is an indispensable factor for enterprises to stay competitive. Industries perform vigor tests to acquire information and evaluate the quality of soy planting. The tetrazolium test, for example, provides information about moisture damage, bedbugs, or mechanical damage. However, the verification of the damage reason and its severity are done by an analyst, one by one. Since this is massive and exhausting work, it is susceptible to mistakes. Proposals involving different supervised learning approaches, including active learning strategies, have already been used, and have brought significant results. Therefore, this paper analyzes the performance of non-supervised techniques for classifying soybeans. An extensive experimental evaluation was performed, considering (9) different clustering algorithms (partitional, hierarchical, and density-based) applied to 5 image datasets of soybean seeds submitted to the tetrazolium test, including different damages and/or their levels. To describe those images, we considered 18 extractors of traditional features. We also considered four metrics (accuracy, FOWLKES, DAVIES, and CALINSKI) and two-dimensionality reduction techniques (principal component analysis and t-distributed stochastic neighbor embedding) for validation. Results show that this paper presents essential contributions since it makes it possible to identify descriptors and clustering algorithms that shall be used as preprocessing in other learning processes, accelerating and improving the classification process of key agricultural problems.


Introduction
Soy is the fourth most cultivated bean globally and the main product in Brazilian agriculture.In 2021/22, Brazil estimates a production record of 142,009 million tons of soybeans.This value was obtained from the increase of 3.4% of the cultivated area compared to 2020/21, and 4.3% of increased productivity per hectare [1].
Those numbers would only be achieved with scientific advances and the availability of new technologies in the productive sector [2].Among advances, mechanization, soil management, and prevention solutions for pests and diseases stand out.
Vigor tests are a way to obtain good information for consequent scientific advances.They are commonly used to find quality differences among bean lots during storage or after sowing, presenting the best and highlighting their planting conditions [3].One of these tests is the tetrazolium test.It determines the vigor of soybean lots and offers information on the causes of quality reduction, identifying mechanical damages, moisture, and bug deterioration [4].
However, the analysis for damage classification is done visually by a specialist.This task may be unavailable considering the number of samples in a lot.The damage analysis of soybeans submitted to the tetrazolium test is a slow and tiresome method when performed manually since it is a visual task and requires hours of work [4].
Efforts in the literature [3,[5][6][7][8] show that supervised learning techniques for damage classification after the tetrazolium test could present good results.Therefore, to increase productivity, papers such as [5][6][7][8] propose active learning techniques (supervised classification) for this type of analysis.Despite having significant results, these papers focus on supervised learning techniques.
Non-supervised learning techniques are rarely explored in this context.Some active learning strategies that use grouping techniques as pre-processing consider, for example, the kmeans algorithm.However, an extensive experimental evaluation is not done, comparing the performances of different clustering algorithms.Therefore, this paper investigates different non-supervised techniques to contribute to the research of unsupervised learning and active learning strategies applied to the classification of soybean damages.
In summary, this paper aims to perform an extensive experimental evaluation, considering different unsupervised techniques to improve the classification process of soybeans subjected to the tetrazolium test.Our contributions are fourfold: i) organization of soybean image groups; ii) extraction and selection of well-suited features that describe the images' clustering; iii) extensive analyses and performance evaluation of different clustering techniques; iv) novel comparative analyses and validation of the obtained results considering different image datasets, clustering, descriptors techniques and different metrics to a vital agricultural problem.

Materials and methods
This section presents image description techniques and unsupervised learning methods used in this paper.We also present different methods for clustering evaluation and graphical/visualization results.

Descriptor learning
It is necessary to capture the different damage patterns that separate into specific classes to classify a set of soybean images.These descriptors consider different visual properties based on color, shape, and texture.Therefore, each descriptor extracts and generates a feature vector (numeric values), describing the images.

Feature extraction.
There are several feature extractors in the literature.Colorbased extractors are widely used, especially for natural image classification.Some extractors consider the color histogram [9], which describes the global image content according to the pixel percentage of each color.
Some extractors combine different types of features, like the Fuzzy Color and Texture Histogram (FCTH) [22] and the Join Composite Descriptor (JCD) (FCTH + CEDD) extractors that combine characteristics based on color and texture to describe images.

Machine learning
Machine learning is an area of artificial intelligence that proposes the development of systems capable of learning a specific pattern or behavior automatically using examples or experience.To do so, different supervised and unsupervised learning approaches can be considered.

Supervised classification.
In supervised classification, the learning process is performed through a set of (training) previously labeled data [23].An oracle or specialist needs to annotate samples that best represent a class in this process.Then, the algorithm, after training, can present a better result [24].Examples of supervised learning methods from the literature are: Random Forest (RF) [25], Support Vector Machines (SVM) [26], and Optimum-Path Forest (OPF) [27].
However, the supervised learning technique presents challenges: sample unbalancing in dataset classes, noises related to imperfect data or outliers, overfitting, underfitting, and missing values for some features.These challenges directly depend on the data used in learning training.

Unsupervised classification.
The unsupervised classification aims to find data clusters in multidimensional feature space according to specific similarity criteria.Using similarity relations, similar samples are grouped in the same cluster.Therefore, it is possible to describe the inherent characteristics of each cluster made from the grouping process, enabling a better understanding of the clustered data.Several clustering methods can be separated into categories: partitional, hierarchical, and density-based.
Partitional methods make k data partitions (clusters).This model uses an iterative relocation technique to improve partitioning from initial partitioning.The main method of this category is the well-known k-means.Proposed in [28], it is one of the most popular clustering algorithms.This method considers four phases: initialization, clustering, centroids' movement, and optimization.In the initialization, random samples are defined as centroids (i.e., central cluster points).The clustering phase calculates the distance (e.g., Euclidean) between all samples and centroids.Next, each sample is assigned to the cluster where the centroid presents the shortest distance.After the clustering, the centroids' movement phase calculates the sample mean of each cluster.Finally, samples closer to the mean are assigned as new centroids.Therefore, the optimization phase performs the clustering phases and centroids' movement repeatedly until the central values of the clusters stabilize, reaching the final clusters.
Like k-means, the k-medoids algorithm [29] has the same phases.However, in the centroids' movement phase, it is not the nearest sample that is defined as the centroid (i.e., ghost sample), but one of the samples (called medoid) of the clustering that minimizes the distance for all samples.Compared to k-means, this method is less susceptible to noise, since the outliers have little effect on the medoid choice.CLARANS [30] and FCM [31] are other examples of partitional algorithms.
Hierarchical methods are divided into agglomerative and divisive.These methods build (agglomerative) or split (divisive) a binary tree, in which nodes are samples of the data clustering, and connections are made based on data distance (dissimilarity).AGNES [32], CURE [33], and ROCK [34] are examples of hierarchical algorithms.
Density-based methods perform the clustering according to density established through input parameters.These parameters dictate the minimum density necessary to form a cluster.The main technique of this category is the Density-Based Spatial Clustering of Application with Noise (DBSCAN) [35].This algorithm defines that a cluster has central points and border points.Central points are those that, given a radius r, have a defined (MinPts) minimum neighboring points (samples).Border points do not have minimum neighboring points, but one of the neighbors must be a central point.This technique finishes when no new points can be assigned to a cluster.This paper also considers the density-based algorithm named OPTICS [36].

Clustering evaluation.
Two metrics are considered to analyze clustering: those based on the true label of samples; and those not using this information.Although our datasets present the samples' true label, we also considered metrics that do not consider such information to reach a broader and better analysis.These metrics are Fowlkes-Mallows (FOWLKES) [37], Davies-Bouldin (DAVIES) [38] and Calinski-Harabasz (CALINSKI) [39].
Another well-known metric is accuracy, and it gives the correctness samples percentage.It is calculated by the number of correctly clustered samples divided by the total number of samples.Clustering techniques are not committed to correctly defining labels, only to cluster samples.As a result, clusters may or may not adequately represent different classes.Therefore, accuracy calculation is based on a method that organizes the assigned labels and finds the highest accuracy possible [40].To do so, it is used data from the confusion matrix, in which lines comprises true labels and columns are clusters.The goal is to find the column reorganization that generates the best accuracy result (i.e., an organization where the biggest values of each line are on the main diagonal of the matrix).Another metric that uses true label knowledge is FOWLKES [37], which is defined as the geometric mean of precision and recall.Eq 1 presents the metric formula, where TP corresponds to the number of true positives (samples correctly grouped with label C, that are part of C), and FN represents the number of false negatives (samples that should be grouped with C label, but were not).The range of the Fowlkes metric is between 0 and 1.The higher the value, the greater the similarity between the found clusters and the ground-truth classification.

FOWLKES ¼
VP ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi When the ground-truth labels of the samples are unknown, metrics are employed to evaluate the quality of obtained clustered.These metrics are generally based on cohesion and separation measures.Cohesion refers to the distance between samples of the same cluster, and separation indicates the distance between clusters [41].The DAVIES metric [38] compares the distance between clusters with their sizes to indicate how good the separation between them is.Eq 2 formally defines the DAVIES metric.
where k represents the total number of clusters and R ij comprises the comparison between two clusters.R ij calculation is presented by Eq 3.
where s i represents the mean distance of each sample of the group i to its centroid, and d ij is the distance between i and j clusters' centroids.Results range from 0 to 1, and lower values indicate better groupings.CALINSKI is another metric that does not use true label knowledge.In this case, high values indicate better clustering definitions (i.e., dense and well-separated clusters).Its value is generated by the dispersion sum ratio between clusters and the internal dispersion of each cluster [39].
To a dataset E with size n E and k clusters, the CALINSKI definition is presented by Eq 4, where tr(B k ) represents the dispersion matrix trace between clusters, and tr(W k ) is the cluster internal dispersion matrix trace (defined by Eqs 5 and 6).In this case, C i represents the i cluster's samples set, c i is the i-th cluster center, c e is the e center, and n i is the number of samples from the i-th cluster.
2.2.4 Dimensionality reduction.Descriptors presented in Section 2.1.1 describe images in multiple-dimension vectors.Therefore, the graphic visualization of samples' real values is impossible.In order to solve this problem, dimensionality reduction methods are applied to reduce samples in two or three dimensions.
Dimensionality reduction techniques (or feature reduction) are separated into feature selection methods and feature transformation methods [42].Selection methods are based on evaluating which sample features are the most important for a better representation.Transformation methods use all sample features to calculate a new representation and are more suitable for graphic visualization.
Some feature reduction techniques based on transformation use linear and nonlinear functions.Principal Component Analysis (PCA) [43] is a technique that performs linear data mapping to a reduced dimension.Therefore, in the reduced dimension, data variation is maximized.Another one is the t-distributed Stochastic Neighbor Embedding (t-SNE) [44], a nonlinear technique that calculates the probability of features' similarity and minimizes the divergence between the reduced and the original sample probability.

Proposed methodology
This section presents the proposed methodology for soybean image classification, describing the steps of each process.We present the description of the soybean image datasets (Section 3.1), experimental scenarios (Section 3.2), image descriptors, and algorithm configuration.Soybean images can be organized in many ways, resulting in different datasets for experimental evaluation.We removed the images with stain and clipping problems in this stage.Section 3.1 presents more details about the organized datasets.
We extracted features for each organized dataset.In this stage, each soybean image is represented by a feature vector, considering descriptors presented in Section 2.1.
From the feature extraction process, clustering methods (presented in Section 2.2.2) are evaluated through different metrics (mentioned in Section 2.2.3).Therefore, it is possible to analyze the performance of each algorithm and its configurations in soybean classification.

Description of the datasets
After the tetrazolium test, analysts perform a visual process to identify and label each soybean portion (containing two internal parts and two external parts) into classes.The class is defined by the damage type and its severity level.
In [8], two image acquisitions were performed in a real soybean seed enterprise, where 1400 images were obtained from the first one and 1358 from the second one.Then, we generated different subsets from both image acquisitions in our work.
To do so, we consider each soybean portion labeled separately, given that we observed better performance from this scenario in [8].This paper identifies the datasets as D 1 -D 5 .D 4 and D 5 ones are reorganizations of the D 3 dataset.Tables 1 to 5, respectively, present a description and quantity of samples in each dataset class.D 1 and D 2 are, respectively, the first and second acquisitions, in which we organized samples with different damages and their severity level between 0 and 3.However, considering the D 1 dataset, images that presented stains and cutting problems were removed in the organization stage of the datasets since they can impact algorithms' learning process and performance.D 3 dataset is the aggregation between the D 1 and the D 2 ones.Since D 3 presents all possible samples, the D 4 and the D 5 datasets were created from it.D 4 is a reorganization of the D 3 dataset where we removed the severity levels (i.e., only damage types are considered).In the D 5 dataset, both severity levels and the differentiation of internal and external portions were removed.These new datasets were created to measure how the change in the samples' number and classes impact experiments.Table 6 summarizes each dataset regarding their number of classes and samples.

Description of the scenarios
In order to perform experiments, we defined different scenarios.First, we extract features from the datasets based on the image descriptors shown in Table 7 and described in Section 3.1.
After the description process, different clustering algorithms were applied and analyzed through different metrics (see Sections 2.2.3 and 2.2.4).Table 8 shows the clustering methods considered in our work.According to the literature standard, default values were used as input parameters for clustering methods.In these cases, the number of classes was used.
As presented in Section 2.2.2, density-based algorithms need the radius and the minimum of neighbor's points as a parameter.We used the technique presented in [35] to set the radius value.We performed a parametric analysis to define the minimum of the neighbor's points.To do so, multiple experiments were performed, changing the parameter value (i.e., gridsearch).Then, the value was chosen in which the resulting clustering number was the closest to the total of classes.
To define the accuracy, we performed a clustering label organization method [40], as presented in Section 2.2.3.

Results and discussion
The accuracy results make it possible to perform the first performance analysis.Tables 9 to 13 present accuracy results obtained by each descriptor and clustering method from each dataset (D 1 -D 5 , respectively).
The bold values highlight the clustering methods that obtained the best result for each descriptor.In addition, the underlined values emphasize the descriptor that reached the best performance for each clustering method.Finally, the asterisk value represents the best combination (descriptor and clustering method).
It is possible to observe that CURE and ROCK algorithms stand out for most datasets and descriptors, presenting the best results.The AGNES algorithm also reached good results.However, AGNES presented lower accuracy values compared to CURE and ROCK algorithms.
According to the obtained results, no descriptor can be considered the best for all clustering methods.However, in general, FCTH, GCH, and RCS descriptors presented the best accuracy values, especially when performed with AGNES, CURE, and ROCK clustering algorithms.
Since the internal and external samples are from the same cluster considering the D 5 dataset, it is possible to observe a different behavior compared to other datasets.For instance, we notice that the OPTICS and ROCK algorithms present the highest accuracy for several descriptors.ROCK presents the best result, with an accuracy of 83.1%.
Since the internal and external samples are from the same cluster considering the D 5 dataset, it is possible to observe a different behavior compared to other datasets.For instance, the OPTICS and ROCK algorithms present the highest accuracy for several descriptors.ROCK presents the best result, with an accuracy of 83.1%.
In addition to the accuracies, performance analyses were performed using other metrics (FOWLKES, DAVIES e CALINSKI).Tables 14-18 present the results obtained for each   In order to achieve a better understanding of the results obtained by the several metrics, it is vital to observe the graphical clustering representation.To do so, PCA and t-SNE dimensionality reduction techniques were considered (Figs 3-8), respectively.Figs 3-8 present the highlighted clustering results from Table 17 (i.e., best combinations-descriptor and clustering algorithm-according to each metric).The colors represent an obtained clustering, while symbols represent the true classes of the samples.
As presented in Table 4, the majority (83%) of the dataset's samples are from the perfect, external and internal classes.It is possible to observe the clustering of these classes of samples, represented by circle and diamond symbols (Figs 3-8)).Both PCA and t-SNE techniques could not correctly cluster other damage classes.In the clustering presented in Fig 3, even with an 81% of accuracy, it is possible to note that the majority of moisture, bugs, and mechanical damage classes were considered perfect, that is, almost all of the successes obtained refer to perfect classes.17), do not indicate a good clustering, since a cluster with almost all samples was generated.In this case, the accuracy obtained from the clustering was 44%.Therefore, DAVIES is not a good metric for this context.
We noticed this same behavior in Fig 5, where we can see that the results obtained from the CALINSKI metric (e.g., 6614.24 for the MPO_ROCK pair in Table 17), do not show a good clustering formation.The majority of samples from both perfect classes (i.e., XE and XI, represented by circle and diamond, respectively) that comprise almost 83% of the samples, were divided into five clusters, impacting the accuracy (48%).Therefore, CALINSKI is not a good metric for this context either.

Conclusion
In this paper, we performed an extensive experimental evaluation, considering different unsupervised learning techniques applied to soybean seed image datasets from the tetrazolium test.To do so, we used 5 image datasets considering different scenarios, damages, and/or their respective severity levels.To describe these images, we consider 18 different image descriptors.Moreover, we evaluated 9 clustering algorithms from different paradigms (e.g., partitional, hierarchical, and density-based) under 4 metrics (accuracy, FOWLKES, DAVIES, and CALINSKI).We considered 2 dimensionality reduction techniques (PCA and t-SNE) to validate the analyses to visualize the clusters' distributions.
Analyzing the obtained results, we observed similar behavior with different datasets, in which the best accuracies were achieved according to the number of perfect samples (without damage) in the dataset.The samples' unbalancing in each class, including the excess of perfect samples, disturbs the clustering algorithm's performance, especially in classes with few samples.
We generally reached the best accuracy results by each descriptor with AGNES, CURE, and ROCK clustering algorithms.FCTH, GCH, and RCS presented the highest accuracy values Fig 1(a) and 1(b) present (pre and post) matrix reorganization examples, respectively.

Fig 2
illustrates the pipeline of our proposed methodology.

Table 9 . Accuracy results obtained by each descriptor and clustering method from the dataset D 1 .
The bold values correspond to the best clustering methods for each descriptor.The underlined values highlight the descriptor that reached the best performance for each clustering algorithm.The asterisk value represents the best combination (descriptor and clustering algorithm).It is also presented the mean accuracies considering all descriptors and clustering methods. https://doi.org/10.1371/journal.pone.0285566.t009

Table 10 . Accuracy results obtained by each descriptor and clustering method from the dataset D 2
. The bold values correspond to the best clustering methods for each descriptor.The underlined values highlight the descriptor that reached the best performance for each clustering algorithm.The asterisk value represents the best combination (descriptor and clustering algorithm).It is also presented the mean accuracies considering all descriptors and clustering methods.https://doi.org/10.1371/journal.pone.0285566.t010

Table 11 . Accuracy results obtained by each descriptor and clustering method from the dataset D 3 .
The bold values correspond to the best clustering methods for each descriptor.The underlined values highlight the descriptor that reached the best performance for each clustering algorithm.The asterisk value represents the best combination (descriptor and clustering algorithm).It is also presented the mean accuracies considering all descriptors and clustering methods. https://doi.org/10.1371/journal.pone.0285566.t011

Table 12 . Accuracy results obtained by each descriptor and clustering method from the dataset D 4 .
The bold values correspond to the best clustering methods for each descriptor.The underlined values highlight the descriptor that reached the best performance for each clustering algorithm.The asterisk value represents the best combination (descriptor and clustering algorithm).It is also presented the mean accuracies considering all descriptors and clustering methods. https://doi.org/10.1371/journal.pone.0285566.t012

Table 13 . Accuracy results obtained by each descriptor and clustering method from the dataset D 5 .
The bold values correspond to the best clustering methods for each descriptor.The underlined values highlight the descriptor that reached the best performance for each clustering algorithm.The asterisk value represents the best combination (descriptor and clustering algorithm).It is also presented the mean accuracies considering all descriptors and clustering methods.

Table 14 . Results obtained by each descriptor and clustering method CURE, considering each of the metrics (accuracy, FOWLKES, DAVIES and CALINSKI), from the dataset D 1 .
Bold values highlight the best descriptors' results according to each metric.the datasets D 1 and D 2 with CURE and the datasets D 3 , D 4 and D 5 with ROCK, since these were the best combinations (descriptor and clustering algorithm pair).In bold are highlighted the best results (descriptors) obtained according to each metric.

Table 15 . Results obtained by each descriptor and clustering method CURE, considering each of the metrics (accuracy, FOWLKES, DAVIES and CALINSKI), from the dataset D 2 .
Bold values highlight the best descriptors' results according to each metric.

Table 16 . Results obtained by each descriptor and clustering method ROCK, considering each of the metrics (accuracy, FOWLKES, DAVIES and CALINSKI), from the dataset D 3 .
Bold values highlight the best descriptors' results according to each metric.It is possible to observe that the FOWLKES metric presents similar results to the accuracy.When DAVIES and CALINSKI metrics reach their best results (bold values), in most cases, accuracy values are inferior to the best accuracy obtained for the clustering.This may indicate that metrics based on cohesion and clustering separation, in this context, should not be used to analyze correctability. https://doi.org/10.1371/journal.pone.0285566.t016

Table 17 . Results obtained by each descriptor and clustering method ROCK, considering each of the metrics (accuracy, FOWLKES, DAVIES and CALINSKI), from the dataset D 4 .
Bold values highlight the best descriptors' results according to each metric. https://doi.org/10.1371/journal.pone.0285566.t017

Table 18 . Results obtained by each descriptor and clustering method ROCK, considering each of the metrics (accuracy, FOWLKES, DAVIES and CALINSKI), from the dataset D 5 .
Bold values highlight the best descriptors' results according to each metric. https://doi.org/10.1371/journal.pone.0285566.t018