Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Assessment of clustering techniques to support the analyses of soybean seed vigor

  • Eduardo R. de Oliveira,

    Roles Investigation, Methodology, Software, Validation, Visualization, Writing – original draft

    Affiliation Department of Computing, Federal University of Technology - Parana, Cornelio Procopio, PR, Brazil

  • Pedro H. Bugatti,

    Roles Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Supervision, Validation, Writing – review & editing

    Affiliations Department of Computing, Federal University of Technology - Parana, Cornelio Procopio, PR, Brazil, Department of Computing, Federal University of Sao Carlos, Sao Carlos, SP, Brazil

  • Priscila T. M. Saito

    Roles Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Resources, Supervision, Validation, Writing – review & editing

    priscilasaito@ufscar.br

    Affiliations Department of Computing, Federal University of Technology - Parana, Cornelio Procopio, PR, Brazil, Department of Computing, Federal University of Sao Carlos, Sao Carlos, SP, Brazil

Abstract

Soy is the main product of Brazilian agriculture and the fourth most cultivated bean globally. Since soy cultivation tends to increase and due to this large market, the guarantee of product quality is an indispensable factor for enterprises to stay competitive. Industries perform vigor tests to acquire information and evaluate the quality of soy planting. The tetrazolium test, for example, provides information about moisture damage, bedbugs, or mechanical damage. However, the verification of the damage reason and its severity are done by an analyst, one by one. Since this is massive and exhausting work, it is susceptible to mistakes. Proposals involving different supervised learning approaches, including active learning strategies, have already been used, and have brought significant results. Therefore, this paper analyzes the performance of non-supervised techniques for classifying soybeans. An extensive experimental evaluation was performed, considering (9) different clustering algorithms (partitional, hierarchical, and density-based) applied to 5 image datasets of soybean seeds submitted to the tetrazolium test, including different damages and/or their levels. To describe those images, we considered 18 extractors of traditional features. We also considered four metrics (accuracy, FOWLKES, DAVIES, and CALINSKI) and two-dimensionality reduction techniques (principal component analysis and t-distributed stochastic neighbor embedding) for validation. Results show that this paper presents essential contributions since it makes it possible to identify descriptors and clustering algorithms that shall be used as preprocessing in other learning processes, accelerating and improving the classification process of key agricultural problems.

1 Introduction

Soy is the fourth most cultivated bean globally and the main product in Brazilian agriculture. In 2021/22, Brazil estimates a production record of 142,009 million tons of soybeans. This value was obtained from the increase of 3.4% of the cultivated area compared to 2020/21, and 4.3% of increased productivity per hectare [1].

Those numbers would only be achieved with scientific advances and the availability of new technologies in the productive sector [2]. Among advances, mechanization, soil management, and prevention solutions for pests and diseases stand out.

Vigor tests are a way to obtain good information for consequent scientific advances. They are commonly used to find quality differences among bean lots during storage or after sowing, presenting the best and highlighting their planting conditions [3]. One of these tests is the tetrazolium test. It determines the vigor of soybean lots and offers information on the causes of quality reduction, identifying mechanical damages, moisture, and bug deterioration [4].

However, the analysis for damage classification is done visually by a specialist. This task may be unavailable considering the number of samples in a lot. The damage analysis of soybeans submitted to the tetrazolium test is a slow and tiresome method when performed manually since it is a visual task and requires hours of work [4].

Efforts in the literature [3, 58] show that supervised learning techniques for damage classification after the tetrazolium test could present good results. Therefore, to increase productivity, papers such as [58] propose active learning techniques (supervised classification) for this type of analysis. Despite having significant results, these papers focus on supervised learning techniques.

Non-supervised learning techniques are rarely explored in this context. Some active learning strategies that use grouping techniques as pre-processing consider, for example, the k-means algorithm. However, an extensive experimental evaluation is not done, comparing the performances of different clustering algorithms. Therefore, this paper investigates different non-supervised techniques to contribute to the research of unsupervised learning and active learning strategies applied to the classification of soybean damages.

In summary, this paper aims to perform an extensive experimental evaluation, considering different unsupervised techniques to improve the classification process of soybeans subjected to the tetrazolium test. Our contributions are fourfold: i) organization of soybean image groups; ii) extraction and selection of well-suited features that describe the images’ clustering; iii) extensive analyses and performance evaluation of different clustering techniques; iv) novel comparative analyses and validation of the obtained results considering different image datasets, clustering, descriptors techniques and different metrics to a vital agricultural problem.

2 Materials and methods

This section presents image description techniques and unsupervised learning methods used in this paper. We also present different methods for clustering evaluation and graphical/visualization results.

2.1 Descriptor learning

It is necessary to capture the different damage patterns that separate into specific classes to classify a set of soybean images. These descriptors consider different visual properties based on color, shape, and texture. Therefore, each descriptor extracts and generates a feature vector (numeric values), describing the images.

2.1.1 Feature extraction.

There are several feature extractors in the literature. Color-based extractors are widely used, especially for natural image classification. Some extractors consider the color histogram [9], which describes the global image content according to the pixel percentage of each color.

Other examples of color-based extractors are: Auto Color Correlogram (ACC) [10], Border/Interior Pixel Classification (BIC) [11], Color and Edge Directivity Descriptor (CEED) [12], Global Color Histogram (GCH) [13], Local Color Histogram (LHC) [14], Reference Color Similarity (RCS) [15], among others.

Texture properties can also represent images since they have information about luminosity, spatial distribution, and structural arrangement of the surface relating to the neighboring region. Gabor filters [16], Haralick descriptors [17], Local Binary Pattern (LBP) [18], Moments [19], First Order Statistics (MPO), First Order Statistics w/ color (MPOC), Pyramid Histogram of Oriented Gradients (PHOG) [20], Tamura [21] are examples of texture-based extractors.

Some extractors combine different types of features, like the Fuzzy Color and Texture Histogram (FCTH) [22] and the Join Composite Descriptor (JCD) (FCTH + CEDD) extractors that combine characteristics based on color and texture to describe images.

2.2 Machine learning

Machine learning is an area of artificial intelligence that proposes the development of systems capable of learning a specific pattern or behavior automatically using examples or experience. To do so, different supervised and unsupervised learning approaches can be considered.

2.2.1 Supervised classification.

In supervised classification, the learning process is performed through a set of (training) previously labeled data [23]. An oracle or specialist needs to annotate samples that best represent a class in this process. Then, the algorithm, after training, can present a better result [24]. Examples of supervised learning methods from the literature are: Random Forest (RF) [25], Support Vector Machines (SVM) [26], and Optimum-Path Forest (OPF) [27].

However, the supervised learning technique presents challenges: sample unbalancing in dataset classes, noises related to imperfect data or outliers, overfitting, underfitting, and missing values for some features. These challenges directly depend on the data used in learning training.

2.2.2 Unsupervised classification.

The unsupervised classification aims to find data clusters in multidimensional feature space according to specific similarity criteria. Using similarity relations, similar samples are grouped in the same cluster. Therefore, it is possible to describe the inherent characteristics of each cluster made from the grouping process, enabling a better understanding of the clustered data. Several clustering methods can be separated into categories: partitional, hierarchical, and density-based.

Partitional methods make k data partitions (clusters). This model uses an iterative relocation technique to improve partitioning from initial partitioning. The main method of this category is the well-known k-means. Proposed in [28], it is one of the most popular clustering algorithms. This method considers four phases: initialization, clustering, centroids’ movement, and optimization. In the initialization, random samples are defined as centroids (i.e., central cluster points). The clustering phase calculates the distance (e.g., Euclidean) between all samples and centroids. Next, each sample is assigned to the cluster where the centroid presents the shortest distance. After the clustering, the centroids’ movement phase calculates the sample mean of each cluster. Finally, samples closer to the mean are assigned as new centroids. Therefore, the optimization phase performs the clustering phases and centroids’ movement repeatedly until the central values of the clusters stabilize, reaching the final clusters.

Like k-means, the k-medoids algorithm [29] has the same phases. However, in the centroids’ movement phase, it is not the nearest sample that is defined as the centroid (i.e., ghost sample), but one of the samples (called medoid) of the clustering that minimizes the distance for all samples. Compared to k-means, this method is less susceptible to noise, since the outliers have little effect on the medoid choice. CLARANS [30] and FCM [31] are other examples of partitional algorithms.

Hierarchical methods are divided into agglomerative and divisive. These methods build (agglomerative) or split (divisive) a binary tree, in which nodes are samples of the data clustering, and connections are made based on data distance (dissimilarity). AGNES [32], CURE [33], and ROCK [34] are examples of hierarchical algorithms.

Density-based methods perform the clustering according to density established through input parameters. These parameters dictate the minimum density necessary to form a cluster. The main technique of this category is the Density-Based Spatial Clustering of Application with Noise (DBSCAN) [35]. This algorithm defines that a cluster has central points and border points. Central points are those that, given a radius r, have a defined (MinPts) minimum neighboring points (samples). Border points do not have minimum neighboring points, but one of the neighbors must be a central point. This technique finishes when no new points can be assigned to a cluster. This paper also considers the density-based algorithm named OPTICS [36].

2.2.3 Clustering evaluation.

Two metrics are considered to analyze clustering: those based on the true label of samples; and those not using this information. Although our datasets present the samples’ true label, we also considered metrics that do not consider such information to reach a broader and better analysis. These metrics are Fowlkes-Mallows (FOWLKES) [37], Davies-Bouldin (DAVIES) [38] and Calinski-Harabasz (CALINSKI) [39].

Another well-known metric is accuracy, and it gives the correctness samples percentage. It is calculated by the number of correctly clustered samples divided by the total number of samples. Clustering techniques are not committed to correctly defining labels, only to cluster samples. As a result, clusters may or may not adequately represent different classes. Therefore, accuracy calculation is based on a method that organizes the assigned labels and finds the highest accuracy possible [40]. To do so, it is used data from the confusion matrix, in which lines comprises true labels and columns are clusters. The goal is to find the column reorganization that generates the best accuracy result (i.e., an organization where the biggest values of each line are on the main diagonal of the matrix). Fig 1(a) and 1(b) present (pre and post) matrix reorganization examples, respectively.

thumbnail
Fig 1. Matrix reorganization to reach the best accuracy from a given cluster: (a) before and (b) after reorganization.

https://doi.org/10.1371/journal.pone.0285566.g001

Another metric that uses true label knowledge is FOWLKES [37], which is defined as the geometric mean of precision and recall. Eq 1 presents the metric formula, where TP corresponds to the number of true positives (samples correctly grouped with label C, that are part of C), and FN represents the number of false negatives (samples that should be grouped with C label, but were not). The range of the Fowlkes metric is between 0 and 1. The higher the value, the greater the similarity between the found clusters and the ground-truth classification. (1)

When the ground-truth labels of the samples are unknown, metrics are employed to evaluate the quality of obtained clustered. These metrics are generally based on cohesion and separation measures. Cohesion refers to the distance between samples of the same cluster, and separation indicates the distance between clusters [41].

The DAVIES metric [38] compares the distance between clusters with their sizes to indicate how good the separation between them is. Eq 2 formally defines the DAVIES metric. (2) where k represents the total number of clusters and Rij comprises the comparison between two clusters. Rij calculation is presented by Eq 3. (3) where si represents the mean distance of each sample of the group i to its centroid, and dij is the distance between i and j clusters’ centroids. Results range from 0 to 1, and lower values indicate better groupings.

CALINSKI is another metric that does not use true label knowledge. In this case, high values indicate better clustering definitions (i.e., dense and well-separated clusters). Its value is generated by the dispersion sum ratio between clusters and the internal dispersion of each cluster [39].

To a dataset E with size nE and k clusters, the CALINSKI definition is presented by Eq 4, where tr(Bk) represents the dispersion matrix trace between clusters, and tr(Wk) is the cluster internal dispersion matrix trace (defined by Eqs 5 and 6). In this case, Ci represents the i cluster’s samples set, ci is the i-th cluster center, ce is the e center, and ni is the number of samples from the i-th cluster. (4) (5) (6)

2.2.4 Dimensionality reduction.

Descriptors presented in Section 2.1.1 describe images in multiple-dimension vectors. Therefore, the graphic visualization of samples’ real values is impossible. In order to solve this problem, dimensionality reduction methods are applied to reduce samples in two or three dimensions.

Dimensionality reduction techniques (or feature reduction) are separated into feature selection methods and feature transformation methods [42]. Selection methods are based on evaluating which sample features are the most important for a better representation. Transformation methods use all sample features to calculate a new representation and are more suitable for graphic visualization.

Some feature reduction techniques based on transformation use linear and nonlinear functions. Principal Component Analysis (PCA) [43] is a technique that performs linear data mapping to a reduced dimension. Therefore, in the reduced dimension, data variation is maximized. Another one is the t-distributed Stochastic Neighbor Embedding (t-SNE) [44], a nonlinear technique that calculates the probability of features’ similarity and minimizes the divergence between the reduced and the original sample probability.

3 Proposed methodology

This section presents the proposed methodology for soybean image classification, describing the steps of each process. We present the description of the soybean image datasets (Section 3.1), experimental scenarios (Section 3.2), image descriptors, and algorithm configuration. Fig 2 illustrates the pipeline of our proposed methodology.

Soybean images can be organized in many ways, resulting in different datasets for experimental evaluation. We removed the images with stain and clipping problems in this stage. Section 3.1 presents more details about the organized datasets.

We extracted features for each organized dataset. In this stage, each soybean image is represented by a feature vector, considering descriptors presented in Section 2.1.

From the feature extraction process, clustering methods (presented in Section 2.2.2) are evaluated through different metrics (mentioned in Section 2.2.3). Therefore, it is possible to analyze the performance of each algorithm and its configurations in soybean classification.

3.1 Description of the datasets

After the tetrazolium test, analysts perform a visual process to identify and label each soybean portion (containing two internal parts and two external parts) into classes. The class is defined by the damage type and its severity level.

In [8], two image acquisitions were performed in a real soybean seed enterprise, where 1400 images were obtained from the first one and 1358 from the second one. Then, we generated different subsets from both image acquisitions in our work.

To do so, we consider each soybean portion labeled separately, given that we observed better performance from this scenario in [8]. This paper identifies the datasets as D1-D5. D4 and D5 ones are reorganizations of the D3 dataset. Tables 1 to 5, respectively, present a description and quantity of samples in each dataset class.

thumbnail
Table 1. Classes’ descriptions, samples’ distribution for each class of dataset D1, and their respective severity levels.

https://doi.org/10.1371/journal.pone.0285566.t001

thumbnail
Table 2. Classes’ descriptions, samples’ distribution for each class of dataset D2, and their respective severity levels.

https://doi.org/10.1371/journal.pone.0285566.t002

thumbnail
Table 3. Classes’ descriptions, samples’ distribution for each class of dataset D3, and their respective severity levels.

https://doi.org/10.1371/journal.pone.0285566.t003

thumbnail
Table 4. Classes’ descriptions and samples’ distribution for each class of dataset D4.

https://doi.org/10.1371/journal.pone.0285566.t004

thumbnail
Table 5. Classes’ descriptions and samples’ distribution for each class of dataset D5.

https://doi.org/10.1371/journal.pone.0285566.t005

D1 and D2 are, respectively, the first and second acquisitions, in which we organized samples with different damages and their severity level between 0 and 3. However, considering the D1 dataset, images that presented stains and cutting problems were removed in the organization stage of the datasets since they can impact algorithms’ learning process and performance.

D3 dataset is the aggregation between the D1 and the D2 ones. Since D3 presents all possible samples, the D4 and the D5 datasets were created from it. D4 is a reorganization of the D3 dataset where we removed the severity levels (i.e., only damage types are considered). In the D5 dataset, both severity levels and the differentiation of internal and external portions were removed. These new datasets were created to measure how the change in the samples’ number and classes impact experiments. Table 6 summarizes each dataset regarding their number of classes and samples.

3.2 Description of the scenarios

In order to perform experiments, we defined different scenarios. First, we extract features from the datasets based on the image descriptors shown in Table 7 and described in Section 3.1.

thumbnail
Table 7. Image descriptors used to extract features from each soybean seed image, and their respective feature vector dimensionality.

https://doi.org/10.1371/journal.pone.0285566.t007

After the description process, different clustering algorithms were applied and analyzed through different metrics (see Sections 2.2.3 and 2.2.4). Table 8 shows the clustering methods considered in our work.

According to the literature standard, default values were used as input parameters for clustering methods. In these cases, the number of classes was used.

As presented in Section 2.2.2, density-based algorithms need the radius and the minimum of neighbor’s points as a parameter. We used the technique presented in [35] to set the radius value. We performed a parametric analysis to define the minimum of the neighbor’s points. To do so, multiple experiments were performed, changing the parameter value (i.e., grid-search). Then, the value was chosen in which the resulting clustering number was the closest to the total of classes.

To define the accuracy, we performed a clustering label organization method [40], as presented in Section 2.2.3.

4 Results and discussion

The accuracy results make it possible to perform the first performance analysis. Tables 9 to 13 present accuracy results obtained by each descriptor and clustering method from each dataset (D1-D5, respectively).

thumbnail
Table 9. Accuracy results obtained by each descriptor and clustering method from the dataset D1.

The bold values correspond to the best clustering methods for each descriptor. The underlined values highlight the descriptor that reached the best performance for each clustering algorithm. The asterisk value represents the best combination (descriptor and clustering algorithm). It is also presented the mean accuracies considering all descriptors and clustering methods.

https://doi.org/10.1371/journal.pone.0285566.t009

thumbnail
Table 10. Accuracy results obtained by each descriptor and clustering method from the dataset D2.

The bold values correspond to the best clustering methods for each descriptor. The underlined values highlight the descriptor that reached the best performance for each clustering algorithm. The asterisk value represents the best combination (descriptor and clustering algorithm). It is also presented the mean accuracies considering all descriptors and clustering methods.

https://doi.org/10.1371/journal.pone.0285566.t010

thumbnail
Table 11. Accuracy results obtained by each descriptor and clustering method from the dataset D3.

The bold values correspond to the best clustering methods for each descriptor. The underlined values highlight the descriptor that reached the best performance for each clustering algorithm. The asterisk value represents the best combination (descriptor and clustering algorithm). It is also presented the mean accuracies considering all descriptors and clustering methods.

https://doi.org/10.1371/journal.pone.0285566.t011

thumbnail
Table 12. Accuracy results obtained by each descriptor and clustering method from the dataset D4.

The bold values correspond to the best clustering methods for each descriptor. The underlined values highlight the descriptor that reached the best performance for each clustering algorithm. The asterisk value represents the best combination (descriptor and clustering algorithm). It is also presented the mean accuracies considering all descriptors and clustering methods.

https://doi.org/10.1371/journal.pone.0285566.t012

thumbnail
Table 13. Accuracy results obtained by each descriptor and clustering method from the dataset D5.

The bold values correspond to the best clustering methods for each descriptor. The underlined values highlight the descriptor that reached the best performance for each clustering algorithm. The asterisk value represents the best combination (descriptor and clustering algorithm). It is also presented the mean accuracies considering all descriptors and clustering methods.

https://doi.org/10.1371/journal.pone.0285566.t013

The bold values highlight the clustering methods that obtained the best result for each descriptor. In addition, the underlined values emphasize the descriptor that reached the best performance for each clustering method. Finally, the asterisk value represents the best combination (descriptor and clustering method).

It is possible to observe that CURE and ROCK algorithms stand out for most datasets and descriptors, presenting the best results. The AGNES algorithm also reached good results. However, AGNES presented lower accuracy values compared to CURE and ROCK algorithms.

According to the obtained results, no descriptor can be considered the best for all clustering methods. However, in general, FCTH, GCH, and RCS descriptors presented the best accuracy values, especially when performed with AGNES, CURE, and ROCK clustering algorithms.

Since the internal and external samples are from the same cluster considering the D5 dataset, it is possible to observe a different behavior compared to other datasets. For instance, we notice that the OPTICS and ROCK algorithms present the highest accuracy for several descriptors. ROCK presents the best result, with an accuracy of 83.1%.

Since the internal and external samples are from the same cluster considering the D5 dataset, it is possible to observe a different behavior compared to other datasets. For instance, the OPTICS and ROCK algorithms present the highest accuracy for several descriptors. ROCK presents the best result, with an accuracy of 83.1%.

In addition to the accuracies, performance analyses were performed using other metrics (FOWLKES, DAVIES e CALINSKI). Tables 1418 present the results obtained for each descriptor, considering the datasets D1 and D2 with CURE and the datasets D3, D4 and D5 with ROCK, since these were the best combinations (descriptor and clustering algorithm pair). In bold are highlighted the best results (descriptors) obtained according to each metric.

thumbnail
Table 14. Results obtained by each descriptor and clustering method CURE, considering each of the metrics (accuracy, FOWLKES, DAVIES and CALINSKI), from the dataset D1.

Bold values highlight the best descriptors’ results according to each metric.

https://doi.org/10.1371/journal.pone.0285566.t014

thumbnail
Table 15. Results obtained by each descriptor and clustering method CURE, considering each of the metrics (accuracy, FOWLKES, DAVIES and CALINSKI), from the dataset D2.

Bold values highlight the best descriptors’ results according to each metric.

https://doi.org/10.1371/journal.pone.0285566.t015

thumbnail
Table 16. Results obtained by each descriptor and clustering method ROCK, considering each of the metrics (accuracy, FOWLKES, DAVIES and CALINSKI), from the dataset D3.

Bold values highlight the best descriptors’ results according to each metric.

https://doi.org/10.1371/journal.pone.0285566.t016

thumbnail
Table 17. Results obtained by each descriptor and clustering method ROCK, considering each of the metrics (accuracy, FOWLKES, DAVIES and CALINSKI), from the dataset D4.

Bold values highlight the best descriptors’ results according to each metric.

https://doi.org/10.1371/journal.pone.0285566.t017

thumbnail
Table 18. Results obtained by each descriptor and clustering method ROCK, considering each of the metrics (accuracy, FOWLKES, DAVIES and CALINSKI), from the dataset D5.

Bold values highlight the best descriptors’ results according to each metric.

https://doi.org/10.1371/journal.pone.0285566.t018

It is possible to observe that the FOWLKES metric presents similar results to the accuracy. When DAVIES and CALINSKI metrics reach their best results (bold values), in most cases, accuracy values are inferior to the best accuracy obtained for the clustering. This may indicate that metrics based on cohesion and clustering separation, in this context, should not be used to analyze correctability.

In order to achieve a better understanding of the results obtained by the several metrics, it is vital to observe the graphical clustering representation. To do so, PCA and t-SNE dimensionality reduction techniques were considered (Figs 38), respectively. Figs 38 present the highlighted clustering results from Table 17 (i.e., best combinations—descriptor and clustering algorithm—according to each metric). The colors represent an obtained clustering, while symbols represent the true classes of the samples.

thumbnail
Fig 3. Clusterings’ visualization obtained by PCA from the D4 dataset, considering the best pair RCS_ROCK (descriptor_clustering algorithm), and according to the accuracy metric.

https://doi.org/10.1371/journal.pone.0285566.g003

thumbnail
Fig 4. Clusterings’ visualization obtained by PCA from the D4 dataset, considering the best pair TAMURA_ROCK (descriptor_clustering algorithm), and according to the DAVIES metric.

https://doi.org/10.1371/journal.pone.0285566.g004

thumbnail
Fig 5. Clusterings’ visualization obtained by PCA from the D4 dataset, considering the best pair MPO_ROCK (descriptor_clustering algorithm), and according to the CALINSKI metric.

https://doi.org/10.1371/journal.pone.0285566.g005

thumbnail
Fig 6. Clusterings’ visualization obtained by t-SNE from the D4 dataset, considering the best pair RCS_ROCK (descriptor_clustering algorithm), and according to the accuracy metric.

https://doi.org/10.1371/journal.pone.0285566.g006

thumbnail
Fig 7. Clusterings’ visualization obtained by t-SNE from the D4 dataset, considering the best pair TAMURA_ROCK (descriptor_clustering algorithm), and according to the DAVIES metric.

https://doi.org/10.1371/journal.pone.0285566.g007

thumbnail
Fig 8. Clusterings’ visualization obtained by t-SNE from the D4 dataset, considering the best pair MPO_ROCK (descriptor_clustering algorithm), and according to the CALINSKI metric.

https://doi.org/10.1371/journal.pone.0285566.g008

As presented in Table 4, the majority (83%) of the dataset’s samples are from the perfect, external and internal classes. It is possible to observe the clustering of these classes of samples, represented by circle and diamond symbols (Figs 38)). Both PCA and t-SNE techniques could not correctly cluster other damage classes. In the clustering presented in Fig 3, even with an 81% of accuracy, it is possible to note that the majority of moisture, bugs, and mechanical damage classes were considered perfect, that is, almost all of the successes obtained refer to perfect classes.

When analyzing Fig 4, it is possible to conclude that the highlighted results for DAVIES metric (e.g., 0.35 value to the TAMURA_ROCK pair in Table 17), do not indicate a good clustering, since a cluster with almost all samples was generated. In this case, the accuracy obtained from the clustering was 44%. Therefore, DAVIES is not a good metric for this context.

We noticed this same behavior in Fig 5, where we can see that the results obtained from the CALINSKI metric (e.g., 6614.24 for the MPO_ROCK pair in Table 17), do not show a good clustering formation. The majority of samples from both perfect classes (i.e., XE and XI, represented by circle and diamond, respectively) that comprise almost 83% of the samples, were divided into five clusters, impacting the accuracy (48%). Therefore, CALINSKI is not a good metric for this context either.

5 Conclusion

In this paper, we performed an extensive experimental evaluation, considering different unsupervised learning techniques applied to soybean seed image datasets from the tetrazolium test. To do so, we used 5 image datasets considering different scenarios, damages, and/or their respective severity levels. To describe these images, we consider 18 different image descriptors. Moreover, we evaluated 9 clustering algorithms from different paradigms (e.g., partitional, hierarchical, and density-based) under 4 metrics (accuracy, FOWLKES, DAVIES, and CALINSKI). We considered 2 dimensionality reduction techniques (PCA and t-SNE) to validate the analyses to visualize the clusters’ distributions.

Analyzing the obtained results, we observed similar behavior with different datasets, in which the best accuracies were achieved according to the number of perfect samples (without damage) in the dataset. The samples’ unbalancing in each class, including the excess of perfect samples, disturbs the clustering algorithm’s performance, especially in classes with few samples.

We generally reached the best accuracy results by each descriptor with AGNES, CURE, and ROCK clustering algorithms. FCTH, GCH, and RCS presented the highest accuracy values among all descriptors evaluated, using the combination FCTH_CURE, GCH_CURE, and RCS_ROCK.

Regarding the generated visualizations, it was possible to observe that the FOWLKES metric presented results similar to the accuracy; this suggests that the metric can be used as a substitute for accuracy in clustering analysis.

References

  1. 1. Kist BB. Brazilian Soybean Yearbook. Brazil: Editora Gazeta; 2021.
  2. 2. Freitas CM. A Cultura da soja no Brasil: O Crescimento da Produção Brasileira e o Surgimento de uma Nova Fronteira Agricola. Enciclopa Biosfera. 2011;.
  3. 3. Santanna MGF, Saito PTM, Bugatti PH. Content-based image retrieval towards the automatic characterization of soybean seed vigor. In: Proceedings of the Symposium on Applied Computing. South Korea: ACM; 2014. p. 964–969.
  4. 4. Neto F, Costa NP, Marcos FJ, Krzyzanowski FC, Henning AA. The tetrazolium test for soybean seeds. Embrapa Soja. 2008;.
  5. 5. Pereira DF, Saito PTM, Bugatti PH. An image analysis framework for effective classification of seed damages. In: Ossowski S, editor. Proceedings of the 31st Annual Symposium on Applied Computing. Italy: ACM; 2016. p. 61–66.
  6. 6. Bressan RS, Camargo G, Bugatti PH, Saito PTM. Exploring Active Learning Based on Representativeness and Uncertainty for Biomedical Data Classification. IEEE J Biomed Health Informatics. 2019;23(6):2238–2244. pmid:30442623
  7. 7. Camargo G, Bugatti PH, Saito PTM. Active semi-supervised learning for biological data classification. PLOS ONE. 2020;15(8):1–20. pmid:32813738
  8. 8. Pereira DF, Bugatti PH, Lopes FM, de Souza ALSM, Saito PTM. Assessing Active Learning Strategies to Improve the Quality Control of the Soybean Seed Vigor. IEEE Transactions on Industrial Electronics. 2021;68(2):1675–1683.
  9. 9. Swain MJ, Ballard DH. Color Indexig. International journal of computer vision. 1991;.
  10. 10. Huang J, Kumar SR, Zabih R. Image indexing using color correlagrams. IEEE Computer Society Conference. 1997; p. 762–768.
  11. 11. Stehling RO, Nascimento MA, FalcÃo AX. A compact and efficient image retrieval approach based on border/interior pixel classification. Intl Conference on Information and Knowledge Management. 2002; p. 102–109.
  12. 12. Chatzichristofis SA, Boutalis YS. Cedd: color and edge directivity descriptor: a compact descriptor for image indexing and retrieval. International Conference on Computer Vision Systems. 2008a; p. 312–322.
  13. 13. Stricker MA, Oregon M. Similarity of color images. Storage and Retrieval for Image and Video Databases. 1995;2420(III):381–393.
  14. 14. Smith JR, Chang SF. Local color and texture extraction and spatial query. Intl Conference on Image Processing. 1996;3:1011–1014.
  15. 15. Smith JR, Chang SF. Evaluantion of multiple clustering solutions. MultiClust@ ECML/PKDD. 2011; p. 55–66.
  16. 16. Zhang D, Wong A, Indrawan M, Lu G. Content-based image retrieval using gabor texture features. IEEE Transactions PAMI. 2000; p. 13–15.
  17. 17. Haralick RM, Shanmugam K, Distein I. Textural features for image classification. IEEE Transactions on Systems, Man, and Cybernetics. 1973; p. 610–621.
  18. 18. Guo Z, Zhang L, Zhang D. Rotation invariant texture classification using LBP variance (LBPV) with global matching. PR. 2010; p. 706–709.
  19. 19. Graf F. Free Java library for image features and point/region detectors used in ComputerVision—Jfeaturelib v1.6.3; 2015. “https://github.com/locked-fg/JFeatureLib/tree/v1.6.3”.
  20. 20. Bosch A, Zisserman A, Munoz X. Representing shape with a spatial pyramid kernel. Proceedings of the 6th ACM international conference on image and video retrieval. 2007; p. 401–408.
  21. 21. Tamura H, Mori S, Yamawaki T. Textural features corresponding to visual perception. IEEE Transactions on Systems, man and cybernetics. 1978; p. 460–473.
  22. 22. Chatzichristofis SA, Boutalis YS. Fcth: Fuzzy color and texture histogram-low level feature for accurate image retrieval. Image Analysis for Multimedia Interactive Services. 2008b; p. 191–196.
  23. 23. Campbell JB. Introduction to remote sensing. Taylor and Francis. 2002;.
  24. 24. Máximo D O A; Fernandes. Classificação supervisionada de imagens sar do sivam pré-definidas. XII Simpósio Brasileiro de Sensoriamento Remoto. 2005;.
  25. 25. Gislason PO, Benediktsson J, Sveinsson JR. Random Forest for Land Cover Classification. Pattern Recognition Letters. 2006; p. 294–300.
  26. 26. Vapnik V, Golowich SE, Smola A. Support vector method for function approximation, regression estimation, and signal processing. Neural Information Processing Systems. 1997;9.
  27. 27. P PJ, Falcão AX, Suzuki CTN. Supervised pattern classification based on optimum-path. Proceeding of the 4th International Symposium on Advances in Visual Computing. 2009; p. 120–131.
  28. 28. MacQueen JB. Some methods for classification and analisys of multivariate observations. Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Propabilty. 1967;1.
  29. 29. Kaufman L, Russeeuw PJ. Clustering by means of medoids, in statical data analysis. North-Holland. 1987;.
  30. 30. Raymond N, Han J. A Method for Clustering Objects for Spatial Data Mining. IEE Trans Knowledge and Data Engineering. 2002;.
  31. 31. Bezdek JC, Pal SK. Fuzzy models for pattern recognition : methods tha search for structures in data. IEEE Press. 1992;.
  32. 32. Kaufman L, Rousseeuw P. Agglometarive nesting. New York: Wiley Inter-Science. 1990a;1:199–252.
  33. 33. Guha S, Rastogi R, Shim K. CURE: an efficient clustering algorithm for large databases. ACM Sigmod International Conference on Management of Data. 1998; p. 73–84.
  34. 34. Guha S, Rastogi R, Shim K. ROCK: A robust clustering algorithm for categorical attributes. Infomation Systems. 2000;25:345–366.
  35. 35. Ester M, Kriegel HP, Sander J, Xu X. A density-based algorithm for discovering clusters in large spatial databases with noise. KKD. 1996;96:226–231.
  36. 36. Ankerstm M, Breuning MM. OPTICS: Ordering Points To Identify the Clustering Structure. ACM SIGMOD international conference on management of data. 1999; p. 49–60.
  37. 37. Fowlkes EB, Mallows CL. A method for comparing two hierarchical clusterings. Journal of the American Statistical Association. 1983;.
  38. 38. Davies DL, Boldin DW. A Cluster Separation Measure. IEEE Transactions on Pattern Analysis and Machine Intelligence. 1979;PAMI-1(2):224–227. pmid:21868852
  39. 39. Calinski T, Harabasz J. A dendrite method for cluster analysis, Communications. Statistics—Theory and Methods. 1974;1(3):1–27.
  40. 40. Duda RO, Hart PE, Stork DG. Pattern Classification. 2nd ed. New York: Wiley; 2001.
  41. 41. Rendón E. Internal versus External cluster validation indexes. International Journal Of Computers And Communications. 2011;5(1):27–33.
  42. 42. Paliwal KK. Dimensionality Reduction of the Enhanced Feature Set for the HMM-based Speech Recognizer. Digital Signal Processing. 1992;2:157–173.
  43. 43. Tipping ME, Bishop CM. Probabilistic principal component analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 1999;3(61):611–622.
  44. 44. Maaten LJP, Hinton GE. Visualizing Data Using t-SNE. Journal of Machine Learning Research. 2008;9:2579–2605.