Assessment of clustering techniques to support the analyses of soybean seed vigor

Eduardo R. de Oliveira; Pedro H. Bugatti; Priscila T. M. Saito

doi:10.1371/journal.pone.0285566

Abstract

Soy is the main product of Brazilian agriculture and the fourth most cultivated bean globally. Since soy cultivation tends to increase and due to this large market, the guarantee of product quality is an indispensable factor for enterprises to stay competitive. Industries perform vigor tests to acquire information and evaluate the quality of soy planting. The tetrazolium test, for example, provides information about moisture damage, bedbugs, or mechanical damage. However, the verification of the damage reason and its severity are done by an analyst, one by one. Since this is massive and exhausting work, it is susceptible to mistakes. Proposals involving different supervised learning approaches, including active learning strategies, have already been used, and have brought significant results. Therefore, this paper analyzes the performance of non-supervised techniques for classifying soybeans. An extensive experimental evaluation was performed, considering (9) different clustering algorithms (partitional, hierarchical, and density-based) applied to 5 image datasets of soybean seeds submitted to the tetrazolium test, including different damages and/or their levels. To describe those images, we considered 18 extractors of traditional features. We also considered four metrics (accuracy, FOWLKES, DAVIES, and CALINSKI) and two-dimensionality reduction techniques (principal component analysis and t-distributed stochastic neighbor embedding) for validation. Results show that this paper presents essential contributions since it makes it possible to identify descriptors and clustering algorithms that shall be used as preprocessing in other learning processes, accelerating and improving the classification process of key agricultural problems.

Citation: de Oliveira ER, Bugatti PH, Saito PTM (2023) Assessment of clustering techniques to support the analyses of soybean seed vigor. PLoS ONE 18(8): e0285566. https://doi.org/10.1371/journal.pone.0285566

Editor: Usman Qamar, National University of Science and Technology, PAKISTAN

Received: January 14, 2023; Accepted: April 26, 2023; Published: August 25, 2023

Copyright: © 2023 de Oliveira et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All files are available from https://github.com/BioinfoCP/visual-features-soybean-vigor.

Funding: This work has been supported by National Council for Scientific and Technological Development - CNPq; Coordination for the Improvement of Higher Education Personnel - CAPES; Funda\c{c}\~{a}o Arauc\’{a}ria; SETI; UTFPR; and UFSCar. There was no additional external funding received for this study.

Competing interests: The authors have declared that no competing interests exist.

1 Introduction

Soy is the fourth most cultivated bean globally and the main product in Brazilian agriculture. In 2021/22, Brazil estimates a production record of 142,009 million tons of soybeans. This value was obtained from the increase of 3.4% of the cultivated area compared to 2020/21, and 4.3% of increased productivity per hectare [1].

Those numbers would only be achieved with scientific advances and the availability of new technologies in the productive sector [2]. Among advances, mechanization, soil management, and prevention solutions for pests and diseases stand out.

Vigor tests are a way to obtain good information for consequent scientific advances. They are commonly used to find quality differences among bean lots during storage or after sowing, presenting the best and highlighting their planting conditions [3]. One of these tests is the tetrazolium test. It determines the vigor of soybean lots and offers information on the causes of quality reduction, identifying mechanical damages, moisture, and bug deterioration [4].

However, the analysis for damage classification is done visually by a specialist. This task may be unavailable considering the number of samples in a lot. The damage analysis of soybeans submitted to the tetrazolium test is a slow and tiresome method when performed manually since it is a visual task and requires hours of work [4].

Efforts in the literature [3, 5–8] show that supervised learning techniques for damage classification after the tetrazolium test could present good results. Therefore, to increase productivity, papers such as [5–8] propose active learning techniques (supervised classification) for this type of analysis. Despite having significant results, these papers focus on supervised learning techniques.

Non-supervised learning techniques are rarely explored in this context. Some active learning strategies that use grouping techniques as pre-processing consider, for example, the k-means algorithm. However, an extensive experimental evaluation is not done, comparing the performances of different clustering algorithms. Therefore, this paper investigates different non-supervised techniques to contribute to the research of unsupervised learning and active learning strategies applied to the classification of soybean damages.

In summary, this paper aims to perform an extensive experimental evaluation, considering different unsupervised techniques to improve the classification process of soybeans subjected to the tetrazolium test. Our contributions are fourfold: i) organization of soybean image groups; ii) extraction and selection of well-suited features that describe the images’ clustering; iii) extensive analyses and performance evaluation of different clustering techniques; iv) novel comparative analyses and validation of the obtained results considering different image datasets, clustering, descriptors techniques and different metrics to a vital agricultural problem.

2 Materials and methods

This section presents image description techniques and unsupervised learning methods used in this paper. We also present different methods for clustering evaluation and graphical/visualization results.

2.1 Descriptor learning

It is necessary to capture the different damage patterns that separate into specific classes to classify a set of soybean images. These descriptors consider different visual properties based on color, shape, and texture. Therefore, each descriptor extracts and generates a feature vector (numeric values), describing the images.

2.1.1 Feature extraction.

There are several feature extractors in the literature. Color-based extractors are widely used, especially for natural image classification. Some extractors consider the color histogram [9], which describes the global image content according to the pixel percentage of each color.

Other examples of color-based extractors are: Auto Color Correlogram (ACC) [10], Border/Interior Pixel Classification (BIC) [11], Color and Edge Directivity Descriptor (CEED) [12], Global Color Histogram (GCH) [13], Local Color Histogram (LHC) [14], Reference Color Similarity (RCS) [15], among others.

Texture properties can also represent images since they have information about luminosity, spatial distribution, and structural arrangement of the surface relating to the neighboring region. Gabor filters [16], Haralick descriptors [17], Local Binary Pattern (LBP) [18], Moments [19], First Order Statistics (MPO), First Order Statistics w/ color (MPOC), Pyramid Histogram of Oriented Gradients (PHOG) [20], Tamura [21] are examples of texture-based extractors.

Some extractors combine different types of features, like the Fuzzy Color and Texture Histogram (FCTH) [22] and the Join Composite Descriptor (JCD) (FCTH + CEDD) extractors that combine characteristics based on color and texture to describe images.

2.2 Machine learning

Machine learning is an area of artificial intelligence that proposes the development of systems capable of learning a specific pattern or behavior automatically using examples or experience. To do so, different supervised and unsupervised learning approaches can be considered.

2.2.1 Supervised classification.

In supervised classification, the learning process is performed through a set of (training) previously labeled data [23]. An oracle or specialist needs to annotate samples that best represent a class in this process. Then, the algorithm, after training, can present a better result [24]. Examples of supervised learning methods from the literature are: Random Forest (RF) [25], Support Vector Machines (SVM) [26], and Optimum-Path Forest (OPF) [27].

However, the supervised learning technique presents challenges: sample unbalancing in dataset classes, noises related to imperfect data or outliers, overfitting, underfitting, and missing values for some features. These challenges directly depend on the data used in learning training.

2.2.2 Unsupervised classification.

The unsupervised classification aims to find data clusters in multidimensional feature space according to specific similarity criteria. Using similarity relations, similar samples are grouped in the same cluster. Therefore, it is possible to describe the inherent characteristics of each cluster made from the grouping process, enabling a better understanding of the clustered data. Several clustering methods can be separated into categories: partitional, hierarchical, and density-based.

Partitional methods make k data partitions (clusters). This model uses an iterative relocation technique to improve partitioning from initial partitioning. The main method of this category is the well-known k-means. Proposed in [28], it is one of the most popular clustering algorithms. This method considers four phases: initialization, clustering, centroids’ movement, and optimization. In the initialization, random samples are defined as centroids (i.e., central cluster points). The clustering phase calculates the distance (e.g., Euclidean) between all samples and centroids. Next, each sample is assigned to the cluster where the centroid presents the shortest distance. After the clustering, the centroids’ movement phase calculates the sample mean of each cluster. Finally, samples closer to the mean are assigned as new centroids. Therefore, the optimization phase performs the clustering phases and centroids’ movement repeatedly until the central values of the clusters stabilize, reaching the final clusters.

Like k-means, the k-medoids algorithm [29] has the same phases. However, in the centroids’ movement phase, it is not the nearest sample that is defined as the centroid (i.e., ghost sample), but one of the samples (called medoid) of the clustering that minimizes the distance for all samples. Compared to k-means, this method is less susceptible to noise, since the outliers have little effect on the medoid choice. CLARANS [30] and FCM [31] are other examples of partitional algorithms.

Hierarchical methods are divided into agglomerative and divisive. These methods build (agglomerative) or split (divisive) a binary tree, in which nodes are samples of the data clustering, and connections are made based on data distance (dissimilarity). AGNES [32], CURE [33], and ROCK [34] are examples of hierarchical algorithms.

Density-based methods perform the clustering according to density established through input parameters. These parameters dictate the minimum density necessary to form a cluster. The main technique of this category is the Density-Based Spatial Clustering of Application with Noise (DBSCAN) [35]. This algorithm defines that a cluster has central points and border points. Central points are those that, given a radius r, have a defined (MinPts) minimum neighboring points (samples). Border points do not have minimum neighboring points, but one of the neighbors must be a central point. This technique finishes when no new points can be assigned to a cluster. This paper also considers the density-based algorithm named OPTICS [36].

2.2.3 Clustering evaluation.

Two metrics are considered to analyze clustering: those based on the true label of samples; and those not using this information. Although our datasets present the samples’ true label, we also considered metrics that do not consider such information to reach a broader and better analysis. These metrics are Fowlkes-Mallows (FOWLKES) [37], Davies-Bouldin (DAVIES) [38] and Calinski-Harabasz (CALINSKI) [39].

Another well-known metric is accuracy, and it gives the correctness samples percentage. It is calculated by the number of correctly clustered samples divided by the total number of samples. Clustering techniques are not committed to correctly defining labels, only to cluster samples. As a result, clusters may or may not adequately represent different classes. Therefore, accuracy calculation is based on a method that organizes the assigned labels and finds the highest accuracy possible [40]. To do so, it is used data from the confusion matrix, in which lines comprises true labels and columns are clusters. The goal is to find the column reorganization that generates the best accuracy result (i.e., an organization where the biggest values of each line are on the main diagonal of the matrix). Fig 1(a) and 1(b) present (pre and post) matrix reorganization examples, respectively.

Download:

Fig 1. Matrix reorganization to reach the best accuracy from a given cluster: (a) before and (b) after reorganization.

https://doi.org/10.1371/journal.pone.0285566.g001

Another metric that uses true label knowledge is FOWLKES [37], which is defined as the geometric mean of precision and recall. Eq 1 presents the metric formula, where TP corresponds to the number of true positives (samples correctly grouped with label C, that are part of C), and FN represents the number of false negatives (samples that should be grouped with C label, but were not). The range of the Fowlkes metric is between 0 and 1. The higher the value, the greater the similarity between the found clusters and the ground-truth classification. (1)

When the ground-truth labels of the samples are unknown, metrics are employed to evaluate the quality of obtained clustered. These metrics are generally based on cohesion and separation measures. Cohesion refers to the distance between samples of the same cluster, and separation indicates the distance between clusters [41].

The DAVIES metric [38] compares the distance between clusters with their sizes to indicate how good the separation between them is. Eq 2 formally defines the DAVIES metric. (2) where k represents the total number of clusters and R_ij comprises the comparison between two clusters. R_ij calculation is presented by Eq 3. (3) where s_i represents the mean distance of each sample of the group i to its centroid, and d_ij is the distance between i and j clusters’ centroids. Results range from 0 to 1, and lower values indicate better groupings.

CALINSKI is another metric that does not use true label knowledge. In this case, high values indicate better clustering definitions (i.e., dense and well-separated clusters). Its value is generated by the dispersion sum ratio between clusters and the internal dispersion of each cluster [39].

To a dataset E with size n_E and k clusters, the CALINSKI definition is presented by Eq 4, where tr(B_k) represents the dispersion matrix trace between clusters, and tr(W_k) is the cluster internal dispersion matrix trace (defined by Eqs 5 and 6). In this case, C_i represents the i cluster’s samples set, c_i is the i-th cluster center, c_e is the e center, and n_i is the number of samples from the i-th cluster. (4) (5) (6)

2.2.4 Dimensionality reduction.

Descriptors presented in Section 2.1.1 describe images in multiple-dimension vectors. Therefore, the graphic visualization of samples’ real values is impossible. In order to solve this problem, dimensionality reduction methods are applied to reduce samples in two or three dimensions.

Dimensionality reduction techniques (or feature reduction) are separated into feature selection methods and feature transformation methods [42]. Selection methods are based on evaluating which sample features are the most important for a better representation. Transformation methods use all sample features to calculate a new representation and are more suitable for graphic visualization.

Some feature reduction techniques based on transformation use linear and nonlinear functions. Principal Component Analysis (PCA) [43] is a technique that performs linear data mapping to a reduced dimension. Therefore, in the reduced dimension, data variation is maximized. Another one is the t-distributed Stochastic Neighbor Embedding (t-SNE) [44], a nonlinear technique that calculates the probability of features’ similarity and minimizes the divergence between the reduced and the original sample probability.

3 Proposed methodology

This section presents the proposed methodology for soybean image classification, describing the steps of each process. We present the description of the soybean image datasets (Section 3.1), experimental scenarios (Section 3.2), image descriptors, and algorithm configuration. Fig 2 illustrates the pipeline of our proposed methodology.

Download:

Fig 2. Pipeline of our proposed methodology.

https://doi.org/10.1371/journal.pone.0285566.g002

Soybean images can be organized in many ways, resulting in different datasets for experimental evaluation. We removed the images with stain and clipping problems in this stage. Section 3.1 presents more details about the organized datasets.

We extracted features for each organized dataset. In this stage, each soybean image is represented by a feature vector, considering descriptors presented in Section 2.1.

From the feature extraction process, clustering methods (presented in Section 2.2.2) are evaluated through different metrics (mentioned in Section 2.2.3). Therefore, it is possible to analyze the performance of each algorithm and its configurations in soybean classification.

3.1 Description of the datasets

After the tetrazolium test, analysts perform a visual process to identify and label each soybean portion (containing two internal parts and two external parts) into classes. The class is defined by the damage type and its severity level.

In [8], two image acquisitions were performed in a real soybean seed enterprise, where 1400 images were obtained from the first one and 1358 from the second one. Then, we generated different subsets from both image acquisitions in our work.

To do so, we consider each soybean portion labeled separately, given that we observed better performance from this scenario in [8]. This paper identifies the datasets as D₁-D₅. D₄ and D₅ ones are reorganizations of the D₃ dataset. Tables 1 to 5, respectively, present a description and quantity of samples in each dataset class.

Download:

Table 1. Classes’ descriptions, samples’ distribution for each class of dataset D₁, and their respective severity levels.

https://doi.org/10.1371/journal.pone.0285566.t001

Download:

Table 2. Classes’ descriptions, samples’ distribution for each class of dataset D₂, and their respective severity levels.

https://doi.org/10.1371/journal.pone.0285566.t002

Download:

Table 3. Classes’ descriptions, samples’ distribution for each class of dataset D₃, and their respective severity levels.

https://doi.org/10.1371/journal.pone.0285566.t003

Download:

Table 4. Classes’ descriptions and samples’ distribution for each class of dataset D₄.

https://doi.org/10.1371/journal.pone.0285566.t004

Download:

Table 5. Classes’ descriptions and samples’ distribution for each class of dataset D₅.

https://doi.org/10.1371/journal.pone.0285566.t005

D₁ and D₂ are, respectively, the first and second acquisitions, in which we organized samples with different damages and their severity level between 0 and 3. However, considering the D₁ dataset, images that presented stains and cutting problems were removed in the organization stage of the datasets since they can impact algorithms’ learning process and performance.

D₃ dataset is the aggregation between the D₁ and the D₂ ones. Since D₃ presents all possible samples, the D₄ and the D₅ datasets were created from it. D₄ is a reorganization of the D₃ dataset where we removed the severity levels (i.e., only damage types are considered). In the D₅ dataset, both severity levels and the differentiation of internal and external portions were removed. These new datasets were created to measure how the change in the samples’ number and classes impact experiments. Table 6 summarizes each dataset regarding their number of classes and samples.

Download:

Table 6. Number of classes and samples for each dataset.

https://doi.org/10.1371/journal.pone.0285566.t006

3.2 Description of the scenarios

In order to perform experiments, we defined different scenarios. First, we extract features from the datasets based on the image descriptors shown in Table 7 and described in Section 3.1.

Download:

Table 7. Image descriptors used to extract features from each soybean seed image, and their respective feature vector dimensionality.

https://doi.org/10.1371/journal.pone.0285566.t007

After the description process, different clustering algorithms were applied and analyzed through different metrics (see Sections 2.2.3 and 2.2.4). Table 8 shows the clustering methods considered in our work.

Download:

Table 8. Clustering methods used in our experiments.

https://doi.org/10.1371/journal.pone.0285566.t008

According to the literature standard, default values were used as input parameters for clustering methods. In these cases, the number of classes was used.

As presented in Section 2.2.2, density-based algorithms need the radius and the minimum of neighbor’s points as a parameter. We used the technique presented in [35] to set the radius value. We performed a parametric analysis to define the minimum of the neighbor’s points. To do so, multiple experiments were performed, changing the parameter value (i.e., grid-search). Then, the value was chosen in which the resulting clustering number was the closest to the total of classes.

To define the accuracy, we performed a clustering label organization method [40], as presented in Section 2.2.3.

4 Results and discussion

The accuracy results make it possible to perform the first performance analysis. Tables 9 to 13 present accuracy results obtained by each descriptor and clustering method from each dataset (D₁-D₅, respectively).

Download:

Table 9. Accuracy results obtained by each descriptor and clustering method from the dataset D₁.

The bold values correspond to the best clustering methods for each descriptor. The underlined values highlight the descriptor that reached the best performance for each clustering algorithm. The asterisk value represents the best combination (descriptor and clustering algorithm). It is also presented the mean accuracies considering all descriptors and clustering methods.

https://doi.org/10.1371/journal.pone.0285566.t009

Download:

Table 10. Accuracy results obtained by each descriptor and clustering method from the dataset D₂.

The bold values correspond to the best clustering methods for each descriptor. The underlined values highlight the descriptor that reached the best performance for each clustering algorithm. The asterisk value represents the best combination (descriptor and clustering algorithm). It is also presented the mean accuracies considering all descriptors and clustering methods.

https://doi.org/10.1371/journal.pone.0285566.t010

Download:

Table 11. Accuracy results obtained by each descriptor and clustering method from the dataset D₃.

The bold values correspond to the best clustering methods for each descriptor. The underlined values highlight the descriptor that reached the best performance for each clustering algorithm. The asterisk value represents the best combination (descriptor and clustering algorithm). It is also presented the mean accuracies considering all descriptors and clustering methods.

https://doi.org/10.1371/journal.pone.0285566.t011

Download:

Table 12. Accuracy results obtained by each descriptor and clustering method from the dataset D₄.

The bold values correspond to the best clustering methods for each descriptor. The underlined values highlight the descriptor that reached the best performance for each clustering algorithm. The asterisk value represents the best combination (descriptor and clustering algorithm). It is also presented the mean accuracies considering all descriptors and clustering methods.

https://doi.org/10.1371/journal.pone.0285566.t012

Download:

Table 13. Accuracy results obtained by each descriptor and clustering method from the dataset D₅.

The bold values correspond to the best clustering methods for each descriptor. The underlined values highlight the descriptor that reached the best performance for each clustering algorithm. The asterisk value represents the best combination (descriptor and clustering algorithm). It is also presented the mean accuracies considering all descriptors and clustering methods.

https://doi.org/10.1371/journal.pone.0285566.t013

The bold values highlight the clustering methods that obtained the best result for each descriptor. In addition, the underlined values emphasize the descriptor that reached the best performance for each clustering method. Finally, the asterisk value represents the best combination (descriptor and clustering method).

It is possible to observe that CURE and ROCK algorithms stand out for most datasets and descriptors, presenting the best results. The AGNES algorithm also reached good results. However, AGNES presented lower accuracy values compared to CURE and ROCK algorithms.

According to the obtained results, no descriptor can be considered the best for all clustering methods. However, in general, FCTH, GCH, and RCS descriptors presented the best accuracy values, especially when performed with AGNES, CURE, and ROCK clustering algorithms.

Since the internal and external samples are from the same cluster considering the D₅ dataset, it is possible to observe a different behavior compared to other datasets. For instance, we notice that the OPTICS and ROCK algorithms present the highest accuracy for several descriptors. ROCK presents the best result, with an accuracy of 83.1%.

Since the internal and external samples are from the same cluster considering the D₅ dataset, it is possible to observe a different behavior compared to other datasets. For instance, the OPTICS and ROCK algorithms present the highest accuracy for several descriptors. ROCK presents the best result, with an accuracy of 83.1%.

In addition to the accuracies, performance analyses were performed using other metrics (FOWLKES, DAVIES e CALINSKI). Tables 14–18 present the results obtained for each descriptor, considering the datasets D₁ and D₂ with CURE and the datasets D₃, D₄ and D₅ with ROCK, since these were the best combinations (descriptor and clustering algorithm pair). In bold are highlighted the best results (descriptors) obtained according to each metric.

Download:

Table 14. Results obtained by each descriptor and clustering method CURE, considering each of the metrics (accuracy, FOWLKES, DAVIES and CALINSKI), from the dataset D₁.