Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Swin-HSSAM: A green coffee bean grading method by Swin transformer

  • Yujie Jiao,

    Roles Conceptualization, Data curation, Formal analysis, Methodology, Project administration, Resources, Software, Writing – original draft

    Affiliations Faculty of Mechanical and Electrical Engineering, Yunnan Agriculture University, Kunming, China, Key Laboratory for Crop Production and Smart Agriculture of Yunnan Province, Kunming, China, Yunnan Key Laboratory of Coffee, Kunming, China

  • Yuqing Zhao,

    Roles Funding acquisition, Investigation, Methodology, Project administration, Supervision, Writing – review & editing

    Affiliations Faculty of Mechanical and Electrical Engineering, Yunnan Agriculture University, Kunming, China, Key Laboratory for Crop Production and Smart Agriculture of Yunnan Province, Kunming, China, Yunnan Key Laboratory of Coffee, Kunming, China, Faculty of Transportation Engineering, Kunming University of Science and Technology, Kunming, China

  • Aoying Jia,

    Roles Formal analysis

    Affiliations Faculty of Mechanical and Electrical Engineering, Yunnan Agriculture University, Kunming, China, Key Laboratory for Crop Production and Smart Agriculture of Yunnan Province, Kunming, China

  • Tianyun Wang,

    Roles Data curation, Formal analysis, Supervision

    Affiliations Faculty of Mechanical and Electrical Engineering, Yunnan Agriculture University, Kunming, China, Key Laboratory for Crop Production and Smart Agriculture of Yunnan Province, Kunming, China

  • Jiashun Li,

    Roles Software

    Affiliation Key Laboratory for Crop Production and Smart Agriculture of Yunnan Province, Kunming, China

  • Kaiming Xiang,

    Roles Investigation

    Affiliations Faculty of Mechanical and Electrical Engineering, Yunnan Agriculture University, Kunming, China, Key Laboratory for Crop Production and Smart Agriculture of Yunnan Province, Kunming, China

  • Hangyu Deng,

    Roles Supervision

    Affiliations Faculty of Mechanical and Electrical Engineering, Yunnan Agriculture University, Kunming, China, Key Laboratory for Crop Production and Smart Agriculture of Yunnan Province, Kunming, China

  • Maochang He,

    Roles Supervision

    Affiliations Faculty of Mechanical and Electrical Engineering, Yunnan Agriculture University, Kunming, China, Key Laboratory for Crop Production and Smart Agriculture of Yunnan Province, Kunming, China

  • Rui Jiang,

    Roles Validation

    Affiliations Faculty of Mechanical and Electrical Engineering, Yunnan Agriculture University, Kunming, China, Key Laboratory for Crop Production and Smart Agriculture of Yunnan Province, Kunming, China

  • Yue Zhang

    Roles Supervision, Validation, Visualization

    674406584@qq.com

    Affiliations Key Laboratory for Crop Production and Smart Agriculture of Yunnan Province, Kunming, China, Yunnan Key Laboratory of Coffee, Kunming, China, College of Big Data, Yunnan Agricultural University, Kunming, China

Abstract

A novel shifted window (Swin) Transformer coffee bean grading model called Swin-HSSAM has been proposed to address the challenges of accurately classifying green coffee beans and low identification accuracy. This model integrated the Swin Transformer as the backbone network; fused features from the second, third, and fourth stages using the high-level screening-feature pyramid networks module; and incorporated the selective attention module (SAM) for discriminative power enhancement to enhance the feature outputs before classification. Fusion Loss was employed as the loss function. Experimental results on a proprietary coffee bean dataset demonstrate that the Swin-HSSAM model achieved an average grading accuracy of 96.34% for the three grading as well as the nine defect subdivision levels, outperforming the AlexNet, VGG16, ResNet50, MobileNet-v2, Vision Transformer (ViT), and CrossViT models by 3.86%, 2.56%, 0.44%, 4.05%, 5.36%, and 5.40% percentage points, respectively. Evaluations on a public coffee bean dataset revealed that, compared with the aforementioned models, the Swin-HSSAM model improved the average grading accuracy by 1.01%, 0.13%, 4.75%, 0.85%, 0.73%, and 0.27% percentage points, respectively. These results indicate that the Swin-HSSAM model not only achieved high grading accuracy but also exhibited broad applicability, providing a novel solution for the automated grading and identification of green coffee beans.

1. Introduction

Coffee, tea, and cocoa, are the three most traded beverages globally. Since the introduction of coffee in China in 1892, Yunnan has since emerged as the country’s largest coffee-producing region [1], with coffee production at 113,600 tons, which accounts for 98% of the national output. The size and appearance of green coffee beans directly impact the economic value of coffee products; thus, the grading of bean size and sorting for defective beans are critical steps in coffee production. As the coffee industry evolved, the international American Specialty Coffee Association (SCA) and domestic coffee associations began developing more detailed professional guidelines for coffee sorting [2,3]. Coffee grading is typically performed using mechanical and manual methods according to increasingly stringent standards, which enhances the complexity of grading. This has limited the ability of mechanical sorting to identify defective beans, while manual sorting, although more effective, is labor-intensive and inefficient [4,5]. Therefore, a method that allows for the rapid, non-destructive sorting of green coffee beans based on the appearance needs to be developed.

Machine learning is now widely employed in tasks that are labor-intensive [6], time-consuming [7], and high-precision [8] in grading. Recent studies on coffee grading have primarily used conventional machine and deep learning techniques. Bazame et al. [9] employed the darknet framework in conjunction with YOLOv3-tiny to identify the ripeness of coffee fruits, achieving an accuracy of 83%. Chou et al. [10] developed a deep learning-based defective bean inspection scheme, along with an automatic data augmentation method using a generative adversarial networks structure to enhance the proposed scheme, which demonstrated 80% accuracy in identifying defective beans. Chang et al. [5] introduced a novel deep learning approach for detecting eight types of defective coffee beans, obtaining an accuracy rate of 95.2%. Akbar et al. [11] employed color histograms and local binary patterns to extract features from green coffee beans, subsequently employing random forest and k-nearest neighbors algorithms for grading, achieving accuracies of 87.87% and 80.47%, respectively. Zhao et al. [12] used machine vision technology to extract three types of coffee bean features and employed a support vector machine for defect grading, achieving an accuracy of 84.9%.

Deep learning is a specialized subset of machine learning. The prominence of transformer models in natural language processing is well established within the deep learning community [13]. The unique attention mechanism of the transformer is utilized to capture non-local dependencies to achieve a broader receptive field [14]. Because of its robust performance, its applications span a wide array of fields, including speech recognition [15], object detection [16], video understanding [17], and multimodal learning [18]. Enhanced image processing capabilities have been added to transformer architectures, including the Vision Transformer (ViT) [19] and shifted window (Swin) Transformer [20]. The exceptional performance of ViT and Swin Transformer [21] in image recognition tasks demonstrates the potential of transformers in visual applications. Wang et al. [22] developed a SwinGD algorithm based on the Swin Transformer for identifying grape clusters, which achieved a mean average precision (mAP) of 94% at an Intersection of Union of 0.5. Si et al. [23] introduced a dual-branch model, DBCoST, which integrates convolutional neural network (CNN) and Swin Transformer technologies and includes the feature fusion module, which consists of a residual module and an enhanced Squeeze-and-Excitation network. The model boasts a recognition accuracy of 97.32% for diseased apple leaves.

This paper introduces a novel method of green coffee bean grading named Swin-HSSAM, which uses the Swin Transformer network for feature extraction. The model was trained using the newly proposed Fusion Loss, and the trained features were further enhanced through an enhanced hierarchical scale-based feature pyramid network (HS-FPN) and selective attention module (SAM) structures. This approach facilitated the grading and identification of green coffee beans while simultaneously sorting out defective ones. The new model not only improved the detection accuracy but expedited detection, providing a theoretical foundation for the development of future green coffee bean grading systems.

2. Materials and methods

2.1 Dataset and experimental environment

This study evaluated green Arabica coffee bean samples collected by Changmu Coffee Company, Pu’er City, Yunnan, China, and manually sorted by globally certified coffee quality appraisers. The samples totaled 10,378 beans. The classification standards used are as outlined in Table 1 [3].

thumbnail
Table 1. Grading Standards for Green Coffee Beans.

https://doi.org/10.1371/journal.pone.0322198.t001

However, the classification of defective beans in this grading standard is rather vague and lacks detailed categorization. Therefore, to enhance the training outcomes and enrich the dataset, we have supplemented it with the types of defective beans delineated in the guidelines recommended by the SCA. The categories of defective beans as defined in the guidelines are shown in Fig 1.

thumbnail
Fig 1. Different types of defective coffee beans.

https://doi.org/10.1371/journal.pone.0322198.g001

The images were categorized into four types, comprising three classes of normal beans with a total of 7,894 images, and 2,484 images of substandard beans. Specifically, the normal beans include as follows: 2,460 images of first-grade beans, 2,601 images of second-grade beans, and 2,833 images of third-grade beans. The substandard beans include 9 types:black, broken, cherry pods, fungus damage, husk, immature, parchment, shells and sour. The distribution of data is detailed in Table 2. A graphical representation of the dataset is illustrated in Fig 2.

thumbnail
Fig 2. Different grades of green coffee beans.

https://doi.org/10.1371/journal.pone.0322198.g002

Images were captured using a Huawei Enjoy 20 pro from a height of 140 mm. The sample images were first analyzed in their original form and then converted to grayscale. Gaussian filtering was applied to remove noise interference, followed by gamma transformation to enhance the overall details and increase contrast. Edges of the image were sharpened using the Canny operator. Based on differences in grayscale characteristics, a fixed threshold between 175 and 255 was selected to convert the image into binary format, effectively segregating the target from the background. The images were then inverted to enhance the feature areas. Three rounds of opening and closing operations were performed to fill gaps and eliminate edge burrs. Contour tracing code was used to identify the largest contour of the raw coffee bean image, capturing the smallest bounding rectangle of the raw image. To minimize errors, the width and height of the bounding rectangle were each expanded by 10 pixels, and the image were cropped to a size of 224 × 224 pixels [13], as shown in Fig 3 The normal and defective beans across the three grades were then divided into a training set and a test set in an 8:2 ratio, which consisted of 8,304 images for training and 2,074 images for testing.

thumbnail
Fig 3. Image data preprocessing of Arabica green coffee beans.

https://doi.org/10.1371/journal.pone.0322198.g003

The experimental platform for the study was configured as follows: The computer used had an Intel Xeon Gold 6230R processor and was equipped with an NVIDIA RTX A6000 graphics card. All models were constructed using the Pytorch 1.13.1 deep learning framework on the Jupyter platform, with Python 3.9 as the programming language. Model training employed the AdamW optimizer, with the initial learning rate set at 0.00005. A StepLR scheduler was used for the learning rate adjustment, which halved the rate every 50 epochs.

2.2 Swin transformer

The Swin Transformer is a deep learning model based on the foundational principles of the Transformer. Compared with the ViT, the Swin Transformer is more efficient and accurate. As shown in Fig 4a and 4b, the Swin Transformer reduced computational complexity by partitioning the feature map into smaller windows, unlike the ViT structure. It constructed a hierarchical representation by starting with small-sized patches and progressively merging adjacent patches layer by layer. However, while it reduced the computational load, this method also impeded the exchange of information between windows [24]. To address this issue, the Swin Transformer employed a technique of shifting the divided windows, as shown in Fig 4c. This shift facilitated information exchange between windows that would normally not communicate without increasing computational complexity.

thumbnail
Fig 4. Feature maps structure.

a The feature maps of previous ViT; b the hierarchical feature maps approach; c the shifted window approach for computing self-attention.

https://doi.org/10.1371/journal.pone.0322198.g004

Fig 5 shows the fundamental architecture of the Swin Transformer. Using the patch splitting module, the input RGB image was initially divided into the smallest unit tokens, i.e., non-overlapping patches. This module is composed of two components, namely, Patch Partition and Linear Embedding. Subsequently, through overlapping Swin Transformer blocks, the receptive field was expanded, model capacity was enhanced, and cross-window information fusion was achieved. Ultimately, this process completed the task of feature representation learning, which was used to generate patch tokens for creating hierarchical feature representations.

thumbnail
Fig 5. Shifted window Transformer architecture.

https://doi.org/10.1371/journal.pone.0322198.g005

2.3 Swin Transformer block

The Swin Transformer Block was constructed based on shifted windows. As illustrated, a Swin Transformer Block typically consists of an even number of blocks forming one stage, Stages consisting of different numbers of Swin Transformer Blocks can form different Swin Transformer variants,with the Swin-T(a variant structure of the Swin Transformer) model generally comprising 2, 2, 6, and 2 blocks per stage. Each block contains four components: a layer norm (LN) layer, multihead self-attention (MSA) module, residual connection, and two-layer multilayer perceptron (MLP). Odd-numbered blocks utilized the window-based MSA(W-MSA) module [25], while even-numbered blocks employ the shifted window-based MSA (SW-MSA) module. Following this organizational scheme, feature maps were calculated using the formulas below:

(1)(2)(3)(4)

In the block structure (Fig 6), l and l+1 illustrate the outputs of the lth W-MSA module and the subsequent SW-MSA module, respectively. The symbol zl indicates the output of the lth multilayer perceptron (MLP) module.

thumbnail
Fig 6. Swin Transformer block.

LN, layer norm layer; W-MSA, window-based multi-head self-attention module; MLP, multilayer perceptron. Z l, outputs of the lth W-MSA module; l+1, shifted window-based MSA module; Z l output of the lth MLP module.

https://doi.org/10.1371/journal.pone.0322198.g006

2.4 High-level screening-feature fusion pyramid

Identifying objects in computer vision remains difficult due to their scale variation. Feature pyramids, which are built on image pyramids, form a fundamental solution to this challenge [26]. They adapt to object scale changes by switching between different levels within the pyramid, while the scale of the pyramid itself remains unchanged. The advent of deep learning has enabled the integration of pyramid structures into CNNs to establish multi-scale representations of images. Lin et al. [27] introduced a novel architecture known as FPN, which combines features from different layers to form feature maps of varying scales within the feature pyramid. Using FPN, a greater integration of shallow feature map information is achieved, enhancing small object detection accuracy and providing more robust semantic information.

Chen et al. [28] developed an HS-FPN(High-level Screening-feature Fusion Pyramid) for merging multi-scale features, especially for finer details, such as leukocytes. This enhancement enabled the model to capture a more comprehensive representation of leukocyte features. The structure of the HS-FPN, as depicted in Fig 7, consists of two primary components, namely, a feature selection module and a feature fusion module.

thumbnail
Fig 7. Architecture of the hierarchical scale-based feature pyramid network.

https://doi.org/10.1371/journal.pone.0322198.g007

Initially, feature maps of varying scales undergo a selection process within the feature selection module. Subsequently, through the selective feature fusion (SFF) mechanism, high- and low-level information are synergistically integrated within these feature maps. This fusion enhances the model’s detection capabilities by generating features rich in semantic content, especially useful for detecting small-scale features. Fig 7 illustrates the fundamental architecture of the HS-FPN.

2.5 SAM for discriminative power enhancement

SAM(selective attention module) comprises three primary components: the control depthwise separable convolution (CDSC) module [29], a fully connected layer, and an exponential channel component. The architecture is illustrated Fig 8 provided. The CDSC module features depthwise and pointwise convolution, as depicted. The number of output feature maps in each depthwise convolution equals the number of input channels. Using depthwise convolution instead of standard convolution saves computational resources, as evident in the given equations [30]. The fundamental architecture of the SAM and CDSC is depicted in Figs 8 and 9, respectively.

thumbnail
Fig 8. SAM discriminability enhancement module structure diagram.

https://doi.org/10.1371/journal.pone.0322198.g008

thumbnail
Fig 9. Detailed structure diagram of the control depthwise separable convolution module.

https://doi.org/10.1371/journal.pone.0322198.g009

(5)

In Equation 5, Cin denotes the number of channels in the input feature map; Cout represents the number of channels in the output feature map; and K indicates the size of the convolution kernel.

Compared with traditional convolution, CDSC reduces the computational cost by decreasing the number of parameters, achieving a reduction factor of 1/K2. The results were then processed through the Gelu function and normalized before being fed into the pointwise convolution. This pointwise convolution adeptly combines feature maps to generate new ones. The expression of the Gelu function, as shown in the formula, implements a linear transformation that zeroes out non-essential neural activations while augmenting the network’s capability to represent intricate data distributions and patterns.

(6)

2.6 Fusion loss

Fusion Loss ingeniously combines Focal Loss and Cross-Entropy Loss. For each sample, it calculates the average of Focal Loss and Cross-Entropy Loss, then assigns weights (weight_ce and weight_focal) respectively to the average value of them. The sum of these weighted values yields the composite loss value. Fusion Loss notably improves model performance by reducing loss while maintaining recognition accuracy [31].

Fusion Loss penalizes incorrectly predicted samples, enhancing the model’s accuracy in learning target categories. This helps address issues of class imbalance by swiftly focusing on samples that are difficult to distinguish. The formula for Fusion Loss is as follows:

(7)

where pt represents the predicted probability; αt is the adjustment factor balancing the importance of positive and negative samples; γ is the focusing parameter that diminishes the weight of easily classified samples; qi is a binary variable that is 1 when the condition is satisfied otherwise it is 0 and wf and wc are the parameter values for weight_focal and weight_ce, respectively.

2.7 Swin-HSSAM

Using the Swin-T model as the foundational architecture significantly reduces computational demands during model training. The initial input size for green coffee beans as 224 × 224 pixels, which, following the patch partition operation, was resized to 56 × 56. After undergoing transformations through four stages, the output dimension was further reduced to 7 × 7. Each stage producds feature maps at four distinct scales: S2, S3 S4 and S5, with the spatial size of each feature map diminishing with increasing depth. The feature maps S3, S4, and S5 were transformed into P3, P4, and P5 through a channel attention mechanism and a 1 × 1 conversion. Outputs from the feature selection phase were directed into the feature fusion section, where each feature map was upsampled to align with spatial resolutions. In the feature fusion section, a top-down fusion strategy was implemented, culminating in the output N3 from the SFF feature fusion module. This output N3 was subsequently enhanced by the SAM module and linked to a fully connected layer for classification. The model employed Fusion Loss as its loss function.

Shallower features can more effectively identify smaller objects. Therefore, the Swin-HSSAM model enhances the Swin-T one by incorporating an HS-FPN structure, which effectively integrates shallow features into the final output. Adding the SAM structure before classification further enhances the learning capabilities for target features and local feature information across each channel. The receptive field exponentially expands as the layers deepen, significantly improving the model’s accuracy in identifying small targets like coffee beans. Fig 10 is a schematic diagram of the specific structure of Swin-HSSAM.

3. Experimental results and analysis

3.1 Evaluation criteria

The performance of various models were evaluated using metrics such as mAP, accuracy, F1-score, and speed (Frames Per Second,FPS). True positive, true negative, false positive (FP),false negative, recall (R), and precision (P) were used to define these five criteria in (Eq. 810). By comparing these metrics, the models were assessed against each other, and losses were calculated using Fusion Loss.

(8)(9)(10)

3.2 Swin-HSSAM modeling tests

Compared with the baseline model Swin-T, the mAP metric of the Swin-HSSAM model was better by 1.31 percentage points. As shown in Table 3, both average accuracy and F1 values improved by 3.5 percentage points. The incorporation of HS-FPN into the baseline model enables the capture of more detailed features from green coffee beans, while the feature fusion achieved by SAM enhances semantic content, leading to improved detection accuracy, Furthermore, the increase in the F1 score indicates that the Swin-HSSAM model reduces both false positives and missed positives compared to the baseline. Table 4 depicts the performance details of the Swin-HSSAM model on nine defective bean classifications. As shown in Table 4, the detection accuracy for six types of defective beans exceeded 95%, except for Parchment, Immature, and Fungus damage, which achieved accuracy above 90%. However, Parchment had a lower detection accuracy of 86.5%, likely due to its limited sample size and the minimal visual distinction from normal beans.

thumbnail
Table 3. Complexities between the Swin-HSSAM and Swin-T models.

https://doi.org/10.1371/journal.pone.0322198.t003

thumbnail
Table 4. Complexities between the different defective types coffee beans (Swin-HSSAM).

https://doi.org/10.1371/journal.pone.0322198.t004

Fig 11 and 12 illustrate the training and validation loss curves, highlighting the convergence behavior of the detection model. As shown, the Swin-HSSAM model achieves significantly lower loss and faster convergence compared to the baseline model. This improvement demonstrates that the introduction of Fusion Loss addresses the imbalance in defective green coffee beans and the challenging classification of normal beans, thereby enhancing model accuracy.

thumbnail
Fig 11. Loss change curve of the Swin-HSSAM and Swin-T models(Train).

https://doi.org/10.1371/journal.pone.0322198.g011

thumbnail
Fig 12. Loss change curve of the Swin-HSSAM and Swin-T models(Validation).

https://doi.org/10.1371/journal.pone.0322198.g012

3.3 Impact of different Fusion Loss weights on model performance

The combined Fusion Loss and Swin-T model yielded excellent experimental results and achieved remarkable identification accuracy. This section explores the impact on model performance of employing different fusion weights for combining Focal Loss and Cross-Entropy Loss. Nine distinct weight combinations were tested, while all other variables remained constant. The results presented in Table 5 indicate that the best performance was achieved when the weights for Cross-Entropy Loss and Focal Loss were set to 0.7 and 0.3, respectively. This configuration resulted in the highest mAP, Accuracy, and F1 scores, which were 97.82%, 93.74%, and 93.74%, respectively. The weights of Cross-Entropy Loss and Focal Loss are set to 0.7 and 0.3, respectively, indicating that Cross-Entropy Loss plays a dominant role, while Focal Loss assumes a secondary position. This configuration directs the training process to prioritize minimizing Cross-Entropy Loss, which emphasizes reducing the discrepancy between the predicted and true probability distributions. Simultaneously, it ensures attention to hard-to-classify samples, thereby enhancing the model’s overall performance.

thumbnail
Table 5. Comparison of complexity between different loss.

https://doi.org/10.1371/journal.pone.0322198.t005

3.4 Impact of different HS-FPN structure on model performance

This section describes the comparative experiment performed to adjust the standard HS-FPN structure to verify the most effective stages of information fusion for the grading of green coffee beans. The experimental results shown in Table 6 indicate that the model structure combining S3、 S4 and S5 chieved the highest mAP, accuracy, and F1 scores at 98.51%, 96.34%, and 96.35%, respectively. Graphs illustrating the loss changes for different HS-FPN structures are presented in Fig 13.

thumbnail
Table 6. Comparison of complexity between different HS-FPN structure.

https://doi.org/10.1371/journal.pone.0322198.t006

thumbnail
Fig 13. Loss change curve of different hierarchical scale-based feature pyramid network structure.

https://doi.org/10.1371/journal.pone.0322198.g013

The results indicate that deeper information layers are more advantageous for recognizing small targets such as coffee beans. The blank areas in global information likely impacted the identification of coffee bean grades, thereby decreasing accuracy. Consequently, this experiment was chosen to employ an HS-FPN structure that fuses information from S3 S4 and S5.

3.5 Ablation study of the Swin-HSSAM model

An ablation study was performed to evaluate the influence of the HS-FPN module, SAM module, and Fusion Loss on the model’s stability and accuracy. Table 7 indicates that using only the HS-FPN with the Swin-T grading model led to a 0.31 percentage-point increase in mAP and a 0.65 percentage-point improvement in accuracy. The Fusion Loss model saw a 0.62 percentage-point increase in mAP and a 0.90 percentage-point increase in accuracy. The SAM model increased the mAP by 0.65 percentage points and accuracy by 1.67 percentage points. The combined HS-FPN and Fusion Loss model resulted in a 0.65 percentage-point increase in mAP and a 1.00 percentage-point increase in accuracy. The combined HS-FPN and SAM model yielded a 1.17 percentage-point increase in mAP and a 2.30 percentage-point improvement in accuracy. The combined SAM and Fusion Loss model enhanced the mAP by 0.87 percentage point and accuracy by 2.59 percentage points. Lastly, the combined HS-FPN, SAM, and Fusion Loss model increased the mAP by 1.31 percentage points and accuracy by 3.50 percentage points. These ablation studies demonstrate that the introduction of the HS-FPN, SAM, and Fusion Loss modules simultaneously enhanced the model’s accuracy and stability.

3.6 Proposed method versus other models

To evaluate the performance of the proposed Swin-HSSAM method, a series of were conducted. This study compared the developed method with AlexNet [32], VGG16 [33], ResNet50 [34,35], MobileNet-v2 [36], Vision Transformer (ViT) [19], and CrossViT [37]. The Swin-HSSAM achieved an accuracy of 96.34% and an mAP of 98.51%, surpassing AlexNet, VGG16, ResNet50, MobileNet-v2, Vision Transformer (ViT), and CrossViT models by 3.86%, 2.56%, 0.44%, 4.05%, 5.36%, and 5.40% in accuracy, respectively, and 1.70%, 0.47%, 0.58%, 0.92%, 2.31, and 1.88% in mAP, respectively. The variations in loss and accuracy are shown in Figs 14 and 15. Table 8 presents a comparison of the proposed method with other models. The F1 score, which combines precision and recall, ranges from 0 to 1, with 1 representing optimal model performance and 0 representing the poorest. The F1 scores for first grade and defective beans are higher than those for second- and third-grade beans. This discrepancy may be attributed to the minimal size difference between second- and third-grade beans, causing more frequent misjudgments and confusion by the program, thereby reducing the F1 scores. Comparing the Swin-HSSAM model with more complex deep learning models, such as ViT and CrossViT, revealed that the parameters and flops did not excessively increase to ensure the recognition speed while maintaining a high accuracy and F1 value. Fig 14 demonstrates that the Swin-HSSAM model achieves faster convergence and lower loss compared to other models. Similarly, Fig 15 shows that Swin-HSSAM attains the highest accuracy among all models evaluated

thumbnail
Table 8. Comparison of complexity between different models.

https://doi.org/10.1371/journal.pone.0322198.t008

thumbnail
Fig 14. Loss change curve of different model.

https://doi.org/10.1371/journal.pone.0322198.g014

thumbnail
Fig 15. Accuracy change curve of different mode.

https://doi.org/10.1371/journal.pone.0322198.g015

The experimental results demonstrate that the proposed Swin-HSSAM model not only performed well but also maintained a rapid identification speed compared with other models.

3.6 Proposed method versus other models on new dataset

The Swin-HSSAM model was compared with the AlexNet, ResNet 50, VGG16, MobileNet V2, ViT, and Cross ViT models on a public dataset (sourced from Kaggle at https://www.kaggle.com/datasets/gpiosenka/coffee-bean-dataset-resized-224-x-224), and the results are presented in Table 9. The Swin-HSSAM model achieved the highest scores, with a mAP of 97.83%, F1 score of 96.48%, and average accuracy of 96.50%, making it the top-performing model in the dataset. Comparative tests on the public dataset demonstrate the model’s superior generalization capabilities.

thumbnail
Table 9. Comparison of complexity between different models on new dataset.

https://doi.org/10.1371/journal.pone.0322198.t009

4. Discussion

The above results demonstrate that the Swin-T network exhibited high accuracy in small target detection, and the fusion of the HS-FPN and SAM module were highly efficient in extracting features. This enhanced performance is suitable for multiple types of small target detection tasks. To reduce the loss while addressing the problem of hard-to-distinguish samples (e.g., sour versus immature beans in defective beans), we incorporated Fusion Loss again and ultimately proposed the Swin-HSSAM model.

The Swin-HSSAM model is superior to the mainstream target detection models AlexNet, VGG16, ResNet50, MobileNet-v2, Vision Transformer (ViT), and CrossViT. It incorporates the HS-FPN and SAM modules, and the simultaneous.,and Introducing Fusion Loss considerably improved model accuracy and recall, implying that the model fully extracts the rich features of raw coffee beans and effectively labels and classifies raw coffee beans into different categories. Complex situations exist in raw coffee beans. First, the size difference between normal beans of three classes is minimal. In particular, the size difference between second- and third-class beans is even smaller. Second, the features of different defective beans are diverse. Differences exist in the detection performance of mainstream target detection models on each category; however, the detection performance of the Swin-HSSAM model on each category is higher than those of other mainstream target detection models; this demonstrates that incorporating the HS-FPN, SAM module, and Fusion Loss enables the accurate capture of detailed information on different scales, enriches the linguistic information of features, and improves the accuracy and comprehensiveness of image understanding. Notably, the Swin-HSSAM model has a higher mAP and a considerably improved average accuracy than ResNet. Furthermore, the model can obtain relatively high F1 values in the classification of first-, second-, and third-grade raw beans.

The accurate identification of different defective beans is challenging in raw coffee bean classification because the characteristics of different defective coffee beans are complex. For example, parchment beans have shape and color similar to normal beans, while sour beans are similar to fungus-damaged beans in color. Thus, distinguishing the color and texture characteristics of black, sour, and moldy beans with various degrees of severity is difficult. The Swin-HSSAM model showed superiority in classifying nine types of defective beans. Table 10 illustrates the classification results, implying that the mAP of black, broken, cherry pods, husk, shells, and sour were >95%. The mAP of fungus damage and immature were >90%, while the mAP of parchment was 88.86%, which was the only mAP lower than 90%. These results demonstrate the advantage of the Swin-HSSAM model in extracting feature information (color, edge, and texture).

thumbnail
Table 10. Comparison of complexities between different models based on nine defects.

https://doi.org/10.1371/journal.pone.0322198.t010

Comparisons with publicly datasets have shown that the superior generalization ability of the Swin-HSSAM model. The images in the public dataset differ from those in the self-constructed (proprietary) dataset concerning illumination, size, and resolution. Furthermore, the coffee beans in this dataset are roasted for classification. However, the Swin-HSSAM model still obtained higher mAP values, average accuracy, and F1 score. Thus, the Swin-HSSAM model has the ability for generalization, exhibits better stability against interference, and can adapt to a larger range of new and unknown situations.

The Swin-HSSAM model included a moderately high level number of parameters. Specifically, the Swin-HSSAM model had a higher number of parameters than the lightweight model MobileNet V2, slightly higher number than ResNet50 and a lower number than other models; this can be attributed to HS-FPN through the hierarchical structure and adaptive feature selection mechanism. The inference speed of the Swin-HSSAM model is slightly higher than those of speed than MobileNet V2 and CrossViT, comparable to that of VIT and lower than those of other comparison models, such as ResNet50. Notably, the inference speed of the Swin-HSSAM model was mainly affected by the model configurations and hardware devices because current GPU devices are especially optimized for Transformer class models. Furthermore, the use of some advanced techniques, such as parallel computing and hardware acceleration, can potentially further improve the inference speed of the model. With the availability of GPU devices specifically optimized for Transformer class models, the performance of our Swin-HSSAM model will be substantially improved, making it suitable for resources-constrained environments.

Future research should further explore the model’s capacity to accurately identify a wider array of defects, ideally covering all defect categories enumerated by the SCAA association. Currently, the Swin-HSSAM model categorized green coffee beans into four classes; however, the grading of coffee beans in commercial transactions could be segmented into finer categories. Future studies could refine the model to meet the comprehensive grading requirements for all green coffee beans. Furthermore, further investigations should employ the Swin-HSSAM model to evaluate other fruits, such as cherries and tomatoes.

5. Conclusions

This paper described the development of the novel Swin-HSSAM method, which is based on the Swin Transformer, for grading green coffee beans. This innovative approach leverages the strengths of the Transformer network, SAM module, and HS-FPN module and is equipped with pretrained weights to extract image features. These features are then fused via HS-FPN and enhanced by the SAM before being inputted into a classification head to predict labels. During the experimental phase, the proposed Swin-HSSAM method generated impressive results compared with combinations of the Swin Transformer with other classifiers, achieving mAP, accuracy, and F1 scores of 98.51%, 96.34%, and 96.35%, respectively. Extensive experiments conducted on green coffee bean grading demonstrated the exceptional performance of the proposed Swin-HSSAM method. Notably, the Swin Transformer served as an efficient feature extractor. The proposed method was proven effective and demonstrates broad potential for use in identifying other fruits and vegetables, as well as promising applications in agricultural product sorting.

References

  1. 1. Huang J, Li W, Xia B, Hu F. Strategies on how to promote the development of coffee industry with high quality in Yunnan Province [in Chinese]. Trop Agric Sci Technol. 2022;45(03):21-29. https://10.16005/j.cnki.tast.2022.03.005
  2. 2. Specialty Coffee Association of America. Specialty Coffee Association of America [Internet]. 2016 May [cited 2024 Nov 4] http://www.scaa.org
  3. 3. Yunnan Provincial Administration for Market Regulation. Small-Berry Coffee Part VII: Green Bean Grading. Yunnan Provincial Administration for Market Regulation. DB53/T 149.7—2023 [Standard, in Chinese]. 2023 Jul 10.
  4. 4. Caporaso N, Whitworth MB, Cui C, Fisk ID. Variability of single bean coffee volatile compounds of Arabica and robusta roasted coffees analysed by SPME-GC-MS. Food Res Int. 2018;108:628–40. pmid:29735099
  5. 5. Chang S-J, Huang C-Y. Deep Learning Model for the Inspection of Coffee Bean Defects. Applied Sciences. 2021;11(17):8226.
  6. 6. Afonso M, Fonteijn H, Fiorentin FS, Lensink D, Mooij M, Faber N, et al. Tomato Fruit Detection and Counting in Greenhouses Using Deep Learning. Front Plant Sci. 2020;11:571299. pmid:33329628
  7. 7. Gai R, Chen N, Yuan H. A detection algorithm for cherry fruits based on the improved YOLO-v4 model. Neural Comput & Applic. 2021;35(19):13895–906.
  8. 8. Zheng H, Wang G, Li X. Swin-MLP: a strawberry appearance quality identification method by Swin Transformer and multi-layer perceptron. Food Measure. 2022;16(4):2789–800.
  9. 9. Bazame HC, Molin JP, Althoff D, Martello M. Detection, classification, and mapping of coffee fruits during harvest with computer vision. Computers and Electronics in Agriculture. 2021;183:106066.
  10. 10. Chou Y-C, Kuo C-J, Chen T-T, Horng G-J, Pai M-Y, Wu M-E, et al. Deep-Learning-Based Defective Bean Inspection with GAN-Structured Automated Labeled Data Augmentation in Coffee Industry. Applied Sciences. 2019;9(19):4166.
  11. 11. Akbar M, Rachmawati E, Sthevanie F. Visual Feature and Machine Learning Approach for Arabica Green Coffee Beans Grade Determination. Proceedings of the 6th International Conference on Communication and Information Processing. 2020.
  12. 12. Zhao Y, Yang H, Zhang Y, Yang Y, Yang Y, Sai M. Detection of defective Arabica green coffee beans based on feature combination and SVM [in Chinese]. Transactions of the Chinese Society of Agricultural Engineering. 2022;38(14):295-302. DOI:
  13. 13. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A, et al. Attention is all you need. In: Advances in Neural Information Processing Systems; 2017:30.
  14. 14. Zhou D, Kang B, Jin X, Yang L, Lian X, Hou Q, et al. DeepViT: Towards deeper vision transformer. arXiv:2103.11886 [Preprint]. 2021 [cited 2024 Dec 17. ]. Available from: https://arxiv.org/abs/2103.11886
  15. 15. Huang C-W, Chen Y-N. Adapting Pretrained Transformer to Lattices for Spoken Language Understanding. 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). 2019:845–52.
  16. 16. Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S. End-to-end object detection with transformers. European Conference on Computer Vision; 2020. p. 213-229. https://doi.org/10.1007/978-3-030-58452-8_13
  17. 17. DehghaniHeigoldArnab A, Dehghani M, Heigold G, Sun C, Lucic M, Schmid C. ViViT: A Video Vision Transformer. 2021 IEEE/CVF International Conference on Computer Vision (ICCV). 2021:6816–26.
  18. 18. Xu P, Zhu X, Clifton DA. Multimodal Learning With Transformers: A Survey. IEEE Trans Pattern Anal Mach Intell. 2023;45(10):12113–32. pmid:37167049
  19. 19. Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv: 2010.11929 [Preprint]. 2020 [cited 2024 Dec 17. ]. Available from: https://arxiv.org/abs/2010.11929
  20. 20. Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, et al. Swin transformer: Hierarchical vision transformer using shifted windows. Proceedings of the IEEE/CVF International Conference on Computer Vision; 2021:10012-10022. https://10.1109/CCV48922.2021.00986
  21. 21. Hao S, Zhang L, Jiang Y, Hu H, Wang J, Ji Z, Zhao L, et al. ConvNeXt-ST-AFF: A Novel Skin Disease Classification Model Based on Fusion of ConvNeXt and Swin Transformer. IEEE Access; 2023. PP(99):1-1.
  22. 22. Wang J, Zhang Z, Luo L, Zhu W, Chen J, Wang W. SwinGD: A Robust Grape Bunch Detection Model Based on Swin Transformer in Complex Vineyard Environment. Horticulturae. 2021;7(11):492.
  23. 23. Si H, Li M, Li W, Zhang G, Wang M, Li F, et al. A Dual-Branch Model Integrating CNN and Swin Transformer for Efficient Apple Leaf Disease Classification. Agriculture. 2024;14(1):142.
  24. 24. Chen T, Mo L. Swin-Fusion: Swin-Transformer with Feature Fusion for Human Action Recognition. Neural Process Lett. 2023;55(8):11109–30.
  25. 25. Cao H, Wang Y, Chen J, Jiang D, Zhang X, Tian Q, et al. Swin-unet: Unet-like pure transformer for medical image segmentation. Proceedings of the European Conference on Computer Vision; 2022:205-218.
  26. 26. Adelson EH, Anderson CH, Bergen JR, Burt PJ. Pyramid methods in image processing. RCA Eng. 1984;29(6):33-41.
  27. 27. Lin TY, Dollár P, Girshick R, He K, Hariharan B, Belongie S. Feature pyramid networks for object detection. IEEE Conference on Computer Vision and Pattern Recognition; 2017:2117-2125.
  28. 28. Chen Y, Zhang C, Chen B, Huang Y, Sun Y, Wang C, et al. Accurate leukocyte detection based on deformable-DETR and multi-level feature fusion for aiding diagnosis of blood diseases. Comput Biol Med. 2024;170:107917. pmid:38228030
  29. 29. Chollet F. Xception: Deep learning with depthwise separable convolutions Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017 p. 1251-58
  30. 30. Wang F, Si B, Yu G, Zhang Y. Fusion of Swin Transformer and discriminative enhancement module target detection algorithm [in Chinese]. Journal of Optoelectronics·Laser Forthcoming. 2024; p. 1-10.
  31. 31. Lin TY, Goyal P, Girshick R, He K, Dollár P. Focal loss for dense object detection. Proceedings of the IEEE International Conference on Computer Vision; 2017. p. 2980-2988.
  32. 32. Krizhevsky A, Sutskever I, Hinton GE. ImageNet classification with deep convolutional neural networks. Commun ACM. 2017;60(6):84–90.
  33. 33. Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556 [Preprint]. 2014 [cited 2024 Dec 17. ]. Available from: https://arxiv.org/abs/1409.1556
  34. 34. He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2016. p. 770-778. https://doi.org/10.3390/app12188972
  35. 35. Jing E, Zhang H, Li Z, Liu Y, Ji Z, Ganchev I. ECG heartbeat classification based on an improved ResNet-18 model. Computational and Mathematical Methods in Medicine. 2021; 2021.
  36. 36. Sandler M, Howard A, Zhu M, Zhmoginov A, Chen L-C. MobileNetV2: Inverted Residuals and Linear Bottlenecks. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018:4510–20.
  37. 37. Chen CF, Fan Q, Panda R. Crossvit: Cross-attention multi-scale vision transformer for image classification. In: Proceedings of the IEEE/CVF International Conference on Computer Vision; 2021. https://10.1109/CCV48922.2021.00041