Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Interpretable crop pest and disease identification based on comparative concept tree

  • Bingjing Jia,

    Roles Methodology, Resources, Software, Supervision, Validation, Visualization, Writing – original draft, Writing – review & editing

    Affiliation Anhui Science and Technology University, Bengbu, China

  • Zhiwei Zheng,

    Roles Conceptualization, Data curation, Formal analysis, Methodology, Software, Validation, Visualization

    Affiliation Anhui Science and Technology University, Bengbu, China

  • Jinyu Zeng,

    Roles Methodology, Validation

    Affiliation Anhui Science and Technology University, Bengbu, China

  • Lei Shi,

    Roles Validation

    Affiliation State key Laboratory of Media Convergence and Communication; Communication University of China, Beijing, China

  • Hua Ge,

    Roles Data curation, Project administration, Resources

    Affiliation Anhui Science and Technology University, Bengbu, China

  • Chenguang Song

    Roles Resources, Supervision

    songcg@ahstu.edu.cn

    Affiliation Anhui Science and Technology University, Bengbu, China

Abstract

Deep learning provides new methods for crop pest and disease identification and control, offering unique advantages in terms of recognition accuracy and efficiency. However, deep learning models generally lack interpretability, and their internal decision-making processes are difficult to understand. This, to some extent, undermines users’ trust in the model’s predictions and hinders its large-scale application in agricultural production. Therefore, improving model transparency and interpretability has become an important research direction. To address this issue, this study proposes a novel interpretable crop pest and disease identification model, the Contrastive Prototype Tree (CPTR). The model is designed around the core structure of “concept prototypes and decision tree,” which builds clear prototype matching paths for each recognition result. This enables the model to not only have strong classification capability but also provide intuitive explanations. Additionally, the study introduces the SimCLR contrastive learning framework to enhance the model’s ability to express deep image features. SimCLR guides the model to learn more discriminative visual features by maximizing the similarity between positive sample pairs and minimizing the similarity between negative sample pairs, thereby improving overall recognition performance. This study evaluated the model on three datasets: AppleLeaf9, Cassava, and Cashew. The experimental results show that CPTR achieves accuracies of 83.74%, 94.80%, and 96.01% on the three datasets, representing improvements of 4.12%, 0.34%, and 0.51% compared to Prototype Tree, respectively. These results indicate that the proposed model achieves the highest accuracy across different datasets, demonstrating its effectiveness.

1 Introduction

Globally, crop pests and diseases pose a significant challenge to agricultural production and food security [1,2]. With the intensification of global climate change and changes in agricultural production methods, the types of crop pests and diseases continue to increase, and their spread has become more complex and rapid [3,4], causing major difficulties for agricultural workers. Effective identification and early prevention of crop pests and diseases are crucial for improving crop yield and quality, reducing excessive pesticide use, minimizing environmental pollution, and ensuring food security. Efficient and accurate pest and disease identification methods can not only reduce pesticide usage but also prevent crop yield loss caused by pesticide misuse, thereby increasing crop yield while protecting the ecological environment [5,6].

Traditional identification methods mainly rely on agricultural experts’ on-site observations and subjective judgment, which are not only inefficient but also significantly reduce both efficiency and accuracy when faced with large-scale production environments and diverse disease types [7]. In recent years, with the development of image recognition and computer vision, deep learning has gradually become an important technological approach for crop pest and disease identification [8,9]. Convolutional Neural Networks (CNNs), with their multi-layered nonlinear structure, can learn discriminative deep features from raw images, thereby significantly improving the accuracy of pest and disease image recognition [10].

Mainstream deep learning models often exist as “black boxes,” with their internal reasoning mechanisms lacking transparency [11], making it difficult for agricultural practitioners to trust the model’s output in practical applications. Especially when the model makes incorrect predictions, users often struggle to trace the decision process, analyze the error causes, and effectively locate model flaws, which reduces the model’s practical value in production processes [12]. As a result, “interpretability” has gradually become a key research direction for AI models. Researchers have proposed various interpretable deep model structures, including Class Activation Maps (Grad-CAM) [13], Concept Bottleneck Networks [14], Prototype Networks [15], etc., all of which have improved the interpretability of models to some extent.

Prototype learning structures have gained widespread attention due to their strong semantic nature and clear reasoning processes. The core idea is to introduce “concept prototypes” as an intermediate bridge in classification, enabling the model not only to provide prediction results but also to highlight the semantic representative regions or samples upon which the predictions are based. Building upon this, the Concept Prototype Tree (Prototype Tree) [16] further extends the expressive capacity of prototype learning by organizing multiple concept prototypes in a hierarchical manner through a tree structure. This allows the model to make decisions along a clear logical path, progressively approaching the final classification result. This modular, staged reasoning mechanism not only enhances the model’s interpretability but also improves its ability to represent complex class relationships, making it easier for users to intuitively understand the model’s decision-making basis and process.

However, prototype learning still faces challenges such as blurry decision boundaries and insufficient feature representation when dealing with highly similar or widely varying image categories. SimCLR contrastive learning [17] addresses this by pulling positive sample pairs closer and pushing negative sample pairs farther apart in the feature space, enabling the model to learn more discriminative feature embeddings. This method can automatically extract structurally stable and semantically rich features from images.

Based on the above background, this paper proposes an interpretable pest and disease identification model that integrates SimCLR contrastive learning and the Concept Prototype Tree structure—Contrastive Prototype Tree (CPTR). The CPTR model is designed to balance both accuracy and interpretability. It uses CNN as the base feature extraction module and incorporates the SimCLR mechanism to enhance feature expression capabilities. Additionally, it combines a trainable prototype module with a binary concept tree structure in the decision-making process, thus achieving semantic transparency and structural explicitness in the classification path. Unlike traditional CNNs or prototype networks, CPTR not only outputs prediction results during inference but also showcases the relationships between the input image and multiple concept prototypes. This structure provides good local interpretability while offering a global decision-making context through the tree-based organization, significantly improving users’ understanding and trust in the model’s prediction rationale.

The structure of this paper is organized as follows: Chapter 2 presents the relevant domestic and international research progress, with a focus on reviewing the development of deep learning-based crop pest and disease identification methods and interpretable models. Chapter 3 provides a detailed description of the overall framework and key techniques of the proposed Contrastive Prototype Tree model, including dataset construction, preprocessing processes, the SimCLR contrastive learning mechanism, and the concept prototype tree structure. Chapter 4 outlines the experimental design, parameter settings, and result analysis, emphasizing the performance of CPTR across multiple crop pest and disease datasets and its interpretability validation. Chapter 5 concludes the paper, discussing the innovations, limitations, and future research directions.

The main contributions of this paper are as follows:

  • This study proposes an interpretable crop pest and disease identification model based on SimCLR contrastive learning and the Concept Prototype Tree.
  • To enhance the model’s understanding and recognition capability of image information, the SimCLR contrastive learning mechanism is introduced to optimize feature learning, effectively improving the discriminative power and generalizability of feature expression.
  • Experimental research is conducted on multiple real-world crop pest and disease image datasets. Compared with existing mainstream models, CPTR demonstrates superior overall performance.

2 Related work

2.1 Crop disease and pest identification based on deep learning

Crop pest and disease identification is a critical component in ensuring agricultural production efficiency and crop yield [18]. Traditional identification methods often rely on expert knowledge and manual experience, which not only are inefficient but also prone to subjective influences. With the development of computer vision and deep learning technologies, image-based pest and disease identification methods have gradually become mainstream.

In early studies, researchers primarily used classic convolutional neural network (CNN) architectures to classify crop pest and disease images. Fuentes et al. [19] combined three detectors—Faster R-CNN, SSD, and R-FCN—with backbone networks such as VGG-16, ResNet-50, and ResNeXt-50 to achieve real-time identification of tomato pests and diseases, maintaining high accuracy and low false positive rates even in complex environments. These object detection models are capable of locating and classifying disease areas in images, laying the foundation for practical applications in real-world scenarios.

To improve recognition ability in complex field scenarios, subsequent studies introduced drone remote sensing images and image segmentation techniques. Tetila et al. [20] used a drone-based aerial image acquisition method combined with SLIC superpixel segmentation and ResNet extractors to achieve automatic identification of soybean pest and disease areas, with the highest classification accuracy reaching 93.82%. In addition, image augmentation and preprocessing technologies have become key methods for enhancing model performance. For example, the image processing system developed by Devaraj [21] on the MATLAB platform significantly improved overall recognition accuracy in stages such as preprocessing, feature extraction, and classification. Martos et al. [22] integrated remote sensing technology, artificial intelligence, and advanced sensor technologies to achieve efficient management and sustainable development of agricultural production.

At the same time, transfer learning has been widely applied in pest and disease identification tasks as an effective method to alleviate the scarcity of agricultural image samples. Barbedo [23] and others found that deep models, when provided with more diverse samples and high data variety, significantly improve generalization ability, highlighting the importance of building high-quality pest and disease image databases. In practical applications, mainstream models are often pre-trained on large-scale datasets like ImageNet and then fine-tuned for crop pest and disease identification tasks to enhance the model’s performance on target tasks.

In summary, deep learning methods have shown promising application prospects in crop pest and disease identification. However, several issues remain in their practical application: first, obtaining high-quality annotated data in the agricultural field is difficult, which limits the training and deployment of deep models; second, existing models generally lack transparent decision logic, making it challenging to meet the need for interpretability in agricultural practice; third, the models have insufficient generalization ability across different regions or environmental conditions, leading to performance degradation.

2.2 Interpretable deep learning

Deep neural networks face challenges in gaining widespread trust and acceptance in agricultural applications due to the lack of transparency in their decision-making processes. Explainable Artificial Intelligence (XAI) technologies improve users’ trust in model decisions by providing visual and semantic explanations of model behavior [24].

Early studies often employed post-hoc explainability methods, such as Grad-CAM, which generates activation heatmaps through gradient backpropagation, and LIME [25], which models feature contribution values through input perturbations. Shrikumar et al.’s DeepLIFT algorithm [26] can trace the impact path of input changes on the output, enhancing local interpretability. Gopalan et al. [27] proposed a maize leaf disease classification model based on ResNet152 and combined it with the Grad-CAM method to improve model interpretability. This model achieved accuracies of 99.95% in training and 98.34% in testing, effectively distinguishing between four types of maize leaf diseases.

However, post-hoc methods often lack stability and are disconnected from the model’s original structure. To overcome these shortcomings, researchers have proposed structurally interpretable models, with the “Concept Bottleneck Model (CBM)” and “Concept Prototype Network (ProtoPNet)” being the most representative. CBM achieves semantic-level interpretability by constructing an intermediate layer with explicit semantic representations, where model decisions are built upon high-level concepts. ProtoPNet, on the other hand, introduces a set of class prototype images, making the classification process resemble a “this looks like that” analogy, and has achieved good interpretability results in multiple fine-grained recognition tasks. Zeng et al. [28] proposed the CDPNet model, a deformable ProtoPNet model for interpretable maize leaf disease identification. This model, by combining deformable convolution with ProtoPNet’s concept prototypes, can capture more flexible and precise disease areas, thereby improving both disease diagnosis accuracy and interpretability.

The advantage of concept prototype methods lies in their intuitive reasoning logic and strong semantic associations. By using prototypes as intermediaries, these methods link input images with classes, significantly enhancing model transparency and human interpretability. For example, ProtoPNet can show, “This image belongs to apple rust disease because it closely resembles the red rust spots in this prototype image.” This image-to-prototype visual mapping provides a strong local explanation basis.

However, existing concept prototype models also exhibit notable limitations. On one hand, their prototype matching mechanisms often rely on global or fixed local feature representations, making them less effective when dealing with highly similar classes or images with complex internal structures. This can lead to misclassification or ambiguous reasoning. On the other hand, most current methods do not consider the structural relationships among prototypes and therefore lack the ability to express a global decision path. As a result, although the models are interpretable, their explanations remain fragmented at the “image-to-image” level and fail to provide a complete semantic reasoning chain.

To address these issues, this study adopts the design principles of the Prototype Tree model and introduces a tree-structured representation on top of prototype learning, proposing a more hierarchical and holistic structurally interpretable framework. In this framework, multiple prototypes are organized as decision tree nodes according to semantic or discriminative pathways. The input image is matched through the tree from top to bottom, enabling multi-level reasoning that progresses from coarse to fine and from abstract to specific. Each branching decision corresponds to a prototype match and can explain why a particular path is chosen over another. This approach effectively integrates local image-level explanations with the global decision-making logic.To better compare representative interpretable deep learning methods, Table 1 summarizes the core ideas, advantages, and limitations of several widely used models.

3 Model

3.1 CPTR

The CPTR model integrates the SimCLR contrastive learning mechanism with the Concept Prototype Tree structure, balancing feature expression capability and model interpretability. The model consists of three main components: 1. CNN feature extraction layer, which is responsible for converting the input image into high-dimensional feature representations; 2. Concept Prototype Tree layer, which contains several trainable prototype nodes organized into a binary decision tree for hierarchical feature discrimination; 3. SimCLR contrastive learning and training strategy, which combines contrastive loss with classification loss as the optimization objective.

The input crop pest and disease image is passed through a convolutional neural network (CNN) to obtain the feature map . During the training phase, this study applies random transformations to each input image to generate another view, and the same CNN is used to extract features to obtain . SimCLR contrastive loss is then applied to push the features closer in the feature space while distancing them from features of other samples. This contrastive learning process enhances the discriminability and robustness of the features extracted by the CNN. Next, the feature map is passed through the Concept Prototype Tree layer for discrimination. The specific process is as follows: the model calculates the similarity between and each prototype at the tree nodes, and routes to the corresponding child node with a certain probability based on the similarity. In this way, the input sample propagates through the tree structure, matching the corresponding pest and disease feature prototypes layer by layer, ultimately reaching one or more leaf nodes. Due to the use of probabilistic soft routing, the input features are effectively “distributed” to each leaf node with varying probabilities. The model then performs a weighted fusion of the category predictions from each leaf node according to these probabilities to obtain the final output.

The decision tree layer in CPTR differs from traditional decision trees: it allows samples to propagate along multiple paths simultaneously in a soft branching manner, improving the model’s adaptability to complex samples while maintaining differentiability. During training, the model’s optimization objective is composed of both classification loss and contrastive loss. These two losses are combined using a weighted sum:

(1)

The total loss function described above combines the supervised classification loss and the contrastive learning loss. Here, represents the cross-entropy loss, which measures the discrepancy between the model’s predictions and the ground-truth labels; denotes the SimCLR contrastive loss, which aims to enhance the discriminability of the learned feature representations; and λ is a weighting coefficient that controls the relative contribution of the two losses. The overall architecture of the CPTR model is illustrated in Fig 1.

To examine the effect of the loss weighting coefficient λ in Eq. (1), we conducted a small sensitivity analysis on the AppleLeaf9 dataset. Keeping all other training settings fixed, we evaluated the model performance using several values of λ around the default setting, specifically The results show that the overall classification performance varies only slightly within this range, indicating that the model is relatively insensitive to λ in a reasonable interval. Among these settings, provides a favorable balance between classification accuracy and feature discriminability. Therefore, is adopted as the final setting and used in all experiments reported in this paper.

3.2 CNN feature extraction layer

In the CPTR model, the CNN feature extraction module serves as the foundational component for image encoding, responsible for transforming the input crop pest and disease images into high-dimensional, structured visual feature representations. This module employs predefined deep convolutional neural network architectures (VGG19, ResNet152, DenseNet161) as the backbone networks, leveraging their excellent feature extraction capabilities in the field of image recognition. Let the input image be , where is the number of channels, and are the height and width of the original image. After passing through the convolutional layers and nonlinear transformations, the output feature map is denoted as:

(2)

Where represents the CNN feature extraction function, with the parameter set θ, and denotes the output feature map.

Each feature map , corresponds to a semantic representation of position in the original image and contains deep structural information about the response of the convolutional kernel to that region. The entire feature map is essentially a dense representation of the input image in a high-dimensional semantic space, and is the basis for the subsequent conceptual prototype tree module for region matching.

3.3 Conceptual prototype tree layer

In order to enhance the interpretability of the model while ensuring the classification performance, this study introduces a conceptual prototype tree structure based on the fusion of prototype learning and soft decision tree in the CPTR model. The structure takes the conceptual prototype as the core, together with the probabilistic routing mechanism, to gradually realize the transparent inference process from feature characterization to category prediction.

In the conceptual prototype tree, each internal node is associated with a trainable prototype ,Take , i.e., each prototype is a small patch with the number of channels , which is the same as the number of channels in the feature map . This design is able to capture fine-grained local discriminative features in the input image, such as leaf surface details, lesion distribution, etc. To determine which local region in the input feature graph is closest to the prototype of a node, we slide over the feature graph to extract all local patches and compute the Euclidean distance between them and . Ultimately, the local patch with the smallest distance is selected as the optimal match:

(3)

This optimal patch represents the local region in the input image that is most similar to the prototype, and is the basis for node decisions. Through this local matching mechanism, the model is able to perceive the key features of the input image at a fine-grained level and guide the subsequent routing direction of the samples accordingly.

After the prototype matching is completed, the routing direction of the sample in the tree needs to be determined based on the degree of matching. Specifically, the probability of the sample propagating from the current node to the right child node is defined as:

(4)

and the probability of propagation to the left child node is its complement, viz:

(5)

This Soft Split strategy allows the samples not to be forcibly assigned to a single path, but to be propagated simultaneously to the left and right child nodes with a certain probability, preserving the uncertainty and informativeness of the decision-making process. In addition, the use of exponential function can naturally map the Euclidean distance to probability, ensuring that the probability of propagation to the right branch is higher when the distance is closer and the similarity is higher.

This design is different from the single routing approach in traditional hard decision trees, enabling the CPTR model to effectively handle situations with fuzzy category boundaries or large sample heterogeneity while maintaining microscopicity and facilitating end-to-end training.

After the samples are routed through multiple layers of nodes, they eventually reach each leaf node with a certain probability distribution. Let be a path from the root node to the leaf node . The total probability of a sample reaching along this path is:

(6)

where denotes the transfer probability of the sample corresponding to each edge on the path. The path probability reflects the global likelihood of the sample’s reasoning process in the conceptual prototype tree structure.

Each leaf node learns a category distribution parameter vector ,which is normalized by Softmax to obtain a standardized category probability distribution . Ultimately, the category prediction of the input image is obtained by weighted summation of the outputs of all leaf nodes by path probability:

(7)

This weighted fusion mechanism not only realizes end-to-end micro-trainable model predictions but also ensures transparent traceability of the inference path and decision basis. Unlike traditional black-box neural networks, CPTR is able to provide a clear and verifiable decision-making link for each classification prediction, thus significantly improving the trust and usability of the model in real-world application scenarios such as agricultural production. The conceptual prototype tree model diagram is shown in Fig 2.

3.4 SimCLR contrastive learning

In order to improve the feature representation of the CPTR model, this study introduces the SimCLR contrastive learning framework for feature optimization. SimCLR obtains a more discriminative image feature representation by constructing pairs of positive and negative samples to maximize the similarity between positive samples and minimize the similarity between negative samples in the feature space.

Specifically, for a given input image , this study first generates two different augmented views, denoted as and , by means of stochastic data augmentation strategies (e.g., random cropping, random flipping, color perturbation, etc.). Both views are derived from the same original image and are therefore considered as positive sample pairs. Subsequently, the enhanced image views are fed into a convolutional neural network feature extractor with shared weights as well as a nonlinear projection head to obtain low-dimensional feature representations and , respectively.

Given a batch of training samples, the SimCLR contrast loss function, is used to optimize the model parameters so that enhanced views from the same original image are closer together in feature space and features from different images are farther apart. The specific mathematical expression for the NT-Xent loss is:

(8)

where τ denotes the temperature parameter and is the number of original samples within a single training batch, denotes the indicator function, which is used to exclude samples of itself and to avoid comparing samples with itself in the loss calculation, denotes the loss value of the feature representation for the comparison of , , denotes the cosine similarity between feature vectors and . The contrastive learning framework is shown in Fig 3:

4 Experiments

4.1 Dataset

To systematically evaluate the recognition capability of the proposed model under different crop types and pest disease symptoms, this paper selected three publicly available and representative crop pest and disease image datasets: Cassava, Cashew, and AppleLeaf9. These datasets all originate from real field environments, featuring high resolution, clear category differences, and significant intra-class variation, which effectively reflect the complexity of actual agricultural scenarios.

To ensure fairness and reproducibility of the experimental results, all datasets were split into training and testing sets according to a unified principle. During the split process, independent random sampling was performed within each category to ensure balanced class distribution; the random process used a fixed seed of seed = 42 to ensure consistent splitting results across different experiments. The split ratio was 80% for the training set and 20% for the testing set. The training set was used for model training and parameter updates, while the testing set was only used in the final performance evaluation phase and was not involved in any model tuning or hyperparameter selection.

The Cassava dataset is sourced from the CCMT platform [29], and primarily consists of leaf images of cassava crops collected from real field environments. It includes five categories: Healthy Leaves, Bacterial Blight, Brown Spot, Green Mite, and Mosaic. The dataset contains a total of 7,508 images, with the training and testing sets split according to a ratio while maintaining balance in class distribution. The images were collected under various lighting conditions, angles, and leaf statuses, effectively simulating the diversity and complexity of cassava field diseases.

The Cashew dataset also comes from the CCMT project and focuses on the identification of typical pests and diseases on cashew leaves. The dataset includes four categories: Healthy, Anthracnose, Red Rust, and Leaf Miner, with a total of 6,109 images. All images were manually collected and annotated by experts, with clear disease labels and high image quality, making the dataset suitable for constructing multi-class classification tasks.

The AppleLeaf9 dataset [30] is a composite dataset built for apple leaf disease identification, integrating multiple publicly available subsets, including the PlantVillage [31] database, ATLDSD, PPCD2020, and PPCD2021 [3234]. It contains a total of 14,582 images, covering nine categories of apple leaf diseases and Healthy leaves. These categories include: Alternaria Leaf Spot, Brown Spot, Frogeye Leaf Spot, Grey Spot, Mosaic, Powdery Mildew, Rust, Scab, and Healthy Leaves. Approximately 94% of the images were captured under natural field conditions, incorporating complex factors such as uneven lighting and background interference, which greatly enhances the dataset’s practical adaptability.

It should be noted that the AppleLeaf9 dataset is constructed by integrating multiple publicly available sub-datasets, and the original data do not provide unique identifiers at the leaf or scene level. As a result, when performing stratified random splitting, it is not possible to strictly guarantee that images originating from the same leaf or the same acquisition scene do not appear simultaneously in both the training and test sets. This potential sample correlation may, to some extent, lead to a slight overestimation of the overall performance.

Nevertheless, to ensure a fair comparison among different methods, all models in this study are evaluated using exactly the same data split. Under identical data conditions, the relative performance differences between models remain comparable and informative.

Detailed information about the dataset is provided in Table 2, and sample images from the dataset are shown in Fig 4.

thumbnail
Table 2. Details of Cashew, Cassava and Appleleaf9 datasets.

https://doi.org/10.1371/journal.pone.0343715.t002

thumbnail
Fig 4. Partial images of Cashew, Cassava, and AppleLeaf9 datasets.

https://doi.org/10.1371/journal.pone.0343715.g004

4.2 Data pre-processing

To improve the model’s generalization ability in crop pest and disease identification tasks and prevent overfitting during training, while also providing rich augmented sample views for the SimCLR contrastive learning module, this study applied systematic data pre-processing. First, all images were resized to 224224 pixels to ensure consistent input dimensions. Then, two random augmented views were independently generated for each image to construct positive sample pairs for SimCLR. The specific augmentation strategies included: random perspective transformation (with a distortion_scale of 0.2, applied with a probability of 0.5) to simulate geometric deformations of leaves at different angles; color jitter (with brightness, contrast, saturation, and hue adjustment ranges all set to 0.4, applied with a probability of 0.8) to simulate natural lighting and color variations; horizontal flipping (with a probability of 0.5) to enhance the model’s robustness to orientation changes; and random affine transformations (with rotation angles of ±10°, translation scales of 0.05, shear angles of ±2°, applied with a probability of 0.8) to increase the spatial diversity of the samples.

All augmented images were normalized using the ImageNet standard for channel normalization, with mean values of [0.485, 0.456, 0.406] and standard deviations of [0.229, 0.224, 0.225] for the RGB channels, ensuring the stability of feature distribution. This data pre-processing strategy significantly improved the model’s robustness under varying lighting, angles, and background conditions, and provided sufficient view diversity for SimCLR contrastive learning, thereby enhancing the discriminability and generalization capability of the feature representations. As shown in Fig 5, the SimCLR-based data augmentation strategy generates two distinct augmented views of each original image through a sequence of stochastic transformations, providing the foundation for contrastive representation learning.

4.3 Experimental parameters

In this study, all experiments were conducted at a resolution of 224224. During the first 30 epochs of training, the backbone network parameters were frozen, and only the concept prototype layer and SimCLR projection head were optimized to ensure stable convergence of the feature extraction part. Afterward, in the remaining 70 epochs, all parameters were unfrozen for end-to-end joint optimization.

The training was conducted for a total of 100 epochs, with a batch size set to 64. The Adam optimizer was used with parameters = 0.9, = 0.999, and weight decay set to . A layered learning rate strategy was employed for different modules: a smaller learning rate of was used for the backbone network (CNN backbone) to avoid disrupting the pretrained weights, while a primary learning rate of 0.001 was applied to the concept prototype layer and projection head to accelerate the convergence of the new feature space.

The learning rate scheduling adopted a milestone-based decay strategy, where the learning rate was multiplied by a decay factor of 0.1 every 10 epochs starting from epoch 60. All hidden layers used ReLU as the activation function. The concept layer nodes used the Sigmoid activation function to maintain the independence of semantic concept responses, while the final classification output layer used the Softmax function for normalization to achieve multi-class pest and disease identification. The channel dimension of the concept prototypes was set to D = 256, and the decision tree depth was set to 3.

In the SimCLR contrastive learning module, the temperature parameter τ was set to 0.5. The projection head used a two-layer multi-layer perceptron structure: the first layer was Linear(256 → 2048) followed by Batch Normalization and ReLU activation, and the second layer was Linear(2048 → 128). The output vector was -normalized before loss calculation to compute the cosine similarity. The similarity in the SimCLR branch was calculated using cosine distance, while the concept prototype tree module’s prototype matching was based on Euclidean distance. This optimization configuration maintained the stability of the pre-trained features while ensuring the fast convergence of newly introduced modules and the overall stability of the training. It effectively balanced the model’s performance in terms of classification accuracy, feature expression capability, and interpretability.

4.4 Evaluation indicators

To evaluate the effectiveness of the proposed model, CPTR is compared with Prototype Tree, VGG19 [35], ResNet152 [36], DenseNet161 [37], Vision Transformer [38], and Swin Transformer [39] on three datasets. All methods adopt the same data augmentation strategy, and their performance is assessed using four commonly used metrics: accuracy, precision, recall, and F1 score.

To ensure a fair and controlled comparison, all compared methods are evaluated under identical pretraining conditions. Specifically, the backbone networks of all models—including standard CNN baselines (VGG19, ResNet152, DenseNet161), Vision Transformer, Swin Transformer, Prototype Tree, and the proposed CPTR—are initialized using weights pretrained on the iNaturalist 2017 dataset.

By adopting a unified pretraining strategy across different model families, the influence of pretraining data is effectively controlled. Under this setting, performance differences among methods can be attributed to differences in model architectures and training mechanisms rather than advantages arising from pretraining.

Within the prototype-based framework, CPTR and Prototype Tree further share the same backbone architecture and identical pretraining initialization. Therefore, the performance improvement achieved by CPTR over Prototype Tree reflects the contribution of the proposed SimCLR-based contrastive learning mechanism. This interpretation is further supported by the ablation study reported in Section 4.6, where the effect of SimCLR is examined while keeping all other settings unchanged.

For clarity, all reported results (Accuracy, macro-averaged Precision, Recall, and F1) are presented as mean ± standard deviation over three independent runs with different random seeds.

Accuracy is the proportion of correctly categorized samples to the total sample.

(9)

The precision rate is the proportion of samples correctly predicted to be in the positive category to all samples predicted to be in the positive category.

(10)

Recall is the proportion of samples correctly predicted to be in the positive category to all samples that are actually in the positive category.

(11)

The F1 score, which is a reconciled average of precision and recall, is a composite metric that is particularly applicable in cases of category imbalance.

(12)

Among them, TP (True Positive) is true positive, TN (True Negative) is true negative, FP (False Positive) is false positive and FN (False Negative) is false negative.

For multi-class classification, Precision/Recall/F1 are computed in a macro-averaged manner by treating each class as one-vs-rest and averaging across classes

4.5 Comparative experiment

To comprehensively validate the effectiveness of the CPTR model proposed in this paper, comparative experiments were conducted on three crop pest and disease datasets: AppleLeaf9, Cassava, and Cashew. All results are the averages of three independent runs. The comparison models include traditional convolutional neural network models (VGG19, ResNet152, DenseNet161) as well as recently outstanding Vision Transformer and Swin Transformer models, to ensure the comprehensiveness and objectivity of the evaluation. The experimental results are shown in Table 3.

thumbnail
Table 3. Comparison of Experimental Results (mean ± std over three independent runs).

https://doi.org/10.1371/journal.pone.0343715.t003

Overall, while traditional CNN models and transformer-based models performed well on some datasets, they still showed certain gaps compared to the CPTR model. Specifically, on the most complex AppleLeaf9 dataset, which has subtle inter-class differences, CPTR demonstrated a significant advantage: CPTR-vgg19 achieved an accuracy of 83.74%, which is nearly 6% higher than the corresponding VGG19 model, and showed comprehensive improvements in precision, recall, and F1 score. This indicates that contrastive learning effectively enhanced the model’s ability to capture fine-grained disease features. On the Cassava dataset, CPTR-DenseNet161 achieved an accuracy of 94.80%, outperforming all baseline models, including the relatively strong Swin Transformer (94.73%), further confirming that CPTR retains its advantages even in conditions with clear class boundaries. On the Cashew dataset, CPTR-DenseNet161 achieved the highest accuracy of 96.01%, also leading in other metrics compared to all other models, showcasing its high recognition stability and generalization ability.

In summary, CPTR outperformed traditional CNN and transformer models across all three datasets, with a particular advantage in fine-grained recognition tasks. This success is attributed to CPTR’s combination of SimCLR contrastive learning and the Concept Prototype Tree structure, which enhances the model’s feature expression ability while maintaining strong interpretability and robustness.

4.6 Ablation experiment

To further analyze the model’s performance, this paper compares the classification results of Prototype Tree and CPTR on the three crop pest and disease datasets before and after the introduction of the SimCLR contrastive learning module. The experimental results are shown in Table 4 (using DenseNet161 for feature extraction as an example). The introduction of SimCLR had a positive impact on the model’s accuracy and other evaluation metrics.

On the AppleLeaf9 dataset, CPTR’s accuracy increased from 78.66% to 80.48%, with significant improvements in precision, recall, and F1 score, indicating that contrastive learning enhanced the model’s discriminative ability for complex disease features. On the Cassava dataset, although Prototype Tree already showed high performance, after introducing SimCLR, CPTR still achieved improvements in accuracy, precision, recall, and F1 score, further validating the effectiveness of contrastive learning in enhancing feature expression. On the Cashew dataset, CPTR also achieved a small performance improvement compared to the baseline model, indicating that SimCLR contrastive learning provides consistent optimization across different datasets.

Overall, the addition of SimCLR made the model more robust and discriminative across all evaluation metrics, providing higher accuracy and interpretability for deep learning models in crop pest and disease identification tasks.

4.7 Confusion matrix

This paper conducts a confusion matrix analysis of the CPTR model’s prediction results on the AppleLeaf9, Cassava, and Cashew datasets. The confusion matrix provides an intuitive way to reflect the model’s classification accuracy across different categories, as well as the specific category combinations that are prone to confusion. It is an essential tool for evaluating the model’s fine-grained recognition ability.

From the Fig 6 results, it can be observed that CPTR exhibits a highly concentrated diagonal distribution across all three datasets, indicating that the model consistently classifies the majority of the samples correctly. For categories with a large number of samples or distinct feature patterns (such as Mosaic in Cassava, Healthy in Cashew, and the categories in AppleLeaf9 with clear lesion features), the model shows near-perfect classification performance, with very few misclassifications. In contrast, some categories with high similarity or ambiguous boundaries, such as the similar-looking Brown Spot and Grey Spot in AppleLeaf9 (classes 8 and 9), or certain spot diseases in Cashew, still show a small amount of confusion. However, the overall error rate is noticeably lower than that of Prototype Tree and other comparison models, indicating that CPTR has stronger feature expression capability for fine-grained distinctions.

At the same time, after the inclusion of SimCLR, CPTR demonstrated clearer inter-class boundaries across multiple datasets, with misclassifications being more concentrated in a few hard-to-differentiate categories. There were no systemic biases or large-scale misclassifications. This suggests that contrastive learning effectively enhanced the model’s sensitivity to key lesion textures, shapes, and color variations, improving the model’s robustness in complex disease scenarios.

In conclusion, the confusion matrix analysis further confirms that CPTR exhibits stable and reliable classification ability in crop pest and disease identification tasks.

4.8 Interpretable analysis

In addition to its accuracy advantage, CPTR also offers excellent model decision interpretability. As shown in Fig 7, CPTR uses the Concept Prototype Tree structure to visually present the model’s decision-making process in a tree-like form. The figure illustrates the concept prototype decision tree trained on the Cassava dataset (tree height h = 3), where the image at each node represents a concept prototype, and the leaf nodes correspond to specific pest and disease categories. Through this tree-based prototype display, the model’s decision path is clearly revealed: during each binary decision, the model progresses along different branches based on the “presence” or “absence” of certain feature patterns in the input image, until it reaches the leaf node to give the final classification result. Each path from the root node to the leaf node corresponds to a human-understandable decision logic.

To further enhance the interpretability of the tree model, CPTR pruned the prototype tree after training: redundant branches with unclear decision boundaries were removed, leaving only prototype nodes with more certain decisions. After pruning, the model’s prediction accuracy remained almost unchanged, but the number of concept prototypes in the tree was significantly reduced, and the decisions at each leaf node became clearer. By visualizing the concept tree, users can intuitively understand how the model gradually makes decisions based on specific image features, greatly increasing trust in the model’s predictions.

In summary, the CPTR model not only achieves high-accuracy recognition but also provides transparent decision-making grounds by integrating the prototype tree structure, striking a good balance between performance and interpretability in crop pest and disease identification tasks.

As illustrated in Fig 8, the input image is first processed by the backbone network to extract feature representations, which are then propagated through the prototype tree along a similarity-driven path. At each internal node, the model selects the most relevant branch based on the matching relationship between the current feature representation and the corresponding prototype, thereby progressively narrowing the decision space. This process intuitively reflects the model’s hierarchical discrimination from low-level local visual patterns to high-level semantic concepts. Finally, the sample reaches a leaf node and outputs the predicted class, which is consistent with the class associated with the sequence of prototype matches along the decision path. This example demonstrates how the prototype-based reasoning process directly supports the model’s final decision, providing an interpretable explanation for an individual prediction while maintaining strong classification performance.

thumbnail
Fig 8. An illustrative decision path of the CPTR model for a single sample.

https://doi.org/10.1371/journal.pone.0343715.g008

To quantitatively evaluate the consistency between the model’s prediction and its interpretable decision process, we adopt the Fidelity metric. For an input sample , denote the final predicted class produced by the CPTR model, and let denote the class inferred from the corresponding prototype tree decision path, i.e., the class associated with the leaf node selected by the prototype tree for sample . Fidelity is defined as

(13)

where is the number of test samples and is the indicator function, which equals 1 if its argument is true and 0 otherwise. A higher Fidelity value indicates stronger agreement between the model’s final output and the reasoning outcome provided by the prototype tree, reflecting better interpretability consistency. In other words, Fidelity measures how often the final prediction of CPTR matches the class implied by its prototype-tree decision path, thereby quantifying the consistency between the model’s prediction and its interpretable reasoning process.

The experimental results are shown in Table 5. The results show that CPTR achieved higher fidelity on both the Apple and Cassava datasets (97.42% and 98.80%, respectively), showing an improvement over Prototype Tree. This indicates that after incorporating contrastive learning, the features learned by the model are more robust, and the decision path is more consistent with the prototype matching process. On the Cashew dataset, CPTR’s fidelity is slightly lower than that of Prototype Tree, but it still remains at a very high level, suggesting that its interpretability is still reliable.

thumbnail
Table 5. Comparison of fidelity across different models on datasets.

https://doi.org/10.1371/journal.pone.0343715.t005

5 Conclusion

In this paper, for the problem of insufficient model interpretability in crop pest and disease identification, we propose an interpretable identification algorithm that integrates SimCLR contrastive learning with the conceptual prototype tree structure —— Contrastive Prototype Tree (CPTR). The algorithm utilizes SimCLR contrastive learning to enhance feature expressiveness, and enhances the transparency of the decision-making process through the conceptual prototype tree structure, which improves the interpretability of the model while ensuring recognition accuracy. In this study, CPTR was experimentally evaluated on three crop pest and disease image datasets, AppleLeaf9, Cashew, and Cassava, and the results showed that the model was able to provide global decision interpretability through the tree structure while maintaining a high level of accuracy and giving a clear explanation of the decision path for each prediction. In summary, CPTR demonstrates excellent performance and good interpretability in the crop pest recognition task, providing strong technical support and new research ideas for the research of interpretable deep learning algorithms in agriculture.In future work, we will further validate CPTR on more diverse, large-scale field datasets and explore lightweight architectures to support deployment on edge devices for real-time diagnosis.

References

  1. 1. Wang C, Wang X, Jin Z, Müller C, Pugh TAM, Chen A, et al. Occurrence of crop pests and diseases has largely increased in China since 1970. Nat Food. 2022;3(1):57–65. pmid:37118481
  2. 2. Ratnadass A, Fernandes P, Avelino J, Habib R. Plant species diversity for sustainable management of crop pests and diseases in agroecosystems: a review. Agron Sustain Dev. 2011;32(1):273–303.
  3. 3. Donatelli M, Magarey RD, Bregaglio S, Willocquet L, Whish JPM, Savary S. Modelling the impacts of pests and diseases on agricultural systems. Agric Syst. 2017;155:213–24. pmid:28701814
  4. 4. Oerke E-C. Crop losses to pests. J Agric Sci. 2005;144(1):31–43.
  5. 5. Zhaoyu Z, Yifei C, Huanliang X, Peisen Y, Haoyun W. Review of key techniques for crop disease and pest detection. Nongye Jixie Xuebao/Transact Chinese Soc Agricul Mach. 2021;52(7).
  6. 6. Türkoğlu M, Hanbay D. Plant disease and pest detection using deep learning-based features. Turk J Elec Eng Comp Sci. 2019;27(3):1636–51.
  7. 7. Kotwal J, Kashyap DrR, Pathan DrS. Agricultural plant diseases identification: from traditional approach to deep learning. Mat Today: Proceed. 2023;80:344–56.
  8. 8. Baruffaldi S, van Beuzekom B, Dernis H, Harhoff D, Rao N, Rosenfeld D. Identifying and measuring developments in artificial intelligence: making the impossible possible. 2020.
  9. 9. Goralski MA, Tan TK. Artificial intelligence and sustainable development. Inter J Manag Edu. 2020;18(1):100330.
  10. 10. Rahman CR, Arko PS, Ali ME, Khan MAI, Wasif A, Jani MR. Identification and recognition of rice diseases and pests using deep convolutional neural networks. ArXiv. 2018.
  11. 11. Fan F-L, Xiong J, Li M, Wang G. On interpretability of artificial neural networks: a survey. IEEE Trans Radiat Plasma Med Sci. 2021;5(6):741–60. pmid:35573928
  12. 12. Broniatowski DA, Broniatowski DA. Psychological foundations of explainability and interpretability in artificial intelligence. US Department of Commerce, National Institute of Standards and Technology; 2021.
  13. 13. Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE International conference on computer vision, 2017. 618–26.
  14. 14. Koh PW, Nguyen T, Tang YS, Mussmann S, Pierson E, Kim B. Concept bottleneck models. In: International conference on machine learning, 2020. 5338–48.
  15. 15. Chen C, Li O, Tao D, Barnett A, Rudin C, Su JK. This looks like that: deep learning for interpretable image recognition. Adv Neural Inform Proces Syst. 2019;32.
  16. 16. Nauta M, Van Bree R, Seifert C. Neural prototype trees for interpretable fine-grained image recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021. 14933–43.
  17. 17. Chen T, Kornblith S, Norouzi M, Hinton G. A simple framework for contrastive learning of visual representations. In: International conference on machine learning, 2020. 1597–607.
  18. 18. Zhang J, Huang Y, Pu R, Gonzalez-Moreno P, Yuan L, Wu K, et al. Monitoring plant diseases and pests through remote sensing technology: a review. Comput Electro Agricul. 2019;165:104943.
  19. 19. Fuentes A, Yoon S, Kim SC, Park DS. A robust deep-learning-based detector for real-time tomato plant diseases and pests recognition. Sensors (Basel). 2017;17(9):2022. pmid:28869539
  20. 20. Tetila EC, Machado BB, Astolfi G, Belete NA de S, Amorim WP, Roel AR, et al. Detection and classification of soybean pests using deep learning with UAV images. Computers and Electronics in Agriculture. 2020;179:105836.
  21. 21. Devaraj A, Rathan K, Jaahnavi S, Indira K. Identification of plant disease using image processing technique. In: 2019 International Conference on Communication and Signal Processing (ICCSP), 2019. 0749–53. https://doi.org/10.1109/iccsp.2019.8698056
  22. 22. Martos V, Ahmad A, Cartujo P, Ordoñez J. Ensuring Agricultural Sustainability through Remote Sensing in the Era of Agriculture 5.0. Applied Sciences. 2021;11(13):5911.
  23. 23. Barbedo JGA. Impact of dataset size and variety on the effectiveness of deep learning and transfer learning for plant disease classification. Comput Electron Agricul. 2018;153:46–53.
  24. 24. Doshi-Velez F, Kim B. Towards a rigorous science of interpretable machine learning. 2017. https://arxiv.org/abs/1702.08608
  25. 25. Ribeiro MT, Singh S, Guestrin C. Why should I trust you? Explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016. 1135–44.
  26. 26. Shrikumar A, Greenside P, Kundaje A. Learning important features through propagating activation differences. In: International conference on machine learning, 2017. 3145–53.
  27. 27. Gopalan K, Srinivasan S, Singh M, Mathivanan SK, Moorthy U. Corn leaf disease diagnosis: enhancing accuracy with resnet152 and grad-cam for explainable AI. BMC Plant Biol. 2025;25(1).
  28. 28. Zeng J, Jia B, Song C, Ge H, Shi L, Kang B. CDPNet: a deformable ProtoPNet for interpretable wheat leaf disease identification. Frontiers in Plant Science. 2023;16:1676798.
  29. 29. Mensah PK, Akoto-Adjepong V, Adu K, Ayidzoe MA, Bediako EA, Nyarko-Boateng O, et al. CCMT: Dataset for crop pest and disease detection. Data Brief. 2023;49:109306. pmid:37360671
  30. 30. Yang Q, Duan S, Wang L. Efficient identification of apple leaf diseases in the wild using convolutional neural networks. Agronomy. 2022;12(11):2784.
  31. 31. Hughes D, Salathé M. An open access repository of images on plant health to enable the development of mobile disease diagnostics. arXiv preprint. 2015. https://doi.org/arxiv:1511.08060
  32. 32. Sun Z, Feng Z, Chen Z. Highly accuracy and lightweight detection of apple leaf diseases based on YOLO. 2024.
  33. 33. Feng J. Apple tree leaf disease segmentation dataset. 2022.
  34. 34. Thapa R, Zhang K, Snavely N, Belongie S, Khan A. The plant pathology challenge 2020 data set to classify foliar disease of apples. Appl Plant Sci. 2020;8(9):e11390. pmid:33014634
  35. 35. Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv preprint. 2014. https://arxiv.org/abs/1409.1556
  36. 36. He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016. 770–8.
  37. 37. Huang G, Liu Z, Van Der Maaten L, Weinberger KQ. Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2017. 4700–8.
  38. 38. Dosovitskiy A. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint. 2020. https://arxiv.org/abs/2010.11929
  39. 39. Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z. Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021. 10012–22.