Figures
Abstract
Object identification has been widely used in several applications, utilising the annotated data with bounding boxes to specify each object’s exact location and category in images and videos. However, relatively little research has been conducted on identifying plant species in their natural environments. Natural habitats play a crucial role in preserving biodiversity, ecological balance, and overall ecosystem health. So, effective monitoring of habitats is necessary for safeguarding them, and one way of doing this is by identifying the typical and early warning plant species. Our study quantitatively evaluates the performance of six popular object detection models on our dataset collected in the wild, comprising various plant species from four habitats: screes, dunes, grasslands, and forests. The dataset employed in this work includes the data collected by human operators and the quadrupedal robot ANYmal C. The pre-trained object detection models have been chosen for experiments, and they are fine-tuned on our dataset to achieve better performance. These models incorporate two one-stage (RetinaNet and YOLOv8n), two two-stage (Faster RCNN and Cascade RCNN), and two transformer-based detectors (DETR and Deformable DETR). Extensive experimentation has been performed on the four habitat datasets by applying class balancing and hyperparameter tuning, and the obtained results are discussed.
Citation: Kaur P, Grassi A, Bonini F, Valle B, Borgatti MS, Rivieccio G, et al. (2025) Artificial vision models for the identification of Mediterranean flora: An analysis in four ecosystems. PLoS One 20(9): e0327969. https://doi.org/10.1371/journal.pone.0327969
Editor: Najib Ben Aoun, Al Baha University, SAUDI ARABIA
Received: July 28, 2024; Accepted: June 24, 2025; Published: September 5, 2025
Copyright: © 2025 Kaur et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The dataset is openly available on Zenodo repository (https://zenodo.org/records/11504938). The DOI number is “10.5281/zenodo.11504938”.
Funding: This research was supported by Grant agreement No. 101016970, European Union’s Horizon 2020 Research and Innovation Programme - ICT-47-2020. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Habitats are essential in preserving biodiversity, reflecting the intricate interplay between living organisms and their surrounding environment. With the growing impact of human activities (deforestation, fossil fuel burning, pollution, trampling, etc.) on natural habitats, it has become essential to have precise and timely information regarding the condition of ecosystems. Recognised as pivotal by various researchers [1–5], habitats have been singled out by the European Commission (EC) as essential targets for efficiently assessing the state of nature conservation, as outlined in the 92/43/EEC “Habitats" Directive [6,7]. This directive stands as a cornerstone of European biodiversity conservation policy [8,9], notably through its Annex I, which identifies “Natural habitat types of Community interest whose conservation requires the designation of Special Areas of Conservation" [10]. In Europe, the conservation status of Annex I Habitats is scheduled for monitoring every six years, guided by official reference guidelines established at both European and national levels [11–13]. These guidelines encompass criteria such as the habitat’s distribution, structure, functions, and the conservation status of its ’typical’ species, which are species serving as indicators for habitat conservation status [6]. These ’typical’ species act as proxies, offering insights into environmental changes, including those known as ‘sentinels’ [14–16].
While artificial intelligence is increasingly employed across various applications, monitoring natural habitats predominantly relies on field observations by human experts, particularly in terrestrial environments [17]. However, several challenges hinder the effective monitoring of habitat conservation status as outlined in the Habitats Directive (92/43/EEC) [6,7]: (1) reliance on human operators as the primary monitoring option, (2) subjectivity introduced by human involvement may lead to inconsistency, and (3) limited time windows for habitat monitoring, requiring additional professional surveyors with the increase in habitat numbers. To address these challenges, leveraging state-of-the-art deep learning techniques can automate habitat monitoring processes or support human experts in the field. So, we curated our image data of target species, including typical species (TS), alien species (AS), and early warning species (EWS), from four habitats: screes, dunes, grasslands, and forests. The presence of TS signifies a favourable habitat conservation status, whereas the presence of EWS and AS indicates potential threats to habitat health [17]. This dataset used in the current experimental analysis is a combined data collected by both human operators and the quadrupedal robot ANYmal C [18]. Plant experts and robotics engineers collected data using the quadrupedal robot ANYmal C during habitat monitoring missions. Fig 1 shows the quadrupedal robot ANYmal C in the screes habitat and highlights the sensors utilised for data acquisition for multiple tasks [19]. Out of all the sensors, the RGB-D cameras are specifically used for acquiring data used in this study.
ANYmal C is equipped with various sensors, including four Intel RealSense D435 RGB-D cameras for capturing high-resolution images. These sensors enable a large-scale and time-effective data acquisition that can be later used for habitat monitoring.
After data collection, domain experts annotated all the human and robot-collected images using bounding box annotations with the help of Labelbox tool [20] to prepare it for automatic plant species localisation and classification. Our motive is to monitor the natural habitat conservation status by identifying the target species within four habitats to aid humans in assessing the habitat conservation status. So, this article presents a comprehensive quantitative analysis of six object recognition models on our plant species dataset from four habitats, namely, screes, dunes, grasslands, and forests, to visualise the models’ (pre-trained on COCO dataset [21]) performance. The six models include two one-stage object detectors: YOLOv8 [22] and RetinaNet [23]; two two-stage detectors: Faster RCNN [24] and Cascade RCNN [25]; and two transformer-based detectors: DETR [26]and Deformable DETR [27]. These deep learning models are chosen for their unique strengths, for instance, one-stage detectors for their speed and efficiency, two-stage detectors for their high accuracy and refined object localisation, and transformer-based models for their ability to handle complex scenes and robust feature extraction. This enabled us to address the diverse challenges in plant species identification, such as class imbalance, overlapping objects, and the need for precise localisation to some extent. YOLOv8 has been implemented separately from other detectors, which are implemented in the MMDetection 2.x toolbox [28]. The data annotations were exported using a Python script from the Labelbox tool in the COCO format. These annotations are converted to the YOLO format for YOLOv8 model execution. This study mainly focuses on the performance analysis of the six detection models on our plant species dataset collected in the wild by human operators and quadrupedal robot ANYmal C.
Contributions
The significant contributions of the proposed study are as follows:
- An extensive experimental analysis of six object detectors (belonging to three different types) for identifying plant species in four diverse habitats.
- Performance evaluation of the object detection methods, excluding YOLOv8, after introducing class balancing using ClassBalancedDataset wrapper in MMDetection toolbox [28].
- YOLOv8 efficacy analysis on test data from all four habitats after performing hyperparameter tuning. The implementation details of our experimental study are available at https://github.com/parminder1050/Plant_species_identification.git link.
Related work
This section includes various research works in diverse application areas, such as plant disease detection, identification of active landslides and marine ship targets, identification of plant and weed species, etc., incorporating the six methods (Faster RCNN, Cascade RCNN, RetinaNet, YOLOv8, DETR, and Deformable DETR) used in this experimental study. Table 1 presents a broad qualitative comparison of the utilised object detection methods based on specific parameters.
Detecting plant diseases is a challenging task due to: (1) similar appearance of normal and diseased plant leaves; and (2) significant variations in colour, intensity, shape, and size across both the background and foreground of plant images. To overcome these issues, Nawaz et al. [29] introduced Faster RCNN employing a VGG-19 network to detect and categorise different plant diseases. The proposed technique was tested on the PlantVillage data repository and found promising in recognising plant disease under diverse image-capturing scenarios. Real-time Faster RCNN has been utilised in [30] for tomato plant leaf disease detection. Saleem et al. [31] extended their previous research of weed detection by performing an in-depth analysis of the Faster RCNN technique with ResNet-101 as it was the best performing method among others. After conducting a deep investigation, they developed a Faster RCNN enhancement by incorporating diverse anchor box scales and aspect ratios to improve the weed detection performance, especially for Chinee apple weed class. Identification of diabetic retinopathy has been performed using Faster RCNN with feature fusion (a combination of multiple image feature extractors) [32]. Cia at. al. [33] employed an improved Faster RCNN based on attended ResNet-34 and feature pyramid networks for identifying active landslides over vast areas.
Cascade RCNN has been used for identifying the marine ship targets [34]. It is challenging to detect the ship targets due to harsh environments such as fog, snow, and rain, and also, the targets are usually blocked or small in size. Cascade RCNN is compared with Faster RCNN and has been found to perform better. Mo et al. [35] proposed an optimisation model for defect detection on steel surfaces, utilising Cascade RCNN, termed CR-SSDD. This model is designed to effectively identify various categories of defects across multiple scales on the steel surface. Cao et al. [36] proposed an improved Cascade RCNN method for detecting different ceramic tile surface defects in diverse texture backgrounds. Improved Resnet and feature pyramid network versions are utilised to identify low-contrast and small-scale ceramic tile defects. Tang et al. [37] introduced a Cascade RCNN with ResNet-50 backbone based approach to identify the athletes in the soccer videos, which is a burdensome task due to blurring and occlusion issues. A hybrid attention module has been proposed to enhance the occluded and blurred features in the feature pyramid. Cao et al. [38] performed a multi-scale face mask identification using an improved Cascade RCNN technique dubbed MFMDet. It utilises a recursive feature pyramid to handle multi-scale features extracted by the backbone network. This approach enhances the receptive field and improves the representation of multi-scale features, thereby achieving effective detection, particularly for small objects.
Weeds pose a significant threat to both the yield and quality of crops. Automating the weed detection process is challenging due to the overlapping presence of crops and weeds. So, Peng et al. [39] proposed a technique based on RetinaNet to detect the weeds in the paddy field. A dataset consisting of images of rice plants and various weed species was created through on-site photography and web crawling. It encompasses rice plants along with eight different categories of weeds. Vecvanags et al. [40] proposed two algorithms based on RetinaNet and Faster RCNN to detect the ungulates in the images. A new dataset of wild ungulates has been composed from Latvia. Training optimisation and change in algorithm performance with data augmentation have also been discussed. Miao et al. [41] proposed and improved lightweight RetinaNet for fast and accurate ship detection in Synthetic Aperture Radar (SAR) images. The initial aspect ratios of the architecture have been adjusted using the K-means clustering method. Mahum et al. [42] introduced a Lung-RetinaNet technique based on RetinaNet to detect lung tumours.
Accurate identification of plant traits is essential for timely monitoring and assessing their growth and readiness for harvest. Solimani et al. [43] proposed the use of the YOLOv8 model to proficiently detect flowers, fruits, and nodes in tomato plants. They have also addressed the challenge of uneven distribution of samples, potentially resulting in misclassification and disruptions in model recognition. Brucal et al. [44] performed a tomato leaf disease detection using YOLOv8 to identify nine diseases. Yang et al. [45] utilised YOLOv8 for plant leaf detection, reducing the background interference for further leaf segmentation tasks. Wang et al. [46] performed an automatic vegetable disease detection in a greenhouse environment using an improved YOLOv8 model. The proposed approach shows promising results on the self-composed vegetable disease detection dataset. Uddin et al. [47] proposed a Cauli-Det approach based on an improved YOLOv8 model for automatic cauliflower disease localisation and classification. They utilised images taken with smartphones and handheld devices for experimentation. A fine-tuned pre-trained YOLOv8 architecture has been used to detect regions affected by diseases and extract spatial features for localising and categorising the diseases.
Zhong et al. [48] proposed a DETR-based human-object interaction detection technique. DETR’s robustness has been increased by identifying hard-positive queries, which are required to make accurate predictions using partial visual cues. Kumar et al. [49] have utilised the DETR model for target detection. The DETR was fine-tuned on the custom dataset and showed a noticeable improvement in terms of the number of training epochs, both visually and statistically. Yuan et al. [50] performed sea cucumber detection using YOLOv5 and DETR and compared the performances of both architectures. The CNN-based method is unreliable in accurately identifying vertebral levels. So, Tang et al. [51] proposed the use of DETR for lamina detection, spine curve measurement, and vertebral level identification. The execution involved three significant goals: (1) automatic identification of lamina pairs, (2) Spinal curvature assessment, and (3) vertebral level recognition. Cheng et al. [52] have incorporated the use of DETR into insulator defect detection by automatically detecting faults in UAV-captured insulator images. Transfer learning techniques were employed to train the high-performing DETR model with a minimally collected insulator image dataset. An improved loss function was also integrated into the model to compensate for DETR’s limitations in detecting small objects at precise scales.
Wang et al. [53] utilised a deformable DETR for farmland obstacle detection from the UAVs’ perspective. Deformable DETR somewhat lacks the capability to capture long-range dependencies constrained by local receptive fields and local self-attention mechanisms. So, global modelling capability (inspired by non-local neural networks) has been incorporated into the front end of ResNet to improve the Deformable DETR performance. Airport security systems typically use radar-based detection technology to monitor UAVs and birds within the clearance area. However, they often struggle to accurately identify the type and number of UAVs as well as the size and quantity of birds. In light of this issue, Shanliang et al. [54] proposed a Deformable DETR based detection method for UAVs and birds to enhance airport supervision of the clearance area and mitigate safety risks. Wang et al. [55] proposed a students’ classroom behaviour detection system, which is based on Deformable DETR with Swin Transformer and light-weight Feature Pyramid Network. Utilising a feature pyramid structure enabled the system to efficiently handle multi-scale feature maps extracted by the Swin Transformer, thus enhancing detection accuracy for targets of varying sizes and scales.
Methodology
This section comprises various subsections, such as Techniques, Model training, Evaluation metrics, Dataset, and Experimental setup, related to our experimental study. The Techniques subsection incorporates a detailed description of the three broad model categories (Two-stage detection, One-stage detection, and Transformer-based detection algorithms) and the six methods under these categories (two of each type). The other subsections provide the implementation details and the dataset description.
Techniques
- Two-stage detectors: Two-stage detection algorithms are a class of object detection algorithms characterised by two primary stages: region proposal and object detection. In the initial stage, these algorithms generate a collection of candidate object regions or proposals within the input image. Subsequently, they conduct classification and refinement of these proposals to identify objects and their precise locations. These algorithms typically amalgamate various techniques, encompassing feature extraction, region proposal, classification, and bounding box regression. The two two-stage detectors used in this study are as follows:
- (a) Faster RCNN: Faster RCNN (Region-based Convolutional Neural Network) [24] revolutionised object detection by integrating Region Proposal Networks (RPNs) directly into the detection pipeline. This two-stage approach generates region proposals and fine-tunes them through a shared convolutional backbone, enabling streamlined end-to-end training. By leveraging anchor boxes for proposal generation and parameter sharing, Faster RCNN achieves notable enhancements in both accuracy and speed, facilitating real-time object detection across diverse applications.
- (b) Cascade RCNN: Cascade RCNN [25] elevates object detection capabilities by utilising a cascade of detectors, which iteratively refine object proposals. This multi-stage approach involves progressively filtering out false positives and improving localisation accuracy. Cascade RCNN comprises multiple stages, each featuring a dedicated detector trained to address specific aspects of the detection task. During inference, proposals meeting the threshold in the initial stage undergo further refinement and evaluation in subsequent stages. Through the iterative refinement of object proposals, Cascade RCNN mostly achieves superior detection accuracy compared to single-stage detectors.
- One-stage detectors: One-stage detection algorithms predict bounding boxes and class labels for objects within an input image in a single network pass. In contrast to two-stage detection algorithms, they do not require a separate region proposal step. Typically comprising a single neural network architecture, these algorithms take an input image and produce a collection of bounding boxes, along with their associated class labels and confidence scores. They optimise an end-to-end objective function, which integrates classification and localisation losses, enabling them to directly predict object locations and categories from the input image. The two one-stage detectors utilised in the current study are as follows:
- (a) RetinaNet: RetinaNet [23] introduces a single-stage object detection framework to overcome challenges encountered in object detection tasks, including class imbalance and localisation accuracy. Key to RetinaNet’s success is its novel focal loss function, which effectively handles the imbalance between foreground and background samples during training. The network architecture encompasses a feature pyramid network (FPN) backbone for multi-scale feature extraction, coupled with convolutional layers that predict class probabilities and bounding box offsets at each feature pyramid level.
- (b) YOLOv8: Introduced by the Ultralytics team [22], YOLOv8 represents the latest iteration of the renowned YOLO series of detection algorithms. It is an open-source, cutting-edge model distributed under the General Public License [56]. Following minor modifications to the YOLOv3 model, Glenn Jocher introduced YOLOv5 [57], paving the way for YOLOv8, an enhanced iteration of YOLOv5. YOLOv8 introduces numerous improvements to enhance detection accuracy, speed, and robustness. It adopts a deep neural network architecture capable of predicting bounding boxes and class probabilities directly from full-sized input images in a single pass. Notable enhancements in YOLOv8 include anchor-free detection, mosaic augmentation, and updates to the convolution blocks within the model, such as replacing the C3 module with the C2f module. YOLOv8 model structure summary and the enhancements from YOLOv5 can be found in [58].
- Transformer-based detectors: Transformer-based detection algorithms employ transformer architectures as the foundation for feature extraction and processing. Originally devised for natural language processing tasks, transformers have recently gained popularity in computer vision applications, particularly object detection. These algorithms harness the transformer architecture, comprising self-attention mechanisms and multi-layer perceptrons, to effectively capture spatial relationships and dependencies among image features. Unlike conventional convolutional neural networks (CNNs), transformers do not rely on fixed-size receptive fields, enabling them to capture long-range dependencies throughout the entire input image. Critical components of transformer-based detection algorithms encompass the transformer backbone, which is responsible for extracting high-level features from the input image while capturing spatial relationships and contextual information across the image, the object detection head, and the training phase, during which the loss function typically integrates terms for object classification and bounding box regression. The two transformer-based detectors used in this study are as follows:
- (a) DETR: DETR (DEtection TRansformer) [26] represents a groundbreaking object detection framework that leverages transformer-based architectures to conduct end-to-end object detection without the need for traditional anchor-based region proposal networks (RPNs) or non-maximum suppression (NMS). The fundamental innovation of DETR resides in its direct set prediction mechanism, wherein it concurrently forecasts object queries alongside their associated class labels and bounding boxes utilising transformer-based encoder-decoder architecture. This transformative approach enables DETR to tackle challenges inherent in conventional object detection methods, such as fixed anchor boxes, arbitrary object counts, and imbalanced class distributions. By leveraging transformers’ ability to capture global context and relationships between objects, DETR achieves competitive performance while offering a unified and flexible framework for object detection tasks.
- (b) Deformable DETR: Deformable DETR [27] is an expansion of the original DETR framework. It elevates object detection performance by integrating deformable attention mechanisms into the transformer-based architecture. This innovative approach empowers the model to dynamically adjust the spatial sampling locations of features, facilitating more adaptable and versatile feature extraction. Deformable DETR effectively captures object deformations, scale variations, and occlusions by integrating deformable attention, resulting in enhanced localisation accuracy and resilience in intricate scenes. Additionally, Deformable DETR preserves the end-to-end nature of the DETR framework, enabling seamless integration into existing detection pipelines while achieving state-of-the-art performance in various object detection benchmarks.
Model training
The model training and implementation were performed using an NVIDIA GeForce RTX 2080 Ti and NVIDIA Tesla V100 GPU. All the models are pre-trained on the COCO dataset [21] and have been fine-tuned on our dataset.
Evaluation metrics
The evaluation metrics utilised for model evaluation are the typical COCO object detection metrics, such as precision, recall, confusion matrix, mAP50, mAP75, and mAP50-95 [59]. Precision measures the ratio of true positives to all positive predictions, indicating the model’s capacity to minimise false positives. Conversely, recall assesses the ratio of true positives to all actual positives, indicating the model’s effectiveness in detecting all instances of a class. Confusion matrix provides a comprehensive perspective on the results, displaying the numbers of true positives, true negatives, false positives, and false negatives for each class. mAP50 depicts the mean Average Precision at an IoU (Intersection over Union) threshold of 0.5, mAP75 at an IoU threshold of 0.75, and mAP50-95 over IoU thresholds of 0.5–0.95 in steps of 0.05.
Dataset
The dataset includes images of selected target species across four habitats: screes, dunes, grasslands, and forests. Details regarding the data collection process for each habitat: grasslands, screes, forests, and dunes - are provided in [19,60–62], respectively. Expert botanists annotated the collected images by delineating bounding boxes around the relevant plant species using the Labelbox tool [20]. Fig 2 presents sample images showcasing four species from each habitat. The dataset was precisely compiled by domain experts who visited the habitat locations and conducted field operations during the appropriate time of the year to capture plant blooming periods. The dataset used in the current study is the combined dataset collected by human operators and the robot. Table 2 lists target plant species from each habitat considered for experimentation. Additionally, Table 3 outlines the distribution of training, validation, and testing data instances, with a split of for training,
for validation, and
for testing. Fig 3 shows the bounding boxes count of each category in the training dataset of screes, dunes and grasslands, forests, respectively. A class imbalance in all the habitats can be visualised in all the sub-figures. The dataset is available at https://zenodo.org/records/11504938.
Experimental setup
The specific configuration for each of the models implemented in MMDetection is presented in Table 4. The columns in the table include technique, backbone used in that technique, type of network or attention heads(AHs) and transformer layers (TLs) in case of transformer-based methods, training schedule or duration, learning rate, optimizer, weight decay, momentum, and the maximum number of epochs for which the model has been trained. The model configuration is the same for Faster RCNN, Cascade RCNN, and RetinaNet. The training schedule 1x means that the model has been trained for one epoch over the dataset. However, training durations 150e and 50e depict that the model is trained for 150 epochs and 50 epochs over the dataset, in order. Under the Network column, FPN represents Feature Pyramid Network, AHs and TLs mean Attention Heads and Transformer Layers, respectively. AdamW optimizer has been utilised in case of DETR and Deformable DETR, which is a variant of Adam optimizer. AdamW integrates weight decay directly into its update step. All the models are pre-trained on the COCO dataset.
Table 5 shows the YOLOv8 hyperparameter values for diverse model training for all the habitats. The first column has all the hyperparameters, and the second column shows the default values used in YOLOv8 architecture, which are the same for all habitats. The subsequent columns present the different hyperparameter values for two settings: 200e, 10iter and 100e, 20iter. The setting 200e and 10iter means that the hyperparameter tuning has been performed on YOLO for 10 iterations and 200 epochs; similarly, 100e and 20iter represent 20 iterations and 100 epochs. After getting the best hyperparameters for a particular setting, YOLO was trained separately for each habitat using those hyperparameter values for 200 epochs. YOLO has been trained with default parameters for 500 epochs with the patience of 50 (the number of epochs to wait without any improvement in validation metrics before stopping the training early).
Results and discussion
The results presented in this section are all obtained after testing the trained models on test data, and the values in bold depict the highest values in the respective row. The mAP score values based on three different IoU thresholds for screes, dunes, grasslands, and forests habitats are presented in Table 6. The performance of YOLOv8 is the best for all the habitats except dunes, in which RetinaNet has a slightly higher value of mAP and mAP50, and cascade RCNN surpasses in the case of mAP75. YOLOv8n depicts the n model of YOLOv8 (which is the smallest in size). Cascade RCNN has shown the second-best performance. The reason behind the superior results of YOLOv8 and RetinaNet could be the use of focal loss as a loss function, which handles the class imbalance [23]. Cascade RCNN is also performing well due to its multi-stage architecture that refines the proposals in the subsequent stages [25]. Perhaps the transformer-based models (DETR and Deformable DETR) are not performing well due to the requirement of significant training data [27]. DETR has the worst performance as it also requires a considerable number of training epochs to converge [26].
Table 7 shows the mAP scores after incorporating class balancing using ClassBalancedDataset wrapper in MMDetection on all habitats. The oversample threshold value (float datatype) chosen was 0.1 for all the habitats, considering the instance (bounding box) count of each species in the training dataset of each habitat. This value provides a frequency threshold to the model below which data is repeated. Class balancing has not been performed in YOLOv8 as it is specific to MMDetection, and YOLO also has built-in class balancing to some extent. However, the same YOLOv8 results have been added in Table 7 to show the comparison with other models’ results. We have not utilised any class balancing techniques separately before the models’ implementation. The results obtained after applying ClassBalanced wrapper are mixed. If we look closely, the mAP score values (after class balancing) have decreased primarily in the case of grasslands and forests, which might be due to overfitting as the number of instances in both these habitats is less compared to screes and dunes. For screes and dunes, the scores have increased mostly except for Cascade RCNN. YOLOv8 results are still better than other models (even without applying class balancing separately), except in the case of the dunes habitat, where RetinaNet is leading, but YOLO is close. Cascade RCNN mAP scores are dropped after class balancing, possibly due to overfitting or complex Cascade RCNN architecture. Cascade RCNN has a sequence of stages that refine the proposals [25]. Class balancing might interfere with the progressive refinement process, particularly if some classes are over-represented in the initial stages, resulting in sub-optimal proposals for subsequent stages.
As YOLOv8 showed better performance than other models even after class balancing, it was chosen for hyperparameter tuning to improve the performance, if possible. So, YOLOv8 model analysis has been performed separately to compare the before (with default parameters) and after hyperparameter tuning results (Table 8). YOLOv8 was executed with default parameters for 500e with patience of 50. So, the model executed for different numbers of epochs for all the habitats, which are mentioned in brackets in the first row and third column of each habitat in Table 8. The third and fourth columns display the evaluation metrics values after hyperparameter tuning using tune method in YOLOv8 with two different settings: 200e, 10iter and 100e, 20iter (described above), respectively. Results have mostly improved after hyperparameter tuning except for forests, where the default parameters of YOLOv8 are the best for this habitat.
The final best results from all the models for each habitat are shown in Table 9. These are accumulated results from all the above tables, with or without class balancing and hyperparameter tuning. Although the final results are mixed, it can be inferred that most of the best results are with class balancing or hyperparameter tuning (in the case of YOLOv8). YOLO has performed the best out of all the models. So, the class-wise evaluation metrics’ scores have also been shown in the Tables 10–13 obtained from the best models of YOLOv8, respectively, for each habitat. In addition to mAP score, these tables also show the Precision and Recall values. The total number of images (on which the model is tested) for each of the habitats is shown in the Test column of Table 3 and the Instances column represents the total number of actual bounding boxes of each of the classes in the test images. Figs 4–7 display the predictions of all the models on screes, dunes, grasslands, and forests, along with the ground truth labels (on a single example test image). The classname initials are used with the bounding boxes to show the predictions for better visualisation. The other examples are attached as supporting figures. Below are the inferences deduced from the predicted results:
(A) Ground truth bounding boxes. (B) Faster RCNN. (C) Cascade RCNN. (D) RetinaNet. (E) YOLOv8. (F) DETR. (G) Deformable DETR.
(A) Ground truth bounding boxes. (B) Faster RCNN. (C) Cascade RCNN. (D) RetinaNet. (E) YOLOv8. (F) DETR. (G) Deformable DETR.
(A) Ground truth bounding boxes. (B) Faster RCNN. (C) Cascade RCNN. (D) RetinaNet. (E) YOLOv8. (F) DETR. (G) Deformable DETR.
(A) Ground truth bounding boxes. (B) Faster RCNN. (C) Cascade RCNN. (D) RetinaNet. (E) YOLOv8. (F) DETR. (G) Deformable DETR.
- Almost all models work well for simple predictions (for instance, one or two predictions, plant clearly visible, focus on plant and blurred background).
- In all four habitats, YOLO has mostly predicted the exact bounding boxes as ground truth labels and with good confidence scores.
- DETR has provided a large number of detections every time in comparison to the ground truths, for instance, multiple bounding boxes of varied sizes for predicting a single plant species.
- In the case of screes, the order of models considering well-predicted labels is YOLO, Deformable DETR, Cascade RCNN, Faster RCNN, RetinaNet, and DETR.
- In the dunes case, the order is YOLO, Cascade RCNN, Faster RCNN, RetinaNet, Deformable DETR, and DETR.
- In the case of grasslands, the predicted labels by all models are almost the same and also match with the ground truth labels. However, YOLO has shown the best performance, after that, Faster RCNN, RetinaNet and Deformable DETR have similar performances, then comes DETR and Cascade RCNN close predictions. The reason behind these grasslands habitat results could be the simple test examples compared to other habitats.
- In the case of Forests, the prediction performance of Cascade RCNN is slightly better than YOLO. As most of the plants/bounding boxes are very small, Cascade RCNN is performing very well because of its better small object detection capability due to improvement in localisation by refining bounding box predictions in consecutive Cascade RCNN stages [25]. The performance of Faster RCNN and RetinaNet is close. Then, Deformable DETR and DETR come last.
Although the mAP scores of YOLO are best in all cases, that does not directly imply that other models have inferior performance. It is evident from the prediction results shown in Figs 4– 7 that other models have shown good results. YOLO exactly matches the ground truth in most cases. However, it seems that the two-stage detectors, especially Cascade RCNN, may also predict/find the species correctly, which the botanists might have missed or not annotated in the images during the labelling procedure. Sometimes, it is not possible to annotate each and every instance of plant species in an image due to various reasons, such as a large number of species instances, very small plant size, species in groups, and species not visible properly. Figs 8–11 display the confusion matrices of all the best models (with or without class balancing or hyperparameter tuning) as sub-figures for screes, dunes, grasslands, and forests, respectively. The classname initials have been used in the confusion matrices rather than full names to reduce the space. From all the confusion matrices, it can be inferred that the models are getting more confused between the actual category and the background rather than among different categories. There are two major reasons behind this: (1) similarity among different categories and the background, and (2) sometimes, models have also detected the plant species in the background which were not annotated by the botanists, perhaps due to less clarity and very small plant size. In the case of grasslands in Fig 10, the confusion matrix results for the plant species Orchis pauciflora are not included as there is no instance of this class in the test dataset due to a significantly smaller number of instances.
The numbers in the confusion matrices represent percentages.
The numbers in the confusion matrices represent percentages.
The numbers in the confusion matrices represent percentages.
The numbers in the confusion matrices represent percentages.
Conclusion
This article presented an experimental study of the performance of six popular object detection algorithms such as Faster RCNN and Cascade RCNN (two-stage), RetinaNet and YOLOv8 (one-stage), and DETR and Deformable DETR (transformer-based) for plant species identification in four habitats: screes, dunes, grasslands, and forests. After extensive experimentation on all four datasets, the results are discussed. As per the mAP scores, YOLOv8n has been found to be the best-performing model among all, given the ground truth labels. Although YOLOv8 is best in detecting all the categories in an image with nearly exact bounding boxes as per ground truths, other models are also detecting a few of the plant species in the image which were missed during the annotation procedure. As these are considered false positives, so they affect the overall precision, recall, and mAP score values. Confusion matrices and predicted results are also presented in this study for more clarity. There are a few issues in this experimental study: (1) one major issue is the class imbalance, (2) small datasets, (3) limited data diversity, and (4) integration of human-collected and robot-collected data (which was done to increase the instances in the dataset and for incorporating diversity). The data integration might have impacted the results because of the differences in how pictures were taken by human experts and robots. Moreover, for deep learning models to work well, it is advisable to have huge datasets with high and nearly equal number of instances of each class, including diverse examples of each category. Although the class balancing wrapper (in MMDetection) has been used to avoid the class imbalance issue, it did not work very well. In the future, we will attempt to balance the data by using a custom data augmentation technique inspired by state-of-the-art methods such as generative adversarial networks and diffusion models, which can improve plant species identification.
Supporting information
S1 Fig. Bounding box predictions from six models on a Screes test image1.
(A) Ground truth bounding boxes. (B) Faster RCNN. (C) Cascade RCNN. (D) RetinaNet. (E) YOLOv8. (F) DETR. (G) Deformable DETR.
https://doi.org/10.1371/journal.pone.0327969.s001
(JPG)
S2 Fig. Bounding box predictions from six models on a Screes test image2.
(A) Ground truth bounding boxes. (B) Faster RCNN. (C) Cascade RCNN. (D) RetinaNet. (E) YOLOv8. (F) DETR. (G) Deformable DETR.
https://doi.org/10.1371/journal.pone.0327969.s002
(JPG)
S3 Fig. Bounding box predictions from six models on a Dunes test image1.
(A) Ground truth bounding boxes. (B) Faster RCNN. (C) Cascade RCNN. (D) RetinaNet. (E) YOLOv8. (F) DETR. (G) Deformable DETR.
https://doi.org/10.1371/journal.pone.0327969.s003
(JPG)
S4 Fig. Bounding box predictions from six models on a Dunes test image2.
(A) Ground truth bounding boxes. (B) Faster RCNN. (C) Cascade RCNN. (D) RetinaNet. (E) YOLOv8. (F) DETR. (G) Deformable DETR.
https://doi.org/10.1371/journal.pone.0327969.s004
(JPG)
S5 Fig. Bounding box predictions from six models on a Grasslands test image1.
(A) Ground truth bounding boxes. (B) Faster RCNN. (C) Cascade RCNN. (D) RetinaNet. (E) YOLOv8. (F) DETR. (G) Deformable DETR.
https://doi.org/10.1371/journal.pone.0327969.s005
(JPG)
S6 Fig. Bounding box predictions from six models on a Grasslands test image2.
(A) Ground truth bounding boxes. (B) Faster RCNN. (C) Cascade RCNN. (D) RetinaNet. (E) YOLOv8. (F) DETR. (G) Deformable DETR.
https://doi.org/10.1371/journal.pone.0327969.s006
(JPG)
S7 Fig. Bounding box predictions from six models on a Forests test image1.
(A) Ground truth bounding boxes. (B) Faster RCNN. (C) Cascade RCNN. (D) RetinaNet. (E) YOLOv8. (F) DETR. (G) Deformable DETR.
https://doi.org/10.1371/journal.pone.0327969.s007
(JPG)
S8 Fig. Bounding box predictions from six models on a Forests test image2.
(A) Ground truth bounding boxes. (B) Faster RCNN. (C) Cascade RCNN. (D) RetinaNet. (E) YOLOv8. (F) DETR. (G) Deformable DETR.
https://doi.org/10.1371/journal.pone.0327969.s008
(JPG)
Acknowledgments
We would like to express our deepest gratitude to Daniela Gigante (Department of Agricultural, Food and Environmental Sciences, University of Perugia, Italy), Marco Caccianiga (Department of Biosciences, Università Sdegli Studi di Milano, Italy), Simonetta Bagella (Department of Chemical, Physical, Mathematical and Natural Sciences, Università di Sassari, Italy), Claudia Angiolini (Department of Life Sciences, Università degli studi di Siena, Italy), Giovanni Di Lorenzo, Franco Angelini, and Manolo Garabini (Centro di Ricerca “Enrico Piaggio" and Dipartimento di Ingegneria dell’Informazione, Università di Pisa, Italy) for putting their time, efforts, and guidance in the data collection and labelling process.
References
- 1. Noss RF. Ecosystems as conservation targets. Trends Ecol Evol. 1996;11(8):351. pmid:21237874
- 2. Cowling RM, Knight AT, Faith DP, Ferrier S, Lombard AT, Driver A, et al. Nature conservation requires more than a passion for species. Conservation Biology. 2004;18(6):1674–6.
- 3. Bunce RGH, Bogers MMB, Evans D, Halada L, Jongman RHG, Mucher CA, et al. The significance of habitats as indicators of biodiversity and their links to species. Ecological Indicators. 2013;33:19–25.
- 4. Berg C, Abdank A, Isermann M, Jansen F, Timmermann T, Dengler J. Red lists and conservation prioritization of plant communities – a methodological framework. Applied Vegetation Science. 2014;17(3):504–15.
- 5. Keith DA, Rodríguez JP, Brooks TM, Burgman MA, Barrow EG, Bland L, et al. The IUCN red list of ecosystems: motivations, challenges, and applications. Conservation Letters. 2015;8(3):214–26.
- 6. Directive H, et al. Council Directive 92/43/EEC of 21 May 1992 on the conservation of natural habitats and of wild fauna and flora. Official Journal of the European Union. 1992;206(7):50.
- 7.
Commission E. Our life insurance, our natural capital: an EU biodiversity strategy to 2020 : communication from the Commission to the European Parliament, the Council, the European Economic and Social Committee and the Committee of the Regions. Publications Office of the European Union; 2011.
- 8. Pullin AS, Báldi A, Can OE, Dieterich M, Kati V, Livoreil B, et al. Conservation focus on Europe: major conservation policy issues that need to be informed by conservation science. Conserv Biol. 2009;23(4):818–24. pmid:19627313
- 9. Henle K, Bauch B, Auliya M, Külvik M, Pe‘er G, Schmeller DS, et al. Priorities for biodiversity monitoring in Europe: a review of supranational policies and a novel scheme for integrative prioritization. Ecological Indicators. 2013;33:5–18.
- 10. European Commission D. Interpretation manual of European Union habitats. Eur. 2013;28:1–144.
- 11.
Evans D, Arvela M. Assessment and reporting under Article 17 of the Habitats Directive. Brussels: European Commission; 2011.
- 12. Jongman RHG. Biodiversity observation from local to global. Ecological Indicators. 2013;33:1–4.
- 13.
Angelini P, Casella L, Grignetti A, Genovesi P. Manuali per il monitoraggio di specifiee habitat di interesse comunitario (Direttiva 92/43/CEE) in Italia: habitat. ISPRA; 2016.
- 14.
Caro T. Conservation by proxy: indicator, umbrella, keystone, flagship, and other surrogate species. Island Press; 2010.
- 15. Gigante D, Attorre F, Venanzoni R, Acosta A, Agrillo E, Aleffi M, et al. A methodological protocol for Annex I Habitats monitoring: the contribution of vegetation science. Plant Sociology. 2016;53(2):77–87.
- 16. Bonari G, Fantinato E, Lazzaro L, Sperandii MG, Acosta ATR, Allegrezza M, et al. Shedding light on typical species: implications for habitat monitoring. Plant Sociology. 2021;58(1):157–66.
- 17. Angelini F, Angelini P, Angiolini C, Bagella S, Bonomo F, Caccianiga M, et al. Robotic monitoring of habitats: the natural intelligence approach. IEEE Access. 2023.
- 18. Hutter M, Gehring C, Lauber A, Gunther F, Bellicoso CD, Tsounis V, et al. ANYmal - toward legged robots for harsh environments. Advanced Robotics. 2017;31(17):918–31.
- 19. Angelini F, Pollayil MJ, Valle B, Borgatti MS, Caccianiga M, Garabini M. Robotic monitoring of Alpine screes: a dataset from the EU Natura2000 habitat 8110 in the Italian Alps. Sci Data. 2023;10(1):855. pmid:38040689
- 20.
Labelbox. https://labelbox.com
- 21.
Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D. Microsoft coco: common objects in context. In: Computer Vision–ECCV 2014 : 13th European Conference, Zurich, Switzerland, September 6–12, 2014, Proceedings, Part V 13. 2014. p. 740–55.
- 22.
Team U. Ultralytics YOLOv8 docs;. https://docs.ultralytics.com/
- 23.
Lin TY, Goyal P, Girshick R, He K, Doll´ar PF. Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision. 2017. p. 2980–8.
- 24. Ren S, He K, Girshick R, Sun J. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems. 2015;28.
- 25.
Cai Z, Vasconcelos N. Cascade r-cnn: delving into high quality object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018. p. 6154–62.
- 26.
Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S. End-to-end object detection with transformers. In: European Conference on Computer Vision. Springer; 2020. p. 213–29.
- 27. Zhu X, Su W, Lu L, Li B, Wang X, Dai J. Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint 2020. https://arxiv.org/abs/2010.04159
- 28. Chen K, Wang J, Pang J, Cao Y, Xiong Y, Li X, et al. MMDetection: Open MMLab Detection Toolbox and Benchmark. arXiv preprint 2019. https://arxiv.org/abs/1906.07155
- 29.
Nawaz M, Nazir T, Khan MA, Rajinikanth V, Kadry S. Plant disease classification using VGG-19 based Faster-RCNN. In: International Conference on Advances in Computing and Data Sciences. 2023. p. 277–89.
- 30. Alruwaili M, Siddiqi MH, Khan A, Azad M, Khan A, Alanazi S. RTF-RCNN: An architecture for real-time tomato plant leaf diseases detection in video streaming using Faster-RCNN. Bioengineering. 2022;9(10):565.
- 31. Saleem MH, Potgieter J, Arif KM. Weed detection by faster RCNN model: an enhanced anchor box approach. Agronomy. 2022;12(7):1580.
- 32. Nur-A-Alam Md, Nasir MdMK, Ahsan M, Based MdA, Haider J, Palani S. A faster RCNN-based diabetic retinopathy detection method using fused features from retina images. IEEE Access. 2023;11:124331–49.
- 33. Cai J, Zhang L, Dong J, Guo J, Wang Y, Liao M. Automatic identification of active landslides over wide areas from time-series InSAR measurements using Faster RCNN. International Journal of Applied Earth Observation and Geoinformation. 2023;124:103516.
- 34. Wu L, Chen S, Liang C. Target detection of marine ships based on a cascade RCNN. J Phys: Conf Ser. 2022;2185(1):012028.
- 35.
Mo Y, Liu L, Zhu L, Fu S. CR-SSDD: an optimization model for steel surface defect detection based on cascade RCNN. In: 7th International Conference on Vision, Image and Signal Processing (ICVISP 2023). IET; 2023.
- 36.
Cao Y, Wang Y, Feng H, Wang T. Method for detecting surface defects of ceramic tile based on improved Cascade RCNN. In: 2022 4th International Conference on Frontiers Technology of Information and Computer (ICFTIC). 2022. p. 41–5. https://doi.org/10.1109/icftic57696.2022.10075095
- 37.
Tang K, Ma W, Fei Z, Gao Y, Yuan Y, Xu Q. Cascade RCNN with hybrid attention and dual pooling for soccer player detection. In: 2023 6th International Conference on Artificial Intelligence and Big Data (ICAIBD). 2023. p. 695–700. https://doi.org/10.1109/icaibd57115.2023.10206144
- 38. Cao R, Mo W, Zhang W. MFMDet: multi-scale face mask detection using improved Cascade rcnn. J Supercomput. 2023;80(4):4914–42.
- 39. Peng H, Li Z, Zhou Z, Shao Y. Weed detection in paddy field using an improved RetinaNet network. Computers and Electronics in Agriculture. 2022;199:107179.
- 40. Vecvanags A, Aktas K, Pavlovs I, Avots E, Filipovs J, Brauns A, et al. Ungulate detection and species classification from camera trap images using RetinaNet and faster R-CNN. Entropy. 2022;24(3):353.
- 41. Miao T, Zeng H, Yang W, Chu B, Zou F, Ren W, et al. An improved lightweight retinanet for ship detection in SAR images. IEEE J Sel Top Appl Earth Observations Remote Sensing. 2022;15:4667–79.
- 42. Mahum R, Al-Salman AS. Lung-RetinaNet: lung cancer detection using a retinanet with multi-scale feature fusion and context module. IEEE Access. 2023;11:53850–61.
- 43. Solimani F, Cardellicchio A, Dimauro G, Petrozza A, Summerer S, Cellini F, et al. Optimizing tomato plant phenotyping detection: boosting YOLOv8 architecture to tackle data complexity. Computers and Electronics in Agriculture. 2024;218:108728.
- 44.
Brucal SGE, de Jesus LCM, Peruda SR, Samaniego LA, Yong ED. Development of tomato leaf disease detection using YoloV8 model via RoboFlow 2.0. In: 2023 IEEE 12th Global Conference on Consumer Electronics (GCCE). 2023. p. 692–4. https://doi.org/10.1109/gcce59613.2023.10315251
- 45. Yang T, Zhou S, Xu A, Ye J, Yin J. An approach for plant leaf image segmentation based on YOLOV8 and the improved DEEPLABV3. Plants (Basel). 2023;12(19):3438. pmid:37836178
- 46. Wang X, Liu J. Vegetable disease detection using an improved YOLOv8 algorithm in the greenhouse plant environment. Sci Rep. 2024;14(1):4261. pmid:38383751
- 47. Uddin MS, Mazumder MKA, Prity AJ, Mridha MF, Alfarhood S, Safran M, et al. Cauli-Det: enhancing cauliflower disease detection with modified YOLOv8. Front Plant Sci. 2024;15:1373590. pmid:38699536
- 48.
Zhong X, Ding C, Li Z, Huang S. Towards hard-positive query mining for detr-based human-object interaction detection. In: European Conference on Computer Vision. 2022. p. 444–60.
- 49.
Kumar A, Singh SK, Dubey SR. Target detection using transformer: a study using detr. Computer Vision and Machine Intelligence: Proceedings of CVMI 2022 . Springer; 2023. p. 747–59.
- 50. Yuan X, Fang S, Li N, Ma Q, Wang Z, Gao M, et al. Performance comparison of sea cucumber detection by the Yolov5 and DETR approach. JMSE. 2023;11(11):2043.
- 51.
Tang Y, Chen H, Qian L, Ge S, Zhang M, Zheng R. Detection of spine curve and vertebral level on ultrasound images using DETR. In: 2022 IEEE International Ultrasonics Symposium (IUS). 2022. p. 1–4. https://doi.org/10.1109/ius54386.2022.9958621
- 52. Cheng Y, Liu D. An image-based deep learning approach with improved DETR for power line insulator defect detection. Journal of Sensors. 2022;2022:1–22.
- 53. Wang D, Li Z, Du X, Ma Z, Liu X. Farmland obstacle detection from the perspective of UAVs based on non-local deformable DETR. Agriculture. 2022;12(12):1983.
- 54. Shanliang L, Yunlong L, Jingyi Q, Renbiao W. Airport UAV and birds detection based on deformable DETR. J Phys: Conf Ser. 2022;2253(1):012024.
- 55. Wang Z, Yao J, Zeng C, Li L, Tan C. Students’ classroom behavior detection system incorporating deformable DETR with swin transformer and light-weight feature pyramid network. Systems. 2023;11(7):372.
- 56.
Jocher G. Ultralytics YOLOv8. https://github.com/ultralytics/ultralytics
- 57.
Jocher G, Chaurasia A, Stoken A, Borovec J, Kwon Y, Michael K, et al. Ultralytics/yolov5: v7.0-yolov5 sota realtime instance segmentation. Zenodo. 2022.
- 58.
Ultralytics. Ultralytics YOLOv8. https://github.com/ultralytics/ultralytics/issues/189
- 59. Yan B, Fan P, Lei X, Liu Z, Yang F. A real-time apple targets detection method for picking robot based on improved YOLOv5. Remote Sensing. 2021;13(9):1619.
- 60. Angelini F, Pollayil MJ, Bonini F, Gigante D, Garabini M. Robotic monitoring of grasslands: a dataset from the EU Natura2000 habitat 6210 * in the central Apennines (Italy). Sci Data. 2023;10(1):418. pmid:37369670
- 61. Pollayil MJ, Angelini F, de Simone L, Fanfarillo E, Fiaschi T, Maccherini S, et al. Robotic monitoring of forests: a dataset from the EU habitat 9210 * in the Tuscan Apennines (central Italy). Sci Data. 2023;10(1):845. pmid:38040693
- 62. Angelini F, Pollayil MJ, Rivieccio G, Caria MC, Bagella S, Garabini M. Robotic monitoring of dunes: a dataset from the EU habitats 2110 and 2120 in Sardinia (Italy). Sci Data. 2024;11(1):238. pmid:38402293