Figures
Abstract
Globally, agriculture holds significant importance for human food, economic activities, and employment opportunities. Wheat stands out as the most cultivated crop in the farming sector; however, its annual production faces considerable challenges from various diseases. Timely and accurate identification of these wheat plant diseases is crucial to mitigate damage and enhance overall yield. Pakistan stands among the leading crop producers due to favorable weather and rich soil for production. However, traditional agricultural practices persist, and there is insufficient emphasis on leveraging technology. A significant challenge faced by the agriculture sector, particularly in countries like Pakistan, is the untimely and inefficient diagnosis of crop diseases. Existing methods for disease identification often result in inaccuracies and inefficiencies, leading to reduced productivity. This study proposes an efficient application for wheat crop disease diagnosis, adaptable for both mobile devices and computer systems as the primary decision-making engine. The application utilizes sophisticated machine learning techniques, including Decision Tree (DT), Random Forest (RF), Support Vector Machine (SVM), and AdaBoost, combined with feature extraction methods such as Count Vectorization (CV) and Term Frequency-Inverse Document Frequency (TF-IDF). These advanced methods collectively achieve up to 99% accuracy in diagnosing 14 key wheat diseases, representing a significant improvement over traditional approaches. The application provides a practical decision-making tool for farmers and agricultural experts in Pakistan, offering precise disease diagnostics and management recommendations. By integrating these cutting-edge techniques, the system advances agricultural technology, enhancing disease detection and supporting increased wheat production, thus contributing valuable innovations to both the field of machine learning and agricultural practices.
Citation: Niaz AA, Ashraf R, Mahmood T, Faisal CMN, Abid MM (2025) An efficient smart phone application for wheat crop diseases detection using advanced machine learning. PLoS ONE 20(1): e0312768. https://doi.org/10.1371/journal.pone.0312768
Editor: Vaibhav Kumar Singh, Indian Agricultural Research Institute, INDIA
Received: March 22, 2024; Accepted: October 13, 2024; Published: January 8, 2025
Copyright: © 2025 Niaz et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The data supporting the findings of this study are available on Figshare (https://figshare.com/articles/dataset/Dataset_xlsx/27282912?file=49936995) and GitHub (https://github.com/RehanAshrafNTU/Wheat-disease).
Funding: The author(s) received no specific funding for this work.
Competing interests: the authors have declared that no competing interests exist.
1. Introduction
The agriculture sector in Pakistan is the backbone of economic growth and it was the 7th largest wheat-producing country in 2013 because it has a large geographical area for agriculture [1]. It has the oldest and largest irrigation system in the entire South Asian region, according to the FAO (Food and Agriculture Organization of the United Nations) [2, 3]. In terms of nominal gross domestic product (GDP), the country’s economy ranked 23rd in the world (purchasing power parity) [4].
Pakistan now stands as a top 10 producer of cotton, wheat, sugarcane, dates, mango, and kinnow (oranges), as well as a top 10 producer of rice. Major crops (wheat, rice, cotton, and sugar cane) contribute approximately 4.9% of overall GDP, while other crops contribute 2.1% [4], with the decline in wheat production from number 7th to 10th, the agriculture sector is facing many problems and difficulties in wheat crop production [1, 3]. This decline is partly attributed to limitations in current disease detection methods, which are often inaccurate and inefficient, contributing to substantial losses in yield and economic value.
The low levels of wheat production can be exacerbated by several factors, including inadequate early disease detection, outdated cultivation methods, and limited knowledge about modern agricultural practices. Disease detection is particularly critical, as it can reduce annual global wheat production by 15%–20% [5]. Agriculture extension services, though prominent in Punjab, are hindered by the lack of trained agriculturists and modern machinery and equipment [3]. Efficient early disease detection could significantly enhance crop management and protection methods, but the lack of trained agriculturists and state-of-the-art machinery and equipment. To overcome the problems, early disease detection in various crops as well as in wheat crops can benefit farmers for batter management and apply different methods to protect the crop. [3]. Furthermore, disease detection in rural areas is often ineffective and inefficient due to the shortage of agriculturists, affecting approximately 70% of the country’s rural population [1]. Wheat plant diseases are challenging for farmers to identify, requiring them to manually inspect large fields or rely on agricultural experts, which is time-consuming and labor-intensive [6].
Recent advancements in computer technology, particularly in human-computer interaction and artificial intelligence (AI), have opened new avenues for improving disease detection. Intelligent systems utilizing AI-based and Computer Vision (CV) methods can assist farmers in identifying wheat diseases more efficiently. Despite these advancements, many existing systems face challenges due to the limited scope of disease datasets, which restrict their ability to detect a wide range of wheat diseases accurately. To address these challenges, there is a growing need for comprehensive datasets and robust machine learning (ML) systems that can operate effectively across diverse field conditions [7–11].
In this study, a state-of-the-art (SOTA) and new dataset is introduced that has symptom-based text data for wheat diseases. The information in the database comes from a wide variety of wheat farms, internet, literature, and the University of Agriculture Faisalabad in the province of Punjab, Pakistan. This research focused on fourteen types of diseases black stem rust, leaf rust, stripe rust, loose smut, flag smut, complete bunt, partial bunt, ear cockle, tundo, black point complex, common bunt, sooty head molds, stagonospora nodorum blotch, and barley yellow dwarf. In text-based datasets, cases of these problems include, have been narrowly studied, which offers a general method for diagnosing diseases affecting wheat crops. The proposed system efficiently identifies wheat diseases based on text data using the ML model. Wheat disease detection is reviewed by field experts. This helped in two ways: (1) The accuracy and precision of the ML models have been improved, and (2) it generated useful agricultural information in the form of rules and disease categories that can be incorporated into this study. Finally, a performance-based comparison of the proposed system to the current state-of-the-art approaches in the literature. In this study, an intelligent system for diagnosing wheat crop diseases has been proposed.
This study presents a system for wheat disease detection based on ML techniques. Through a comprehensive investigation of different feature extraction and classification algorithms, to achieve high accuracy with 14 different diseases of wheat crops. The primary contributions of this study are outlined below:
- This study concentrates on a key crop in Pakistan, namely wheat. An upgraded text-based dataset for wheat diseases is generated, compiled from diverse sources in Pakistan, and incorporating state-of-the-art information from [1]. This study emphasizes 14 specific wheat diseases.
- The system provides management for identifying fourteen prevalent wheat diseases through an application serving as the decision-making engine in the backend. Additionally, an application has been developed to function as the user interface in the front end.
- The validation of the diagnostic of various wheat diseases is additionally confirmed by domain experts, contributing to the enhanced accuracy of the system.
- The effectiveness of two distinct vectorization methods, namely count vectorization and TF-IDF is assessed in terms of performance that is accuracy, precision, recall, and f1 score.
- Fine-tuned the machine learning model through hyperparameter tuning, adjusting three specific hyperparameters (Maximum Depth, Minimum Samples Split, and Minimum Samples Leaf) when employing CART DT as well as in Random Forest. This adjustment aims to select the optimal model, preventing overfitting, a resource-intensive process.
- A comparative assessment was carried out to evaluate ML techniques for recognizing wheat diseases. The proposed system demonstrated an accuracy of 99% in classifying wheat diseases. Owing to its robust generalization and notable recognition rate, the proposed system is suitable for deployment in diverse real-time industrial applications.
The subsequent sections of this paper are organized as follows: Section 2 Literature review pertinent literature on wheat disease classification using machine learning. Section 3 details the study’s proposed research methodology, encompassing data collection, data preparation, feature extraction, and machine learning model selection and evaluation. Section 4 presents the study’s results and discussion, including the performance of four ML algorithms and the influence of hyperparameter tuning. Section 5 concludes the paper by summarizing the key findings and suggesting directions for future research in wheat disease classification using machine learning.
2. Literature review
Artificial intelligence (AI) has revolutionized numerous fields by embedding human-like abilities such as learning, reasoning, and perception into software. This advancement allows computers to perform tasks that were once exclusively handled by humans. With the increase in computing power, access to large datasets, and the development of advanced AI algorithms, AI is now utilized in a variety of domains. These applications range from finger vein recognition [12] and diabetic retinopathy detection [13–17] to RNA engineering [18, 19], cancer detection [20–22], biomathematical problems [13, 23, 24], and smart agriculture [25].
The agriculture industry is severely impacted whenever significant crops are attacked by pests. Wheat, maize, rice, and other food crops are especially vulnerable to disease and need to be protected and managed effectively. The agricultural sector has suffered economically over the last couple of decades as a result of a global reduction in crop yield caused by various diseases. Wheat is susceptible to many different diseases, but some of the worst are loose smut, yellow/stripe rust, flag smut, and black stem rust. Various researchers have made contributions to distinct facets of precision agriculture [26]. Precise and early detection of plant diseases is vital for enhancing agricultural productivity sustainably. Historically, human specialists have been essential for identifying plant anomalies resulting from diseases, pests, nutrient deficiencies, or adverse weather conditions [27]. Fast and precise recognition of wheat leaf diseases and their severity is advantageous for the exact prevention and management of such diseases [28].
The main issue in most countries, including Pakistan, is low crop production as a result of late and faulty disease detection [29]. The main cause of low crop yield in developing countries, including Pakistan, is incorrect and late disease diagnosis [29, 30]. In Pakistan, Some of the prominent diseases are as follows: loose smut, yellow or stripe rust, flag smut, and black stem rust [31]. Research in this area has recently demonstrated sizable economic advantages [32, 33]. It demonstrates that agricultural productivity is increasing. Many successful expert systems in a variety of fields, such as economics, healthcare, education, weather forecasting, market trends, and different kinds of planning activities, have been created as a result of technology adaptation [34–37]. Statistics regarding Pakistani smartphone usage patterns have been revealed by the Pakistan Advertiser Society [38]. In Pakistan, 72% of people use smartphones. 68% of those who use smartphones are Android users. 60% of users have multiple cell phones. The country’s 3G and 4G networks have proliferated widely, fueling the rapid expansion of the smartphone market. So, various strategies and applications specifically disease diagnosis systems have been put forth in literature to help the agriculture sector. In [39], a data acquisition framework based on the traditional waterfall model proposed to collect crop disease data.
In [40], the authors discussed various approaches and methodologies used in the CLAES (Central Laboratory for Agricultural Expert Systems) Egypt to create expert systems. CLAES has also researched Egyptian rice disease classification rules. To derive classification rules and assess them in comparison with NN (Neural Networks) outcomes, the C4.5DT (Decision Tree) algorithm utilized by the authors. The performance of the CITEX (Citrus Expert System) and the CUPTEX (Cucumber Expert System) has been defined as an agricultural training tool by [41]. CUPTEX and CITEX are divided into four sub-systems with the names irrigation, fertilization, treatment, and verification. Performance improvements are seen after various experiments. The early diagnosis of fruit disease has been achieved using a variety of data mining techniques [42]. Spore traps and automatic meteorological stations are used, respectively, to gather weather and spore data. The percentage of instances that are successfully classified, the SD (Standard Deviation), the MAE (Mean Absolute Error), and the RAE (Relative Absolute Error) have all been calculated using various data mining approaches, such as J48, SMO (Sequential Minimal Optimization), and Zero-R. The fruit infection prediction model trained using the J48 algorithm, which successfully identified 90% of the total occurrences [42].
Back-propagation neural networks have been used in China to create an “intelligent web based system for diagnosis” for the management of cotton diseases [43]. Eight distinct types of cotton diseases have been used to evaluate this technology. For the identification of diseases in the oilseed crops soybean, peanuts, and rapeseed mustard, fuzzy logic has been employed in India [44]. For the wheat crop, [45] suggests an expert system that makes inferences based on rules. This system facilitating two-way cloud-based communication, disease recognition algorithms, and automated services for crops developed by [46]. A variety of parameter settings have been used to assess performance. To aid farmers in identifying plant illnesses, classification algorithms are introduced in [47]. It can also provide advice on how to treat various diseases. An image processing-based method for plant disease diagnosis is proposed in [48]. Clustering techniques are applied to the leaf image to segment out the damaged area. The disease is estimated by a fuzzy logic-based classifier based on attributes of the affected part. To control agricultural diseases, [49] give a review of existing expert systems.
In this study, there are few viable theoretical frameworks, and substantial work remains on the practical side of disease detection. For instance, diagnosing fruit diseases using a rule-based expert system has been proposed in [50]. In [51], a graphical and descriptive approach for locating plant illnesses is suggested and prototyped. Additionally, [52] propose a knowledge management and acquisition system for storing and managing crop disease data, which may serve as a basis for future diagnoses. A web-based application utilizing a rule promotion technique for diagnosing illnesses in oilseed crops is presented in [53]. Meanwhile, [43] offer a neural network-based solution for disease diagnosis in Chinese cotton using a backpropagation algorithm. [54] introduces new features in a mobile app for Windows Phones that employs image processing methods for early plant disease detection. In rural Africa, [55] present a smartphone-based app with a focus on diagnosing banana and tuber infections. Despite advancements in machine learning (ML) for disease detection, existing methods exhibit notable shortcomings, such as limited dataset diversity, inadequate handling of noisy or incomplete data, and reduced robustness across varying field conditions. For instance, while the mobile application Plantix provides disease diagnosis through leaf images and offers advice for extension professionals and farmers [56], its primary focus on visual symptom analysis does not address dataset limitations or adaptability to diverse conditions. Similarly, deep neural networks discussed in [57] face challenges with data variability and generalization. Image processing methods for early disease detection, as outlined in [58], may struggle with accuracy and applicability across a broad range of diseases. A literature review in [59] highlights smartphone sensors in agriculture but notes practical deployment issues. In response to these gaps, this study introduces a system that leverages a curated text-based dataset specifically compiled from agricultural research publications, extension manuals, and expert annotations [1, 3]. While the dataset size is relatively small, it has been meticulously curated to include the most prevalent and impactful wheat diseases. The focus on quality over quantity ensures that the dataset captures the essential characteristics necessary for accurate disease detection. Additionally, advanced machine learning techniques—such as Decision Trees (DT), Random Forest (RF), Support Vector Machines (SVM), and AdaBoost—are employed. These are combined with feature extraction techniques like Count Vectorization (CV) and Term Frequency-Inverse Document Frequency (TF-IDF). These advanced methods enable the model to achieve up to 99% accuracy in diagnosing fourteen prevalent wheat diseases. Although the dataset size might not be extensive, the robust preprocessing techniques and careful selection of features ensure model stability and reliability, making a significant contribution to wheat disease detection.
3. Proposed research methodology
In this research, the wheat crop has been selected for implementation of the proposed method. Using the verification of domain experts, the implementation of this methodology has enabled the detection and classification of wheat diseases. The process flow of the suggested method is illustrated in Fig 1. Textual information has been provided by the domain expert. The model is discussed further below.
3.1 Data collection
The collection of raw data relating to symptoms and their associated diseases is an essential component in facilitating the overall functioning of the proposed system. Primarily, data related to wheat has been acquired from many sources, encompassing the websites of governmental agriculture agencies, and agricultural research institutions as well as from SOTA, the internet, farmers, and domain experts. The primary data collected from Various sources have been stored in a rudimentary database. The document comprises a total of 100 elements. The dataset consists of a total of 120 columns representing symptoms and 61 rows representing diseases.
3.2 Data preprocessing
First, information is compiled from various sources, including SOTA, the internet, farmers, and domain experts. The application of the Standard Knowledge Discovery in Database (KDD) process, as outlined by [60], has been utilized to extract valuable knowledge from raw data. The consolidation of redundant qualities with distinct names has been achieved through collaboration with agricultural domain experts. The concept of information gain has been employed to discover the features that have the highest significance in the classification of a certain disease. The dimensions of the data have been efficiently reduced and simplified based on the findings of the information gain analysis.
The primary objective of this study is to examine the four key input properties that characterize the symptoms associated with a specific disease, as well as the corresponding output attribute that represents the disease itself. The data cleaning technique yielded a total of 43 distinct instances, which encompassed 14 different diseases affecting wheat crops. The data that had been cleaned underwent further verification and authentication by domain experts and the University of Agriculture. The principles of diseases have been presented in Table 1.
3.3 Feature extraction
Feature extraction holds a paramount role in the development of machine learning models. The effectiveness of machine learning algorithms is intricately tied to the selection and engineering of features. When the features extracted are pertinent to the specific objectives of the task, machine learning classifiers can achieve notably high accuracies in distinguishing between various classes.
The fundamental concept behind feature extraction is to sift through the multitude of potential data attributes and retain only those that bear substantial importance in characterizing the object of study. By identifying and prioritizing features that strongly contribute to the representation of the subject matter, this process not only enhances the model’s ability to differentiate between classes but also yields significant computational benefits.
In essence, feature extraction is a strategic curation of data attributes, favoring those with the greatest significance in describing the underlying patterns in the data. This meticulous feature selection and engineering not only elevates the predictive power of machine learning models but also streamlines the computational complexity by sparing the system from the burden of processing extraneous or less meaningful features.
Wheat disease signs and symptoms are labeled as "WheatDisease" to serve as a predictor variable in Table 1. The domain experts who validated the data agreed that these characteristics are useful in diagnosing disease. Some of these signs are also present in the literature conducted on wheat crop diseases. In this respect, for classification, all symptoms of various diseases of wheat that are most prevalent and well-known in the literature are chosen. These characteristics include the occurrence of a symptom in any region of the plant.
3.3.1 Performing feature extraction with count vectorizer.
Transform textual descriptions of wheat diseases into token-based vectors, creating a Bag of Words model that relies on word frequency. This transformation is achieved using the Count Vectorizer method from the Sklearn Library. Through this process, construct vectors that represent the features of the text data. Before applying the count vectorizer, it is essential to analyze high-frequency words, as they serve as a comprehensive representation of text-based wheat diseases.
3.3.2 Performing feature extraction with TF-IDF (Term Frequency-Inverse Document Frequency).
Next, this study employs the Term Frequency-Inverse Document Frequency (TF-IDF) method, which involves calculating the term frequency (TF), i.e., the frequency of each word within a text-based description of a wheat disease. This process results in the creation of a vocabulary comprised of words present in the text data. This vocabulary is then utilized for encoding both visible and unseen text.
3.4 Data splitting
After pre-processing and feature extraction steps on the text-based datasets, the next phase involves dividing the data into training and test sets. In this context, 80% of the dataset is allocated for training purposes, while the remaining 20% is reserved for testing. This partitioning is essential for the subsequent application of classification algorithms. The training data, constituting 80% of the original dataset, serves as the input for training the classifiers. During this training phase, the ML algorithms learn patterns and relationships within the data. On the other hand, the test data, comprising the remaining 20%, is employed to evaluate the performance of the trained classifiers.
3.5 Classification
In the realm of Machine Learning (ML), supervised classification stands out as a foundational task, offering a rich variety of classification algorithms. For the current evaluation, four specific algorithms ‐ Decision Trees (DT), Random Forest (RF), Support Vector Machine (SVM), and AdaBoost ‐ have been selected for the proposed method for the training data to construct predictive models.
Decision Trees, as classifiers, embody tree-like structures defined by rules and are esteemed for their interpretability, mirroring human reasoning [61]. Widely utilized in both research and practical applications, Decision Trees, such as Quinlan’s C4.5 [62] and Classification and Regression Tree (CART) [63], hold a prominent position in the field of ML [64, 65]. The advantages of DT induction algorithms extend to robustness in handling noisy data, including managing missing values and imbalanced classes. Furthermore, they are recognized for their low computational cost and adeptness in handling redundant attributes [66].
The significance of hyper-parameter values in ML algorithms cannot be overstated, directly impacting the predictive performance of the models they generate. Researchers have dedicated extensive studies to understand the influence of hyper-parameters on various algorithms, employing techniques ranging from traditional methods like Grid Search (GS) and Random Search (RS) [67], to advanced approaches such as meta-heuristics (MTH) [68] and meta-learning (MtL) [69]. Despite the abundance of studies on hyper-parameter optimization for Support Vector Machines (SVMs) [70, 71] and Neural Networks (NNs) [72], fewer investigations have specifically targeted the optimization of hyper-parameters for DT induction algorithms [73–75].
Moving to Random Forest, it is defined as a classifier comprising a collection of tree-structured classifiers, each casting a unit vote for the most popular class at input x. With a large number of trees generated, the winning class is determined by the one with the most votes [76].
Support Vector Machine, another classifier in the evaluation, is a supervised learning model particularly effective in text categorization. This classifier establishes optimal boundaries to separate positive and negative training samples, demonstrating resilience against overfitting when provided with less noisy data [77].
AdaBoost, is effective in mitigating overfitting, particularly in scenarios with less noisy data, as observed in the study by [78]. However, due to its inherent binary nature, AdaBoost achieved an equivalent level of accuracy to SVM when applied to the given dataset.
The superiority of the classification methods used in this study—Decision Tree (DT), Random Forest (RF), Support Vector Machine (SVM), and AdaBoost—stems from their proven effectiveness in handling text-based data for classification tasks. These methods are particularly well-suited for text data due to their ability to handle high-dimensional features, as seen with Count Vectorization (CV) and Term Frequency-Inverse Document Frequency (TF-IDF). While recent advanced methods like deep learning are highly effective with large image datasets, they often require substantial computational resources and large volumes of labeled data, which may not always be feasible in real-world agricultural settings. The chosen methods in this study strike a balance between computational efficiency, accuracy, and applicability to diverse field conditions, providing a practical and reliable solution for wheat disease detection based on text data.
3.6 The dominant hyperparameter: Maximum depth’s central role
Is the maximum depth hyperparameter the most critical factor influencing the complexity and effectiveness of tree-based models? Typically, when using the CART Decision Tree algorithm, three hyperparameters (Maximum Depth, Minimum Samples Split, and Minimum Samples Leaf) are fine-tuned to find the ideal model and prevent overfitting, which can be computationally intensive [79]. To reduce the computational burden while still achieving a robust model, it’s advisable to concentrate on optimizing those specific hyperparameters that have the greatest impact on model performance.
3.7 Mobile application
Smartphone applications have become pivotal in enhancing accessibility and efficiency across various sectors, including agriculture. The proposed wheat disease detection system leverages the widespread use of Android smartphones to bring advanced machine learning capabilities directly to the hands of farmers. This integration is crucial for ensuring that the technological solutions developed in this study are both practical and impactful as shown in Fig 2.
(a): User input screen.
The mobile application is designed to be user-friendly, allowing farmers to diagnose wheat diseases in the field with minimal effort. Users can input visual characteristics of the wheat plant, such as the color and appearance of affected areas, through an intuitive interface. For example, in one scenario, a farmer might select ’stem’ for the crop part, ’brick red’ for the color, and ’long and narrow streaks’ for the appearance. The application then processes this input using the trained machine learning models and provides an accurate diagnosis, along with recommended management strategies.
One of the key advantages of integrating smartphones into this system is the ability to perform disease detection on-site and in real-time, without the need for expensive or complex equipment. This is particularly beneficial in rural areas where access to professional agricultural services may be limited. By offering an offline mode, the application ensures that farmers can still utilize its features even in areas with poor internet connectivity, making it a reliable tool in various field conditions. During the development and deployment phases, several challenges were encountered, including issues related to data quality, model accuracy, and user interface design. These were addressed by collaborating with local experts to ensure relevant data, optimizing machine learning models for better accuracy, and designing an intuitive, user-friendly interface. Additionally, offline functionality and comprehensive training were implemented to ensure usability in areas with limited internet access.
The system’s backend, built using the sci-kit-learn ML library, processes the input data and matches it against the trained models to generate predictions. The results are then displayed on the smartphone screen, providing farmers with actionable insights in a matter of seconds. This seamless integration of smartphones into the disease detection pipeline not only streamlines the diagnostic process but also empowers farmers to make informed decisions quickly, ultimately leading to better crop management and higher yields.
In conclusion, the integration of smartphones in this wheat disease detection system represents a significant step forward in agricultural technology. It bridges the gap between advanced machine learning models and practical, on-the-ground application, making sophisticated disease diagnosis accessible to farmers everywhere. Fig 2 illustrates the mobile application’s ability to assist farmers in diagnosing wheat diseases directly in the field with minimal effort.
3.8 Performance evaluation of the proposed system
The performance of the proposed system is assessed for wheat disease using a classification setup, employing the confusion matrix Table 2, specific to the disease (as provided in Table 1). Four evaluation metrics are computed using Eqs (1) through (4) based on the values in these confusion matrices. To gain an understanding of the overall performance, averages are calculated across the columns. These equations involve various values obtained from the confusion matrix of each ML model. These critical values encompass:
True_Positive (TP): Specifies the number of instances in which the ML model accurately identified positive samples in the testing data as positive.
True_Negative (TN): This indicates the frequency with which the ML model correctly classified instances belonging to the negative class.
False_Positive (FP): Reflects the rate at which our model incorrectly classified different positive classes.
False_Negative (FN): Represents the misclassification rate for different positive classes.
4. Result and discussion
In this section, discusses the test environment and the outcomes of classifying wheat diseases using text data, achieved through the implementation of DT, RF, SVM, and AdaBoost. This study will specifically address the accuracy of the proposed system as shown in Table 3. Here is an in-depth analysis of the effectiveness of ML models when applied to classification tasks provided. To carry out this analysis, the performance of the proposed system is assessed by examining its accuracy, precision, recall, and F1 score.
4.1 Test environment
The test experimental setup that underlies the study contribution is characterized by specific information about the system, datasets, and machine learning algorithms utilized. Here are the comprehensive details: The experimentation is conducted on a 64-bit Windows 10 Enterprise platform. The central processing unit employed for this study is an Intel(R) Core(TM) i5-7200U CPU running at a base clock speed of 2.50 GHz, with a maximum turbo frequency of 2.71 GHz. The experimental system is equipped with 8.0 gigabytes (GB) of Random Access Memory (RAM). The primary machine learning classifiers employed in this study are the DT, RF, SVM, and AdaBoost Classifiers. These classifiers are pivotal in the study as they form the basis for the analysis and classification of data. In addition to this, the proposed system leverages various libraries for its implementation. These include Python 3.11 as the primary programming language. For the training and testing of various machine learning models, utilizing sci-kit-learn ML library version 1.3.0. Furthermore, employ the Matplotlib library, version 3.7.1, a Python-based tool for visualizing a wide range of content, such as images, results, and graphs.
4.2 Decision tree
In the first experiment in Fig 3, Decision Trees (DT) are employed with a count vectorizer, incorporating specific hyperparameters such as max_depth = 10, min_samples_split = 3, and min_samples_leaf = 1. The results revealed an accuracy of 86%, indicating that the model correctly predicted the outcome in 86% of cases. Precision, which measures the accuracy of positive predictions, stood at 77%, implying that out of all instances predicted as positive, 77% are indeed positive. The recall, which gauges the model’s ability to identify all relevant instances, is 86%, signifying that the model captured 86% of all actual positive instances. The F1 score, a balanced metric considering both precision and recall, is 80%. Moving on to the second experiment, TF-IDF (Term Frequency-Inverse Document Frequency) is utilized without the incorporation of specific hyperparameters. Despite the absence of hyperparameter tuning, the accuracy remained at 86%, aligning with the first experiment. Precision, recall, and F1_score are also consistent at 77%, 86%, and 80%, respectively. This suggests that the TF-IDF approach, without hyperparameter adjustments, yielded comparable performance to the count vectorizer with carefully selected hyperparameters in the first experiment. In essence, both experiments resulted in similar performance metrics, highlighting the robustness of the models in achieving a balanced accuracy, precision, recall, and F1 score.
4.3 Random forest
In the initial experiment, the outcomes are presented in Fig 4, the Random Forest (RF) is employed with a count vectorizer. The results indicated an accuracy of 86%, denoting that the model made correct predictions in 86% of cases. Precision, which reflects the accuracy of positive predictions, is at 82%, signifying that 82% of instances predicted as positive are indeed positive. The recall, measuring the model’s ability to identify all relevant instances, stood at 86%, indicating that the model captured 86% of all actual positive instances. The F1_score, a balanced metric considering both precision and recall, is calculated at 83%. Moving on to the second experiment, TF-IDF is utilized, and specific hyperparameters (max_depth = none, min_samples_split = 2) are employed. The results of this experiment revealed that accuracy is 85% when compared to the count vectorizer in the first experiment. Precision is 79%, recall is 85%, and the F1 score is 82%. In summary, the shift from count vectorizer to TF-IDF, along with the alteration of hyperparameters, resulted in a notable decrease in accuracy and performance metrics. The accuracy dropped to 85%, and there was a reduction in precision, recall, and the F1 score. This suggests that the choice of vectorization method (count vectorizer or TF-IDF) and the fine-tuning of hyperparameters significantly impact the model’s predictive capabilities.
4.4 Support Vector Machine
In Fig 5, a Support Vector Machine (SVM) is applied using the count vectorizer in the first experiment, and the results demonstrated a high level of performance. The accuracy reached 99%, indicating that the model correctly predicted outcomes in 99% of cases. Precision, which gauges the accuracy of positive predictions, is notably high at 95%, signifying that almost all instances predicted as positive are indeed positive. However, the recall, representing the model’s ability to identify all relevant instances, is 98%, implying that the model captured 98% of all actual positive instances. The F1_score, a balanced metric considering both precision and recall, is calculated at 96%. In the second experiment, TF-IDF is employed instead of the count vectorizer. However, the results showed a significant decrease in performance compared to the first experiment. The accuracy dropped to 89%, indicating a lower overall correctness in predictions. Precision decreased to 83%, meaning that the proportion of correctly predicted positive instances decreased. The recall also dropped to 89%, suggesting that the model captured a lower percentage of actual positive instances. Consequently, the F1 score, taking into account both precision and recall, declined to 85%. In summary, the transition from count vectorizer to TF-IDF led to a notable decrease in SVM’s performance metrics, including accuracy, precision, recall, and F1 score. This emphasizes the sensitivity of SVM to the choice of vectorization method, and in this case, the count vectorizer proved more effective in capturing patterns in the data.
4.5 AdaBoost
In the first experiment, AdaBoost is employed with a count vectorizer, and the results are summarized in Fig 6. The findings reveal that the accuracy of the model reached 91%, signifying that it made correct predictions in 91% of cases. Precision, which measures the accuracy of positive predictions, stood at 86%, indicating that 86% of instances predicted as positive are indeed positive. The recall, representing the model’s ability to identify all relevant instances, is 91%, implying that the model captured 91% of all actual positive instances. The F1_score, a balanced metric considering both precision and recall, is calculated at 88%. Moving on to the second experiment, TF-IDF is employed instead of the count vectorizer. However, the results of this experiment indicated a decrease in performance compared to the first experiment. The accuracy dropped to 89%, suggesting lower overall correctness in predictions. Precision decreased to 83%, indicating a reduction in the proportion of correctly predicted positive instances. The recall also dropped to 89%, implying that the model captured a lower percentage of actual positive instances. Consequently, the F1 score, considering both precision and recall, declined to 85%. To summarize, the transition from the count vectorizer to TF-IDF led to a noticeable decrease in AdaBoost’s performance metrics, including accuracy, precision, recall, and F1_score. This highlights the sensitivity of AdaBoost to the choice of vectorization method, and in this case, the count vectorizer proved more effective in capturing patterns in the data.
DT, SVM, RF, and AdaBoost are examined to determine how vectorization strategies affect model performance in Fig 7. Decision Trees performed consistently in count vectorization and TF-IDF, with stable accuracy, precision, recall, and F1_score. It appears that DT is not sensitive to the vectorization approach. However, SVM is sensitive to vectorization, losing accuracy, precision, and F1_score when switching from count vectorizer to TF-IDF. This emphasizes the need for careful SVM vectorization method selection. Random Forest, like SVM, performed poorly with TF-IDF. Although less significant than in SVM, the pattern shows that count vectorization may benefit Random Forest in this scenario. AdaBoost also performed worse with TF-IDF, supporting the concept that the vectorization approach may affect SVM, RF, and AdaBoost. Beyond vectorization, hyperparameter adjustment affected SVM. SVM may need to fine-tune hyperparameters to optimize TF-IDF performance. In SVM, TF-IDF significantly reduced recall, demonstrating the precision-recall trade-off. This trade-off emphasizes the need to examine the analysis’s goals and the right balance between properly anticipating positive instances and capturing all relevant examples.
This study suggests that model performance can vary between algorithms, hence the vectorization method and hyperparameters should be adjusted to the dataset and analytic goals. These insights help us evaluate algorithmic performance and make informed decisions when applying machine learning to real-world challenges.
The discussion highlights the effectiveness of the proposed system in wheat disease detection compared to existing state-of-the-art (SOTA) approaches as shown in Table 3 While previous methods like Fuzzy Logic [1] and Decision Trees [3] demonstrated high accuracy (99.3% and 98%, respectively) in classifying a limited number of diseases, the proposed system excels by accurately identifying a broader spectrum of 14 wheat diseases. For instance, Decision Trees achieved 86% accuracy with both Count Vectorizer (CV) and Term Frequency-Inverse Document Frequency (TF-IDF), while Random Forest showed similar robustness with accuracies of 86% (CV) and 85% (TF-IDF). The Support Vector Machine (SVM) classifier was particularly effective, reaching 99% accuracy with CV, though slightly lower at 89% with TF-IDF. AdaBoost also performed well, achieving 91% accuracy with CV and 89% with TF-IDF.
The comparison suggests that while traditional methods perform well in specific contexts, the proposed system’s ability to handle more complex datasets with multiple diseases is a significant advantage. However, the discussion would benefit from a deeper comparison with recent literature, particularly concerning advanced methods like fuzzy logic and deep learning or hybrid approaches.
Additionally, the integration of smartphone applications, although briefly mentioned, plays a crucial role in the system’s practical utility. The application, designed for ease of use with features like language support and offline functionality, allows farmers to quickly and accurately identify wheat diseases, even in regions with limited access to advanced agricultural services. By combining machine learning with a user-friendly smartphone interface, the proposed system not only improves disease detection accuracy but also enhances its accessibility and practicality in real-world agricultural settings, ultimately supporting better decision-making and crop management.
5. Conclusions
This study introduced a system for the identification of wheat diseases using the ML approach and used a text base dataset comprising various types of wheat diseases, such as black stem rust, leaf rust, stripe rust, loose smut, flag smut, complete bunt, partial bunt, ear cockle, tundo, black point complex, common bunt, sooty head molds, stagonospora nodorum blotch, and barley yellow dwarf. Feature selection techniques are employed to ensure precise preprocessing. Two features are extracted using the count vectorizer and TF-IDF. Four machine learning models are trained using extracted characteristics to conduct a comparison of the performance of the SOTA ML model. Following the comparative analysis in Table 3, it has been discovered that the proposed system exhibits better accuracy as it successfully predicted 14 different wheat diseases with management. The effectiveness of the proposed system is assessed using a text-based dataset and a diverse set of evaluation metrics, including accuracy, precision, recall, and F1-score. To facilitate a comprehensive assessment, a comparative analysis is undertaken between the proposed system and the current SOTA methodologies. Consequently, the proposed system demonstrates a higher level of accuracy in the identification of 14 different wheat diseases compared to SOTA methodologies. The future goal of this study is to expand the existing dataset to include more classes of wheat diseases as well as other crops and deployed in various regions outside the Pakistan, also integrate a treatment suggestion feature for the found diseases. This implementation would enable effective and timely assistance to farmers in the field while minimizing resource and time wastage on a global scale.
References
- 1. Toseef M. and Khan M.J., An intelligent mobile application for diagnosis of crop diseases in Pakistan using fuzzy inference system. Computers and Electronics in Agriculture, 2018. 153: p. 1–11.
- 2. Chouhan S.S., Singh U.P., and Jain S., Applications of computer vision in plant pathology: a survey. Archives of computational methods in engineering, 2020. 27(2): p. 611–632.
- 3. Haider W., et al., A generic approach for wheat disease classification and verification using expert opinion for knowledge-based decisions. IEEE Access, 2021. 9: p. 31104–31129.
- 4. Nation F.U. FAO in Pakistan. Pakistan at a Glance 2023 [cited 2023 14-10-2022]; Available from: https://www.fao.org/pakistan/our-office/pakistan-at-a-glance/en/.
- 5. Figueroa M., Hammond‐Kosack K.E., and Solomon P.S., A review of wheat diseases—a field perspective. Molecular plant pathology, 2018. 19(6): p. 1523–1536. pmid:29045052
- 6. Jha K., et al., A comprehensive review on automation in agriculture using artificial intelligence. Artificial Intelligence in Agriculture, 2019. 2: p. 1–12.
- 7. Khan N., et al., An adaptive game-based learning strategy for children road safety education and practice in virtual space. Sensors, 2021. 21(11): p. 3661. pmid:34070237
- 8. Haroon U., et al., A multi-stream sequence learning framework for human interaction recognition. IEEE Transactions on Human-Machine Systems, 2022. 52(3): p. 435–444.
- 9. Khan H., et al., Automated Wheat Diseases Classification Framework Using Advanced Machine Learning Technique. Agriculture, 2022. 12(8): p. 1226.
- 10. Yar H., et al., Vision sensor-based real-time fire detection in resource-constrained IoT environments. Computational intelligence and neuroscience, 2021. 2021. pmid:34970311
- 11. Patrício D.I. and Rieder R., Computer vision and artificial intelligence in precision agriculture for grain crops: A systematic review. Computers and electronics in agriculture, 2018. 153: p. 69–81.
- 12. Bilal A., Sun G., and Mazhar S., Finger-vein recognition using a novel enhancement method with convolutional neural network. Journal of the Chinese Institute of Engineers, 2021. 44(5): p. 407–417.
- 13. Bilal A., et al., Improved Support Vector Machine based on CNN-SVD for vision-threatening diabetic retinopathy detection and classification. Plos one, 2024. 19(1): p. e0295951. pmid:38165976
- 14. Bilal A., et al., EdgeSVDNet: 5G-enabled detection and classification of vision-threatening diabetic retinopathy in retinal fundus images. Electronics, 2023. 12(19): p. 4094.
- 15. Bilal A., et al., NIMEQ-SACNet: A novel self-attention precision medicine model for vision-threatening diabetic retinopathy using image data. Computers in Biology and Medicine, 2024. 171: p. 108099. pmid:38364659
- 16. Bilal A., et al., Diabetic retinopathy detection and classification using mixed models for a disease grading database. IEEE Access, 2021. 9: p. 23544–23553.
- 17. Bilal A., et al., AI-based automatic detection and classification of diabetic retinopathy using U-Net and deep learning. Symmetry 14 (7), 1427. 2022.
- 18. Feng X., et al., Advancing single-cell RNA-seq data analysis through the fusion of multi-layer perceptron and graph neural network. Briefings in Bioinformatics, 2024. 25(1): p. bbad481.
- 19. Yu X., et al., iDNA-OpenPrompt: OpenPrompt learning model for identifying DNA methylation. Frontiers in Genetics, 2024. 15: p. 1377285. pmid:38689652
- 20. Bilal A., et al., BC-QNet: A quantum-infused ELM model for breast cancer diagnosis. Computers in Biology and Medicine, 2024. 175: p. 108483. pmid:38704900
- 21. Bilal A., et al., IGWO-IVNet3: DL-based automatic diagnosis of lung nodules using an improved gray wolf optimization and InceptionNet-V3. Sensors, 2022. 22(24): p. 9603. pmid:36559970
- 22. Bilal A., et al., Lung nodules detection using grey wolf optimization by weighted filters and classification using CNN. Journal of the Chinese Institute of Engineers, 2022. 45(2): p. 175–186.
- 23. Bilal A. and Sun G., Neuro-optimized numerical solution of non-linear problem based on Flierl–Petviashivili equation. SN Applied Sciences, 2020. 2(7): p. 1166.
- 24. Bilal A., et al., Neuro-optimized numerical treatment of HIV infection model. International Journal of Biomathematics, 2021. 14(05): p. 2150033.
- 25. Bilal A., et al., Increasing crop quality and yield with a machine learning-based crop monitoring system. Comput Mater Continua, 2023. 76(2): p. 2401–2426.
- 26. Paul A., et al., A review on agricultural advancement based on computer vision and machine learning. Emerging Technology in Modelling and Graphics: Proceedings of IEM Graph 2018, 2020: p. 567–581.
- 27. Ngugi L.C., Abelwahab M., and Abo-Zahhad M., Recent advances in image processing techniques for automated leaf pest and disease recognition–A review. Information processing in agriculture, 2021. 8(1): p. 27–51.
- 28. Bao W., et al., Identification of wheat leaf diseases and their severity based on elliptical-maximum margin criterion metric learning. Sustainable Computing: Informatics and Systems, 2021. 30: p. 100526.
- 29. Anwar S.A., Salahuddin G., and Rauf C.A., Nematode diseases of rice in the Punjab, Pakistan. Pakistan Journal of Agricultural Research, 1993. 14(2/3): p. 184–191.
- 30. Mukhtar I., Sunflower disease and insect pests in Pakistan: A review. African crop science journal, 2009. 17(2).
- 31. Atiq-ur-Rehman R., et al., Status of foliar diseases of wheat in Punjab, Pakistan. Mycopath, 2011. 9(1): p. 39–42.
- 32. Evenson R.E., Waggoner P.E., and Ruttan V.W., Economic benefits from research: An example from agriculture. Science, 1979. 205(4411): p. 1101–1107. pmid:17735033
- 33. Gollin D., Parente S., and Rogerson R., The role of agriculture in development. American economic review, 2002. 92(2): p. 160–164.
- 34. Das S., Guha D., and Dutta B., Medical diagnosis with the aid of using fuzzy logic and intuitionistic fuzzy logic. Applied Intelligence, 2016. 45(3): p. 850–867.
- 35. Dagar P., Jatain A., and Gaur D. Medical diagnosis system using fuzzy logic toolbox. in International Conference on Computing, Communication & Automation. 2015. IEEE.
- 36. Gokmen G., et al., Evaluation of student performance in laboratory applications using fuzzy logic. Procedia-Social and Behavioral Sciences, 2010. 2(2): p. 902–909.
- 37. Abhishek K., et al., Weather forecasting model using artificial neural network. Procedia Technology, 2012. 4: p. 311–318.
- 38. Socity P.A. Smartphone Usage in Pakistan. Technical Report. 2022 10 December 2022; Available from: https://pas.org.pk/smart-phone-usage-in-pakistan-infographics/.
- 39. Rafea A., et al., Development and implementation of a knowledge acquisition methodology for crop management expert systems. Computers and electronics in agriculture, 1993. 8(2): p. 129–146.
- 40. El-Telbany M.E., Warda M., and El-Borahy M., Mining the Classification Rules for Egyptian Rice Diseases. Int. Arab J. Inf. Technol., 2006. 3(4): p. 303–307.
- 41. Rafea A. and Shaalan K., Using expert systems as a training tool in the agriculture sector in Egypt. Expert Systems with applications, 1996. 11(3): p. 343–349.
- 42. Ilic M., et al. Data mining model for early fruit diseases detection. in 2015 23rd Telecommunications Forum Telfor (TELFOR). 2015. IEEE.
- 43. Li H., et al. WEB-based intelligent diagnosis system for cotton diseases control. in International Conference on Computer and Computing Technologies in Agriculture. 2010. Springer.
- 44. Kamalak P. and Hemalatha K., Agro genius: an emergent expert system for querying agricultural clarification using data mining technique. Int. J. Eng. Sci, 2012. 1(11): p. 34–39.
- 45. Khan F.S., et al. Dr. Wheat: a Web-based expert system for diagnosis of diseases and pests in Pakistani wheat. in Proceedings of the world congress on engineering. 2008.
- 46. Sarangi S., Umadikar J., and Kar S., Automation of Agriculture Support Systems using Wisekar: Case study of a crop-disease advisory service. Computers and electronics in agriculture, 2016. 122: p. 200–210.
- 47. Camargo A., et al., Intelligent systems for the assessment of crop disorders. Computers and electronics in agriculture, 2012. 85: p. 1–7.
- 48. Naik S.I., Kanandreddy V., and Sannakki S., Plant disease diagnosis system for improved crop yield. International Journal of Innovations in Engineering and Technology, 2014. 4: p. 198–204.
- 49. Shafinah K., et al., A FRAMEWORK OF AN EXPERT SYSTEM FOR CROP PEST AND DISEASE MANAGEMENT. Journal of Theoretical & Applied Information Technology, 2013. 58(1).
- 50. Dewanto S. and Lukas J. Expert system for diagnosis pest and disease in fruit plants. in EPJ Web of Conferences. 2014. EDP Sciences.
- 51. Abu-Naser S.S., Kashkash K., and Fayyad M., Developing an expert system for plant disease diagnosis. 2010.
- 52. Kolhe S., et al. KMSCD: Knowledge management system for crop diseases. in 2009 World Congress on Nature & Biologically Inspired Computing (NaBIC). 2009. IEEE.
- 53. Kolhe S., et al., A web-based intelligent disease-diagnosis system using a new fuzzy-logic based approach for drawing the inferences in crops. Computers and electronics in agriculture, 2011. 76(1): p. 16–27.
- 54. Petrellis N. A smart phone image processing application for plant disease diagnosis. in 2017 6th international conference on modern circuits and systems technologies (MOCAST). 2017. IEEE.
- 55.
www.psu.edu. New mobile app diagnoses crop diseases in the field and alerts rural farmers. 2022 11 December 2022]; Available from: https://www.psu.edu/news/research/story/new-mobile-app-diagnoses-crop-diseases-field-and-alerts-rural-farmers/.
- 56.
www.psu.edu. Plantix: A Mobile Application for Agriculture Sector. Technical Report. 2022 12 December 2022]; Available from: http://www.plantix.net.
- 57. Mohanty S.P., Hughes D.P., and Salathé M., Using deep learning for image-based plant disease detection. Frontiers in plant science, 2016. 7: p. 1419. pmid:27713752
- 58. Petrellis N. Plant Disease Diagnosis Based on Image Processing, Appropriate for Mobile Phone Implementation. in HAICTA. 2015.
- 59. Pongnumkul S., Chaovalit P., and Surasvadi N., Applications of smartphone-based sensors in agriculture: a systematic review of research. Journal of Sensors, 2015. 2015.
- 60. Fayyad U., Piatetsky-Shapiro G., and Smyth P., From data mining to knowledge discovery in databases. AI magazine, 1996. 17(3): p. 37–37.
- 61. Mantovani R.G., et al. Hyper-parameter tuning of a decision tree induction algorithm. in 2016 5th Brazilian Conference on Intelligent Systems (BRACIS). 2016. IEEE.
- 62. Quinlan J.R., C4. 5: programs for machine learning. 2014: Elsevier.
- 63.
Breiman L., et al., Classification and regression trees Belmont. CA: Wadsworth International Group, 1984.
- 64. Maimon O. and Rokach L., Data mining and knowledge discovery handbook. Vol. 2. 2005: Springer.
- 65.
Jankowski, D. and K. Jackowski. Evolutionary algorithm for decision tree induction. in Computer Information Systems and Industrial Management: 13th IFIP TC8 International Conference, CISIM 2014, Ho Chi Minh City, Vietnam, November 5–7, 2014. Proceedings 14. 2014. Springer.
- 66. Barros R.C., et al., A survey of evolutionary algorithms for decision-tree induction. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 2011. 42(3): p. 291–312.
- 67. Braga I., et al. A note on parameter selection for support vector machines. in Advances in Soft Computing and Its Applications: 12th Mexican International Conference on Artificial Intelligence, MICAI 2013, Mexico City, Mexico, November 24–30, 2013, Proceedings, Part II 12. 2013. Springer.
- 68. Friedrichs F. and Igel C., Evolutionary tuning of multiple SVM parameters. Neurocomputing, 2005. 64: p. 107–117.
- 69. Feurer M., Springenberg J., and Hutter F. Initializing bayesian hyperparameter optimization via meta-learning. in Proceedings of the AAAI Conference on Artificial Intelligence. 2015.
- 70. Chapelle O., et al., Choosing multiple parameters for support vector machines. Machine learning, 2002. 46: p. 131–159.
- 71. Ali S. and Smith-Miles K.A., A meta-learning approach to automatic kernel selection for support vector machines. Neurocomputing, 2006. 70(1–3): p. 173–186.
- 72. Bergstra J., et al., Algorithms for hyper-parameter optimization. Advances in neural information processing systems, 2011. 24.
- 73.
Reif, M., F. Shafait, and A. Dengel. Prediction of classifier training time including parameter optimization. in KI 2011: Advances in Artificial Intelligence: 34th Annual German Conference on AI, Berlin, Germany, October 4–7, 2011. Proceedings 34. 2011. Springer.
- 74. Molina M., et al., Meta-Learning Approach for Automatic Parameter Tuning: A Case Study with Educational Datasets. International Educational Data Mining Society, 2012.
- 75. Reif M., et al., Automatic classifier selection for non-experts. Pattern Analysis and Applications, 2014. 17: p. 83–96.
- 76. Breiman L., Random forests. Machine learning, 2001. 45: p. 5–32.
- 77. Ahmad M., et al., Hybrid tools and techniques for sentiment analysis: a review. Int. J. Multidiscip. Sci. Eng, 2017. 8(3): p. 29–33.
- 78. Banerjee I., et al., Comparative effectiveness of convolutional neural network (CNN) and recurrent neural network (RNN) architectures for radiology text report classification. Artificial intelligence in medicine, 2019. 97: p. 79–88. pmid:30477892
- 79.
sklearn.tree.DecisionTreeClassifier. 2023 [cited 2023 19-11-2023]; Available from: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html.