Prediction method of sugarcane important phenotype data based on multi-model and multi-task

Jihong Sun; Chen Sun; Zhaowen Li; Ye Qian; Tong Li

doi:10.1371/journal.pone.0312444

Abstract

The efficacy of generalized sugarcane yield prediction models holds significant implications for global food security. Given that machine learning algorithms often surpass the precision of remote sensing technology, further exploration of machine learning algorithms in the development of sugarcane yield prediction models is imperative. In this study, we employed six key phenotypic traits of sugarcane, specifically plant height, stem diameter, third-node length (internode length), leaf length, leaf width, and field brix, along with eight machine learning methods: logistic regression, linear regression, K-Nearest Neighbors (KNN), Support Vector Machine (SVM), Backpropagation Neural Network (BPNN), Decision Tree, Random Forest, and the XGBoost algorithm. The aim was to establish an intelligent model ensemble for predicting two crucial phenotypic characteristics—stem diameter and plant height—that determine sugarcane yield, ultimately enhancing the overall yield.The experimental findings indicate that the XGBoost algorithm outperforms the other seven algorithms in predicting these significant phenotypic traits of sugarcane. Furthermore, an analysis of the sugarcane intelligent prediction model’s performance under a specialized data environment, incorporating self-prepared data, reveals that the XGBoost algorithm exhibits greater stability. Notably, the data pertaining to these crucial phenotypic traits have a profound impact on the efficacy of the intelligent models. The research demonstrates that a sugarcane yield prediction model ensemble, incorporating multiple intelligent algorithms, can accurately forecast stem diameter and plant height, thereby predicting sugarcane yield. Additionally, this approach, combined with the principles of sugarcane cross-breeding, provides a valuable reference for the artificial breeding of new sugarcane varieties that excel in stem diameter and plant height, bridging a research gap in indirect yield prediction through sugarcane phenotypic traits.

Citation: Sun J, Sun C, Li Z, Qian Y, Li T (2024) Prediction method of sugarcane important phenotype data based on multi-model and multi-task. PLoS ONE 19(12): e0312444. https://doi.org/10.1371/journal.pone.0312444

Editor: Paulo Eduardo Teodoro, Federal University of Mato Grosso do Sul, BRAZIL

Received: June 29, 2024; Accepted: October 7, 2024; Published: December 13, 2024

Copyright: © 2024 Sun et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All relevant data are within the manuscript.

Funding: This research work was funded by the Open Fund of Yunnan Key Laboratory of Crop Production and Smart Agriculture. This work has been supported by the Major Project of Science and Technology of Yunnan Province under Grant No. 202302AE090020, No. 202002AE090010, No.202002AD080002 and Scholarship for Academic Leader of Yunnan Province funding No. 202405AC350108. Funded by Prof. Tong Li, who was responsible for resource management, publication decisions and guidance in the study.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Sugarcane stands as the preeminent sugar crop globally, serving as a primary source of food and energy for humans. With an annual output exceeding 1.75 billion tons, it ranks among the world’s most extensively cultivated crops. Brazil, the foremost producer in the world, contributes over 700 million tons annually, while India, the second-largest producer, generates more than 300 million tons. China, occupying the fourth position, produces over 200 million tons annually [1]. However, sugarcane’s productivity has been significantly hampered by numerous challenges, including continuous cropping obstacles, diseases and pests, low yield, and lodging. Therefore, accurate prediction of sugarcane yield is paramount for global sugar security.

Crop yield, as a crucial aspect of global food production, constitutes one of the most prevalent research domains among scholars [2]. The development of crop yield prediction models falls into two primary categories: classical empirical models [3–7] and machine learning models [8, 9]. Classical empirical models typically rely on field surveys, biophysical simulations, and statistical frameworks to establish estimation models [10]. In field surveys, seasoned farmers or experts predict crop yields based on field observations. This process demands considerable time and labor, and the effectiveness of these predictions is inherently subjective, relying heavily on the forecaster’s experience. Biophysical models can simulate crop growth stages and yields, providing insights into crop development under diverse meteorological conditions [11, 12]. Nevertheless, these models necessitate rigorous testing and calibration, often requiring a vast array of input parameters, such as soil moisture, meteorological data, and agricultural management information [13]. While the accuracy of such models generally hovers around 70%, they necessitate comprehensive data collection at the scale of the crop-growing area, introducing substantial uncertainty and reducing their predictive performance at larger scales [14]. Statistical models, grounded in probability theory, utilize mathematical statistics to establish functional relationships between variables based on experimental measurements [15]. However, capturing the intricate nonlinear relationships between dependent and independent variables remains a challenge within this framework [16].

In recent scholarly endeavors, numerous studies have demonstrated the superiority of models incorporating multi-intelligent algorithm fusion over traditional field surveys, biophysical models, and statistical models in the context of crop science applications. Within this context, machine learning algorithms can predict crop yields by exploring the nonlinear relationship between influencing factors and crop yields, successfully addressing classification issues and gaining widespread application [17]. The precision of these models is contingent upon the accuracy of the influencing factors and the comprehensiveness of the data. Furthermore, the accuracy of these influencing factors can be retrospectively validated through the construction of verification models utilizing machine learning techniques [18–20]. Currently, machine learning methodologies have gained widespread adoption in forecasting agricultural economic variables [21–24]. Notably, significant achievements have been recorded in the realm of crop yield prediction [25, 26], underscoring the transformative potential of this technology in advancing agricultural predictions and management.

Khaki and Wang [27] formulated a residual neural network model to forecast output, marking a significant advancement in predictive analytics. Mupangwa et al. [28] introduced a long short-term memory (LSTM) model, which seamlessly integrates heterogeneous crop phenology, meteorological, and remote sensing data to predict maize yield at the county level. This innovative model outperforms LASSO and random forest, accounting for 76% of yield variability across the entire corn belt. Khaki et al. [29] furthered this research by developing a convolutional neural network-recurrent neural network (CNN-RNN) framework, enabling precise predictions of corn and soybean yields in 13 states of the US corn belt. Noorunnahar Mst et al. [30] Autoregressive Integrated Moving Average (ARIMA) and Extreme Gradient Boosting (XGBoost) methods were used to predict annual rice production in Bangladesh (1961–2020) and to compare their respective performances. Jiang et al. [31] also employed an LSTM model, leveraging a combination of crop phenology, meteorology, and remote sensing data to forecast maize yield at the county scale. Yuan Liu et al. [32] capitalized on machine learning and deep learning algorithms to devise wheat yield prediction models, leveraging satellite-derived high-resolution and coarse-resolution SIF, vegetation indices, and other pertinent data. Their work comprehensively evaluates and compares the performance of these models.Moreover, the current advancements in machine learning research methodologies facilitate the automatic and cost-effective acquisition of high-accuracy production data [33]. Through the implementation of IoT (Internet of Things) equipment, real-time meteorological data, soil conditions, crop diseases and pests, nutrient deficiencies, and other critical information within the planting area can be captured and transmitted to designated locations via sensors. This approach ensures the collection of high-precision, vast volumes of data, ultimately enhancing the accuracy of predictive models.

A recent scholarly inquiry delves into the integration of machine learning and crop modeling to enhance the accuracy of crop yield forecasting in the US corn belt. The principal objective aims to investigate whether the hybrid approach of crop modeling and machine learning (ML) can yield a superior predictive model, providing the utmost precision in yield projections, and identify the most efficacious crop modeling functions and ML integration strategies for corn yield prediction [34]. Additionally, although efforts have been made in real-time prediction of sugarcane yield based on harvester engine parameters and ML methods [35], there remains a scarcity of studies examining the utilization of diverse ML algorithms to construct an intelligent algorithmic ensemble for predicting crop phenotypic traits and determining crop yield.

In the realm of constructing machine learning models, certain researchers employ diverse algorithms to build identical models and conduct comparative analyses [36]. Virendra Kumar Shrivastava [37] and colleagues conducted research on machine learning (ML) technology, utilizing various input features to forecast the temperature in New Delhi for the next year with a 6-hour resolution. They compared and analyzed the prediction results obtained from the Deep Neural Network Model (DNNM) with those from a multiple regression model, achieving commendable outcomes. Kanchan Bala [38] employed grid search and bagging techniques to optimize the selected classifier (SVM-RBF), ultimately identifying the optimal classifier. However, there is a scarcity of research that combines the construction and comparative analysis of intelligent algorithm model groups utilizing multiple machine learning algorithms with the prediction of crop phenotypic traits based on the characteristics of the research subjectSpecifically, the field of smart agriculture still lacks a comprehensive approach for leveraging phenotypic characteristics of multiple crop varieties to forecast yield and guide cross-breeding strategies. Therefore, further exploration of novel methodologies utilizing various ML techniques to mine phenotypic data for crop yield prediction is imperative.

In the present study, we focused on predicting two critical sugarcane phenotypic characteristics that significantly influence yield: stem diameter and plant height. Initially, we curated a dataset comprising six phenotypic traits from the sugarcane resource nursery at Yunnan Agricultural University. To assess the performance of sugarcane yield prediction, we employed eight diverse machine learning algorithms, encompassing logistic regression, linear regression, K-Nearest Neighbors (KNN), Support Vector Machine (SVM), Backpropagation Neural Network (BPNN), Decision Tree, Random Forest, and XGBoost, to construct predictive models. Our research aims to address the following fundamental inquiries: (1) Which machine learning algorithm, or combination of algorithms, offers superior predictive capabilities for sugarcane phenotypic characteristics? (2) Which intelligent algorithm, or algorithms, is most effective in developing a predictive model that accurately captures the key factors determining yield?

Study area and data processing

Study area

The Sugarcane Resources Research Institute at Yunnan Agricultural University conducted a comprehensive survey and collection of wild sugarcane germplasm resources across China spanning the years 1985 to 1993. During this period, they successfully gathered 824 clones encompassing 18 species from 9 distinct genera within the Sugarcane Subrace. Leveraging advanced scientific conservation techniques, the institute established a sugarcane resource garden, where the planted wild sugarcane plants have been iteratively maintained to the present day [39]. These wild varieties have further been utilized in breeding programs, resulting in the development of new sugarcane cultivars through hybridization with existing commercial varieties. In the current study, we have developed an intelligent model group to predict sugarcane yield. This model group was constructed by analyzing phenotypic traits of sugarcane plants in the Yunnan sugarcane resource nursery and integrating machine learning algorithms. The resulting model not only offers a predictive framework but also serves as a platform for further popularization and demonstration, thus contributing to the enhancement of sugarcane cultivation and yield optimization.

Data set and data preprocessing

Datasets.

By collecting phenotypic character data from wild sugarcane in Yunnan sugarcane resource nursery and from new sugarcane varieties formed through wild sugarcane hybridization, data support is provided for the construction of yield prediction model based on sugarcane phenotypic character data.

Dataset 1: Data pertaining to 33 phenotypic traits were comprehensively gathered, encompassing species name, plant height, stem diameter, tiller count, underground stem characteristics, internode length, stem color and shape, bud groove presence, wax band presence, node characteristics, root point characteristics, bud shape, bud size, bud color, bud growth status, leaf length, leaf width, widest main vein dimensions, leaf color, leaf tongue characteristics, ear morphology, leaf sheath features, hair group No. 57 of the leaf sheath, flag leaf length and width, inflorescence length, inflorescence maximum width, inflorescence color, inflorescence axis characteristics, growth period, flowering period, fruiting period, and field brix. The initial dataset comprised 1068 wild sugarcane plants, resulting in a data matrix of 1068 rows and 33 columns.

Dataset 2: In a scientific investigation, phenotypic trait data from 572 wild sugarcane plants were systematically collected, encompassing key attributes such as species name, plant height, stem diameter, internode length, leaf length, leaf width, and field brix. The dataset comprised a total of nine phenotypic traits, arranged in a matrix format with 572 rows and nine columns.

Data analysis and processing.

Initially, dataset 1 and dataset 2 were merged into a unified experimental dataset. Subsequently, an analysis revealed that stem diameter and plant height were the phenotypic traits most strongly correlated with sugarcane yield. Further refinement identified six key phenotypic traits: plant height, stem diameter, internode length, leaf length, leaf width, and field brix, which were selected as predictive indicators of significant research value for wild sugarcane yield, as referenced in previous studies [40, 41]. After rigorous data screening, the principle of outlier handling was applied, resulting in the elimination of data points outside the normal distribution range. The remaining 555 groups of valid data, totaling 3330 data points, are presented in Table 1. The refined experimental dataset, excluding redundant and abnormal data, comprises the aforementioned six phenotypic traits: plant height, stem diameter, internode length, leaf length, leaf width, and field brix, with a dataset size of 555 * 6.

Download:

Table 1. Collection of experimental data (in cm).

https://doi.org/10.1371/journal.pone.0312444.t001

Following the selection of phenotypic characteristics, the dataset was classified in accordance with the Specification for the Description of Sugarcane Germplasm Resources and Data Standards [42], along with the expert advice of the sugarcane research team. Table 2 outlines the categorization rules for the phenotypic traits of wild sugarcane. As shown, plants with a height of 0–99 cm constitute the first category, while 100 cm is designated as the threshold for effective stems. Each subsequent 50 cm increment in plant height represents a new category, totaling nine distinct categories. For stem diameter, the range of 0–0.49 cm comprises the initial category, and each additional 0.2 cm increment signifies a new category, yielding eight categories overall. Internode length is initially classified into the 0–4.9 cm range, with every subsequent 5 cm increment representing a new type, resulting in seven types. Leaf length is initially categorized within 0–4.9 cm, and every 50 cm increment designates a new category, totaling ten categories. Leaf width is initially classified as 0–0.49 cm, with each additional 0.5 cm constituting a new category, resulting in eleven categories. Lastly, field brix is initially categorized as 0–9.99%, and every 2% increment establishes a new category, totaling six categories.

Download:

Table 2. Classification rule table for phenotypic characteristics of wild sugarcane (in cm).

https://doi.org/10.1371/journal.pone.0312444.t002

The original data was systematically classified in accordance with the categorization rules outlined in Table 2. Post-classification, the data was consolidated into a unified dataset, resulting in the compilation presented in Table 3. Notably, following the application of these rules, the dataset size expanded to 555 * 12, reflecting the integration of both the raw data and the categorical information.

Download:

Table 3. Raw data classification data table (in cm).

https://doi.org/10.1371/journal.pone.0312444.t003

Data standardization is the process of scaling data to a predefined range, effectively eliminating the constraints of units and transforming it into a dimensionless pure value. This approach enables the comparison and weighting of indicators with varying units or magnitudes.

A prime example of data standardization is data normalization, which transforms the raw data into a decimal range between (0,1). In the current study, we employ the Min-Max standardization algorithm [43] as the preferred method for data standardization.

Specifically, the standardized processing of data involves the following steps:

Transform sequence x₁, x₂,⋯,x_n: (1)

Get new sequence y₁, y₂,⋯,y_n∈[0,1]. In this study, the Min-Max standardization method was employed to execute a linear transformation on the original dataset, subsequently mapping the values to the range of [0,1]. Concurrently, several phenomena within the dataset were identified, and these were efficiently addressed through the application of random oversampling techniques.

After performing random oversampling with stem diameter and plant height as dependent variables, the data are shown in Table 4. The sample balance of random oversampling data with stem diameter (plant height) as dependent variable is 12.5% (11.1%) of each category. To avoid any potential impact on modeling accuracy due to the oversampling of certain parts of the sample data, SMOTE-ENN combined sampling is performed for data equalization again. After standardization, the SMOTE-ENN combination sampling was conducted again after the random oversampling with stem diameter and plant height as the dependent variables. Consequently, the data volume after the random oversampling process with plant height (stem diameter) as the dependent variable reached 1620 * 12 (1151 * 12).

Download:

Table 4. Experimental data set.

https://doi.org/10.1371/journal.pone.0312444.t004

Methods

Due to the close relationship between sugarcane yield and key phenotypic traits, specifically the direct proportionality between plant height and stem diameter values and yield, it is evident that a higher plant height and larger stem diameter indicate a greater weight per plant and hence higher yield. Therefore, constructing a sugarcane plant height/stem diameter prediction model based on machine learning algorithms enables accurate prediction of plant height/stem diameter values, given the collection of other important phenotypic traits of sugarcane. This, in turn, indirectly estimates the yield potential of sugarcane. Additionally, in the process of sugarcane hybrid breeding, the key phenotypic trait prediction model developed in this study, which can precisely predict plant height/stem diameter values, can provide materials for the artificial selection of new sugarcane varieties with superior stem diameter and plant height, thus filling the research gap in indirectly predicting yield through phenotypic traits of sugarcane.

This study integrates various machine learning algorithms, encompassing both non-integrated and integrated learning approaches, to construct an intelligent predictive model for wild sugarcane. The research delves into the analysis of the prediction outcomes and model accuracy, thereby investigating the significance of employing machine learning methodologies in forecasting phenotypic traits of wild sugarcane.

Algorithm selection

In this study, we utilized six distinct non-integrated learning algorithms, namely decision tree, multiclass logistic regression, K-nearest neighbors (KNN), support vector machine (SVM), backpropagation neural network (BPNN), and linear regression. Additionally, we employed two integrated learning algorithms: random forest and XGBoost [44]. The aim was to comprehensively compare and analyze the performance of these algorithms.

Setting of evaluation indicators

In the classification experiment, we employ accuracy, recall, precision, and F1-score as metrics to evaluate the performance of the classification model. Similarly, in the regression experiment, mean squared error (MSE), root mean squared error (RMSE), mean absolute error (MAE), and coefficient of determination (R²) are used as the evaluation indicators [45]. The characteristics of these metrics are outlined below:

The confusion matrix [46] serves as a matrix representation of the classification model’s prediction outcomes. It is a valuable tool for evaluating the model’s classification performance. Specifically, the columns of the confusion matrix represent the predicted values, while the rows correspond to the actual values. Each element within the matrix quantifies the frequency of instances where the classifier’s prediction aligns with the true category. The confusion matrix facilitates the calculation of accuracy, recall, precision, and F1-score.

(2)

(3)

Recall rate: in the results that are actually positive samples, the predicted proportion is positive samples. The larger the recall rate, the better.

(4)

F1: The harmonic average of precision and recall. Precision and recall are interdependent, and although both values are high, it is an ideal situation. However, in reality, it is often a case of high precision and low recall, or low recall but high precision. If it is necessary to balance both, then the F1 indicator can be used.

(5)

MSE (Mean Square Error): The expected value of the square of the difference between the predicted value and the actual value. The smaller the value, the higher the accuracy of the model.

(6)

RMSE (Root Mean Square Error): is the square root of MSE, and the smaller the value, the higher the accuracy of the model.

(7)

MAE (Mean Absolute Error): The average value of absolute error, which can reflect the actual situation of prediction error. The smaller the value, the higher the accuracy of the model.

(8)

R ²: Compared to using only the mean, the closer the result is to 1, the higher the accuracy of the model.

(9)

Experimental design

In this comprehensive study, we have primarily established a comprehensive data resource library through diligent collection, systematic organization, and rigorous analysis of data pertaining to sugarcane phenotypic traits. Leveraging the distinctive features of this data resource library, we have selected eight distinct machine learning algorithms, each exhibiting varied performance, to develop predictive models for two crucial phenotypic characteristics: stem diameter and plant height. These characteristics serve as fundamental determinants of sugarcane yield, ultimately aiming to attain the objective of predicting sugarcane yield, particularly in high-yield scenarios. The subsequent sections detail the experimental design.

In our experimental setup, we initially utilized the standardized 555 sets of data alongside randomly oversampled data comprising 1620 plant heights and 1768 stem diameters, which were designated as the testing and training sets. Subsequently, we employed eight diverse algorithms to classify and regress the plant height and stem diameter separately. For the classification prediction model, we adopted seven algorithms, namely decision tree, multiclass logistic regression, KNN, support vector machine, BP neural network, random forest, and XGBoost. However, when constructing the regression prediction model, we substituted multiclass logistic regression with linear regression and utilized algorithms such as CART decision tree, linear regression, KNN, support vector machine, BP neural network, random forest, and XGBoost. During the experiment, we compared the efficacy of ensemble learning with non-ensemble learning approaches and analyzed the model accuracy before and after random oversampling. Notably, during the modeling process, 70% of the data was allocated for training, while 30% was designated for testing. When utilizing each model, the parameter adjustments were tailored to match the corresponding classification model parameters. Overall, a total of 84 experiments were conducted, and the performance of each experimental model was evaluated based on its stable value, which was deemed as the final result.

By employing a self-generated data integration pattern, we successfully augmented the dataset. Specifically, we appended modified values of various phenotypic traits: (plant height + 100 * 0.4), (stem diameter + 0.2 * 0.4), (internode length + 5 * 0.4), (leaf length + 20 * 0.4), (leaf width + 0.5 * 0.4), and (field brix + 2 * 0.4). The randomly oversampled data were then utilized for machine learning modeling.Upon acquiring the initial results, we implemented SMOTE-ENN combination sampling to model the prediction outcomes in a similar manner. This step allowed us to compare the performance of different modeling approaches. Fig 1 illustrates the experimental design process, outlining the integration of the self-generated data pattern, data augmentation, modeling, and the application of SMOTE-ENN combination sampling.

Download:

Fig 1. Flowchart for modeling important phenotypic characters in sugarcane.

https://doi.org/10.1371/journal.pone.0312444.g001

Results

A predictive model for key phenotypic traits of sugarcane, based on machine learning, has been constructed

This model primarily utilizes seven phenotypic traits of sugarcane, including plant height, stem diameter, third node length, internode length, leaf length, leaf width, and field brix, as the dataset for constructing predictive models of key phenotypic traits (plant height and stem diameter). Five to eight suitable machine learning algorithms are selected based on data characteristics such as the volume and size of the collected data. Taking the six phenotypic traits of stem diameter, third node length, internode length, leaf length, leaf width, and field brix as influencing factors, an intelligent predictive model is constructed for plant height prediction. The performance of different machine learning models is compared, and the model with the best performance is selected as the predictive model for sugarcane plant height. Similarly, a predictive model for sugarcane stem diameter is constructed using the same method.

Performance comparison of sugarcane stem diameter/plant height classification prediction models using different machine learning algorithms

In the experiment aimed at predicting stem diameter classification, we employed seven algorithms: logistic regression, KNN, support vector machine, BP neural network, decision tree, random forest, and XGBoost, to construct stem diameter classification prediction models. The outcomes of these experiments are presented in Table 5. Following data standardization, the stem diameter was directly modeled for prediction experiments. Among the various algorithms, logistic regression exhibited the highest modeling accuracy, with evaluation index values spanning from 0.56 to 0.59. Conversely, the support vector machine algorithm displayed the lowest accuracy, maintaining values between 0.33 and 0.45. After random oversampling, the performance of the algorithms changed significantly. The XGBoost model achieved the highest modeling accuracy, with evaluation index values ranging from 0.91 to 0.92. However, the support vector machine continued to exhibit the lowest accuracy, maintaining values between 0.44 and 0.47. Logistic regression and the BP neural network performed similarly, while KNN and the decision tree also showed comparable accuracies.When utilizing SMOTE-ENN combined sampling for modeling, the XGBoost and Random Forest algorithms displayed high modeling accuracy, with various evaluation indicators stabilizing between 0.97 and 0.98. Conversely, the modeling accuracy of the support vector machine, logistic regression, and BP neural network algorithms remained relatively low, ranging from 0.54 to 0.57. The other models performed well. Fig 2 provides a comparative analysis of the modeling accuracy of the seven algorithms under different data processing methods, using F1 and Acc indicators. The results indicate that random oversampling can improve model accuracy to a certain extent, while the SMOTE-ENN combined sampling method can comprehensively enhance model performance. Furthermore, the XGBoost and Random Forest algorithms emerged as the most effective for sugarcane stem diameter classification modeling.

Download:

Fig 2. Comparison of the performance of sugarcane stem diameter classification prediction models.

https://doi.org/10.1371/journal.pone.0312444.g002

Download:

Table 5. Stem diameter/plant height classification prediction model accuracy table.

https://doi.org/10.1371/journal.pone.0312444.t005

The experiment conducted for plant height classification prediction adhered to a methodology analogous to the aforementioned approach in constructing a predictive model. As evident in Table 5, post-data standardization, the XGBoost and Random Forest algorithms exhibited remarkable accuracy in forecasting plant height, with evaluation metrics ranging between 0.45 and 0.50, surpassing other models that registered approximately 0.40 in accuracy. Upon the application of random oversampling, a substantial enhancement in modeling precision was observed across all models. Specifically, the XGBoost and Random Forest algorithms led the way, achieving evaluation index values spanning from 0.81 to 0.87. Similarly, the KNN and decision tree algorithms also witnessed significant improvements, reaching accuracy levels of 0.73 and 0.75 respectively. Other models maintained a stable accuracy of approximately 0.6.Utilizing the combined SMOTE-ENN sampling methodology, each model attained its peak accuracy. Notably, the KNN, decision tree, XGBoost, and Random Forest algorithms all exceeded the 0.9 accuracy threshold. Meanwhile, other algorithms attained an accuracy of approximately 0.75. Fig 3 offers a comparative analysis of the modeling accuracy of seven algorithms, employing F1 and Acc metrics, under various data processing techniques. The results unequivocally demonstrate that the SMOTE-ENN combined sampling method significantly enhances model accuracy, with the XGBoost and Random Forest algorithms emerging as the most effective in sugarcane height classification modeling.

Download:

Fig 3. Comparison of the performance of sugarcane plant height classification prediction models.

https://doi.org/10.1371/journal.pone.0312444.g003

Performance comparison of sugarcane stem diameter/plant height regression prediction models using different machine learning algorithms

In the experiment aimed at stem diameter and plant height regression prediction, seven algorithms were employed to establish sugarcane stem diameter prediction models. These encompassed linear regression, KNN, support vector machine, backpropagation (BP), decision tree, random forest, and XGBoost. The study constructed the models by applying various data processing techniques, including data standardization, random oversampling, and combined sampling. The performance of these models was rigorously evaluated using metrics such as MSE, RMSE, MAE, and R2. The comprehensive results of this analysis are presented in Table 6, which offers a detailed performance analysis of the 42 models constructed using the seven algorithms. Furthermore, Fig 4 provides a comparative analysis of the modeling accuracy of the seven algorithms under different data processing methods, utilizing RMSE and R2 as the primary indicators. Notably, the decision tree, random forest, and XGBoost algorithms exhibited superior performance in model construction, with the XGBoost algorithm standing out as the most effective in constructing stem diameter regression prediction models.

Download:

Fig 4. Comparison of the performance of sugarcane stem diameter/plant height regression prediction models.

https://doi.org/10.1371/journal.pone.0312444.g004

Download:

Table 6. Stem diameter/plant height regression prediction model accuracy table.

https://doi.org/10.1371/journal.pone.0312444.t006

Expanding experimental data to construct predictive models

Initially, a random selection of 100 sets of data was made from the experimental dataset, and the interval distance between the characteristic values of each phenotypic trait was determined. Subsequently, the original data values were augmented by adding a value equivalent to 1.4 times the calculated interval distance, thereby generating 100 novel datasets. These self-generated datasets were then incorporated into the original dataset to form an expanded dataset. Utilizing the optimal modeling algorithms identified, namely Random Forest and XGBoost, modeling was performed on this expanded dataset. As depicted in Fig 5, following the application of SMOTE-ENN combined sampling, the evaluation metrics of the Random Forest algorithm post-modeling ranged from 0.932 to 0.94, indicating a slight decrement in accuracy compared to pre-expansion. Conversely, the XGBoost algorithm exhibited superior performance, with evaluation metrics ranging from 0.968 to 0.969, albeit slightly lower than the pre-expansion accuracy. In the experiment pertaining to plant height classification prediction, the utilization of SMOTE-ENN combined sampling coupled with Random Forest algorithm modeling yielded evaluation metrics spanning from 0.975 to 0.978, representing a marginal improvement over pre-expansion accuracy. However, the XGBoost algorithm exhibited a slight decrement in performance post-modeling, with evaluation metrics ranging from 0.969 to 0.972. In examining the experimental results presented in Table 7 it was observed that the variance in the model evaluation index values, both improvement and decrease, remained approximately 0.005, regardless of the inclusion of self-generated data. Upon incorporating additional experimental data, regression models were constructed using Random Forest and XGBoost algorithms, coupled with SMOTE-ENN sampling, to forecast stem diameter and plant height. As illustrated in Fig 6, both models exhibited commendable performance, with R2 values close to 1 and other metrics not surpassing 0.1. However, a comparative analysis revealed that, compared to models constructed prior to the incorporation of self-generated data, the performance was marginally inferior to that achieved before the expansion of experimental data, as evidenced in Table 7. This suggests that the inclusion of self-generated data in this experiment had a negligible impact on the evaluation index values of the models.

Download:

Fig 5. Performance graph of classification model after expanding data.

https://doi.org/10.1371/journal.pone.0312444.g005

Download:

Fig 6. Plot of regression model performance after expanding data.

https://doi.org/10.1371/journal.pone.0312444.g006

Download:

Table 7. Comparison table of model performance before and after data expansion.

https://doi.org/10.1371/journal.pone.0312444.t007

The impact of important phenotypic features on the performance of intelligent models

In this study, four types of variables were excluded: "leaf length," "leaf width," "internode length," and "field brix." Subsequently, the SMOTE-ENN combination sampling method was employed for data preprocessing. Utilizing the optimal XGBoost algorithm identified for predicting sugarcane phenotype characteristics, classification and regression models were constructed. Detailed results are presented in Table 8.

Download:

Table 8. Comparison table of model performance after removing a single influencing factor.

https://doi.org/10.1371/journal.pone.0312444.t008

Performance analysis of important phenotypic characteristics on sugarcane stem diameter/plant height classification prediction models

Following the elimination of the "leaf width" variable, both random forest and XGBoost algorithms were implemented to develop stem diameter prediction models. Notably, all evaluation metrics ranged from 0.934 to 0.964. Nevertheless, a comparative analysis with the corresponding models revealed a minor decrement in these evaluation metrics. Specifically, the XGBoost modeling exhibited a 0.011–0.012 decline in various evaluation indicators, while the random forest algorithm modeling saw a 0.21–0.25 reduction.

Subsequently, upon eliminating the "leaf width" variable, XGBoost was employed to construct a plant height prediction model. This model achieved evaluation metrics ranging from 0.976 to 0.976, slightly surpassing the modeling performance. Furthermore, upon the removal of the "internode length" variable, the random forest algorithm exhibited commendable modeling performance, with all evaluation metrics spanning 0.806 to 0.813, albeit slightly inferior to the results. Finally, upon excluding the "field brix" variable and utilizing the random forest algorithm for modeling, all indicators were found to be within the range of 0.975 to 0.978, resulting in superior modeling performance.

Performance analysis of regression prediction models for sugarcane stem diameter/plant height based on important phenotypic characteristics

In this study, four variables were excluded: "leaf length," "leaf width," "internode length," and "field brix." We employed the SMOTE-ENN combination sampling data processing method and integrated the XGBoost algorithm, which was identified as the optimal model, to construct classification and regression models for predicting sugarcane phenotype characteristics. For a detailed overview, refer to Table 8.

Following the elimination of each influencing factor individually, a classification prediction model for sugarcane plant height was formulated. The experimental results revealed a slight improvement in the accuracy of each indicator, ranging between 0.001 and 0.15, indicating a minimal impact on model performance. Notably, the removal of the "leaf length" factor resulted in the most significant enhancement in model accuracy, suggesting that this factor is redundant and should be excluded. Conversely, the exclusion of the "internode length" factor led to the most significant decrement in model accuracy, implying its pivotal role in predicting plant height. Utilizing a similar research approach, notable variations in the accuracy of the stem diameter classification model were observed. Specifically, eliminating the "leaf width" factor caused the most significant deterioration in model performance, whereas the exclusion of the "internode length" factor led to the greatest improvement. This finding suggests that leaf width is a crucial factor in the model, while internode length is not a significant influencing factor and should be excluded.When developing a regression prediction model for plant height, the removal of the "field brix" factor resulted in the most significant decrement in model performance, indicating its importance in the model. The elimination of other factors led to relatively minor changes in various evaluation metrics.

As depicted in Fig 7, upon the elimination of one of the influential variables, the classification models adopted the F1 value as a metric for evaluation, while the regression models employed the R² value. Notably, the plant height classification model exhibited the most significant improvement in accuracy following the exclusion of the "leaf length" factor, attaining an F1 value of 0.987, thereby establishing itself as the definitive model for plant height classification in this study. Similarly, the stem diameter classification prediction model achieved its peak accuracy after diminishing the "internode length" factor, recording an F1 value of 0.985, and was consequently designated as the final model for stem diameter classification. Analogously, the plant height regression model observed a reduction in the "leaf length" factor’s R² value to 0.991, thus serving as the culmination of the plant height regression prediction model. In the stem diameter regression model, the precision was enhanced to 0.988 following the reduction of either the "leaf length" or "field brix" factors, which ultimately constituted the final model. Consequently, the prediction models for both sugarcane plant height and stem diameter were conclusively determined.

Download:

Fig 7. Performance of classification model and regression model after removing single influencing factors.

https://doi.org/10.1371/journal.pone.0312444.g007

Discussion

Comparison of model performance of different data processing methods

The experimental results reveal that in the classification and prediction experiments pertaining to sugarcane phenotypic traits (stem diameter and plant height), the accuracy of the machine learning-based intelligent prediction models utilizing raw data is comparatively low. While random sampling offers a marginal improvement in prediction accuracy, the utilization of the SMOTE-ENN combination sampling technique significantly enhances the accuracy of the prediction models to the highest level. Furthermore, our study observed that the inclusion of self-fitting data experiments led to a decrease in the accuracy of all models, suggesting that the integration of self-fitting data has a detrimental impact on the predictive accuracy of the models. However, despite this reduction in accuracy due to the addition of self-fitting data, the application of SMOTE-ENN combined sampling technology still managed to increase the predictive accuracy of the models. This underscores the effectiveness of the SMOTE-ENN combination sampling technique in significantly enhancing the predictive performance of the classification models.Therefore, by incorporating the SMOTE-ENN combined sampling technique to balance the dataset, we can not only improve the generalization ability of the models but also mitigate the challenges associated with data collection in the classification and prediction process for sugarcane stem diameter and plant height, ultimately enhancing the overall accuracy of the models.

Performance comparison of different machine learning algorithms

The findings of our study indicate considerable variability in the performance of seven machine learning algorithms when employed in the construction of prediction models. Specifically, the models developed using the random forest and XGBoost algorithms outperformed those constructed with five other algorithms: decision tree, logistic regression, K-nearest neighbors (KNN), support vector machine (SVM), and backpropagation (BP) neural network. This superiority can be attributed to the random forest algorithm’s ability to mitigate overfitting risks and enhance prediction accuracy by integrating predictions from multiple decision trees, while XGBoost’s compatibility with diverse datasets and its parallel processing capabilities contribute to its strong performance. Notably, XGBoost exhibited a particularly robust ability in high-precision prediction, rendering it the preferred algorithm for the construction of prediction models for sugarcane phenotypic traits such as stem diameter and plant height. Among the remaining four algorithms, the decision tree algorithm demonstrated slightly lower performance compared to XGBoost and random forest, likely due to its proficiency in handling datasets with missing attributes and its faster execution during testing. The KNN-based prediction model achieved the highest accuracy, while the models constructed using logistic regression, SVM, and BP neural network algorithms exhibited accuracy levels below 0.6. While the KNN algorithm requires substantial computational resources, its simplicity in understanding and implementation is noteworthy. In contrast, the logistic regression algorithm struggles to capture the full range of data information, limiting its ability to handle complex data types, anomalies, and missing data. The SVM’s performance is sensitive to the choice of kernel function parameters and the presence of missing data, while the BP neural network faces challenges in determining the optimal number of hidden layers and nodes.

Performance analysis of predictive models after adjusting influencing factors (stem diameter, plant height)

Following the elimination of the influential factor "leaf width," the decision tree, random forest, and XGBoost algorithms were employed to reconstruct the stem diameter classification prediction model. The comparison with the original model elucidated the pivotal role of "leaf width" in the prediction of sugarcane stem diameter classification. Our findings demonstrate that the accuracy of the reconstructed models improved by a range of 0.01 to 0.08 compared to the original model. Similarly, after excluding the influential factor "stem diameter," the decision tree and random forest algorithms were used to reconstruct the plant height prediction model. The comparison revealed a marginal accuracy difference between the new and original models, ranging from 0.01 to 0.02. Additionally, upon the elimination of "internode length," the XGBoost algorithm reconstructed the plant height prediction model, resulting in a negligible accuracy difference of only 0.01 compared to the original model.These research outcomes indicate that the inclusion or exclusion of individual influencing factors does not significantly alter the overall model accuracy. However, it is noteworthy that due to the relatively simplified experimental design focused on reducing influencing factors in this study, a comprehensive analysis of the impact of individual or multiple factors on model performance was not undertaken. Therefore, future research should delve deeper into this area.

Analysis of the scalability of a sugarcane important phenotypic data prediction model based on multi-model and multi-task approach

To construct a practical and scalable predictive model for key phenotypic traits of sugarcane, this study employed the XGBoost algorithm, which is optimal for predicting phenotypic characteristics of sugarcane. We established field brix classification and regression models and conducted scalability experiments to enhance the model’s performance. The results of the scalability testing are presented in Table 9.

Download:

Table 9. Table of accuracy for field brix classification/regression prediction models.

https://doi.org/10.1371/journal.pone.0312444.t009

In the field brix classification/regression prediction experiment, data were processed through methods such as data standardization, random oversampling, and combined sampling. Subsequently, XGBoost was employed to construct a sugarcane field brix classification/regression prediction model. Experimental results indicated that the performance of the field brix classification model was superior after combined sampling, achieving accuracy, recall, precision, and F1 scores of 0.957, 0.957, 0.956, and 0.956, respectively. Similarly, the field brix regression model demonstrated satisfactory predictive performance, with MSE, RMSE, MAE, and R2 values reaching 0.092, 0.302, 0.084, and 0.969, respectively, achieving precise prediction of field brix values. This demonstrates the scalability of the sugarcane important phenotypic data prediction model based on multi-model and multi-task, which is suitable for predicting different phenotypic characteristics of sugarcane.

Conclusion

In addressing the challenges posed by the strong subjectivity and low prediction accuracy of classical empirical models, this study pioneers a comprehensive approach by developing a suite of significant sugarcane phenotype feature prediction models that harness the power of multiple integrated intelligent algorithms. Specifically, we leverage eight machine learning algorithms to formulate a novel method for indirectly forecasting yield based on sugarcane phenotype feature values. The methodology commences with data collection, where outliers are identified and mitigated, followed by data standardization. Subsequently, we employ random oversampling techniques and combined sampling methods to curate a robust data resource library, which serves as the foundation for model construction. Utilizing 555 sets of standardized and randomly sampled data as our training and testing sets, we deploy both non-ensemble and ensemble learning algorithms to construct an intelligent prediction model ensemble dedicated to wild sugarcane phenotypic traits. Furthermore, to enrich our dataset, we introduce a self-fitting data integration rule and employ the SMOTE-ENN method for combined sampling. By modeling in a consistent manner, we derive prediction results that are then compared against diverse modeling approaches. The analysis leads to the following key conclusions:

(1) Significant improvements in model performance were observed through a rigorous data preprocessing workflow encompassing data standardization, random oversampling, and SMOTE-ENN processing. Notably, in the stem diameter prediction model, following SMOTE-ENN processing, the already promising prediction results were further enhanced. Specifically, metrics such as accuracy, recall, precision, and F1 value exhibited substantial gains. The most notable enhancement was achieved by the KNN-based stem diameter classification prediction model, achieving improvements of 17.1%, 17.1%, 18.89%, and 17.65% respectively. Similarly, in plant height prediction, the KNN-based plant height classification prediction model yielded the highest performance gains, improving by 30.05%, 30.05%, 33.47%, and 32.44% in accuracy, recall, precision, and F1 value.

(2) In a comparative analysis with mainstream machine learning algorithms, including logistic regression, linear regression, KNN, support vector machine, BP neural network, decision tree, and random forest, the XGBoost algorithm emerged as the most effective for predicting stem diameter and plant height. When constructing the stem diameter prediction model, XGBoost significantly elevated accuracy, recall, precision, and F1 value to 0.976, 0.976, 0.977, and 0.976 respectively. This represented increases of 0.527, 0.527, 0.64, and 0.594 over the base models. Similarly, in the plant height prediction model, XGBoost optimized accuracy, recall, precision, and F1 value to 0.974, 0.974, 0.974, and 0.973 respectively, with gains of 0.555, 0.555, 0.55, and 0.576 in each metric.

(3) Upon augmenting the amount of self-generated data, the model’s performance exhibited minimal variations. As evident from the experimental results, the difference in model evaluation index values, both before and after the inclusion of self-fitting data, hovered around 0.005. This suggests that in the context of predicting crucial phenotypic data in sugarcane, the influence of incorporating self-fitting data on the model’s evaluation index value is negligible.

(4) To validate the model’s performance, we conducted a screening process of phenotypic features. By eliminating the "leaf length" factor in the plant height classification model, the F1 value ascended to 0.987. Similarly, the F1 value reached 0.985 after reducing the factor of internode length in the stem diameter classification prediction model. In the plant height regression model, the R² value improved to 0.991 following the removal of the "leaf length" factor. Furthermore, in the stem diameter regression model, the accuracy enhanced to 0.988 after diminishing either "leaf length" or "field brix." Each model’s performance was further optimized based on the original framework, thus facilitating the final determination of influential phenotypic factors in sugarcane.

(5) This research is grounded in wild sugarcane phenotype data, leveraging sugarcane phenotypic characteristics as predictive factors to develop models for forecasting sugarcane stem diameter and plant height. Specifically, the models aim to predict the characteristic values of stem diameter and plant height post-planting for this sugarcane variety, thereby indirectly estimating sugarcane yield. Given the accuracy of these models in predicting post-planting stem diameter and height for specific traits, the application of the stem diameter prediction model in the hybridization of sugarcane varieties with taller plants and narrower stems holds the potential to enhance the likelihood of cultivating new sugarcane varieties with larger stem diameters and heights. Similarly, the plant height prediction model can be applied to the hybridization of sugarcane varieties with broader stems and shorter plants, increasing the probability of developing new sugarcane varieties with both larger stem diameters and plant heights. This approach offers a novel methodology for breeding new sugarcane varieties.

References

1. Dong L, Kai W, Yan H. Current Status and Trend of Industrial Development of Major Tropical Crops in the World. Tropical Agricultural Science. 2021;41(9):111–6.
- View Article
- Google Scholar
2. Mangla M, Sharma N, Mohanty SN. A sequential ensemble model for software fault prediction. Innovations in Systems and Software Engineering. 2021:1–8.
- View Article
- Google Scholar
3. Dubey SK, Gavli A, Yadav S, Sehgal S, Ray SS. Remote sensing-based yield forecasting for sugarcane (Saccharum officinarum L.) crop in India. Journal of the Indian Society of Remote Sensing. 2018;46:1823–33.
- View Article
- Google Scholar
4. Jayawardhana W, Chathurange V. Extraction of agricultural phenological parameters of Sri Lanka using MODIS, NDVI time series data. Procedia Food Science. 2016;6:235–41.
- View Article
- Google Scholar
5. Lai Y, Pringle M, Kopittke PM, Menzies NW, Orton TG, Dang YP. An empirical model for prediction of wheat yield, using time-integrated Landsat NDVI. International journal of applied earth observation and geoinformation. 2018;72:99–108.
- View Article
- Google Scholar
6. Saeed U, Dempewolf J, Becker-Reshef I, Khan A, Ahmad A, Wajid SA. Forecasting wheat yield from weather data and MODIS NDVI using Random Forests for Punjab province, Pakistan. International journal of remote sensing. 2017;38(17):4831–54.
- View Article
- Google Scholar
7. Mkhabela MS, Mkhabela MS, Mashinini NN. Early maize yield forecasting in the four agro-ecological regions of Swaziland using NDVI data derived from NOAA’s-AVHRR. Agricultural and Forest Meteorology. 2005;129(1–2):1–9.
- View Article
- Google Scholar
8. Aghighi H, Azadbakht M, Ashourloo D, Shahrabi HS, Radiom S. Machine learning regression techniques for the silage maize yield prediction using time-series images of Landsat 8 OLI. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing. 2018;11(12):4563–77.
- View Article
- Google Scholar
9. Sorjamaa A, Hao J, Reyhani N, Ji Y, Lendasse A. Methodology for long-term prediction of time series. Neurocomputing. 2007;70(16–18):2861–9.
- View Article
- Google Scholar
10. Wang H, Huo Z, Zhou G, Wu L, Feng H. Monitoring and forecasting winter wheat freeze injury and yield from multi-temporal remotely sensed data. Intelligent Automation & Soft Computing. 2016;22(2):255–60.
- View Article
- Google Scholar
11. Humphreys E, Gaydon D, Eberbach P. Evaluation of the effects of mulch on optimum sowing date and irrigation management of zero till wheat in central Punjab, India using APSIM. Field Crops Research. 2016;197:83–96. pmid:27698532
- View Article
- PubMed/NCBI
- Google Scholar
12. Singh R, Krishnan P, Singh VK, Sah S, Das B. Combining biophysical parameters with thermal and RGB indices using machine learning models for predicting yield in yellow rust affected wheat crop. Scientific Reports. 2023;13(1):18814. pmid:37914800
- View Article
- PubMed/NCBI
- Google Scholar
13. Lobell DB, Hammer GL, McLean G, Messina C, Roberts MJ, Schlenker W. The critical role of extreme heat for maize production in the United States. Nature climate change. 2013;3(5):497–501.
- View Article
- Google Scholar
14. Peng B, Guan K, Zhou W, Jiang C, Frankenberg C, Sun Y, et al. Assessing the benefit of satellite-based Solar-Induced Chlorophyll Fluorescence in crop yield prediction. International Journal of Applied Earth Observation and Geoinformation. 2020;90:102126.
- View Article
- Google Scholar
15. Paccioretti P, Bruno C, Gianinni Kurina F, Córdoba M, Bullock D, Balzarini M. Statistical models of yield in on‐farm precision experimentation. Agronomy Journal. 2021;113(6):4916–29.
- View Article
- Google Scholar
16. Lobell DB, Asseng S. Comparing estimates of climate change impacts from process-based and statistical crop models. Environmental Research Letters. 2017;12(1):015001.
- View Article
- Google Scholar
17. Shrivastava VK, Shrivastava A, Sharma N, Mohanty SN, Pattanaik CR. Deep learning model for temperature prediction: A case study in New Delhi. Journal of Forecasting. 2023;42(6):1445–60.
- View Article
- Google Scholar
18. Cai Y, Guan K, Lobell D, Potgieter AB, Wang S, Peng J, et al. Integrating satellite and climate data to predict wheat yield in Australia using machine learning approaches. Agricultural and forest meteorology. 2019;274:144–59.
- View Article
- Google Scholar
19. Cao J, Zhang Z, Tao F, Zhang L, Luo Y, Zhang J, et al. Integrating multi-source data for rice yield prediction across China using machine learning and deep learning approaches. Agricultural and Forest Meteorology. 2021;297:108275.
- View Article
- Google Scholar
20. Feng P, Wang B, Li Liu D, Waters C, Xiao D, Shi L, et al. Dynamic wheat yield forecasts are improved by a hybrid approach using a biophysical model and machine learning technique. Agricultural and Forest Meteorology. 2020;285:107922.
- View Article
- Google Scholar
21. Kang Y, Ozdogan M, Zhu X, Ye Z, Hain C, Anderson M. Comparative assessment of environmental variables and machine learning algorithms for maize yield prediction in the US Midwest. Environmental Research Letters. 2020;15(6):064005.
- View Article
- Google Scholar
22. Leng G, Hall JW. Predicting spatial and temporal variability in crop yields: an inter-comparison of machine learning, regression and process-based models. Environmental research letters: ERL [Web site]. 2020;15(4):044027. pmid:32395176
- View Article
- PubMed/NCBI
- Google Scholar
23. Hoffman A, Kemanian A, Forest C. The response of maize, sorghum, and soybean yield to growing-phase climate revealed with machine learning. Environmental Research Letters. 2020;15(9):094013.
- View Article
- Google Scholar
24. Crane-Droesch A. Machine learning methods for crop yield prediction and climate change impact assessment in agriculture. Environmental Research Letters. 2018;13(11):114003.
- View Article
- Google Scholar
25. Jeong JH, Resop JP, Mueller ND, Fleisher DH, Yun K, Butler EE, et al. Random forests for global and regional crop yield predictions. PloS one. 2016;11(6):e0156571. pmid:27257967
- View Article
- PubMed/NCBI
- Google Scholar
26. Cai Y, Guan K, Peng J, Wang S, Seifert C, Wardlow B, et al. A high-performance and in-season classification system of field-level crop types using time-series Landsat data and a machine learning approach. Remote sensing of environment. 2018;210:35–47.
- View Article
- Google Scholar
27. Khaki S, Wang L. Crop yield prediction using deep neural networks. Frontiers in plant science. 2019;10:621. pmid:31191564
- View Article
- PubMed/NCBI
- Google Scholar
28. Mupangwa W, Chipindu L, Nyagumbo I, Mkuhlani S, Sisito G. Evaluating machine learning algorithms for predicting maize yield under conservation agriculture in Eastern and Southern Africa. SN Applied Sciences. 2020;2(5):952.
- View Article
- Google Scholar
29. Khaki S, Wang L, Archontoulis SV. A CNN-RNN framework for crop yield prediction. Frontiers in Plant Science. 2020;10:492736. pmid:32038699
- View Article
- PubMed/NCBI
- Google Scholar
30. Noorunnahar M, Chowdhury AH, Mila FA. A tree based eXtreme Gradient Boosting (XGBoost) machine learning model to forecast the annual rice production in Bangladesh. PloS one. 2023;18(3):e0283452. pmid:36972270
- View Article
- PubMed/NCBI
- Google Scholar
31. Jiang H, Hu H, Zhong R, Xu J, Xu J, Huang J, et al. A deep learning approach to conflating heterogeneous geospatial data for corn yield estimation: A case study of the US Corn Belt at the county level. Global change biology. 2020;26(3):1754–66. pmid:31789455
- View Article
- PubMed/NCBI
- Google Scholar
32. Liu Y, Wang S, Wang X, Chen B, Chen J, Wang J, et al. Exploring the superiority of solar-induced chlorophyll fluorescence data in predicting wheat yield using machine learning and deep learning methods. Computers and Electronics in Agriculture. 2022;192:106612.
- View Article
- Google Scholar
33. Van Klompenburg T, Kassahun A, Catal C. Crop yield prediction using machine learning: A systematic literature review. Computers and Electronics in Agriculture. 2020;177:105709.
- View Article
- Google Scholar
34. Shahhosseini M, Hu G, Huber I, Archontoulis SV. Coupling machine learning and crop modeling improves crop yield prediction in the US Corn Belt. Scientific reports. 2021;11(1):1606. pmid:33452349
- View Article
- PubMed/NCBI
- Google Scholar
35. Maldaner LF, de Paula Corrêdo L, Canata TF, Molin JP. Predicting the sugarcane yield in real-time by harvester engine parameters and machine learning approaches. Computers and Electronics in Agriculture. 2021;181:105945.
- View Article
- Google Scholar
36. Shrivastava VK, Shrivastava A, Sharma N, Mohanty SN, Pattanaik CR. Deep learning model for temperature prediction: an empirical study. Modeling Earth Systems and Environment. 2023;9(2):2067–80.
- View Article
- Google Scholar
37. Bala K, Paul S, Mohanty SN, Mahapatra S. Improved prediction analysis with hybrid models for thunderstorm classification over the ranchi region. New Generation Computing. 2024;42(1):7–31.
- View Article
- Google Scholar
38. Mangla M, Mehta V, Mohanty SN, Sharma N, Preetham A. Statistical growth prediction analysis of rice crop with pixel-based mapping technique. International Journal of Artificial Intelligence and Soft Computing. 2022;7(3):208–27.
- View Article
- Google Scholar
39. He S, Yang Q, Xiao F, Zhang F, He L. Investigations and collections of wild germplasm plants related to sugarcane in China. Sugarcane. 1994;1:11–7.
- View Article
- Google Scholar
40. Alam M, Nath UK, Karim K, Ahmed M, Mitul R. Genetic variability of exotic sugarcane genotypes. Scientifica. 2017;2017. pmid:29348970
- View Article
- PubMed/NCBI
- Google Scholar
41. Wang Z-p, Liu L, Deng Y-c, Li Y-j, Zhang G-m, Lin S-h, et al. Establishing a forecast mathematical model of sugarcane yield and Brix reduction based on the extent of pokkah boeng disease. Sugar tech. 2017;19:656–61.
- View Article
- Google Scholar
42. Qing C. Research on standardization of sugarcane germplasm resources data and construction of a sharing platform 2008-06–16.
- View Article
- Google Scholar
43. Jain S, Shukla S, Wadhvani R. Dynamic selection of normalization techniques using data complexity measures. Expert Systems with Applications. 2018;106:252–62.
- View Article
- Google Scholar
44. Leevy JL, Hancock J, Khoshgoftaar TM, Peterson JM. IoT information theft prediction using ensemble feature selection. Journal of Big Data. 2022;9(1):6.
- View Article
- Google Scholar
45. Jumin E, Zaini N, Ahmed AN, Abdullah S, Ismail M, Sherif M, et al. Machine learning versus linear regression modelling approach for accurate ozone concentrations prediction. Engineering Applications of Computational Fluid Mechanics. 2020;14(1):713–25.
- View Article
- Google Scholar
46. Li X, Yi S, Cundy AB, Chen W. Sustainable decision-making for contaminated site risk management: A decision tree model using machine learning algorithms. Journal of Cleaner Production. 2022;371:133612.
- View Article
- Google Scholar

[ref1] 1. Dong L, Kai W, Yan H. Current Status and Trend of Industrial Development of Major Tropical Crops in the World. Tropical Agricultural Science. 2021;41(9):111–6.
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. Mangla M, Sharma N, Mohanty SN. A sequential ensemble model for software fault prediction. Innovations in Systems and Software Engineering. 2021:1–8.
View Article
Google Scholar

[5] View Article

[6] Google Scholar

[ref3] 3. Dubey SK, Gavli A, Yadav S, Sehgal S, Ray SS. Remote sensing-based yield forecasting for sugarcane (Saccharum officinarum L.) crop in India. Journal of the Indian Society of Remote Sensing. 2018;46:1823–33.
View Article
Google Scholar

[8] View Article

[9] Google Scholar

[ref4] 4. Jayawardhana W, Chathurange V. Extraction of agricultural phenological parameters of Sri Lanka using MODIS, NDVI time series data. Procedia Food Science. 2016;6:235–41.
View Article
Google Scholar

[11] View Article

[12] Google Scholar

[ref5] 5. Lai Y, Pringle M, Kopittke PM, Menzies NW, Orton TG, Dang YP. An empirical model for prediction of wheat yield, using time-integrated Landsat NDVI. International journal of applied earth observation and geoinformation. 2018;72:99–108.
View Article
Google Scholar

[14] View Article

[15] Google Scholar

[ref6] 6. Saeed U, Dempewolf J, Becker-Reshef I, Khan A, Ahmad A, Wajid SA. Forecasting wheat yield from weather data and MODIS NDVI using Random Forests for Punjab province, Pakistan. International journal of remote sensing. 2017;38(17):4831–54.
View Article
Google Scholar

[17] View Article

[18] Google Scholar

[ref7] 7. Mkhabela MS, Mkhabela MS, Mashinini NN. Early maize yield forecasting in the four agro-ecological regions of Swaziland using NDVI data derived from NOAA’s-AVHRR. Agricultural and Forest Meteorology. 2005;129(1–2):1–9.
View Article
Google Scholar

[20] View Article

[21] Google Scholar

[ref8] 8. Aghighi H, Azadbakht M, Ashourloo D, Shahrabi HS, Radiom S. Machine learning regression techniques for the silage maize yield prediction using time-series images of Landsat 8 OLI. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing. 2018;11(12):4563–77.
View Article
Google Scholar

[23] View Article

[24] Google Scholar

[ref9] 9. Sorjamaa A, Hao J, Reyhani N, Ji Y, Lendasse A. Methodology for long-term prediction of time series. Neurocomputing. 2007;70(16–18):2861–9.
View Article
Google Scholar

[26] View Article

[27] Google Scholar

[ref10] 10. Wang H, Huo Z, Zhou G, Wu L, Feng H. Monitoring and forecasting winter wheat freeze injury and yield from multi-temporal remotely sensed data. Intelligent Automation & Soft Computing. 2016;22(2):255–60.
View Article
Google Scholar

[29] View Article

[30] Google Scholar

[ref11] 11. Humphreys E, Gaydon D, Eberbach P. Evaluation of the effects of mulch on optimum sowing date and irrigation management of zero till wheat in central Punjab, India using APSIM. Field Crops Research. 2016;197:83–96. pmid:27698532
View Article
PubMed/NCBI
Google Scholar

[32] View Article

[33] PubMed/NCBI

[34] Google Scholar

[ref12] 12. Singh R, Krishnan P, Singh VK, Sah S, Das B. Combining biophysical parameters with thermal and RGB indices using machine learning models for predicting yield in yellow rust affected wheat crop. Scientific Reports. 2023;13(1):18814. pmid:37914800
View Article
PubMed/NCBI
Google Scholar

[36] View Article

[37] PubMed/NCBI

[38] Google Scholar

[ref13] 13. Lobell DB, Hammer GL, McLean G, Messina C, Roberts MJ, Schlenker W. The critical role of extreme heat for maize production in the United States. Nature climate change. 2013;3(5):497–501.
View Article
Google Scholar

[40] View Article

[41] Google Scholar

[ref14] 14. Peng B, Guan K, Zhou W, Jiang C, Frankenberg C, Sun Y, et al. Assessing the benefit of satellite-based Solar-Induced Chlorophyll Fluorescence in crop yield prediction. International Journal of Applied Earth Observation and Geoinformation. 2020;90:102126.
View Article
Google Scholar

[43] View Article

[44] Google Scholar

[ref15] 15. Paccioretti P, Bruno C, Gianinni Kurina F, Córdoba M, Bullock D, Balzarini M. Statistical models of yield in on‐farm precision experimentation. Agronomy Journal. 2021;113(6):4916–29.
View Article
Google Scholar

[46] View Article

[47] Google Scholar

[ref16] 16. Lobell DB, Asseng S. Comparing estimates of climate change impacts from process-based and statistical crop models. Environmental Research Letters. 2017;12(1):015001.
View Article
Google Scholar

[49] View Article

[50] Google Scholar

[ref17] 17. Shrivastava VK, Shrivastava A, Sharma N, Mohanty SN, Pattanaik CR. Deep learning model for temperature prediction: A case study in New Delhi. Journal of Forecasting. 2023;42(6):1445–60.
View Article
Google Scholar

[52] View Article

[53] Google Scholar

[ref18] 18. Cai Y, Guan K, Lobell D, Potgieter AB, Wang S, Peng J, et al. Integrating satellite and climate data to predict wheat yield in Australia using machine learning approaches. Agricultural and forest meteorology. 2019;274:144–59.
View Article
Google Scholar

[55] View Article

[56] Google Scholar

[ref19] 19. Cao J, Zhang Z, Tao F, Zhang L, Luo Y, Zhang J, et al. Integrating multi-source data for rice yield prediction across China using machine learning and deep learning approaches. Agricultural and Forest Meteorology. 2021;297:108275.
View Article
Google Scholar

[58] View Article

[59] Google Scholar

[ref20] 20. Feng P, Wang B, Li Liu D, Waters C, Xiao D, Shi L, et al. Dynamic wheat yield forecasts are improved by a hybrid approach using a biophysical model and machine learning technique. Agricultural and Forest Meteorology. 2020;285:107922.
View Article
Google Scholar

[61] View Article

[62] Google Scholar

[ref21] 21. Kang Y, Ozdogan M, Zhu X, Ye Z, Hain C, Anderson M. Comparative assessment of environmental variables and machine learning algorithms for maize yield prediction in the US Midwest. Environmental Research Letters. 2020;15(6):064005.
View Article
Google Scholar

[64] View Article

[65] Google Scholar

[ref22] 22. Leng G, Hall JW. Predicting spatial and temporal variability in crop yields: an inter-comparison of machine learning, regression and process-based models. Environmental research letters: ERL [Web site]. 2020;15(4):044027. pmid:32395176
View Article
PubMed/NCBI
Google Scholar

[67] View Article

[68] PubMed/NCBI

[69] Google Scholar

[ref23] 23. Hoffman A, Kemanian A, Forest C. The response of maize, sorghum, and soybean yield to growing-phase climate revealed with machine learning. Environmental Research Letters. 2020;15(9):094013.
View Article
Google Scholar

[71] View Article

[72] Google Scholar

[ref24] 24. Crane-Droesch A. Machine learning methods for crop yield prediction and climate change impact assessment in agriculture. Environmental Research Letters. 2018;13(11):114003.
View Article
Google Scholar

[74] View Article

[75] Google Scholar

[ref25] 25. Jeong JH, Resop JP, Mueller ND, Fleisher DH, Yun K, Butler EE, et al. Random forests for global and regional crop yield predictions. PloS one. 2016;11(6):e0156571. pmid:27257967
View Article
PubMed/NCBI
Google Scholar

[77] View Article

[78] PubMed/NCBI

[79] Google Scholar

[ref26] 26. Cai Y, Guan K, Peng J, Wang S, Seifert C, Wardlow B, et al. A high-performance and in-season classification system of field-level crop types using time-series Landsat data and a machine learning approach. Remote sensing of environment. 2018;210:35–47.
View Article
Google Scholar

[81] View Article

[82] Google Scholar

[ref27] 27. Khaki S, Wang L. Crop yield prediction using deep neural networks. Frontiers in plant science. 2019;10:621. pmid:31191564
View Article
PubMed/NCBI
Google Scholar

[84] View Article

[85] PubMed/NCBI

[86] Google Scholar

[ref28] 28. Mupangwa W, Chipindu L, Nyagumbo I, Mkuhlani S, Sisito G. Evaluating machine learning algorithms for predicting maize yield under conservation agriculture in Eastern and Southern Africa. SN Applied Sciences. 2020;2(5):952.
View Article
Google Scholar

[88] View Article

[89] Google Scholar

[ref29] 29. Khaki S, Wang L, Archontoulis SV. A CNN-RNN framework for crop yield prediction. Frontiers in Plant Science. 2020;10:492736. pmid:32038699
View Article
PubMed/NCBI
Google Scholar

[91] View Article

[92] PubMed/NCBI

[93] Google Scholar

[ref30] 30. Noorunnahar M, Chowdhury AH, Mila FA. A tree based eXtreme Gradient Boosting (XGBoost) machine learning model to forecast the annual rice production in Bangladesh. PloS one. 2023;18(3):e0283452. pmid:36972270
View Article
PubMed/NCBI
Google Scholar

[95] View Article

[96] PubMed/NCBI

[97] Google Scholar

[ref31] 31. Jiang H, Hu H, Zhong R, Xu J, Xu J, Huang J, et al. A deep learning approach to conflating heterogeneous geospatial data for corn yield estimation: A case study of the US Corn Belt at the county level. Global change biology. 2020;26(3):1754–66. pmid:31789455
View Article
PubMed/NCBI
Google Scholar

[99] View Article

[100] PubMed/NCBI

[101] Google Scholar

[ref32] 32. Liu Y, Wang S, Wang X, Chen B, Chen J, Wang J, et al. Exploring the superiority of solar-induced chlorophyll fluorescence data in predicting wheat yield using machine learning and deep learning methods. Computers and Electronics in Agriculture. 2022;192:106612.
View Article
Google Scholar

[103] View Article

[104] Google Scholar

[ref33] 33. Van Klompenburg T, Kassahun A, Catal C. Crop yield prediction using machine learning: A systematic literature review. Computers and Electronics in Agriculture. 2020;177:105709.
View Article
Google Scholar

[106] View Article

[107] Google Scholar

[ref34] 34. Shahhosseini M, Hu G, Huber I, Archontoulis SV. Coupling machine learning and crop modeling improves crop yield prediction in the US Corn Belt. Scientific reports. 2021;11(1):1606. pmid:33452349
View Article
PubMed/NCBI
Google Scholar

[109] View Article

[110] PubMed/NCBI

[111] Google Scholar

[ref35] 35. Maldaner LF, de Paula Corrêdo L, Canata TF, Molin JP. Predicting the sugarcane yield in real-time by harvester engine parameters and machine learning approaches. Computers and Electronics in Agriculture. 2021;181:105945.
View Article
Google Scholar

[113] View Article

[114] Google Scholar

[ref36] 36. Shrivastava VK, Shrivastava A, Sharma N, Mohanty SN, Pattanaik CR. Deep learning model for temperature prediction: an empirical study. Modeling Earth Systems and Environment. 2023;9(2):2067–80.
View Article
Google Scholar

[116] View Article

[117] Google Scholar

[ref37] 37. Bala K, Paul S, Mohanty SN, Mahapatra S. Improved prediction analysis with hybrid models for thunderstorm classification over the ranchi region. New Generation Computing. 2024;42(1):7–31.
View Article
Google Scholar

[119] View Article

[120] Google Scholar

[ref38] 38. Mangla M, Mehta V, Mohanty SN, Sharma N, Preetham A. Statistical growth prediction analysis of rice crop with pixel-based mapping technique. International Journal of Artificial Intelligence and Soft Computing. 2022;7(3):208–27.
View Article
Google Scholar

[122] View Article

[123] Google Scholar

[ref39] 39. He S, Yang Q, Xiao F, Zhang F, He L. Investigations and collections of wild germplasm plants related to sugarcane in China. Sugarcane. 1994;1:11–7.
View Article
Google Scholar

[125] View Article

[126] Google Scholar

[ref40] 40. Alam M, Nath UK, Karim K, Ahmed M, Mitul R. Genetic variability of exotic sugarcane genotypes. Scientifica. 2017;2017. pmid:29348970
View Article
PubMed/NCBI
Google Scholar

[128] View Article

[129] PubMed/NCBI

[130] Google Scholar

[ref41] 41. Wang Z-p, Liu L, Deng Y-c, Li Y-j, Zhang G-m, Lin S-h, et al. Establishing a forecast mathematical model of sugarcane yield and Brix reduction based on the extent of pokkah boeng disease. Sugar tech. 2017;19:656–61.
View Article
Google Scholar

[132] View Article

[133] Google Scholar

[ref42] 42. Qing C. Research on standardization of sugarcane germplasm resources data and construction of a sharing platform 2008-06–16.
View Article
Google Scholar

[135] View Article

[136] Google Scholar

[ref43] 43. Jain S, Shukla S, Wadhvani R. Dynamic selection of normalization techniques using data complexity measures. Expert Systems with Applications. 2018;106:252–62.
View Article
Google Scholar

[138] View Article

[139] Google Scholar

[ref44] 44. Leevy JL, Hancock J, Khoshgoftaar TM, Peterson JM. IoT information theft prediction using ensemble feature selection. Journal of Big Data. 2022;9(1):6.
View Article
Google Scholar

[141] View Article

[142] Google Scholar

[ref45] 45. Jumin E, Zaini N, Ahmed AN, Abdullah S, Ismail M, Sherif M, et al. Machine learning versus linear regression modelling approach for accurate ozone concentrations prediction. Engineering Applications of Computational Fluid Mechanics. 2020;14(1):713–25.
View Article
Google Scholar

[144] View Article

[145] Google Scholar

[ref46] 46. Li X, Yi S, Cundy AB, Chen W. Sustainable decision-making for contaminated site risk management: A decision tree model using machine learning algorithms. Journal of Cleaner Production. 2022;371:133612.
View Article
Google Scholar

[147] View Article

[148] Google Scholar

Figures

Abstract

Introduction

Study area and data processing

Study area

Data set and data preprocessing

Datasets.

Data analysis and processing.

Methods

Algorithm selection

Setting of evaluation indicators

Experimental design

Results

A predictive model for key phenotypic traits of sugarcane, based on machine learning, has been constructed

Performance comparison of sugarcane stem diameter/plant height classification prediction models using different machine learning algorithms

Performance comparison of sugarcane stem diameter/plant height regression prediction models using different machine learning algorithms

Expanding experimental data to construct predictive models

The impact of important phenotypic features on the performance of intelligent models

Performance analysis of important phenotypic characteristics on sugarcane stem diameter/plant height classification prediction models

Performance analysis of regression prediction models for sugarcane stem diameter/plant height based on important phenotypic characteristics

Discussion

Comparison of model performance of different data processing methods

Performance comparison of different machine learning algorithms

Performance analysis of predictive models after adjusting influencing factors (stem diameter, plant height)

Analysis of the scalability of a sugarcane important phenotypic data prediction model based on multi-model and multi-task approach

Conclusion

References