Gradient boosting for yield prediction of elite maize hybrid ZhengDan 958

Oumnia Ennaji; Sfia Baha; Leonardus Vergutz; Achraf El Allali

doi:10.1371/journal.pone.0315493

Abstract

Understanding accurate methods for predicting yields in complex agricultural systems is critical for effective nutrient management and crop growth. Machine learning has proven to be an important tool in this context. Numerous studies have investigated its potential for predicting yields under different conditions. Among these algorithms, Random Forest (RF) has gained prominence due to its ability to manage large data sets with high dimensions, as well as its ability to uncover complicated non-linear relationships and interactions between variables. RF is particularly suitable for scenarios with categorical variables and missing data. Given the complex web of management practices and their nonlinear effects on yield prediction, it is important to investigate new machine learning algorithms. In this context, our study focused on the evaluation of gradient boosting methods, particularly Extreme Gradient Boosting (XGB) and Gradient Boosting Regressor (GBR), as potential candidates for yield estimation of the maize hybrid Zhengdan 958. Our aim was not only to evaluate and compare these algorithms with existing approaches, but also to comprehensively analyze the resulting model uncertainties. Our approach includes comparing multiple machine learning algorithms, developing and selecting suitable features, fine-tuning the models by training and adjusting the hyperparameters, and visualizing the results. Using a recent dataset of over 1700 maize yield data pairs, our evaluation included a spectrum of algorithms. Our results show robust prediction accuracy for all algorithms. In particular, the predictions of XGB (RMSE = 0.37, R2 = 0.87 and MAE = 0.26) and GBR(RMSE = 0.39, R2 = 0.86 and MAE = 0.27), emphasized the central role of weather characteristics and confirmed the high dependence of crop yield prediction on environmental attributes. Utilizing the capabilities of gradient boosting for yield prediction holds immense potential and is consistent with the promise of this method to serve as a catalyst for further investigation in this evolving field

Citation: Ennaji O, Baha S, Vergutz L, El Allali A (2024) Gradient boosting for yield prediction of elite maize hybrid ZhengDan 958. PLoS ONE 19(12): e0315493. https://doi.org/10.1371/journal.pone.0315493

Editor: Prabina Kumar Meher, ICAR Indian Agricultural Statistics Research Institute, INDIA

Received: April 8, 2024; Accepted: November 26, 2024; Published: December 17, 2024

Copyright: © 2024 Ennaji et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All relevant data are within the manuscript and its Supporting information files.

Funding: The author(s) received no specific funding for this work.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Improving management decisions within crop production systems holds great importance in narrowing yield gaps and safeguarding food security. However, conventional approaches to agriculture have struggled to keep pace with the escalating demands of more intricate systems, particularly those involving small, scattered farms. These challenges are exacerbated by the misalignment of management recommendations with production systems and environmental variables, as observed in the literature [1]. One of the factors leading to low profitability and inefficient use of inputs is low knowledge of limiting factors. Historically, understanding region-specific best practices has relied on controlled experiments within small plots, followed by dissemination through extension services. This methodology has yielded benefits but falls short in addressing contemporary challenges, including climate change, management, and labor supply.

The anticipation of crop yields stands as an important element in sustainable cultivation and optimal utilization of natural resources. However, the intricacies involved in yield prediction render it an extremely complex task influenced by many factors [2, 3]. This complexity prevents both researchers and farmers from accurately forecasting yields and the subsequent economic gains over both short and extended time frames. Prior to the advent of machine learning, conventional statistical models, field surveys, and simulation models were used in diverse combinations to explain and predict crop-yield dynamics [4].

In recent times, machine learning techniques have emerged as potentially effective tools for probing the factors influencing crop yields and for yield prediction itself [1, 5–7]. The integration of these methods has the potential to enhance precision in nutrient management, encourage sustainable cultivation, and fortify food security. Machine learning-based tools can foster the preservation of biodiversity while concurrently augmenting crop yields across various cropping systems, soil compositions, and climatic conditions [8, 9]. Notably, each algorithm offers unique insights into the complexity of these challenges. Maize, being a model species, has been scrutinized extensively for yield and fertilization aspects in diverse experimental setups across global locations. Recent studies have showcased the integration of machine learning in various hybrid configurations.

The research by [10] established a structured approach to distinguish between the spatial and temporal aspects of crop yield, enabling the assessment of different data types in predicting yields and grasping their connection to underlying mechanisms. By employing multiple linear regression (MLR) and random forest (RF), the authors highlighted the importance of within-field location in modeling corn and soybean yields in Nebraska. Other researchers, such as [11], coupled machine learning with spatiotemporal soil fertility data to predict corn yields. In addition, [12] leveraged satellite-based climate and vegetation indices to forecast maize yields for smallholder farmers, effectively demonstrating a scalable yield estimation method even in data-scarce scenarios. Further contributions include a comprehensive evaluation of corn yield prediction, wherein six machine learning algorithms were tested alongside environmental variables from satellite observations [13]. [14] explored diverse machine learning models, including XGBoost, Gradient Boosting Machine, Random Forest, Decision Tree, Adaptive Boosting, and Neural Network, to forecast corn hybrid yields. XGBoost and GBM showed good predictive results, with XGBoost standing out due to its advanced regularization techniques. Such studies have helped shape the advancement of yield prediction modeling. In the context of maize (Zea mays L.), a staple food with important economic significance, yield is influenced by an array of variables and environmental dynamics [15–18]. This study focuses on assessing the predictive performance of gradient51 boosting regressors for corn yield prediction, including XGBoost and GBR [19, 20]. The investigation included a comparative analysis, involving other machine learning algorithms used to predict yields from high-dimensional data, in order to compare them with GB-based algorithms. The primary goal of this study is to explore the responsiveness of yields to diverse environmental variables, evaluate the contribution of weather variables to seasonal crop yield predictions, and discern disparities in the performance of distinct machine learning algorithms, particularly in the case of the ZhengDan 958 hybrid and GB-based machine learning algorithms. A schematic overview of the methodology is presented in Fig 1.

Download:

Fig 1. General workflow of the machine learning approach used in the current study.

https://doi.org/10.1371/journal.pone.0315493.g001

For this particular study, the focus centers on summer maize, which has a critical role in the food security of China, the second largest global producer and importer of maize [21]. Comprehending the intricate interplay between climate and corn production within the Chinese context is critical and scientists have undertaken numerous investigations on the subject [22, 23]. In maize breeding, inbred lines diverge into two primary categories: the Stiff Stalk Heterosis Group (SS group) and the Non-Stif Stalk Heterosis Group (NSS group), based on the degree of grain yield heterosis. Typically, intragroup crossings exhibit lower grain yield heterosis than intergroup crosses. The study’s focal point is the hybrid Zhengdan 958, a prominent maize hybrid recurrently cultivated in China. Its parent inbred lines stem from distinct heterosis groups: Zheng 58 pertains to the PA subgroup of modern U.S. hybrid-derived germplasm (part of the stiff stalk heterosis group—SS group), while Chang 7–2 associates with the Tsipingtou (TSPT) heterosis group, a subset of the non-stiff stalk heterosis group (NSS group) [24]. Noteworthy attributes of Zhengdan 958 involve yield, planting density, and stress resilience. However, few studies evaluating yield parameters were found in the literature [25–27]. Our study contributes to these efforts by showcasing the important features linked to yield in this important maize hybrid.

Materials and methods

Study site and dataset

The study area focuses on China. Maize one of the most important cereal, has increased rapidly in China from 23.1 Mha (106 hectares) in 2000 to 42.1 Mha in 2018. The average yield has also risen from 4.60 tons per hectare to 6.10 tons per hectare in the same period. In the North China Plain (NCP), smallholders grow spring maize (Zea mays L.) together with winter wheat and achieve average maize yields of 5.39 Mg ha-1 [28]. The summer maize variety ZhengDan958, the most commonly grown variety in China, was used as test material. No irrigation was applied during the entire growing season except for sowing. The study area includes 1700 data points in NCP. In crop modeling, a multitude of agronomic principles are meticulously considered to produce precise features related to their impact and significance on crop growth and the evolution of yield. In our pursuit, we harmonized various datasets to construct an appropriate groundwork for our machine-learning workflow. Our initial step entailed preparing the dataset curated by [29], with a specific focus on isolating Zhengdan 958 from other cultivars during the June to September timeframe of the years 2005 to 2010. We identified meteorological indicators that influence yield production [6]. Weather data from diverse sources were homogenized and adjusted to the same temporal and spatial resolution. The result is a comprehensive dataset, encompassing a spectrum of parameters such as temperature (inclusive of minimum, maximum, and average temperatures at a height of 2 meters), humidity and precipitation, atmospheric pressure (surface pressure), and soil characteristics alongside NPK applications. Table 1 summarizes the final features used in to build the models.

Download:

Table 1. Description of the features used to build the yield prediction models for Zhengdan 958.

https://doi.org/10.1371/journal.pone.0315493.t001

Methodology

A number of different predictive models were trained to predict yields, encompassing eXtreme Gradient Boost (XGB), Gradient Boosting (GBR), Random Forest (RF), Artificial Neural Network (ANN), Support Vector Regression (SVR), Linear Regression (LR), Decision Trees (DT), and K-Nearest Neighbors (KNN). The eXtreme Gradient Boost (XGB) algorithm employs a gradient boosting technique fine-tuned for decision and regression trees. The Gradient Boosting Regressors (GBRs) utilize an iterative approach to refine response variable estimates by adapting new models. Central to this methodology is the development of base learning models exhibiting high correlation with the negative gradient of the ensemble’s associated loss function. In contrast, Random Forest (RF) constructs multiple regression trees using feature subsets and bootstrap samples, combining outcomes for robust predictions.

Multilayer Perceptron (MLP), an artificial neural network, consists of input, hidden and output layers. MLPs use backpropagation for training and can learn complex, non-linear relationships between input features and target output values. With multiple hidden layers and non-linear activation functions such as ReLU or Sigmoid, MLPs can capture interactions in the data that simpler models may miss. In this study, we use an MLP for our regression task of yield forecasting. The MLP maps input variables such as weather, soil and management data to continuous yield values.

The Support Vector Regression (SVR) algorithm attempts to locate the hyperplane (or line, in simple linear regression) that maximizes the gap between predicted and actual values. Linear Regression (LR) learns from data by minimizing loss typically quantified as RMSE or MSE through algorithms like gradient descent [30]. Decision Trees (DT), a widely employed algorithm, addresses both regression and classification tasks, with nodes representing data attributes and branches symbolizing decisions or rules based on these attributes. The final result is represented by the leaf nodes. K-Nearest Neighbors (KNN), on the other hand, is a versatile machine-learning algorithm handling both regression and classification. This algorithm compares the similarity of features between a new data point and the already labeled data points in its vicinity.

To confirm the non-linearity of the data, we performed the Teraesvirta test for neural networks [31]. The test statistic was 669.109 with a p-value of 10⁻¹⁶, indicating a strong rejection of the null hypothesis of linearity and confirming the presence of non-linear relationships in the dataset. This result strongly supports the appropriateness of employing non-linear models, such as gradient boosting in our analysis. The test was conducted using a random seed of 42 to ensure reproducibility.

For each algorithm, a grid search across the hyperparameter space was employed to find the best-performing parameters using 5-fold cross-validation. Consequently, the most efficient hyperparameters were used to build and evaluate each models. Fig 2 shows the overall framework followed in our study.

Download:

Fig 2. Workflow diagram showing the steps from data processing, model training, and testing.

https://doi.org/10.1371/journal.pone.0315493.g002

The performance of the models was assessed using three metrics: root mean squared error (RMSE), mean absolute error (MAE), and coefficient of determination (R²) [5]. These metrics are outlined in Eqs 1, 2 and 3: (1) (2) (3) Here, yi is the actual measured yield, is the predicted yield, and represents the mean of the actual measured yield.

Results

Prediction performance

One of the objectives of this study is to compare the performance of XGBoost (XGB) and Gradient Boosting Regressor (GBR) with that of Random Forest (RF), which are commonly used for yield prediction. In addition to these models, we evaluated MLP, KNN, DT, SVR and LR for a comprehensive comparison of predictive performance. The results in Table 2 and Fig 3, show that XGBoost has the highest prediction accuracy with an R² of 0.84, an RMSE (0.40) and an MAE (0.29). GBR, which was developed similarly to XGBoost, achieved an R² of 0.57, with an RMSE of 0.66, MAE of 0.51. RF, showed a strong performance with an R² of 0.72, an RMSE of 0.54, MAE of 0.40. Models such as MLP also performed well, with an R² of 0.74 and an RMSE of 0.52, but models such as KNN, DT, SVR and LR showed comparatively lower performance, with R² values between 0.13 and 0.59 and higher RMSE values. The cross-validation results show that XGBoost consistently delivered the best generalization performance with the lowest MSE across all models. This highlights the robustness of XGBoost across different data splits, in contrast to models such as GBR and RF, which had higher cross-validation errors, indicating greater variability in performance on unseen data. While XGBoost significantly outperformed both GBR and RF in terms of prediction accuracy and stability, both GBR and RF performed competitively. Detailed results including training and testing performance measures can be found in the supplementary material (S4 Table). Overall, the study shows that XGBoost provides the best overall performance, with RF and GBR also showing strong results with slightly higher cross-validation errors than XGBoost. Fig 3 illustrates the RMSE, MAE and R² using a 5-fold cross-validation.

Download:

Fig 3. Comparison of the performance of ML algorithms in predicting maize yield using 5-fold CV.

The performance of the models was assessed using RMSE, R² and MAE.

https://doi.org/10.1371/journal.pone.0315493.g003

Download:

Table 2. RMSE, MAE and R² of the models using 5-fold CV.

https://doi.org/10.1371/journal.pone.0315493.t002

XGB, MLP, RF, and GBR were selected for further optimization using hyperparameter tuning. Table 3 highlights that both XGB and GBR outperformed RF in terms of RMSE, R², and MAE values. Specifically, XGB and GBR achieved RMSE values of 0.37 and 0.39, R² values of 0.87 and 0.86, and MAE values of 0.26 and 0.27, respectively. In contrast, the results of hyperparameter tuning for RF were relatively lower, with an RMSE of 0.71, an R² of 0.51, and an MAE of 0.56. MLP, while performing better than RF, showed weaker performance compared to both XGB and GBR, with an RMSE of 0.51, an R² of 0.75, and an MAE of 0.38. These results demonstrate that XGB and GBR, after hyperparameter tuning, offer superior predictive performance in comparison to both MLP and RF, particularly in terms of minimizing RMSE and maximizing R². To confirm the best-performing model, we conducted the Diebold-Mariano test [32], which revealed that XGBoost significantly outperforms all other models in terms of forecasting accuracy. The results show a statistically significant improvement in performance, confirming XGBoost as the most accurate model among those tested. Full test results are available in the supplementary material (S5 Table).

Download:

Table 3. The predictive performance of XGB, GBR, MLP, and RF after hyper-parameters tuning (the highest accuracy was highlighted in bold).

https://doi.org/10.1371/journal.pone.0315493.t003

To illustrate the comparative results, we generated plots depicting the predicted versus actual yield for the selected algorithms. Fig 4 illustrates the correlation between the projected and actual yield. Evidently, a positive-sloped linear regression line is shown within the plots. While some values exhibit marginal deviations from the line, these anomalies may be attributed to bias effects. However, the visual analysis consistently reaffirms the performance of XGBOOST,GBR, MLP and RF, substantiating their effectiveness in yielding the most accurate results.

Download:

Fig 4. Comparison between predicted and actual crop yield of each model: XGB, GBR, MLP and RF.

https://doi.org/10.1371/journal.pone.0315493.g004

We performed normality tests on the residuals for each of the models, as presented in Fig 5. The results indicated that the residuals for both the Gradient Boosting Regressor (GBR) and eXtreme Gradient Boosting (XGB) models showed slight deviations from normality. This outcome is typical for non-linear models, where such deviations are common. However, these deviations do not substantially impact the predictive performance of the models, as GBR and XGB do not rely heavily on normality assumptions. Detailed test results are available in the supplementary material S1 and S2 Figs.

Download:

Fig 5. Residual Quantile-Quantile (QQ) plot for each model: XGB, GBR, MLP, and RF (QQ plot compares the observed residuals against the standardized normal distribution.

https://doi.org/10.1371/journal.pone.0315493.g005

Feature importance

We studied the importance of the features to determine the weight of each feature using the best-performing algorithms: XGB, GBR, MLP, and RF. Feature importance scores were calculated based on how often a feature is used in constructing decision trees within the ensemble and how much those splits improve the model’s performance. For the MLP model, the importance of the features was determined using permutation importance. This method evaluates the impact of each feature by randomly permuting its values and measuring the resulting decrease in model performance, which is quantified by the change in mean squared error (MSE). Our exploration shows the significance of diverse features such as temperature at 2 meters altitude (°C), maximum temperature at 2 meters altitude, minimum temperature at 2 meters altitude, relative humidity at 2 meters altitude, precipitation (mm/day), and surface pressure (kPa) in relation to the yield of the maize cultivar Zhengdan 958 as shown in Fig 6.

Download:

Fig 6. Comparison between feature importance of each model:XGB, GBR, MLP, and RF.

https://doi.org/10.1371/journal.pone.0315493.g006

XGB’s assessment of feature importance revealed that temperature at 2 meters minimum held the greatest influence over the yield response of the ZhengDan 958 variety. Other key features included N input, surface pressure, temperature at 2 meters, and relative humidity at 2 meters, which all had a significant impact on the yield predictions. This highlights the importance of weather-related factors in XGB’s predictive model, particularly temperature. For Random Forest (RF), SOM (Soil Organic Matter) emerged as the most dominant feature influencing yield prediction, followed by Olsen-P, Ava-K, and P205 input. RF’s reliance on soil attributes underscores its focus on nutrient availability and soil health as major determinants of crop yield, with weather features playing a comparatively lesser role. In the case of Gradient Boosting Regressor (GBR), P205 input was identified as the most important feature, followed by SOM, surface pressure, and temperature at 2 meters range. This suggests that while soil properties like P205 and SOM play a significant role, GBR also places substantial emphasis on weather-related factors, balancing soil and weather influences more evenly than RF. The MLP model, on the other hand, ranked relative humidity at 2 meters as the most important feature, with temperature at 2 meters minimum, precipitation, and temperature at 2 meters maximum following closely. This indicates that MLP’s predictive capability is primarily driven by atmospheric conditions, particularly humidity and temperature. These findings emphasize the critical role that temperature-related features play across all models, particularly in the case of XGB and MLP. Additionally, soil properties like P205, SOM, and Olsen-P were consistently important for models like RF and GBR, demonstrating the varied importance that each model assigns to weather and soil factors. Notably, XGB and GBR showed greater agreement in prioritizing both soil and weather features, while RF placed more emphasis on soil attributes, and MLP leaned more towards atmospheric conditions.

The outcomes of feature importance underscore a consistent preference for weather traits (as showcased in Fig 6, alongside the importance of nutrient availability and nutrient status attributes. In addition, we conducted a stepwise regression analysis. The key variables identified included P205 input (kg ha-1)’, ‘K2O input (kg ha-1)’, ‘Olsen-P(mg kg-1)’, ‘Ava-K(mg kg-1)’, ‘SOM(g kg-1)’, ‘Surface Pressure (kPa)’, ‘N input (kg ha-1)’, which were selected based on their statistical significance. A one-unit increase in these parameters is associated with a significant increase in yield (in kg/ha), while holding all other variables constant. Evidently, the inclusion of soil and weather features is likely to improve the predictive precision of maize yield for the Zhengdan 958 hybrid. The full test results can be found in supplementary Material (S3 Table).

Discussion

Using a large collection of maize observations from China, we undertook a comprehensive analysis to summarize the changes in yield response across varying fertilizer rates, soil attributes, and weather conditions. Our investigation hinged on the utilization of the summer maize cropping system and the cultivar ZhengDan 958 as our focal study crop. An important step of our approach involved the integration of weather data, particularly historical and site-specific information, signifying a crucial stride towards enhancing the robustness and precision of yield predictions. Within this study, we directed our attention towards assessing the efficacy of two emerging gradient-boosting machine learning algorithms, namely XGB and GBR, in the prediction of ZhengDan 958’s yield. This endeavor marks the first time this particular cultivar has been used as the benchmark for comparing diverse machine-learning algorithms. To facilitate this comparative analysis, we compiled a multi-source dataset to study the most powerful algorithm and key features for corn yield prediction. Our study has underscored the feasibility of harnessing machine learning techniques to generate site-specific yield prediction. While prior research has acknowledged the efficacy of random forest models in predicting corn production [1, 33, 34], our investigation reveals that GBR and XGB surpass the performance of the random forest algorithm using the case of the ZhengDan 958 data. A recent comparative analysis conducted by [13] also highlighted the outperformance of XGBOOST compared to alternative algorithms. Moreover, a review by [35] that evaluated over fifty studies in crop yield prediction underscored the prevalence of gradient-boosting trees and random forests as the preferred algorithms within this domain, in concordance with our findings. Similarly, a study by Burdett et al. employed multiple linear regression, artificial neural networks, decision trees, and random forests to establish methodologies capable of associating soil attributes with crop yields at a subfield scale. The features included topographic attributes and soil nutrient data such as pH, soil organic matter content (OM), cation exchange capacity (CEC), soil phosphorus content, zinc (Zn), potassium (K), elevation, and topographic moisture index. Among these techniques, random forests exhibited superior performance, achieving an R² value of 0.85 for corn yield prediction [34]. Furthermore, our integrated dataset analysis underscores GBR and XGBOOST as contenders for robust yield prediction and significant attribute selection, as compared to RF and MLP. In addition to the predictive power attributed to novel algorithmic enhancements, the performance observed is also influenced by the array of environmental and soil properties [13, 36]. Prior research endeavors have often focused on factors such as soil characteristics, fertilizer utilization, and management methodologies to predict crop yields, often neglecting the substantial impact of weather and environmental attributes [29]. However, these neglected elements are worthy of attention as determinants of crop growth. Recent trends have witnessed an increasing integration of weather features into analyses, facilitated by the availability of expansive datasets offering real-time or historical weather information through APIs. Feature selection outcomes have underscored the influence of independent attributes, highlighting the efficacy of weather features in yield prediction [37]. Among them, precipitation, temperature, and surface pressure have been consistently favored by top-performing algorithms.

For instance, the inclusion of these features resulted in a significant enhancement in GBR’s performance, reflected in a reduction of RMSE from 0.76 to 0.39, and an improvement in R² and MAE from 0.43, 0.58 to 0.86 and 0.27, respectively. For XGBoost, the optimized model achieved an RMSE of 0.37, R² of 0.87, and MAE of 0.26, which represents an improvement from 0.47, 0.78, and 0.34, respectively. MLP also showed moderate performance with an RMSE of 0.51, an R² of 0.75, and an MAE of 0.38, which represents an improvement from 0.68, 0.54, and 0.51, respectively. In contrast, RF yielded the lowest performance, with an RMSE of 0.71, R² of 0.51, and an MAE of 0.56, which represents a deterioration from 0.65, 0.58, and 0.48, respectively. The results of the models without the weather feature are presented in the Supplementary Material (S1 Table). These results clearly indicate the superiority of gradient-boosting methods like XGB and GBR over RF and MLP in predicting maize hybrid yields. The outcomes also underscore the notable role of soil attributes, particularly SOM, NPK availability, and utilization, in yield prediction. Furthermore, the integration of diverse data sources has been found to improve the model’s accuracy and robustness. Hence, instead of relying solely on a singular dataset, a holistic approach including various data sources is advocated [1, 13, 38] Feature contributions can also be measured via SHapley Additive exPlanations (SHAP) values. Features that better contribute to the prediction are associated with higher SHAP values. Fig 7 shows the SHAP values of the features used in our study when using GBR for prediction. For example, the figure shows that several observations (red points positioned in the left of x-axis) are negatively correlated with ‘Surface Pressure’. The figure also shows the importance of ‘P205 input’ along with the two weather features as the top three contributing features.

Download:

Fig 7. SHAP values of feature importance of the best performing model XGB.

https://doi.org/10.1371/journal.pone.0315493.g007

Conclusions

In summary, this study demonstrates the successful application of machine learning algorithms to produce accurate and timely yield estimates for the maize variety Zhengdan 958 in China. By integrating various data sources, including soil features and weather characteristics, we developed machine learning models that provide accurate, site-specific yield predictions. Among the models tested, XGBoost proved to be the best performing, closely followed by GBR, underlining the superiority of gradient boosting techniques in yield prediction. Although the MLP model also delivered reasonable results, it lagged behind XGBoost and GBR, further underlining the dominance of these gradient boosting algorithms. The inclusion of weather features, particularly temperature and humidity, significantly improved the prediction accuracy of both XGBoost and GBR, as evidenced by their strong performance metrics. This result underscores the crucial role of environmental factors in crop yield modeling, alongside traditional soil attributes such as P205, SOM and Olsen-P. Our results support the growing realization that machine learning, when properly applied, has the potential to transform agricultural practices by improving yield predictions and supporting data-driven decisions. This study also suggests that future research should incorporate long-term field trials to further refine yield prediction models, especially for specific varieties such as Zhengdan 958. These efforts can contribute to the development of personalized decision support systems tailored to smallholder farmers, enabling more accurate yield predictions, higher nutrient efficiency and better economic returns. Such systems can play an important role in optimizing fertilizer use, reducing environmental impacts and ultimately increasing crop productivity. However, further research is needed to develop even more robust models that incorporate additional variables such as environmental impacts and residual nutrient inputs to the soil. Creating reliable indices to measure these multiple influences will be critical to improving model accuracy and applicability. Overall, this study sets the stage for a broader application of machine learning techniques in agriculture and highlights their potential to make informed decisions and promote sustainable agricultural practices.

Supporting information

S1 Fig. Frequency distribution of the most selected features used in the current study.

The histogram visualizes the importance of various features.

https://doi.org/10.1371/journal.pone.0315493.s001

(TIF)

S2 Fig. Residual plots for model evaluation.

These plots help assess the accuracy and errors in predictions.

https://doi.org/10.1371/journal.pone.0315493.s002

(TIF)

S3 Fig. Residual histograms for models.

This figure illustrates the distribution of residuals, showcasing the deviations.

https://doi.org/10.1371/journal.pone.0315493.s003

(TIF)

S1 Table. Model performance comparison.

Detailed performance metrics (R², MSE, RMSE, and MAE) for each model evaluated in the study.

https://doi.org/10.1371/journal.pone.0315493.s004

(PDF)

S2 Table. Descriptive statistics for features.

Summary statistics (Mean, Std, Min, Max) for all input variables used in the models.

https://doi.org/10.1371/journal.pone.0315493.s005

(PDF)

S3 Table. Stepwise selected features.

Features selected for the models based on their statistical significance (p-values).

https://doi.org/10.1371/journal.pone.0315493.s006

(PDF)

S4 Table. Predictive performance testing and training results with CV errors.

Metrics for training and testing datasets, and cross-validation errors across models.

https://doi.org/10.1371/journal.pone.0315493.s007

(PDF)

S5 Table. Diebold-Mariano test results.

Comparative statistics for model accuracy between pairs of models.

https://doi.org/10.1371/journal.pone.0315493.s008

(PDF)

Acknowledgments

We extend our gratitude to Dr. X. Yan from the International Magnesium Institute, College of Resources and Environment at Fujian Agriculture and Forestry University, for generously providing open access to the valuable data used in this study.

References

1. Ennaji O., Vergutz L. & El Allali A. Machine learning in nutrient management: A review. Artificial Intelligence In Agriculture. (2023).
- View Article
- Google Scholar
2. Paudel D., Boogaard H., Wit A., Janssen S., Osinga S. & Pylianidis C. Machine learning for large-scale crop yield forecasting. Agricultural Systems. 187 pp. 103016 (2021).
- View Article
- Google Scholar
3. Pantazi X., Moshou D., Alexandridis T., Whetton R. & Mouazen A. Wheat yield prediction using machine learning and advanced sensing techniques. Computers And Electronics In Agriculture. 121 pp. 57–65 (2016).
- View Article
- Google Scholar
4. Liakos K., Busato P., Moshou D., Pearson S. & Bochtis D. Machine learning in agriculture: A review. Sensors (Switzerland). 18, 1–29 (2018). pmid:30110960
- View Article
- PubMed/NCBI
- Google Scholar
5. Coulibali Z., Cambouris A. & Parent S. Site-specific machine learning predictive fertilization models for potato crops in Eastern Canada. PLOS ONE. 15, e0230888 (2020). pmid:32764750
- View Article
- PubMed/NCBI
- Google Scholar
6. Qin Z., Myers D., Ransom C., Kitchen N., Liang S., Camberato J., et al. Application of Machine Learning Methodologies for Predicting Corn Economic Optimal Nitrogen Rate. Agronomy Journal. 110, 2596–2607 (2018).
- View Article
- Google Scholar
7. Sweet L., Müller C., Anand M. & Zscheischler J. Cross-validation strategy impacts the performance and interpretation of machine learning models. Artificial Intelligence For The Earth Systems. pp. 1–35 (2023).
- View Article
- Google Scholar
8. Barbosa A., Trevisan R., Hovakimyan N. & Martin N. Modeling yield response to crop management using convolutional neural networks. Computers And Electronics In Agriculture. 170 pp. 105197 (2020).
- View Article
- Google Scholar
9. Luo Y., Wang H., Cao J., Li J., Tian Q., Leng G., et al. Evaluation of machine learning-dynamical hybrid method incorporating remote sensing data for in-season maize yield prediction under drought. Precision Agriculture. pp. 1–25 (2024).
- View Article
- Google Scholar
10. Franz T., Pokal S., Gibson J., Zhou Y., Gholizadeh H., Tenorio F., et al. The role of topography, soil, and remotely sensed vegetation condition towards predicting crop yield. Field Crops Research. 252 pp. 107788 (2020).
- View Article
- Google Scholar
11. Nyeki A., Kerepesi C., Daróczy B., Benczúr A., Milics G. & Nagy J. Application of spatio-temporal data in site-specific maize yield prediction with machine learning methods. Precision Agriculture. 22, 1397–1415 (2021).
- View Article
- Google Scholar
12. Zhang L., Zhang Z., Luo Y., Cao J., Xie R. & Li S. Integrating satellite-derived climatic and vegetation indices to predict smallholder maize yield using deep learning. Agricultural And Forest Meteorology. 311 pp. 108666 (2021).
- View Article
- Google Scholar
13. Kang Y., Ozdogan M., Zhu X., Ye Z., Hain C. & Anderson M. Comparative assessment of environmental variables and machine learning algorithms for maize yield prediction in the US Midwest. Environmental Research Letters. 15, 064005 (2020).
- View Article
- Google Scholar
14. Sarijaloo F., Porta M., Taslimi B. & Pardalos P. Yield performance estimation of corn hybrids using machine learning algorithms. Artificial Intelligence In Agriculture. 5 pp. 82–89 (2021).
- View Article
- Google Scholar
15. Jiang W., Liu X., Qi W., Xu X. & Zhu Y. Using QUEFTS model for estimating nutrient requirements of maize in the Northeast China. Plant, Soil And Environment. 63, 498–504 (2017).
- View Article
- Google Scholar
16. Ogutu G., Franssen W., Supit I., Omondi P. & Hutjes R. Probabilistic maize yield prediction over East Africa using dynamic ensemble seasonal climate forecasts. Agricultural And Forest Meteorology. 250 pp. 243–261 (2018).
- View Article
- Google Scholar
17. Kipkulei H., Bellingrath-Kimura S., Lana M., Ghazaryan G., Baatz R., Matavel C., et al. Maize yield prediction and condition monitoring at the sub-county scale in Kenya: synthesis of remote sensing information and crop modeling. Scientific Reports. 14, 14227 (2024). pmid:38902311
- View Article
- PubMed/NCBI
- Google Scholar
18. Villiers C., Mashaba-Munghemezulu Z., Munghemezulu C., Chirima G. & Tesfamichael S. Assessing Maize Yield Spatiotemporal Variability Using Unmanned Aerial Vehicles and Machine Learning. Geomatics. 4, 213–236 (2024).
- View Article
- Google Scholar
19. Li Y., Zeng H., Zhang M., Wu B. & Qin X. Global de-trending significantly improves the accuracy of XGBoost-based county-level maize and soybean yield prediction in the Midwestern United States. GIScience & Remote Sensing. 61, 2349341 (2024).
- View Article
- Google Scholar
20. Mahesh P. & Soundrapandiyan R. Yield prediction for crops by gradient-based algorithms. PloS One. 19, e0291928 (2024). pmid:39186769
- View Article
- PubMed/NCBI
- Google Scholar
21. Food and Agriculture Organization [FAO] Global Food Production Data. (2021), Available from: https://www.fao.org/statistics/en/.
22. Tao Z., Chen Y., Chao L., Zou J., Peng Y., Yuan S. et al. The causes and impacts for heat stress in spring maize during grain filling in the North China Plain—A review. Journal Of Integrative Agriculture. 15, 2677–2687 (2016).
- View Article
- Google Scholar
23. Holst J., Liu W., Zhang Q. & Doluschitz R. Crop evapotranspiration, arable cropping systems and water sustainability in southern Hebei, PR China. Agricultural Water Management. 141 pp. 47–54 (2014).
- View Article
- Google Scholar
24. Ma J., Li J., Cao Y., Wang L., Wang F., Wang H. et al. Comparative study on the transcriptome of maize mature embryos from two China elite hybrids Zhengdan958 and Anyu5. PloS One. 11, e0158028 (2016). pmid:27332982
- View Article
- PubMed/NCBI
- Google Scholar
25. Lai J., Li R., Xu X., Jin W., Xu M., Zhao H. et al. Genome-wide patterns of genetic variation among elite maize inbred lines. Nature Genetics. 42, 1027–1030 (2010). pmid:20972441
- View Article
- PubMed/NCBI
- Google Scholar
26. Li H., Liu T., Cao Y., Wang L., Zhang Y., Li J. et al. Transcriptomic analysis of maize mature embryos from an elite maize hybrid Zhengdan958 and its parental lines. Plant Growth Regulation. 76, 315–325 (2015).
- View Article
- Google Scholar
27. Li H., Yang Q., Gao L., Zhang M., Ni Z. & Zhang Y. Identification of heterosis-associated stable QTLs for ear-weight-related traits in an elite maize hybrid Zhengdan 958 by design III. Frontiers In Plant Science. 8 pp. 561 (2017). pmid:28469626
- View Article
- PubMed/NCBI
- Google Scholar
28. Dai Y. A revised checklist of corticioid and hydnoid fungi in China for 2010. Mycoscience. 52, 69–79 (2011).
- View Article
- Google Scholar
29. Yan X., Chen X., Ma C., Cai Y., Cui Z., Chen X., et al. What are the key factors affecting maize yield response to and agronomic efficiency of phosphorus fertilizer in China?. Field Crops Research. 270 pp. 108221 (2021).
- View Article
- Google Scholar
30. Abbas F., Afzaal H., Farooque A. & Tang S. Crop yield prediction through proximal sensing and machine learning algorithms. Agronomy. 10, 1046 (2020).
- View Article
- Google Scholar
31. Terasvirta T., Lin C. & Granger C. Power of the neural network linearity test. Journal Of Time Series Analysis. 14, 209–220 (1993).
- View Article
- Google Scholar
32. Diebold F. & Mariano R. Comparing predictive accuracy. Journal Of Business & Economic Statistics. 20, 134–144 (2002).
- View Article
- Google Scholar
33. Han J., Zhang Z., Cao J., Luo Y., Zhang L. & Li Z. Prediction of winter wheat yield based on multi-source data and machine learning in China. Remote Sensing. 12 (2020).
- View Article
- Google Scholar
34. Burdett H. & Wellen C. Statistical and machine learning methods for crop yield prediction in the context of precision agriculture. Precision Agriculture. 23, 1553–1574 (2022).
- View Article
- Google Scholar
35. Van Klompenburg T., Kassahun A. & Catal C. Crop yield prediction using machine learning: A systematic literature review. Computers And Electronics In Agriculture. 177 pp. 105709 (2020).
- View Article
- Google Scholar
36. Cedric L., Adoni W., Aworka R., Zoueu J., Mutombo F. & Krichen M. Crops yield prediction based on machine learning models: case of west african countries. Smart Agricultural Technology. pp. 100049 (2022).
- View Article
- Google Scholar
37. Elavarasan D., Vincent D., Sharma V., Zomaya A. & Srinivasan K. Forecasting yield by integrating agrarian factors and machine learning models: A survey. Computers And Electronics In Agriculture. 155 pp. 257–282 (2018).
- View Article
- Google Scholar
38. Lischeid G., Webber H., Sommer M., Nendel C. & Ewert F. Machine learning in crop yield modelling: A powerful tool, but no surrogate for science. Agricultural And Forest Meteorology. 312 pp. 108698 (2022).
- View Article
- Google Scholar

[ref1] 1. Ennaji O., Vergutz L. & El Allali A. Machine learning in nutrient management: A review. Artificial Intelligence In Agriculture. (2023).
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. Paudel D., Boogaard H., Wit A., Janssen S., Osinga S. & Pylianidis C. Machine learning for large-scale crop yield forecasting. Agricultural Systems. 187 pp. 103016 (2021).
View Article
Google Scholar

[5] View Article

[6] Google Scholar

[ref3] 3. Pantazi X., Moshou D., Alexandridis T., Whetton R. & Mouazen A. Wheat yield prediction using machine learning and advanced sensing techniques. Computers And Electronics In Agriculture. 121 pp. 57–65 (2016).
View Article
Google Scholar

[8] View Article

[9] Google Scholar

[ref4] 4. Liakos K., Busato P., Moshou D., Pearson S. & Bochtis D. Machine learning in agriculture: A review. Sensors (Switzerland). 18, 1–29 (2018). pmid:30110960
View Article
PubMed/NCBI
Google Scholar

[11] View Article

[12] PubMed/NCBI

[13] Google Scholar

[ref5] 5. Coulibali Z., Cambouris A. & Parent S. Site-specific machine learning predictive fertilization models for potato crops in Eastern Canada. PLOS ONE. 15, e0230888 (2020). pmid:32764750
View Article
PubMed/NCBI
Google Scholar

[15] View Article

[16] PubMed/NCBI

[17] Google Scholar

[ref6] 6. Qin Z., Myers D., Ransom C., Kitchen N., Liang S., Camberato J., et al. Application of Machine Learning Methodologies for Predicting Corn Economic Optimal Nitrogen Rate. Agronomy Journal. 110, 2596–2607 (2018).
View Article
Google Scholar

[19] View Article

[20] Google Scholar

[ref7] 7. Sweet L., Müller C., Anand M. & Zscheischler J. Cross-validation strategy impacts the performance and interpretation of machine learning models. Artificial Intelligence For The Earth Systems. pp. 1–35 (2023).
View Article
Google Scholar

[22] View Article

[23] Google Scholar

[ref8] 8. Barbosa A., Trevisan R., Hovakimyan N. & Martin N. Modeling yield response to crop management using convolutional neural networks. Computers And Electronics In Agriculture. 170 pp. 105197 (2020).
View Article
Google Scholar

[25] View Article

[26] Google Scholar

[ref9] 9. Luo Y., Wang H., Cao J., Li J., Tian Q., Leng G., et al. Evaluation of machine learning-dynamical hybrid method incorporating remote sensing data for in-season maize yield prediction under drought. Precision Agriculture. pp. 1–25 (2024).
View Article
Google Scholar

[28] View Article

[29] Google Scholar

[ref10] 10. Franz T., Pokal S., Gibson J., Zhou Y., Gholizadeh H., Tenorio F., et al. The role of topography, soil, and remotely sensed vegetation condition towards predicting crop yield. Field Crops Research. 252 pp. 107788 (2020).
View Article
Google Scholar

[31] View Article

[32] Google Scholar

[ref11] 11. Nyeki A., Kerepesi C., Daróczy B., Benczúr A., Milics G. & Nagy J. Application of spatio-temporal data in site-specific maize yield prediction with machine learning methods. Precision Agriculture. 22, 1397–1415 (2021).
View Article
Google Scholar

[34] View Article

[35] Google Scholar

[ref12] 12. Zhang L., Zhang Z., Luo Y., Cao J., Xie R. & Li S. Integrating satellite-derived climatic and vegetation indices to predict smallholder maize yield using deep learning. Agricultural And Forest Meteorology. 311 pp. 108666 (2021).
View Article
Google Scholar

[37] View Article

[38] Google Scholar

[ref13] 13. Kang Y., Ozdogan M., Zhu X., Ye Z., Hain C. & Anderson M. Comparative assessment of environmental variables and machine learning algorithms for maize yield prediction in the US Midwest. Environmental Research Letters. 15, 064005 (2020).
View Article
Google Scholar

[40] View Article

[41] Google Scholar

[ref14] 14. Sarijaloo F., Porta M., Taslimi B. & Pardalos P. Yield performance estimation of corn hybrids using machine learning algorithms. Artificial Intelligence In Agriculture. 5 pp. 82–89 (2021).
View Article
Google Scholar

[43] View Article

[44] Google Scholar

[ref15] 15. Jiang W., Liu X., Qi W., Xu X. & Zhu Y. Using QUEFTS model for estimating nutrient requirements of maize in the Northeast China. Plant, Soil And Environment. 63, 498–504 (2017).
View Article
Google Scholar

[46] View Article

[47] Google Scholar

[ref16] 16. Ogutu G., Franssen W., Supit I., Omondi P. & Hutjes R. Probabilistic maize yield prediction over East Africa using dynamic ensemble seasonal climate forecasts. Agricultural And Forest Meteorology. 250 pp. 243–261 (2018).
View Article
Google Scholar

[49] View Article

[50] Google Scholar

[ref17] 17. Kipkulei H., Bellingrath-Kimura S., Lana M., Ghazaryan G., Baatz R., Matavel C., et al. Maize yield prediction and condition monitoring at the sub-county scale in Kenya: synthesis of remote sensing information and crop modeling. Scientific Reports. 14, 14227 (2024). pmid:38902311
View Article
PubMed/NCBI
Google Scholar

[52] View Article

[53] PubMed/NCBI

[54] Google Scholar

[ref18] 18. Villiers C., Mashaba-Munghemezulu Z., Munghemezulu C., Chirima G. & Tesfamichael S. Assessing Maize Yield Spatiotemporal Variability Using Unmanned Aerial Vehicles and Machine Learning. Geomatics. 4, 213–236 (2024).
View Article
Google Scholar

[56] View Article

[57] Google Scholar

[ref19] 19. Li Y., Zeng H., Zhang M., Wu B. & Qin X. Global de-trending significantly improves the accuracy of XGBoost-based county-level maize and soybean yield prediction in the Midwestern United States. GIScience & Remote Sensing. 61, 2349341 (2024).
View Article
Google Scholar

[59] View Article

[60] Google Scholar

[ref20] 20. Mahesh P. & Soundrapandiyan R. Yield prediction for crops by gradient-based algorithms. PloS One. 19, e0291928 (2024). pmid:39186769
View Article
PubMed/NCBI
Google Scholar

[62] View Article

[63] PubMed/NCBI

[64] Google Scholar

[ref21] 21. Food and Agriculture Organization [FAO] Global Food Production Data. (2021), Available from: https://www.fao.org/statistics/en/.

[ref22] 22. Tao Z., Chen Y., Chao L., Zou J., Peng Y., Yuan S. et al. The causes and impacts for heat stress in spring maize during grain filling in the North China Plain—A review. Journal Of Integrative Agriculture. 15, 2677–2687 (2016).
View Article
Google Scholar

[67] View Article

[68] Google Scholar

[ref23] 23. Holst J., Liu W., Zhang Q. & Doluschitz R. Crop evapotranspiration, arable cropping systems and water sustainability in southern Hebei, PR China. Agricultural Water Management. 141 pp. 47–54 (2014).
View Article
Google Scholar

[70] View Article

[71] Google Scholar

[ref24] 24. Ma J., Li J., Cao Y., Wang L., Wang F., Wang H. et al. Comparative study on the transcriptome of maize mature embryos from two China elite hybrids Zhengdan958 and Anyu5. PloS One. 11, e0158028 (2016). pmid:27332982
View Article
PubMed/NCBI
Google Scholar

[73] View Article

[74] PubMed/NCBI

[75] Google Scholar

[ref25] 25. Lai J., Li R., Xu X., Jin W., Xu M., Zhao H. et al. Genome-wide patterns of genetic variation among elite maize inbred lines. Nature Genetics. 42, 1027–1030 (2010). pmid:20972441
View Article
PubMed/NCBI
Google Scholar

[77] View Article

[78] PubMed/NCBI

[79] Google Scholar

[ref26] 26. Li H., Liu T., Cao Y., Wang L., Zhang Y., Li J. et al. Transcriptomic analysis of maize mature embryos from an elite maize hybrid Zhengdan958 and its parental lines. Plant Growth Regulation. 76, 315–325 (2015).
View Article
Google Scholar

[81] View Article

[82] Google Scholar

[ref27] 27. Li H., Yang Q., Gao L., Zhang M., Ni Z. & Zhang Y. Identification of heterosis-associated stable QTLs for ear-weight-related traits in an elite maize hybrid Zhengdan 958 by design III. Frontiers In Plant Science. 8 pp. 561 (2017). pmid:28469626
View Article
PubMed/NCBI
Google Scholar

[84] View Article

[85] PubMed/NCBI

[86] Google Scholar

[ref28] 28. Dai Y. A revised checklist of corticioid and hydnoid fungi in China for 2010. Mycoscience. 52, 69–79 (2011).
View Article
Google Scholar

[88] View Article

[89] Google Scholar

[ref29] 29. Yan X., Chen X., Ma C., Cai Y., Cui Z., Chen X., et al. What are the key factors affecting maize yield response to and agronomic efficiency of phosphorus fertilizer in China?. Field Crops Research. 270 pp. 108221 (2021).
View Article
Google Scholar

[91] View Article

[92] Google Scholar

[ref30] 30. Abbas F., Afzaal H., Farooque A. & Tang S. Crop yield prediction through proximal sensing and machine learning algorithms. Agronomy. 10, 1046 (2020).
View Article
Google Scholar

[94] View Article

[95] Google Scholar

[ref31] 31. Terasvirta T., Lin C. & Granger C. Power of the neural network linearity test. Journal Of Time Series Analysis. 14, 209–220 (1993).
View Article
Google Scholar

[97] View Article

[98] Google Scholar

[ref32] 32. Diebold F. & Mariano R. Comparing predictive accuracy. Journal Of Business & Economic Statistics. 20, 134–144 (2002).
View Article
Google Scholar

[100] View Article

[101] Google Scholar

[ref33] 33. Han J., Zhang Z., Cao J., Luo Y., Zhang L. & Li Z. Prediction of winter wheat yield based on multi-source data and machine learning in China. Remote Sensing. 12 (2020).
View Article
Google Scholar

[103] View Article

[104] Google Scholar

[ref34] 34. Burdett H. & Wellen C. Statistical and machine learning methods for crop yield prediction in the context of precision agriculture. Precision Agriculture. 23, 1553–1574 (2022).
View Article
Google Scholar

[106] View Article

[107] Google Scholar

[ref35] 35. Van Klompenburg T., Kassahun A. & Catal C. Crop yield prediction using machine learning: A systematic literature review. Computers And Electronics In Agriculture. 177 pp. 105709 (2020).
View Article
Google Scholar

[109] View Article

[110] Google Scholar

[ref36] 36. Cedric L., Adoni W., Aworka R., Zoueu J., Mutombo F. & Krichen M. Crops yield prediction based on machine learning models: case of west african countries. Smart Agricultural Technology. pp. 100049 (2022).
View Article
Google Scholar

[112] View Article

[113] Google Scholar

[ref37] 37. Elavarasan D., Vincent D., Sharma V., Zomaya A. & Srinivasan K. Forecasting yield by integrating agrarian factors and machine learning models: A survey. Computers And Electronics In Agriculture. 155 pp. 257–282 (2018).
View Article
Google Scholar

[115] View Article

[116] Google Scholar

[ref38] 38. Lischeid G., Webber H., Sommer M., Nendel C. & Ewert F. Machine learning in crop yield modelling: A powerful tool, but no surrogate for science. Agricultural And Forest Meteorology. 312 pp. 108698 (2022).
View Article
Google Scholar

[118] View Article

[119] Google Scholar

Figures

Abstract

Introduction

Materials and methods

Study site and dataset

Methodology

Results

Prediction performance

Feature importance

Discussion

Conclusions

Supporting information

S1 Fig. Frequency distribution of the most selected features used in the current study.

S2 Fig. Residual plots for model evaluation.

S3 Fig. Residual histograms for models.

S1 Table. Model performance comparison.

S2 Table. Descriptive statistics for features.

S3 Table. Stepwise selected features.

S4 Table. Predictive performance testing and training results with CV errors.

S5 Table. Diebold-Mariano test results.

Acknowledgments

References