A machine learning approach for estimating forage maize yield and quality in NW Spain

Silverio García-Cortés; Agustín Menéndez-Díaz; María José Bande-Castro; Alfonso Carballal-Samalea; Adela Martínez-Fernández; Jose Alberto Oliveira-Prendes

doi:10.1371/journal.pone.0326364

Abstract

Crop models simulate crop growth and development according to different climatic, soil and crop management conditions. The CSM-CERES-Maize model (DSSAT) was adapted to simulate forage maize yields by calibrating the genetic parameters of six cultivars: SE1–200, SE2–300 and SE3–400 in three sites and three years in Asturias, and XU1–220, XU2–300 and XU3–400 in four sites and three years in Galicia. Calibration using the CSM-CERES-Maize model, together with the use of historical meteorological data (2000–2022) from the study sites, enabled simulation of forage maize yield (whole plant dry matter yield) and quality (whole plant net energy for lactation yield and whole plant crude protein yield) for six cultivars during the 23-year period. LightGBM models (a machine learning technique) were used with the simulated forage maize yield, quality data, historical weather, soil, and management data to capture non-linear relationships in the data and to identify the most influential variables for crop yield and quality predictions. The results of the model evaluation yielded an accuracy of 94.7%, (R² score = 0.86) for forage maize yield, an accuracy of 94.0% (R² score = 0.84) for the net energy for lactation yield and an accuracy of 93.0% (R² score = 0.85) for the crude protein yield. Variable importance plots revealed Growing Season and Radiation from sowing to harvest to be the top two most influential predictor variables. In Asturias and Galicia, the cultivars with the longest cycle (cultivars cycle 400) are those with the highest values for the variables studied in the 23 years of historical meteorological data (average of three sites in Asturias and four sites in Galicia with three sowing dates in each site). The models will be available to make predictions for forage maize yield and quality by non-specialist users, using the geographical location of the crop field, cultivar type, sowing and harvest date and probable values of weather variables during the growing season as input data.

Citation: García-Cortés S, Menéndez-Díaz A, Bande-Castro MJ, Carballal-Samalea A, Martínez-Fernández A, Oliveira-Prendes JA (2025) A machine learning approach for estimating forage maize yield and quality in NW Spain. PLoS One 20(8): e0326364. https://doi.org/10.1371/journal.pone.0326364

Editor: RISHIRAJ DUTTA,, Asian Disaster Preparedness Center, THAILAND

Received: October 16, 2024; Accepted: May 29, 2025; Published: August 12, 2025

Copyright: © 2025 García-Cortés et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: Data & Code relevant to this paper is available from: https://doi.org/10.5281/zenodo.15470090.

Funding: This research was supported by a mobility research grant awarded to JAO (TAD/CRP PO 500109615) from the OECD Co-operative Research Programme. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

1. Introduction

Production of forage maize in Spain is concentrated in the northern part of the country. The main production regions are Galicia, Asturias and Cantabria, covering areas of 73836 ha, 7033 ha and 4610 ha respectively [1]. Together, these three regions account for 89% of the area planted with forage maize in Spain.

In collaboration with seed-producing companies, an evaluation of commercial hybrid maize cultivars for silage was initiated in 1996 in Asturias by the Regional Service for Agri-Food Research and Development (SERIDA) and in 1999 in Galicia by the Agricultural Research Centre of Mabegondo (CIAM) [2,3] and has been continued to the present. High levels of interannual variability in the results can occur owing to differences in climatology (temperature, precipitation, etc.). It is therefore important to have data available for more than one year to enable agronomic characterization of cultivars.

The sensitivity of maize grain yield to elevated temperatures (alone or associated with water or nutrient stress) is much higher during the critical flowering period, which determines the number of grains per plant or during grain filling, which influences grain weight and quality [4].

Process-based (functional) models are excellent tools for studies quantifying the effects of management, genetics, soil and climate on crop yields and phenology. The CSM-CERES-Maize model [5–7] within the same DSSAT package has been widely used to support decision-making regarding the management of irrigation and fertilization, as well as the choice of maize cultivars. CERES-Maize continues to be the most widely used maize model globally and remains the basis of other maize models, including APSIM [8] and the CSM-IXIM [9].

In forage maize, [10,11] initially adapted the CSM-CERES-Maize model (CSM = Cropping System Model) in the software provided by the Decision Support System for Agrotechnology Transfer (DSSAT) [12] for forage maize, in order to simulate the growth and development of three forage maize cultivars (SE1–200, SE2–300, and SE3–400) in three sites (Barcia, Villaviciosa and Grado) in Asturias.

One of the major problems in the selection of genotypes (cultivars) with high productivity in different environments (locations, years) is the genotype x environment interaction (GEI). Multi-environment trials (MET) often use the AMMI (additive main-effects and multiplicative interaction) model which is popular for analyzing MET data with fixed effect [13]. This model is a statistical tool for identifying GEI patterns and allows grouping genotypes according to response characteristics (identification of stable genotypes) and detecting trends between environments [14]. Moreover, DSSAT-based seasonal analysis was conducted to examine the interannual variation in forage maize productivity in combination with meteorological data available for 23 years (2000–2022) in the three sites in Asturias. Similar work was later performed with data from three cultivars (XU1–200, XU2–300, and XU3–400) in four sites (Ribadeo, Ordes, Deza and Sarria) in Galicia [15].

The objective of the present study was to obtain a predictive model using a forage maize dataset (field, weather, soil and management data) and a machine learning technique to help optimize agronomic practices and harvest decisions for forage maize farmers and policymakers in Northwestern Spain.

The workflow carried out in this case from old and new data is summarized in the following chart (Fig 1).

Download:

Fig 1. General workflow.

Left vertical part represents the field data sources and synthetic data creation using CSM-CERES and DSSAT. Right part shows the Machine learning modelling phase of the previous data. The outputs of the process are a forage maize production prediction model, associated validation metrics and input variable relative importance for those predictions.

https://doi.org/10.1371/journal.pone.0326364.g001

The structure of this article is organized as follows: the Materials and methods section details the field data collected across seven localities in the northwest of Spain, along with the adaptation of the CSM-CERES Maize model to simulate forage maize yield and quality by calibrating the genetic parameters of six cultivars. Additionally, the methodology for generating synthetic data to analyze interannual variations in production over a 23-year period is described. Subsequently, the Machine Learning Processing subsection describes the LightGBM model and its application in developing a predictive model for forage maize yield based on the data presented in the previous section. The Results and discussion section presents the model validation metrics obtained using the reserved dataset for this purpose. Finally, the study concludes with a summary of key findings and insights in the final section.

2. Materials and methods

2.1 Experimental sites and minimum data sets in Asturias and Galicia

Maize cultivars (SE1–200, FAO-200; SE2–300, FAO-300; SE3–400, FAO-400) were evaluated in field trials conducted in three sites in Asturias: Barcia (43.5402, −6.4954 and 25 masl), Villaviciosa (43.4722, −5.4361 and 10 masl) and Grado (43.3764, −6.0625 and 50 masl). The cultivars are given fictitious names here due to the strict confidentiality regarding the name of the hybrids imposed by the seed companies.

The soil in the Barcia site, located on the western coast of Asturias, is characterized as a loam soil (order Inceptisol, suborder Udepts, great group Dystrudepts). The soil in Villaviciosa, in the central coastal zone, has a clay-loam texture (order Entisol, suborder Fluvents, great group Udifluvents). The soil in Grado, situated in an interior valley in central Asturias, has a sandy clay loam texture (order Inceptisol, suborder Udepts, great group Dystrudepts) [16].

All the study sites belong to the temperate ocean climate zone (type Cfb), according to the Köppen-Geiger climate classification [17]. The temperature in the coldest month is lower than 18 ºC but higher than −3 ºC. The mean temperature is lower than 22 °C in all months and higher than 10 °C in at least four months of the year. The precipitation does not vary significantly between seasons [18].

Maize cultivars (XU1–200, FAO-200; XU2–300, FAO-300; XU3–400, FAO-400) were evaluated in experimental field trials at four locations in Galicia: Ribadeo (43.5458, −7.0816 and 43 masl), Ordes (43.0432, −8.4458 and 300 masl), Deza (43.6995, −8.3192 and 400 masl) and Sarria (42.8194, −7.3758 and 520 masl). The soils included in the trials in Ribadeo (located in A Mariña Oriental, north-east of Lugo), Ordes (centre of the province of A Coruña) and Deza (north of Pontevedra) have a sandy loam texture. The Ribadeo soil is developed on slate and those of Ordes and Deza on schist. The soil in the Sarria (south of Lugo) trials has a sandy-clay loam texture and is classified as slate soil.

The Ribadeo experimental site, according to the Köppen-Geiger climate classification, belongs to the humid temperate climate with warm summer (type Cfb). The average temperature of the coldest month is below 18 °C and above −3 °C.

The average temperature in the hottest month does not reach 22 °C, but it is higher than 10 °C for four or more months of the year. Rainfall is spread throughout the year, and there is no dry season. On the other hand, the other three experimental sites belong to the temperate rainy climate with dry and warm summer (Csb). The mean temperature of the coldest month is below 18 °C and above −3 °C. The mean temperature of the warmest month does not reach 22 °C and exceeds 10 °C in four or more months of the year. Precipitation exceeds evaporation. Rainfall decreases considerably in summer, coinciding with high temperatures.

A historical series of meteorological data for the experimental sites, covering a period of 23 years, was obtained at the weather stations closest to the experimental sites and provided by the State Meteorological Agency [19] for Asturias and by the regional MeteoGalicia [20] for Galicia.

2.2 Cultivar characteristics

In Table 1, the values of the sowing date (the day of the year = doy or Julian date), anthesis date (doy), harvest date (doy) and the periods between sowing and anthesis date and between sowing and harvest date (Growing season) can be seen.

Download:

Table 1. Mean values and Standard deviations in brackets of hybrid forage maize cultivars (SE1-200, SE2-300 and SE3-400 in Asturias and XU1-200, XU2-300 and XU3-400 in Galicia).

https://doi.org/10.1371/journal.pone.0326364.t001

2.3 Adaptation of the CSM-CERES-Maize model

CSM-CERES-Maize requires six parameters, known as “genetic coefficients”, to characterize different cultivars. Each genetic coefficient has a direct influence on a specific crop model variable (Table 2).

Download:

Table 2. Genetic coefficients that characterize each cultivar type in the CSM-CERES-Maize model.

https://doi.org/10.1371/journal.pone.0326364.t002

The estimated genetic coefficients for three cultivars (SE1–200, SE2–300, SE3–400) for three locations in Asturias and for three cultivars (XU1–200, XU2–300, XU3–400) for four locations in Galicia for three years are indicated in Table 3.

Download:

Table 3. Estimated genetic coefficients in three cultivars (SE1-200, SE2-300, SE3-400) in 3 years and 3 locations in Asturias and three cultivars (XU1-200, XU2-300, XU3-400) obtained with experimental data from 3 years and 4 locations in Galicia.

https://doi.org/10.1371/journal.pone.0326364.t003

At present, CSM-CERES-Maize simulates seven phenological states: germination, emergence, end of the juvenile phase (in maize, leaf 6 and leaf 7 are juvenile-to-adult transition leaves and leaf 8 is normally an adult leaf according to [21], flower initiation (anthesis date), 75% of the plants with visible stigmas (female flowering), initiation of grain filling and physiological maturity. These stages do not give sufficient details to produce forage maize destined for silage, as harvesting is determined in the field because of the position of the milk line in the grain. This variable (not simulated by the CSM-CERES-Maize model) is commonly used as an indicator of the optimal moisture content for harvesting forage maize for silage [22,23]. To overcome the model limitations, it was assumed that at the time of harvesting the forage maize that the milk line in the grain is halfway between the crown and the tip, i.e., 13 days before physiological maturity, as indicated by [24].

The CSM-CERES-Maize model was calibrated with data obtained in the 2012-2013-2014 and 2014-2015-2016 field trials at the three and four evaluation locations in Asturias and Galicia respectively [25].

2.4 Simulation of the interannual variation in forage maize production

With the aim of examining the interannual variation in the anthesis date, dry matter production in the whole plant and nitrogen production in the whole plant, the model was executed with historical meteorological data for a period of 23 years (2000–2022), used to simulate the production and phenology of the cultivars in each of the study locations in Asturias and Galicia [25].

In Spain and France, net energy in feed is expressed in a barley feed unit (one kg of standard barley contains one Unité Fourragère, one UF). Net energy represents the energy used by the animal’s body for maintenance, growth and production. The net energy for lactation (NEL) in dairy cattle is also given according to the French standards and expressed as UFL (UF Laitières = milk forage unit = NEL, with 1 UFL = 7.1 MJ/kg DM = 1.7 Mcal NEL) according to [26]. UFL values are obtained from laboratory determination of in vitro digestibility of organic matter and organic matter in the feeds, therefore they have a high correlation with feed digestibility.

As the current CSM-CERES-Maize model does not enable calculation of the net energy of the maize forage (UFL/kg DM) and whole plant net energy for lactation yield (UFL/ha), the dry matter production of the whole plant was converted into energy production by considering the ear harvest index (HIPD = ear dry matter production/total biomass production), as demonstrated by [27].

Values of 0.61 UFL/kg DM and 1.08 UFL/kg DM were used for the energy contents (energy value) of the foliage (stems + leaves + husks) and ears respectively. The following equations were used:

(1)

(2)

The CSM-CERES-Maize model enables calculation of the kg N/ha but not of the kg CP/ha. Almost all the N in plants is present as amino acids in proteins and the average N content of the proteins (16%) is therefore 100/16 = 6.25 [28]. The crude protein yield (kg CP/ha) was calculated as the kg N/ha x 6.25.

The description of the variables included in the Asturian and Galician forage maize dataset used in the machine learning technique are presented in Table 4.

Download:

Table 4. Description of the variables included in the Asturian and Galician forage maize dataset generated with the CSM-CERES-Maize model.

https://doi.org/10.1371/journal.pone.0326364.t004

2.5 Genotype × environment × management interaction

A three-way ANOVA was conducted to determine the effects of Cultivar, Site and Sowing date on dry matter yield. The three independent variables or factors (Cultivar, Site and Sowing date) were considered fixed factors.

To find out information about the three-way Site x Cultivar x Sowing date interaction (e.g., whether the three-way interaction effect is statistically significant), we need to consult the “Site x Cultivar x Sowing date” row in the Table 5.

Download:

Table 5. Summary of Three-Way Analysis of Variance for Site, Cultivar and Sowing date factors on dry matter yield (kg DM/ha).

https://doi.org/10.1371/journal.pone.0326364.t005

There was no statistically significant three-way interaction between Cultivar, Site and Sowing date, F(24, 1386) = 1.287, p ≥ 0.05, but all the two-way interactions were significant.

There was a statistically significant two-way interaction effect between Cultivar and Site, on dry matter yield of the maize cultivars, F(12, 1386) = 5.288, p < 0.001. This indicates that cultivars were affected differently by the Sites. There was also a statistically significant two-way interaction effect between Site and Sowing date, F(12, 1386) = 5.775, p < 0.001 and Cultivar and Sowing date, F(4, 1386) = 4.616, p < 0.001 on dry matter yield.

We may follow up and interpret the two-way interactions but not the main effects due to the statistical significance of the two-way interactions. Usually when we have a significant two-way interaction (e.g., Site x Cultivar), it is the effect of this interaction that is of interest, and the main effects (e.g., Site and Cultivar) are less of interest, because, in this case, we know that the effect of Cultivar changes across levels of Site. There was a statistically significant simple main effect of Site, F(6, 1386) = 201.6, p < 0.001, Cultivar, F(2, 1386) = 447.9, p < 0.001, and Sowing date, F(2, 1386) = 617.9, p < 0.001, on Dry matter yield.

The graphical analysis (not presented here) showed non-crossover interactions [29] indicating that the difference in performance (kg DM/ha) of the Cultivars is not similar across the other factors (Sites or Sowing dates) but it does not change the order (ranking) of the one that produces more and the one that produces less according to the Sites or Sowing dates.

2.6 Machine learning processing

The goal is to utilize machine learning techniques to develop a model tailored to this geographic region that can provide production predictions using simple variables associated with the given locations, weather forecasts, and crop management practices. This approach would enable producers or agricultural managers to simulate production scenarios in advance without requiring in-depth knowledge of complex agronomic parameters, such as those governing functional models like CSM-CERES-Maize or the genetic calibration parameters needed for forage maize adaptations. A machine learning technique, such as the ones used in this study, can construct a predictive model for maize production based on these input variables. This approach eliminates the need to explicitly model the physical functional relationships between variables and can establish fundamentally complex and non-linear relationships between the input and output variables with suitable precision. The field data were observed during the years 2012–2013–2014 in Asturias and 2014–2015–2016 in Galicia. Out of the total 23 years of data, 6 years correspond to field observations and the rest are simulated, meaning that approximately 25% of the data are field based. This proportion of field data versus simulated data is expected to be maintained in both the training and test sets.

2.6.1 Programming language and used packages.

Python has been used as the programming language within Jupyter Notebook environment. The Python packages used include: numpy, pandas, for basic programming and sklearn, lightgbm, xgboost, optuna, shap for machine learning processes. Other auxiliary packages: joblib, matplotlib, streamlit, and plotly, were used for graphics and file outputs. Data and code links are supplied in the Supporting Information section in S1 File.

2.6.2 Exploratory Data Analysis (EDA).

EDA is the first step in almost every Machine learning study and helps to understand the structure, quality, and characteristics of the dataset, ensuring the machine learning models are built on reliable and meaningful data.

Summary statistics of the variables used in the machine learning model are shown in Table 6. These variables were obtained after calibrating and evaluating the CSM-CERES-Maize (DSSAT) model to simulate forage maize production in Asturias and Galicia. Adaptation of the model, together with the use of historical meteorological data (2000–2022) from the study sites, enabled simulation for the whole plant dry matter yield, the whole plant milk forage unit yield and the whole plant crude protein yield of the six cultivars during the 23-year period.

Download:

Table 6. Mean values of the variables included in the Asturian and Galician forage maize dataset and the associated standard deviations (SD) for three forage maize cultivars in three locations: Barcia, Villaviciosa and Grado and three years 2012, 2013 and 2014 in Asturias and three forage maize cultivars in four locations: Ribadeo, Ordes, Deza and Sarria and three years 2014, 2015 and 2016 in Galicia.

https://doi.org/10.1371/journal.pone.0326364.t006

The mean values obtained for these variables are within the values obtained by the evaluation network of forage maize varieties in Asturias and Galicia [30,31].

Good quality long-term station data plays a significant role in characterizing the climatic conditions and in assessing their suitability for agricultural production [32]. Recent studies highlight the importance of using gridded meteorological datasets as reliable alternatives to directly measured data, especially in areas with sparse weather station coverage [33]. Examples of available meteorological databases include the following: NASA Power from the NASA Langley Research Center POWER (Prediction Of Worldwide Energy Resources) Project (https://power.larc.nasa.gov/) provides solar and meteorological data sets from NASA research for support of renewable energy, building energy efficiency and agricultural needs and CHIRPS (Climate Hazards Group InfraRed Precipitation with Station data) from the Climate Hazards Center, University of California, Santa Barbara for precipitation data (https://www.chc.ucsb.edu/data/chirps) [34].

In Fig 2, in this case we performed a clustermap analysis to understand the linear correlations between the variables and the hierarchies of the different groups of them with respect to the target production variables. The color and numbers in the cells of the chart inform about the linear correlations between teach input variable (rows) and the target one (columns). Also, the lines in the margin give a sense of variable groups by similar content of information with respect to the prediction target values. Radiation and Growing season showed the bigger positive correlation meanwhile Sowing date present the bigger in negative terms. The chart in Fig 2 shows that there are roughly two big groups of input variables which do not clearly agree with physical meanings like soil and meteorological group of vars. In addition, linear correlations between variables are high only for few vars like Radiation and Growing season duration which can be modelling the same type of information.

Download:

Fig 2. Dendrogram chart, grouping similar input variables and linear correlation values with respect to production vars.

Input variables by rows and target variables by columns.

https://doi.org/10.1371/journal.pone.0326364.g002

Meteorological variables for Asturias locations and for Galicia ones can be seen in Fig 3. For Asturias, the average values of meteorological variables over 23 years in the forage maize crop cycle (sowing-harvesting) for Tmax (22.9 ºC, SD = 1.41), Tmin (15.3 ºC, SD = 1,25), Total solar radiation (2296 MJ/m², SD = 320.7) and Total precipitation (208 mm, SD = 76.8 mm) were similar to those obtained in Galicia during the maize crop cycle for Tmax (24.3 ºC, SD = 2.22) and Tmin (12.7 ºC, SD = 1.97) and slightly better (higher values) in terms of Total Solar radiation (2115 MJ/m², SD = 405.7) and Total precipitation (145 mm, SD = 57.0). The maximum and minimum temperature values obtained in the 23 years of historical meteorological data in Asturias and Galicia are within the ranges considered suitable for maize cultivation [35]. Maximum temperatures (Tmax) registered during the 23-year period in the seven sites of experimentation were always below 30 ºC. Brief or prolonged episodes (more than 5 days) of high temperatures stress (>35 ºC), especially at the flowering stage (longer anthesis-silk interval), lead to reduced yields [[36–38]].

Download:

Fig 3. Historical weather data (Tmax, Tmin, Solar radiation and Precipitation) for 2000-2022 period in Asturias and Galicia study locations.

https://doi.org/10.1371/journal.pone.0326364.g003

In Asturias (Fig 4), the highest value for the whole plant dry matter yield (22887 kg DM/ha) was obtained in 2018 with cultivar 400 and average daily maximum temperatures of 21.9–25.2 ºC, average daily minimum temperatures of 14.5–17.4 ºC, total solar radiation of 2148–2772 MJ/m² and total rainfall of 195–412 mm. In Galicia (Fig 4), the highest value for the whole plant dry matter yield (19637 kg DM/ha) was obtained in 2000 for cultivar 400, with average daily maximum temperatures of 22.3–27.0 ºC, average daily minimum temperatures of10.7–15.9 ºC, total solar radiation of 1960–2987 MJ/m² and total rainfall of 87–236 mm.

Download:

Fig 4. Comparison of whole plant dry matter yield (kg DM/ha) simulated using 23 years of meteorological data in Asturias (Barcia, Grado, Villaviciosa) and Galicia locations for cultivars 200, 300 and 400.

https://doi.org/10.1371/journal.pone.0326364.g004

2.6.3 Regressor Machine Learning algorithms tested in the study.

2.6.3.1 Random Forest Regressor: Random Forest Regressor (RFR) [39] is a supervised ensemble learning algorithm used for regression and classification tasks. It builds multiple decision trees using random subsets of data and features and averages their predictions to improve accuracy and reduce overfitting. Each tree partitions the feature space to minimize a loss function, typically the mean squared error (MSE) in regression. RFR is robust to outliers and requires tuning of hyperparameters such as the number of trees and tree depth. In this study, hyperparameter optimization was performed using Optuna. Feature importance was also computed to evaluate the contribution of each predictor. These methods have already been used for maize model prediction [40].

One of the drawbacks of RFR is the requirement for tuning hyperparameters like the number of trees and the maximum depth of each tree to achieve optimal performance. In this case we performed a Randomized SearchCV [41] by randomly sampling hyperparameter values from a specified search distribution for each hyperparameter. This process evaluates the performance of the model with these random combinations and iterates for a defined number of times [42]. Finally, the importance feature is obtained to assess the relative amount of information extracted from each feature during model training.

2.6.3.2 LightGBM: LightGBM (Light Gradient Boosting Machine) [43] is a gradient boosting framework based on decision trees that is optimized for speed and efficiency. It grows trees leaf-wise (as opposed to level-wise) with depth constraints, allowing better accuracy and faster training on large datasets. LightGBM handles categorical variables natively and supports parallel and GPU learning. It is particularly effective for high-dimensional data. In this study, hyperparameter tuning was performed using Optuna, and feature importance was computed to assess predictor contributions.

2.6.3.3 XGBoost: XGBoost (Extreme Gradient Boosting) [44] is a scalable, regularized boosting algorithm that builds additive regression trees using gradient descent to minimize a specified loss function. It includes built-in regularization to prevent overfitting and supports parallel computation, missing value handling, and customizable loss functions. XGBoost is known for its strong performance in structured data problems. In this work, we optimized hyperparameters using Optuna and evaluated global feature importance to interpret the model’s behavior.

2.6.3.4 Adaboost: AdaBoost (Adaptive Boosting) [45] is an ensemble learning method that combines multiple weak learners, typically shallow decision trees into a stronger predictive model. It builds models sequentially, with each new learner focusing on the samples most misclassified by previous ones. AdaBoost adjusts the weights of training examples to emphasize difficult cases. While simple and effective for certain problems, it is more sensitive to noisy data and outliers. In this study, we included AdaBoost for comparison, although it showed lower predictive performance compared to the other methods.

2.6.4 Model performance.

The following metrics were used to evaluate the predictive quality of the models: R² coefficient, RMSE (Root Mean Squared Error), MAE (Mean Absolute Error), MAPE (Mean Absolute Percentage Error) and Accuracy (for regression). The metrics are defined in Table 7.

Download:

Table 7. Metrics for Regression evaluation.

https://doi.org/10.1371/journal.pone.0326364.t007

2.6.5 Optuna: Hyperparameter optimization.

Optuna [46] is an automatic hyperparameter optimization framework that uses intelligent sampling algorithms, such as Tree-structured Parzen Estimator (TPE), to efficiently explore the hyperparameter space. Unlike grid search, which exhaustively tests predefined combinations, Optuna dynamically selects promising regions based on past evaluation results, leading to faster convergence [47]. It also implements pruning strategies, such as MedianPruner, to stop unpromising trials early, reducing computation time. In this study, Optuna was used to optimize key hyperparameters for LightGBM, XGBoost, and Random Forest models, improving performance while minimizing overfitting.

2.6.6 Model predictions uncertainty quantification Bootstrapping.

Bootstrapping [48,49] is a resampling technique used to quantify the uncertainty of predictions made by regression models, including those built with algorithms like LightGBM. The process involves generating multiple datasets by sampling with replacement from the original training data. For each resampled dataset, a separate LightGBM model is trained. When making predictions, the ensemble of these models provides a distribution of predicted values for each input. By analyzing this distribution—specifically, by computing percentiles such as the 2.5th and 97.5th percentiles—one can construct prediction intervals that reflect the uncertainty associated with the model’s predictions. This approach captures variability due to both the data and the model, offering a more comprehensive understanding of prediction confidence.

2.6.7 SHAP values and variable permutation tests.

SHAP values are an interpretability technique that explains how each feature contributes to a machine learning model’s prediction. Based on cooperative game theory developed by Lloyd Shapley, they assign an important value to each feature, quantifying its contribution to the prediction.

Permutation variable tests in machine learning involve shuffling the values of a feature (or the target labels) and measuring how much the model’s performance degrades. If permuting a feature leads to a significant drop in performance, that feature is considered important for the model predictions. This approach is model-agnostic and can be used for both interpreting feature importance and testing the statistical significance.

3. Results and discussion

3.1 Model evaluation and selection

Several algoritms for regression were tried training individual models. LightGBM, XGBoost, Adaboost and Random Forest Regressor (RFR).

SVR (Support Vector Machine for regression) and GradientBoost models, the latter being quite similar to XGBoost and implemented in scikit-learn were also tested but because very poor results in the first case and very similar to XGBoost result in the second case they are not included here. In the following table (Table 8) we present the results for the “basic” (without Hyperparameter optimization) models for each target variable.

Download:

Table 8. Metrics obtained for different regression algorithms and target variables.

https://doi.org/10.1371/journal.pone.0326364.t008

LightGBM remains the best-performing model across the tests except in the case of UFL/ha model that RFR obtain a slightly better result in R² and RMSE. LightGBM and XGBoost will be optimized in their parameters in the following tests.

3.2 Residual analysis and variable importance during training

The visual representation of the differences between the actual values and the predicted values by the model in Fig 5 also allows us to visually assess the expected quality of the predictions. A perfect prediction would imply that all the red points in the graphs of Fig 5 would be situated on the diagonal line of zero difference.

Download:

Fig 5. Scatter plot of observed vs. predicted target Dry Matter yields (kg DM/ha).

Solid line represents the 1:1 relationship between observed and predicted yields. Model comparison metrics are R², Accuracy and Mean Absolute Error (MAE).

https://doi.org/10.1371/journal.pone.0326364.g005

In Fig 5, we can see that, in general, the prediction for Dry Matter yield model aligns quite well with the dashed line (the zero residual line). The other models, kg CP/ha and UFL/ha, (graphics not shown here for brevity), show that the predictions are also quite accurate, with small errors in the predicted values, although there is also some scatter. For all the target variables, the actual and predicted values were consistent, and the dispersion in the graph increased with the magnitudes of the values to be predicted. However, as can be seen in Table 8, the relative MAPE values in percentage are very similar, and similar relative errors in predictions are expected for all three models.

In Fig 6, we observe that the variables with the most significant influence on the training of models for all three target variables, consistently the Growing season and Radiation. Solar radiation constitutes the primary energy source for crop production. Cloudy, rainy periods that limit the amount of solar radiation available to a maize crop during susceptible stages of development contribute to regional differences that can have significant effects on yield [50]). Crop productivity (kg DM/ha) is a function of the amount of the Photosynthetically Active Radiation (PAR) absorbed or intercepted by the crop, which depends on incident PAR radiation and radiation use efficiency (RUE, units of g/MJ PAR) in the period of time in which it is grown, assuming that other factors are not limiting or conditioning (pests, diseases, water, nutrients, etc.) [51].

Download:

Fig 6. Relative importance of the variables, calculated for LightGBM Optimized models of dry matter yield (kg DM/ha), net energy for lactation yield (UFL/ha) and crude protein yield (kg CP/ha).

https://doi.org/10.1371/journal.pone.0326364.g006

Beyond these, the influence of other variables varies depending on the specific target variable. However, their impact on the overall reduction of Root Mean Square Error (RMSE) is comparatively minor. This indicates that while secondary variables contribute to the model performance, their effect on improving prediction accuracy is limited compared to the primary factors.

Effects of temperature on biomass production and its components, radiation interception and RUE are twofold. First, and most important, temperature changes the duration of the period from sowing to harvest (Growing season). At high latitudes in Europe, Asia and North America, warming over recent decades has extended this period, with positive implications for crop growth and yield. Second, RUE is non-linearly related to temperature, an effect that is mediated by the effects of temperature on leaf gross photosynthesis, respiration and dry matter partitioning [52].

Maximum temperatures, which appear in fifth or sixth position as influential variable, were always below 30 ºC in our study. Maize plants are sensitive to heat stress (>30 ºC) and there is a strong decline in grain yield above this temperature when maintained for a long time [53].

The growing cycle of the FAO maize cultivars (200, 300 and 400) depends on the thermal time, i.e., the sum of temperatures that the maize accumulates each day from the day of sowing to the day of harvest (maize silage) or until the day of physiological maturity (maize grain).

Each maize cultivar has its own thermal time; the number of days needed to reach this thermal time (related to the Growing Season) varies every year. Several studies have quantified the impact of climate change, in particular the increase in temperatures on maize cultivation in Spain [54,55]. The findings of these studies show that the increase in temperature causes a decrease in yield, even under non-water-limiting conditions, due to the shortened growing cycle. Thus, for maize forage and a given sowing date and site in a warmer than normal summer, the time to harvest will be shorter (fewer days), while in a summer with cooler than normal temperatures, the time to harvest will be longer (more days). Therefore, for the FAO cultivars (200, 300 and 400) and a given sowing date and site, a long growing cycle will be more advantageous in hot summers, and a short growing cycle will be more advantageous in cooler summers.

3.3 Hyperparameter optimization

The LightGBM method provides the best results, along with XGBoost. These two models, together with the Random Forest Regressor, have undergone hyperparameter optimization, yielding the following results for kg DM/ha, UFL/ha and kg CP/ha predictions.

The improvements achieved through hyperparameter optimization are modest in terms of both R² and RMSE, but LightGBM remains the best-performing model.

The two variables that strongly influence predictions across all cases are Growing season and Radiation. These analyses were performed on the specified models after their hyperparameters had been optimized using the Optuna package.

Hyperparameter tuning did not yield a significant improvement over the base models (Table 9). The model for kg CP/ha has a lower absolute MAE and is expected to make better predictions for this target variable than UFL/ha and kg DM/ha, even though the R² score for kg CP/ha is slightly lower. However, the three models are almost equivalent considering that the kg CP/ha variable is one order of magnitude lower than the other two. Nonetheless, all three models explained more than 86% of the variability in the data, with a relative error of about 6-7%, (with respect to the mean of the target variable). which we believe is a remarkable result.

Download:

Table 9. Metrics obtained for different regression algorithms after Hyperparamter Optimization.

https://doi.org/10.1371/journal.pone.0326364.t009

3.4 Variable permutation tests

To assess the influence of the predictor variables on the model predictions variable permutation test were performed (Fig 7).

Download:

Fig 7. Variable permutation importance test results for kg DM/ha, UFL/ha and kg CP/ha predictions.

https://doi.org/10.1371/journal.pone.0326364.g007

The results of the permutation tests confirm that the biggest influence variables on predictions are Growing season and Radiation except for kg CP/ha, because Crude protein is very influenced by the Harvest date.

3.5 SHAP values

The studied SHAP values on the trained LightGBM models explain how each feature contributes to a machine learning model’s prediction (Fig 8). Based on cooperative game theory developed by Lloyd Shapley, they assign an important value to each feature, quantifying its contribution to the prediction.

Download:

Fig 8. SHAP values showing the impact of every variable on predictions for kg DM/ha, UFL/ha and kg CP/ha.

https://doi.org/10.1371/journal.pone.0326364.g008

For the trained LightGBM model, we can infer the following from the plot.

Growing season has a significant impact showing that high values (in red) have positive SHAP values, indicating that longer growing season contributes positively to yield. High radiation and Harvest date values (in red) also contribute positively to the prediction. Interestingly early Sowing dates (in blue) show a positive impact, whereas later dates (in red) tend to have a negative effect.

For Tmax (°C) and Tmin (°C), higher temperatures appear to have a positive impact on the prediction. Cultivar and Site categorical features show centered impact distributions without a clear directional trend.

The vertical spread of points for the same feature suggests possible interactions with other variables. For example, the vertical dispersion of SHAP values around zero for the Radiation variable, combined with mixed red and blue colors, indicates that the same SHAP value (i.e., the same impact on the prediction) can occur for different feature values (as reflected by the colors). This suggests that the effect of that variable depends on the values of other variables in the model. A similar interaction appears in Growing Season, particularly in the region where 1000 < SHAP < 2000, possibly involving other variables as well.

3.6 Bootstrapping results

In Fig 9, the indices of the first 50 samples from the test set are shown on the x-axis (only the first 50 samples are plotted to improve the visibility of the lines and the confidence interval). The y-axis displays the values of dry matter yield per hectare. The blue line represents the actual values corresponding to these samples from the test set. The orange line corresponds to the mean yield predicted by the LightGBM model during the bootstrapping process with 100 trained models. The light blue shaded band represents the 95% confidence interval (2.5% and 97.5% percentiles) associated with those mean predicted values of the orange line.

Download:

Fig 9. Bootstrap technique results for LightGBM predictions from the test set, with 95% confidence interval band.

https://doi.org/10.1371/journal.pone.0326364.g009

In this case, the orange line closely follows the blue one, which suggests that the model captures the general trend of the actual test data well. Furthermore, the confidence interval band is narrow, indicating that the model’s confidence is high and that large variability in the predictions is not expected. The proportion of predictions for the total test set samples that fall within the 95% confidence band is approximately 55%. This relatively low percentage may suggest that there is still some bias in the data that the model has not been able to capture, or it could correspond to noise in the original data. This also corresponds to the value of the variance not modeled by the LightGBM training, as indicated by the R² metric of 0.86.

3.7 Prediction web application

To make the predictive capabilities of the developed models available to the public, a web application has been created in which users can obtain predictions of kg DM/ha, UFL/ha, and kg CP/ha by first choosing some data.

In Fig 10 the main interface of this app can be seen.

Download:

Fig 10. Prediction web app.

(Link for this app is supplied in the Supporting information section in S1 File).

https://doi.org/10.1371/journal.pone.0326364.g010

3.8 Influence of Growing season and Radiation on Dry matter yield

Growing season and Radiation were the variables that most influence both the global reduction of the Root mean square error (RMSE) and the predictions (as obtained in the new variable permutation tests performed. We also calculated a set of predictions (30 by each Site studied) under conditions considered favorable for these two variables (i.e., Growing season and Radiation above their respective mean values), unfavorable (i.e., Growing season and Radiation below their means), and intermediate conditions when neither of the two thresholds—high or low—are simultaneously met. The remaining predictor variables were assigned their overall mean values, and for the categorical variable “Site”, 30 samples were taken for each of the seven locations with different “Cultivar” values. The result can be seen in Fig 11.

Download:

Fig 11. Effect of Growing Season (GS) and Radiation (Rad) on predicted dry matter yield (kg DM/ha).

Horizontal axes cover the full range of GS and Rad values used to train the model.

https://doi.org/10.1371/journal.pone.0326364.g011

To better appreciate the result, we have developed a web app (the link for that tool is supplied in Supporting information section in S1 File) where the user can rotate and interact with the 3D Fig 11, as well as show or hide the interpolating polynomial surface or change its degree, check numerical data.

The model shows a robust response, yielding higher outputs under favorable conditions for these two variables, and the opposite behavior under unfavorable conditions. This indicates stability in the predictions, with no signs of excessive variability or erratic outcomes.

4. Conclusions

The Machine Learning models obtained have the potential use as a tool to evaluate production and quality of forage maize as affected by the geographical location of the field, cultivar type, sowing and harvesting dates, and assumptions regarding weather variables during the growing season.

This paper presents a procedure that allows the creation of these predictive models of yield and nutritional quality, which could be applied to other types of crops and locations.

To our knowledge, there are currently no other models for forage maize available for this geographic area. It is important to note that for cultivars other than those used in this study, it will be necessary to test and train a new Machine learning model based on the new data available.

These three prediction models (one LightGBM model for each target variable) are implemented on a public web app tool. The link for that tool is supplied in the Supporting information section in S1 File.

Supporting information

S1 File.

https://doi.org/10.1371/journal.pone.0326364.s001

(DOCX)

Acknowledgments

The authors are grateful for a grant provided by the “OECD Co-operative Research Programme” to support a research visit by the second author at the University of Florida, Gainesville, Florida, USA in 2022. Also gratefully acknowledge the reviewers for their insightful comments and constructive suggestions, which have greatly improved the quality of this manuscript. Finally, we acknowledge with gratitude the assistance of Ms. Christine Francis in the preparation of this paper in English.

References

1. MAPA. Encuesta sobre superficies y rendimiento de cultivos (ESYRCE), resultados provisionales nacionales y autonómicos. Madrid: Ministerio de Agricultura, Pesca y Alimentación (MAPA); 2023. 47 p. Available from: https://www.mapa.gob.es/es/estadistica/temas/estadisticas-agrarias/agricultura/esyrce/
2. Carballal A, González C, Piñeiro I, Vega S, Martínez A. Evaluación de variedades de maíz (1996-2022). Actualización año 2022. Oviedo, Asturias: Consejería de Ciencia, Innovación y Universidad, Servicio Regional de Investigación y Desarrollo Agroalimentario (SERIDA); 2022.
3. Bande MJ. Evaluación de variedades de maíz forrajero en Galicia (1999-2022). Actualización 2023. Vaca Pinta. 2023;37:124–32.
- View Article
- Google Scholar
4. Lizaso JI, Ruiz-Ramos M, Rodríguez L, Gabaldon-Leal C, Oliveira JA, Lorite IJ, et al. Modeling the response of maize phenology, kernel set, and yield components to heat stress and heat shock with CSM-IXIM. Field Crops Res. 2017;214:239–54.
- View Article
- Google Scholar
5. Ritchie JT, Singh U, Godwin DC, Bowen WT. Cereal growth, development and yield. In: Tsuji GY, Hoogenboom G, Thornton PK, editors. Understanding options for agricultural production. Dordrecht, The Netherlands: Kluwer Academic Publishers; 1998. p. 79–98.
6. Jones JW, Hoogenboom G, Porter CH, Boote KJ, Batchelor WD, Hunt LA, et al. The DSSAT cropping system model. Eur J Agron. 2003;18:235–65.
- View Article
- Google Scholar
7. Kumar K, Parihar CM, Nayak HS, Sena DR, Godara S, Dhakar R, et al. Modeling maize growth and nitrogen dynamics using CERES-Maize (DSSAT) under diverse nitrogen management options in a conservation agriculture-based maize-wheat system. Sci Rep. 2024;14(1):11743. pmid:38778072
- View Article
- PubMed/NCBI
- Google Scholar
8. Keating BA, Carberry PS, Hammer GL, Probert ME, Robertson MJ, Holzworth D, et al. An overview of APSIM, a model designed for farming systems simulation. Eur J Agron. 2003;18(3–4):267–88.
- View Article
- Google Scholar
9. Lizaso JI, Boote KJ, Jones JW, Porter CH, Echarte L, Westgate ME, et al. CSM‐IXIM: a new maize simulation model for DSSAT version 4.5. Agron J. 2011;103(3):766–79.
- View Article
- Google Scholar
10. Addiscott TM, Wagenet RJ. Concepts of solute leaching in soils: a review of modelling approaches. J Soil Sci. 1985;36(3):411–24.
- View Article
- Google Scholar
11. Oliveira JA, Boote KJ, Oliveira FAA, Hoogenboom G, Carballal A, Martínez-Fernández A. Adaptación del modelo CSM-CERES-Maize (DSSAT) para simular la producción de maíz forrajero: variación interanual en Asturias. Vaca Pinta. 2023;40:140–55.
- View Article
- Google Scholar
12. Hoogenboom G, Porter CH, Boote KJ, Shelia V, Wilkens PW, Singh U, et al. The DSSAT crop modeling ecosystem. In: Boote KJ, editor. Advances in crop modeling for a sustainable agriculture. Cambridge, UK: Burleigh Dodds Science Publishing; 2019. p. 173–216.
13. Sa’diyah H, Hadi AF. AMMI model for yield estimation in multi-environment trials: a comparison to BLUP. Agric Agric Sci Procedia. 2016;9:163–9.
- View Article
- Google Scholar
14. Singh SB, Kumar S, Kumar R, Kumar P, Yathish KR, Jat BS, et al. Stability analysis of promising winter maize (Zea mays L.) hybrids tested across Bihar using GGE biplot and AMMI model approach. IJGPB. 2024;84(01):73–80.
- View Article
- Google Scholar
15. Oliveira JA, Bande MJ. Productividad, estabilidad y adaptabilidad de cultivares de maíz forrajero en Galicia. Vaca Pinta. 2024;44:90–6.
- View Article
- Google Scholar
16. USDA-Soil Taxonomy. A basic system of soil classification for making and interpreting soil surveys. Agriculture Handbook No. 436. 2nd ed. Washington (DC): Soil Survey Staff; 1999.
17. Kotteck M, Grieser J, Beck C, Rudolf B, Rubel F. Meteorologische Zeitschrift. Meteorol Z. 2006;15(3):259–63.
- View Article
- Google Scholar
18. Arnfield AJ. Koppen climate classification: Encyclopedia Britannica; 2021. Available from: https://www.britannica.com/science/Koppen-climate-classification
19. Agencia Estatal de Meteorología. AEMET: Agencia Estatal de Meteorología; n.d. [cited 2025 May 19]. Available from: https://www.aemet.es/es/portada
20. MeteoGalicia. MeteoGalicia: predición do tempo de Galicia; n.d. [cited 2025 May 19]. Available from: https://www.meteogalicia.gal/
21. Orkwiszewski JA, Poethig RS. Phase identity of the maize leaf is determined after leaf initiation. Proc Natl Acad Sci U S A. 2000;97(19):10631–6. pmid:10973480
- View Article
- PubMed/NCBI
- Google Scholar
22. Wiersma DW, Carter P, Albrecht KA, Coors JG. Kernel milkline stage and corn forage yield, quality, and dry matter content. J Prod Agric. 1993;6(23/24):94–9.
- View Article
- Google Scholar
23. Havilah EJ, Kaiser AG, Nicol H. Use of a kernel milk line score to determine stage of maturity in maize crops harvested for silage. Aust J Exp Agric. 1995;35(6):739–43.
- View Article
- Google Scholar
24. Braga RP, Cardoso MJ, Coelho JP. Crop model based decision support for maize (Zea mays L.) silage production in Portugal. Eur J Agron. 2008;28(3):224–33.
- View Article
- Google Scholar
25. Oliveira JA, Boote KJ, Oliveira FAA, Bande MJ, Carballal A, Martínez-Fernández A, et al. Using DSSAT to examine the interannual weather variation in forage maize productivity in Asturias and Galicia (Spain). Proceedings of the 18th Congress of the European Society for Agronomy; Rennes, France; Aug 26–30; 2024. p. 557–9.
26. INRA. Alimentation des ruminants. France: Editions Quae. 2018. 728 p.
27. Cox WJ, Cherney JH, Cherney DJR, Pardee WD. Forage quality and harvest index of corn hybrids under different growing conditions. Agron J. 1994;86(2):277–82.
- View Article
- Google Scholar
28. Angell AR, Mata L, de Nys R, Paul NA. The protein content of seaweeds: a universal nitrogen-to-protein conversion factor of five. J Appl Phycol. 2016;28:511–24.
- View Article
- Google Scholar
29. Allard RW, Bradshaw AD. Implications of genotype‐environmental interactions in applied plant breeding. Crop Sci. 1964;4(5):503–8.
- View Article
- Google Scholar
30. Carballal A, González C, Menéndez M, Piñeiro I, Corral ME, Martínez-Fernández A. Maíz forrajero en Asturias. Evaluación de variedades (1996-2023). Actualización 2023. Vaca Pinta. 2024;44:112–31.
- View Article
- Google Scholar
31. Bande MJ. Evaluación de variedades de maíz forrajero en Galicia (1999-2023). Actualización 2024. Vaca Pinta. 2024;44:100–9.
- View Article
- Google Scholar
32. Sinclair S, Pegram G. Combining radar and rain gauge rainfall estimates using conditional merging. Atmos Sci Lett. 2005;6(1):19–22.
- View Article
- Google Scholar
33. Araghi A, Adamowski JF. Assessment of 30 gridded precipitation datasets over different climates on a country scale. Earth Sci Inform. 2024;17(2):1301–13.
- View Article
- Google Scholar
34. Funk C, Peterson P, Landsfeld M, Pedreros D, Verdin J, Shukla S, et al. The climate hazards infrared precipitation with stations--a new environmental record for monitoring extremes. Sci Data. 2015;2:150066. pmid:26646728
- View Article
- PubMed/NCBI
- Google Scholar
35. Sánchez B, Rasmussen A, Porter JR. Temperatures and the growth and development of maize and rice: a review. Glob Chang Biol. 2014;20(2):408–17. pmid:24038930
- View Article
- PubMed/NCBI
- Google Scholar
36. Elmore RW. Stress, anthesis-silk interval and corn yield potential. Iowa Integrated Crop Management News; 2012 [cited 20254 Jul 12]. Available from: https://dr.lib.iastate.edu/server/api/core/bitstreams/5687f70f-7c08-44e8-a96f-fdefea91e37c/content
37. Gourdji SM, Sibley AM, Lobell DB. Global crop exposure to critical high temperatures in the reproductive period: historical trends and future projections. Environ Res Lett. 2013;8:024041.
- View Article
- Google Scholar
38. Shim D, Lee K-J, Lee B-W. Response of phenology- and yield-related traits of maize to elevated temperature in a temperate region. Crop J. 2017;5(4):305–16.
- View Article
- Google Scholar
39. Breiman L. Random forests. Mach Learn. 2002;45(1):5–32.
- View Article
- Google Scholar
40. Gawdiya S, Kumar D, Ahmed B, Sharma RK, Das P, Choudhary M, et al. Field scale wheat yield prediction using ensemble machine learning techniques. Smart Agric Technol. 2024;9:100543.
- View Article
- Google Scholar
41. Bergstra J, Ca JB, Ca YB. Random search for hyper-parameter optimization Yoshua Bengio; 2012. Available from: http://scikit-learn.sourceforge.net
42. Buitinck L, Louppe G, Blondel M, Pedregosa F, Mueller A, Grisel O, et al. API design for machine learning software: experiences from the scikit-learn project. ECML PKDD Workshop: Languages for Data Mining and Machine Learning; 2013. p. 108–22.
43. Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, et al. LightGBM: a highly efficient gradient boosting decision tree. Adv Neural Inf Process Syst. 2017;30:3146–54.
- View Article
- Google Scholar
44. Chen T, Guestrin C. XGBoost: a scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ‘16). ACM; 2016. p. 785–94.
45. Freund Y, Schapire RE. A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci. 1997;55(1):119–39.
- View Article
- Google Scholar
46. Optuna Developers. Optuna: a hyperparameter optimization framework; n.d. [cited 2025 May]. Available from: https://optuna.org/
47. Akiba T, Sano S, Yanase T, Ohta T, Koyama M. Optuna: a next-generation hyperparameter optimization framework. En Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM; 2019. p. 2623–31. doi: https://doi.org/10.1145/3292500.3330701
48. Efron B. Bootstrap methods: another look at the jackknife. Ann Statist. 1979;7(1):1–26.
- View Article
- Google Scholar
49. Palmer G, Du S, Politowicz A, Emory JP, Yang X, Gautam A, et al. Calibration after bootstrap for accurate uncertainty quantification in regression models. npj Comput Mater. 2022;8(1):115.
- View Article
- Google Scholar
50. Yang Y, Guo X, Liu H, Liu G, Liu W, Ming B, et al. The effect of solar radiation change on the maize yield gap from the perspectives of dry matter accumulation and distribution. J Integr Agric. 2021;20(2):482–93.
- View Article
- Google Scholar
51. Monteith JL. Climate and the efficiency of crop production in Britain. Philos Trans R Soc Lond B. 1977;281:277–94.
- View Article
- Google Scholar
52. Villalobos FJ, Fereres E, editors. Principles of agronomy for sustainable agriculture. Springer International Publishing; 2016. doi: https://doi.org/10.1007/978-3-319-46116-8
53. Schauberger B, Archontoulis S, Arneth A, Balkovic J, Ciais P, Deryng D, et al. Consistent negative response of US crops to high temperatures in observations and crop models. Nat Commun. 2017;8:13931. pmid:28102202
- View Article
- PubMed/NCBI
- Google Scholar
54. Mínguez MI, Ruiz-Ramos M, Díaz-Ambrona CH, Quemada M, Sau F. First-order impacts on winter and summer crops assessed with various high-resolution climate models in the Iberian Peninsula. Clim Change. 2007;81(S1):343–55.
- View Article
- Google Scholar
55. Lizaso JI, Ruiz-Ramos M, Rodríguez L, Gabaldon-Leal C, Oliveira JA, Lorite IJ, et al. Impact of high temperatures in maize: phenology and yield components. Field Crops Res. 2018;216:129–40.
- View Article
- Google Scholar

[ref1] 1. MAPA. Encuesta sobre superficies y rendimiento de cultivos (ESYRCE), resultados provisionales nacionales y autonómicos. Madrid: Ministerio de Agricultura, Pesca y Alimentación (MAPA); 2023. 47 p. Available from: https://www.mapa.gob.es/es/estadistica/temas/estadisticas-agrarias/agricultura/esyrce/

[ref2] 2. Carballal A, González C, Piñeiro I, Vega S, Martínez A. Evaluación de variedades de maíz (1996-2022). Actualización año 2022. Oviedo, Asturias: Consejería de Ciencia, Innovación y Universidad, Servicio Regional de Investigación y Desarrollo Agroalimentario (SERIDA); 2022.

[ref3] 3. Bande MJ. Evaluación de variedades de maíz forrajero en Galicia (1999-2022). Actualización 2023. Vaca Pinta. 2023;37:124–32.
View Article
Google Scholar

[4] View Article

[5] Google Scholar

[ref4] 4. Lizaso JI, Ruiz-Ramos M, Rodríguez L, Gabaldon-Leal C, Oliveira JA, Lorite IJ, et al. Modeling the response of maize phenology, kernel set, and yield components to heat stress and heat shock with CSM-IXIM. Field Crops Res. 2017;214:239–54.
View Article
Google Scholar

[7] View Article

[8] Google Scholar

[ref5] 5. Ritchie JT, Singh U, Godwin DC, Bowen WT. Cereal growth, development and yield. In: Tsuji GY, Hoogenboom G, Thornton PK, editors. Understanding options for agricultural production. Dordrecht, The Netherlands: Kluwer Academic Publishers; 1998. p. 79–98.

[ref6] 6. Jones JW, Hoogenboom G, Porter CH, Boote KJ, Batchelor WD, Hunt LA, et al. The DSSAT cropping system model. Eur J Agron. 2003;18:235–65.
View Article
Google Scholar

[11] View Article

[12] Google Scholar

[ref7] 7. Kumar K, Parihar CM, Nayak HS, Sena DR, Godara S, Dhakar R, et al. Modeling maize growth and nitrogen dynamics using CERES-Maize (DSSAT) under diverse nitrogen management options in a conservation agriculture-based maize-wheat system. Sci Rep. 2024;14(1):11743. pmid:38778072
View Article
PubMed/NCBI
Google Scholar

[14] View Article

[15] PubMed/NCBI

[16] Google Scholar

[ref8] 8. Keating BA, Carberry PS, Hammer GL, Probert ME, Robertson MJ, Holzworth D, et al. An overview of APSIM, a model designed for farming systems simulation. Eur J Agron. 2003;18(3–4):267–88.
View Article
Google Scholar

[18] View Article

[19] Google Scholar

[ref9] 9. Lizaso JI, Boote KJ, Jones JW, Porter CH, Echarte L, Westgate ME, et al. CSM‐IXIM: a new maize simulation model for DSSAT version 4.5. Agron J. 2011;103(3):766–79.
View Article
Google Scholar

[21] View Article

[22] Google Scholar

[ref10] 10. Addiscott TM, Wagenet RJ. Concepts of solute leaching in soils: a review of modelling approaches. J Soil Sci. 1985;36(3):411–24.
View Article
Google Scholar

[24] View Article

[25] Google Scholar

[ref11] 11. Oliveira JA, Boote KJ, Oliveira FAA, Hoogenboom G, Carballal A, Martínez-Fernández A. Adaptación del modelo CSM-CERES-Maize (DSSAT) para simular la producción de maíz forrajero: variación interanual en Asturias. Vaca Pinta. 2023;40:140–55.
View Article
Google Scholar

[27] View Article

[28] Google Scholar

[ref12] 12. Hoogenboom G, Porter CH, Boote KJ, Shelia V, Wilkens PW, Singh U, et al. The DSSAT crop modeling ecosystem. In: Boote KJ, editor. Advances in crop modeling for a sustainable agriculture. Cambridge, UK: Burleigh Dodds Science Publishing; 2019. p. 173–216.

[ref13] 13. Sa’diyah H, Hadi AF. AMMI model for yield estimation in multi-environment trials: a comparison to BLUP. Agric Agric Sci Procedia. 2016;9:163–9.
View Article
Google Scholar

[31] View Article

[32] Google Scholar

[ref14] 14. Singh SB, Kumar S, Kumar R, Kumar P, Yathish KR, Jat BS, et al. Stability analysis of promising winter maize (Zea mays L.) hybrids tested across Bihar using GGE biplot and AMMI model approach. IJGPB. 2024;84(01):73–80.
View Article
Google Scholar

[34] View Article

[35] Google Scholar

[ref15] 15. Oliveira JA, Bande MJ. Productividad, estabilidad y adaptabilidad de cultivares de maíz forrajero en Galicia. Vaca Pinta. 2024;44:90–6.
View Article
Google Scholar

[37] View Article

[38] Google Scholar

[ref16] 16. USDA-Soil Taxonomy. A basic system of soil classification for making and interpreting soil surveys. Agriculture Handbook No. 436. 2nd ed. Washington (DC): Soil Survey Staff; 1999.

[ref17] 17. Kotteck M, Grieser J, Beck C, Rudolf B, Rubel F. Meteorologische Zeitschrift. Meteorol Z. 2006;15(3):259–63.
View Article
Google Scholar

[41] View Article

[42] Google Scholar

[ref18] 18. Arnfield AJ. Koppen climate classification: Encyclopedia Britannica; 2021. Available from: https://www.britannica.com/science/Koppen-climate-classification

[ref19] 19. Agencia Estatal de Meteorología. AEMET: Agencia Estatal de Meteorología; n.d. [cited 2025 May 19]. Available from: https://www.aemet.es/es/portada

[ref20] 20. MeteoGalicia. MeteoGalicia: predición do tempo de Galicia; n.d. [cited 2025 May 19]. Available from: https://www.meteogalicia.gal/

[ref21] 21. Orkwiszewski JA, Poethig RS. Phase identity of the maize leaf is determined after leaf initiation. Proc Natl Acad Sci U S A. 2000;97(19):10631–6. pmid:10973480
View Article
PubMed/NCBI
Google Scholar

[47] View Article

[48] PubMed/NCBI

[49] Google Scholar

[ref22] 22. Wiersma DW, Carter P, Albrecht KA, Coors JG. Kernel milkline stage and corn forage yield, quality, and dry matter content. J Prod Agric. 1993;6(23/24):94–9.
View Article
Google Scholar

[51] View Article

[52] Google Scholar

[ref23] 23. Havilah EJ, Kaiser AG, Nicol H. Use of a kernel milk line score to determine stage of maturity in maize crops harvested for silage. Aust J Exp Agric. 1995;35(6):739–43.
View Article
Google Scholar

[54] View Article

[55] Google Scholar

[ref24] 24. Braga RP, Cardoso MJ, Coelho JP. Crop model based decision support for maize (Zea mays L.) silage production in Portugal. Eur J Agron. 2008;28(3):224–33.
View Article
Google Scholar

[57] View Article

[58] Google Scholar

[ref25] 25. Oliveira JA, Boote KJ, Oliveira FAA, Bande MJ, Carballal A, Martínez-Fernández A, et al. Using DSSAT to examine the interannual weather variation in forage maize productivity in Asturias and Galicia (Spain). Proceedings of the 18th Congress of the European Society for Agronomy; Rennes, France; Aug 26–30; 2024. p. 557–9.

[ref26] 26. INRA. Alimentation des ruminants. France: Editions Quae. 2018. 728 p.

[ref27] 27. Cox WJ, Cherney JH, Cherney DJR, Pardee WD. Forage quality and harvest index of corn hybrids under different growing conditions. Agron J. 1994;86(2):277–82.
View Article
Google Scholar

[62] View Article

[63] Google Scholar

[ref28] 28. Angell AR, Mata L, de Nys R, Paul NA. The protein content of seaweeds: a universal nitrogen-to-protein conversion factor of five. J Appl Phycol. 2016;28:511–24.
View Article
Google Scholar

[65] View Article

[66] Google Scholar

[ref29] 29. Allard RW, Bradshaw AD. Implications of genotype‐environmental interactions in applied plant breeding. Crop Sci. 1964;4(5):503–8.
View Article
Google Scholar

[68] View Article

[69] Google Scholar

[ref30] 30. Carballal A, González C, Menéndez M, Piñeiro I, Corral ME, Martínez-Fernández A. Maíz forrajero en Asturias. Evaluación de variedades (1996-2023). Actualización 2023. Vaca Pinta. 2024;44:112–31.
View Article
Google Scholar

[71] View Article

[72] Google Scholar

[ref31] 31. Bande MJ. Evaluación de variedades de maíz forrajero en Galicia (1999-2023). Actualización 2024. Vaca Pinta. 2024;44:100–9.
View Article
Google Scholar

[74] View Article

[75] Google Scholar

[ref32] 32. Sinclair S, Pegram G. Combining radar and rain gauge rainfall estimates using conditional merging. Atmos Sci Lett. 2005;6(1):19–22.
View Article
Google Scholar

[77] View Article

[78] Google Scholar

[ref33] 33. Araghi A, Adamowski JF. Assessment of 30 gridded precipitation datasets over different climates on a country scale. Earth Sci Inform. 2024;17(2):1301–13.
View Article
Google Scholar

[80] View Article

[81] Google Scholar

[ref34] 34. Funk C, Peterson P, Landsfeld M, Pedreros D, Verdin J, Shukla S, et al. The climate hazards infrared precipitation with stations--a new environmental record for monitoring extremes. Sci Data. 2015;2:150066. pmid:26646728
View Article
PubMed/NCBI
Google Scholar

[83] View Article

[84] PubMed/NCBI

[85] Google Scholar

[ref35] 35. Sánchez B, Rasmussen A, Porter JR. Temperatures and the growth and development of maize and rice: a review. Glob Chang Biol. 2014;20(2):408–17. pmid:24038930
View Article
PubMed/NCBI
Google Scholar

[87] View Article

[88] PubMed/NCBI

[89] Google Scholar

[ref36] 36. Elmore RW. Stress, anthesis-silk interval and corn yield potential. Iowa Integrated Crop Management News; 2012 [cited 20254 Jul 12]. Available from: https://dr.lib.iastate.edu/server/api/core/bitstreams/5687f70f-7c08-44e8-a96f-fdefea91e37c/content

[ref37] 37. Gourdji SM, Sibley AM, Lobell DB. Global crop exposure to critical high temperatures in the reproductive period: historical trends and future projections. Environ Res Lett. 2013;8:024041.
View Article
Google Scholar

[92] View Article

[93] Google Scholar

[ref38] 38. Shim D, Lee K-J, Lee B-W. Response of phenology- and yield-related traits of maize to elevated temperature in a temperate region. Crop J. 2017;5(4):305–16.
View Article
Google Scholar

[95] View Article

[96] Google Scholar

[ref39] 39. Breiman L. Random forests. Mach Learn. 2002;45(1):5–32.
View Article
Google Scholar

[98] View Article

[99] Google Scholar

[ref40] 40. Gawdiya S, Kumar D, Ahmed B, Sharma RK, Das P, Choudhary M, et al. Field scale wheat yield prediction using ensemble machine learning techniques. Smart Agric Technol. 2024;9:100543.
View Article
Google Scholar

[101] View Article

[102] Google Scholar

[ref41] 41. Bergstra J, Ca JB, Ca YB. Random search for hyper-parameter optimization Yoshua Bengio; 2012. Available from: http://scikit-learn.sourceforge.net

[ref42] 42. Buitinck L, Louppe G, Blondel M, Pedregosa F, Mueller A, Grisel O, et al. API design for machine learning software: experiences from the scikit-learn project. ECML PKDD Workshop: Languages for Data Mining and Machine Learning; 2013. p. 108–22.

[ref43] 43. Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, et al. LightGBM: a highly efficient gradient boosting decision tree. Adv Neural Inf Process Syst. 2017;30:3146–54.
View Article
Google Scholar

[106] View Article

[107] Google Scholar

[ref44] 44. Chen T, Guestrin C. XGBoost: a scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ‘16). ACM; 2016. p. 785–94.

[ref45] 45. Freund Y, Schapire RE. A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci. 1997;55(1):119–39.
View Article
Google Scholar

[110] View Article

[111] Google Scholar

[ref46] 46. Optuna Developers. Optuna: a hyperparameter optimization framework; n.d. [cited 2025 May]. Available from: https://optuna.org/

[ref47] 47. Akiba T, Sano S, Yanase T, Ohta T, Koyama M. Optuna: a next-generation hyperparameter optimization framework. En Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM; 2019. p. 2623–31. doi: https://doi.org/10.1145/3292500.3330701

[ref48] 48. Efron B. Bootstrap methods: another look at the jackknife. Ann Statist. 1979;7(1):1–26.
View Article
Google Scholar

[115] View Article

[116] Google Scholar

[ref49] 49. Palmer G, Du S, Politowicz A, Emory JP, Yang X, Gautam A, et al. Calibration after bootstrap for accurate uncertainty quantification in regression models. npj Comput Mater. 2022;8(1):115.
View Article
Google Scholar

[118] View Article

[119] Google Scholar

[ref50] 50. Yang Y, Guo X, Liu H, Liu G, Liu W, Ming B, et al. The effect of solar radiation change on the maize yield gap from the perspectives of dry matter accumulation and distribution. J Integr Agric. 2021;20(2):482–93.
View Article
Google Scholar

[121] View Article

[122] Google Scholar

[ref51] 51. Monteith JL. Climate and the efficiency of crop production in Britain. Philos Trans R Soc Lond B. 1977;281:277–94.
View Article
Google Scholar

[124] View Article

[125] Google Scholar

[ref52] 52. Villalobos FJ, Fereres E, editors. Principles of agronomy for sustainable agriculture. Springer International Publishing; 2016. doi: https://doi.org/10.1007/978-3-319-46116-8

[ref53] 53. Schauberger B, Archontoulis S, Arneth A, Balkovic J, Ciais P, Deryng D, et al. Consistent negative response of US crops to high temperatures in observations and crop models. Nat Commun. 2017;8:13931. pmid:28102202
View Article
PubMed/NCBI
Google Scholar

[128] View Article

[129] PubMed/NCBI

[130] Google Scholar

[ref54] 54. Mínguez MI, Ruiz-Ramos M, Díaz-Ambrona CH, Quemada M, Sau F. First-order impacts on winter and summer crops assessed with various high-resolution climate models in the Iberian Peninsula. Clim Change. 2007;81(S1):343–55.
View Article
Google Scholar

[132] View Article

[133] Google Scholar

[ref55] 55. Lizaso JI, Ruiz-Ramos M, Rodríguez L, Gabaldon-Leal C, Oliveira JA, Lorite IJ, et al. Impact of high temperatures in maize: phenology and yield components. Field Crops Res. 2018;216:129–40.
View Article
Google Scholar

[135] View Article

[136] Google Scholar

A machine learning approach for estimating forage maize yield and quality in NW Spain

A machine learning approach for estimating forage maize yield and quality in NW Spain

Corrections

Figures

Abstract

1. Introduction

2. Materials and methods

2.1 Experimental sites and minimum data sets in Asturias and Galicia

2.2 Cultivar characteristics

2.3 Adaptation of the CSM-CERES-Maize model

2.4 Simulation of the interannual variation in forage maize production

2.5 Genotype × environment × management interaction

2.6 Machine learning processing

2.6.1 Programming language and used packages.

2.6.2 Exploratory Data Analysis (EDA).

2.6.3 Regressor Machine Learning algorithms tested in the study.

2.6.4 Model performance.

2.6.5 Optuna: Hyperparameter optimization.

2.6.6 Model predictions uncertainty quantification Bootstrapping.

2.6.7 SHAP values and variable permutation tests.

3. Results and discussion

3.1 Model evaluation and selection

3.2 Residual analysis and variable importance during training

3.3 Hyperparameter optimization

3.4 Variable permutation tests

3.5 SHAP values

3.6 Bootstrapping results

3.7 Prediction web application

3.8 Influence of Growing season and Radiation on Dry matter yield

4. Conclusions

Supporting information

S1 File.

Acknowledgments

References