Options for calibrating CERES-maize genotype specific parameters under data-scarce environments

Most crop simulation models require the use of Genotype Specific Parameters (GSPs) which provide the Genotype component of G×E×M interactions. Estimation of GSPs is the most difficult aspect of most modelling exercises because it requires expensive and time-consuming field experiments. GSPs could also be estimated using multi-year and multi locational data from breeder evaluation experiments. This research was set up with the following objectives: i) to determine GSPs of 10 newly released maize varieties for the Nigerian Savannas using data from both calibration experiments and by using existing data from breeder varietal evaluation trials; ii) to compare the accuracy of the GSPs generated using experimental and breeder data; and iii) to evaluate CERES-Maize model to simulate grain and tissue nitrogen contents. For experimental evaluation, 8 different experiments were conducted during the rainy and dry seasons of 2016 across the Nigerian Savanna. Breeder evaluation data were also collected for 2 years and 7 locations. The calibrated GSPs were evaluated using data from a 4-year experiment conducted under varying nitrogen rates (0, 60 and 120kg N ha-1). For the model calibration using experimental data, calculated model efficiency (EF) values ranged between 0.88–0.94 and coefficient of determination (d-index) between 0.93–0.98. Calibration of time-series data produced nRMSE below 7% while all prediction deviations were below 10% of the mean. For breeder experiments, EF (0.58–0.88) and d-index (0.56–0.86) ranges were lower. Prediction deviations were below 17% of the means for all measured variables. Model evaluation using both experimental and breeder trials resulted in good agreement (low RMSE, high EF and d-index values) between observed and simulated grain yields, and tissue and grain nitrogen contents. It is concluded that higher calibration accuracy of CERES-Maize model is achieved from detailed experiments. If unavailable, data from breeder experimental trials collected from many locations and planting dates can be used with lower but acceptable accuracy.

Most crop simulation models require the use of Genotype Specific Parameters (GSPs) which provide the Genotype component of G×E×M interactions. Estimation of GSPs is the most difficult aspect of most modelling exercises because it requires expensive and time-consuming field experiments. GSPs could also be estimated using multi-year and multi locational data from breeder evaluation experiments. This research was set up with the following objectives: i) to determine GSPs of 10 newly released maize varieties for the Nigerian Savannas using data from both calibration experiments and by using existing data from breeder varietal evaluation trials; ii) to compare the accuracy of the GSPs generated using experimental and breeder data; and iii) to evaluate CERES-Maize model to simulate grain and tissue nitrogen contents. For experimental evaluation, 8 different experiments were conducted during the rainy and dry seasons of 2016 across the Nigerian Savanna. Breeder evaluation data were also collected for 2 years and 7 locations. The calibrated GSPs were evaluated using data from a 4-year experiment conducted under varying nitrogen rates (0, 60 and 120kg N ha -1 ). For the model calibration using experimental data, calculated model efficiency (EF) values ranged between 0.88-0.94 and coefficient of determination (d-index) between 0.93-0.98. Calibration of time-series data produced nRMSE below 7% while all prediction deviations were below 10% of the mean. For breeder experiments, EF (0.58-0.88) and d-index (0.56-0.86) ranges were lower. Prediction deviations were below 17% of the means for all measured variables. Model evaluation using both experimental and breeder trials resulted in good agreement (low RMSE, high EF and d-index values) between observed and simulated grain yields, and tissue and grain nitrogen contents. It is concluded that higher calibration accuracy of CERES-Maize model is achieved from detailed experiments. If unavailable, data from breeder experimental trials collected from many locations and planting dates can be used with lower but acceptable accuracy.

Introduction
Maize has become an important crop in Nigeria in the past decades due to its importance as food for human consumption; feed for animals and as a source of industrial raw material [1]. Despite its importance, yield of maize has remained quite low in the Savannas mostly due to biotic and abiotic constraints [2]. In recent years, new early and extra early maturing maize varieties that are tolerant to most of the biotic and abiotic constraints have been developed for the Nigerian Savannas by the International Institute for Tropical Agriculture (IITA) and its partners. Several agronomic technologies have also been developed to increase the productivity of these varieties with a view to enhancing maize productivity. Before the varieties are released, they are usually grown under multi-locational yield and crop management evaluation trials over several years. Dissemination of such varieties and technologies will require setting up of costly and time-consuming experiments across wide areas. This is needed to adequately evaluate the Genotype × Environment interaction which demonstrates the performance of each variety across diverse environments. Unless this is done, breeders cannot conclusively recommend genotypes for specific environments.
Crop simulation modeling offers an opportunity to explore the potential of new varieties and crop management practices in different environments (soil, climate, management) prior to their release [3]. Recently, use of crop simulation models, particularly DSSAT, is on the increase in Africa through initiatives such as the Agricultural Models Inter-Comparison Project (AgMIP) [4]. In West Africa, the CERES-Maize model has been recently used by McCarthy et al. [3] to evaluate climate-sensitive farm management practices in the Northern Regions of Ghana. Adnan et al. [5,6] used the same model to determine the nitrogen fertilization requirements of early maturing maize in the Sudan Savanna of Nigeria and the optimum planting dates of maize in Northern Nigeria. Iyanda et al. [7] used the CERES-Maize model to identify potential zones for maize production in Nigeria. One of the major requirements for the use of crop simulations is calibration of Genotype Specific Parameters (GSPs). GSPs are sets of parameters that enable crop models to simulate the performance of diverse genotypes under varying soil, weather and management conditions [8]. Like all other parameters in crop simulation models, the GSPs must have a physical or biological meaning [9]. Measuring GSPs directly from real systems (farm and field level) is very complex and impractical, and results in highly inaccurate and uncertain values of estimated variables [10,11]. Direct measurement requires setting up of field or growth chamber studies, collection of many samples, and exposure to different photoperiods where necessary. The most common method for deriving GSPs is from field experiments designed specifically for their estimation [12,13]. This process is quite expensive, time consuming and requires regular sampling of growth, phenology and yield data for each variety following a set of minimum dataset rules [8]. Since the movement of models from research and policy to adoption by farmers and extension, the need for rapid estimation of GSPs for newly released varieties has become more urgent [14]. Several concerns have been raised even in locations where abundant and high-quality data for calibration of GSPs for model uses are available. In a recent publication, Seidel et al. [15] presented various methods for improving the current methods of calibrating crop models.
Since most models have been developed elsewhere in Europe and USA, their use outside their domain of development requires a great deal of data for their calibration and validation. Several approaches for estimating GSPs have been documented. The genetic coefficient calculator (GENCALC) was used by Anothai et al. [16] to determine variety coefficients for new peanut lines in Thailand from standard varietal trials. From their experiments, they were able to successfully calibrate groundnut GSPs using a set of field experiments and yield evaluation experiments using the GENCALC software. Mavromatis et al. [17] successfully generated GSPs of soybean from crop performance trials in Georgia, USA. Bannayan et al. [18] employed a pattern recognition technique, which is based on similarity measures, to estimate GSPs for maize. In their experiments, pattern recognition was used as an alternative to GENCALC and GLUE in estimation of maize GSPs. The generalized likelihood uncertainty estimation (GLUE) method was used by He et al. [19] to successfully estimate maize GSPs in North Carolina. Welch et al. [14] used data from private-sector variety performance trials to develop soybean GSPs in the soybean belt of the United States of America. Buddhaboon et al. [20] used GENCALC and GLUE to estimate GSPs of deep water rice using CERES-Rice model. Most recently Lamsal et al. [21] used the independent component analysis (ICA) and separate factor approaches to estimate soybean GSPs from large breeding trial datasets in the USA.
With a growing number of researchers using the DSSAT model in the Savannas of Africa, there is need to evaluate the GSP calibration step as it is the aspect that requires the greatest amount of data and expertise. Calibration of GSPs can also be done using secondary data from breeders who routinely conduct multi-location trials. Such datasets are available in Africa where strong breeding programs are present. Because the conventional method of calibrating GSPs is quite expensive and laborious, there is need to utilize secondary breeder trial data for calibrating maize GSPs and to evaluate the accuracy of this approach by comparing it with calibrations done using detailed calibration experiments. The present research compares data generated from conventional experiments and from breeder evaluation trials. This is done to justify the claim that available data from breeder evaluation experiments can potentially be used for generating maize GSPs when setting up conventional experiments is unfeasible.
The objectives of this research were: i) to determine GSPs of 10 newly released open pollinated (OPV) and hybrid maize varieties for the Nigerian Savannas using data from both field experiments specifically designed for this purpose (herein called calibration experiments) and by using data from breeder varietal evaluation trials (herein called breeder evaluation experiments); ii) to compare the accuracy of the GSPs generated using calibration and breeder data; and iii) to evaluate the ability of the GSPs calibrated using the 2 methods to simulate grain yield and tissue/grain nitrogen contents of maize.

Model description
The maize model used in this study is the CSM CERES-Maize model of DSSAT version 4.6. Detailed description of the CERES-maize model of DSSAT can be found in Jones et al [22]. CERES-maize is variety and site specific and operates on a daily time step. It dynamically simulates the development of roots and shoots, the growth and senescence of leaves and stems, biomass accumulation, and the growth of maize grain yield as a function of soil and weather conditions, crop management practices, and variety characteristics. The model uses a standardized system for model inputs and outputs that have been described elsewhere [23,24]. The input system enables the user to select crop genotype (variety), weather, soil, and management data appropriate to experiment being simulated. Required crop genetic inputs for CERES Maize are given in Table 1.  Table 2). The calibration experiment consisted of 20 varieties, but we focused on 10 varieties that were common to both the on-station calibration experiments and breeder varietal evaluation experiments ( Table 3). The calibration experiments were conducted near irrigation facilities so as to maintain optimum moisture by irrigating when the soil moisture is below field capacity. Moisture conditions were monitored using a Time Domain Reflectometry (TDR) Meter 6050X1 TRASE SYSTEM (Soilmoisture Equipment Corp.). Recommended levels of mineral fertilizers for the region were applied (120N:60P:60K kg ha -1 ); potassium (K) was applied in form of Muriate of Potash, phosphorus in the form of Single Super Phosphate, and Nitrogen was applied in the form of Urea. While all the P and K fertilizers were applied at sowing; only half of the N fertilizer was applied at the time of sowing and the other half applied 21 days later. In addition, poultry manure (approximately NPK 1.1:0.8:0.5) was added to the fields at the rate of 5 tons ha -1 to maintain optimum nutrient status. The calibration experiments were laid down in a Randomized Complete Block Design (RCBD) with four replications. The gross plot consisted of six ridges, 0.75 m apart and 3 m long (plot area = 13.5 m 2 ). The two innermost ridges were used as the net plot for yield assessment and for sampling purposes. A space of 0.5 m was used between plots and 1m between replications. The experimental fields were cleared, harrowed, ridged and thereafter sprayed with a pre-emergence herbicide, Primextra (Atrazine + Metolachlor) at the rate of 4lha -1 before planting. The maize was sown at intra-row spacing of 0.25m at two seeds per hole, and later thinned to one plant giving a population of 53, 333 plants ha -1 .

Field experiments
For the breeder evaluations, experimental units are one-row plots, each 4 m long with inter-row spacing of 0.75 m and intra-row spacing of 0.40 m. Three seeds planted and later thinned to two per hill at 2 weeks after emergence to give a final plant population density of about 66,666 plants ha -1 . Fertilizer is usually applied at the rate of 60 kg ha -1 of NPK 15:15:15 at 2 WAP. An additional 60 kg ha -1 N using urea is top dressed at 5 WAP. The trials are kept weed free by applying atrazine (1 -chloro-3-ethylamino-5-isopropylamino-2,4,6-triazine) and gramoxone (1,1-dimethyl-4,4-bipyridinium dichloride) as pre-and post-emergence herbicides at 5 L in 220 L of water ha -1 and subsequently by hoeing. Grain yield is calculated based on 80% (800 g grain kg -1 ear weight) shelling percentage and adjusted to 150 g kg -1 moisture content.
For calibration using data from breeder evaluation trials, we collected long-term yield evaluation data from breeders at the International Institute for Tropical Agriculture (IITA), Ibadan and the Institute for Agricultural Research (IAR), Zaria. Data for the 10 maize varieties used in this study were selected. The bulk data was subjected to various quality checks. We used data for the 2012 and 2013 seasons from seven locations where weather and soil data were available. Table 2 shows the locations and data used in the calibration with breeder data. For model validation, field experiments were conducted at the Research Farm (11 o 59'N, 8 o 25'E 466m above sea level) of the Faculty of Agriculture, Bayero University, Kano in the rainy seasons between 2013 to 2016 (four seasons). The treatments consisted of three rates of nitrogen (0, 60 and 120 kg N ha -1 ) and ten maize varieties used in the calibration experiments ( Table 2). Treatments were laid out in a split-plot design with three replications. Nitrogen rates were assigned to the main plots while the varieties were assigned to the sub-plot.
Although the experiments were conducted in the rainy season, moisture contents were monitored, and supplementary irrigation was provided to ensure no moisture stress. All conventional agronomic cultural practices were followed. The data collected for model evaluation includes grain yield (kg ha -1 ), total grain nitrogen (kg ha -1 ), total tissue nitrogen (kg ha -1 ) and nitrogen harvest index (percentage). Total grain and tissue nitrogen were determined using the Micro Kjeldahl method.

Plant measurements
Evaluation of crop development was done by observing the phenology of the different maize varieties and recording the length of time (days) it takes to attain each phenological phase. The measurements were then converted to growing degree days (GDD) using a base temperature of 8˚C and adopting the relationship: Ten plants were tagged from the center of each plot in each replication for phenological observations. The end of the juvenile stage (i.e. panicle initiation) was determined through destructive sampling and dissection of three plants, followed by observation of apical meristem to check for floral bud development at 2 d intervals starting from 14d after emergence. The end of juvenile stage was recorded when the male flower primordial were visible in 50% of plants examined. Days to 50% tasseling was recorded when tassels were observed on 50% of the tagged plants. Physiological maturity observations were conducted as follows: kernels were removed from the base, middle and distal end of each sampled ear daily, starting when husks begin to show signs of drying. Days to physiological maturity was recorded when 50% of the kernels in each tagged ear had formed a black layer, indicating physiological maturity.
Plant biomass was taken at four different stages: vegetative, anthesis, grain filling and physiological maturity. Five plants within a one-meter strip in a row were cut at the ground level as suggested by Ogoshi et al. [11]. Leaves were separated from the stem, chopped and dried in the shade for three days. Both stems and leaves were oven dried at 70˚C for 36-48 hours until the sample had attained constant weight. Yield and yield component measurements were taken at harvest maturity. Plant height was measured from five randomly tagged plants within the net plot using a standard field meter rule. Other variables measured included: the number of seeds per unit area (seed # m -2 ), dry seed weight (g m -2 ), dry cob weight (g m -2 ), dry husks weight (g m -2 ), grain yield (kg ha -1 ) and stover weight at harvest (kg ha -1 ). All yield and yield component measurements were done using procedures and formulae described by Ogoshi et al. [11]. Total grain and tissue nitrogen (measured for the evaluation experiments only) were determined using the Micro-Kjeldahl method.

Soil and weather data
Detailed soil studies were conducted for each experimental location before planting. Soil pits were dug in each location, and soil samples were taken from each layer. The collected samples were then analyzed for pH, texture, moisture, bulk density, exchangeable potassium (K), organic matter, phosphorus (P), total nitrogen and CEC. For the detailed calibration and validation experiments, daily weather data were collected from weather stations (Watchdog 2000 Series, Spectrum Technologies) adjacent to all experimental sites. All weather stations were less than 5 km away from the experimental sites.

Initialization of soil and weather parameters
Daily records of minimum and maximum temperature, total solar radiation, and total rainfall are required for the CERES-Maize model weather initialization. The Weatherman utility in DSSAT was used to input the weather data to create the weather file used by the CERES-Maize model. The Weatherman utility also requires information on name of weather station, latitude, longitude and altitude. Soil data tool (SBuild) was used to create the soil database which was used for the general simulation purposes. Name of the country, name of experimental site, site code, site coordinates, soil series and classification were among the data entered in this utility. Initial soil water was set to field capacity for all locations for the calibration experiments, while for the breeder evaluation this condition was not set, leaving the inputted moisture properties of the soils in each location. Measured soil characteristics taken from each profile were used to calculate the soil physical and chemical parameters that are needed to run the model. For calibration experiments, we assumed that N was not limiting while for the breeder evaluation nitrogen was simulated although N stress was not recorded in any of the locations. For the evaluation experiments however, Nitrogen was simulated, and application was done according to treatments. For other simulation options, initial conditions were as reported for each year and location, the Priestly-Taylor/Ritchie method was selected for simulation of evapotranspiration while the Soil Conservation Service (SCS) method was selected for simulation of infiltration. Photosynthesis was simulated using the radiation use efficiency method, while hydrology and soil evaporation were simulated using the Ritchie Water Balance and Suleiman-Ritchie methods respectively. Phosphorus and Potassium were not simulated in all trials and locations.

Estimating genotype specific parameters
The GENCALC program of the DSSAT (Version 4.6) was used to calibrate the GSPs of the maize varieties. GENCALC is a software package that facilitates the calculation of variety coefficients for use in existing crop models including the CSM-CERES-Maize Model [25]. The CSM-CERES-Maize model has GSPs that define growth and development characteristics or traits of a maize variety (Table 1). Three parameters (P1, P2 and P5) define the life cycle development characteristics, two coefficients (G2 and G3) define growth and yield characteristics and one coefficient, PHINT, defines leaf tip appearances [10]. All the candidate genetic coefficients were selected and calculated using GENCALC except P2 because all the varieties used were day-neutral. Conventionally day-neutral varieties should have constant P2 value, ideally the value should be zero which means that the variety does not generate delays when photoperiods exceed 12.5 hours. In our calibration procedure, a small positive number (0.01) was used as P2 for all varieties so that computer arithmetic problems like division by zero are prevented. The varieties used in the trials were representative of all the maturity groups, i.e. extra early to late maturity. The default values in DSSAT were therefore used as initial coefficients for the extra-early, early and late maturity classes. Variety coefficient values for each variety are then varied, relative to each simulated and observed measurement. The model algorithm then searches the output file and uses the difference between simulated and observed variables to decide whether to increase or decrease the value of the coefficient that is being estimated. When GENCALC finds a good fit for each observation, it averages the coefficients and calculates the root mean square error (RMSE) [26]. According to each genetic parameter, the process is repeated until the best fit is selected. An interactive procedure is used by GENCALC where the user changes the variety coefficient step to minimize the errors and speed-up the convergence of the algorithm. The search finishes when the user accepts the parameters providing the lowest RMSE for a single target trait.
For calibration of maize genotypes using the experimental data, four variables connected to four out of the six coefficients were directly measured (P1, P5, G2 and PHINT), while P2 was not estimated because all the varieties used in the experiments were day-neutral. G3 of the initial genotypes were first selected and later adjusted using a set of truncated rules in the GEN-CALC2.rul file until a good fit is observed. For calibration using the breeder data, five out of the six coefficients (except P2) were estimated following an optimization procedure (Fig 1) similar to that used by Anothai et al. [16]. This approach has not been reported for maize, especially in Sub-Saharan Africa. At each step of the calibration process in GENCALC, measured number of days from emergence to flowering was compared with days to anthesis (ADAP), measured number of days from emergence to physiological maturity was compared to days to physiological maturity (MDAP), measured grain yield at harvest was compared with harvest weight at maturity (HWAM), measured overall biomass at maturity was compared with tops weight at maturity (CWAM), measured maximum leaf area index was compared with maximum leaf area index (LAIX), while measured harvest index was compared with harvest index at maturity (HIAM).The generated coefficients were then used to run sensitivity analysis, using various iterations (not less than 6000 for each coefficient) to confirm the accuracy of the sequential approach. The adjustment for each target coefficient was done while all other nontarget coefficients were kept constant. Despite the sensitivity analysis conducted, there is a possibility that pathologies associated with staged optimizations like GENCALC will occur thereby influencing the goodness of fit [27].

Model evaluation
The model was calibrated using data from conventional experiments or breeder evaluation trials. Model evaluation was done using data from the nitrogen trials ( Table 2). The data sets used for model evaluation were of two types; single measured data and time series data. For single measured data, we used r 2 and root mean squared error (RMSE) (Eq 1) to evaluate the agreements between simulated and observed values. Normalized Root Mean Squared Error (nRMSE, Eq 2) and the index of agreement (d, Eq 3) [28] were used to evaluate the time series data. We used nRMSE for time series data because RMSE varies with growth over time as the magnitude of the growth variables increase. The d-statistic was used because it gives a single index of model performance, which covers bias and variability; it also indicates 1:1 prediction better than R 2 . A low value for nRMSE (expressed in percent) is desired to define a good fit. The d statistic has values between zero and one, with one being the best fit. The modeling efficiency, EF [29] was employed to test modeling efficiency (Eq 4). EF has no dimension and an EF = 1 corresponds to a perfect match between observed and simulated data. When EF < 0, the simulated values are worse than simply using the observed mean. R 2 , RMSE, RMSEn, dindex and EF are shown in Eqs 1, 2, 3 and 4 respectively.
Where n is the number of observations, S i is the simulated data, m i is the measured data, and � m is the mean of the measured data.

Genotype specific parameters.
The values of GSPs generated using data from calibration experiments and breeder evaluation data are shown in Table 4. The highest degree days from emergence to end of Juvenile stage (P1) was recorded for OBA 9 in both experimental and breeder data. For number of days from silking to end of physiological maturity (P5), the highest values were recorded for Seedco white in both the experimental and breeder data. The lowest P1 values were recorded for IITA EE white using both experimental and breeder data. The variety Seedco white produced the largest number of maximum possible kernels (G2) for experimental data while OBA 9 had the highest values for breeder data. The value of G3 (kernel filling rate) ranged between 6.55 and 8.42 for the experimental data, and between 6.39 and 8.51 for the breeder data. Phyllochron interval (PHINT) values ranged from 36.9 and 45.5˚Cd for the experimental data and between 35.7 and 50.2˚Cd for the breeder data. The results show that about half of the GENCALC estimates are near to or beyond two SEMs away from measured values. Majority of these estimates are for the phenology parameters P1, P5 and PHINT.   (Fig 3).

Biomass and leaf area index.
Biomass and LAI were measured at juvenile stage, at anthesis, and at physiological maturity for the calibration data only. Fig 4 shows the result of simulation of above-ground biomass and LAI for Sammaz 32 across the trial locations. Good agreements were found between simulated and observed variables for all other varieties. Biomass was simulated with higher accuracy than LAI across all locations. Simulation of both biomass and LAI were most accurate using data from Samaru (d-index = 0.96, RMSE = 547.3 for biomass and d-index 0.92, RMSE 0.022 for LAI). Calibration of both variables had the lowest accuracy at Dambatta. Agreements between observed and simulated LAI were closer for the earliest measurement (juvenile stage), followed by measurement at anthesis, and physiological maturity in all locations except at Samaru where the reverse was observed. For biomass however, measurement at physiological maturity produced the closest agreements between observed and simulated values, while measurement at anthesis produced the lowest agreement between observed and simulated variables.

Yield and yield attributes.
Yield and yield attributes were well calibrated for all varieties in both calibration and breeder datasets. Table 5 shows the result of comparisons between  Options for calibrating CERES-maize genotype specific parameters under data-scarce environments observed and simulated mean grain yields of all varieties across different locations. Calibration of grain yield using experimental data was more accurate, as evidenced by low percentage prediction deviations (3.1 to 12.9). Values for model statistics were also good for the experimental data (RMSE = 264.6 kg ha -1 , nRMSE = 11.1%, and d-index = 0.97). For the breeder data however, prediction deviations of up to 24.7% were observed, with higher RMSE (510.1 kg ha -1 ) and nRMSE (16.1%). Negative prediction deviation which indicate under simulation was only observed in one location (BGD 13) for the breeder evaluation data, while in all instances positive prediction deviations were observed.

Model validation experiments
Grain and tissue nitrogen, as well as grain yield, at harvest were simulated using independent datasets from trials conducted at BUK during the rainy seasons between 2013 and 2016. Simulations were done using GSPs generated from both experimental and breeder data. Table 6 shows the comparison between observed and simulated grain yields with accompanying model statistics for the two datasets taking SAMMAZ 32 and EE-White as examples. Grain yield was well simulated for both varieties using both datasets, although better fits were observed for GSPs from the calibration data. Nonetheless, low values of RMSE (below 2% of mean for experimental and 4.5% for breeder), high values of d index (0.99 for experimental and 0.96 for breeder) and good EF values (slightly less than 1 for both datasets) were observed. Tables 7 and 8 shows comparisons of simulated grain and stover nitrogen using GSPs generated from calibration and breeder evaluation experiments. Better agreements between observed and simulated grain and stover Nitrogen were observed at high Nitrogen (120 and 60 Options for calibrating CERES-maize genotype specific parameters under data-scarce environments Kg N) for both calibration and breeder evaluation experiments. At zero nitrogen application however, the agreements between observed and simulated values where low as evidenced by higher RMSE and lower d-index values.

Discussion
Calibrated GSPs from the on-station experiments and breeder evaluation experiments were similar to GSPs reported for related varieties in West and Southern Africa [30][31][32] with respect to yield and yield attributes. For growth and phenology however, data from our experiments produced better calibration of growth and phenology than earlier reported experiments in the Nigerian Savannas. For calibration using both experimental and breeder data, we set the values of P2 to 0.01 to simulate the day-neutral characteristics of all the varieties used. Recent Table 5. Observed and simulated mean grain yields (kg ha -1 ) of all varieties across different locations.

Data Type
Observed � Simulated PD%# publications by Lamsal et al. [21,33] highlighted the need to test for possible pathologies while estimating GSPs in crop simulation models. Pathologies like expressivity failure (occur when a model cannot reproduce some observations for any combination of GSP values due to how the models' mathematical structure is set up), equifinality (multiple GSP combinations producing exactly same model predictions), and environmental hypersensitivity (GSP estimates depend on the environments used in generating them) should be checked especially if GSPs generated are to be used for genetic mapping. The presence of equifinality in our results is suggested by the closeness of the predictions generated from the two sets of GSPs even though 50% of the estimates differ by close to two SEMs or more. However, while more detailed testing for this phenomenon might be useful future work, our goal was to assess the degree of alignment between model predictions and observations given different sources of calibration data and this has been shown to be adequate. A high percentage (75%) of the GENCALC estimates that are near to or beyond two SEMs of the measured values were recorded for coefficients that determine phenology and therefore dependent on accurate measurement of developmental events (in observed days) and subsequent conversion to degree days. This high percentage shows that phenological events like number of days to flowering and number of days to maturity were not accurately measured in the breeder experiments due to small sample sizes and because they are not the traits of interest in the breeding program. This is evidenced for example by the under-simulation of days to flowering by 2.2 days and over simulation of days to maturity by 1.8 days for SAMMAZ 32 Options for calibrating CERES-maize genotype specific parameters under data-scarce environments shown in Fig 2. The implication of poor phenology measurements is seen by a slight over-estimation of yield and yield attributes thereby confirming assertions made by Kumudini et al. [34] who suggested that accurate prediction of phenology is fundamental to determining crop adaptation and yield potential. Calibration of the GSPs using the on-station experiment datasets produced better model fits than the breeder evaluation data as expected. The closeness of fit observed for the on-station data could be attributed to better experimental sites (soils with higher fertility and better moisture retention), better crop management (timely weeding, fertilizer application etc.) and higher experimental precision. This is evidenced by the breeder data having higher experimental errors for all measured variables when compared to the evaluation experiments. The evaluation experiments were also done on larger plot sizes and no missing plants were recorded at harvest, while in the breeder data smaller plots were used and there were no considerations for missing plants during yield calculations. In addition, for the experimental datasets more plantrelated variables were measured compared to the breeder evaluation experiment data where only grain yield, days to flowering, plant height and days to physiological maturity were measured. For the breeder evaluation experiment, the closeness between observed and simulated plant heights was low. This could be attributed to the fact that most breeder trials are conducted under water limited conditions, thus rainfall variability may affect crop performance and data quality. Although the model can properly simulate water stress, no stress was observed in any of the breeder evaluation sites and years. Grain yield and days to anthesis were simulated more accurately than plant height for the breeder evaluation experiment. This can be attributed to the high number of datasets used (7 locations and 2 seasons). Anothai et al. [23] suggested that more accurate predictions of yield and phenology are observed when data is collected from many locations and seasons. For the on-station experiment, plant height, number of leaves, leaf area index, biomass, number of grains per meter square and grain yields were well calibrated as the differences between observed and simulated values were very minimal.
According to literature [23,35] when many years and locations are available, GSPs calibrated using breeder evaluation experiments produced very accurate comparisons between observed and simulated growth, yield and phenology of maize. As suggested by Fensterseifer et al. [35], uncertainties exist in the reliability of model based simulations of growth, yield and phenology when calibrations are done using data from trials conducted under few environmental conditions. Also, several researchers [36,37] reported that the major factors determining the success of a model calibration process, which determines the applicability of the model on a larger scale is dependent on the wide variability of data used during the calibration process. Thorp et al. [38] suggested that for accurate calibration of crop models, integration of time variation using different planting dates and seasons, and spatial differences using different locations of datasets should be adopted for calibration of crop models using datasets from yield/breeder evaluation trials. To further verify these claims, we re-ran a couple of contrasting varieties under both on-station experiments and breeder evaluation experiments using different number of trials and data sets. For the on-station experiments, we first reduced the number of experiments by subtracting 2 stations concurrently (i.e. reducing from 8-6, 6-4 and 4-2). With every decrease in number of experiments, a subsequent decrease in model efficiency and increase in prediction error were recorded. The higher the number of trials the better the model fitted the observations, also reducing the number of experiments to 4 led to EF and dindex values below 0.4, while further reduction to 2 reduced the model efficiency to 0.25 and increased the prediction error to 55%. Using 4 experiments and all measured data produced the lowest level of acceptable model statistics (d-index � 0.50, nRMSE � 16% and EF �0.4). For the breeder evaluation experiments, every reduction in number of experiments led to a decrease in model efficiency and an increase in prediction error. We also reduced the number of datasets from the evaluation experiments to the same that was used in the breeder evaluation experiments. This resulted in decrease in model efficiency (0.89 to 0.81 for Sammaz 32 and from 0.94 to 0.88 for Seedco white). This shows that the number of experimental sites is more important than amount of calibration data if the minimum data sets (MDS) are collected as shown in Table 9. This view is supported by [35] and [31]. When many locations and planting dates are available, data from breeder evaluation experiments in the SSA can be used to make good calibration of calibrations with lower RMSE & nRMSE and higher d-index & EF values.
Although the calibration experiments provided more accurate GSPs, they are still very expensive and laborious and thus are nearly impossible to carry out especially in Sub-Saharan Africa where expertise and resources are limiting. Breeder evaluation data could also be used for calibration of GSPs where such data is available. As shown earlier, very accurate GSPs could be generated if large amount of data from many years (also planting dates) and various locations are available. This will go a long way in providing model users with cheap and easy ways of calibrating GSPs of existing and newly released varieties to their locations.
Evaluating the generated GSPs for simulation of grain yield, tissue nitrogen and grain nitrogen using independent datasets resulted in good agreements between observed and simulated values. For grain yield, comparisons of measured and simulated values using both GSPs generated from experimental and breeder data showed very close agreements under medium and high nitrogen applications. For comparisons under nitrogen stressed conditions however, poor agreements existed between observed and simulated grain yields for both GSPs. This is a common occurrence with simulations of grain yield and yield attributes under low nitrogen fertilizer applications. Gungula et al. [39], reported that the CERES-Maize model poorly predicts performance of maize under low nitrogen conditions in the tropics. The agreements between observed and simulated grain and stover nitrogen for both GSPs under high fertilizer applications is an indication that CERES-model still performs best under high nitrogen applications especially on tropical soils.
The CERES-Maize model has been shown over the years to be an important tool in evaluating crop management [3], climate change impacts [40], fertilizer recommendations [5,39] and yield forecasting [41]. Calibrating the newly released maize varieties currently recommended for the Nigerian maize belts will provide an important input requirement for using crop models to evaluate major production constraints including optimum stand density (OSD), appropriate varietal selection (targeting/stability analysis), choice of major partner crop (in case of mixed cropping) and fertilizer (especially N and P) managements. The availability of accurate GSPs for all major varieties will also increase the applicability of the model on a wider scale and for broader applications. Options for calibrating CERES-maize genotype specific parameters under data-scarce environments

Conclusion
Financial as well as time constraints coupled with frequent release of new varieties makes it difficult for model users to conveniently calibrate GSPs of crop models using detailed calibration experiments. Large numbers of evaluation trials are conducted across multiple locations under diverse planting dates by breeders and other growers prior to varietal release. Availability of such datasets, especially from evaluation trials conducted under minimal stress (moisture and nutrient) conditions provides an opportunity for efficient and rapid means of generating GSPs of newly released maize varieties. A systematic approach (as proposed in this study) as well as availability of large datasets from different locations and planting dates provide opportunities for estimation of accurate GSPs. Although it is possible to generate GSPs from breeder evaluation data, care must be taken to collect data from trials conducted under optimal conditions and not too far away from weather stations. Also, breeder data to be used for calibration of crop models must be collected from sites where detailed soil data is available. Additionally, appropriate tests must be conducted to ensure that pathologies such as equifinality, expressivity failures and environmental hypersensitivity are minimized especially when the objective is to generate GSPs for genetic mapping or for application under many environments where the estimation was not conducted. Availability of GSPs of new varieties as soon as they are released will help farmers and growers to make improved site-specific decision support tools (DST). Also, researchers will be provided with new ways to making variety groupings as well as studying complex Genotype, Environment, Management (G×E×M) interactions. Model users should endeavor to join breeding units/teams to ensure collection of robust data needed for model calibrations that are not traditionally collected by breeders.