Parameterization-induced uncertainties and impacts of crop management harmonization in a global gridded crop model ensemble

Global gridded crop models (GGCMs) combine agronomic or plant growth models with gridded spatial input data to estimate spatially explicit crop yields and agricultural externalities at the global scale. Differences in GGCM outputs arise from the use of different biophysical models, setups, and input data. GGCM ensembles are frequently employed to bracket uncertainties in impact studies without investigating the causes of divergence in outputs. This study explores differences in maize yield estimates from five GGCMs based on the public domain field-scale model Environmental Policy Integrated Climate (EPIC) that participate in the AgMIP Global Gridded Crop Model Intercomparison initiative. Albeit using the same crop model, the GGCMs differ in model version, input data, management assumptions, parameterization, and selection of subroutines affecting crop yield estimates via cultivar distributions, soil attributes, and hydrology among others. The analyses reveal inter-annual yield variability and absolute yield levels in the EPIC-based GGCMs to be highly sensitive to soil parameterization and crop management. All GGCMs show an intermediate performance in reproducing reported yields with a higher skill if a static soil profile is assumed or sufficient plant nutrients are supplied. An in-depth comparison of setup domains for two EPIC-based GGCMs shows that GGCM performance and plant stress responses depend substantially on soil parameters and soil process parameterization, i.e. hydrology and nutrient turnover, indicating that these often neglected domains deserve more scrutiny. For agricultural impact assessments, employing a GGCM ensemble with its widely varying assumptions in setups appears the best solution for coping with uncertainties from lack of comprehensive global data on crop management, cultivar distributions and coefficients for agro-environmental processes. However, the underlying assumptions require systematic specifications to cover representative agricultural systems and environmental conditions. Furthermore, the interlinkage of parameter sensitivity from various domains such as soil parameters, nutrient turnover coefficients, and cultivar specifications highlights that global sensitivity analyses and calibration need to be performed in an integrated manner to avoid bias resulting from disregarded core model domains. Finally, relating evaluations of the EPIC-based GGCMs to a wider ensemble based on individual core models shows that structural differences outweigh in general differences in configurations of GGCMs based on the same model, and that the ensemble mean gains higher skill from the inclusion of structurally different GGCMs. Although the members of the wider ensemble herein do not consider crop-soil-management interactions, their sensitivity to nutrient supply indicates that findings for the EPIC-based sub-ensemble will likely become relevant for other GGCMs with the progressing inclusion of such processes.


Introduction
Over the past decade, global gridded crop models (GGCMs) evolved to become major tools for agricultural climate change impact assessments [1][2][3][4][5][6][7][8]. They are also employed for studies on agricultural externalities [9][10][11][12][13] and provide key data for land use change and agro-economic models [14][15][16][17]. Typically, GGCMs are combinations of (a) a core model that estimates crop yields and externalities of crop production for a given set of input data and (b) a model framework that processes specified input data and runs the core model over large regions or the globe based on computational interfaces and georeferenced data from earth observations, statistical databases, or modelers' assumptions. Most often, core models are based either on ecosystems models adapted for representing cropping systems or on field-scale crop models with varying level of detail in agro-environmental or management processes.
Despite their wide use and substantial deviations among studies based on single GGCMs (e.g. [2,11]), there is little insight regarding actual drivers behind these uncertainties, which can be grouped into simulated processes, process algorithms, input data, and parameterization including management assumptions. Field-scale crop models themselves have been subject to a wide range of uncertainty analyses ranging from crop yield and biomass performance [18] including in-season dynamics [19] to soil organic matter (SOM) spin-up [20], static or transient soil handling [21,22], and trial management [23]. Since the inception of the Agricultural Model Intercomparison and Improvement Project (AgMIP, [24]) these experiments are frequently carried out for crop model ensembles.
Past uncertainty analyses for GGCMs or large-scale crop models in contrast have mostly addressed GGCMs' sensitivity to input data from different sources for climate (e.g. [2,11,25]) and soil (e.g. [26][27][28]), spatial resolution of input data (e.g. [29,30]), and deviations in outputs from different core models within the same framework ( [28,30]). More recently, also core model subroutines have come under scrutiny, for example PET estimation methods [31], temperature response functions [32], and the parameterization of soil processes relating to land degradation at the regional scale [33]. In a recent GGCM ensemble study, Müller et al. [7] attributed shares of overall deviations in ensemble climate impact projections and found that the selection of GGCMs contributed the majority of overall uncertainty, at least when the effects of CO 2 fertilization are accounted for.
Most prior studies targeting sources of uncertainty, however, have been performed for single GGCMs and typically for one singled out core model component. This provides thorough understanding of the considered components and their importance within a given GGCM and setup. Yet, it has previously not been investigated how assumptions on parameterizations and management-for which at present only limited data exist at the global scale-vary across GGCMs developed by different research groups, how single domains of GGCM setups interact, and how these translate into differences in GGCM outputs and performance. We present here a first evaluation of drivers in differences among yield estimates produced by a GGCM ensemble with a focus on setup configurations. Simulations were performed within AgMIP's Global Gridded Crop Model Intercomparison (GGCMI) initiative phase 1 [34] in which GGCMs have been forced with their default or partly harmonized input data. The ensemble considered herein consists of 12 members (Table A in S1 File), five of which are based on the field-scale model Environmental Policy Integrated Climate (EPIC), while the remainder has unique core models. This allows for an in-depth evaluation how setup-related uncertainties translate into differences in GGCM outputs within the EPIC-based sub-ensemble and to assess how this sub-ensemble compares to the wider ensemble.
While other ensemble members have more complex routines for plant growth and yield formation processes (Table A in S1 File; [23]), EPIC presently stands out in its detailed representation of soil processes including organic matter and nutrient cycling, hydrology, and erosion as well as impacts of tillage on soil properties, besides being the only core model in GGCMI phase 1 used in multiple GGCMs. EPIC provides options for tracking changes in soil properties transiently or to reinitialize soils each year. This enables to account for temporal variations in soil quality or to focus on plant growth process with limited impact of dynamics in soil properties. Both options are employed in the ensemble based on modelers' preferences. Further differences arise from the selection of subroutines for hydrologic processes, cultivar distributions, management assumptions, and process parameterization.
In most of the ensemble's other GGCMs, detailed soil representations are absent as of now or soils are reinitialized annually. Still, few ensemble members such as APSIM-and DSSATbased GGCMs [35][36][37][38] consider some of them in principle. Others have implemented transient dynamic soil routines and tillage operations affecting these in more recent versions not include in this experiment, as is the case for LPJmL and LPJ-GUESS [39][40][41]. Detailed soil processes are being implemented in other branches of ORCHIDEE (e.g. [42]) from which they may be transferred to the crop version. Beyond the present ensemble, various gridded crop models with detailed crop-soil-management interactions based on field-scale core models such as DNDC, Expert-N, MONICA, or STICS have been developed for regional studies (e.g. [43]), to which the same uncertainties apply.
This renders the present ensemble timely for the evaluation of differences in GGCM setups including the parameterization of agro-environmental processes and management assumptions, which we hypothesize to greatly affect GGCM outputs and performance. Furthermore, it allows to assess how such differences in setups relate to differences among core model formulations that entail substantial differences from inclusion and conceptualization to implementation of plant growth and agro-environmental processes. Through focusing the analysis on a single core model EPIC, it is possible to concentrate on the effects of different plausible setups whilst holding model structure constant, allowing comparison with a wider ensemble where both setup and structure vary. The knowledge gained from the analyses can inform both modelers and the wider scientific community, making use of the publicly available data (e.g. [44][45][46][47][48]), about the magnitude of process-and parameterization-induced uncertainty across highly distinct setups and the importance of specific setup domains.
The overarching aim of the study is to evaluate sources of uncertainty in maize yield estimates among the five EPIC-based GGCMs and to compare differences within this sub-ensemble to the wider ensemble. Specifically, the objectives are to (a) identify key assumptions and setup components that drive differences in yield estimates, (b) derive priorities for further improvements in GGCM input data and harmonization, and (c) assess how findings for the sub-ensemble relate to a wider ensemble of different core models.
Complementary, a detailed evaluation is conducted for two EPIC-based GGCMs by stepwise introducing aggregated setup domains from one into the other in order to examine the importance of cultivar setups, organic matter turn-over and nutrient cycling, hydrologic parameterization, soil parameterization and handling, and crop management.

Global gridded crop models participating in this study
A total of 14 GGCMs contributed simulation outputs to GGCMI phase 1 [34], 12 of which provided outputs with harmonized input data (see next section). Five of the GGCMs (EPIC-BOKU, EPIC-IIASA, EPIC-TAMU, GEPIC, and PEPIC) are based on the field-scale model EPIC, while the remainder differs in core plant growth models. This study focuses on the five EPIC-based GGCMs, while the seven non-EPIC-based GGCMs are included in the Supplementary Information and Discussion to put the EPIC ensemble in a wider context, one of them only in the evaluation of model intercorrelation. Table A in S1 File provides an overview of key GGCM characteristics concerning plant growth, yield formation, and soil processes. More detailed information are provided on the website of ISI-MIP (http://www.isimip.org) and in Müller et al. [49,50]. In the further text, the term "model" refers to the core model routines and hence the structural differences, and the term "GGCM" to the global crop model framework, which may-as in the case of the EPIC-based GGCMs-only differ in setup and configuration.

Crop management scenarios
Six crop management scenarios (Table 1) were simulated to allow for quantifying differences among GGCMs from assumptions on management. Three different harmonization setups on growing season (planting and harvest dates) and nutrient supply (default, fully harmonized, harmonized & sufficient nutrients) were combined with two water management scenarios: rainfed only and sufficiently irrigated using automatic irrigation scheduling based on water deficit. Parameterizations-except for the cumulative temperature requirement to reach maturity (see next section), which were adjusted according to the growing season setup of each management scenario-were not altered among scenarios to allow for evaluating the impacts of data harmonization alone.
The default scenario represents each research group's assumptions on annual fertilizer application rates and growing seasons (see Text C in S1 File for EPIC-based GGCMs). It serves for evaluating differences among GGCMs if only climate data are harmonized. The fully harmonized (fullharm) setup allows for identifying remaining differences if annual nutrient application rates and growing seasons are harmonized using the input data described below. The fully harmonized setup with sufficient nutrient application (harm-suffN, referred to as harmnon in the simulation protocol [34]) aims to virtually eliminate plant nutrient deficits and consequently impacts of soil nutrient dynamics. This minimizes differences among GGCMs resulting from the setup of fertilizer application and soil nutrient cycling. As EPIC-TAMU was first setup in the course of this project, its default and fullharm setups are identical. To allow for evaluating the effect of harmonization from the default to the fullharm setups for the other EPIC-based GGCMs, supplementary results are shown with exclusion of EPIC-TAMU. This is also the case for LPJmL and LPJ-GUESS of the wider ensemble included in supplementary evaluations.

The Environmental Policy Integrated Climate (EPIC) model
The EPIC model was first developed in the 1980s to assess the impacts of soil management on crop yields [53]. It has since been updated frequently to cover e.g. effects of elevated atmospheric CO 2 concentration on plant growth [54], detailed soil organic matter cycling [55,56], and an extended number of crop types and cultivars [57,58] among others [59]. The presently publically available version is EPIC v.0810.
EPIC estimates potential biomass increase on a daily time-step based on light interception and conversion of CO 2 to biomass. Plant growth and phenology are calculated based on the daily accumulation of heat units. Potential biomass increase is constrained by water and nutrient (nitrogen (N) and phosphorus (P)) deficits, adverse temperature, and aeration stress. On each day of the crop growth period, the potential biomass gain is adjusted by the major plant growth-regulating factor to obtain the actual biomass increment. Hence, only one stress factor limits biomass accumulation on a given day. Root growth can be limited by soil strength, adverse soil temperature, and aluminum toxicity. At maturity, the model calculates crop yield based on above ground biomass and an actual harvest index HI a , which is estimated within a range given by potential HI (HI max ) and minimum HI under water stress (HI min ).
Besides plant growth and yield formation, EPIC estimates a wide range of environmental processes, for example wind and water erosion rates, turnover and partitioning of organic matter (OM) based on the CENTURY model [55,60], mineral N and P cycling, evapotranspiration (ET), fluxes of selected gases, and soil hydrologic processes. All of these have feedbacks on Table 1. Crop management scenarios based on Elliott et al. [34]. The default setup represents each modelling group's own assumptions, input data and management parameters. The harmonized scenarios use the same growing season data [51] and the same annual application rates for N and P [52] (fullharm) or sufficient nutrient supply (harm-suffN) to avoid nutrient-related plant growth limitations. See Fig E, Fully harmonized, irrigated fullharm sufficient harmon. 2) harmon. 2) harmon. 3) Fully harmonized, rainfed -harmon. 2) harmon. 2) harmon. 3) Harmonized & suff. nutrients, irrig. harm-suffN sufficient sufficient sufficient harmon. 3) Harmonized & suff. nutrients, rainfed -sufficient sufficient harmon. 3) plant growth, mainly through nutrient and water availability. EPIC has one central plant growth module, but provides various subroutines for calculating several of the externalities, e.g. six methods for water erosion estimation, eleven methods for estimating field capacity (FC) and wilting point (WP) including static input of own estimates or data, and five options for potential evapotranspiration (PET) among others. While this allows for adjusting the model to site conditions for which one method may be more appropriate than the other, it introduces another dimension of uncertainty besides the numeric parameterization of processes itself. Further information on relevant subroutines are provided in Text A in S1 File.

Parameterizations of the EPIC-based global gridded crop models
Selection of global parameters. All EPIC-based GGCMs except EPIC-TAMU use EPIC v.0810 as the core model. EPIC-TAMU uses the experimental version v.1102, which has additional routines mainly for OM and nutrient cycling, but the same plant growth module. As shown in Text B (S1 File), EPIC v.1102 produces virtually identical outputs in high-input regions but shows differences in low-input agriculture where nutrient cycling has larger impacts on plant growth. It may hence be considered another configuration of EPIC in this context.
All EPIC-based GGCMs have been applied in prior studies except EPIC-TAMU, which has first been set up in the course of this project. Based on prior applications or modellers' parameter estimates suitable for global simulations, the EPIC-GGCMs differ substantially in their parameterization and selection of subroutines. E.g. GEPIC has earlier been set up for reproducing small-holder agriculture in sub-Saharan Africa [61], relying partly on parameters calibrated in West Africa [58] whereas EPIC-IIASA has frequently been applied in high-input regions such as the EU [62] or China [63]. Yet, modelling purposes and target regions may change over time with ongoing research and parameterizations herein are accordingly results of both earlier research and modellers' assumptions on globally representative settings. Outlines of EPIC-based GGCM setups, purposes, and prior applications are provided in Text C (S1 File). Table 2 gives an overview of the setups and parameterizations grouped by hydrology, soil degradation, OM and nutrient cycling, crop management, and crop growth apart from cultivar definitions, which are described in the subsequent section.
Numbers in parentheses below refer to column "No" in the table. As there are interactions among core model processes such as hydrology and OM cycling, this grouping is tentative and partly owed to the model structure. Concerning the choice of major subroutines, three GGCMs use Penman-Monteith (PM; [64]) for PET estimation (1) and two Hargreaves (HG; [65]) in different parameterizations. Only EPIC-IIASA uses prior estimated FC and WP parameters (9) while all other EPIC-based GGCMs estimate these parameters using Rawls method online ( [66]). Water erosion (14) is considered in two of the GGCMs (and wind erosion in an additional one) with deviations in estimation method (16) and scaling of sediment yield (15). Three EPIC-GGCMs have a dynamic soil profile (19) with transient updating of profile depth, texture, OM, nutrient pools, and hydrology. In the two GGCMs with static soil profiles, soil texture and OM are re-initialized at the beginning of each year, but not mineral nutrient pools and soil humidity. All GGCMs are run transiently (20), except for GEPIC, which is run for each decade separately with a spin-up of 30 years (see Text C in S1 File). All three methods available for estimating denitrification (21) are used in the EPIC-based ensemble. Numeric parameters agree in some cases among GGCMs, especially if default parameter values have been selected (e.g. microbial decay rate (22)), but differ in several cases among four to five EPIC-GGCMs as is the case for the N volatilization coefficient (26). Different values have also been selected for defining irrigation water and fertilizer application strategies  (27)(28)(29), and EPIC-TAMU differs in addition from the other GGCMs in the parameterization of selected plant growth process (30)(31)(32)(33).

Geographic distributions and parameterization of maize cultivars
Differences in parameterizations of crop cultivars are evaluated here based on the parameters HI min , HI max , and optimal temperature ranges only. The latter is defined as the base and maximum temperatures for plant growth (see Text A, i.e. eq. S1 and S5, in S1 File). Although potential heat units (PHU), the temperature sum to reach maturity (Text A in S1 File), are an important cultivar characteristic as well, they are typically prescribed in GGCMs by growing season input data and long-term climate in each grid cell in order to meet reported sowing and harvest dates [61]. Between one and four different maize cultivars were planted within each EPIC-GGCM (Fig 1; Table D in S1 File). EPIC-IIASA uses four cultivars in its default setup ( Fig 1A) that are attributed to major world regions based on climatic and economic  3) Field capacity (FC) and wilting point (WP) can be estimated by 11 different methods or be an input in soil files. Saturated hydraulic conductivity (K sat ) can be estimated according to Rawls method or be input. For EPIC-IIASA these parameters were estimated based on the ROSETTA model as described in Text C (S1 File). 4) Describes the dependence of curve number (CN) estimation on soil moisture, which can be based on five methods, among them soil moisture gradient with profile depth or calculation of a daily soil moisture index (SMI) 5) Water and wind erosion can be turned on or off and water erosion is estimated by different methods (see below) 6) Water erosion rates are lowered by the given fraction (0 corresponds to virtually eliminated water erosion, 1 to no erosion control) 7) MUSS: Modified Universal Soil Loss Equation for Small Watersheds; RUSL2: Modified Revised Universal Soil Loss Equation 8) Static: annual re-initialization of soil profile, except water content and mineral nutrients; dynamic: transient updating of soil parameters throughout simulation 9) GEPIC is run separately for each decade as described in Text C (S1 File) 10) EPIC: original EPIC method [53]; CI: Cesar Izaurralde method [56]; AK: Armen Kemanian method (unpublished) 11) The auto-fertilizer and irrigation triggers define at which stress level fertilizer or water are being applied. E.g., a value of 0.8 for the auto-fertilizer trigger implies that fertilizer is applied on a given day if potential biomass production would be limited by >20%. PEPIC employs rigid timing of N fertilizer application and has accordingly no threshold.
https://doi.org/10.1371/journal.pone.0221862.t002  characteristics. In the harmonized setup scenarios, cultivars 1 and 3 were merged as growing season length was defined according to common input data sets. EPIC-TAMU ( Fig 1B) plants high-and low-yielding varieties. The latter is assigned to countries in which maize yields have stagnated or decreased within the past decades according to Ray et al. [67]. The high-yielding variety is assigned to all other regions. The same two maize cultivars were distributed in GEPIC and PEPIC (Fig 1C) based on the human development index (HDI). The high-yielding variety is planted in all countries with HDI�80, which corresponds to "very high development" according to UN classification [68]. EPIC-BOKU used the high-yielding variety in all grid cells (Fig 1D).

Common input data
Climate forcing data based on the WFDEI GPCC dataset [69] at a spatial resolution of 0.5˚x 0.5˚were provided by the ISI-MIP and GGCMI projects. The climate data are based on temperature and solar radiation from ERA-interim [70] and precipitation from GPCC [71]. All EPIC-based GGCMs used soil data from the ISRIC-WISE database [72] mapped to the Digital Soil Map of the World [73]. For EPIC-BOKU and EPIC-IIASA, the 5000 soil profiles had been reduced to the original 120 soil typologic units WISE is based on [74]. Soil hydraulic parameters not provided in the WISE database (FC, WP and saturated conductivity (K S )) were estimated for EPIC-IIASA using the ROSETTA model [75,76] and calculated endogenously in the other EPIC-based GGCMs using Rawls method (Table 2). Figure C in S1 File shows distributions of key soil parameters for both GGCMs. For the harmonized runs, nutrient application rates for N and P were based on crop-specific mineral fertilizer application rates from [52] combined with N and P embedded in manure [77]. Harmonized planting dates and growing season lengths were based on Sacks et al. [51], complemented by gap filling with data from the MIRCA2000 dataset [78] and LPJmL [79]. Both datasets were provided by the GGCMI project [34]. Default runs were carried out using individual fertilizer and growing season data within each GGCM.

Permutation of setup domains for GEPIC and EPIC-IIASA
To assess the importance of single data and parameterization domains within the EPIC-GGCMs, aggregated parameter domains of EPIC-IIASA were step-wise introduced into GEPIC. Parameters and routines were grouped into the six domains (Table 3) of cultivar distribution (Cult), soil parameterization (SoilD), soil handling (SoilP), nutrient turnover-related coefficients (CoeffN), hydrologic coefficients (CoeffW), and crop management (Manage). The two GGCMs were selected because yield simulations have no trend in time (Fig 2) and there are substantial differences in their setups (Table 2).
GEPIC was run with all 64 (2 6 ) resulting setup combinations using the land mask of EPI-C-IIASA to ensure consistency. The evaluation focuses on rainfed yield estimates as these cover the whole range of uncertainty impacts. Magnitudes of plant growth stresses are included to analyze drivers behind different yield estimates. Benchmarking against reported yields at the country-level serves for quantifying the contribution of single setup domains to GGCM performance besides the GGCM's sensitivity for a given setup domain in contrasting countries.

Evaluation and reference data
Yield aggregation. Crop yields are compared and evaluated at the global, national, and grid level as well as aggregated to Koeppen-Geiger regions (Fig D in S1 File). Agreement among GGCMs is compared in relation to fertilizer application rates, mean annual precipitation (MAP), and cultivar distributions to identify drivers of deviations in yield estimates.
Global and national average yields (YD av ) were calculated from simulated rainfed and irrigated yields in each grid cell and the respective rainfed and irrigated harvested areas obtained from the MIRCA2000 dataset [78], which provides rainfed and irrigated areas for various   Table 3. Composition of aggregated setup domains the comparison of GEPIC in EPIC-IIASA in the fully harmonized (fullharm) scenario (Table 1). Numbers in the first column are used in selected figures to keep annotation short, otherwise the abbreviation is used. Numbers in column "Parameters considered" refer to those in Table 2. When referencing the setup domain parameterizations from each GGCM, e = EPIC-IIASA and g = GEPIC (e.g. eCult refers to cultivar setup of EPIC-IIASA).

No Setup domain and abbreviation Parameters considered Effect in the EPIC model 1 Cultivars (Cult)
• see Fig 1 for distribution of cultivars and Table D in S1 File for differences in cultivar parameterization • scaling of yields based on potential HI max • higher sensitivity to water stress with lower HI min • temperature ranges for optimal crop growth 2 Soil parameterization (SoilD) •

Parameterization of organic matter and nutrient cycling (CoeffN)
• where YD av,c is the national average yield in country c, YD i,g is yield under irrigated conditions in grid cell g, YD r,g is yield under rainfed conditions in grid cell g, HA i,g is irrigated area in grid cell g, and HA r,g is rainfed area in each grid cell g, and m is the number of grid cells in country c. We acknowledge the uncertainty introduced from spatial aggregation [80] but as the focus is on a comparison among GGCMs, we consider this to be of minor importance here.
Metrics for GGCM agreement. The coefficient of variation (CV) [%] was used as a metric for absolute bias among yields averaged throughout the study period (CV av ) as well as changes in inter-annual yield dynamics if GGCM setup components are introduced from GEPIC into EPIC-IIASA (CV t ). The coefficient of variation is expressed as where S is the standard deviation and � X is the mean of yields throughout the evaluation period in each grid cell or globally aggregated. CV av was calculated for the period 1980-2009 as the first simulation year 1979 did not have a complete growing season globally. CV t was calculated the same way but after scaling the time series average to 1 in order to avoid bias caused by changes in the magnitude of yields.
The mean error (ME) was used in the same evaluations as a metric for absolute bias including the sign of change: where YD est is the yield estimate, YD ref is the reference yield, and n is the number of years considered. In the permutation of GGCM setup domains (Table 3) difference are evaluated in relative terms compared to the original EPIC-IIASA setup. ME hence corresponds there to the fraction of relative change [-].
To test the agreement in inter-annual yield variability among GGCMs, we used the timeseries correlation [34,49] according to Pearson's correlation coefficient r calculated for yield time-series in each grid cell pairwise among all GGCMs according to where n is the sample size, x i‥n and y i‥n are paired samples of yield estimates from two GGCMs, x i and y i are the ith elements of each total sample, and � x and � y are the sample means. As one of the EPIC-based GGCMs exhibited a substantial decline in yields after the first simulation years, the evaluation period was limited to 1980-1990 in the GGCM inter-comparison in order to avoid bias from unexpected model behavior later in the simulation period.
All evaluations were carried out with the statistics software R [81] using the packages ggplot2 [82], corrplot [83], and the heatmap.2 function of gplots [84] in a modified version from Müller et al. [49] for visualization.
Benchmark metrics for reproducing reported yields. Skills of the ensemble with respect to reproducing reported inter-annual yield variability and absolute yields have been assessed in detail in Müller et al. [49] across various sets of benchmark, land use, and climate data. Due to uncertainties inherent also in benchmark and land use data (see below), the performance evaluation herein does not aim at identifying an ensemble member performing optimally at global or regional scales. Rather, it serves for comparing skills of GGCMs in relation to differences in setups, i.e. the number of countries in which a specific GGCM shows high performance or how performance changes with permutation of model setup domains. Thereby, we focus on the two harmonized management scenarios fullharm and harm-suffN to ensure comparability in key input data.
Following the methodology of Müller et al. [49], we used time-series correlation coefficient r (Eq 4) between detrended national average simulated and reported yields as the main metric. Reported national and global average yields were obtained from FAOSTAT [85]. Detrending was performed in order to remove temporal trends in reported yields due to changes in technology and management by subtracting the 5-year moving mean [34,49]. As a reference, the detrended yields from FAOSTAT were multiplied by their mean of the period 1997-2003 for which fertilizer inputs are representative. The comparison of reported versus simulated global and national yields was performed for the complete simulation period 1980-2009. A significance threshold of p<0.1 (at approx. r>0.31) was selected for defining good performance compared to reported yields. The mean bias in absolute yield estimates from reported yields was measured as mean error (Eq 3).
The evaluation of GGCM performance itself is often limited by the quality of benchmark [49] and land-use data [80], characteristics of climate data (Ruane et al., in preparation), and representativeness of management data for a given region. Benchmarking itself is hence subject to substantial uncertainties and was here limited to major producers and other countries for which available benchmark and management data can be considered representative. These were selected based on whether (a) production and harvest area data had not been estimated by FAO and (b) harvested area did not fluctuate by >100% throughout the study period to account for the static cropland mask used in the aggregations.

Effects of harmonization on global average maize yield estimates
If the EPIC-based GGCMs are run in their default setups, global average simulated maize yields differ by up to 124% annually (mean 95%) using the lowest estimate as a reference (Fig  2A; Table E in S1 File). This is mainly due to very high yield estimates from EPIC-BOKU of around 8 t ha -1 , while the other EPIC-based GGCMs have yield estimates of around 4-6 t ha -1 . The ranges decrease to 55% if harmonized planting dates and fertilizer application rates are used ( Fig 2B) and further to 26% with sufficient nutrient supply ( Fig 2C). Accordingly, the bias from reported yields varies greatly by GGCM and scenario with the largest bias in terms of ME for EPIC-BOKU in the default setup and the lowest for GEPIC in the fullharm scenario (Table F in S1 File). The mean bias is not constant over time, however, with significant negative trends in yield estimates for PEPIC in all setup scenarios, and for EPIC-BOKU in the fullharm and harm-suffN scenarios. EPIC-IIASA in contrast shows a slight positive trend in its default setup. Still, major inter-annual patterns agree among GGCMs and reported data reasonably well and the whole EPIC-based ensemble reproduces a reported peak in global average yield in 2004.

Spatial differences in mean and inter-annual maize yield estimates
Deviations in long-term mean yields. Spatially, the deviation of maize yield estimates among the EPIC-GGCMs is largest with the default setups in tropical and arid regions (Fig D in S1 File) with CV av of up to 224% and CV av � 44% in > 50% of all grid cells (Fig 3A and 3B; Table G in S1 File). The most distinct differences with CV av > 100% were found in sub-Saharan Africa, South America, India, and Southeast Asia. The smallest differences occur in mid and high latitudes of both hemispheres, where (a) fertilizer inputs are at moderate or high levels (Fig E in S1 File), (b) most GGCMs plant the same high-yielding cultivar (Fig 1) and (c) the climatology usually defines a narrow growing season window limiting differences among GGCMs in planting date assumptions. Rainfed cultivation results in larger differences among GGCMs in (semi-)arid regions of Central and West Asia, the Western USA and Northeastern Brazil. If irrigation water is applied, differences increase in most parts of sub-Saharan Africa and Central India, but decrease in most of North and South America, Central Asia, and Europe.
Harmonizing fertilizer and growing season data reduces CV av to �62% under rainfed and �54% under irrigated conditions in 75% of all grid cells (Fig 3C and 3D; Table G in S1 File). Spatial patterns remain similar to those found for the default managements, but CV av also increases substantially in few regions after harmonization such as Western Russia or Southern China. The application of sufficient nutrients further reduces differences among EPIC-GGCMs (Fig 3E and 3F). Agreement improves especially in regions with low or moderate reported fertilizer application rates (see Fig E in S1 File) such as India, sub-Saharan Africa, and South America, but higher CV av compared to the default setups remains in Western Russia and Southern China. Excluding EPIC-TAMU, for which missing default simulations have been replaced by the fullharm setup, from the analysis results in comparable spatial patterns (Fig G in S1 File), lower CV av in the lower percentiles, and higher CV av in the upper percentiles except for the maximum value (Table H in S1 File).
In the regions with increased CV av in the harmonized setups, different mechanisms are at play. Selecting an administrative unit from each of the two regions (Fig I in S1 File) indicates that in the province Hunan (China), yields decrease uniformly for all EPIC-based GGCMs, Parameterization-induced uncertainties and setup harmonization in global crop models except EPIC-TAMU which has identical default and fullharm setups, leaving the absolute difference constant but increasing the CV av value. The fact that there is little difference between the fullharm and harm-suffN setups and higher yields in the default scenario indicates that the harmonized growing season, which refers to a side season at the end of the year with low precipitation and 129 days until harvest (not shown), drives the yield decrease. In the oblast Krasnodar (Russia), GGCM responses to harmonization are more diverse. GEPIC and PEPIC show a decrease in yields from default to fullharm followed by an increase with harm-suffN, indicating a partial impact of nutrient supply. Yet, for EPIC-IIASA, which simulates an adapted cultivar in this region (Fig 1A), yields increase continuously among scenarios. EPIC--BOKU in turn shows substantial yield reductions in the harmonized setups indicating that sufficient nutrient supply cannot offset yields in this GGCM.
Deviations in inter-annual yield dynamics. In the default setup, the median time-series correlation coefficient r among the EPIC-based GGCMs is often around zero (Fig 4A and 4B; see also Fig M in S1 File), except for temperate and cold regions in case of sufficient irrigation and additionally in arid regions under rainfed conditions. Globally, still >40% of all grid cells have a median correlation that is statistically significant (r with at least p<0.1, Table I in S1 File) under rainfed conditions, but only 17% with sufficient irrigation. Harmonization provides often a slight improvement, foremost with a higher correlation in regions that already had a moderate agreement in the default setups (Fig 4C and 4D). Low agreement prevails especially in the tropics and along the Eurasian border, where the correlation partly decreases compared to default. With sufficient nutrient supply (Fig 4E and 4F), there is a significant correlation in 68% of grid cells under both irrigated and rainfed water supply, and a very high agreement at p<0.01 in 51% or 43%, respectively (Table I in S1 File). The largest deviations remain in the humid tropics of sub-Saharan Africa and South America, South(-East) Asia, and along the Eurasian border. Excluding EPIC-TAMU from the analysis results again in similar spatial patterns (Fig H in S1 File), but a substantially lower fraction of grid cells with a good median agreement occurs in the harm-suffN scenario (Table J in S1 File).
Impact of fertilizer supply on deviations in maize yield estimates. As indicated by the cross-scenario analyses above, the bias among GGCMs in terms of CV av is largely driven by crop nutrient supply. Accordingly, within the fullharm scenario CV av is inversely correlated with the level of N fertilizer supply in most climate regions (Fig 5). Although linear regressions are highly significant in all climate region x water supply combinations, the explained variance is substantially higher under irrigated (Fig 5A-5D) than under rainfed conditions (Fig 5E-5H). CV av is on average at about 60-85% in all climate regions at very low N application levels and highest in arid regions under irrigated conditions. It decreases on average to about 22% in arid and temperate and 17% in tropical and cold regions at applications rates above 200 kg N ha -1 yr -1 . Rainfed cultivation substantially dampens the effect of nutrient application rates as a driver for differences among GGCMs in (semi-)arid climates and leaves larger deviations at moderate to high application rates also in other climate regions. The large deviation at fertilizer application rates >300 kg N ha -1 in arid regions under rainfed conditions (Fig 5E) is apparently caused by soil hydrology as indicated by the substantially lower deviation under irrigated conditions (Fig 5A).
Similarly, the correlation among GGCMs increases with increasing fertilizer application rates (Fig 6A-6H) especially under irrigated conditions (Fig 6A-6D), but with comparably little explained variance. The highest impact of fertilizer application can be found in cold and temperate climates under irrigated conditions (Fig 6B and 6C). In arid regions under rainfed conditions, the correlation is often already high at low fertilizer application rates (Fig 6E) implying that climatic drivers dominate here the GGCM inter-correlation. In the tropics in contrast, where fertilizer application rates are commonly moderate to low, the correlation is at all application rates lower than in other regions (Fig 6D and 6H). Binning the fertilizer application rates (Fig J in S1 File) shows that there may rather be thresholds of application rates allowing for high correlation among GGCMs at least in arid, cold and temperate regions where a substantial increase can be found for the first two at 50-100 kg N ha -1 and for the latter at >150 kg N ha -1 . Again, this occurs foremost with sufficient water supply (Fig J, panel a-d in S1 File).
Impact of cultivar distributions on deviations in maize yield estimates. Besides nutrient supply, an important driver for remaining differences in spatial mean yield estimates is the Fig 5. Coefficient of variation for maize yields among EPIC-based GGCMs compared to fertilizer application rates. Results are shown for the fully harmonized management scenario (fullharm) with sufficiently irrigated (a-d) or rainfed (e-h) water supply in each grid cell of four major climate regions. Linear regressions are limited to �200 kg N ha -1 , which commonly corresponds to sufficient N supply [86]. cultivar distribution. If all GGCMs plant one of the high-yielding cultivars 1 or 2, CV av is typically lowest (Fig K in S1 File). The deviation increases for regions in which all four GGCMs that use Cultivar 4 plant this variety and is often 50-100% higher than the first option if none of the cultivars dominates. This applies to regions in which GEPIC and PEPIC but not EPI-C-IIASA and EPIC-TAMU plant cultivar 4. The effect is stronger under rainfed than under irrigated conditions. Cultivar distributions can hence also explain some of the remaining differences in the harm-suffN scenario (Fig 2E and 2F), e.g. in Western Russia (see above).
The impact on the time-series correlation coefficient is less evident (Fig L in S1 File). Regardless of the water supply regime and cultivar definitions, the agreement is high in arid regions with lowest agreement if all GGCMs plant a high-yielding cultivar with irrigation. At overall lower agreement, the picture is similar in the tropics where it also applies to the mixed cultivar definitions. Cold and foremost temperate regions in contrast show a gradient in decreasing agreement among GGCMs from uniform planting of the high-yielding cultivar towards dominant low-yielding cultivar or mixed cultivar definitions.

Impact of single setup domains on yield deviations
The further evaluation of differences in setup domains between EPIC-IIASA and GEPIC ( Fig  7A-7P) focuses on the relative difference from the complete EPIC-IIASA setup (Fig 7A). Complementary magnitudes of plant stresses are provided in the Supplementary Information (Fig  O in S1 File). They cannot be related to differences in yield estimates directly as their impact depends on estimates of potential biomass growth in the EPIC model (driven e.g. by growing season length, climate and management; see also Text A, S1 File) and the timing of the stress occurrence. They are hence addressed per panel but not among different managements. Selected examples for single grid cells are provided in Figs O and P in S1 File, maps of dominant stresses for contrasting setups in Fig R in S1 File.   Fig 6. Median time-series correlation coefficient r for maize yields among EPIC-based GGCMs compared to fertilizer application rates. Results are shown for the fully harmonized management scenario (fullharm) with sufficiently irrigated (a-d) or rainfed (e-h) water supply in each grid cell of four major climate regions. Linear regressions are limited to �200 kg N ha -1 , which commonly corresponds to sufficient N supply [86]. https://doi.org/10.1371/journal.pone.0221862.g006

Parameterization-induced uncertainties and setup harmonization in global crop models
If only the management is used from the GEPIC setup and everything else is set to the EPI-C-IIASA setup, yields increase slightly compared to the full EPIC-IIASA setup (Fig 7A) despite an increase in phosphorus (P) and water (W) deficits (Fig O, panel a in S1 File) and show an increase in inter-annual yield variability in terms of CV t . This is caused by the narrower row spacing in GEPIC (Table C in S1 File), which increases the estimate of potential biomass (Text A in S1 File) often resulting in higher actual biomass estimates despite higher stress occurrence (see Fig P in S1 File for grid cell example). Replacing also the cultivars scales yields down and Parameterization-induced uncertainties and setup harmonization in global crop models increases variability as GEPIC plants the low-yielding, drought-sensitive cultivar 4 in a larger number of countries (Fig 1A and 1C). Introducing the gCoeffN parameters into the setup ( Fig  7B) increases yields in all cultivar x management combinations and affects inter-annual yield dynamics whereas nutrient-related stresses decrease (Fig O, panel b in S1 File) due to more rapid turnover of organic matter (  eCoeffN vs gCoeffN). The slight increase in temperature (T) stress is hence a secondary effect due to the stress handling in the EPIC model selecting only the major limiting factor for biomass production on a given day (see Methods). The gCoeffW parameters in turn import little change on yield variability but slightly scale yields up for each CoeffN parameterization ( Fig  7C and 7D).
Replacing in a further step the static soil handling of EPIC-IIASA by the dynamic decadal runs of GEPIC (Fig 7E-7H) alters yield levels and inter-annual dynamics substantially with about 15% lower yields than in the corresponding eSoilP scenarios. Nutrient deficits become the dominant growth constraint, especially in combination with eCoeffN (Fig O, panel e,g in S1 File), which causes a slower release of nutrients from OM and higher volatilization of N. The higher P stress with gCoeffN ( Fig O, panel f,h in S1 File) is often a secondary effect of high N availability early in the simulation that causes more rapid P mining form the soil in low-P input regions and a concomitant increase in P stress.
Introducing in addition the soil parameters of GEPIC gSoilD into the setup combinations (Fig 7I-7P) results in an increase in yield estimates and changes in inter-annual yield variability in all scenarios (Fig 7A-7H vs Fig 7I-7P). This is driven by decreases in N stress and increases in P stress if a static soil profile eSoilP is employed or if the dynamic soil handling gSoilP is combined with eCoeffN ( Fig O, panel m,o in S1 File). The most significant difference between the soil parameterizations is in the estimation of hydraulic parameters field capacity (FC) and wilting point (WP) where EPIC-IIASA has typically higher values for the first and lower for the latter (Fig C in S1 File). Both parameters affect a wide range of processes in the EPIC model, among them the threshold for percolation of water and the optimal soil humidity for microbial processes (see Text A in S1 File). The gSoilD component hence allows for providing larger amounts of nutrients from OM as required soil humidity is reached earlier, but causes higher water stress as an effect of (a) lower water storage capacity and (b) higher model sensitivity to climate stresses caused by higher nutrient supply. In the combination of the static soil profile eSoilP and the parameter set gCoeffN (Fig 7J and 7L), nutrient stresses are virtually eliminated and yield estimates are foremost driven by climate (Fig O, panel d; Fig R, panel d in S1 File), potential biomass accumulation, and cultivar specification.
A correlation matrix of global area-weighted yields among all permutations (Fig S in S1 File) shows that the combination of eSoilD, gSoilP, and eCoeffN (Fig 7E and 7G) has the lowest agreement with the remainder of setups (Fig S in S1 File). In turn, the nutrient and OM turnover parameterizations and soil parameters of GEPIC (gCoeffN and gSoilD) as well as the static soil handling of EPIC-IIASA (eSoilP) render the GGCM resilient to changes in other setup domains (Fig T in S1 File), while the remaining setup domains show bimodal distributions and hence depend more strongly on interactions.

Impacts of setups on GGCM performance
The EPIC-based GGCMs show a mixed performance in the fullharm setups (Table 4). GEPIC and PEPIC, notably the two GGCMs considering a dynamic soil profile and erosion (Table 2), exhibit relatively poor skills in terms of countries in which they have best performance. However, all EPIC-GGCMs have a good performance in about half or more of the countries considered. If sufficient nutrients are supplied, the numbers of countries in which each GGCM is best performing or has at least a high performance, converge to about ±2, except for PEPIC in the latter case.
For the ten major maize producing countries, substantial variability in GGCM performance can be observed for both the fullharm (Fig 8A) and the harm-suffN (Fig 8B) scenarios. With the fully harmonized setup, EPIC-IIASA shows in most cases the best performance, followed by PEPIC, and finally EPIC-BOKU and GEPIC. All GGCMs show high performance in the USA and France and low performance for Indonesia or Mexico (see also Table K in S1 File). With sufficient nutrient supply, the best performing GGCMs change in various countries, primarily those with overall low to moderate time-series correlation, such as Brazil, Indonesia, and Mexico with decreases in ensemble performance in the latter two (Fig 8B). For some countries in which at least one GGCM has a high performance in the fullharm scenario, several EPIC-GGCMs achieve better results with sufficient nutrient supply, i.e. in Argentina and to a lesser extent in India (Table K in S1 File).   Table K in S1 File.

Table 4. Numbers of countries (out of 99 for which benchmark data and GGCM outputs are available) in each harmonized setup scenario, in which each EPIC-based GGCMs has the highest (column "best") performance compared against reported yields within the EPIC ensemble and all countries (column "all") in which the correlation coefficient is significant at p<0.1 and positive.
https://doi.org/10.1371/journal.pone.0221862.g008

Parameterization-induced uncertainties and setup harmonization in global crop models
Accordingly, also the permutation of setup domains greatly affects the performance (Fig 9). The maximum correlation coefficient increases in all countries compared to the basic fullharm setups of the two GGCMs (Fig 8A), except for Argentina (with slightly lower r due to different digit precisions in output files used in GEPIC). In various countries, the setup with the highest correlation coefficient also exceeds the highest value of the EPIC ensemble (Fig 8A), apart from China and France.
The sensitivity of the GGCM to the selection of setup domains differs considerably among countries with distinct patterns of positive or adverse impacts (Fig U in S1 File). In the USA, the correlation is considerably high with any setup whereas the OM and nutrient cycling parameterization eCoeffN typically provides a higher correlation (Fig U, panel a in S1 File). The opposite is the case for the corresponding GEPIC component gCoeffN. For China in contrast, the performance is overall moderate and most sensitive to soil parameters but also cultivar definitions (Fig U, panel b in S1 File). In Argentina and India (Fig U, panel c,d in S1 File), two countries with low reported fertilizer rates (Table K in S1 File), performance shows a very high sensitivity to setup specifics ranging from r<0 to r>0.8 with lowest results for soil handling from GEPIC (gSoilP). Most setup components however show wide or bimodal distributions indicating that relative performance of single domains depends on the combination with other parameters.

Effects of harmonization on GGCM agreement
At the pixel level, harmonization expectedly decreases differences between simulated means and increases correlation among the EPIC-based GGCMs in most parts of the world (Figs 3  and 4), driven by harmonized growing seasons and more importantly increasing conformity in nutrient supply. Considering that EPIC-GGCMs agree on average in the majority of grid cells with respect to inter-annual variability (Table I in S1 File) and have low remaining differences in means if nutrient deficits are eliminated (Table G in S1 File) shows that this element causes here the greatest impact in GGCM convergence. This is most apparent under irrigated conditions where the agreement in inter-annual variability improves to a minimum level of significance in 7% of grid cells form default (17%) to fullharm (24%) setups and in >45% from fullharm to harm-suffN (69%). Yet, substantial differences remain in inter-annual variability in about one third and CV av is > 30% in more than a quarter of grid cells highlighting that setup components other than nutrient supply and growing seasons cause lower but still significant impacts on yield estimates. The fact that CV av among GGCMs increases in few regions (e.g. Fig I in S1 File) indicates that data used for harmonization will require further evaluation and continued improvement to best reflect actual on-ground conditions. E.g., the harmonized growing seasons for Southern China reflect a side season in winter, while typically summer maize is planted in the region [87], which may explain the low simulated yields. Vice versa, the fact that GGCMs exhibit highly diverse response to harmonization implies that also GGCMs will need to be iteratively adapted to harmonized input data in future experiments, while it was in this experiment prescribed not to adapt default parameterizations in the harmonized scenarios [34].

Effects of setup domains on GGCM agreement and performance
Cultivar distributions. Cultivar distributions (Fig 1) resulted mostly in the scaling of yields (Fig 7) due to different parameterizations of the upper and lower HI coefficients as PHU are prescribed by growing season input data and climate. The lower HI boundary is approached under water stress and hence renders the EPIC model more susceptible to water deficits during the reproductive growth stage [58], affecting also the magnitude of inter-annual yield variability but hardly inter-annual dynamics as such (Fig 7). Additional effects may still incur from cycling of plant residues, whose amounts depend on the fraction of crop removed at harvest. Planting a cultivar with lower base and optimum temperatures (Fig 1A) contributes to higher yield estimates from EPIC-IIASA in most of temperate Europe and Russia (not shown). Overall, the effect of cultivar distributions on GGCM agreement for inter-annual yield variability (Table J in S1 File) is lower than that of nutrient supply (Table H in S1 File) but can have substantial impacts in temperate and tropic regions.
Soil attribute estimation, soil handling, and crop nutrient supply. Among the two EPIC-GGCMs compared in detail, soil parameters differ hardly in primary characteristics, but considerably in the derived hydrologic properties FC, WP, and K S (Fig C in S1 File), which can either be input or estimated by EPIC based on various routines (Table 2). These parameters have a substantial effect on inter-annual yield variability (Fig 7E and 7M). They affect modeled drought stress by defining soil water storage capacity and hence the plant available water volume, and also modulate nutrient availability from soil OM mineralization, especially if a dynamic soil profile is assumed. Estimation methods for FC and WP depend on the set of soil samples for which they were developed [88,89]. At present, there is no single optimal method for deriving soil hydraulic parameters globally [90] with none of the methods employed here providing the best performance across the majority of the main producing countries (Fig 9) and substantially varying sensitivities to these parameters among regions ( Fig  U in S1 File).
Whether soils are treated dynamically with transient carry-over of all soil properties or statically with annual re-initialization of soil properties greatly influences performance (Table 4 and Figs 8 and 9). Especially GEPIC and PEPIC, which have dynamic profile handling and consider soil erosion processes, exhibit high performance in the least number of countries for the fullharm setup, which improves if sufficient nutrients are supplied ( Table 4). The GGCMs showing the best performance in terms of total number of countries with good performance in the fullharm scenario and the lowest increase in the harm-suffN scenario are EPIC-BOKU and EPIC-IIASA, both of which employ static soil profiles. In the case of EPIC-IIASA the setup results in a dominance of climatic stresses over nutrient supply (Fig O, panel a in S1 File) compared to GEPIC, which has substantial nutrient limitations in low-input regions (Fig O, panel p in S1 File). The findings are in line with an earlier performance evaluation of the ensemble investigated here [49], which also showed that GGCMs with a static soil profile exhibited often better skills in reproducing national inter-annual yield variability.
Hence, while dynamic soil profiles are essential for assessing agricultural externalities such as soil OM dynamics as well as climate change adaptation and mitigation options [21,22], static soil profiles and even more so sufficient nutrient supply appear key for obtaining high performance in terms of inter-annual yield variability. Obviously, inter-annual dynamics of nutrient supply from OM dominate-especially in the case of limited fertilizer supply-over interannual climate dynamics, hampering the signal of the latter in inter-annual yield dynamics. This is perspicuous, as the limited number of pixels in a given country at the employed spatial resolution cannot be expected to reflect OM dynamics on actual farmland. Rather, the latter must be assumed to level out at the country-scale to a certain extent, leaving a signal that is to a larger or smaller extent climate driven [67] and can hence be better picked up by GGCMs not considering soil dynamics.
This poses a dilemma for GGCM performance evaluation and applications to crop-soilmanagement studies. It may lend itself to assess GGCM performance with respect to climatedriven inter-annual yield variability using a static soil profile and/or sufficient nutrient supply, but to use dynamic soil profiles and business-as-usual fertilization rates for impact and adaptation assessments. However, the latter requires additional evaluations to ensure the correct representation and parameterization of relevant processes, which is as of now strongly limited by lack of globally representative data.
Soil organic matter and hydrologic process parameterization. The parameterizations of soil OM and nutrient turnover encompass in various cases recommended parameter ranges [91] and hence a wide range in assumptions on microbial process dynamics. For example, coefficients for slow to passive humus partitioning (Table 2, parameter 23) or N volatilization (parameter 26) cover ranges of one to two orders of magnitude. Values for both parameters are at the edges of recommended ranges [91] and hence bracket extreme cases. Herein, the parameterization of soil OM-related processes was found to be of considerable importance concerning GGCM performance, which depends strongly on this set of parameters in few countries (Fig U in S1 File) but with no clear superiority of either setup from EPIC-IIASA or GEPIC (Fig 9). Yield magnitudes and inter-annual dynamics are greatly affected as well depending on which parameter set is selected (Fig 7). Various field and regional studies have shown that OM partitioning and turnover parameters for CENTURY [60], which is at the core of the OM cycling routines of EPIC and a range of other crop models such as DSSAT or Day-Cent, are subject to substantial uncertainty as OM pools are conceptually represented and hence not feasible to measure [92]. Therefore, these parameters typically require calibration to regional conditions, which is presently lacking within the ensemble and generally difficult at the global scale due to limited data availability. GGCM setups may here need to be informed by cross calibration based on representative site data [93] beyond crop yields.
We found a rather low sensitivity to hydrologic model domains (Figs 7 and 9; Fig T; Fig U  in S1 File). In a recent study, Liu et al. [31] reported that the PET estimation method PM provides the best results globally within the otherwise constant setup of PEPIC with respect to national crop yield estimates, albeit with overall minor differences compared to using the HG method. As for soil OM cycling, methods for hydrologic processes such as algorithms for PET estimation have often been developed for specific regions or require local calibration for optimal performance [94], which is as well presently not implemented in the ensemble members.
Crop management operations. The substantial impact of crop nutrient supply as such on agreement and performance of GGCMs has already been discussed above. General management coefficients for irrigation water and fertilizer application ( Table 2, parameter 27-29) follow pragmatic assumptions rather than the representation of actual farming systems. Low trigger thresholds (i.e. high values for parameters 27 and 29) allow for a more rapid plant stress reduction but cause in the case of suboptimal fertilizer application rates an earlier consumption of the annual maximum rate, especially if also a low trigger threshold is selected for irrigation water application, which can result in stronger leaching. Low fertilizer application thresholds in contrast do not allow for full plant stress reduction in the harm-suffN scenario. In contrast to the four other EPIC-based GGCMs, PEPIC was set up with a rigid timing of fertilizer application, which likely contributes to the negative yield trend observed. A study based on the same GGCM with automatic fertilizer application of the same annual amounts resulted in overall higher yields [95].
The field operations compared for GEPIC and EPIC-IIASA (Table C in S1 File) differ most substantially in the removal of plant residue and row spacing, affecting long-term nutrient availability and potential biomass estimation, both of which depend in practice on socio-economic decisions and prevailing practices on-farm or locally.

Representativeness of setups for global agricultural systems
The inherently different assumptions on agro-environmental systems parameterization within the EPIC-based GGCM ensemble result in greatly varying crop growth conditions and limitations. For EPIC-BOKU the average global yields in the default setup indicate the virtual elimination of nutrient stresses (Fig 2A). Similarly, EPIC-IIASA exhibits in its fullharm setup dominant climate stresses (Fig O, panel a; Fig R, panel a in S1 File), which can be assumed to be highly similar in the default setup that is based on nearly identical input data (Text C in S1 File) and shows only slightly lower global average yields (Fig 2A). For GEPIC in contrast, productivity on large parts of the global cropland-especially in the tropics-is limited by nutrient deficits (Fig R, panel b in S1 File), which can also be expected for PEPIC based on the identical cultivar distribution, similar global average yield levels (Fig 2A), and the consideration of dynamic soil handling (Table 2). EPIC-TAMU presents a compromise including soil OM dynamics-with rather rapid turnover rates and moderate N volatilization (Table 2)-over time but no water erosion. The substantial differences in nutrient deficits and associated climate sensitivity can hence also explain why in two earlier studies EPIC-BOKU responded more sensitively to climatic change than GEPIC [6], and EPIC-BPOKU and EPIC-IIASA provided more explained variance for weather-induced yield variability than GEPIC [96]. Thus, the large variation in yield projections from various GGCMs seen in e.g. Rosenzweig et al. [6] for a smaller ensemble may be as much due to setup as to model structural differences.
However, due to lack of spatial parameterizations except for a limited number of cultivars, none of the GGCMs can be expected to represent a globally optimal setup. The improved GGCM performance if setup domains are permutated (Fig U in S1 File and Fig 9) and the presence of all EPIC-based GGCMs among the top performing GGCMs (Fig 8 and Table 4) highlight that no globally uniform parameterization performs optimally for all countries. Instead, the EPIC ensemble allows for covering a range of uncertainties that may exist in a given pixel or the sub-grid scale [97] below the spatial resolution of 0.5˚x 0.5˚, e.g. through the representation of low-input or high-input agricultural systems and associated plant growth limitations, which coexist in close proximity especially in transition countries [98]. The diversity in cultivars, soil types, and tillage regimes potentially occurring at the sub-grid level, however, cannot be assumed to be fully covered by the range of setups present in the ensemble. In addition, all GGCMs assume high-input agriculture in regions with high reported average fertilizer application rates (Fig E in S1 File), which neglects potential imbalances and heterogeneity in input systems present in these regions. For a more targeted representation of agricultural systems within the ensemble, setups may rather need to be defined in a systematic way to cover distinct agricultural production systems for impact studies.
Increasing the spatial resolution of input data may suggest a viable alternative to bracketing uncertainties at the sub-grid scale. Although this has been shown to improve performance or at least affect yield estimates, especially in environmentally heterogeneous regions with respect to soils, climate, and topography (e.g. [29,43,99]), it can at present not be expected to improve the representation of management practices. Global data on cultivar distributions even between broad regions are lacking, which requires assumptions by modelers (cf. Fig 1). The same is the case for tillage practices including residue management, which may also vary in time subject to farmers' management preferences, economic opportunities, and incentives for specific agricultural practices. Albeit spatial data on representative management practices such as tillage systems are in the process of being compiled globally [100] and remote sensing products may allow for spatial attribution of field management practices in the future [101,102], employing an ensemble to bracket uncertainties appears the most robust approach meanwhile. Nevertheless, increasing the spatial resolution will be required to better reflect actually cultivated soils [27] or microclimate [99] at the sub-grid scale. However, the fact that high performance could be achieved herein for the majority of countries (Figs 8 and 9; Fig AB in S1 File and Table 4) indicates that the presently employed resolution is in most regions sufficient for obtaining robust nationally weighted estimates.

The EPIC-based GGCMs in the context of a wider ensemble
Performing key analyses also for a wider ensemble of six GGCMs based on other core models (Text D; Figs U to AE and associated tables in S1 File) shows in various aspects contrasting results. While global average yield estimates converge among the EPIC-based GGCMs, they diverge in the wider ensemble (Fig V in S1 File). Importantly, the spread among EPIC-based GGCMs and the non-EPIC-based in the default setup is highly comparable indicating that the differences in input data, parameterization, and management assumptions are for absolute yield levels of similar importance as core model selection itself. The wide range in estimates for non-nutrient limited yield potentials (Fig V, panel f in S1 File) can be attributed to very high or low cultivar productivity in specific GGCMs due to core model processes as such and cultivar parameterizations. The increase in absolute yields from the fullharm to harm-suffN scenario, which is substantial for some GGCMs as indicated by the threefold increase for PEGASUS, shows that also most of the non-EPIC-based GGCMs are moderately to highly responsive to nutrient supply, but that their sensitivities vary greatly.
Similar to absolute yield estimates, the range of correlation coefficients of global mean inter-annual yield variability compared to FAOSTAT reported data increases with harmonization for the non-EPIC-based sub-ensemble (Fig AD in S1 File) due to decreasing skill for some GGCMs. The EPIC-based sub-ensemble has here a comparably constant range from default to fullharm and a narrower in the harm-suffN scenario. Both metrics indicate that harmonization tends to improve the agreement among GGCMs with the same structure but not among structurally different GGCMs, which vary in their sensitivities to specific setup components such as nutrient inputs and have partly been calibrated in their default setups to reflect reported crop yields [49]. Except for the default scenario, these results are in line with a recent field-scale study finding that crop model structure has a greater impact on deviations than parameterization [103]. Yet, the default scenario is not comparable here, as ground conditions such as management and soil are typically well known at the field scale but pose substantial uncertainty in global studies.
A common finding across the whole ensemble is that GGCM performance for major producing countries is highest for GGCMs with either a static soil profile or sufficient nutrient supply and in most cases performance of individual GGCMs is higher with sufficient nutrient supply compared to the fullharm setup (Fig AB in S1 File). E.g., PEGASUS only considers hydrology in its soil module but has a multiplicative plant stress impact combining temperature, water, and nutrient limitations from insufficient fertilizer application [104] resulting in higher performance in most countries in the harm-suffN setup (Fig AB in S1 File). As pointed out above, this suggests that GGCMs should not only be evaluated for business-as-usual fertilizer application but also sufficient nutrient supply if inter-annual yield variability is the benchmark. The increasing spreads in mean yields and correlations with reported yields (Fig V and  Fig AD in S1 File), partly caused by very low yields or negative trends in yields, furthermore suggest that while most GGCMs have been positively evaluated in their default setups and partly been calibrated to reproduce reported yields, the resultant default parameterizations may not be compatible with the harmonization approach taken here. This suggests that harmonization should be an iterative process with re-evaluation of GGCMs.
The agreement in inter-annual yield variability among GGCMs at the pixel level is far lower for the whole ensemble (Fig W in S1 File) or the non-EPIC based GGCMs (Fig Y in S1 File) than for the EPIC-based GGCMs alone and only shows a trend in improvement with increasing harmonization (Table M in S1 File), which is not evident if the EPIC-based GGCMs are not considered (Table O in S1 File). The substantial divergence even in the harm-suffN scenario, where under non-nutrient limited conditions plant phenology and photosynthesis dominate biomass accumulation and yield formation, indicates that these processes-differing greatly among GGCMs and core models (Table A in S1 File)-exhibit differences that exceed the impact of nutrient supply in the EPIC-based sub-ensemble (Table I in S1 File). Spatially, the agreement is best in temperate and cold regions of the northern hemisphere with clear climatic constraints of growing seasons, which impacts planting and harvest in the default setups but may also affect cultivar definitions carried over to the harmonized setups. Similar to the EPIC-based GGCMs, agreement is poor across the tropics. Interestingly, the non-EPIC-based GGCMs also show little agreement in arid regions, indicating substantial differences in crop water use and/or soil hydrology, which can result either from process implementations in the core models or from divergence in soil input data.
Albeit GGCMs have partly been harmonized in this study, it is at this point not feasible to attribute differences in GGCM outputs to specific setup domains ranging from input data to parameterization and process representation. A basic evaluation (Fig AC in S1 File) indicates that GGCMs with site-based core models and with prescribed leaf area development have a tendency towards higher skill in reproducing average global inter-annual yield variability. However, due to the sample bias caused by the multiple EPIC-based GGCMs, robustly identifying model routines and structures providing high skill will require more scrutiny. As pointed out below, this will also require further harmonization or in-depth analyses for contrasting locations.
Finally, evaluations of the multi-GGCM mean (MGM) for simulated vs reported global inter-annual yield variability show that the MGM has typically a higher score than individual GGCMs across (sub-)ensembles and scenarios (Fig AD in S1 File), analogously to earlier findings for an ensemble of structurally different field-scale wheat models [19]. An exception occurs for the harmonized setups of the EPIC-based sub-ensemble in which some GGCMs have higher skill. The highest time-series correlation coefficient occurs for the MGM of the whole ensemble in the default and harm-suffN setups, but is marginally surpassed by the MGM of the non-EPIC-based GGCMs in the fullharm scenario. These evaluations indicate that the MGM of the whole ensemble has a higher gain from including GGCMs with structural differences rather than multiple configurations of the same core model. The resilience of MGM for the whole ensemble to exclusion of specific GGCMs (Fig AE in S1 File) furthermore shows that the incompatibility of single GGCMs with harmonization does hardly affect the ensemble mean.

Conclusions for global crop model and ensemble applications
Extension of sensitivity analyses and model calibration. Our findings highlight the importance of parameter choices in global-scale crop modeling studies that have not received much attention so far. While recent years have seen a vast growth in sensitivity analyses and calibration efforts of crop models at the field, local, and regional scales [103,[105][106][107] with increasing methodologic sophistication [108], GGCMs have been subject to such studies only to a very limited extent (e.g. [109,110]). In all cases, a focus is typically on directly plant growth related parameters (e.g. photosynthesis, leaf development, or temperature response) or these are identified as the most sensitive variables.
Acknowledging the narrow parameterization of cultivars herein, our findings show that other setup components such as soil processes and parameters have substantial impact on yield estimates (Fig 7; Fig T; Fig U in S1 File). Such parameter domains should hence be included in future sensitivity analyses in order to derive a full picture of model sensitivity globally. The same applies to calibration efforts as uncertainties related to agro-environmental processes and management are otherwise projected unto plant or cultivar coefficients and will likely affect impact assessments or evaluations of adaptation measures. At the regional scale, the sensitivity of EPIC to soil and management parameters alone has been evaluated in a very recent impact study addressing climate change and soil degradation in Europe [33], indicating high sensitivity to soil and management parameters, which partly outweigh climate impacts.
Bracketing uncertainties and potential for further harmonization. The GGCM parameterizations by individual modeling groups do not reflect poor vs. careful parameterization but the lack of reference and input data on many aspects of agricultural production systems, such as soil and crop management. The uncertainty in management and soil parametrization is a problem that is specific to large-scale and global gridded crop model applications, while fieldscale assessments have typically precise information on management practices, variety characteristics and soil properties. Still, the importance of these parameters for simulation results, as shown here, demonstrates that simple extrapolation of a small set of site-specific parameterizations is not possible. Global-scale applications are thus challenged to improve on the representation of management and soil conditions. In addition, applications of crop models at all scales need to better understand and represent the dynamics in environmental (e.g. soil degradation) and management (e.g. variety selection) conditions. Thus, employing a large GGCM ensemble with substantial differences in setups presents an asset as it allows for covering some of the uncertainties in the range of possible environment x management conditions, but the management aspects need to be better represented in models by process implementation and better informed by suitable input data.
Whether to further harmonize GGCM setups for ensemble runs is primarily a question of whether (a) GGCMs are being evaluated with respect to procedural differences (comparison studies) or (b) ranges of agro-environmental conditions are to be covered (impact assessments). New datasets relevant for GGCMs are continuously becoming available and may aide in reducing or attributing uncertainties within the ensemble for case (a).
Recently, Gbegbelegbe et al. [111] have presented a global distribution of representative wheat cultivars for crop modelling based on extensive data on physiology and global agro-climatic zoning. The same is in principle feasible for maize using agro-climatic mega-environments [112]. Cultivar specifications, however, are difficult to translate between models and will therefore require new data management tools as proposed by Porter et al. [113]. Global tillage practices-albeit not relevant for all GGCMs (Table A in S1 File)-are presently being compiled [100] and will likely become publically available in the near future.
Soil properties lend themselves to be harmonized to avoid differences in nutrient supply in low-input regions from SOM mineralization and especially differences in soil hydrology once appropriate data become available. The derived soil parameters FC, WP, and K S , which were here shown to play a major role in drought response and soil microbial processes, are used directly or have a counterpart in all GGCMs. For optimal results, they need to be estimated based on regionally calibrated functions and algorithms [90]. Albeit a new global soil database provides these estimates, they were derived based on soil samples foremost from sub-Saharan Africa and can hence not be considered representative globally (update of [114]), while a dataset specifically calibrated for Europe exists [115]. Spatially parameterizing microbial turnover of soil OM and nutrients in turn appears not feasible in the near term due to lack of comprehensive field data.
Implications for the wider ensemble and ensemble studies. Several of the conclusions stated above also apply to the wider ensemble of this study, such as the potential for further harmonization of input data and setups on the one hand, but also the requirement for iterative GGCM parameterization and re-evaluation after harmonization on the other. I.e., local and global increases in deviations with harmonization underpin that this is not a trivial process. Our results suggest that common input data, foremost growing seasons, will require more scrutiny to ensure they reflect the most common practices. Within this study, the protocol mandated that default parameterizations remain constant among scenarios to study the effect of harmonization without confounding impacts of adjusted parameters [34]. Yet, unanticipated GGCM behavior, e.g. resulting in lower than expected yields or negative yield trends despite sufficient nutrient supply (Fig F, Fig W, Table L in S1 File), suggests that in future studies employing harmonized ensembles, GGCMs should be re-evaluated after harmonization to ensure valid responses to changes in input data and crop management. Noteworthy, our findings are robust against excluding individual GGCMs (Text D in S1 File).
Further findings for the EPIC-based sub-ensemble will become relevant for other GGCMs with the progressing inclusion of crop-soil-management-related processes. The majority of non-EPIC-based GGCMs shows a moderate to strong response to nutrient supply from the fullharm to harm-suffN scenario, which can be substantial as in the case of PEGASUS. Accordingly, the progressing inclusion of agro-environmental processes such as transient nutrient cycling [39][40][41][42], presently only accounted for in the EPIC-based GGCMs within this ensemble, bears the potential to further exacerbate differences in GGCM outputs.
Evaluations of the EPIC-based sub-ensemble show that the inclusion of multiple configurations of the same core model has several merits in bracketing uncertainties. However, the MGM performance for different sub-ensembles (Text D in S1 File; Fig AE in S1 File) highlights that the ensemble mean has a higher gain from including structurally distinct GGCMs compared to varying configurations of the same core model. Yet, the range of GGCM skills and their divergence with harmonization suggest that GGCM setups are of high importance as well and performance of both single GGCMs and ensemble means are typically highest in the default setups for which most GGCMs have originally been set up and evaluated.
Investigating the role of specific model processes or process conceptualizations in detail is beyond the scope of this study and the underlying experiment. Grouping GGCMs by basic characteristics suggests higher skill in reproducing global average inter-annual yield dynamics for site-based GGCMs and those with prescribed phenology. However, this will need to be studied in more detail across scales in specifically designed experiments covering both parameterizations and process representations based on different conceptualizations.
Supporting information S1 File. The supplementary information (SI) is provided in one single PDF file. Text A. Relevant routines of the EPIC model. Text B. Differences between EPIC model versions v0810 and v1102. Text C. Description of EPIC-based GGCMs. Text D. Evaluations of the wider ensemble. Table A. Relevant characteristics of GGCMs in this study based on Müller et al. [30]. Table B. Legend for Table 2 in main paper with brief explanation for parameters differing among EPIC-based GGCMs. Table C Table 1 in main article). Table H. Same as Table G but excluding EPIC-TAMU. Table I. Fractions of grid cells [%] in which the median time series correlation among the EPIC-based GGCMs (Fig 4 in main article) fulfils a certain level of significance. Table J. Same as Table I but excluding EPIC-TAMU. Table K. Time-series correlation coefficient r for each GGCM in the ten major maize producing countries for the fullharm and harm-suffN scenarios ( Table 1 in main paper) and annual N fertilizer application rates for maize in each country. Table L Fig I. Long-term maize yield estimates for the EPIC-based GGCMs in two spatial units at administrative level 2 with increase in CV av after harmonization in Fig 3 of the  main body. Fig J. Median time-series correlation coefficient r for maize yields among EPICbased GGCMs compared to binned fertilizer application rates in the fully harmonized management scenario (fullharm) with sufficiently irrigated (a-d) or rainfed (e-h) water supply in each grid cell of four major climate regions. Fig K. Coefficient of variation (CV av ) among maize yield estimates in the harm-suffN scenario in grid cells in which either all GGCMs plant the high-yielding cultivars 1 or 2 (Fig 1 in main paper) or in which at least four GGCMs plant the low-yielding drought-sensitive cultivar 4 or in which cultivar types are mostly mixed. Fig  L. Median time-series correlation coefficient in the harm-suffN scenario in grid cells in which either all GGCMs plant the high-yielding cultivars 1 or 2 (Fig 1 in main paper and Table D) or in which at least four GGCMs plant the low-yielding drought-sensitive cultivar 4 or in which cultivar types are mostly mixed. Fig M. Frequency distribution of time-series correlation coefficients among EPIC-based GGCMs in each grid cell and for each management scenario. Fig  N. Global average rainfed maize yields over a 29 year period for 64 setup combinations based on the EPIC-IIASA and GEPIC setups ( Table 3 in main article). Fig O. Box-and-whisker plots of global averaged growth stresses over a 29 year period for 64 setup combinations based on the EPIC-IIASA and GEPIC setups under rainfed conditions ( Table 3 in main paper). Fig P. (a) Monthly total biomass and (b-c) stress occurrence for a single year in a randomly sampled grid cell of the US Corn Belt differing the managements of EPIC-IIASA and GEPIC (eManage/gManage) with otherwise identical setups (included in Fig 7A of the main article) and a static soil profile (eSoilP) . Fig Q. Annual (a-c) yields, (d-f) stresses, (g-i) water fluxes, and (j-l) nitrogen fluxes for a randomly sampled grid in Ukraine with low fertilizer application for three EPIC-GGCM setups differing in soil parameters (SoilD) and OM/nutrient cycling parameterization (CoeffN) using dynamic soil profile handling (gSoilP).  Table 1 Table 1 in main article). Solid and dashed lines at the top of each panel indicated the location of the major peak in the distribution for rainfed (dashed) or sufficiently irrigated (solid) simulations of each management scenario ( Table 1 in main article). Fig AB. Time-series correlation coefficients for all GGCMs with the fullharm and harm-suffN scenarios (x-axis) in the top ten maize producing countries (right y-axis) and the best performing GGCM/setup combination including the r value (left y-axis). Fig AC. Time-series correlation coefficients for GGCMs grouped by basic characteristics (Table A) for the three setup scenarios. Fig AD. Box-and-whisker plots of time-series correlation coefficients for single GGCMs against FOASTAT global reported yields grouped into the (sub-)ensembles "All GGCMs", "EPIC-based GGCMs", and "non-EPIC-based GGCMs" for each setup scenario (a-c). Fig AE. Box-and-whisker plots of time-series correlation coefficients for permutations of multi-GGCM means excluding one GGCM at a time against FOASTAT global reported yields for each setup scenario (a-c). (PDF)