^{1}

^{1}

^{2}

The authors have declared that no competing interests exist.

Accounting for the interannual climatic variations is a well-known issue for simulation-based studies of environmental systems. It often requires intensive sampling (e.g., averaging the simulation outputs over many climatic series), which hinders many sequential processes, in particular optimization algorithms. We propose here an approach based on a subset selection in a large basis of climatic series, using an ad-hoc similarity function and clustering. A non-parametric reconstruction technique is introduced to estimate accurately the distribution of the output of interest using only the subset sampling. The proposed strategy is non-intrusive and generic (i.e. transposable to most models with climatic data inputs), and can be combined to most “off-the-shelf” optimization solvers. We apply our approach to sunflower ideotype design using the crop model SUNFLO. The underlying optimization problem is formulated as a multi-objective one to account for risk-aversion. Our approach achieves good performances even for limited computational budgets, outperforming significantly standard strategies.

Using numerical models of complex dynamic systems has become a central process in many fields, including engineering or natural sciences. It is now an essential tool for water resource management, adaptation of anthropic or natural systems to a changing climatic context or the conception of new production systems.

Many times, the objective pursued by model users amounts to solving an optimization problem, that is, find the set of input parameters of the model that maximizes (or minimizes) the output of interest (cost, production level, environmental impact, etc.). Examples of such problems abound with environmental models, including water distribution systems design [

Within the wide range of potential approaches to solve such optimization problems,

However, a well-known difficulty, shared by many agricultural or ecological models users, lies in dealing with climatic information. Many models require series of measures of precipitation, temperature, etc., as input variables: typically, a crop model requires day-to-day measures over the agricultural season. Those inputs are particularly crucial for agricultural or ecological models, for which the climate has a preponderant impact on the system. To avoid drawing conclusions biased by the choice of a particular set (e.g., year) of climatic data, scenarios approaches can be used, duplicating the analysis for a small number of distinct climates [

A natural solution is to treat the climate as a random variable, which allows the use of the robust (or noisy) optimization framework (see e.g. [

In this work, we propose to address the issue of propagating climatic uncertainties in an optimization algorithm with a reasonable computational cost. Our approach is based on a subset selection in a large basis of climatic series (Section 3.1). A non-parametric reconstruction technique is introduced to estimate accurately the distribution of the output of interest based on this subset (Section 3.2). Our solution is designed as non-intrusive and generic, i.e. transposable to most models with climatic data inputs and to most black-box optimization solvers, while allowing parallel computing.

As an application problem, which we use as a running example through this article, we consider the optimization of phenotypes of sunflower (or

The rest of this paper is organized as follow: Section 2 briefly reviews previous work on phenotype optimization, describes the SUNFLO model and the multi-objective optimization formulation to solve the problem at hand. Section 3 is dedicated to the optimization algorithm. Sections 4 provides the experimental setup and 5 numerical results.

In this section, we first describe briefly the SUNFLO model and corresponding climatic data. Then, we define an optimization problem to account for climatic uncertainty.

SUNFLO is a process-based model which was developed to simulate sunflower grain yield (in tons per hectare) and oil concentration as a function of climatic time series, environment (soil and climate), management practices and genetic diversity [

A sunflower cultivar is represented by a combination of eight genetic coefficients, which are the inputs to be optimized. They describe various aspects of crop structure or functioning: phenology, plant architecture, response curve of physiological processes to drought and biomass allocation. We assume that the coefficients can take continuous values between a lower and an upper bound, determined from a dataset of existing cultivars. The variables and their domain of variation are reported in

Max | ||
---|---|---|

Temperature sum from emergence to the beginning of flowering (TDF1, °C) | 765 | 907 |

Temperature sum from emergence to seed physiological maturity (TDM3, °C) | 1540 | 1830 |

Number of leaves at flowering (TLN) | 22.2 | 36.7 |

Light extinction coefficient during vegetative growth (K) | 0.780 | 0.950 |

Rank of the largest leave of leaf profile at flowering (LLH) | 13.5 | 20.6 |

Area of the largest leave of leaf profile at flowering (LLS, ^{2}) |
334 | 670 |

Threshold for leaf expansion response to water stress (LE) | -15.6 | -2.31 |

Threshold for stomatal conductance response to water stress (TR) | -14.2 | -5.81 |

As climatic inputs, SUNFLO uses daily measures over a year of five variables: minimal and maximal temperatures (_{min} and _{max}, °C), global incident radiation (^{2}), evapotranspiration (

We denote _{min}, _{max}, _{i}, _{j}))_{1 ≤ i ≤ I, 1 ≤ j ≤ J} the _{1}, …,_{I}} and a set of climatic series _{1}, …, _{J}}.

In the following, we consider that the set of climatic series Ω is discrete, since we use historic climatic data (as opposed to using a stochastic generator for instance [

From a farmer point of view, the objective would be to find a phenotype that maximizes the yield for the year to come, without knowing in advance the climate data. Let

However, in general, a farmer also wishes to integrate some prevention against risk in its decision. Such a problem is often referred to as _{α} is the average yield over the (

The multi-objective optimization problem is then:

The two objective functions, _{α}[_{K}. Ω_{K} is chosen prior to optimization (Section 3.1); then, the optimization algorithm is run using Ω_{K} and a specific inference strategy (Section 3.2).

To select our subset, we propose to define a distance (or, conversely, a similarity) between two climatic series, then choose a set of series

A classical tool for time series analysis is an algorithm called Dynamic Time Warping (DTW, [

The dotted lines represent the optimal matching computed by DTW, for a window size of seven days.

Given two climatic series _{i} and _{j}, five distances can be computed with DTW, one for each variable: _{i}, _{j})^{Tmin}, _{i}, _{j})^{Tmax}, _{i}, _{j})^{R}, _{i}, _{j})^{E} and _{i}, _{j})^{P}.

However, these distances are not sufficient, since two climatic series can be far from each other with respect to the DTW distance yet lead to the same outputs if some key features with respect to the model are similar. Unfortunately, these features are problem-dependent and in general unknown, even to experts. Hence, we propose to define a sixth, model-dependent distance. To do so, we choose first a small set of

To avoid scaling issues and attribute equal importance to all variables, we use the normalization procedure described in [_{ij} = _{i}, _{j}), _{ij} = _{jj} and _{ii} = 0). We first compute a corresponding similarity matrix

Finally, we combine the six dissimilarities into a single scalar using a convex combination:

Once the matrix of dissimilarities (_{ij})_{1 ≤ i, j ≤ N} is computed, most unsupervised clustering algorithms can be used to split the set of climatic series Ω into subsets. However, a difficulty here is that the centroids of the clusters cannot be computed. Hence, we use a variation of the k-means algorithm that only requires

The algorithm divides the set Ω into ^{k} elements ^{k} is chosen to define the representative set, hence: Ω_{K} = {^{1}, …, ^{K}}.

Now, we assume that when a new input _{K}). The next step is to obtain accurate estimations of the objective functions _{α}[

Computing directly the objective functions would lead to large errors, in particular for CVaR_{α}[

Hence, we propose to infer the distribution using a non-parametric method, by re-using the data computed for the classification step, that is, the output matrix ^{k}(

We decompose further ^{k}(^{k}(^{k}(_{j}). We then define averages of ^{k}⟧, by:

The colours show the different classes, and the vertical bars the output values for the representative elements.

We first notice that the subset data is by itself insufficient to evaluate accurately the mean or the CVaR. Then, we see that the actual distribution does not seem to belong to a known distribution, and using a normal distribution introduces a large bias. Inversely, using a non-parametric reconstruction allows us to match the shape of the actual distribution.

In our study, we found that this reconstruction method provided a satisfying trade-off between robustness, simplicity and accuracy. Yet, many refinements would be possible at this point, for instance by introducing intra-class rescaling (different normalization for each class), bias correction, or using the distance from the phenotype

Finally, the multi-objective optimization problem can be solved with any black-box algorithm for

The strategy we adopt to overcome this problem is to introduce, during the optimization procedure, a step of re-evaluation of averages of normalized residuals

We used historic climatic data from five French locations where sunflower crop frequent (Reims, Dijon and Lusignan that are in the north of the country, and in Avignon and Blagnac in the south, see

We used the

Then, a k-means algorithm is run. Since it provides a local optimum only, we performed several restarts to achieve a good robustness. Several number of classes

To assess the validity of our approach, we conducted an empirical comparison with two alternatives: a random search and a black-box optimization, both based on the full set of climate series. We compare the different approaches based on an equal number of calls to SUNFLO (that is, we do not consider the other time costs related to each approach). We consider three budgets, which we refer to as large (380,000), medium (95,000) and small (23,750).

Our approach (denoted

For each budget, we define the number of iterations and the population size for the full and two-step approaches. As a rule-of-thumb, we set the number of iterations to approximately five times the population size [

Optimization experiment | Budget | Nb of iterations | Pop size | Real nb of simulations |
---|---|---|---|---|

small | - | 125 | 23,750 | |

medium | - | 500 | 95,000 | |

large | - | 2,000 | 380,000 | |

small | 25 | 5 | 24,700 | |

medium | 50 | 10 | 96,900 | |

large | 100 | 20 | 383,000 | |

small | 71(×2) | 14 | 23,960 | |

medium | 152(×2) | 30 | 95,600 | |

large | 308(×2) | 61 | 380,780 |

Since all the algorithms used are stochastic, each optimization experiment is replicated ten times to assess the robustness of the results.

Average yields, computed on the base

The average yield provide a complementary information on the clusters. Overall, the clusters cover a large range of yields, which distinguishes series from the same locations (1 and 5 or 4 and 9). Note that integrated quantities, such as evapotranspiration annual average do not explain well the different classes. Especially, there is a known high impact of rain episodes and their location in time, which may be “seen” by our composite distance, but is challenging to display here.

Now, we compare the performances of the approaches described in

For the small and medium budgets, the two-step approach clearly outperforms the other approaches (with the exception of two outliers with the small budget). For the large budget, we see that the regular MOPSO-CD performs slightly better, which is expected. Indeed, as soon as there is no necessity of parsimony, using approximate objectives instead of actual ones tends to slow down, rather than accelerate, convergence. However, we can conclude that our two-step approach with medium budget performs almost as well as the regular approach with large budget, hence for a quarter of the budget. In the small budget case, which is likely to happen when more expensive environmental models (time-wise) are considered, our approach still provides reasonably good results, while the classical optimization fails at providing a better performance than random search.

Finally, we characterize the results on the phenotype space. We compare here the Pareto set obtained by merging all the runs (which can be considered as close to the actual solution) with one run of the two-step method; we chose the run on the medium budget with the median performance. For readability, we only consider a subset of the Pareto set of size five, equally spaced along the Pareto front. The Pareto fronts and sets are represented in

The bold symbols correspond to a subset of five optimal phenotypes that are shown on the figures on the right, where each line represents a phenotype.

We can see first that considering both the expectation and CVaR for optimization leads to a large variety of optimal phenotypes. Looking back at the plant characteristics corresponding to those solutions, the optimum value for three traits had little variability, meaning that those traits were important plant characteristics for crop performance in the tested environments. Those traits depicted plants adapted to water deficit: a late maturity (TDM3), largest leaves at the bottom of the plant (LLH) and a conservative strategy for stomatal conductance regulation (TR). The five other traits (TDF1, TLN, K, LLS, LE) displayed variability in optimal values, which was identified as the basis of the performance/stability trade-off (expectation/CVaR). Here, the traits (except TLN) vary monotonically along the Pareto front.

Distinct plant types could be identified in the phenotype space. For example, the

The Pareto set obtained with the two-step method reproduces part of these features: the fixed traits are similar (except TLN, which is fixed to approximately 0.5, but this parameter is known to have little impact on the yield, see [

Overall, the two-step method allowed to identify the few key traits that are responsible for the cultivar global adaptation capacity as well as secondary traits that support alternative resource use strategies underlying the yield expectation/stability trade-off.

In this article, we proposed an algorithm for the optimization of black-box models with uncertain climatic inputs, and applied it to the design of robust sunflower cultivars. Our approach does not require any

Nevertheless, we see many opportunities for further improvements. First, one could study the impact of the number of clusters on the results, which is a recurrent question with clustering methods. Second, a popular strategy to reduce the computational costs is to combine optimization with the use of surrogate modelling (see for instance [

The dataset used throughout this manuscript and the R code used to call the SUNFLO model and run the proposed algorithm are provided in a single archive file.

(TGZ)