Optimization of black-box models with uncertain climatic inputs—Application to sunflower ideotype design

Accounting for the interannual climatic variations is a well-known issue for simulation-based studies of environmental systems. It often requires intensive sampling (e.g., averaging the simulation outputs over many climatic series), which hinders many sequential processes, in particular optimization algorithms. We propose here an approach based on a subset selection in a large basis of climatic series, using an ad-hoc similarity function and clustering. A non-parametric reconstruction technique is introduced to estimate accurately the distribution of the output of interest using only the subset sampling. The proposed strategy is non-intrusive and generic (i.e. transposable to most models with climatic data inputs), and can be combined to most “off-the-shelf” optimization solvers. We apply our approach to sunflower ideotype design using the crop model SUNFLO. The underlying optimization problem is formulated as a multi-objective one to account for risk-aversion. Our approach achieves good performances even for limited computational budgets, outperforming significantly standard strategies.


Introduction
Using numerical models of complex dynamic systems has become a central process in many fields, including engineering or natural sciences. It is now an essential tool for water resource management, adaptation of anthropic or natural systems to a changing climatic context or the conception of new production systems.
Many times, the objective pursued by model users amounts to solving an optimization problem, that is, find the set of input parameters of the model that maximizes (or minimizes) the output of interest (cost, production level, environmental impact, etc.). Examples of such problems abound with environmental models, including water distribution systems design [1] or agricultural watershed management [2]. In agronomy, in the past two decades crop models have received a growing attention [3][4][5][6], as they can be used to help improve the plant performances, either through cultural practices [7,8] or model-assisted plant breeding [9,10]. a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 Within the wide range of potential approaches to solve such optimization problems, blackbox optimization methods have proven to be popular in this context [9,11], in particular because they are in essence non-intrusive: they only require pointwise evaluations of the model at hand (output value for a given set of inputs), as opposed to knowing the underlying mechanisms of the model, derivative information, etc. This greatly facilitates implementation and avoids developping taylored algorithms.
However, a well-known difficulty, shared by many agricultural or ecological models users, lies in dealing with climatic information. Many models require series of measures of precipitation, temperature, etc., as input variables: typically, a crop model requires day-to-day measures over the agricultural season. Those inputs are particularly crucial for agricultural or ecological models, for which the climate has a preponderant impact on the system. To avoid drawing conclusions biased by the choice of a particular set (e.g., year) of climatic data, scenarios approaches can be used, duplicating the analysis for a small number of distinct climates [12,13]. Alternatively, one may compute utility functions based on the model outputs over a (large) number of climatic datasets [14,15]. This avoids the potentially challenging task of identifying scenarios, but requires intensive computation, as the number of datasets must be large enough to obtain a stable estimation of the utility (see for instance [8] for a discussion on uncertainty propagation on agro-ecosystem models). This approach rapidly becomes computationally prohibitive if the analysis is embedded in an optimization loop, even for moderately complex models.
A natural solution is to treat the climate as a random variable, which allows the use of the robust (or noisy) optimization framework (see e.g. [1]). However, if readily available codes abound for continuous, box-constrained parameters and deterministic outputs, solutions become scarce for systems depending on stochastic phenomena. Besides, the problem formulation becomes more complex, as typically risk-aversion preferences need to be accounted for [8,15].
In this work, we propose to address the issue of propagating climatic uncertainties in an optimization algorithm with a reasonable computational cost. Our approach is based on a subset selection in a large basis of climatic series (Section 3.1). A non-parametric reconstruction technique is introduced to estimate accurately the distribution of the output of interest based on this subset (Section 3.2). Our solution is designed as non-intrusive and generic, i.e. transposable to most models with climatic data inputs and to most black-box optimization solvers, while allowing parallel computing.
As an application problem, which we use as a running example through this article, we consider the optimization of phenotypes of sunflower (or ideotype design, see [11] for a review of recent developments in this research area). Plant performance (e.g., yield), computed using the crop model SUNFLO [16], is maximized with respect to its morphological and/or physiological traits. To account for risk-aversion, the problem is formulated as a multi-objective one.
The rest of this paper is organized as follow: Section 2 briefly reviews previous work on phenotype optimization, describes the SUNFLO model and the multi-objective optimization formulation to solve the problem at hand. Section 3 is dedicated to the optimization algorithm. Sections 4 provides the experimental setup and 5 numerical results.

Sunflower phenotype optimization
In this section, we first describe briefly the SUNFLO model and corresponding climatic data. Then, we define an optimization problem to account for climatic uncertainty.

Model definition
SUNFLO is a process-based model which was developed to simulate sunflower grain yield (in tons per hectare) and oil concentration as a function of climatic time series, environment (soil and climate), management practices and genetic diversity [16]. It is implemented in the RECORD project [5] which is dedicated to agro-ecosystems studies. It allows to assess the performance of sunflower cultivars in agronomic conditions, hence can be used to perform varietal selection or design.
A sunflower cultivar is represented by a combination of eight genetic coefficients, which are the inputs to be optimized. They describe various aspects of crop structure or functioning: phenology, plant architecture, response curve of physiological processes to drought and biomass allocation. We assume that the coefficients can take continuous values between a lower and an upper bound, determined from a dataset of existing cultivars. The variables and their domain of variation are reported in Table 1.
As climatic inputs, SUNFLO uses daily measures over a year of five variables: minimal and maximal temperatures (T min and T max ,˚C), global incident radiation (R, MJ/m 2 ), evapotranspiration (E, mm, Penman-Monteith [17]) and precipitations (P, mm). Note that the model actually uses only the data corresponding to the cultural year (for sunflower, April to October, 180 days). Fig 1 provides an example of such data.
We denote x 2 X & R d a particular phenotype, X being the hyper-rectangle defined by the bounds, and c = {T min , T max , R, E, P} 2 O a given climatic series. Here, c can be seen as a matrix of size 5 × 180. Hence, the yield can be seen as a function of the phenotype and the climatic series: In Section 3, we denote by y(X, C) ≔ (y(x i , c j )) 1 i I, 1 j J the I × J matrix of yield values for a set of phenotypes X = {x 1 , . . .,x I } and a set of climatic series C = {c 1 , . . ., c J }. In the following, we consider that the set of climatic series O is discrete, since we use historic climatic data (as opposed to using a stochastic generator for instance [18]). We note: Card(O) = N the number of available series for the study.

Optimization under climatic uncertainty
From a farmer point of view, the objective would be to find a phenotype that maximizes the yield for the year to come, without knowing in advance the climate data. Let C be the climatic series of the upcoming year (the upper case denoting a random variable); we consider in the following that C is uniformly distributed over O. The yield y(x, C) is then also a random variable (which we denote in the following Y(x)), which makes its direct maximization with respect to x meaningless. A natural formulation is to maximize the yield expectation: However, in general, a farmer also wishes to integrate some prevention against risk in its decision. Such a problem is often referred to as robust optimization in the engineering literature (see for instance [19] for a review). In econometrics, this is many times handled by utility functions [20,21], which offer an automated trade-off between average performance and risk aversion. However, in order to obtain several solutions ranging from risk neutral to highly risk averse, a solution is to consider this problem as multi-objective, by introducing a second criterion to maximize that accounts for the risk [15]. We consider the conditional value-at-risk (CVaR, [22]), defined as: For the sunflower application, CVaR α is the average yield over the (N × α)-th worst years with the usual definition of the quantile: P½Y Q a ðYÞ ¼ a, and α 2 (0, 0.5]. Note that in general, a CVaR provides an information that is close to the one of a quantile, but enjoys better stability anbd regularity [22]. The multi-objective optimization problem is then:

Optimization with a representative subset
The two objective functions, E½YðxÞ and CVaR α [Y(x)], require running the SUNFLO simulator N times every time a new phenotype x is evaluated. Embedded in an optimization loop, which typically requires thousands to millions of calls to the objective functions, this evaluation step becomes prohibitive. We propose to address this problem by replacing the large climatic dataset O by a small representative set O K . O K is chosen prior to optimization (Section 3.1); then, the optimization algorithm is run using O K and a specific inference strategy (Section 3.2).

Choosing a representative subset of climatic data
To select our subset, we propose to define a distance (or, conversely, a similarity) between two climatic series, then choose a set of series far from each other using clustering algorithms.

Computing dissimilarities between climatic time series.
A classical tool for time series analysis is an algorithm called Dynamic Time Warping (DTW, [23]). In short, DTW computes distances between time series, allowing two time series that are similar but locally out of phase to align in a non-linear manner, by matching events within a given window. Fig 2  illustrates this concept on two Tmax curves. This feature is critical in our context, as for instance two rain time series containing both a large rain event at close dates will have a similar effect on crop, hence should be considered as close to each other.
Given two climatic series c i and c j , five distances can be computed with DTW, one for each variable: However, these distances are not sufficient, since two climatic series can be far from each other with respect to the DTW distance yet lead to the same outputs if some key features with respect to the model are similar. Unfortunately, these features are problem-dependent and in general unknown, even to experts. Hence, we propose to define a sixth, model-dependent distance. To do so, we choose first a small set of l inputs: B ¼ fx 1 ; . . . ; x l g. Typically, B can be chosen by Latin Hypercube Sampling (LHS, [24]) to "fill" the search space R d and obtain a large variety of input values. For this set, the output is computed for all the climatic series. Then, the model-based distance is simply the Euclidean distance with respect to y: Note that this distance strongly depends on the choice of B. Based on a space-filling design, it aims at being relevant on average over X, but might be misleading locally. We address this specific point in Section 3.3.
To avoid scaling issues and attribute equal importance to all variables, we use the normalization procedure described in [25], which works as follow: Let D be a N × N matrix of distances (with values d ij = d(x i , x j ), d ij = d jj and d ii = 0). We first compute a corresponding similarity matrix S, with values: Then, we normalize S with: and the normalized dissimilarity matrix " D has elements defined as: Finally, we combine the six dissimilarities into a single scalar using a convex combination: with a T min þ . . . þ a M ¼ 1. In the following, we use a M ¼ 1=2 and the other weights equal to 1/10, to give an equal weight to the model-based distance and the DWT distances.

Choosing a representative subset using classification.
Once the matrix of dissimilarities (δ ij ) 1 i, j N is computed, most unsupervised clustering algorithms can be used to split the set of climatic series O into subsets. However, a difficulty here is that the centroids of the clusters cannot be computed. Hence, we use a variation of the k-means algorithm that only requires dissimilarities to the centroids. We follow the approach described in [25].
The algorithm divides the set O into K classes C 1 ; . . . ; C K . A class C k contains N k elements fc k 1 ; . . . ; c k K g. Any element c 2 O is uniquely attributed to one class and we have:

Non-parametric reconstruction of distributions
Now, we assume that when a new input x is considered, one may compute the outputs corresponding to the representative set: y(x, O K ). The next step is to obtain accurate estimations of the objective functions E½YðxÞ and CVaR α [Y(x)] based on these values.
Computing directly the objective functions would lead to large errors, in particular for CVaR α [Y(x)], that requires an accurate representation of the tail distribution. A natural alternative is to infer a distribution on the small data set, and compute the objectives on the distribution. However, in the sunflower case the form of the empirical distribution (Fig 3) does not readily call for a given parametric model, and misspecifying the distribution shape may result with large bias.
Hence, we propose to infer the distribution using a non-parametric method, by re-using the data computed for the classification step, that is, the output matrix yðB; OÞ. To do so, we assume a mixture model for y (each component corresponding to a class C k ): f standing for the probability density function and Y k (x) being the output within the class k.
We decompose further Y k (x) as the sum of the output value at the representative element and a residual term: The intra-class distribution is then characterized by the residuals ε k (x), which determine the form, spread (or amplitude), and bias (i.e., difference between the average value and the value for the representative element). All these elements vary from one class to another, which advocates the use of non-parametric approaches. To model ε k (x), we introduce the weighted variance of the yield over the representative set: Note that for a new phenotype x, the only data available is y(x, ω j ). We then define averages of normalized residuals: and the output is approximated, with i uniformly taken from ⟦1, N k ⟧, by: Fig 4 illustrates the reconstruction technique for a given (randomly chosen) phenotype, by showing how the residuals corresponding to each class are used to obtain the estimated distribution. We can see that the range and shape of the residuals vary considerably from one class to another. Also, their distributions around the representative element differ: as the residuals do not have a zero mean, the value of the representative element is not necessarily central for each class. Comparing the reconstructed (Fig 4, left) and actual (right) distributions, we see that the mixture is globally the same on both graphs. Fig 3 shows the cumulative distribution function (CDF) of the actual yield and of two estimations: using the method described above and a simple parametric method, which consists in assuming a Gaussian distribution of the yield. The empirical CDF corresponding to the subset values only is also depicted, with unequal steps to account for the different number of elements in each class.
We first notice that the subset data is by itself insufficient to evaluate accurately the mean or the CVaR. Then, we see that the actual distribution does not seem to belong to a known distribution, and using a normal distribution introduces a large bias. Inversely, using a nonparametric reconstruction allows us to match the shape of the actual distribution.
In our study, we found that this reconstruction method provided a satisfying trade-off between robustness, simplicity and accuracy. Yet, many refinements would be possible at this point, for instance by introducing intra-class rescaling (different normalization for each class), bias correction, or using the distance from the phenotype x to the basis B.

Optimization: A two-step approach
Finally, the multi-objective optimization problem can be solved with any black-box algorithm for E½Ŷ ðxÞ and CVaR a ½Ŷ ðxÞ, withŶ ðxÞ a mixture ofŶ 1 ðxÞ; . . . ;Ŷ K ðxÞ. However, the objective estimates are based on the phenotype basis B, which is sampled uniformly over X to offer a general representation of the phenotype space. This feature is important at the beginning of the optimization to ensure that the optimizer does not get trapped into poorly represented regions. However, as the optimizer converges towards the solution, the search space becomes more narrow, and a substantial gain in performance can be achieved by modifying the estimates so that they are more accurate in the optimal region. The strategy we adopt to overcome this problem is to introduce, during the optimization procedure, a step of re-evaluation of averages of normalized residuals " ε k (Eq 8). First, we run the optimization with the initial basis B. Then, we select l new phenotypes from the obtained Pareto set to form B 0 and we evaluate yðB 0 ; OÞ and the new version of " ε k by performing N × l simulations. Finally, we restart the optimization with the new estimates of " ε k . We have found (Section 5) that this two-step strategy was sufficient on our problem, while relatively easy to implement.

Climatic dataset
We used historic climatic data from five French locations where sunflower crop frequent (Reims, Dijon and Lusignan that are in the north of the country, and in Avignon and Blagnac in the south, see

Climate subset selection
We used the R package dtw [26] to compute the distances. The window size (that is, the maximum shift allowed) is the main influential parameter of the method. We used here expert Optimization of black-box models with uncertain climatic inputs knowledge to choose it based on the sunflower phenology: for precipitation, a window of ±3 days is used; for the other variables, a window of ±7 days is chosen (e.g. we considered that a weekly shift on temperature changes little the yield). The phenotype set B is chosen as a 10-point LHS; hence, for this step the method required 1,900 calls to the SUNFLO model.
Then, a k-means algorithm is run. Since it provides a local optimum only, we performed several restarts to achieve a good robustness. Several number of classes K were tested; we found empirically that K = 10 provided a satisfying trade-off between the representation capability of the subset and the computational cost during the optimization loop.

Optimization setup and comparison benchmark
To assess the validity of our approach, we conducted an empirical comparison with two alternatives: a random search and a black-box optimization, both based on the full set of climate series. We compare the different approaches based on an equal number of calls to SUNFLO (that is, we do not consider the other time costs related to each approach). We consider three budgets, which we refer to as large (380,000), medium (95,000) and small (23,750).
Our approach (denoted two-step in the following) uses the MOPSO-CD metaheuristic (Multi-Objective Particle Swarm Optimization with Crowding Distance, [27]), available in the R package mopsocd. Its two main parameters are the population size and number of generations (their product being equal to the number of function evaluations, i.e. the budget). A standard optimization based on all the series is also performed using the original MOPSO-CD algorithm (henceforth full MOPSO). Random search is based on optimized Latin hypercube sampling (LHS) to fill the search space X using the R package lhs.
For each budget, we define the number of iterations and the population size for the full and two-step approaches. As a rule-of-thumb, we set the number of iterations to approximately five times the population size [28]. For the two-step algorithm, each evaluation of the objectives requires ten SUNFLO runs (while the other approaches require 190 runs), which allows a larger population and number of iterations, but the initial and intermediate learning steps have a 10 × 190 cost. Table 2 reports the detailed setups for each budget and algorithm.
Since all the algorithms used are stochastic, each optimization experiment is replicated ten times to assess the robustness of the results.  Optimization of black-box models with uncertain climatic inputs of France (Avignon and Blagnac, hence on average hotter and drier), four from the north (4, 5, 7 and 10) and three with mixed locations. The average yield provide a complementary information on the clusters. Overall, the clusters cover a large range of yields, which distinguishes series from the same locations (1 and 5 or 4 and 9). Note that integrated quantities, such as evapotranspiration annual average do not explain well the different classes. Especially, there is a known high impact of rain episodes and their location in time, which may be "seen" by our composite distance, but is challenging to display here.

Phenotype optimization
5.2.1 Algorithm performance assessment. Now, we compare the performances of the approaches described in Table 2. We use a classical multi-criteria indicator (hypervolume) that measures the volume of the objective space behind the Pareto front (the larger the better, see for instance [29] for details).  For the small and medium budgets, the two-step approach clearly outperforms the other approaches (with the exception of two outliers with the small budget). For the large budget, we see that the regular MOPSO-CD performs slightly better, which is expected. Indeed, as soon as there is no necessity of parsimony, using approximate objectives instead of actual ones tends to slow down, rather than accelerate, convergence. However, we can conclude that our twostep approach with medium budget performs almost as well as the regular approach with large budget, hence for a quarter of the budget. In the small budget case, which is likely to happen when more expensive environmental models (time-wise) are considered, our approach still provides reasonably good results, while the classical optimization fails at providing a better performance than random search.

Sunflower ideotype analysis.
Finally, we characterize the results on the phenotype space. We compare here the Pareto set obtained by merging all the runs (which can be considered as close to the actual solution) with one run of the two-step method; we chose the run on the medium budget with the median performance. For readability, we only consider a subset of the Pareto set of size five, equally spaced along the Pareto front. The Pareto fronts and sets are represented in Fig 8. We can see first that considering both the expectation and CVaR for optimization leads to a large variety of optimal phenotypes. Looking back at the plant characteristics corresponding to those solutions, the optimum value for three traits had little variability, meaning that those traits were important plant characteristics for crop performance in the tested environments. Those traits depicted plants adapted to water deficit: a late maturity (TDM3), Optimization of black-box models with uncertain climatic inputs largest leaves at the bottom of the plant (LLH) and a conservative strategy for stomatal conductance regulation (TR). The five other traits (TDF1, TLN, K, LLS, LE) displayed variability in optimal values, which was identified as the basis of the performance/stability trade-off (expectation/CVaR). Here, the traits (except TLN) vary monotonically along the Pareto front.
Distinct plant types could be identified in the phenotype space. For example, the pink-red plant type had an early flowering (TDF1), a low light extinction efficiency (K) and a low plant leaf area (LLS); those characteristics correspond to a conservative resource management strategy. In an opposite manner, the light-blue type displays a late flowering, a high efficiency to intercept light and a larger plant leaf area, characteristics usually associated with a productive but risky crop type when facing strong water deficit [30]. The strategy associated with plant types identified from the phenotype space matched their position in the Pareto front, i.e the light-blue plant type was more efficient but less stable than the pink-red one.
The Pareto set obtained with the two-step method reproduces part of these features: the fixed traits are similar (except TLN, which is fixed to approximately 0.5, but this parameter is known to have little impact on the yield, see [16]) and the variation of TDF1 and LLS is wellcaptured. However, on this run the method failed at finding the variation of the K and LE traits: this probably explains why the largest mean and CVaR values (extremities of the actual Pareto front) are missed.
Overall, the two-step method allowed to identify the few key traits that are responsible for the cultivar global adaptation capacity as well as secondary traits that support alternative resource use strategies underlying the yield expectation/stability trade-off.

Summary and perspectives
In this article, we proposed an algorithm for the optimization of black-box models with uncertain climatic inputs, and applied it to the design of robust sunflower cultivars. Our approach does not require any a priori knowledge on the system besides parameter bounds, hence is usable with most simulators depending on similar climatic data. Using subset selection for the climates allowed us to reduce substantially the computational time without adding implementation issues. If bias correction seems inevitable during optimization, we showed that a twostep strategy was sufficient to achieve convergence: this point is critical as it allows our approach to be combined with any black-box multi-objective solver.
Nevertheless, we see many opportunities for further improvements. First, one could study the impact of the number of clusters on the results, which is a recurrent question with clustering methods. Second, a popular strategy to reduce the computational costs is to combine optimization with the use of surrogate modelling (see for instance [1]). Our approach straightforwardly extends to such approaches, and would result in very parsimonious algorithms that may be beneficial for expensive simulations. Finally, we have chosen here to use a two-step strategy to allow the use of "off-the-shelf" optimization solvers. Interlinking optimization and learning may improve substantially the efficiency of the method, although requiring the development of an ad hoc algorithm.
Supporting information S1 Dataset and Code. Data and software instructions for implementing computer experiments. The dataset used throughout this manuscript and the R code used to call the SUNFLO model and run the proposed algorithm are provided in a single archive file. (TGZ)