How to account for the uncertainty from standard toxicity tests in species sensitivity distributions: An example in non-target plants

This research proposes new perspectives accounting for the uncertainty on 50% effective rates (ER50) as interval input for species sensitivity distribution (SSD) analyses and evaluating how to include this uncertainty may influence the 5% Hazard Rate (HR5) estimation. We explored various endpoints (survival, emergence, shoot-dry-weight) for non-target plants from seven standard greenhouse studies that used different experimental approaches (vegetative vigour vs. seedling emergence) and applied seven herbicides at different growth stages. Firstly, for each endpoint of each study, a three-parameter log-logistic model was fitted to experimental toxicity test data for each species under a Bayesian framework to get a posterior probability distribution for ER50. Then, in order to account for the uncertainty on the ER50, we explored two censoring criteria to automatically censor ER50 taking the ER50 probability distribution and the range of tested rates into account. Secondly, based on dose-response fitting results and censoring criteria, we considered input ER50 values for SSD analyses in three ways (only point estimates chosen as ER50 medians, interval-censored ER50 based on their 95% credible interval and censored ER50 according to one of the two criteria), by fitting a log-normal distribution under a frequentist framework to get the three corresponding HR5 estimates. We observed that SSD fitted reasonably well when there were at least six distinct intervals for the ER50 values. By comparing the three SSD curves and the three HR5 estimates, we shed new light on the fact that both propagating the uncertainty from the ER50 estimates and including censored data into SSD analyses often leads to smaller point estimates of HR5, which is more conservative in a risk assessment context. In addition, we recommend not to focus solely on the point estimate of the HR5, but also to look at the precision of this estimate as depicted by its 95% confidence interval.


Introduction
Today, Species Sensitivity Distributions (SSD) are established as a key tool for the environmental risk assessment (ERA) of chemicals [1,2]. They provide a reliable assessment of the range of sensitivities within a plant or animal community of interest [3] and thereby allow to estimate indicators such as the 5% hazard concentration or rate (HC 5 or HR 5 ) that is the hazardous concentration or rate prone to affect 5% of the species within the community. An estimation of the HC 5 or HR 5 can be obtained from the fit of a probability distribution on a collection of toxicity values, such as 50% effective concentrations or rates (ER 50 or ER 50 ), thus requiring performing a statistically robust analysis. Toxicity values are usually derived from a regression model fitted on toxicity test data observed at several treatment levels at a target time point. This fit provides toxicity values as point estimates, but an uncertainty can also be associated to them, either through a confidence interval (under a frequentist framework) or a credible interval (under a Bayesian framework). Nevertheless, this uncertainty, as well as other sources of uncertainty [4], is rarely accounted for in HC 5 or HR 5 estimates afterwards. This motivated our study, supported by recent works that also recognise the usefulness of considering interval ecotoxicological data [5].
The SSD method is largely used in the field of non-target terrestrial plant (NTTP) studies for the purpose of assessing the risk of plant protection products [6]. NTTP are defined as non-crop plants located outside the treatment area according to the Guidance Document (GD) on Terrestrial Ecotoxicology [6]. In the case of NTTP studies, treatment levels or exposure concentrations are rather called tested rates, corresponding to application rates in field. Subsequently, we will use the notations ER 50 and HR 5 hereafter.
For the SSD analyses, up to ten NTTP species are usually studied for the ERA of pesticides. The used tested rates are selected prior to the experiments, sometimes being the same for some of or all the chosen species. The highest tested rate usually corresponds to the highest authorised application rate of the herbicide in the field, which ensures the realism of the ecotoxicological evaluation towards agricultural practices. However, this highest tested rate might be too low to elicit large toxic effects (i.e., close to 100% effect, as it is classically done when setting the highest dose for a dose-response analysis for animals) especially for fungicides and insecticides. This specific point will be discussed in our paper. Consequently, unbounded right-censored ER 50 values (namely ER 50 greater than the highest tested rate) can occur when the range of tested rates does not really match the observed sensitivity of a plant species or when this species is not affected at the highest tested rate intended according to good agricultural practices. Note that such unbound ER 50 values may have been produced using a validated standard experimental procedure, so that there is no reason to question them. Additionally, the GD on Terrestrial Ecotoxicology does not provide any advice on how to deal with unbound ER 50 values or with the uncertainty associated with the ER 50 estimates when performing SSD analyses [6]. As a consequence, the common practice is first to ignore the uncertainty by considering point estimates only (usually the mean estimate), and second, either to discard unbound ER 50 values from the analysis or to substitute them with arbitrary values (e.g., the highest tested rate), even if rarely done in practice. Nevertheless, performing in such a way is a clear loss of valuable information with some drawbacks. Ignoring uncertainty prevents to account for low (i.e., in the lower tail of their probability distribution) or high (i.e., in the upper tail of their probability distribution) ER 50 values that are statistically probable, thus leading to potentially biased HR 5 estimates (i.e., either over-or under-estimated values). Discarding unbound ER 50 values may derive in (i) a range of remaining ER 50 values that may not cover the full range of sensitivities as originally displayed in the set of the chosen species (the most sensitive or the most tolerant species may for example be excluded, thus producing biased HR 5 estimates, either over-or under-estimated); (ii) unbound ER 50 values can occur for many species, so that, after discarding them, the small sample size of the input data set might then not be sufficient to allow an SSD analysis to be performed. This latter issue is of great concern for risk assessment, since the SSD analysis is currently the only higher tier option prescribed by the GD on Terrestrial Ecotoxicology and widely accepted by authorities. Thus, being unable to finalise an SSD analysis may prevent refining the risk assessment of some chemical substances. The GD on Terrestrial Ecotoxicology indeed considers that the SSD analysis is more suitable than a tier-1 approach (based on a single endpoint and a single species) to achieve the environmental protection goal because it takes into account the available data on the sensitivity of several species [6]. Moreover, substituting unbounded ER 50 with arbitrary values would be a fairly arbitrary way that does not make much sense and leads to the possibility to produce biased HR 5 estimates subjectively.
Within this context, based on seven NTTP case studies, each including several data sets, we aimed to revisit SSD analyses by accounting for both the uncertainty on ER 50 values (referred to as interval-censored values hereafter) together with the inclusion of censored values, in particular right-censored values (corresponding to unbounded ER 50 values) what commonly happens with toxicity tests in practice for NTTP. Indeed, left-censored values are rare because the tested rate range as imposed by the standard protocols is better adapted to assess effects for the more sensitive species. We also tried to quantify how both types of censored values may influence the final estimate of the HR 5 .

Materials
Laboratory experiment data sets on NTTP were available for seven case studies on products with various herbicidal mode of action (Table 1). Each study provides data from two toxicity tests: seedling emergence (SE) according to OECD guideline 208 [7] and vegetative vigour (VV) according to OECD guideline 227 [8]. For each study, 10 species (thereafter named using their EPPO code [9]; see S1 Table for common names of species) were exposed to a range of five tested rates of a product plus a control (i.e., absence of product), which were applied either to the soil surface (SE tests) or directly to the plants (VV tests). Besides, in study 4, extra experiments at lower tested rates were conducted for two of the species (CUMSA and LYPES) in the VV test, and for study 7, extra experiments at lower tested rates for two of the species (ALLCE and BEAVA) were carried out in the SE test. During experiments, plants were observed for 21 days after day 0. Day 0 is defined as the day at which 50% of the control seedlings have emerged for SE tests and as the day of application for VV tests. During the 21-day observation period, seedling emergence, seedling survival and visual injury in each replicate were followed weekly (at days 0, 7, 14 and 21) in SE tests, while plant survival and visual injuries were followed weekly in VV tests, also in each replicate. At the end of the experiments (in both SE and VV tests), shoots were cut-off and dried up, then the shoot dry weight was measured in each replicate. For each study, five quantitative endpoints at day 21 were thus available: emergence, survival and shoot dry weight for SE tests, survival and shoot dry weight for VV tests.

Methods
To assess the effects of the studied herbicides on NTTP, we first analysed the effects of the seven products on the five endpoints for each of the 10 species (that is a total of 7 × 5 × 10 data sets) by fitting a dose-response model to experimental toxicity test data thus getting ER 50 estimates for each data set. The modelling process was carried out under a Bayesian framework, which ensures to get a posterior probability distribution for the ER 50 which can then be used as a basis to quantify the uncertainty on the ER 50 . Then, these ER 50 values, also considering their uncertainty, were used as inputs for the SSD analyses leading to the HR 5 estimates. Dose-response model. For SE tests, observed data at day 21 for replicate i can be described as Assuming that W i is normally distributed with mean μ i and standard deviation σ, with μ i defined as a function of the tested rate, we get: In Eqs (1) to (4), f was chosen as three-parameters log-logistic function: Parameters b, d and e are positive. Parameter b is a shape parameter translating the intensity of the effect, d corresponds to the endpoint in control data (i.e., in absence of product) and e corresponds to the ER 50 . Within the Bayesian framework, we have to specify a prior distribution for model parameters b, d, e (and σ in case of modelling shoot dry weight data). The prior distributions are given in Table 2.
Estimation of parameters. Model computations were performed in R [10] with JAGS using Gibbs sampling via Markov Chain Monte Carlo (MCMC) simulations [11]. The R-package morse [12] was used to analyse emergence and survival data. In package morse, if no inhibition of plant emergence (or if no survival) is observed in control groups, parameter d is automatically set to 1 by default. Hence a two-parameters log-logistic model is fitted to the data. Emergence and survival data can also be practically analysed with the MOSAIC platform [13]. A modus operandi is provided in S1 Appendix. A specific R-code based on the R-package rjags was built to fit shoot dry weight data. This code is made freely available through an Rshiny web application (https://mosaic.univ-lyon1.fr/growth), for reproducibility of the results for shoot dry weight data.
Three chains were run firstly for an initialisation phase of 3000 iterations and a burn-in phase of 5000 iterations, then Ratery and Lewis's Diagnostic was used to set the necessary thinning and the number of iterations to reach a given level of precision in posterior samples. These posterior samples allow to get a joint posterior probability distribution as well as marginal posterior probability distributions for all model parameters.
Censoring ER 50 estimates to account for the uncertainty. The output of interest from the previous dose-response analyses consists of the posterior probability distribution of the ER 50 ( Fig 1A) allowing to quantify the uncertainty on the ER 50 estimation summarised as a 95% credible interval (CI95), representing the range of values within which the ER 50 has 95% of chance to be found. Hence, we considered the use of CI95 of ER 50 estimates as inputs of SSD analyses, instead of point estimates (median or mean values), as a good way to account for the uncertainty on the ER 50 estimates into subsequent analyses. But the following questions then arise: should we always consider the bounds of CI95 as the uncertainty limits of the ER 50  and use this interval as an ER 50 input value for the SSD analysis? Is the ER 50 estimate always precise enough to be used as it is in the SSD compared to the range of tested rates? How to account for the precision of the ER 50 estimate regarding the range of tested rates? Is there any situation for which the estimated ER 50 should be considered as right-censored? To ensure agronomic realism, the common practice in standard toxicity tests with NTTP is to use the maximal field application rate as the maximum tested rate in the experimental design. However, some species are not affected or only slightly affected at the highest intended application rate: the application rates that would be needed to create high effects in the plants are unknown, thus leading to unbounded values of ER 50 estimates (greater than the highest tested rate). Such ER 50 estimates will usually be not precise (having a large CI95) which may suggest to rather consider them as a right-censored value. That is why we carefully considered the relevance of the ER 50 estimates (quantified through their CI95) regarding the range of tested rates, in particular the highest tested rate (max_rate). In order to decide on the most appropriate mathematical option for automatically right-censoring the ER 50 , we propose two criteria based on overlapping ratios between [LCI95;max_rate] and [LCI95;UCI95] intervals, where LCI95 and UCI95 are the lower and upper bounds of the CI95, respectively: 1. A first criterion based on a ratio of probabilities (denoted C 1 , Eq (6)) defined as the ratio of the probability that the ER 50 lies within [LCI95;max_rate] over the probability that the ER 50 lies within [LCI95;UCI95]; as illustrated on Fig 1B, criterion C 1 is calculated as the ratio of the orange surface divided by the (orange + grey) surface.
2. A second criterion based on a ratio of distances (denoted C 2 , Eq (7)) defined as the ratio of the distance max_rate minus LCI95 (if max_rate < LCI95, then the distance is set to 0) over the extend of the CI95; as illustrated on Fig 1C, criterion C 2 is calculated as the ratio of the orange segment divided by the (orange + grey) segment.
Decision. Once the criterion is calculated, we need a decision threshold (denoted T) to right-censor or not the ER 50 . If the ratio is greater than T, we keep an interval-censored ER 50 corresponding to the whole CI95; otherwise, we consider a right-censored ER 50 with a lower bound being the minimum between LCI95 and max_rate (Eq (8)):

SSD analyses.
Our final objective is to explore the influence of considering the uncertainty on ER 50 in SSD analyses and specifically its impact on HR 5 estimates. Given the way we have taken the uncertainty on ER 50 into account (see above), this means studying how intervaland/or right-censored ER 50 input values impact the SSD analysis and the HR 5 estimation. Thus, SSD analyses were carried out based on ER 50 values coming from the seven studied firstly analysed with a dose-response model as previously described. For each case study and each endpoint, based on dose-response fitting results, we considered input ER 50 values for an SSD analysis in the different following ways: i. only point estimates (chosen as the medians of the probability distributions of the ER 50 estimates); ii. interval-censored ER 50 based on their CI95, and we used these intervals as such in a mathematically sound way; iii. censored ER 50 according to criterion 1 with a decision threshold T = 0.5 (denoted C1T0.5), and we used these censored ER 50 as such in a mathematically sound way.
SSD analyses were run by fitting a log-normal probability distribution to ER 50 input values under a frequentist framework based on the R-package fitdistrplus [14]. This R-package allows the user to deal with censored data in a mathematically sound way. An alternative way is to use the web platform MOSAIC and its SSD module https://mosaic.univ-lyon1.fr/ssd [15].

Dose-response analyses
All results on dose-response analyses are displayed in files report_xxx.pdf in S1 Archive for each case study, each species and each endpoint (five files per case study). Under a Bayesian framework, whatever the data set, the species and the endpoint, we always succeeded in fitting a dose-response curve and getting a posterior probability distribution on the ER 50 . For certain endpoints in certain studies, we got a well-shaped sigmoidal dose-response curve with a median estimate of the ER 50 within the range of tested rates. Nevertheless, in cases where the herbicides did not elicit a strong effect on the chosen species, we got a flat dose-response curve with a high median estimate of the ER 50 , in particular for the survival endpoint of the VV tests. Fig 2 illustrates an example of a dose-response curve along with some goodness-of-fit criteria. The data we used for this example is the shoot dry weight of the VV test from case study 1 for species BEAVA. The median fitted dose-response curve in Fig 2A is represented by a solid orange line associated with its CI95 as a grey band; it describes the shoot dry weight of the sugar beet as a function of the product tested rate. The goodness-of-fit for the fitted model can be checked using posterior predictive check (PPC) plot illustrated in Fig 2B. The PPC plot shows the observed shoot dry weight values against their corresponding shoot dry weight predictions (black dots), along with their CI95 (vertical segments, green if the CI95 contains the observed value and red otherwise). The model is considered to be well fitted if around 95% of black dots are within CI95. Please note that for the emergence and survival datasets, the previous statement is not necessarily expected because observations are pooled per tested rate. The precision of the model parameter estimates can be visualised in Fig 2C by comparing the posterior distribution (orange surface) to the prior one (grey surface) for each parameter; in Fig  2D, we can check for correlations between parameters. A narrower posterior distribution compared to the prior one for each parameter and low correlations between parameters are expected to ensure the goodness-of-fit of the model; that is the case in this example.

Censoring on ER 50 estimates
The censoring decision for an ER 50 depends on both the criterion (C 1 or C 2 ) and the decision threshold T. To study the influence of the criterion and the decision threshold on censoring decisions, we tried seven T values: 0, 0.

PLOS ONE
Accounting for the uncertainty in species sensitivity distributions: An example in non-target plants decision thresholds. The data we used for this example is the shoot dry weight of the VV test from case study 1 for species ALLCE.
In this example, the ER 50 is either interval-censored or right-censored depending on the criterion and the T value. Moreover, censored ER 50 values vary slightly according to the criterion and the decision threshold. Most of the time, criteria C 1 and C 2 lead to the same censoring decision for our seven case studies, thus criterion C 1 (based on the whole probability distribution of the ER 50 ) was finally preferred. See files ER50_censoring.pdf in S1 Archive (seven files in total) for results on other species and other endpoints. Regarding the decision threshold T, in the following cases, the seven decision thresholds led to the same censoring decision: • case 1: when the CI95 of the ER 50 is utterly within the range of tested rates, an ER 50 intervalcensored by its CI95 is always obtained; • case 2: when the CI95 of the ER 50 is utterly out of the range of tested rates, a right-censored ER 50 [max_rate, + 1] is always obtained.
Consequently, the decision threshold influences the censoring decision when there is an overlap between the CI95 of the ER 50 and the range of tested rates. In this case, the higher T is, the more often we will decide to right-censor the ER 50 . Hence, in certain cases, a too high T value may generate a lot of right-censored ER 50 values and lead to consider some ER 50 estimates right-censored while we would have rather preferred to use their CI95 to quantify their uncertainty. On the other hand, a too low T value may almost always lead to decide to use an interval-censored ER 50 with its CI95, even in cases where most of the possible values for the ER 50 estimate within the support of its posterior probability distribution are greater than the highest tested rate; so, in such a case, we would have rather considered to right-censor it regarding the range of the tested rates. Therefore, we have a preference for T = 0.5, as a neutral value. Hence, for the subsequent SSD analyses, we considered only censored ER 50 values according to C 1 and T = 0.5 (C1T0.5).

SSD and HR 5
Three ways of handling ER 50 values in SSD analyses were studied and compared for the seven case studies. For each case study, all results on SSD and HR 5 are displayed in files SSD_analyses.pdf in S1 Archive (seven files in total). In total, we did SSD analyses on 105 data sets (7 studies × 5 endpoints × 3 types of ER 50 ). We had almost no convergence problem for parameter estimation, except for the survival and shoot dry weight endpoints of the VV test for case studies 2 and 5. In case study 2, for the survival endpoint, fitting a log-normal distribution to the data set with censored ER 50 values according to C1T0.5 failed because the 10 censored ER 50 values were in fact equal to the same interval [max_rate, + 1]. In other cases, convergence failed because all censored ER 50 values were too close from each other. Fig 4 illustrates an example of three SSD analyses based only on medians of ER 50 (A), ER 50 interval-censored by their CI95 (B) and ER 50 censored according to C1T0.5 (C). The obtained HR 5 estimates are denoted by HR 5,1 , HR 5,2 and HR 5,3 , respectively. The data for this example is the shoot dry weight of the VV test from case study 1. In this example, the three SSD curves fitted well to ER 50 values. The SSD curve in Fig 4C has a larger 95% confidence interval (CoI95) than the ones in Fig 4A and 4B. Estimated HR 5,3 was smaller than HR 5,1 and HR 5,2 with a larger CoI95.
Summary from the seven case studies. Results on HR 5 based on the three ways of handling ER 50 values for the seven data sets are given in Table 3 and corresponding SSD curves are given in files SSD_analyses.pdf in S1 Archive.
According to SSD curves, SSD were poorly fitted when there were less than six distinct intervals for the ER 50 values among the 10 input intervals, most of which being right-censored (e.g., the emergence endpoint for case studies 1, 2, 3, 6 and 7; the survival endpoint of the SE test for case studies 2 and 3; the survival endpoint of the VV test for case studies 1, 3 and 5; the shoot dry weight endpoint of the VV test for case study 5). In such cases, we often found that HR 5 estimates were imprecise with a large CoI95. The HR 5,3 estimates taking the right-censoring into account were often greater than the HR 5,1 and HR 5,2 estimates based on only medians of ER 50 or based on interval-censored ER 50 by their CI95. Please note that we performed these SSD analyses anyway to obtain HR 5 estimates and to illustrate that HR 5 values obtained for data sets where less than six species show clear adverse effects are not precise. Therefore, it would make no sense to fit an SSD in such cases for routine analyses.
According to SSD curves, SSD fitted reasonably well when there were at least six distinct intervals for ER 50 values as inputs, only some of which being right-censored (e.g., the emergence endpoint for case study 5; the survival endpoint of the SE test for case studies 1, 4, 5, 6 and 7; the shoot dry weight endpoint of the SE test for the seven case studies; the survival endpoint of the VV test for case studies 4, 6 and 7; the shoot dry weight endpoint of the VV test for case studies 1, 3, 4, 6 and 7). We almost always observed that the HR 5,3 were lower than the HR 5,1 and HR 5,2 , except for the survival endpoint of the SE test for case study 4, where we had a HR 5,3 twice greater than the HR 5,1 and HR 5,2 ; for the three endpoints of the SE test for case study 5, HR 5,1 , HR 5,2 and HR 5,3 were very close and for the shoot dry weight endpoint of the VV test for case study 7, the HR 5,3 was a little greater than the HR 5,1 but less than HR 5,2 .
Concerning the influence of including right-censored data or not on the HR 5 estimate, based on SSD curves for our seven case studies, we found that for cases with an HR 5,3 lower than the HR 5,1 and HR 5,2 estimates, the right-censored ER 50 values were rather obtained for less sensitive species (i.e., species having higher median ER 50 values). This result was found frequently among the seven case studies and most of the time in case studies for which there were few right-censored ER 50 values for measured endpoints; this reflects that such a situation will occur in most of the cases encountered when analysing SSD data for NTTP. This comes from the fact that the tested rate range (as imposed by the standard protocols) is better adapted to assess effects on the more sensitive species. On the other hand, we found that for certain cases where the HR 5,3 was greater than the HR 5,1 and HR 5,2 estimates, the right-censored ER 50 values were not only obtained for less sensitive species but also for highly sensitive species (i.e., species having lower median ER 50 values). This happened often in case studies for which there were lots of right-censored ER 50 values, as the application rates were not adapted for most of the chosen species for the measured endpoints. In a risk assessment context, this situation should typically lead to new experiments with higher tested rates.

Dose-response modelling
Model choice. There are a vast variety of models in common use to describe a doseresponse relationship for ecotoxicity test data, such as probit, log-logistic, Weibull, etc. [16][17][18]. The log-logistic models have been widely used in weed science and they have been recommended as a standard herbicide dose-response [19]. The log-logistic models are by far the most commonly used model for describing toxicity test data [20]. The log-logistic models can be used to properly analyse not only continuous data but also quantal data. Hence, for our NTTP data, we always used log-logistic models with the same deterministic part to analyse emergence, survival and shoot dry weight data, thus facilitating comparisons. Hence a threeparameters log-logistic model was chosen to analyse not only emergence and survival data, but also for shoot dry weight data. Although a four-parameters log-logistic model could have been tested and may be chosen for shoot dry weight data, we preferred to use the threeparameters one, since it is reasonable to fix the lower asymptote (parameter c) at 0 considering that, at really high application rates all plants can die, even though the actual data are not fully supporting this assumption for few rare cases [20]. Morever, for any of our data sets, the addition of one extra parameter did not significantly improve the model fitting (results not shown).
Choice of priors. A quasi-non-informative uniform prior distribution was chosen for the logarithm of parameter b within the interval [−2, 2], in order to cover a wide variety of doseresponse shapes. In certain cases, we had an extremely flat dose-response curve (no effect at any of the tested rates was observed on the endpoint) so that the posterior of parameter b was as wide as the prior, even if we enlarged the support prior interval. In addition, in these cases, the imprecise estimation of parameter b did not influence our conclusions on the ER 50 estimates, since the ER 50 values were considered as right-censored with their lower bound being the highest tested rate.
For parameter e, we used the same prior as the one used in the R-package morse [12,21]: a normal distribution was chosen for the logarithm of parameter e, with specific mean and standard deviation (presented in section method) computed from the experimental design. This choice implies that parameter e has a probability slightly greater than 95% to lie within the range of tested rates.
A uniform prior distribution was chosen for parameter d within the interval [0, d max ]. For the emergence and survival data sets, d max equals 1 representing a 100% probability of emergence or a 100% probability of survival. For the shoot dry weight data sets, d max should ideally be chosen according to expert knowledge and equal to the highest expected shoot dry weight regarding the species and environmental conditions (in the experiment) under consideration. Nevertheless, for pratical convience, d max was chosen as twice the highest observed shoot dry weight for the species under consideration, and then the observation with the highest observed value was discarded from the dose-response analyses.
Concerning prior distributions for the variance parameter of the Gaussian distribution (shoot dry weight data only), there are commonly used prior specifications (e.g., an inverse gamma, an inverse chi-square distribution on variance parameter, a uniform distribution on standard deviation parameter). We finally assigned a uniform prior distribution to standard deviation parameter σ within the interval 0; d max 2 � � .

Censoring of ER 50 estimates
We chose to use criterion C 1 based on the ratio of probabilities and a T value equal to 0.5 to automatically censor ER 50 estimates. Indeed, criteria C 1 and C 2 led to very close censoring decisions for ER 50 estimates, but criterion C 1 was preferred since it is based on the whole probability distribution of the ER 50 . However, if criterion C 2 had been chosen for censoring ER 50 estimates, there would be very few changes for SSD analyses and HR 5 estimates. Regarding decision threshold T, there is no rule set in stone for its choice. We have a preference for T = 0.5, as a neutral value. This medium value may avoid considering very imprecise ER 50 estimates as interval-censored by their CI95 and may also avoid considering enough precise ER 50 estimates as right-censored. Based on the 350 data sets we analysed, the same results and conclusions are almost always reached for a choice of a T value between 0.4 and 0.6. We propose to always use the intermediate value of 0.5 for the sake of simplicity and comparability. We recommend to avoid using T values below 0.4 or above 0.6, for the reasons mentioned above. Uncertainty on the ER 50 estimates. There are some limitations in the way we considered the uncertainty on the ER 50 estimate. Indeed, the uncertainty was simply summarised by either an interval-censored ER 50 with its CI95 or by a right-censored ER 50 accounting for the range of tested rates. This method does not use all the available information on ER 50 (i.e., the full posterior distribution). It is conceivable that the uncertainty on the ER 50 estimates could be considered in other better ways, allowing to account for the full posterior distribution of ER 50 within subsequent SSD analyses. Further research would be needed to explore this possibility.

SSD analyses
Compared to the traditional deterministic approach that relies on the most sensitive individual toxicity data, the probabilistic SSD method has numerous advantages [2,22]. As always in statistics, SSD can be built with either parametric or non-parametric methods. Some parametric distributions have already been proposed for SSD, such as log-logistic [23], log-normal [3,[24][25][26][27][28], Burr Type III [29], Weibull distributions, etc. The common use of the parametric approach for SSD is due to its mathematical simplicity and because it requires less data points compared to non-parametric approaches. Log-normal and log-logistic distributions are the most commonly used for SSD [3,26,27]. The present paper used a log-normal distribution to fit ER 50 values without testing the normality of the logarithm of tested rates, since it is not our purpose to find the best fit to toxicity values, but rather to study the influence of accounting for the uncertainty of ER 50 inputs on HR 5 estimates.
Concerning minimum data requirements for fitting an SSD, a minimum of five to ten species is deemed acceptable for regulatory purposes in aquatic ecotoxicity [2] and, in the context of environmental risk assessment, a minimum of six species is required, ten being usually recommended [6]. Indeed, small size of input samples may lead to high uncertainty in fitted SSD [26]. Ten data points were also suggested by Wheeler et al. [30] to generate reliable estimate upon which regulatory decisions may be based. In the present paper, ten NTTP species were therefore tested for the seven case studies, allowing to collect ten ER 50 values (generally distinct) for SSD analyses. However, for certain endpoints of certain case studies, when taking into account right-censored ER 50 values in SSD analyses, the number of distinct ER 50 values for SSD diminished, since some of the right-censored values were in fact equal to the same interval [max_rate, + 1], with max_rate equal for all the species. Consequently, we had some cases where SSD were poorly fitted with less than six distinct intervals for ER 50 values, some of which being right-censored. When this happens, it is better not to consider SSD results and their corresponding estimated HR 5 values, and if possible to add new experiments with higher application rates. Based on our seven cases studies, at least six distinct intervals (whatever the ER 50 point values) appears as the minimum requirement to reasonably fit an SSD and obtain a relevant 95% confidence interval on the final HR 5 estimate.
Influence of right-censored data on HR 5 estimates. For endpoints for which the SSD was fitted on at least six distinct intervals for ER 50 inputs, most of the time, HR 5 estimates taking the right-censoring into account were lower than HR 5 estimates based on medians of ER 50 or interval-censored ER 50 by their CI95. The results were consistent with a simulation study conducted by Green et al. [16], which demonstrated that the mathematically sound way of using censored data tends to underestimate HR 5 compared to the HR 5 resulting from a data set without censored values, with greater underestimation associated with greater percentage of censoring. In addition, we found that, when right-censored ER 50 values were obtained for the less sensitive species, which is the most common case in practice, the HR 5 values obtained by including the right-censoring were most of the time smaller than those obtained by handling censored data as non-censored data.
On a general point of view, based on our seven case studies for the five endpoints we analysed, the influence of including right-censored data on the HR 5 estimate depends on the right-censored ER 50 values being obtained rather on more sensitive species or on less sensitive species. In addition, we can say that, if right-censored data spread in a random way among the chosen species, the HR 5 obtained by considering right-censored ER 50 values can be both greater or smaller than the HR 5 obtained by handling censored data as non-censored ones.
Sensitivity of endpoints. In the result section, we have not mentioned the sensitivity of endpoints. However from Table 3, we found that the shoot dry weight endpoint from SE or VV tests appeared almost always as the most sensitive endpoint with the lowest estimated HR 5 value, except for case study 5 for which HR 5 results for three endpoints (emergence, survival and shoot dry weight of the SE test) were very close. The shoot dry weight from the VV test appeared more often as the most sensitive one (four times out of the seven case studies) than the one from the SE test. In addition, for the shoot dry weight endpoint, we had rarely rightcensored ER 50 values according to C1T0.5. Therefore, it would be recommended to always collect and analyse shoot dry weight data in order to assess risk of herbicide on NTTP by using SSD analyses based on censored ER 50 inputs.

Experimental design.
In certain studies, the tested rates were not specifically adapted to the sensitivity of some species. Hence, some species were not affected or slightly affected at the highest intended application rate, leading to right-censored ER 50 values. These right-censored values may affect the estimation of HR 5 , for example making the estimate less precise. If the precision of the HR 5 is not considered as sufficient (i.e., with a high value of the CV), then conducting new experiments with higher application rates may help to refine the final estimation of the HR 5 .

Conclusion
All our results confirm the usefulness of our integrated approach going from raw toxicity test data until the HR 5 (or HC 5 ) estimation, considering uncertainty propagation all along the data analysis process. Accounting for ER 50 (or EC 50 ) estimates as intervals clearly avoid to discard any inputs for SSD analyses, or to arbitrarily convert them to point values. This also avoid to increase uncertainty in the apical estimate of the HR 5 (or HC 5 ) by keeping as much ER 50 (or EC 50 ) inputs as possible whatever their associated type of interval (bounded or not). Additionally, the method we proposed in our paper is applicable to any taxon in ecotoxicology. The results we presented are based on a total 350 data sets consisting of seven case studies, each with five endpoints (survival, emergence, shoot dry weight) for 10 non-target terrestrial plants from standard greenhouse experiments that used different experimental designs (vegetative vigour vs. seedling emergence) and applied herbicides at different growth stages. The Bayesian framework allowed estimating ER 50 values and 95% credibility intervals for all data sets, even when the dose-response curve did not reach a strong effect at the highest tested rate. Combined with a statistically sound approach for inclusion of censored ER 50 estimates in SSD computing, we maximised the use of existing species data when building SSD, thus avoiding discarding right and/or left-censored data that may be obtained from lab studies for less or more sensitive species. Our overarching study confirmed that at least six distinct intervals (whatever the ER 50 point values) are required as input to the SSD analysis to ensure obtaining a reliable estimate of the HR 5 .
Our paper finally proposes a statistically sound method for propagating the uncertainty of the ER 50 (or EC 50 ) estimates considered as interval-censored values towards the HR 5 (or HC 5 ) estimates. This method delivers both point estimates and bootstrap 95% confidence intervals of HR 5 (or HC 5 ). It illustrates that both propagating the uncertainty from ER 50 (or EC 50 ) estimates and including interval-censored data as inputs for SSD analyses may change both the point estimate and the 95% confidence interval on the HR 5 (or HC 5 ). The extend of the change depends on the characteristics of the ER 50 (or EC 50 ) input values (e.g., whether censored data were obtained for less or more sensitive species or were randomly spread among the tested species) and on the chosen criteria for handling the uncertainty of ER 50 (or EC 50 ) values. Consequently, when comparing and interpreting the final results, we recommend not to focus solely on the point estimate of the HR 5 (or HC 5 ), but also to look at the precision of this estimate as depicted by its 95% confidence interval. A small confidence interval stands for a precise estimate of the HR 5 (or HC 5 ), and thus a low uncertainty. This information integrates both the differences in sensitivity and the uncertainty of the ER 50 (or EC 50 ) inputs across a range of species all the way down to the HR 5 (or HC 5 ) estimation: therefore, it is particularly valuable for an informed use of the HR 5 (or HC 5 ) value in the context of environmental risk assessment.
Supporting information S1 Table. Corresponding species and common names to the EPPO code. (PDF) S2 Table. Formulation of active substances in seven products. (PDF) S1 Appendix. Reproduction of results via MOSAIC. (PDF) S1 Archive. Zip file containing all supplementary results. It is a zip file containing seven folders (one folder per case study). Each folder contains five files report_xxx.pdf with detailed results of the dose-response analyses, one file corresponding to does-response analysis per endpoint. It also contains one file ER50_censoring.pdf for censored ER 50 and one file SSD_analyses.pdf for results of SSD analyses.