A hybrid particle-ensemble Kalman filter for problems with medium nonlinearity

A hybrid particle ensemble Kalman filter is developed for problems with medium non-Gaussianity, i.e. problems where the prior is very non-Gaussian but the posterior is approximately Gaussian. Such situations arise, e.g., when nonlinear dynamics produce a non-Gaussian forecast but a tight Gaussian likelihood leads to a nearly-Gaussian posterior. The hybrid filter starts by factoring the likelihood. First the particle filter assimilates the observations with one factor of the likelihood to produce an intermediate prior that is close to Gaussian, and then the ensemble Kalman filter completes the assimilation with the remaining factor. How the likelihood gets split between the two stages is determined in such a way to ensure that the particle filter avoids collapse, and particle degeneracy is broken by a mean-preserving random orthogonal transformation. The hybrid is tested in a simple two-dimensional (2D) problem and a multiscale system of ODEs motivated by the Lorenz-‘96 model. In the 2D problem it outperforms both a pure particle filter and a pure ensemble Kalman filter, and in the multiscale Lorenz-‘96 model it is shown to outperform a pure ensemble Kalman filter, provided that the ensemble size is large enough.


Introduction
Data assimilation of high-dimensional dynamical systems routinely falls to various kinds of ensemble Kalman filters (EnKF) [1]. Ensemble Kalman filters make two fundamental approximations: the first is that the likelihood and prior are both Gaussian, and the second is that the mean and covariance of the Gaussian prior are approximated from an ensemble. The EnKF is known to converge to the correct posterior in the limit of large ensemble size when the distributions are Gaussian [2], but it clearly will not converge to the correct posterior in the presence of non-Gaussianity.
Although nonlinear dynamics, nonlinear observation operators, and non-Gaussian error distributions lead to non-Gaussian priors and likelihoods in many applications, the degree of non-Gaussianity is not always so great that it severely degrades EnKF performance. This has led several authors to classify problems according to the degree of nonlinearity, i.e. the degree of non-Gaussianity [34,37,38]. Following [38] we distinguish three categories: • Mild nonlinearity: The prior and posterior are both approximately Gaussian.
• Medium nonlinearity: The prior is very non-Gaussian but the posterior is approximately Gaussian.
• Strong nonlinearity: The prior and posterior are both very non-Gaussian.
Particle filters and non-Gaussian extensions of the EnKF are not needed in situations with mild nonlinearity, while problems with strong nonlinearity can greatly benefit from such methods. Problems with medium nonlinearity can arise when nonlinear dynamics produce a non-Gaussian prior, but a highly accurate Gaussian likelihood generates a nearly Gaussian posterior. The concept of medium nonlinearity is related to the Laplace approximation [39]. Morzfeld and Hodyss [38] argue that variational methods are more appropriate for medium nonlinearity than EnKF methods because the former make a Gaussian approximation of the posterior, while the latter make a Gaussian approximation of the prior.
The goal of the present work is to develop a hybrid of the SIR particle filter with the EnKF that is appropriate for problems with medium nonlinearity. The hybrid is based on the likelihood splitting of Frei & Künsch [23]. At each assimilation cycle, part of the observational information is incorporated by means of an SIR step, and then the remaining observational information is incorporated with a serial square root version of the EnKF. Particle degeneracy that results from the resampling step of the SIR is broken by a mean preserving random orthogonal transformation of the ensemble, as seen in certain EnKFs [40][41][42] and momentmatching particle filters [43,44]. The goal of the hybrid is to present the EnKF with an intermediate prior that is closer to Gaussian than the true prior. The curse of dimensionality in the particle filter is mitigated by assimilating only part of the observational information, i.e. only moving partway from the prior to the posterior, thereby enabling accurate results with practical ensemble sizes. The hybrid presented here is broadly similar to other hybrids (e.g. [23,24]), and differs mainly in the explicit focus on problems with medium nonlinearity and in details of the implementation. Differences are discussed further in section 2.3.
The hybrid particle ensemble Kalman filter is presented in section 2. The new hybrid is compared to the hybrids from [23,24] and to a particle filter and an ensemble Kalman filter in the context of a simple two-dimensional problem in section 3. A multiscale Lorenz-'96 model from [45] is described in section 4.1, followed by a description of the data assimilation system configuration in section 4.2. The EnKF component of the hybrid uses multiplicative inflation and localization, and the method used to optimize the values of these parameters is described in section 4.3. Results of the experiments are described in section 5, followed by a conclusion in section 6.

SIR
Standard sequential importance resampling (SIR) particle filters work as follows [3,46]. Each ensemble member x ðiÞ 0 (or 'particle') starts with equal weight w ðiÞ 0 ¼ 1=N, where N is the ensemble size and i = 1, . . ., N. Subscripts refer to time, while superscripts in parenthesis refer to ensemble members. Each ensemble member is forecast until the next assimilation cycle. At assimilation cycle j the weights are updated using the likelihood L(x) w ðiÞ j ¼w ðiÞ where Z j is a normalization constant to ensure that the weights sum to one, andw j denotes the effect of resampling: without resampling at step j we havew ðiÞ j ¼ w ðiÞ j whereas with resampling we havew ðiÞ j ¼ 1=N. A resampling is then applied whereby particles with high weights are replicated and particles with low weights are eliminated. There are a variety of resampling algorithms; here we use the so-called 'systematic' resampling scheme of [47].
It is well known that the weights of a particle filter can collapse, especially in high dimensions, i.e. a small number of particles receive a weight near one while all others receive a weight near zero [7,8]. After resampling, only the high-weight particles are left. If an optimal-transport based alternative to resampling is used [24,48,49], then all particles are transported to a very small vicinity of the high-weight particles. In both cases the posterior distribution is poorly estimated. The number of particles with a substantial portion of the weight can be approximated by the effective sample size The ESS takes values between 1 and N, and small ESS indicates that the weights have collapsed.

Ensemble square root filter (ESRF)
There are many ensemble Kalman filters, any of which could be hybridized with the smoothed particle filter. We focus here on an ensemble square root filter (ESRF) developed in [50] for sequential assimilation of observations possessing uncorrelated errors. At a single assimilation cycle the ensemble is denoted fx ðiÞ g N i¼1 . The ensemble mean is denoted � x, and the scaled ensemble perturbation matrix is denoted The ensemble covariance matrix is thus AA T . Covariance inflation is applied by replacing A with ffi ffi ffi ffi ffi ffi ffi ffi ffi ffiffi 1 þ r p A, where r > 0 is a tunable inflation factor. Observations are linear, and a single scalar observation y takes the form Here the observation error � is a sample from a zero-mean normal distribution with variance γ 2 and the row vector H = h T extracts the observations from the state vector x. It is convenient

PLOS ONE
to define the row vector v T = h T A. With this notation, the ESRF from [50] corresponds to the following update of the ensemble mean and the following update of the scaled ensemble perturbation matrix ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi where σ 2 = v T v and A a is the scaled analysis ensemble perturbation matrix. Localization is applied by multiplying the increments elementwise by a localization vector ρ. The elements of ρ are e −(d/L) 2 /2 , where d is the distance from x i to y and L is a tunable localization radius. This amounts to updating Eqs 5 and 6 to and where � denotes an elementwise product. Evensen was the first to suggest resampling the posterior within the context of an ensemble square root filter by multiplying A a from the right by a random orthogonal matrix [40]. Since the posterior ensemble covariance matrix is A a A aT , this kind of resampling does not change the ensemble covariance matrix. Sakov & Oke [42] pointed out that the random orthogonal matrix should have 1 (the vector whose elements are all 1) as an eigenvector in order for the resampling to preserve the ensemble mean. We construct a new scaled ensemble perturbation matrix A a by multiplying A a from the right by a random orthogonal matrix Q that has 1 as an eigenvector. The matrix Q is constructed as follows [42] The matrix U is an orthogonal matrix whose first column is proportional to 1, while the matrix P is a random orthogonal matrix of size N − 1 × N − 1. The matrix U is time independent. With a large ensemble size it can become costly to sample a new P at each assimilation cycle. In principle the matrix Q could be constructed once and used repeatedly, but in our numerical experiments P is resampled at each assimilation cycle. Using this method, a single assimilation cycle proceeds as follows • Form the ensemble mean � x and scaled ensemble perturbation matrix A.
• Inflate the scaled ensemble perturbation matrix: A (1 + r)A • For each observation, find � x a and A a using Eqs 8 and 9.
• Resample the posterior ensemble by replacing A a with A a Q.
• Reconstitute the ensemble according to x ðiÞ ¼ � x a þ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi N À 1 p A i where A i is the i th column of A.

SIR-ESRF hybrid
The SIR/ensemble square root filter (SIR-ESRF) hybrid developed here is based on the bridging method of Frei and Künsch [23]. The likelihood L(x) is split into a product (L(x)) α � (L(x)) 1 −α where α 2 [0, 1] is the "splitting factor". The hybrid proceeds by having the SIR particle filter assimilate using the likelihood (L(x)) α , followed by an ESRF assimilation using the likelihood (L(x)) 1−α . In principle, the methods can be applied in either order [24], but the method developed here is intended for situations where the prior is non-Gaussian but the posterior is nearly Gaussian ('medium' nonlinearity according to [38]). In such cases the intermediate posterior produced after the first assimilation with the particle filter should be closer to Gaussian than the prior. The ESRF subsequently performs an assimilation on a problem that more closely conforms to its underlying Gaussian approximation.
Following Frei & Künsch [23] we choose the splitting factor α to ensure that the effective sample size is within some tolerance of a tunable theshold. This is achieved with a rootfinding method. A large ESS threshold implies a small α, though the precise value of α depends on the ensemble size. If α = 0, then the hybrid reverts to a pure ESRF because all the particle filter weights become equal.
The resampling step of the SIR particle filter leads to a degeneracy where there are multiple copies of some ensemble members. In our numerical experiments we use a deterministic system of ordinary differential equations, so the dynamics do not break the degeneracy. We opt to follow the ESRF assimilation with a mean-preserving random orthogonal transformation that resamples the ensemble within the Gaussian posterior, as described in the foregoing section.
There are two other extant hybrid particle/ensemble Kalman filters: those of [23] and [24]. Our hybrid is essentially the same as the hybrid of [24] with the following differences: We use standard resampling methods for the particle filter part of the hybrid instead of the Ensemble Transform Particle Filter (ETPF) method of [48], and we break degeneracy using a random orthogonal transformation rather than the 'particle rejuvenation' procedure of [24]. Our use of a random orthogonal transformation is motivated by the focus on medium nonlinearity problems. Naive implementations of the ETPF are computationally expensive, and in the experiments with the Hénon map described in section 3 there seems to be little benefit in using the ETPF instead of standard resampling.
The hybrid of [23] is significantly different from the one proposed here and from the hybrid of [24] because the first step of Frei & Künsch's hybrid is really a Gaussian mixture model update and not a particle filter update (cf. [51]), though it does limit to a pure SIR particle-filter update in the limit α ! 1. In particular the particle weights in the hybrid of [23] are different from those used here and in [24], and are more expensive to evaluate. Particle degeneracy is avoided in the hybrid of [23] by using a stochastic update for each step: a perturbed-observation Gaussian mixture update (cf. [51]) for the first step and a perturbed-observation EnKF for the second step (cf. [52,53]). It is worth noting that the hybrids of [23] and [24] are generally intended to overcome non-Gaussianity in the filtering problem. The hybrid developed here is quite similar to that of [24] but has a tighter focus: we expect the hybrid to achieve near-optimal performance on problems with medium nonlinearity, but not on problems with strong nonlinearity.

Blurring observations
The development of particle filters that avoid or reduce the incidence of collapse is an active area of research. The authors recently proposed an alternative that uses the same forecast as the standard particle filter, but imposes a generalized random field model of observation errors [17]. When the observation errors are Gaussian, the likelihood takes the form LðxÞ / exp À 1 2 ðy À HðxÞÞ T R À 1 ðy À HðxÞÞ Here y is the observation vector, H is the observation (or 'forward') operator, and R is the observation error covariance matrix. In the particle filter of [17], the observation error covariance matrix R is replaced by a covariance matrix that has increasing variance at small spatial scales. In practice this is implemented by blurring (i.e. smoothing) the innovations y − H(x).
The authors recently developed a fast algorithm for blurring scattered data in arbitrary dimensions for this purpose [54].
In the numerical experiments presented here, the spatial domain is periodic and Fourier methods are used to apply the blurring. The true observation error covariance matrix is R = γ 2 I. In the particle filter with blurred observations this is replaced by γ 2 (S T S) −1 , where the matrix S corresponds to an operator that attenuates the Fourier coefficients using the following spectrum where β and ℓ are tunable parameters and k is the Fourier wavenumber. More general blurring spectra are trivial to implement in our experiments, but the above blurring corresponds to the spectrum of the fast algorithm for scattered data developed in [54]. Replacing the true likelihood by a likelihood associated with spatial blurring means that the particle filter is approximating a distribution other than the true Bayesian posterior. The effect of this blurring is to make the likelihood uninformative at small scales, so that the posterior reverts to the prior at small scales. At large scales the blurring likelihood is close to the true likelihood, so the approximate posterior is close to the true posterior. Blurring reduces the effective dimension of the problem by confining the dimensionality to that of the large scales. This has the effect of reducing the minimum ensemble size needed to avoid collapse. It can also improve uncertainty quantification of large scales for a fixed ensemble size.

Numerical experiment: Hé non map
This section serves to illustrate a specific problem with medium nonlinearity, and to compare the three hybrid particle/ensemble Kalman filters with a particle filter and an ensemble Kalman filter. Rather than performing a cycled data assimilation experiment where the output of one cycle serves as the initial condition for the next, we repeat the same experiment multiple times. This serves to focus attention on a single Bayesian assimilation update, avoiding the complication associated with cycled data assimilation where the degree of non-Gaussianity can vary from one cycle to the next.
The prior imposed is the joint distribution of U and V obtained by applying one iteration of the Hénon map to a standard normal initial condition on U 0 and V 0 , i.e.
The prior probability density is shown in color in the upper left panel of

PLOS ONE
for convenience). Since the prior is clearly non-Gaussian, a pure EnKF solution is expected to give a biased result regardless of ensemble size. In contrast, the hybrid should achieve nearly optimal performance provided that the ensemble size is large enough to avoid sampling errors.
To illustrate these ideas we compare five methods: 1. ETPF: A pure particle filter from [48] 2. ESRF: A pure ESRF described in section 2.2 3. GMM-EnKF: The Gaussian mixture model-EnKF hybrid of [23] 4. ETPF-ESRF: The hybrid of [24] combining the ETPF and the serial square root ESRF described in section 2. The center left panel illustrates the particle filter (ETPF). The ETPF posterior sample is tightly clustered around a small number of the prior samples, which reflects the fact that the ESS is very low (ESS = 5 in this example), despite having an ensemble size of 100 for a problem with dimension 2. This illustrates the severe ensemble size requirements of particle filters. The lower left panel shows the ESRF. The ESRF produces a posterior close to the true value in this case, but the posterior ensemble it produces shows clear discrepancy from the true posterior.
The hybrids all choose the split α to produce an ESS of 30. ETPF-ESRF and SIR-ESRF are shown in the center right and lower right panels, respectively; they use the same split α and the same particle weights. The two methods produce very similar results; one notable difference is that ETPF-ESRF produces an intermediate distribution with less particle degeneracy than SIR-ESRF. This difference in the intermediate distribution does not have a significant impact on the final posterior distribution. GMM-EnKF (upper right panel) uses a different formula for the particle weights-because the first step is a Gaussian mixture model rather than a sum of delta distributions-and thus chooses a different split α to achieve the target ESS of 30. As a result, GMM-EnKF produces an intermediate distribution that is more tightly clustered on the observation in comparison to the other hybrids. The posterior ensemble is also slightly less dispersed than the other hybrids, but is qualitatively similar.
To carefully compare the performance of the different methods, we solve the problem 1,000 times for each method over a range of ESS thresholds. The results are compared on the basis of the root mean squared error (RMSE) where the mean is taken over the 1,000 experiments, and the continuous ranked probability score (CRPS; [55,56]). The median of these 1,000 CRPS values is used as a summary statistic. We also run a standard SIR particle filter with 10 4 particles, as a reference approximation of the true Bayesian posterior. Fig 2 shows the performance of the methods as a function of the ESS threshold. The top panels show RMSE and the bottom panels show the median of CRPS for the U (left) and V (right) variables. The mean ESS of the ETPF over 1,000 trials is 4.4, so the smallest ESS threshold was set to 10. The ETPF performance is shown on the plots at ESS = 0, purely for convenience. The pure ESRF performance is shown on the plots at ESS = 100. In each panel the performance of the pure particle filter with an ensemble size of 10 4 is shown for reference.

PLOS ONE
All three hybrids perform similarly, though the hybrid of [23] performs slightly worse than the other two in terms of RMSE. This may be because the first step of the GMM-EnKF hybrid uses a GMM whose component Gaussians all use a covariance matrix obtained from the full prior ensemble; performance might be improved by using a clustering approach in the GMM following [57]. The hybrids of [24] and section 2.3 are both able to perform better than the pure particle filter when the ESS threshold is low, and are able to nearly match the performance of the true Bayesian filter as approximated by the pure particle filter with 10 4 particles. This is because the pure particle filter with 100 particles is still limited by low ESS (as underscored by the typical ESS value of 4.4).
The differences in RMSE between the methods are fairly small-on the order of 25% at most. Differences in CRPS, which measures the quality of the uncertainty quantification (UQ) associated with the ensemble, are much larger. The hybrid filters all achieve nearly optimal UQ, achieving more than 50% improvement in CRPS over both ETPF and ESRF.
The pure particle filter is quite general in the sense that it generates a consistent estimator of the true Bayesian posterior for a wide range of problems. The cost of this generality is the requirement of a very large ensemble size. The hybrids trade this generality for improved performance using smaller ensemble sizes on a specific subset of problems, namely those with medium nonlinearity.
The code and data associated with this section can be found in [58].

A two-scale Lorenz-'96 model
The experiments in this section make use of a model inspired by the Lorenz-'96 model [59,60] and developed in [45]. The standard two-scale (or 'two-layer') Lorenz-'96 model includes two sets of variables, X k and Y j,k . There are fewer X k variables, and they evolve more slowly than the Y j,k variables, so the X k variables are typically viewed as 'large-scale' while the Y j,k variables are viewed as 'small-scale. ' The difficulty with this model is that it lacks a clear connection to a spatial field of a physical quantity like temperature or velocity, observations of which contain both large and small scales. A model inspired by the Lorenz-'96 models that possesses a single set of variables x i with distinct large-scale and small-scale dynamics was developed in [45] and has been used recently as a test model for data assimilation in [61]. The model is governed by a system of ordinary differential equations of the form where h; F 2 R, J 2 N, 1 is a vector of ones, and The number of state variables in x is 41J; here J = 128 for a total system dimension of 5248.

Data assimilation system configuration
Reference solutions are generated by drawing initial conditions from an uncorrelated standard normal distribution and propagating the initial conditions by 9.0 time units by numerical intergration of the dynamical model, at which point the state arrives at a statistical steady state (cf. Fig 3). Upon reaching that statistically steady state, a reference state is produced at 1500 time intervals separated by 1.2 time units. In the usual interpretation of the standard Lorenz-'96 model, this time interval corresponds to 6 days, which is quite long compared to other studies. At shorter time intervals the model exhibits only mild nonlinearity, where the forecast distribution is still very nearly Gaussian even though the dynamics are nonlinear. At 6 days the forecast distributions are certifiably non-Gaussian, as shown in Fig 4. This figure was produced by projecting a forecast ensemble of 1200 members onto the three leading singular vectors of the ensemble's empirical covariance matrix. The forecast distribution is dramatically non-Gaussian within this subspace-therefore the EnKF assumption of a Gaussian prior is invalid.
Our hybrid is intended for situations with medium non-Gaussianity, where the prior is not Gaussian but the posterior is nearly Gaussian. To achieve an approximately Gaussian

PLOS ONE
posterior in the face of a non-Gaussian prior requires a large number of sufficiently-accurate observations. Observations are taken at every fourth grid point (i.e. 32 observations for each of the 41 large-scale modes), with observation error variance γ 2 = 1/2. This density and accuracy of observations is sufficient to produce a nearly-Gaussian posterior without rendering the data assimilation procedure superfluous. (If the observations are dense enough and accurate enough then the filter adds essentially no information to the observations; this situation is avoided here, as the filter accuracy remains better than the observational accuracy).
Ensemble members are initialized by propagating a sample from the uncorrelated multivariate standard normal distribution by 9.0 time units to arrive at an ensemble of substantially disparate states near the dynamic's attractor. Because this initial forecast ensemble is fairly uninformative of the true state, there is a transient in filter performance while the filter approaches its asymptotic optimal performance. The results of the first 100 assimilation cycles are ignored in computations of filter performance statistics, so that the results presented are reflective of the statistical steady state of the filter. The data assimilation system was run for 1500 cycles, i.e. nearly 25 years, for each trial in the experiment.

Parameter optimization
The ESRF used here has two primary tunable parameters: the inflation factor r and the localization radius L. The SIR-ESRF hybrid filter has an additional tunable parameter, the ESS threshold that determines the splitting factor α. The version of the hybrid filter with blurred observations (denoted BSIR-ESRF) also has tunable parameters related to the blurring, but

PLOS ONE
these should not be viewed as a primary means of optimizing performance; we expect the hybrid to outperform the pure ESRF using only reasonable blurring parameters chosen a priori. The demarcation between large and small scales occurs at Fourier wavenumber 20 for the Lorenz-'96 model considered here, so the blurring is chosen to have a Fourier spectrum To help substantiate a comparison between our SIR-ESRF hybrid approach and the pure ESRF filter, we independently tuned the respective filter parameters. This began by generating parameter configurations, described hereafter as "arms," from a Sobol sequence of low-discrepancy quasirandom numbers in a bounding box that we chose as a search space [63]. (The term "arm" comes from the literature on multi-armed bandits and denotes a particular configuration to be tested.) The range of inflation factors considered was from r = 0 to r = 0.08 for the pure ESRF, and from r = 0 to r = 0.15 for the hybrid. The range of ESS thresholds for the hybrid was from 66 to 400 for N = 400 and 200 to 1200 for the N = 1200. The range of localization radius L was from 128 points (equal to the separation between large-scale Lorenz-'96 modes) and 320 points. At larger localization radii the filter performance became highly erratic, with some experiments performing extremely well and others extremely poorly. It seems likely that at large localization radii there are rare occurrences of spurious long-range correlations that significantly degrade the filter performance.
For each arm, we ran at least four separate experiments with different reference solutions and initial ensembles. For each assimilation cycle we computed the resulting root mean square error (RMSE), spread, and continuous ranked probability score (CRPS; [55,56]) for both the forecast (prior) and analysis (posterior). RMSE and spread are scalar quantities at each timestep, but CRPS was computed for each state variable at each timestep. We then aggregated these quantities by computing a mean over all state variables, timesteps, and assimilation trials -excluding the first 100 timesteps to allow for filter burn-in.
We elected to optimize for mean analysis CRPS because it quantifies the accuracy of the entire distributional estimate, whereas RMSE only describes accuracy of the ensemble mean point estimate. The ensemble spread would also provide an estimate of the distributional accuracy, but CRPS is preferable in its ability to quantify the accuracy of non-Gaussian distributional estimates. The median was excluded as an aggregation function to optimize because we found it to be insufficiently sensitive to situations in which the filter produces large intermittent excursion from the true state.
After exploring broad patterns with a Sobol sequence, we switched to a Bayesian optimization method for choosing new arms to evaluate. Using a Bayesian optimization method substantially accelerated convergence to optimal filter parameters relative to the quasirandom search. In short, this involved fitting a Gaussian process surrogate model to the mean CRPS observations as a function on the parameter search space, and then choosing new arms that maximize a utility function under that surrogate model. We chose a utility function that estimates improvement from previously observed arms expected under the surrogate model. The arms are then evaluated in parallel, by running the filter on a subset of the reference simulations using those arms' filtering parameters. Those results are then incorporated with previous results to fit a new Gaussian process surrogate model used in the next iteration of the Bayesian optimization loop. The technical details of the optimization strategy we used are described in S1 Appendix.

Results: Lorenz-'96
The three methods have indistinguishable performance at ensemble sizes smaller than 400, and the performance of all three methods improves with increasing N up to N = 400. This suggests that for N < 400 sampling errors limit filter performance more than errors due to non-Gaussianity. It is probable that this threshold could be reduced with more sophisticated inflation and localization strategies (e.g. [64,65]). Table 1 presents the results for the optimal parameter configurations of each method at two ensemble sizes N = 400 and N = 1200.

N = 400
At an ensemble size of 400 the three methods yield very similar results. In all three cases the filter is clearly doing better than simply trusting the observations, because the RMSE is nearly half the standard deviation of observation error. The SIR-ESRF hybrid is slightly worse than the other two on average, because three of the eight runs produced significantly worse results, with analysis CRPS above 1.4. In contrast, the BSIR-ESRF hybrid produces results quite similar to the ESRF. One notable difference is that the optimal inflation parameter r is larger for the hybrid filters than for the pure ESRF, presumably to counteract the under-dispersion that results from the resampling step in the particle filter. (The optimal r for SIR-ESRF is 0.06 vs 0.026 for ESRF.) The optimal effective sample size for the SIR-ESRF hybrid was 297, which is fairly large compared to the ensemble size of 400. Fig 5 shows the GP surrogate model's prediction for analysis CRPS as a function of localization radius and inflation ratio (1+ r) for the pure ESRF filter at N = 400. Gray squares indicate parameter configurations where experiments were run. The left panel plots the mean of the GP, while the right panel plots the standard deviation. The optimal parameters are in a fairly broad well, with near-optimal localization radii ranging from 100 to 300 and corresponding inflation factors from r = 0 to r = 0.06. Interestingly, as the localization radius increases the corresponding optimal inflation factor does too. At larger localization radii the filter makes more use of each observation leading to greater reduction in the posterior spread, which needs to be counterbalanced by increased inflation. The GP surrogates for analysis RMSE and for forecast metrics are qualitatively similar, as is the behavior of the hybrid filters. The optimal localization radii for the three filters are L = 209 (ESRF), L = 279 (SIR-ESRF), and L = 238 (BSIR-ESRF). These optimal values should not be over-interpreted, because the filters are not overly sensitive to the localization radius within the broad well that contains the optimal values. Nevertheless, the fact that the hybrids are able to use a larger localization radius might suggest that the

PLOS ONE
particle filter resampling step is eliminating outliers that would otherwise lead to spurious long-range correlations.
All of the methods lead to under-dispersed ensembles in the sense that the ensemble spread is less than the RMSE. The forecast RMSE is 23% larger than the forecast spread for both ESRF and BSIR-ESRF, while it is 30% larger for SIR-ESRF. Forecast spread is here measured before inflation, but in every case the inflation is not enough to match the inflated spread to the forecast RMSE. The under-dispersion is worse for the analysis ensembles, with RMSE bigger than spread by 50% for ESRF and BSIR-ESRF, and by 80% for SIR-ESRF. This mismatch between spread and RMSE can be reduced by tuning the parameters (particularly by increasing the inflation), but only at the cost of increasing both the RMSE and the CRPS.

N = 1200
When the ensemble size is increased from 400 to 1200 the performance of the pure ESRF remains essentially flat, with only a minor improvement in analysis RMSE. This shows that for N � 400 the performance of the pure ESRF is limited primarily by the Gaussian approximation rather than by sampling errors. The optimal inflation parameter for ESRF reduces from r = 0.026 at N = 400 to r = 0.015 at N = 1200, and the optimal localization radius increases from L = 209 to L = 250. This is consistent with the intuition that as ensemble size increases less inflation and localization are needed to counteract sampling errors. The ensemble remains about as under-dispersed as at N = 400, with forecast RMSE 22% larger than forecast spread and analysis RMSE 42% larger than analysis spread.
The performance of the hybrid filters improves with increased ensemble size, with small improvements in CRPS and RMSE. The SIR-ESRF hybrid is now nearly indistinguishable from the pure ESRF, and the BSIR-ESRF hybrid outperforms both by 5-10% in CRPS and RMSE. It is not clear whether further improvements could be obtained by increasing the ensemble size, or whether the hybrids are already close to the true Bayesian posterior. No

PLOS ONE
investigations have been performed at larger ensemble sizes due to the computational expense of optimizing parameters with very large ensembles.
Blurring of the observations enables the BSIR-ESRF hybrid to slightly out-perform the ESRF and the SIR-ESRF hybrid. The split parameter α for a fixed ESS threshold tends to be larger in the hybrid with blurred observations: at N = 400 the median α for SIR-ESRF is 10 −3.14 while the median α for BSIR-ESRF is 10 −2.91 ; at N = 1200 the median α for SIR-ESRF is 10 −2.96 while the median α for BSIR-ESRF is 10 −2.48 . This suggests that for a fixed split α the blurring increases the ESS, but when the split α is instead chosen to produce a desired ESS the smoothing instead impacts which ensemble members are selected for resampling. Heuristically this can be explained as follows: The hybrid essentially decides a priori how many distinct ensemble members will remain after resampling (by fixing the ESS), so the only impact of the blurring will be on which ensemble members are eliminated and which are replicated. The effect of blurring is to trade improved assimilation performance at large scales for degraded performance at small scales; this trade is effective because predictability on longer time horizons comes from the large scales. Indeed, the RMSE of the analysis mean projected onto the large scale modes is better for BSIR-ESRF than the other two methods.
When the ensemble size increases from N = 400 to N = 1200 the optimal inflation for the SIR-ESRF hybrid decreases from r = 0.06 to r = 0.04, and the optimal localization radius increases from L = 279 to L = 316. The optimal ESS threshold is 757, although results are not overly sensitive for ESS thresholds in the range of 500 to 800. The optimal parameters of the BSIR-ESRF method are similar: r = 0.02, L = 319, and ESS threshold 642. Code and summary data associated with this section can be found in [58].

Conclusions
This paper has developed a hybrid particle ensemble Kalman filter targeting applications with medium nonlinearity, i.e. applications where the prior (forecast) distribution is very non-Gaussian but the posterior (analysis) distribution is close to Gaussian. It was argued in [38] that variational methods are more appropriate than the EnKF in this situation, because they approximate the posterior as Gaussian whereas EnKF methods approximate the prior as Gaussian. The hybrid developed here is a pure ensemble approach for problems with medium nonlinearity, not requiring any variational optimization. The particle filter acts first and results in an intermediate distribution that is closer to Gaussian than the prior; this intermediate distribution is then presented to the EnKF, and matches more closely the Gaussian approximation inherent to the EnKF. The hybrid developed here is similar in spirit to previouslydeveloped hybrids (e.g. [23,24]). The main differences, besides the emphasis on medium nonlinearity, are the use of a serial square root filter and of a random resampling of the posterior ensemble to break the particle degeneracy introduced by the resampling step of the particle filter.
The hybrid SIR-ESRF developed here includes a resampling step that reduces the number of distinct ensemble members seen by the EnKF part of the hybrid (which is, in this case, a serial square root ESRF). The EnKF's performance is limited by sampling errors even in purely Gaussian problems, so reducing the number of distinct ensemble members used within the EnKF increases the sampling errors and can hurt performance. Our hybrid is configured such that the ESS in the particle filter step is specified a priori, and we find that the optimal ESS threshold for the hybrid needs to be at least as large as the ensemble size needed to obtain optimal performance in the pure EnKF. (ESRF performance stopped improving for ensemble sizes greater than 400, and the optimal ESS in the hybrid was between 500 and 800.) This leads one to expect that a larger ensemble size is required for the hybrid to outperform a pure EnKF, so that the particle filter component of the hybrid can effectively resample from the full ensemble size down to a size that is still large enough to obtain good EnKF performance. In problems where non-Gaussianity presents in the form of a few outliers in an otherwise nearly-Gaussian distribution, the hybrid will presumably need only a slightly larger ensemble size, so that it can eliminate outliers during the resampling step. But in problems where the forecast exhibits pathological non-Gaussianities such as multi-modality or strong curvature such as that seen in Fig 4, a much larger ensemble may be needed in order for the hybrid to outperform the pure EnKF. The SIR-ESRF hybrid did not achieve significant improvements over the pure ESRF in our experiments on the multiscale Lorenz-'96 model, but the BSIR-ESRF using smoothing of the innovations achieved limited improvements (5-10% improvement in CRPS and RMSE). This limited improvement compared to a pure ESRF may be a reflection of the fact that non-Gaussianity of the forecast is confined to a fairly low dimensional subspace associated with the leading singular vectors, i.e. the fastest directions of expansion along the system's attractor.