## Figures

## Abstract

The process of integrating observations into a numerical model of an evolving dynamical system, known as data assimilation, has become an essential tool in computational science. These methods, however, are computationally expensive as they typically involve large matrix multiplication and inversion. Furthermore, it is challenging to incorporate a constraint into the procedure, such as requiring a positive state vector. Here we introduce an entirely new approach to data assimilation, one that satisfies an information measure and uses the unnormalized Kullback-Leibler divergence, rather than the standard choice of Euclidean distance. Two sequential data assimilation algorithms are presented within this framework and are demonstrated numerically. These new methods are solved iteratively and do not require an adjoint. We find them to be computationally more efficient than Optimal Interpolation (3D-Var solution) and the Kalman filter whilst maintaining similar accuracy. Furthermore, these Kullback-Leibler data assimilation (KL-DA) methods naturally embed constraints, unlike Kalman filter approaches. They are ideally suited to systems that require positive valued solutions as the KL-DA guarantees this without need of transformations, projections, or any additional steps. This Kullback-Leibler framework presents an interesting new direction of development in data assimilation theory. The new techniques introduced here could be developed further and may hold potential for applications in the many disciplines that utilize data assimilation, especially where there is a need to evolve variables of large-scale systems that must obey physical constraints.

**Citation: **Pimentel S, Qranfal Y (2021) A data assimilation framework that uses the Kullback-Leibler divergence. PLoS ONE 16(8):
e0256584.
https://doi.org/10.1371/journal.pone.0256584

**Editor: **Ibrahim Hoteit,
King Abdullah University of Science and Technology, SAUDI ARABIA

**Received: **March 10, 2021; **Accepted: **August 10, 2021; **Published: ** August 26, 2021

**Copyright: ** © 2021 Pimentel, Qranfal. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

**Data Availability: **The code used to generate all data and figures in the manuscript are available on Figshare (https://dx.doi.org/10.6084/m9.figshare.14194004).

**Funding: **This work was supported by a Discovery Grant from the Natural Sciences and Engineering Research Council (NSERC) of Canada.

**Competing interests: ** The authors have declared that no competing interests exist.

## Introduction

Data assimilation is the process by which we merge two types of information about a dynamic system, a numerical model of the underlying processes and observations of the evolving system. The resulting analysis should ideally be optimal in the sense of utilizing associated error and representativeness of the model and observations. The data assimilation procedure can be used to improve initial conditions, boundary conditions and/or parameter values of the numerical model, resulting in better estimates of the state of the system and improving predicability. Data assimilation is most prominently used in the atmospheric and oceanographic sciences where it is essential for modern numerical weather prediction [1]. In addition to being extensively used throughout the geosciences [2] it is increasingly being found a useful computational tool in a wide array of other disciplines, including medicine [3], epidemiology [4], ecology [5], and neurobiology [6].

In recent years there has been a renewed interest in the mathematical foundations of data assimilation (e.g., [7, 8]). One of the research developments being pursued is the consideration of different metrics for the model–observation differences and regularizer that are minimized in data assimilation algorithms. The standard data assimilation approach involves minimizing an objective function of weighted L_{2}-norms, otherwise known as Tikhonov regularization. However, minimization involving a L_{1}-norm for the regularization (or background) term has also been used and this is found to be particularly useful for tracking sharp fronts and discontinuities (e.g., [9, 10]). Rao et al. [11] found L_{1}-norm data assimilation to be beneficial when dealing with outlier observations, but had the drawback that solutions lacked smoothness near the mean, this desirable property was retained by using the Huber-norm, a hybrid that utilizes L_{1} in the presence of outliers and L_{2} close to the mean. An alternative approach to data assimilation, explored by Feyeux et al. [12], utilizes optimal transport theory and within this context the Wasserstein distance (minimizing kinetic energy) replaces the L_{2} distance. Feyeux et al. [12] demonstrated how this approach holds potential for addressing position errors, see also Li et al. [13]. In this paper we introduce another alternative approach to data assimilation, one which explores an information perspective and uses the Kullback-Leibler divergence.

The Kullback-Leibler divergence (KL) between two probability distributions, *P* and *Q*, originally proposed by Kullback and Leibler [14], is defined as the expectation of the logarithmic difference between the probabilities *P* and *Q*, where the expectation is taken using *P*,
(1)
For two probability densities *P* and *Q* of a continuous random variable *x* we have
(2)
and for two probability distributions of a discrete random variable
(3)

Sometimes referred to as the cross-entropy distance or relative entropy, the Kullback-Leibler divergence can be thought of as measuring the discrepancy between probability distributions, in this case the divergence of *P* from *Q*. From a Bayesian perspective it is a measure of information gained when an a priori probability distribution, *Q*, is updated to the posterior probability distribution, *P*. The Kullback-Leibler divergence is not strictly a true metric, and is not symmetric, hence in general KL(*P*, *Q*) ≠ KL(*Q*, *P*). This is because it is an expected value (Eq 1) and therefore it can differ depending on which distribution you take the expectation with respect to. Therefore, the Kullback-Leibler divergence is not a measure of distance in the usual sense but rather can be thought of as a directed, or orientated, distance; although a symmetric KL-functional, known as the Jensen-Shannon divergence (JSD), can be constructed as
(4)
where .

A broader class of f-divergence was introduced by Csiszár (see, for example, [15]):
(5)
where *f* is a convex function on (0, ∞) with *f*(1) = 0, *P* = (*p*_{1}, …, *p*_{n})^{⊤} and *Q* = (*q*_{1}, …, *q*_{n})^{⊤}. Following Csiszar [15] the KL-divergence for arbitrary can be considered the f-divergence with

*f*(*t*) = *t* ln(*t*) − *t* + 1, that is
(6)
This is sometimes referred to as the generalized KL-divergence or the unnormalized KL-divergence; noting that if *P* and *Q* are probability distributions then and the linear terms fall away giving the (normalized) KL-divergence (Eq 3). The Bregman divergences [16] are another important class of divergences between non-negative vectors, defined in terms of a strictly convex function, of which the generalized KL-divergence (Eq 6) is also a special case. Eq (6) is the definition that will be used throughout this paper, and is consistent with use by others, such as those within the signal processing and optimization community (e.g., [17, 18]). Note that the two fundamental properties remain intact; namely,

- (i). non-negativity: KL(
*P*,*Q*) ≥ 0 with equality if and only if*P*=*Q*, and - (ii). asymmetry: KL(
*P*,*Q*) ≠ KL(*Q*,*P*).

At its essence KL(*P*, *Q*) is a coding penalty associated with selecting *Q* to approximate *P*. This KL-divergence also satisfies the homogeneity property of a distance:
It is informative to contrast the KL-divergence (Eq (6)) with the Euclidean distance, d, in a simple case: d(1001, 1000) = d(2, 1) = 1 whereas KL(1001, 1000) = 0.005 and KL(2, 1) = 0.3863. The discrepancy in KL values actually has greater similarity to that of the relative Euclidian distance (1/1001 ≈ 0.001 and 1/2 = 0.5). In this sense the Euclidian distance can be loosely thought of in terms of an absolute difference and the KL-divergence in terms of a relative difference.

The KL-divergence, as an information measure rather than a distance measure, will lead to an interesting new approach to data assimilation. Furthermore, the naturally embedded positivity constraint will prove useful to many problems applicable to data assimilation. Constraints are often required, or are desirable, to enforce non-negatively of certain physical quantities; such as length, volume and mass, variables such as precipitation and humidity, and concentrations of tracers. However, Kalman filter type approaches do not naturally handle such constraints. Quick-fix approaches such as simply setting negative values to zero are not optimal. Whereas the more involved approach of Gaussian anamorphosis (e.g. [19, 20]), whereby a nonlinear change of variables, such as a log transform, is introduced during the analysis step, are not ideal and may not be suitable for certain applications (e.g. [21]). More sophisticated constrained Kalman filtering data assimilation methods such as [22, 23] incorporate an optimization step that involves a projection into the constrained region. In contrast, our proposed KL-divergence filtering method will guarantee a positive state vector by construction; without need for any projection, transformations, ad-hoc adjustments, or any additional steps.

## Methods

In this section we review the traditional formulation of the data assimilation problem and outline sequential solution approaches. We then extend this formulation to describe a new data assimilation approach that utilizes Kullback-Leibler divergence.

### The data assimilation problem

Suppose at some time *t*_{k} we have partial observations, *y*_{k}, and a background estimate, , of some true state, . The best estimate of that state, utilizing both background and observations, is given by the minimum, with respect to *x*, of an objective function, known as the 3D-Var cost function,
(7)
Here is a squared L_{2}-norm weighted by a covariance matrix *A*, with the weighted inner product defined as 〈*a*, *b*〉_{A} = *a*^{⊤} *Ab*. The observations contain serially uncorrelated Gaussian errors, *μ*_{k} and are related to the state by an observation operator *H*_{k}, such that with and . The background estimate also contains serially uncorrelated Gaussian errors, such that and . The minimum of the cost function (Eq 7) is a standard result given as:
(8)
From a Bayesian perspective *x*_{k} is the expectation of the state of the system conditioned on the data, and *y*_{k} (see, for example, [7]). In the case of non-Gaussian errors this will still be the best linear unbiased estimator.

### The Kalman filter

In a sequentially updated system the background state is provided by the model forecast, , at time *t*_{k}, and is then updated by the best estimate (Eq 8) to give the analysis state, , at time *t*_{k}. This analysis state is then evolved forward in time by a numerical model of the evolving system, *M*_{k,k+1}, to give the forecast state , at time *t*_{k+1}. In a similar manner the background errors are also evolved in time and updated to give a covariance forecast error, , and a covariance analysis error, , at time *t*_{k+1}. For a linear system, with matrices *M*_{k−1,k} and *H*_{k}, this process gives rise to the well-known Kalman Filter:
(9)
(10)
where
(11)
is referred to as the Kalman gain matrix and *Q*_{k} denotes the covariance of the model error (assumed normally distributed, unbiased and serially uncorrelated).

### Kalman filter approximations

It is computationally expensive to propagate the error covariance matrix forward in time (Eq 9), prohibitively so for large systems. Furthermore, computing the full Kalman gain matrix (Eq 11), at each step, is typically impractical as it involves multiplying and inverting large matrices (which may be ill conditioned). As such, an approximation often used is to fix the error covariance matrix, that is, let *P*_{k} = *P*_{0} for all *t*_{k}. This implementation method is often referred to as Optimal Interpolation (OI). To aid the inversion step the matrix is typically modified to have a simplified structure. For a non-linear model and/or observation operator a linearization about the background state is required in order to propagate covariances and this formulation is known as the extended Kalman filter (EKF). The ensemble Kalman filter (EnKF) is a popular implementation approach as it does not require a linear approximation, but instead involves propagating an ensemble of analysis vectors and then updating the ensemble using the observations, where the state vector is the ensemble mean and the state error covariance matrix is constructed by the ensemble covariance matrix (see, for example, [24]).

### Kullback-Leibler regularization

The use of Kullback-Leibler minimization for static inverse problems has previously been established. For example, Resmerita and Anderssen [25] have highlighted the choice of KL-divergence as both residue minimizer and regularizer in a two-term cost function for solving ill-posed linear inverse problems. We now describe two iterative methods for minimizing functionals involving additive KL-divergence terms.

#### Expectation maximization (EM).

The expectation maximization (EM) algorithm [17] was originally used by Byrne [26] to determine the solution to the following regularization problem:

For and 0 ≤ *α* ≤ 1 minimise
(12)
to solve a possibly inconsistent linear system *Tx* = *d*, where *T* ∈ R^{M×N} and , and where is an a priori estimate of *x*.

The solution, equivalent to maximizing Burg entropy, is derived from alternating minimization of related KL distances between convex sets [26]. Following Qranfal and Byrne [27] the iterative solution can be described in a single step:
(13)
where for some initial start vector *x*^{0} > 0, *ℓ* is iterated until convergence, provided that for each *j*.

#### The simultaneous multiplicative algebraic reconstruction technique (SMART).

The simultaneous multiplicative algebraic reconstruction technique (SMART) was introduced by Bryne [17, 26] as a means of solving the following problem:

For *x* > 0 and 0 ≤ *α* ≤ 1 minimise
(14)
to approximate the solution to the linear system of equations *Tx* = *d* with *q* an a priori estimate of *x*. Recall that the KL-divergence is not symmetric, hence Eq (14) is formally a different problem from that of Eq (12).

The solution to Eq (14), equivalent to maximizing Shannon entropy, was determined by Byrne [26] using a two-step alternating projections algorithm, which can be expressed in a convex combination compact form [28]:
(15)
where *ℓ* is iterated until convergence, starting from an initial guess *x*^{0} > 0.

### Kullback-Leibler based data assimilation

The cost functions of Eqs (12) and (14) can be reformulated to solve a data assimilation problem analogous to Eq (7). In this context the linear system *Tx* = *d* can be used to represent how the observations *y* are related to the state vector *x* through the observation operator *H*, namely *Hx* = *y*, and the a priori estimate *q* is given by the model forecast *x*^{f}. For example, taking the case from Eq (12), we can derive a weighted Kullback-Leibler objective function, using our data assimilation notation, as follows:
(16)
The covariance matrices, *R*_{k} and , and their inverses, are symmetric positive definite and by the Cholesky decomposition may be expressed in the form:
(17)
(18)
where *U* and *V* are upper triangular matrices with positive diagonal entries. Just as, , for example, so to for weighted KL-divergence, hence
(19)
The KL-divergence must be between two positive quantities (see Eq 6); therefore, we are restricted in that we require positive entries for *U*_{k} *y*_{k}, *U*_{k} *H*_{k} *x*, , and *V*_{k} *x*. Suppose we have a positive system, such that *y*_{k}, *H*_{k}, and each have positive entries. This would be the case for a wide range of applied science applications, and if not a general system can always be converted to a positive system after applying some transformations (see, for example, [28]). However, even with the supposed positive system, *U*_{k} and *V*_{k} may still have negative off-diagonal entries. For this derivation we will therefore restrict ourselves to white noise for the observation error and the forecast error, such that
(20)
(21)
This is the usual structure for the observation error (Eq 20) as observations are typically local in space and considered independent; however, for the forecast error (Eq 21) this form does present some restrictions as it limits our ability to spread information from an observed part of the system to an unobserved part. We will circumvent this to some extent by interpolating observations to all grid points and assimilating these interpolated values, with reduced weight the further they are from the measurement location. This localization procedure enables the smooth spatial spread of information from the measurement point to nearby locations, without the need for off-diagonal terms in the covariance matrix.

Proceeding with (20) and (21) we have and , with (*σ*_{o})_{k}, (*σ*^{f})_{k} > 0 for all *k*, and after some algebraic adjustments we can derive
(22)
We now have a weighted KL objective function that matches the form of Eq (12) and the minimum can be found using the iterative method of Eq (13). This can be solved sequentially as a filtering algorithm by evolving the model state and updating the forecast based on the observations. We have therefore outlined an EM data assimilation filter that can be compared to traditional 3D-Var/OI. Namely,
(23)
for *j* = 1, …, *N* and *ℓ* is iterated until convergence, and where .

A similar procedure can be followed to produce a SMART data assimilation filter
(24)
for *j* = 1, …, *N* and *ℓ* is iterated until convergence, and where .

Hence, we have now developed two new data assimilation methods that minimize Kullback-Leibler divergence. What we have proposed is an entirely new perspective to the traditional data assimilation scheme, one that involves an information measure for the closeness of fit between model and data. Furthermore, these KL data assimilation algorithms guarantee positivity for the solution without the need for projections or transformations and they do not require an adjoint code or the storage, multiplication or inversion of covariance matrices.

## Numerical experiments

We examine the performance of the EM and SMART data assimilation filters with respect to Optimal Interpolation (OI) and the Kalman filter (KF), including the extended Kalman filter (EKF) and the ensemble Kalman filter (EnKF) for a non-linear application. To demonstrate and compare these algorithms we perform so-called ‘twin experiments’ whereby noisy pseudo-observations, , are taken from a truth run . The initial model state is offset from the truth, , the model is evolved forward in time, , and the observations are assimilated in an attempt to recover the truth run. The error statistics are such that (25) (26) (27) We will demonstrate the KL-minimizing data assimilation methods (EM filter and SMART filter) using three different numerical experiments.

### Experiment 1

We first consider a two-dimensional linear dynamics problem taken from [8]
(28)
in which the state vector is rotated 90° in a clockwise direction at each step. To perform these simulations in the positive quadrant we rotate about the point . We first translate the state vector so that the point of translation is moved to the origin, then we rotate the relocated state vector about the origin (Eq 28), finally we undo the translation step to return the state vector to its new rotated location. Within our twin-experiment the background guess follows these deterministic dynamics; however, the truth solution involves the addition of random noise to Eq 28, producing stochastic dynamics as the noise causes a random shift of the origin (see Fig 1). The experiments are performed with an initial condition offset from , where the offset is taken from the normal distribution with mean 0 and variance . The random model error (that produces the stochastic dynamics) is normally distribution with mean zero and variance . Observations are generated from the truth run at each step from the *x*_{1} variable only and include a normally distributed random measurement error of mean zero and variance .

The modelled phase plane of the background solution (a) and the truth solution (b). The initial conditions are given by the red dot and the solution points are in blue.

### Experiment 2

This one-dimensional non-linear dynamics problem involves a sine map (29) and produces deterministic behaviour converging to a period-2 solution; however, the addition of noise creates stochastic dynamics producing considerably different bistable behaviour between two separate period-2 solutions [8]. We start with a mean initial condition of 10. To run this experiment with positive values we subtract 10 from the solution before applying the forward model (Eq 29) and then add 10 after the sine map has been applied (see Fig 2). The stochastic behaviour of the truth run is generated with the addition of normally distributed model errors of mean 0 and variance . The model forecast has an initial condition error that is taken from the normal distribution with mean 0 and variance . Observations are acquired from the truth run at each step and measurement error is added that is normally distributed with mean 0 and variance .

The modelled background solution (a) and the truth solution (b).

### Experiment 3

In this example we have a spatio-temporal model, a partial differential equation with a single spatial variable, *χ*, namely the one-dimensional linear advection equation,
(30)
Here the state vector is the 1-D concentration *χ* = *χ*(*x*, *t*) which is advected with constant fluid velocity *v*. This model (Eq 30) has an analytical solution, which we will use in this study,
(31)
We take *v* = 1 m s^{−1}, discretize with spacing 1 m and use a timestep of 1 s. For the baseline experiments we have 400 spatial gridpoints and evolve over 600 timesteps. We apply periodic boundary conditions. The state vector, *χ*, is initialized as a pseudo-random wave (Fig 5(a)). This smooth periodic initial state is sampled from a normal distribution with mean 0, variance , and a decorrelation length of 20. The solution consists of a superposition of sinusoids with different wavelengths, where the shorter waves are penalized, and where each wave has a random phase [24]. The background state at initialization is offset from the truth by drawing another sample from the distribution and adding this to the true state. Observations are taken from the truth run and used in the assimilation (see Fig 5(b), 5(c)). Every 12 timesteps 20 observations are taken from the 400 possible spatial locations which are randomly sampled from a uniform distribution without replacement. Observation error is added to each observation, where the noise is normally distributed with mean zero and variance .

## Results and discussion

For the experiments conducted in this work we found that the two KL-minimizing data assimilation methods provided near-identical solutions, we will therefore only present the results from the EM filter which we will henceforth refer to as the Kullback-Leibler data assimilation (KL-DA) solution. To give a sense of how differences might arise consider Eq 1, when minimizing KL(*P*, *Q*) we want *P* ≃ *Q* or *P* ≪ *Q*, now suppose *Q* has two peaks then *P* might match one peak (*P* ≃ *Q*) and miss the other (*P* ≪ *Q*), we might think of this as “mode-seeking”, whereas for minimizing KL(*Q*, *P*) we want *P* ≃ *Q* or *Q* ≪ *P*, hence *P* might allocate mass between the two peaks of *Q*, thus “mean-seeking” (see [29]). Different applications might then give rise to different solutions from the EM filter and the SMART filter, although this is not something we have explored in this study.

In experiment 1 we show that the KL-DA method is effective and accurately tracks the unobserved variable (Fig 3(a)). The KL-DA results are shown to be equivalent to the OI solution, with the Kalman filter solution being superior to both (Fig 3(b)). This is to be expected as the full Kalman filter is updating the error covariances through the simulation, unlike the static covariance of the OI and KL-DA systems.

The modelled solution (a) and running average root mean square error (b).

For the 1-D nonlinear problem, experiment 2, we again find that the solution of the KL-DA method is identical to that of the OI (Fig 4(a)). We find that the KL-DA (and OI) is more accurate than the extended Kalman filter, which produces a higher frequency of larger errors, as shown in the probability histrogram of the errors (Fig 4(b)). This problem is not well suited for the extended Kalman filter because of destablization intervals [8]. The ensemble Kalman filter (EnKF) solution is found to have a slightly narrower range of error values than the KL-DA and OI (Fig 4(b)), but this comes from considerably greater computational cost (here we used 100 ensemble members to characterize the evolving error statistics).

The modelled data assimilation solutions (a) and the log-scale histogram of the errors (b).

In experiment 3 we have background error with a decorrelation length scale and so the error covariance matrix, *P*^{f}, contains important off-diagonal structure (unlike in experiments 1 and 2). As the KL-DA method uses diagonal covariance matrices we employ a local assimilation approach, as detailed in our earlier derivation. The observations are linearly interpolated and assimilated at each grid point. The uncertainty assigned to these interpolated observations is increased exponentially the further they are from the actual observation location, such that beyond a certain distance the uncertainty is so great that the interpolated observation will have no bearing on the analysis. This is an effective way of spatially spreading the observation information to locations nearby the measurement points (see Fig 5(b) and 5(c)). It is also different from that of OI (and Kalman filter) which use the *P*^{f} matrix in the Kalman gain to spread information; hence, the KL-DA and OI solutions are no longer alike as was the case for the previous experiments. We find that the KL-DA solution converges toward the truth much faster than both OI and the Kalman filter (Fig 6). As expected errors are reduced much more slowly in the OI than the Kalman filter as the error covariance is not evolved or updated (Fig 6). Note that as each individual simulation will be different, because of random errors and the random selection of observation locations, we have presented our results (in Fig 6 and Table 1) as averages calculated from multiple realizations (repeat simulations).

The state vector solution from the truth run (blue), the no assimilation run (black), and the Kullback-Leibler data assimilation solution (red). Shown at (a) the initial condition (t = 0), (b) the time of first observations (t = 12), and (c) the time of second observations (t = 24). The observations (green circles) are taken at random locations and include random measurement error. The arrow indicates the direction of the advected flow.

The error, , growth over time using the Kalman filter (purple), the Kullback-Leibler data assimilation (red), and the OI (black dashes) solutions. The error curves are determined based on the mean of 100 simulations.

These averages are mean values taken over 30 simulations using experiment 3.

The KL-minimizing data assimilation methods are found to be substantially faster than the Kalman filter and faster than OI for large systems (see Table 1). The algorithms have not necessarily been coded for optimal efficiency; nonetheless, these timing comparisons provide further evidence of the computational advantages of the KL data assimilation approach, especially for large systems. The efficiency of the EM and SMART filters partly derives from its avoidance of matrix algebra. In contrast the Kalman filter approaches depend critically on the Kalman gain matrix (Eq 11), determining this requires computing the inverse of the matrix (*HP*^{f} *H*^{⊤} + R). Even if *P*^{f} is fixed, as is the case for OI, the *H* matrix (observation operator) will be different due to the changing measurement locations in experiment 3. Thus requiring a new inverse to be computed at each assimilation time. The challenges of matrix multiplication, matrix storage, and computing matrix inverses increase substantially with the size of the system and may become intractable for some very large applications. For example, as we increase the number of grid points in our study we find the time needed to complete the simulations dramatically increases for the Kalman filter (see Table 1). As is well known evolving and updating the error covariance in the Kalman filter requires significant additional computational resources over that of OI. Although in practice the EnKF proves effective as it does not propagate a covariance matrix, but instead makes an ensemble forecast. For high dimensional applications the EnKF can be implemented by assimilating single observations serially (e.g. [30]) or by performing the analysis step in a local region (e.g. [31]). Regardless, any ensemble forecast will be more expensive than the single forecast of the KL-DA method; nonetheless it could be worth pursuing some of the benefits of the EnKF by developing an ensemble approach to the KL-DA method.

With regards to accuracy, we find that the final solutions of the KF are much closer to the truth than the OI solutions, but only slightly better than the KL-DA method (see Table 1). For example, for the largest system tested the relative percentage error at the end of the simulation for KF was 3.06%, for KL-DA was 3.29% and for OI was 6.59%. Whereas, the KL-minimizing filter is substantially quicker than the Kalman filter and is found to be also faster than the OI (direct solve 3D-Var) for sufficiently large systems (see Table 1). For example, in the largest problem tested the KL-DA simulation was 36 times faster to complete than the KF and 1.2 times faster than the OI. Therefore, the larger the system the more advantageous the KL data assimilation approaches become. These comparisons provide a baseline for assessing the KL-method and give an indication of their potential.

For all experiments performed in this study only a couple of iterations of the SMART and EM filters are needed for convergence |*x*^{ℓ} − *x*^{ℓ−1}| < 10^{−9}. We should emphasis that the OI solution involves computing the direct and exact solution (Eq 8), but that iterative 3D-Var methods can also be employed to find the cost function minimum (e.g., [32]); however, such numerical minimization algorithms require evaluating both the cost function as well as its gradient. As such we expect the computational advances of the KL-DA approach to remain against the iterative minimization algorithms of 3D-Var.

A shortcoming of the KL data assimilation set-up described is the assumed diagonal structure of the covariance matrix. Realistic multivariable data assimilation applications typically require at least tri-diagonal structure in order to adjust correlated variables. Future work will explore adaptions to the current KL-minimizing filters in order to address this limitation and increase their utility. For example, an ensemble KL-DA approach could be developed that allows for both covariance updating and the direct adjustment of unobserved, but correlated, variables.

Many applications in the applied sciences require constraints on state variables; for instance, negative quantities are not physically possible for sea-ice concentration or ice sheet thickness, to give a couple of examples. Nonetheless, even with positive observations and a positive forecast vector, the data assimilation update (Eq 10) can result in physically unsound negative values occurring in the analysis state. Despite a positive system the innovation vector *y* − *Hx*^{f} could be negative or the Kalman gain (Eq 11) could contain negative values because of the matrix inverse. Often, in such cases any negative values in the analysis state vector are simply set to zero in a post-processing step necessary to maintain consistency with the physical model (see, for example, [33, 34]). However, this decision is rather ad-hoc and is no longer the optimal solution provided by the data assimilation algorithm. Another frequently used approach is that of Gaussian anamorphosis and involves a change of variables for the state vector and the observations, the nonlinear transformation is applied before the update step and then the inverse is used to return back to physical space for the forecast (e.g. [19]). The application of anamorphosis functions for these transformations may not be straightforward and the choice can strongly influence performance (e.g. [20]), although the logarithm is a popular function choice. Despite these approaches giving non-negative analyses they are not necessarily optimal and do not generally conserve mass (e.g. [21]). To counteract such shortcomings more elaborate techniques have been developed that involve solving an optimization problem subject to convex constraints at the analysis step, see for example [21–23]. In contrast, the data assimilation methods we have formulated here are guaranteed to produce analysis states with positive values because they are based on KL-divergence and hence are ideally suited for any applications requiring such physical constraints. No adjustments, transformations, or projections are required and the constraint is naturally embedded in the KL-minimizing filtering algorithms. For example, in our twin experiment 3 problem if we produce the initial pseudo-random wave around the zero line and then offset by the minimum observation (to achieve a positive system) and run the data assimilation experiments with reduced observation error (e.g., ) then the Kalman filter solution produces multiple (undesired) negative values. However, when using the KL-minimizing filters there are no occurrences of negative values and positivity is naturally enforced. Future work will build on this potential by exploring more realistic applications that require positivity and directly comparing the outcomes of KL-DA to Gaussian anamorphosis and other competing approaches.

The Kullback-Leibler divergence has previously been used within the standard data assimilation methods. For example, Mansouri et al. [35] minimize KL divergence to generate the optimal importance proposal distribution within a particle filter. This KL ‘measure’ is also used for model selection via Akaike information criterion, see for example, Burnham and Anderson [36]. In particular, Lang et al. [37] used such a method within an ensemble Kalman filter for parameterization estimation. The Kullback-Leibler divergence has recently been used to incorporate inequality constraints for an ensemble Kalman filter [23]. Their methodology involves first solving the unconstrained ensemble Kalman filter and then projecting these results into the constrained region. In the projection step they seek a distribution in the constrained region that is similar and close to that of the unconstrained region and to determine this they solve a convex optimization problem using the KL-divergence. In contrast our approach can be considered favourable in that we guarantee our solution, by construction and without projection, to belong to the desired constrained region, namely the positive octant for this study. We have originally demonstrated that data assimilation methods can be developed that seek to minimize the Kullback-Leibler divergence, between model forecast and observations as well as between the forecast and the control state, within a two-term weighted cost function. We have shown that these new approaches are computationally efficient and are ideally suited for situations where physical constraints on the state vector are necessary. Such scenarios commonly arise within many state and parameter estimation problems across numerous disciplines.

## Conclusion

We have derived two new data assimilation algorithms that minimize Kullback-Leibler divergence, rather than the L_{2}-norm of standard data assimilation methods. This foundational information-based perspective provides a new way to conceptualize the data assimilation problem. The unnormalized Kullback-Leibler divergence is a measure of the discrepancy between two positive vectors, and is a natural way to characterize the differences between the model prediction and the data. Because this ‘measure’ is not symmetric we have developed two independent filtering schemes, namely the simultaneous multiplicative algebraic reconstruction technique (SMART) filter and the expectation maximization (EM) filter. These proposed KL data assimilation schemes have been implemented numerically and the results compared to Kalman filter approaches. The two algorithms (EM filter and SMART filter) are shown to provide near-identical solutions with accuracy matching the 3D-Var solution using the Optimal Interpolation (OI) method with the same information inputs. We have highlighted several advantages of the KL-based data assimilation methods and indicated the future potential of this approach. The KL methods are computationally much faster than the Kalman filter as they are iterative schemes that have no need for matrix storage, matrix multiplication or computing a matrix inverse. For larger systems the KL-based data assimilation approach is shown to have substantial computational advantages over the Kalman filter and 3D-Var/OI. Furthermore, the KL data assimilation methods are ideal for applications that require state variables (or parameters) to obey certain constraints, such as physical limitations on their values. The KL-divergence applies to positive vectors only and so naturally embeds a constraint without any need for additional steps, such as transformations or projections, unlike the Kalman filter schemes. We have outlined important theoretical and conceptual details and highlighted how this promising new approach can be further improved by focusing on adapting the methods so that error covariance can be evolved and more complicated covariance structure can be incorporated. The KL-DA framework developed in this paper will be used as a foundation for future work demonstrating the methods in more sophisticated applications. In summary, the proposed Kullback-Leibler minimizing filtering methods provide a new data assimilation framework that might hold potential for applications involving time-varying variables of large-scale systems and where physical constraints and limited computational resources present challenges.

## References

- 1.
Kalnay E. Atmospheric Modeling, Data Assimilation and Predictability. Cambridge University Press; 2003.
- 2.
Fletcher S. Data Assimilation for the Geosciences. Elsevier; 2017.
- 3. Albers D, Behn CD, Hripcsak G. Data Assimilation in Medicine. SIAM News. 2020;53:07.
- 4. Nadler P, Wang S, Arcucci R, Yang X, Guo Y. An epidemiological modelling approach for COVID-19 via data assimilation. Eur J Epidemiol. 2020;35:749–761. pmid:32888169
- 5. Zobitz JM, Desai AR, Moore DJP, Chadwick MA. A primer for data assimilation with ecological models using Markov Chain Monte Carlo (MCMC). Oecologia. 2011;167:599–611. pmid:21874332
- 6. Miller A, Dawei L, Platt J, Daou A, Margoliash D, Abarbanel HDI. Statistical data assimilation: Formulation and examples from Neurobiology. Front Appl Math Stat. 2018;4:53.
- 7.
Reich S, Cotter C. Probabilistic Forecasting and Bayesian Data Assimilation. Cambridge University Press; 2015.
- 8.
Law K, Stuart A, Zygalakis K. Data Assimilation: A Mathematical Introduction. Springer; 2015.
- 9.
Freitag M, Nichols N, Budd C. L
_{1}-regularisation for ill-posed problems in variational data assimilation. Proc Appl Math Mech. 2010;10:665–668. - 10. Freitag M, Nichols N, Budd C. Resolution of sharp fronts in the presence of model error in variational data assimilation. Q J R Meteorol Soc. 2013;139:742–757.
- 11.
Rao V, Sandu A, Ng M, Nino-Ruiz E. Robust data assimilation using L
_{1}and Huber norms. SIAM J Sci Comput. 2017;39(3):B548–B570. - 12. Feyeux N, Vidard A, Nodet M. Optimal transport for variational data assimilation. Nonlin Processes Geophys. 2018;25:55–66.
- 13. Li L, Vidard A, Le Dimet FX, Ma J. Topological data assimilation using Wasserstein distance. Inverse Problems. 2018;35:015006.
- 14. Kullback S, Leibler R. On information and sufficiency. Ann Math Statist. 1951;22:79–86.
- 15. Csiszár I. Axiomatic Characterizations of Information Measures. Entropy. 2008;10:261–273.
- 16. Bregman LM. The relaxation method of finding the common points of convex sets and its application to the solution of problems in convex programming. USSR Computational Mathematics and Mathematical Physics. 2014;7(3):200–217.
- 17.
Byrne C. Iterative Optimization in Inverse Problems. Hoboken: Taylor and Francis; 2014.
- 18.
Byrne C. Signal Processing A Mathematical Approach, 2nd edition. Chapman and Hall; 2014.
- 19. Simon E, Bertino L. Application of the Gaussian anamorphosis to assimilation in a 3-D coupled physical-ecosystem model of the North Atlantic with the EnKF: a twin experiment. Ocean Sci. 2009;5:495–510.
- 20. Amezcua J, Leeuwen PJV. Gaussian anamorphosis in the analysis step of the EnKF: a joint state-variable/observation approach. Tellus A. 2014;66:23493.
- 21. Janjić T, McLaughlin D, Cohn SE, Verlaan M. Conservation of mass and preservation of positivity with ensemble-type Kalman filter algorithms. Mon Weather Rev. 2014;142:755–773.
- 22. Albers DJ, Blancquart PA, Levine ME, Seylabi EE, Stuart A. Ensemble Kalman methods with constraints. Inverse Problems. 2019;35(9):095007. pmid:33223593
- 23. Li R, Magbool N, Huang B, Prasad V. Constrained ensemble Kalman filter based on Kullback-Leibler (KL) divergence. J Process Control. 2019;81:150–161.
- 24.
Evensen G. Data Assimilation: The Ensemble Kalman Filter. Springer; 2009.
- 25. Resmerita E, Anderssen RS. Joint additive Kullback-Leibler residual minimization and regularization for linear inverse problems. Math Meth Appl Sci. 2007;30:1527–1544.
- 26. Byrne C. Iterative image reconstruction algorithms based on cross-entropy minimization. IEEE Trans on Image Processing. 1993;2(1):96–102. pmid:18296198
- 27. Qranfal Y, Byrne C. EM filter for time-varying SPECT reconstruction. Int J of Pure and Appli Math. 2011;73(4):379–403.
- 28. Qranfal Y, Byrne C. SMART filter for dynamic SPECT image reconstruction. Int J of Pure and Appli Math. 2011;73(4):405–434.
- 29.
Sanz-Alonso D, Stuart A, Taeb A. Data Assimilation and Inverse Problems. arXiv:181006191v2 [Preprint]. 2019. Available from: https://arxiv.org/abs/1810.06191
- 30. Whitaker JS, Hamill TM. Ensemble Data Assimilation without Perturbed Observations. Mon Wea Rev. 2002;130:1913–1924.
- 31. Hunt B, Kostelich E, Szunyogh I. Efficient data assimilation for spatiotemporal chaos: A local ensemble transform Kalman filter. Physica D. 2007;230:112–126.
- 32. Courtier P, Andersson E, Heckley W, Vasiljevic D, Hamrud M, Hollingsworth A, et al. The ECMWF implementation of three-dimensional variational assimilation (3D-Var). I: Formulation. Quart J Roy Meteor Soc. 1998;124:1783–1807.
- 33. Mathiot P, Beatty CK, Fichefet T, Goosse H, Massonnet F, Vancoppenolle M. Better constraints on the sea-ice state using global sea-ice data assimilation. Geosci Model Dev. 2012;5:1501–1515.
- 34. Bonan B, Nodet M, Ritz C, Peyaud V. An ETKF approach for initial state and parameter estimation in ice sheet modelling. Nonlin Processes Geophys. 2014;21:569–582.
- 35.
Mansouri M, Nounou H, Nounou M. Kullback-Leibler divergence -based improved particle filter. 2014 IEEE 11th International Multi-Conference on Systems, Signals & Devices (SSD14). 2014. https://doi.org/10.1109/SSD.2014.6808793
- 36.
Burnham K, Anderson D. Model selection and multimodel inference: A practical information-theoretic approach. Springer; 2002.
- 37. Lang M, Van Leeuwen PJ, Browne P. A systematic method of parameterisation estimation using data assimilation. Tellus A. 2016;68:29012.