Figures
Abstract
We develop an efficient algorithm to find optimal observation times by maximizing the Fisher information for the birth rate of a partially observable pure birth process involving n observations. Partially observable implies that at each of the observation time points for counting the number of individuals present in the pure birth process, each individual is observed independently with a fixed probability p, modeling detection difficulties or constraints on resources. We apply concepts and techniques from generating functions, using a combination of symbolic and numeric computation, to establish a recursion for evaluating and optimizing the Fisher information. The recursion, while still computationally intensive, greatly improves on previously known computational methods which quickly became intractable even in the n = 2 case. Our numerical results reveal the efficacy of this new method. An implementation of the algorithm is available publicly.
Citation: Eshragh A, Skerritt MP, Salvy B, McCallum T (2025) Optimal experimental design for partially observable pure birth processes. PLoS One 20(8): e0328707. https://doi.org/10.1371/journal.pone.0328707
Editor: Hoda Bidkhori, George Mason University College of Science, UNITED STATES OF AMERICA
Received: December 24, 2024; Accepted: July 4, 2025; Published: August 29, 2025
Copyright: © 2025 Eshragh et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The data, as well as the code to generate it, are available in a GitHub repository. https://github.com/matt-sk/POPBP-Fisher-Information-Optimisation.git (The data can be found in the ‘data’ folders underneath the ‘Maple/Optimisation/*’ folders).
Funding: The author(s) received no specific funding for this work.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Optimal experimental design is a statistical methodology for selecting efficient and effective ways to gather data (c.f., [1,2]). It aims to maximize the amount of information obtained from the experiments, quantified by the Fisher information. The significance of Fisher information lies in its connection to the asymptotic variance of maximum likelihood estimators. By leveraging Fisher information one can find (asymptotically) unbiased estimators with minimum-variances, where the optimality of a design depends on the statistical model and is assessed with respect to some criterion. As an example, the widely recognized “D-optimality” criterion focuses on maximizing the determinant of the Fisher information. This motivates the computation of the Fisher information for different statistical and stochastic models, particularly for continuous-time Markov chains.
Continuous-time Markov chains (CTMCs) have gained significant popularity in modeling a range of phenomena such as evolutionary, ecological, and epidemiological processes, owing to their capability to efficiently capture the discrete, interactive, and stochastic aspects of these processes [see, e.g., 3–5]. A crucial element in the application of CTMCs is the estimation of model parameters. Initial research was directed at parameter estimation for stochastic processes like pure birth [e.g., 6], pure death [e.g., 7], and birth-and-death processes [e.g., 8–12], under both continuous and equidistant discrete observation intervals. The scope was broadened by Becker & Kersting [13], who were successful in formulating an explicit expression for the Fisher information pertaining to the pure birth process, and derived optimal observation times for a pure birth process by applying optimal experimental design methods.
The applicability of pure birth process was further broadened by Bean, Elliot, Eshragh, & Ross [14] and Bean, Eshragh, & Ross [15]. They investigated the Fisher information for the pure birth process under discrete, non-equidistant observation times, namely: the partially observable pure birth process (POPBP), where each observation is modeled as a binomial random variable dependent on the actual population size. This approach adds realism and complexity to the analysis, particularly relevant in the early stages of pest and disease invasions or cell growth experiments, where smaller populations make diffusion approximations less effective. Bean et al. [14] showed that the POPBP is non-Markovian under any order. In addition, Bean et al. [15] developed an efficient approximation to find and optimize the Fisher information, which was previously restricted to only two observations. As a practical application, Eshragh et al. [16] recently utilized these results to model and analyze the dynamics of the COVID-19 population in its early stages in Australia. Our work in this paper improves upon the foundations and methods of those papers.
The use of generating functions in combinatorics and probability theory is classical [e.g., 17–21]. In many cases, notably in relation to Markov chains, the generating functions turn out to be rational functions, that are themselves related to linear recurrences with constant coefficients. This is the situation we encounter in this article and exploit algorithmically to establish a recursion to compute the Fisher information for a POPBP. This technique allows us to efficiently derive the optimal experimental design numerically for more than two observations. To the best of our knowledge, this is the first attempt in applying generating functions in the context of optimal experimental design.
This article is structured as follows: Optimal experimental design presents optimal experimental design methods and shows how to find the Fisher information for a pure birth process. Partially observable pure birth process introduces the partially observable pure birth process, formulates its Fisher information, and develops analytical results for the structure of optimal experimental design. Generating functions for the likelihood applies generating function techniques to establish an efficient recursion for calculating the Fisher information for partially observable pure birth processes. The ‘Experimental methodology’ and ‘Experimental results’ sections exploit the methodology developed in Generating functions for the likelihood to run comprehensive numerical experiments for different values of model parameters. Conclusion concludes and addresses future work.
Notation
The notation used in this paper is summarized in Table 1 on the following page.
Optimal experimental design
Consider the stochastic process , where the random variable Xt is characterized by a probability mass/density function
, with
representing an unknown parameter vector. To estimate this vector accurately, we employ the method of maximum likelihood estimation (MLE). This involves taking n observations
of the process and maximizing the likelihood function:
where denotes a random vector of observations,
its realization, and
the joint probability mass/density function of the observed sample.
It is well-known that MLEs asymptotically follow a normal distribution, characterized by a variance matrix denoted as . The inverse of this matrix introduced in Definition 1 plays a crucial role in statistical estimation theory.
Definition 1 (Fisher Information). The inverse of the variance matrix , referred to as the Fisher information matrix and denoted by
, is a
matrix defined as:
The Fisher information matrix plays a key role in quantifying the amount of information that a random sample carries about an unknown parameter upon which the likelihood depends [22].
An optimal experimental design is defined as an experimental design that optimizes an appropriate function of the Fisher information matrix [2]. Common optimality criteria identified in the literature include:
- A-optimality: Minimizing the trace of the inverse Fisher information matrix, which is equivalent to minimizing the trace of the variance matrix,
- D-optimality: Maximizing the determinant of the Fisher information matrix,
- E-optimality: Maximizing the minimum eigenvalue of the Fisher information matrix,
- T-optimality: Maximizing the trace of the Fisher information matrix.
It is important to note that if the parameter vector contains only a single parameter, then
becomes a scalar. In this simplified scenario, all the above criteria converge, effectively becoming equivalent to maximizing the Fisher information.
The Fisher information matrix can be calculated through one of the two following expectations (see, e.g., [23] Chapter 13):
where denotes the gradient vector and
the Hessian matrix, both with respect to the parameter vector
. Note that in Eq (1) the superscript T indicates the transpose operation. For a single parameter (k = 1), these expressions simplify to the first and second derivatives of the log-likelihood function with respect to θ, thus reducing the Fisher information to a scalar.
Eq (2) demonstrates that the calculation of the Fisher information matrix relies on the likelihood function . Thus, if computing the likelihood function is complex or infeasible, this complexity is likely to carry over to the Fisher information’s calculation. Additionally, even possessing an explicit expression for the likelihood function does not guarantee the straightforward derivation of the Fisher information.
An exception to these challenges occurs in the case of observations derived from a pure birth process, where both the likelihood function and the Fisher information can be explicitly determined, presented in Definition 2.
Definition 2 (Pure Birth Process, PBP). The stochastic process is called a pure birth process (PBP) with a birth rate parameter
, if xt represents the population size at time t with the transition rate to the next state, xt + 1, is precisely
.
Throughout, we assume the initial population size, x0, at time t0 = 0 is known. Furthermore, for , the conditional probability mass function of the random variable
over the values of
is
where (see, e.g., [24] Chapter 13).
Becker & Kersting [13] extensively studied the Fisher information of observations obtained from the PBP to estimate the unknown birth rate parameter λ. They demonstrated that for observations from the PBP with parameter λ at specific observation times
within a predetermined time horizon
, the likelihood function for these observations is given by:
where . This representation of the likelihood function, as a product, facilitates the evaluation of the Fisher information via Eq (2).
Utilizing this formulation, Becker & Kersting [13] derived an explicit expression for the Fisher information for the random observation vector used in estimating λ:
Furthermore, they showed that with given values of τ, n, and λ, the optimal experimental design can be uniquely determined by solving the following optimization equations:
where the functions and
are defined as:
For large sample sizes, an approximate solution to these equations simplifies the experimental design process:
We compare these approximate against
calculated directly (by optimizing Eq (5)) in the Optimization method of the Experimental methodology, below.
Although this approach is interesting, it may not be practical due to real-world restrictions, such as time and budget constraints, which may prevent us from observing and counting all individuals in the population at each observation time ti. To address this issue, we employ a modified stochastic process, a partially observable pure birth process (POPBP), which will be explained in Partially observable pure birth process.
Partially observable pure birth process
Consider a PBP with an unknown birth rate λ. To estimate this unknown parameter λ, we aim to take n observations at times
. Let us now assume that at each observation time ti, we may not be able to observe the entire population size
but can only observe a random sample from it. Consequently, we define a POPBP as follows:
Definition 3 (Partially Observable Pure Birth Process, POPBP [14]). Consider the PBP with birth rate λ. If random variables Yt is defined such that the conditional random variable
follows the Bin(xt,p) distribution, where
then the stochastic process is called the partially observable pure birth process (POPBP) with parameters
.
Remark 4. Definition 3 implies that, for a population size at time equal to xt, where each of these xt individuals can be observed independently with probability p, the random variable Yt then counts the total number of observed individuals at that time. Consequently, the POPBP with parameters
simplifies to the PBP with the same parameter λ, because when p = 1, every individual in the population is observed, mirroring the observation conditions of a PBP. Furthermore, it is assumed throughout that the parameter p is both fixed and known.
Bean et al. [14] demonstrated that the POPBP , characterized by parameters
, does not exhibit the Markovian property. This characteristic means that the likelihood function for observations from the POPBP cannot be simplified utilizing the Markovian property, in contrast to the PBP. Taking into account this significant difference, Bean et al. [15] derived the likelihood function for the POPBP as follows:
where
Eq (7) reveals that, unlike the likelihood function for a PBP (i.e., Eq (4)), the likelihood function for the POPBP cannot be represented simply as a product form. This complexity suggests that calculations involving the likelihood function, including those for the Fisher information, will be considerably more challenging than those for the PBP.
The Fisher information for the parameter λ, based on n observations of the POPBP, is presented as:
The calculation of the partial derivative in Eq (8) can be stated in terms of functions as follows:
Substituting Eqs (7), (9) into Eq (8) allows for the calculation of the Fisher information for the POPBP. However, this process does not lead to a simplified form as seen with the PBP, complicating numerical calculations and optimization efforts.
Notably, the computation in Eq (8) involves n + 2 infinite series, including those over in both the numerator and denominator of the summand, as well as n series over
. To achieve a desirable precision level in numerical calculations of the Fisher information, Bean et al. [15] recommended a truncation criterion based on Chebyshev’s inequality, coupled with a relative-error criterion. This approach ensures that the ratio of the summand to the cumulative sum up to the current point is below a predetermined significance level.
Numerical calculations for a POPBP are challenging due to the infinite summations required to compute the Fisher information. This complexity is magnified as n, the number of observation times, increases. Specifically, each additional observation necessitates the truncation of three more infinite series: one for calculating over yn, another for
over xn, and a third for the partial derivative
over xn. Consequently, computation times can become prohibitively long, even for relatively small n. For example, with n = 3 and
, estimating optimal observation times for p ranging from 0.01 to 0.99 in increments of 0.01 is projected to take five years (This estimate is based on calculations implemented in C++.), highlighting the significant computational demands.
Moreover, as λ increases, computation time escalates exponentially due to the truncation points being exponential functions of λ. For instance, maximizing the Fisher information for n = 2 and varying p from 0.01 to 0.99 by 0.01 steps takes approximately 14 hours for . However, increasing λ from 2 to 5 raises the estimated computation time to two years, underscoring the exponential growth in computational demand with parameter increases.
Bean et al. [15] developed an approximation for the Fisher information in the POPBP with two observations (n = 2) as follows:
where . Their work demonstrates, both theoretically and numerically, that Eq (10) provides a highly accurate approximation of
. Furthermore, they proved that as λ increases, the approximation error quickly diminishes to zero. Significantly, because
does not involve any infinite summation, it enables the rapid approximation of the Fisher information for any λ value.
Unfortunately, while offers an excellent approximation for n = 2 observation times, extending this approach to higher values of n becomes intractable due to the increasing computational complexity and the absence of straightforward analytical solutions. In Generating functions for the likelihood, we introduce a novel numerical algorithm designed to compute and maximize the Fisher information for the POPBP more efficiently, addressing these challenges.
We conclude this section by demonstrating the rescaling property of the optimal experimental design for the , as articulated in Definition 5, which plays a crucial role in enhancing the efficiency and applicability of experimental designs.
Proposition 5. If constitutes an optimal experimental design for a POPBP with parameters
and a time-horizon of 1, then for any fixed
, the scaled design
forms the corresponding optimal experimental design for a POPBP with parameters
and a time-horizon of τ.
Proof: Denote by and
the Fisher information for a POPBP with parameters
over a time-horizon of 1, and for a POPBP with parameters
over a time-horizon of τ, respectively. According to Eq (8), the Fisher information
for any arbitrary set of sampling times
equates to
for the scaled set
. Therefore, if
maximizes
, the scaled set
naturally maximizes
, establishing its optimality for the latter process. □
Remark 6. Definition 5 implies that to find an optimal experimental design for a given POPBP with time-horizon τ, we can find the corresponding optimal experimental design for the rescaled POPBP with time-horizon 1 and then, by a simple linear transformation, convert it to an optimal experimental design of the original process. Thus, without loss of generality, we assume henceforth that .
Generating functions for the likelihood
In this section, which can be considered the main contribution of this paper, we develop a new approach involving the use of generating functions to compute the Fisher information for higher values of n and λ.
The generating function of a sequence indexed by non-negative integer variables zi is the formal power series
When the generating function ϕ is a rational function
with two polynomials P and Q, the sequence satisfies a linear recurrence with constant coefficients obtained by equating the coefficients of the same powers of
on both sides of the identity
Motivated by this idea, we develop a recursive equation for the likelihood function of a POPBP, which we utilize to calculate and maximize the Fisher information. In our application, the initial population size x0 is known, so it is not an input variable of the likelihood function. There is no difficulty in considering the initial population size as a random variable, which will be constant in our special case. Thus, the generating function of the likelihood function (Eq (7)) we consider is defined by
where . It turns out that this is a simple rational function.
Lemma 7. Consider the POPBP with parameters with
observations. The generating function of the likelihood function is given by
where q = 1−p and (Qi) is a family of polynomials defined by and
with the convention .
Proof: By adopting the convention that is 0 for k < 0 and k > n, all sums in Eq (13) can be taken over
(for the variables
or
(for the variables
.
Summation over using the binomial theorem reduces the generating function to
with
where and
for
. By the binomial theorem,
and then by induction
The final sum is
which concludes the proof. □
Remark 8. The important special case when x0 is fixed and equal to 1 corresponds to extracting the coefficient of in this rational function. This is achieved by setting u0 = 1 (hence Q0 = 1) in Eq (14).
For notational convenience, we write for
and if
,
with the convention that this value is 0 if any of the entries of is negative. We use the same notation with
and
in the case when x0 is fixed and equal to 1. Also
denotes the monomial
.
With this notation, a consequence of the explicit form of the generating function is a simple recursion for the likelihood function.
Theorem 9. Consider the POPBP with parameters with
observations. The likelihood function satisfies the following recurrence equation:
where is the coefficient of
in the polynomial Qn of [LemGenFun4n2]Eq (7), while
is the coefficient of
in the numerator of Eq (14). In the special case when x0 is fixed and equal to 1, the same result holds with
and
in the place of
and
.
Proof: This is the result of multiplying both sides of Eq (14) by and extracting the coefficient of
on both sides. □
Remark 10. The coefficients of Qn are easily computed from the recurrence Eq (15). Similarly, the coefficients of the numerator of Eq (14) follow easily from its expression. Note that both polynomials have degree 1 in each of the variables .
Remark 11. The recurrence Eq (15) shows that for p and the in the interval [0,1], all the coefficients of the recurrence are positive, making it numerically stable. Moreover, if
and all
(
), all these coefficients are non-zero: the recurrence has exactly 2n + 1 terms.
Remark 12. The formula Eq (16) also gives the initial conditions. For instance, with , it gives
.
By taking a derivative with respect to λ from both sides of the recurrence equation for the likelihood function given in Definition 9, one obtains a similar recurrence equation for the derivative of the likelihood function. This expression involves the derivative of the coefficients with respect to λ. They are the coefficients of the polynomial
. These polynomials are computed thanks to the recurrence
which can be simplified using
By exploiting all these results together, we can calculate the Fisher information for the POPBP using Eq (8) for a given initial population size x0. Note that in numerical evaluations, all infinite sums in the calculation of the Fisher information should be properly truncated.
Experimental methodology
In Generating functions for the likelihood we used generating functions to develop a recursive equation for the likelihood function of a POPBP. As stated in Partially observable pure birth process, computing and maximizing the Fisher information for a POPBP even for small values of n and λ can be very time consuming. Nonetheless, this section shows the results of Generating functions for the likelihood can significantly speed up the computation of the Fisher information and accordingly help us derive optimal experimental designs for POPBPs efficiently. Recall that the goal is to compute the following optimal observation times:
We used Maple 2017 to symbolically pre-compute the generating function for the likelihood function and its derivative, which are used to compute the Fisher information.
Parallelization
For a vector , we call
its degree. A consequence of the recurrence relation for
is that the recursive computation of
relies entirely on values of
for vectors
of smaller degree (and similarly for
). We exploit this observation to enable parallelization of the computation.
Definition 13 (Slice). Let S>0 be an integer. We define
and call it the slice of the computation of
. We write
for the set
Thus the Fisher information may be rewritten
We compute by computing each slice in turn, starting at 0, and with the computations for each slice being independently computed in parallel. We store the values of
and
from the terms of each slice until they are no longer needed.
Implementation considerations
The above computation method for was implemented in C++, and compiled to a shared library to facilitate its use with third party optimization software. Interested readers can access the code through Github (https://github.com/matt-sk/POPBP-Fisher-Information-Calculator.git).
This implementation includes code written to be multi-threaded so as to compute the terms in an individual slice in parallel. Note that we also wrote and implemented single-threaded code, and both single-threaded and multi-threaded options are included in the implementation. Thus the user may choose at runtime whether to take advantage of multiple processors.
The computation proceeds one slice at a time starting at S = 0 and continuing until the sum of values for a slice does not change the accumulated value. That is, we terminate computation when .
We observe from Eq (16) in Definition 9 that the computation of a single slice needs only the n previous slices. Consequently, our computation only stores the current slice and the n previous slices, discarding no-longer-needed slices as we compute.
Currently, the values of n for which our implementation can compute are fixed due to the generating functions having been pre-computed for fixed n. The coefficients of the recurrence relation from Eq (16) are hard-coded in the software, although in such a way that little modification is needed to add new cases. As of the time of writing, our implementation is capable of computing
for
.
We have written our implementation using C++ templates in such a way that it should be capable of computation using any desired precision. For the templated code to work with arbitrary precision numeric types, those numeric types must use overloaded arithmetic operators. We have not tested it using arbitrary precision libraries; we have used only IEEE single (32 bit) and double (64 bit) precision (C++ float and double types, respectively). We computed the results in this article with double precision.
Our implementation takes , p, and
, and computes the value of
. The coefficients
and
depend on the values of
, so our implementation begins by computing and storing these coefficients. More precisely, our implementation computes and stores
and
so as to save a division operation in the computation of each
(and thus save many divisions over the computation of
).
Each slice is stored in an array. To compute the value of for a given
we must be able to access arbitrary values within the earlier slices. To do so we must be able to index each element in the slice. That is, we must be able to describe in code a bijection between the integers
and the elements of
. We have described the bijection with a recurrence relation, and computed it symbolically in Maple for the values of
.
We would like a more generic solution to this indexing problem (i.e., one that does not require hard coding for each new value of n for which we want to compute). Such a solution would need to be at least as fast in implementation as our current method. We note that an early iteration of the implementation used an associative container (std::unordered_map in C++; readers familiar with Python can think of this as a dictionary) as a generic solution, but this solution proved to be slower than the current implementation, presumably due to the search time inherent in the data structure.
Being able to preserve locality between elements of a slice and the required elements of the lower slices needed for their computation—whether through indexing, a clever data structure, or otherwise—would be particularly desirable. That is, we would like to be able to reliably partition (roughly evenly) in such a way that each partitioned subset, as well as the subsets of
required for its computation, are easily extractable in contiguous memory with few unneeded extra elements. Doing so would allow us to more easily break up the computation of a single slice over multiple computation devices (e.g., using GPUs or MPI) and—if the partitions were small enough—could also allow some more fine-grained memory caching optimizations on a single machine.
Finally, we note that although our implementation can compute for the case n = 5, no results are printed in this paper for this case. The optimization times for this case proved to be prohibitively slow.
Optimization method
To optimize (i.e., find
) for fixed values of
and p we use Maple’s NLPSolve function (that itself relies on numerical code from the NAG library) to perform the optimization using our C++ implementation (accessed through the shared library) for computation of the Fisher information.
We know that and that
. Consequently, we search over the domain
. The boundaries of this domain are when
for any
, when
for any
, and when
for some
. Note that when n is large enough we may have boundaries that are unions of these types of boundary.
We optimize the interior and each boundary individually. However, we also know—because we know the population size at time t = 0 (i.e., x0) almost surely—that for any i, so we exclude any boundaries where
.
The NLPSolve function offers different optimization methods, and we use two of them depending on whether the region we are optimizing over is one-dimensional, or multi-dimensional. For one-dimensional regions (boundaries with only a single varying ti, or with a single parameter t for a single group of equal arguments ) we use the “branchandbound” method, which performs a global search. For multi-dimensional regions (all other regions) we use the default method, which for our problem is the “SQP” (sequential quadratic programming) method.
Definition 14. Numerical results reveal that for a fixed set of parameters, the value of is equal to 1 for small values of p. However, a value of p exists for which
drops suddenly from
. We call such a point a “drop value” and denote it by
. Clearly,
does not exist because
is always 1.
The graphs of in Experimental results are produced by first choosing a fixed
and then calculating the so-called drop values using a binary search. For every
we presume
when p = 0, and that
when p = 1 and conduct a binary search on p to bound
. We search for
first, then
, and so on. However, because each optimization produces
for a particular p value, we update the upper and lower bounds of all
while we are searching for a particular one. This approach allows the later binary searches to perhaps have narrower bounds to begin searching within.
The bounds for each yield an open interval,
, which is stored after computation. The maximum width of the interval is specified at computation time; however, due to the nature of the binary search algorithm, the computation may produce a narrower bound interval. All drop values for the results presented in this paper have been bounded to within an interval of width less than 10−6.
Proposition 15. The calculated open bounding intervals for two drop values overlap if and only if the calculated bounds are identical.
Proof: Suppose that the binary search for drop value has completed with bounds
, and that i<n–1. We consider the computation of
.
First, observe that no value of can have been tested in any of the previous binary search calculations. If they had been, then either the upper or lower bound of
would have been updated at that time.
Second, observe that it must be the case that . This is because
by definition, and we know that
when p = li. So it must be the case that
when p = li and so li is a lower bound for
.
Now consider the value of when p = ui. Note that this value would have been observed during the binary search for one of the previous drop values, and the starting bounds for
would have been updated appropriately at that time.
- If
when p = ui then it must be the case that
. Moreover, at the commencement of the binary search for
, the lower bound,
say, would satisfy
. So the binary search must produce bounds li + 1 and ui + 1 such that
so the open intervals can not overlap.
- If
when p = ui then it must be the case that
. Moreover, the first and second observations, above, imply that the starting bounds in the binary search must be
The binary search for
has completed, so the open interval
is less than the required threshold. Thus the bounds
are also less than the required threshold, and so the binary search will terminate immediately yielding
and
.
We apply this reasoning iteratively starting at i = 1 and the result follows. □
Once we have bounded all we produce the optimization graph in parts. Each part is the interval between the upper bound
to the lower bound of
(inclusive). We do not need to compute
for any value of p less than
because they are always 1.
For each part we use Maple’s plot function to plot for p in the interval. It is a consequence of the plotting that the values of
are calculated for values of p within the interval, chosen by the plot function. We ensure these values of p are chosen so that they are at most 0.005 apart, and that at least three are in the interval. Note that the
are usually evaluated for significantly more values of p than the minimum needed to fulfill this requirement, because the plot function employs an adaptive algorithm whereby it may choose additional values of p so as to produce a smoother plot. When all the parts are plotted, we overlay them together on a single pair of axes to produce the graph.
An important edge case.
Our implementation described in Implementation considerations does not work when p = 1. However, we observe that this case is precisely the case of a PBP. So values of for p = 1 are separately calculated using Eq (5) and are manually appended to the plot data to ensure the plot never attempts to evaluate p = 1 using our C++ implementation.
Note that we optimize Eq (5) directly, instead of using from Eq (6) (the approximate optimal
for Eq (5)) at the end of Optimal experimental design. Recall that the approximation is an asymptotic result in the number of observations. As such the utility of the approximation for the values of n we use in this paper is poor.
The difference in Fisher information (Eq (8)) between using and using directly optimized
is visualized in Fig 1. To constrain the size of the difference in the Fisher information we calculate the ratio of Fisher information from direct optimization to the Fisher information from the approximation, and subtract that ratio from one. Thus we achieved a clear visual indication of the percentage difference in Fisher information; a larger magnitude indicates a greater difference, a positive value indicates the optimized observation times produce a larger Fisher information (and thus are a more optimal set of observation times), and a negative value indicates that the Becker and Kersting approximations are more optimal than those produces by the numerical optimization.
Directly optimized: , Becker & Kersting:
.
We see that the directly optimized Fisher information is as much as 5% greater than the approximated optimal Fisher information when n = 2. Moreover, perhaps surprisingly, the difference in fact increases as grows. Nonetheless, the maximum discrepancy does appear to shrink as n grows, as we expect from the asymptotic result, decreasing to slightly under 1% when n = 4. We also see, unsurprisingly, that the approximated optimal observation times,
, are never more optimal than the numerically optimized observations times,
.
If we restrict our attention to only the ranges of reported in our results (
for n = 2,
for n = 3 and
for n = 4), the discrepancy is much less pronounced. The results are shown in Fig 2, in which we also look at the difference between
and
.
Directly optimized: , Becker & Kersting:
.
We see little noticeable difference in the Fisher information, but do see variance in the optimized observation times reach between 0.01 and 0.02 (which could affect the resulting parameter in the second or third decimal places). We note that although this difference is small, it was nonetheless large enough that if we used the approximation to compute for p = 1, we saw a noticeable “kink” in the graphs of optimal
where the values that were numerically optimized from the POPBP Fisher information as
met the value given by the approximation at p = 1.
Ultimately, optimizing Eq (5) directly is more accurate than using the Becker and Kersting approximation (i.e., Eq (6)). Consequently, we adopted that strategy for our computations.
Experimental results
In this section we present the results of our computations. The presented results, along with the tools for producing them (predominantly Maple scripts, and some bash shell scripts) can be found in a Github repository (https://github.com/matt-sk/POPBP-Fisher-Information-Optimisation.git). Note that this repository is different from the repository for our C++ implementation, described above, of the Fisher information calculation. The results repository includes the C++ implementation repository as a submodule.
Optimal observation times
Recall that are the optimal observation times that maximize the Fisher information, p is the probability of an individual from the population being observed at any given time, n is the number of observation times, and λ is the growth rate of the population.
Two observations (n = 2).
Figs 3, 4,5,6,7,8, and 9 show the values of as p varies in the case of n = 2 observation times. They cover the cases of
and 5, respectively. Recall that
always, so the only
in these figures shown is
.
When our calculations tell us that the drop value
(more precisely, the calculated bound was
). Looking at Fig 3, the motivation for the name “drop value” should be apparent; we see that at
the graph suddenly drops from
. We indicate this drop with a dotted vertical line.
To understand why this drop happens, recall that lower values of p mean the probability of individuals being observed at time t1 is lower. To obtain a better estimator of λ, we want to be able to observe more individuals to more accurately gauge how much the population has grown (from x0 = 1 individual initially). Waiting the maximum amount of time by taking the first (and second) observations at time 1 increases the expected number of individuals observed.
However, a point is reached where the information obtained from taking the first observation earlier outweighs the information obtained from taking the first observation at time t1 = 1. This moment occurs precisely at the drop point (i.e., when ). Rather than decreasing smoothly from 1,
drops instantly.
Note that in Fig 3 the values of decrease after the drop point. Conversely, for
in Fig 4 the values increase after the drop point. This behavior continues for all subsequent values of λ (for n = 2), although the curvature may vary.
We observe that the drop values decrease as increases. Our computations tell us:
,
,
,
,
, and
. We see drops in the graphs of Figs 4, 5, 6, 7, 8, and 9 at values of p corresponding to these values. We see this pattern of decreasing drop values continues when n = 3 and n = 4.
To visualize the behavior of the change in , we plot them against
in Fig 10. We see that the decrease is not linear.
Recall that our computations bound the drop value. To speed up the computation time for the graph in Fig 10, we computed for each until the upper and lower bounds for
were less than 10−3 apart (instead of 10−6 as used in our other drop-value calculations). Two hundred values of
were used to generate the graph. Note that we have plotted both the upper and lower bounds in Fig 10; however, the bounds are sufficiently close that the difference is unable to be discerned in the graph.
This behavior can be explained by recalling that higher λ means a higher population growth rate, so the population grows relatively larger (compared with a population with a lower growth rate) over time. This growth results in a lower probability p required to obtain a satisfactory expected number of observed individuals at time (i.e., a lower drop value).
We see, in Fig 11, the Fisher information plotted against t1 and p, for the case of n = 2 observations and the growth rates of . We have additionally overlaid the optimal Fisher information onto the graph (in a thick black line on each surface). If we observe each plot looking down (observing only the
–p plane) we see in the black line (the optimal Fisher information) precisely the shape of the curves in Figs 5, 6, and 7, respectively. If we observe each plot looking only at the
–p plane we see in the black line (the optimal Fisher information) precisely the shape of the optimal Fisher information as seen in the corresponding optimal Fisher information graphs we present in Optimal Fisher information later in this section.
Plotted against t1 and p. Left: , Mid:
, Right:
.
Recall that Bean et al. [15] could not calculate the Fisher information for high values of λ (even for the n = 2 case), and thus could not assess the quality of the approximation for such λ. However, as our new approach allows us to calculate
for high values of λ more efficiently, we can use it to assess the quality of
for finding optimal observation times for higher values of λ. To this end, let
denote the optimal observation times for
. We show in each of Figs 12, 13, 14, 15, 16, 17, and 18 both
and
(each for a different value of
).
We computed the values of using the same optimization libraries in Maple described above, but using a symbolic representation of
instead of our C++ Fisher information calculation implementation. Undefined values exist in Eq (10) when
, t1 = 0, t2 = 0, and
. We accounted for these by pre-computing in Maple the limits of the formula for all of these cases (i.e., as
,
,
, and
) and employed a piecewise function to choose the correct expression.
The aforementioned piecewise function evaluates the equality of t1 and t2 by checking if (i.e., numeric zero instead of symbolic zero) to account for cases where the values are not precisely the same, but are sufficiently close to yield the undefined values. We similarly check for t1 and t2 being equal to numeric zero.
Note that the graphs in Figs 12 and 13 have a vertical gray line immediately after the dotted gray line indicating the sudden drop. These vertical lines are due to the optimization routines finding values of
for some p values immediately after
but before a second (more significant) drop. The effect is more apparent visually in Fig 13. The remaining graphs do not exhibit this behavior.
In the case of for
, we have
. After the drop point,
drops to 0.95 until
at which point it drops again to
. Comparatively, for
, we have
after which
. Thus the graph of
doesn’t drop as far as
; nonetheless, the curves are remarkably close.
In the case of for
, we have
. After the drop point,
drops to 0.95 until
at which point it drops again to
. Comparatively, for
, we have
after which
.
In all cases, we see both curves approach the same when p = 1. This observation is clearest when
and
, for which the approximation
appears to be poorest. The reason is that when p = 1, the
reduces to the
, and both
and
coincide with the exact form of the Fisher information for a
as given in Eq (5). We note that we did not force the computation of Eq (5) for p = 1 when computing
like we did for the computation of
, and yet the curves nonetheless meet.
In all cases, we see that as p increases, the two curves ( and
) become closer together. Fig 12 notwithstanding, as λ increases the curves become closer together as well. Therefore, when n = 2, we can use the approximation
to find
when λ is large, because the calculation is faster and the approximation is quite good. When λ is small, we can find
directly, because the calculation time is shorter, and
is a poorer approximation.
Three observations (n = 3).
Figs 19, 20, 21, 22, 23, and 24 show the values of as p varies in the case of n = 3 observation times. They cover the cases of
and 4, respectively. Recall that
always, so only
and
are shown in these figures.
In the case, the end of the curve for
before
appears to meet the beginning of the curve of t2 after
. Looking more closely at this region (as shown in Fig 25) shows that although they are very close, the curves do not meet.
Plot zoomed in to show the curves do not touch.
Observe that in all the figures, the graphs of drop a second time at
. We did not explicitly calculate these second (nor third, etc) drops for
, yet they nonetheless appear in the calculated graphs. Recall that
is specifically defined as (and thus computed to be) the value for the first time
drops. Consequently,
is first time that
drops, and we stress that none of the computations for its bounds took
into account at all.
Furthermore, we do not compute any values of (for any i) for values of p within the bounds of
(for any j). Thus whether
drops at exactly
, or at a point very close, is unclear. However, we note that an earlier calculation had
bounded by an interval of width less than 10−12, and we observed the same phenomenon in the graphs calculated at that time. We abandoned and recomputed the data because we had not taken timing information for the calculations, and, importantly, the timing information reported in this paper is for the computations used to produce the results reported herein. When recomputing, we decided that bounding
to an interval less than 10−12 was overkill, and we opted to use 10−6 instead.
We see this behavior consistently in all our graphs for n = 3 and n = 4. In general, for all
suddenly drop in value at
.
In all cases, we see the same phenomena with regard to drop values that we do for the n = 2 case. As λ increases, both drop values ( and
) decrease. Unsurprisingly,
is always less than
; however, the distance between them varies as λ increases.
To visualize the behavior of the change in both and
we plot them against λ in Fig 26. As we did for the n = 2 case, we computed until the bounding intervals were less than 10−3, and plot both upper and lower bounds in the figure for each
.
Four observations (n = 4).
Figs 27, 28, and 29 show the values of as p varies in the case of n = 4 observation times. They cover the cases of
and 1, respectively. Recall that
always, so only
,
, and
are shown in these figures.
In all cases, we see the same phenomena with regard to drop values that we did for the n = 2 and n = 3 case. As λ increases, all drop values (,
, and
) decrease. Unsurprisingly,
is always less than
which in turn is always less than
; however, the distance between them varies as λ increases.
We did not plot against λ to visualize the behavior of the change in the drop values. We chose not to do so because of the large time required to compute the drop values for a single λ (even when only computing to an interval less than 10−3), and the large number of values of λ required to produce such a plot.
Optimal Fisher information
Figs 30, 31, 32, 33, 34, 35, and 36 show the optimal Fisher information, , corresponding to the optimal parameters
from the ‘Two observations (n = 2)’, ‘Three observations (n = 3)’, and ‘Four observations (n = 4)’ subsections, above.
In all cases the graph appears to be increasing. We see an initial concave-down curve followed by a sudden change of curvature and direction in the graph (and an undifferentiable point at the interface between the two). Careful observation should indicate that this change corresponds to a drop value . Indeed, the curves for n = 2 have only a single change, whereas the curves for n = 3 have two changes (although in some cases seeing the second change can be difficult), and for n = 4 we have three changes.
As should be expected, for a fixed growth rate λ we see the optimal Fisher information increases as the number of observations, n, increase. For small the curves appear to converge to the same value as p approaches 1; however, convergence ceases when
. Furthermore, for a fixed growth rate
and number of observations n, the optimal Fisher information increases as the probability of observation p increases.
A surprising observation is that the maximal optimal Fisher information decreases as λ increases from 0.5 to 2, which can be seen by comparing the optimal Fisher information for p = 1 on each graph and observing that it is decreasing. This observation is clearer when we draw the graphs side by side on equally sized axes as shown in Figs 37 and 38.
Timings
All computations reported in this section were performed on Intel(R) Xeon(R) Gold 6150 CPU’s running at 2.70 GHz. All computations for n = 2 were performed single-threaded, whereas all computations for were performed multi-threaded using 36 simultaneous threads.
The reason for the single-threaded computations for n = 2 is that we found the multi-threaded implementation was slower than the single-threaded implementation for small values of (when n = 2).
An individual computation of was so fast that the time required to initialize the threading requirements was significant enough to overshadow the time required to compute with a single thread. As such, and because running all computations for a single n with the same Maple scripts was easier, all computations for n = 2 were performed single-threaded even though some of these computations might have benefited from multi-threading.
We note that the scheduling software placed an upper limit of 300 hours (12.5 days) on a single computation. The only computation for which this 300 hour limitation was a concern was the computation of the graph data for n = 3 and . The code was written to allow a terminated computation to recommence from the point of termination; however, the saved data that allowed for recommencement was at the granularity of completed optimizations. Consequently, the optimization being performed when the computation terminated at the 300-hour boundary needed to begin again when the computation resumed in the next 300 hour block, and so some amount of re-computation was unavoidable.
The optimization time for grows as n increases. Furthermore, for fixed n, the optimization times increase as
increases and as p increases.
The times taken to bound in an open interval of width less than 10−6 are given in Table 2. We show the total time, the number of optimizations performed (i.e., different values of p) and the fastest, slowest, and average times for individual optimizations.
Note that because of the nature of the binary search, the optimizations will tend to clump around the drop values. If those drop values are small, we will be performing more of the faster optimizations. Conversely if the drop values are large, we will be performing more of the slower optimizations. As such, the table should not be taken as a strong indication of the relative speed of the computation times for different parameters n and .
Also note that because we have n–1 drop values, the number of optimizations performed must grow as n increases.
The time taken to compute the data required to produce the graphs of in Experimental results are given in Table 3. Note that more optimizations are performed for the graph creation (compared with the drop-value computation); however, we only optimized for values of p between the drop values. In particular, no values of p before
are ever calculated.
Furthermore, as increases, we see (as discussed above) a decrease in the drop values, and an increase in the size of the intervals between the drop values. Consequently, more values of p need to be optimized to meet the requirements enforced in the plotting of the graphs (see above), resulting in a two-fold slow down where the amount of time per optimization increases as does the number of optimizations performed. We see this effect in the drastic increase in the time taken as
increases for fixed n compared with the moderate increase in the average optimization times.
In the case of n = 3 with the 300 hour upper computation limit, corresponding resumptions, and unavoidable partial re-computation yield only an approximate computation time. A total of six interruptions (and thus resumptions) occurred during the computation, with the (partially) recomputed optimizations taking approximately 9, 15, 25, 32, 37, and 41 hours, respectively. As such, the reported time is an overestimate of the “true” computation time, and the discrepancy could be—in the worst case—in the vicinity of six days and fifteen hours. Nonetheless, even if the discrepancy is that large, it is not sufficient to undermine the broad pattern of growth we see as
increases.
Conclusion
Determining an optimal experimental design for a growing population governed by a POPBP is a difficult problem. In this article, we developed a new approach to compute the Fisher information for higher values of n and λ. With the use of generating functions, we constructed recursive equations for the likelihood function and its derivative, which we used to calculate the Fisher information. This approach allowed us to calculate the Fisher information more efficiently and accordingly determine which observation times maximize the Fisher information for given values of n, λ, and p.
For future work, we plan to develop further theoretical results on an optimal experimental design of a POPBP with the help of numerical experiments obtained in this paper.
We expect we can speed up the optimization process by using the drop values to rule out boundaries to check, thus computing fewer optimizations for any given combination of n,p, and λ. For example, for a fixed n and λ, if we know we are optimizing for a value of p that is larger than the upper bound of , then we know
is strictly less than 1, so we can avoid optimizing over any boundaries wherein t1 = 1. This technique can even work with the binary search to find the drop values if we recall that our initial assumption was that
for all i; however, we note that the implementation must be careful about how it handles cases where p is inside multiple bounds simultaneously. An early test of this idea almost doubled the computation speed when we used the drop-value information (compared with the current method), but we have not yet finished implementation and testing, nor do we have properly rigorous timing data.
Finally, recent advances in GPU technologies (particularly in memory sizes) open the possibility of implementing the highly parallel nature of our computations on GPU hardware.
Acknowledgments
We would like to thank Prof. Nigel Bean and Prof. Joshua Ross from University of Adelaide for discussing topics in Partially observable pure birth process, in particular the proof of Definition 5 with the first author. An early version of this work was discussed with Alin Bostan and Frédéric Chyzak at Inria in 2014.
References
- 1. Goos P, Mylona K. Quadrature methods for Bayesian optimal design of experiments with nonnormal prior distributions. J Comput Graph Statist. 2017;27(1):179–94.
- 2.
López-Fidalgo J. Optimal experimental design. Springer; 2023.
- 3. Keeling MJ, Ross JV. On methods for studying stochastic disease dynamics. J R Soc Interface. 2008;5(19):171–81. pmid:17638650
- 4. Black AJ, McKane AJ. Stochastic formulation of ecological models and their applications. Trends Ecol Evol. 2012;27(6):337–45. pmid:22406194
- 5. Biron-Lattes M, Bouchard-Côté A, Campbell T. Pseudo-marginal inference for CTMCs on infinite spaces via monotonic likelihood approximations. J Comput Graph Statist. 2022;32(2):513–27.
- 6. Keiding N. Estimation in the birth process. Biometrika. 1974;61(1):71–80.
- 7. Pagendam DE, Pollett PK. Locally optimal designs for the simple death process. J Statist Plan Inference. 2010;140(11):3096–105.
- 8. Keiding N. Maximum likelihood estimation in the birth-and-death process. Ann Statist. 1975;3:363–72.
- 9. Ross JV, Taimre T, Pollett PK. On parameter estimation in population models. Theor Popul Biol. 2006;70(4):498–510. pmid:16984803
- 10. Pagendam DE, Pollett PK. Optimal sampling and problematic likelihood functions in a simple population model. Environ Model Assess. 2008;14(6):759–67.
- 11. Pagendam DE, Pollett PK. Robust optimal observation of a metapopulation. Ecological Modelling. 2010;221(21):2521–5.
- 12. Pagendam DE, Pollett PK. Optimal design of experimental epidemics. Journal of Statistical Planning and Inference. 2013;143(3):563–72.
- 13. Becker G, Kersting G. Design problems for the pure birth process. Advances in Applied Probability. 1983;15(2):255–73.
- 14. Bean NG, Elliott R, Eshragh A, Ross JV. On binomial observations of continuous-time markovian population models. Journal of Applied Probability. 2015;52(2):457–72.
- 15. Bean NG, Eshragh A, Ross JV. Fisher Information for a partially observable simple birth process. Communications in Statistics - Theory and Methods. 2016;45(24):7161–83.
- 16. Eshragh A, Alizamir S, Howley P, Stojanovski E. Modeling the dynamics of the COVID-19 population in Australia: a probabilistic analysis. PLoS One. 2020;15(10):e0240153. pmid:33007054
- 17.
Feller W. An introduction to probability theory and its applications. 3rd ed. John Wiley; 1968.
- 18.
Stanley RP. Enumerative Combinatorics. Wadsworth & Brooks/Cole; 1986.
- 19.
Stanley RP. Enumerative combinatorics. Cambridge University Press; 1999.
- 20.
Wilf HS. Generatingfunctionology. USA: A. K. Peters, Ltd.; 2006.
- 21.
Flajolet P, Sedgewick R. Analytic Combinatorics. Cambridge University Press; 2009.
- 22. Spall JC. Monte Carlo computation of the fisher information matrix in nonstandard settings. J Comput Graph Statist. 2005;14(4):889–909.
- 23.
Spall JC. Introduction to stochastic search and optimization: estimation, simulation and control. New Jersey: John Wiley & Sons; 2003.
- 24.
Ross SM. Stochastic processes. New York: John Wiley & Sons; 1996.