adaPop: Bayesian inference of dependent population dynamics in coalescent models

The coalescent is a powerful statistical framework that allows us to infer past population dynamics leveraging the ancestral relationships reconstructed from sampled molecular sequence data. In many biomedical applications, such as in the study of infectious diseases, cell development, and tumorgenesis, several distinct populations share evolutionary history and therefore become dependent. The inference of such dependence is a highly important, yet a challenging problem. With advances in sequencing technologies, we are well positioned to exploit the wealth of high-resolution biological data for tackling this problem. Here, we present adaPop, a probabilistic model to estimate past population dynamics of dependent populations and to quantify their degree of dependence. An essential feature of our approach is the ability to track the time-varying association between the populations while making minimal assumptions on their functional shapes via Markov random field priors. We provide nonparametric estimators, extensions of our base model that integrate multiple data sources, and fast scalable inference algorithms. We test our method using simulated data under various dependent population histories and demonstrate the utility of our model in shedding light on evolutionary histories of different variants of SARS-CoV-2.


Proofs of identifiability
We first recall a few preliminary definitions and results that will be used throughout the proofs. The review that follows is based on (author?) (1) Chapter 2.
Let t be distributed according to density p θ θ θ with respect to the Lebesgue measure, with θ θ θ ∈ Θ, where θ θ θ is a vector of length s. The Expected Fisher Information (EFI) is defined as I(θ θ θ ) = ||I i, j (θ θ θ )||, where I(θ θ θ ) depends on a given parametrization. Suppose that θ i = h i (ξ 1 , . . . , ξ p ), for i = 1, . . . , s and let J denote the Jacobian matrix J = || ∂ θ i ∂ ξ j ||. Then the EFI I(ξ ξ ξ ) can be written as a function of I(θ θ θ ) as follows I(ξ ξ ξ ) = J I(θ θ θ (ξ ξ ξ )) J T . (S1) If p θ θ θ belongs to an exponential family, then θ θ θ is identifiable if I(θ θ θ ) is nonsingular in a convex set containing Θ ((author?) (2) Theorem 3). In Section 3 of the main text, we assumed that the EPS is piecewise-constant, N e = (N e,i ) 1:M , with the boundary points described by a regular grid (k i ) 1:M+1 . Under the coalescent process, we have that g is distributed according to the coalescent density p N e given in Eq. (1) in the main text, and p N e belongs to the exponential family; i.e., to show that N e is identifiable, we need to show that I(N e ) is nonsingular.
In the case of a single population, one can show that I(N e ) is a diagonal matrix with diagonal entries (c 1 N −2 e,1 , . . . , c M N −2 e,M ), where c i denotes the number of coalescent times in the interval [k i , k i+1 ). The determinant |I(N e )| is non-zero if and only if c i > 0 for all i = 1 . . . , M. If this is the case, then N e is identifiable since I(N e ) is full-rank, hence non-singular. The identifiability of N e is also studied by (author?) (3).
The simplest model of two populations is to treat them independently: g A is distributed to a coalescent with EPS N A e = (N A e,i ) 1:M and g B is distributed to a coalescent with EPS N B e = (N B e,i ) 1:M . For the same argument described above, . The hierarchical models described in Eqs. (4) and (5) can be seen as reparametrization of the independent model. To prove Propositions 1 and 2, we will use I(N A e , N B e ) and apply Eq. (S1). The two models define two different reparametrizations ξ ξ ξ and J will be different.
hence it is a block diagonal matrix. The block diagonal structure follows from the independence of consecutive grid intervals. Its determinant is 1 being the product of the determinants of the subsquare 2 × 2 matrices along the diagonal. The result follows, because the determinant of the product of square matrices is the product of the determinants, i.e., |I(N e , γ)| ̸ = 0 and I(N e , γ) is nonsingular.

Proof of Proposition 2
In model Eq.

Simulation details
Details of the trajectories: Here, we describe N A e and N B e for the six scenarios considered: • Scenario 1: • Scenario 3: Simulation strategy: We fix n = 200. For a given scenario, we set n 1 = 2, s 1 = 0, and simulate n − n 1 sampling locations from a Poisson process with rate 500. This defines s and n = (2, 1, . . . , 1). We stress that the sampling is uniform, it does not depend neither on N A e nor N B e , and the rate is chosen such that the expected number of samples in the interval [0, 0.4] is equal to n. The interval [0, 0.4] is of particular interest because it is where all the change points in N A e and N B e are concentrated. We repeat this step twice in order to define two pairs (n, s), which we label (n A , s A ) and (n B , s B ).
Conditionally on n A , s A , and N A e , we sample coalescent times t A according to Algorithm 3 in (author?) (4). We repeat the same step for t B . We sample from the inhomogeneous Poisson processes using the Lewis-Shedler thinning algorithm (5). Note that there is no need to sample tree topologies because the vector t is a sufficient statistic for N e (t).
We rely on a discretization of N e and γ defined by a regular grid (k i ) 1:M+1 . A guideline on how to choose M can be found in (author?) (6). Throughout Section 3 of the main text, we assumed that s A 1 and s B 1 are equal to zero. In applications, this assumption might not hold. For example, the SARS-CoV-2 data sets in Section 4.2 where there are no samples of non-delta variant at time 0 (see heatmaps Figure 4).
Special care is required in this case. A condition for the identifiability of (N e,1 , . . . , N e,M , γ 1 , . . . , γ M ) is that there should be at least one coalescent event per grid interval for both populations. Trivially, this condition will fail if one of the two populations does not have samples at the beginning of the grid.
A solution to this problem is the following. Use as population A, the population that has samples collected at  (N e,1 , . . . , N e,L−1 ). In the analysis of SARS-CoV-2 data, we set the delta sample as population A, the non-delta as population B. Figure 4 depicts the truncation of γ: there are no estimates for γ in the last few weeks. Table S1 reports the average value of each statistic pooling together all scenarios and all grid points. We also average the performance metrics of N A e and N B e . Table S1 differs from the table in the main manuscript because here, we are splitting the performance of each method by scenario.

Synthetic data: additional results
adaPop+Pref stands out again as the best performing method. It only struggles to recover N B e in Scenario 2: neither adaPop nor adaPop+Pref can recover the flat EPS as well as noPop and parPop. parPop is the best performing method in this case. The reason is that by simply setting β = 0, one can recover the flat N B e . parPop is the method with the most polarized performance. When the parametric model is correctly specified (Scenarios 1, 2, and 6), it is among the best-performing methodologies. The accuracy deteriorates substantially when the model is misspecified (Scenarios 3, 4, and 5).
adaPop is often the second best performing method after adaPop+Pref. Surprisingly, it has a performance almost identical to parPop when parPop is correctly specified. This is surprising given that adaPop is more heavily parameterized. When parPop is misspecified, adaPop seems to adapt to the varying dependency and exhibit good performance.

SARS-CoV-2 molecular data analysis
For each country, we randomly sampled 150 sequences with delta variant and 150 sequences without delta variants collected in the period of 2021-03-01 to 2021-09-30 from publicly available high-coverage complete sequences in GISAID (7). The date was chosen based on the availability of the first identified delta variants in each country ( Figure S1). The resulting sampled sequence accession IDs can be found at the end of the Supplementary Materials. The classification of the delta variant group was based on the B.1.617.2 and AY lineages under the Pango lineage designation system (8). The sequences were aligned to the reference sequence EPI_ISL_402124 (9) with MAFFT (10) by GISAID. From the multiple sequence alignment, we filtered out sites with more than 20% missing values in each country.
We analyzed sequences of each variant group of each country independently with BEAST2 (11) using the Extended Bayesian Skyline prior on N e (t) (12), the HKY mutation model with empirically estimated base frequencies (13), and the strict clock model with the rate constrained to [8 × 10 −4 , 1 × 10 −3 ] substitutions per site per year (14). The chain was ran for 20 × 10 6 iterations, thinning every 1000 and with 20% burnin. The resulting maximum credibility clade (MCC) trees can be found in Figures S2 and S3.
With the inferred MCC trees, we examined the quantification of the dependence between the two variants in each country with four methods we introduced: noPop, parPop, adaPop, and adaPop+Pref (Section 4.1 in the main text). Results using the noPop, parPop, and adaPop methods are shown in Figures S4, S5, and S6, respectively, and the results with the adaPop+Pref method is shown in Figure 4 of the main text. Table S1. Summary Statistics of Posterior Inference of γ, N A e , and N B e . Each entry is computed as the mean of the performance metric for a given scenario (100 datasets per scenario). The metrics for N A e and N B e have been also averaged. The numbers in bold indicate the method(s) with the best performance (and within 10% of the best) for each performance metric: the highest for ENV, the lowest for DEV and RWD.