Accelerating Bayesian inference of dependency between mixed-type biological traits

Inferring dependencies between mixed-type biological traits while accounting for evolutionary relationships between specimens is of great scientific interest yet remains infeasible when trait and specimen counts grow large. The state-of-the-art approach uses a phylogenetic multivariate probit model to accommodate binary and continuous traits via a latent variable framework, and utilizes an efficient bouncy particle sampler (BPS) to tackle the computational bottleneck—integrating many latent variables from a high-dimensional truncated normal distribution. This approach breaks down as the number of specimens grows and fails to reliably characterize conditional dependencies between traits. Here, we propose an inference pipeline for phylogenetic probit models that greatly outperforms BPS. The novelty lies in 1) a combination of the recent Zigzag Hamiltonian Monte Carlo (Zigzag-HMC) with linear-time gradient evaluations and 2) a joint sampling scheme for highly correlated latent variables and correlation matrix elements. In an application exploring HIV-1 evolution from 535 viruses, the inference requires joint sampling from an 11,235-dimensional truncated normal and a 24-dimensional covariance matrix. Our method yields a 5-fold speedup compared to BPS and makes it possible to learn partial correlations between candidate viral mutations and virulence. Computational speedup now enables us to tackle even larger problems: we study the evolution of influenza H1N1 glycosylations on around 900 viruses. For broader applicability, we extend the phylogenetic probit model to incorporate categorical traits, and demonstrate its use to study Aquilegia flower and pollinator co-evolution.


Introduction
An essential goal in evolutionary biology is to understand the across-trait covariation observed within biological samples, or taxa, ranging from plants and animals to microorganisms and pathogens such as human immunodeficiency virus (HIV) and influenza.This task is difficult because taxa are implicitly correlated through their shared evolutionary history often described with a reconstructed phylogenetic tree.Here, tree tips correspond to the taxa themselves, and internal nodes are their unobserved ancestors.Inferring across-trait covariation requires a highly structured model that can explicitly describe the tree structure and adjust for acrosstaxa covariation.Phylogenetic models do exactly this but are computationally challenging because one must integrate out unobserved ancestor traits while accounting for uncertainties arising from tree estimation.The computational burden increases when taxon and trait counts grow large and becomes worse when traits include continuous and discrete quantities.Cybis et al. develop the first phylogenetic method that can assess across-trait covariation while controlling for a large, unknown evolutionary tree with hundreds of tips [1].To jointly model mixed-type traits, this approach assumes discrete traits arise from continuously valued latent variables that follow a Brownian diffusion along the tree [2].Assuming latent processes is a common strategy for modeling mixed-type data and it finds uses across various fields [3][4][5][6][7].Subsequent work by [8] solves an essential identifiability issue in [1] by adding specific constraints on the diffusion covariance.The resulting model in particular generalizes the multivariate probit model [9].The most important contribution of [8], however, is an efficient inference scheme that achieves order-of-magnitudes efficiency gains over [1].In this work, we significantly advance performance compared to [8] to solve even larger problems.
Here is an intuition on why our new inference scheme, to be formally introduced in Methods section, outperforms the one by [8].For N taxa each with P continuous or binary traits, Bayesian inference for the phylogenetic probit model involves repeatedly sampling latent variables X from their conditional posterior, an (N × P)-dimensional truncated normal distribution.The (N × P) size of the truncated normal distribution results from having one latent variable for each taxon and each trait.For this task, [8] develop a bouncy particle sampler (BPS) [10] combined with an efficient dynamic programming approach that speeds up the most expensive step in the BPS implementation.Their approach, however, fails to address another source of computational inefficiency in posterior inference under the phylogenetic probit model-a high degree of correlation between X and C. [8] use a separate Hamiltonian Monte Carlo sampler [11,HMC] to sample C and update the two sets of parameters alternately within a random-scan Gibbs scheme [12].The phylogenetic probit model assumes X to follow a multivariate Gaussian distribution whose covariance matrix incorporates C. By the model's very design, therefore, the values in C influence the strength and direction of the correlations between elements of X.This correlation between the two parameters slows down convergence and mixing of the Gibbs scheme as each update of X or C is strongly influenced by the current value of the other parameter.To address this issue, our present solution utilizes a state-of-theart Markov chain Monte Carlo (MCMC) method called Zigzag-HMC [13].Unlike BPS, this method allows a joint update of X and C through differential operator splitting [13,14] that generalizes the previously proposed split HMC framework based on Hamiltonian splitting [11,15].Zigzag-HMC can further take advantage of the same OðNÞ gradient evaluation strategy developed by [8].
Our sampling scheme greatly improves the mixing of elements in C and thus provides a reliable estimate of the across-trait partial correlation matrix R, the inverse of the correlation matrix normalized to have unit diagonals.The partial correlation between two traits quantifies their conditional dependence that accounts for, and hence removes confounding by, the effects of other traits in the model.Use of partial correlations thus allow us to gain insight into potential causal pathways and help guide further research into underlying biological mechanisms.
We apply our methodology to three real-world examples.First, we re-evaluate the HIV evolution application in [8] and identify HIV-1 gag immune-escape mutations linked with virulence through strong conditional dependence relationships.Our findings closely match with the experimental literature and indicate a general pattern in the immune escape mechanism of HIV.Second, we examine the influenza H1N1 glycosylation pattern across different hosts and detect strong conditional dependencies between glycosylation sites closely related to host switching.Finally, we investigate how floral traits of Aquilegia flower attract different pollinators, for which we generalize the phylogenetic probit model to accommodate a categorical pollinator trait.

Mixed-type trait evolution
We describe biological trait evolution with the phylogenetic multivariate probit model following [8] and extend it to unordered categorical traits as in [1].While we do not consider ordered categorical traits in this work and leave it to future work to support such traits, the mapping of latent variables in this case can also be found in [1].We either know the phylogenetic tree F a priori or infer it from a molecular sequence alignment S [16].In our two largescale HIV and influenza applications (Results section) with available sequence data, we use a continuous-time Markov chain evolutionary model [17] to construct pðSjFÞ and so infer F simultaneously.We refer interested readers to [16] for more details on tree sampling.When investigating the efficiency gain of our method over [8], we utilize a fixed tree for a more direct comparison and also to reduce the overall run-time.For our third application on flower and pollinator co-evolution, we adopt the same fixed tree as in [18].
Consider N taxa on a tree F ¼ ðV; bÞ that is a directed, bifurcating acyclic graph.The node set V of size 2N − 1 contains N tip nodes, N − 2 internal nodes and one root node.The branch lengths b = (b 1 , . .., b 2N−2 ) denote the child-parent distance in real time.We observe P mixedtype traits for each taxon.The trait data Y = {y ij } = (Y cont , Y disc ) partition as Y cont , an N × P cont matrix of continuous traits and Y disc , an N × P disc matrix of discrete ones.We associate with each trait a latent variable x ij 2 R, if the j-th trait is continuous or binary, and a (m j − 1)-dimensional latent vector X ij ¼ fx ij;k g 2 R m j À 1 , if the trait is categorical, where m j denotes the number of categorical classes.Continuous traits y ij can be seen as as latent variables that are directly observed so x ij = y ij .To relate latent variables to observed discrete traits, we assume a threshold model for binary traits and a choice model for traits with more than two classes.For a binary trait y ij , For a categorical trait y ij , the possible classes are fc 1 ; . . .; c m j g with the reference class being c 1 .
We have where x ij;max ¼ maxðx i;j ; . . .; x i;jþm j À 2 Þ.This data augmentation strategy is a common choice to model categorical data [19].
After concatenating all the latent variables, for each node i = 1, . .., 2N − 1 in F we have P lat -dimensional latent variable X i 2 R P lat with P lat ¼ P cont þ P P cont þP disc j¼P cont þ1 ðm j À 1Þ.As a side note, for continuous y ij the corresponding x ij is observed, and so X i is actually a partially latent vector.Since in our applications only a small fraction of y ij is continuous, we omit "partial" to ease the notation.
The latent variables follow a multivariate Brownian diffusion process along F such that X i distributes as a multivariate normal (MVN) where X pa(i) is the parent node value and the P lat × P lat covariance matrix O describes the across-trait association.The intuition behind b i O is that the further away a child node is from its parent node (larger b i ), the bigger difference between their node values.Assuming a conjugate root prior X 2NÀ 1 � N ðμ 0 ; o À 1 ΩÞ with prior mean μ 0 and prior variance ω −1 O, we can analytically integrate out latent variables on all internal nodes.Marginally, then, the N × P lat tip latent variables X have the matrix normal distribution where M = (μ 0 , . .., μ 0 ) T is an N × P lat mean matrix and the across-taxa covariance matrix [20].The diffusion matrix VðF Þ is a function of branch lengths such that its diagonal elements represent the sum of branch lengths from a tip to the root, while the offdiagonal elements are the branch length from the root to the most recent common ancestor of two tips.The augmented likelihood of X and Y factorizes as pðY; X j Υ; Ω; μ 0 ; oÞ ¼ pðY j XÞpðX j Υ; Ω; μ 0 ; oÞ; where p(Y|X) = 1 if X are consistent with Y according to Eqs (1) and (2) and 0 otherwise.Following [8], we decompose O = DCD where C is a P lat × P lat correlation matrix.The diagonal entries of C are all equal to 1, while the off-diagonal entries lie in the range of [−1, 1] and represent the correlations between pairs of latent variables and, hence, their corresponding traits.
The P lat × P lat diagonal matrix D = {σ ii } for i = 1, . .., P lat contains the marginal standard deviation of each latent variable.Importantly, since discrete traits only inform the sign or ordering of their underlying latent variables, certain elements of D must be set as a fixed value to ensure that the model is parameter-identifiable [8].Without loss of generality, we fix σ ii = 1 for σ ii corresponding to discrete traits.For continuous traits, the square of the corresponding element (s 2 ii ) multiplied by a branch length is the marginal variance for the Brownian diffusion process along that branch (Eq 3).In other words, this product reports the amount of trait variation that accumulates along a branch.[8] demonstrate the necessity of this DCD decomposition, which also allows a non-informative prior [21,LKJ] on C. For goodness-of-fit of the phylogenetic probit model we refer interested readers to [8] where the explicit tree modeling leads to a significantly better fit.

A novel inference scheme
We sample from the joint posterior to learn the across-trait correlation C pðC; D; X; F j Y; SÞ / pðY j XÞ � pðX j C; D; where we drop the dependence on hyper-parameters (Υ, μ 0 , ω) to ease notation.We fix μ 0 to be a P lat -dimensional zero vector and ω to be 1.We then specify the priors p(C, D) and pðF Þ as in [8] where pðF Þ is a typical coalescent tree prior on F [22] and p(C, D) = p(C)p(D).We set independent log normal priors on D diagonals that correspond to continuous traits.We assume an LKJ prior on the Cholesky factor of C to ensure that C and O are positive definite and invertible.[8] use a random-scan Gibbs [12] scheme to alternately update X, {C, D} and F from their full conditionals [16].They sample X from an NP lat -dimensional truncated normal distribution with BPS and deploy the standard HMC based on Gaussian momentum [23] to update {C, D}.Instead, we simulate the joint Hamiltonian dynamics on {X, C, D} by combining novel Hamiltonian zigzag dynamics on X [24] and traditional Hamiltonian dynamics on {C, D}.This strategy enables an efficient joint update of the two highly-correlated sets of parameters.The improved efficiency allow us to focus on the across-trait partial correlation matrix R = {r ij }.After collecting the MCMC samples of O, we obtain R by the standard transformation [25]: Since R measures the linear relationship between pairs of variables after controlling for effects of all other variables in the model, R usually lies in a more-constrained space than C and is more difficult for the sampler to effectively explore its posterior distribution.We demonstrate the improved efficiency of our method in inferring R in Results section.In the subsequent sections, we first describe how Zigzag-HMC samples X from a truncated normal and then detail the joint update of {X, C, D}.Zigzag-HMC for truncated multivariate normals.We outline the main ideas behind HMC [11] before describing Zigzag-HMC as a version of HMC based on Hamiltonian zigzag dynamics [13,24].In order to sample a d-dimensional parameter x = (x 1 , . .., x d ) from the target distribution π(x), HMC introduces an auxiliary momentum variable p ¼ ðp 1 ; . . .; p d Þ 2 R d and samples from the product density π(x, p) = π(x)π(p) by numerically discretizing the Hamiltonian dynamics where U(x) = −log π(x) and K(p) = −log π(p) are the potential and kinetic energy.In each HMC iteration, we first draw p from its marginal distribution pðpÞ � N ð0; IÞ, a standard Gaussian and then approximate (8) from time t = 0 to t = τ by L = bτ/�c steps of the leapfrog update with step size � [26]: The end state is a valid Metropolis proposal that one accepts or rejects according to the standard acceptance probability formula [27,28].Zigzag-HMC differs from standard HMC insofar as it posits a Laplace momentum and the velocity v ≔ dx/dt 2 {±1} d depends only on the sign of p and thus remains constant until one of p i 's undergoes a sign change (an "event").To understand how the Hamiltonian zigzag dynamics (10) evolve over time, one must investigate when such events happen.Before moving to the truncated MVN, we first review the event time calculation for a general π(x) following [24].Let τ (k) be the kth event time and (x (τ (0) ), v (τ (0) ), p (τ (0) )) is the initial state at time τ (0) .Between τ (k) and τ (k+1) , x follows a piecewise linear path and the dynamics evolve as and Therefore we can derive the (k + 1)th event time and the dimension causing this event is i* = argmin i t i .At the moment of τ (k+1) , the i*th velocity component flips its sign Then the dynamics continue for the next interval [τ (k+1) , τ (k+2) ).We now consider simulating the Hamiltonian zigzag dynamics for a truncated MVN arising from the phylogenetic probit model.
where μ and S are the mean vector and covariance matrix for the MVN and map(�) is the mapping from the vectorized latent variables x to y as in Eqs ( 1) and (2).In other words, y is the NP-dimensional vectorized discrete data such that x 2 R d for d = NP lat .Since vectorizing the random variables under a matrix normal distribution (3) results in a MVN distribution, we have S = O � Υ where � denotes the Kronecker product.The mean vector μ is N copies of the pre-specified root prior mean vector μ 0 concatenated together.
In the setting of Eq (15), we have rU(x) = S −1 x whenever x 2 {map(x) = y}.Importantly, this structure allows us to simulate the Hamiltonian zigzag dynamics exactly and efficiently [24].We handle the constraint map(x) = y with a technique from [11] where the constraint boundaries embody "hard walls" that the Hamiltonian zigzag dynamics "bounce" against upon impact.To distinguish different types of events, we define gradient events arising from solutions of Eq (13), binary events arising from hitting binary data boundaries and categorical events arising from hitting categorical data boundaries.
We first consider how to find the gradient event time.Starting from a state (x, v, p), by plugging in rU(x) = S −1 x to Eq (13), we can calculate the gradient event time t g by first solving d quadratic equations and then taking the minimum among all positive roots of Eq (16).When π(x) is a truncated MVN arising from the phylogenetic probit model, we exploit the efficient gradient evaluation strategy in [8] to obtain S −1 (x − μ) and S −1 v without the notorious Oðd 3 Þ cost to invert S. In our application, μ is a vector of all zeros since we set the root prior mean μ 0 to be all zero.If there is prior knowledge about μ 0 , we can use another fixed value without increasing the computational cost.
Next, we focus on the binary and categorical events.We partition x into two sets: S bin = {x i : x i is for binary data} and S cat = {x i : x i is for categorical data}.Starting from a state (x, v, p), a binary event happens at time t b when the trajectory first reaches a binary boundary at dimen- Here, we only need to check the dimensions satisfying x i v i < 0, i.e., those for which the trajectory is heading towards the boundary.At time t b , the trajectory bounces against the binary boundary, and so the i b th velocity and momentum element both undergo an instantaneous flip v i b À v i b , p i b À p i b , while other dimensions stay unchanged.Finally, we turn to categorical events.Suppose that a categorical trait y j = c k belongs to one of m possible classes, and x 1 , x 2 , . .., x m−1 the underlying latent variables.Eq (2) specifies the boundary constraints.If k = 1, the m − 1 latent variables must be all negative, which poses the same constraint as if they were for n − 1 binary traits, therefore we can solve the event time using Eq (17).If k > 1, we must check when and which two dimensions first violate the order constraint x k−1 = max(x 1 , . .., x m−1 ) > 0. With the dynamics starting from (x, v, p), the categorical event time t j c is given by for when x i c reaches x k−1 and violates the constraint.To identify i c we only need to check dimensions with v k−1 < v i where the distance c is for a single y j and we need to consider all categorical data to find the actual categorical event time t c ¼ min j t j c .We now present the dynamics simulation with all three event types included, starting from a state (x, v, p) with x 2 {map(x) = y}: 1. Solve t g , t b , t c using Eqs ( 16), ( 17) and ( 18) respectively.
2. Determine the actual (first) event time t = min{t g , t b , t c } and update x and p as in Eqs (11) and ( 12) for a duration of t.
3. Make instantaneous velocity and momentum sign flips according to the rules of the actual event type, then go back to Step 1.
Based on the above discussion, Algorithm 1 describes one iteration of Zigzag-HMC on truncated MVNs where we simulate the Hamiltonian zigzag dynamic for a pre-specified duration t total .For a truncated MVN arising from the phylogenetic probit model, the most computationally expensive step is the gradient evaluation in Line 3, where a matrix-vector multiplication by the precision matrix F = S −1 is involved.A matrix inversion to evaluate F directly is expensive since F = O −1 � Υ −1 and computing Υ −1 has a cost of OðN 3 Þ.We adopt the dynamic programming strategy of [8] to reduce the cost of Line 3 from either We refer interested readers to [8] for details on the dynamic programming strategy.In brief, this strategy avoids explicitly inverting Υ by recursively traversing the tree [20] to obtain N conditional densities that directly translate to the desired gradient φ x .
Algorithm 1 Zigzag-HMC for multivariate truncated normal distributions else if a categorical boundary event happens at i c1 , i c2 then 20: Jointly updating latent variables and across-trait covariance.The N × P lat latent variables and P lat × P lat across-trait covariance are highly correlated with each other, so individual Gibbs updates can be inefficient.The posterior conditional of X is truncated normal and thus allows for the efficient Hamiltonian zigzag simulation.The conditional distribution for covariance components C and D has no such special structure, so we map them to an unconstrained space and deploy Hamiltonian dynamics based on Gaussian momentum.We use a standard mapping of C elements to real numbers [29] that first transforms C to canonical partial correlations (CPC) that fall in [−1, 1] and then apply the Fisher transformation to map CPC to the real line.We then construct the joint update of latent variables and covariance via differential operator splitting [13,14] to approximate the joint dynamics of Laplace-Gauss mixed momenta.
We denote the two concatenated sets of parameters X and {C, D} as x = (x G , x L ) with momenta p = (p G , p L ), where indices G and L refer to Gaussian or Laplace momenta.The joint sampler updates (x G , p G ) first, then (x L , p L ), followed by another update of (x G , p G ).This symmetric splitting ensures that the simulated dynamics is reversible and hence constitutes a valid Metropolis proposal mechanism [13].The LG-STEP function in Algorithm 2 describes the process of simulating the joint dynamics for time duration 2� via the analytical Hamiltonian zigzag dynamics for (x L , p L ) and the approximate leapfrog dynamics ( 9) for (x G , p G ).Because x G and x L can have very different scales, we incorporate a tuning parameter, the step size ratio r, to allow different step sizes for the two dynamics.To approximate a trajectory of the joint dynamics from t = 0 to t = τ, we apply the function LG-STEP m = bτ/2�c times, and accept or reject the end point following the standard acceptance probability formula [27,28].We call this version of HMC based on Laplace-Gauss mixed momenta as LG-HMC and describe one iteration of LG-HMC in Algorithm 2 where the inputs include the joint potential function U(x G , x L ).We use LG-HMC to update {X, C, D} as a Metropolis-within-Gibbs step of our random-scan Gibbs scheme.The overall sampling efficiency strongly depends on m, the step size � and the step size ratio r, so it is preferable to auto-tune all of them.We provide an empirical method to automatically tune r in S1 File.We provide another option utilizing the no-U-turn algorithm to automatically decide the trajectory length m [23] and call the resulting algorithm LG No-U-Turn Sampler (LG-NUTS).We adapt the step size � with primal-dual averaging to achieve an optimal acceptance rate [23].
Algorithm 2 One LG-HMC iteration . Record the initial state 2: end for .Calculate the acceptance probability a, where K G and K L denote the kinetic energy based on Gaussian or Laplace momentum and k�k 1 , k�k 2 are the L 1 and L 2 norm.6:

Results
To illustrate the broad applicability of our method, we detail three real-world applications and discuss the scientific findings.We first apply our method to the HIV virulence application of [8].The improved efficiency allows us to estimate the across-trait partial correlation with adequate effective sample size (ESS) and to reveal the conditional dependence among traits of scientific interest.We use the same HIV data set to demonstrate that LG-HMC and LG-NUTS outperform BPS (Section "Efficiency gain from the new inference scheme"), followed by two more LG-NUTS applications on influenza and Aquilegia flower evolution.We conclude this section with MCMC convergence criteria and timing results.

HIV immune escape
In the HIV evolution application of [8], a main scientific focus lies on the association between HIV-1 immune escape mutations and virulence, the pathogen's ability to cause disease.The human leukocyte antigen (HLA) system is predictive of the disease course as it plays an important role in the immune response against HIV-1.Through its rapid evolution, HIV-1 can acquire mutations that aid in escaping HLA-mediated immune response, but the escape mutations may reduce its fitness and virulence [30,31].[8] identify HLA escape mutations associated with virulence while controlling for the unknown evolutionary history of the viruses.However, [8] interpret their results based on the across-trait correlation C which only informs marginal associations that can remain confounded.Now armed with a more efficient inference method, we direct our attention towards the across-trait partial correlation matrix R.
The data contain N = 535 aligned HIV-1 gag gene sequences collected from 535 patients between 2003 and 2010 in Botswana and South Africa [31].Each sequence is associated with 3 continuous and 21 binary traits.The continuous virulence measurements are replicative capacity (RC), viral load (VL) and cluster of differentiation 4 (CD4) cell count.The binary traits include the existence of HLA-associated escape mutations at 20 different amino acid positions in the gag protein and another trait for the sampling country (Botswana or South Africa).For example, we only detect a negative conditional dependence between RC and CD4.In other words, holding one of CD4 and RC as constant, the other does not affect VL, suggesting that RC increases VL via reducing CD4.The fact that RC is not found to share a strong conditional dependence with VL may be explained by the strong modulatory role of immune system on VL.Only when viruses with higher RC also lead to more immune damage, as reflected in the CD4 count, higher VL may be observed as a consequence of less suppression of viral replication.As such, our findings are in line with the demonstration that viral RC impacts HIV-1 immunopathogenesis independent of VL [32].
The partial correlation also helps to decipher epistatic interactions and how the escape mutations and potential compensatory mutations affect HIV-1 virulence.For example, we find a strong positive partial correlation between T186X and T190X.Studies have shown that T186X is highly associated with reduced VL [33,34] and it requires T190I to partly compensate for this impaired fitness so the virus stays replication competent [35].The negative conditional dependence between T186X and RC and the positive conditional dependence between T190I and RC are consistent with this experimental observation.In contrast, with the strong positive association between T186X and T190, the marginal association fails to identify their opposite effects on RC.Another pair of mutations that potentially shows a similar interaction is H28X and M30X, which have a positive and negative partial correlation with VL, respectively.These mutations have indeed been observed to co-occur in gag epitopes from longitudinally followed-up patients [36].Fig 1B keeps all the other compensatory mutation pairs in Fig 1A such as A146X-I147X and A163X-S165X that find confirmation in experimental studies [37,38].
More generally, when considering the viral trait RC and the infection trait VL, for which their variation are to a considerable extent attributable to viral genetic variation [39], we reveal an intriguing pattern.As in Fig 1C, when two escape mutations impair virulence, and there is a conditional dependence between them, it is always negative.When two mutations have opposing effects on these virulence traits, the conditional dependence between them (if present) is almost always positive, with one exception of the negative effect between V168I and S357X.For example, T186X and I61X both have a negative impact on RC and the negative effect between them suggests that their additive, or even potentially synergistic, impact on RC is inhibited.Moreover, they appear to benefit from a compensatory mutation, T190X, which has been corroborated for the T186X-T190X pair at least as reported above.Also for VL, the conditional dependence between mutations that both have a negative impact on this virulence trait is consistently negative.Several of these individual mutations may benefit from H28X as a compensatory mutation, as indicated by the positive effect between pairs that include this mutation, and as suggested above for H28X-M30X.This illustrates the extent to which escape mutations may have a negative impact on virulence and the need to evolve compensatory mutations to restore it.We note that our analysis is not designed to recover compensatory mutations at great length as we restrict it to a limited set of known escape mutations, while mutations on many other sites may be compensatory.In fact, our analysis suggests that some of the considered mutations may be implicated in immune escape due to their compensatory effect rather than a direct escape benefit.

Efficiency gain from the new inference scheme
We demonstrate that the joint update of latent variables X and the covariance matrix O significantly improve inference efficiency.For this purpose we use the large HIV dataset from Section "HIV immune escape" with N = 535, P disc = 21, P cont = 3, where the efficiency gain becomes significant.Our implementations of the algorithms have been validated on smaller truncated MVNs, on which simple rejection sampling can provide the ground truth up to quantifiable Monte Carlo errors.The Zigzag-HMC implementation has also been validated through the standalone implementation in an R package "htdg" [40].
We consider 4 sampling schemes BPS, Zigzag-HMC, LG-HMC, and LG-NUTS.To enable a more direct comparison while saving computational time, we separate tree inference from the inference for O and X and fix F as the maximum clade credibility tree from the HIV immune escape application.BPS and Zigzag-HMC only update X and we use the standard NUTS transition kernel (i.e. standard HMC combined with no-U-turn algorithm) for the O elements.LG-HMC employs the joint update of X and O described in Section "Jointly updating latent variables and across-trait covariance".LG-NUTS additionally employs the No-U-Turn algorithm to decide the number of steps and a primal-dual averaging algorithm to calibrate the step size.We set the same t total for BPS and Zigzag-HMC for a fair comparison.To tune LG-HMC, we first supply it with an optimal step size � learned by LG-NUTS, then decide the number of steps m = 100 as it gives the best performance among the choices (10, 100, 1000).We conduct 3 independent simulations for each sampling scheme and report the perrun-time ESS for 5 parameters-the across-trait correlation C, partial correlation R, latent variable X, log joint density log p(X, O) and log likelihood l(X, O).C and R are of primary scientific interest as they provide insights into correlation structure among the traits.Examining ESS of the highest dimensional parameter X is also important for diagnostic purposes.ESS's of log p(X, O) and l(X, O) help us additionally evaluate how well the samplers explore the target distribution overall.As reported in Table 1, BPS is outperformed by the three other samplers in terms of efficiency for all five parameters.While a formal theoretical analysis is beyond the scope of this work, we provide an empirical explanation for the different performances of BPS and Zigzag-HMC in S2 File.LG-HMC achieves the highest per run-time ESS for R, resulting in a 5× speed-up compared to BPS.The result also highlights that inferring R is more challenging than inferring C, with the elements of R generally having lower ESS, but the difficulty can be largely eliminated by jointly updating X and O through LG-HMC and LG-NUTS.Although Zigzag-HMC achieves much higher ESS for X than LG-HMC, the latter performs best in the most difficult and critical task of updating R. Compared to LG-HMC, LG-NUTS exhibits lower efficiency and higher variance across the 3 runs, likely due to the No-U-Turn algorithm's tendency to require some extraneous leapfrog steps [41,42].We also provide the histograms for the per run-time ESS of R elements in S1 Fig. Based on our findings, we recommend using LG-HMC with multiple choices of hyper parameters (m, �), with a good starting point being (100, 0.01), or the auto-tuned LG-NUTS.

Glycosylation of Influenza A virus H1N1
Influenza A viruses of the H1N1 subtype currently circulate in birds, humans, and swine [43][44][45], where they are responsible for substantial morbidity and mortality [46,47].The two

Table 1. Efficiency comparison among different sampling schemes (BPS, Zigzag-HMC, LG-HMC, LG-NUTS).
We calculate effective sample size (ESS) per hour runtime for the elements of C, R, X, log joint density log p(X, O), and likelihood l(X, O).For the three multivariate parameters (C, R, X) with dimensions 276, 276, and 11,235, respectively, we report the minimal ESS across all dimensions.We conduct three independent simulations for each method and report the ESS values in the first three rows.We include the mean and standard deviation in the last row for each method to provide a summary of its overall performance.The bold number indicates the highest value in each of the five columns.For BPS, given the larger number of iterations required to achieve convergence, we record one sample of X every 1,000 iterations to comply with storage limitations, and report upper bounds of the actual ESS by multiplying the ESS from thinned samples by 1,000.

ESS/hour
* The ESS estimates after 1/1000 thinning are 0.76, 0.67, 0.10 https://doi.org/10.1371/journal.pcbi.1011419.t001 surface glycoproteins hemagglutinin (HA) and neuraminidase (NA) interact with a cell surface receptor and so their characteristics largely affect virus fitness and transmissibility.Mutations in the HA and NA, particularly in their immunodominant head domain, sometimes produce glycosylations that shield the antigenic sites against detection by host antibodies and so help the virus evade antibody detection [48][49][50][51].On the other hand, glycosylation may interfere with the receptor binding and also be targeted by the innate host immunity to neutralize viruses.Therefore there must be an equilibrium between competing pressures to evade immune detection and maintain virus fitness [52,53].The number of glycosylations that leads to this balance is expected to vary in host species experiencing different strengths of immune selection.Despite decades of tracking IAVs evolution in humans for vaccine strain selection and recent expansions of zoonotic surveillance, the evolvability and selective pressures on the HA and NA have not been rigorously compared across multiple host species.Here, we examine the conditional dependence between host type and multiple glycosylation sites by estimating the posterior distribution of across-trait partial correlation while jointly inferring the IAVs evolutionary history.We use hemagglutinin (H1) and neuraminidase (N1) sequence data sets for influenza A H1N1 produced by Trovão et al. as described in [54].We scan all H1 and N1 sequences to identify potential N-linked glycosylation sites, based on the motif Asn-X-Ser/Thr-X, where X is any amino acid other than proline (Pro) [55].We then set a binary trait for each sequence encoding for the presence or absence of glycosylations at a particular amino acid site.We keep sites with a glycosylation frequency between 20% and 80% for our analysis.This gives six sites in H1 and four sites in N1.We include another binary trait for the host type being mammalian (human or swine) or avian, so the sample sizes are N = 964, P = 7 (H1) and N = 896, P = 5 (N1).
The six H1 glycosylation sites consist of three pairs that are physically close (63/94, 129/163, and 278/289, see Fig 2).Sites 63 and 94 are particularly close to each other, though distances will vary slightly with sequence.A negative conditional dependence suggests glycosylation at two close sites may be harmful for the virus (63/94 and 278/289) while a positive effect between two sites suggests a potential benefit (63/129 and 94/278).We detect a negative conditional dependence between mammalian host and glycosylation site 94 and 289.Avian viruses have a stronger tendency to have site 289 glycosylated (Fig 2).
In N1, glycosylations are more strongly correlated than H1 (Fig 3).Two pairs of glycosylation sites have a positive conditional dependency in between (50/68 and 50/389) and two pairs (44/68 and 68/389) have a negative one.We omit a structural interpretation since all sites but 389 are located in the NA stalk, for which no protein structure is available.There is a positive conditional dependence between mammalian host and glycosylations at sites 44 and 68.None of the avian lineages has glycosylation site 44 while most swine and some human lineages have it.Similarly, glycosylation at site 68 is present in most swine and human lineages but only in avian lineages circulating in wild birds, not those in poultry.

Aquilegia flower and pollinator co-evolution
Reproductive isolation allows two groups of organisms to evolve separately, eventually forming new species.For plants, pollinators play an important role in reproductive isolation [56].We examine the relationship between floral phenotypes and the three main pollinators for the columbine genus Aquilegia: bumblebees, hummingbirds, and hawk moths [18].Here, the pollinator species represents a categorical trait with three classes and we choose bumblebee with the shortest tongue as the reference class.Fig 4 provides the across-trait correlation and partial correlation.Compared to a similar analysis on the same data set that only looks at correlation or marginal association [1], partial correlation controls confounding and indicates the conditional dependencies between pollinators and floral phenotypes that can bring new insights.
For example, we observe a positive marginal association between hawk moth pollinator and spur length but no conditional dependence between them.The marginal association matches with the observation that flowers with long spur length have pollinators with long tongues [18,57].The absence of a conditional dependence makes intuitive sense because hawk moth's long tongue is not likely to stop them from visiting a flower with short spurs when the other floral traits are held constant.In fact, researchers observe that shortening the nectar spurs does not affect hawk moth visitation [58].Similarly, the positive partial correlation between orientation and hawk moth also finds experimental support.The orientation trait is the angle of flower axis relative to gravity, in the range of (0, 180).A small orientation value implies a pendent flower whereas a large value represents a more upright flower [59].Due to their different morphologies, hawk moths prefer upright flowers while hummingbirds tend to visit pendent ones.Making the naturally pendent Aquilegia formosa flowers upright increases hawk moth visitation [59].These results suggest that partial correlation may have predictive power for results from carefully designed experiments with controlled variables.

MCMC setup and convergence assessment
We run all simulations on a node equipped with AMD EPYC 7642 server processors which possess 48 cores and 96 threads, with a base clock speed of 2.3 GHz.For every MCMC run, the minimal effective sample size (ESS) across all dimensions of X and R after burn-in is above 100.As another diagnostic, for our two large-scale applications on HIV-1 and H1N1 influenza, we run three independent chains and confirm the potential scale reduction statistic R for all partial correlation elements falls between [1, 1.03], below the common criterion of 1.1 [60].To reach a minimal ESS = 100 across all R elements, the post burn-in run-time and number of MCMC transition kernels applied for the joint inference are 21 hours and 1.3 × 10 6 (HIV-1), 113 hours and 7.9 × 10 7 (H1), 76 hours and 1.4 × 10 8 (N1).These run-times suggest the difficulty of our large-scale inference tasks where besides the main challenge of sampling {X, C, D}, updating the many tree parameters with Metropolis-Hastings transition kernels also takes a large number of iterations.To reduce the computational burden associated with tree inference, one practical approach is to utilize a set of pre-computed trees and incorporate tree swaps within the MCMC transition kernel.

Discussion
Learning how different biological traits interact with each other from many evolutionarily related taxa is a long-standing problem of scientific interest that sheds light on various aspects of evolution.Towards this goal, we develop a scalable solution that significantly improves inferential efficiency compared to established state-of-the-art approaches [1,8].Our novel strategy enables learning across-trait conditional dependencies that are more informative than the previous marginal association based analyses.This approach provides reliable estimates of across-trait partial correlations for large problems, on which the established BPS-based method struggles.In two large-scale analyses featuring HIV-1 and H1N1 influenza, the improved efficiency allows us to infer conditional dependencies among traits of scientific interest and therefore investigate some of the most important molecular mechanisms underlying the disease.In addition, our approach incorporates automatic tuning, so that the most influential tuning parameters automatically adapt to the specific challenge the target distribution presents.Finally, we extend the phylogenetic probit model to include categorical traits and illustrate its use in examining the co-evolution of Aquilegia flower and pollinators.
We leverage the cutting-edge Zigzag-HMC [13] to tackle the exceedingly difficult computational task of sampling from a high-dimensional truncated normal distribution in the context of the phylogenetic probit model.Zigzag-HMC proves to be more efficient than the previously optimal approach that uses the BPS, especially when combined with differential operator splitting to jointly update two sets of parameters X and O that are highly correlated.The improved efficiency allows us to obtain reliable estimates of the conditional dependencies among traits.In our applications, we find that these conditional dependencies better describe trait interactions than do the marginal associations.It is worth mentioning that another closely related sampler, the Markovian zigzag sampler [61], or MZZ, may also be appropriate for this task but provides lower efficiency than Zigzag-HMC [24].While Zigzag-HMC is a recent and less explored version of HMC, BPS and MZZ are two central methods within the piecewise deterministic Markov process literature that have attracted growing interest in recent years [62,63].Intriguingly, the most expensive step of all three samplers is to obtain the log-density gradient, and the same linear-order gradient evaluation method [8] largely speeds it up.
We now consider limitations of this work and the future directions to which they point.First, the phylogenetic probit model does not currently accommodate a directional effect among traits since it only describes pairwise and symmetric correlations.However, the real biological processes are often not symmetric but directional, where it is common that one reaction may trigger another but not the opposite way.A model allowing directed paths is preferable since it better describes the complicated causal network among multiple traits.Graphical models with directed edges [64] are commonly used to learn molecular pathways [65,66], but challenges remain to integrate these methods with a large and randomly distributed phylogenetic tree.Toward this goal, one may construct a continuous-time Markov chain to describe how discrete traits evolve [67,68], but with P binary traits the transition rate matrix grows to the astronomical size 2 P .Second, though our method achieves the current best inference efficiency under the phylogenetic probit model, there is still room for improvement.In the influenza glycosylation example, we use a binary trait indicating the host being either avian or mammal (human or swine), instead of setting a categorical trait for host type.In fact, we choose not to use a three-class host type trait because it causes poor mixing for the partial correlation elements.We suspect two potential reasons for this.First, according to our model assumptions for categorical traits (Eq 2), the latent variables underneath the same trait are very negatively correlated, leading to a more correlated and challenging posterior.Second, in our specific data sets, the glycosylation sites tend to be similar in human and swine viruses, further increasing the correlation among posterior dimensions.One potential solution is to de-correlate some latent variables by grouping them into independent factors using phylogenetic factor analysis [69,70].Finally, one may consider a logistic or softmax function to map latent variables to the probability of a discrete trait.This avoids the hard truncations in the probit model but also adds another layer of noise.It requires substantial effort to develop an approach that overcomes the above limitations while supporting efficient inference at the scale of applications in this work.
Fig 1 depicts across-trait correlations and partial correlations with posterior medians > 0.2 (or < −0.2).Compared to correlations (Fig 1A), we observe more partial correlations with greater magnitude (Fig 1B).They indicate conditional dependencies among traits after removing effects from other variables in the model, helping to explore the causal pathway.

Fig 1 .
Fig 1. (A) Across-trait correlation and (B) partial correlation with a posterior median > 0.2 or < −0.2 (in color).HIV gag mutation names start with the wild type amino acid state, followed by the amino acid site number according to the HXB2 reference genome and end with the amino acid as a result of the mutation ('X' means a deletion).Country = sample region: 1 = South Africa, -1 = Botswana; RC = replicative capacity; VL = viral load; CD4 = CD4 cell count.(C) Conditional dependencies between HIV-1 immune escape mutations that affect RC or VL.Node and edge color indicates whether the dependence is positive (orange) or negative (blue).https://doi.org/10.1371/journal.pcbi.1011419.g001

Fig 2 .
Fig 2. (A) Across-trait partial correlation among H1 glycosylation sites and host type with a posterior median > 0.2 or < −0.2 (in color and number).(B) HA structure of a 2009 H1N1 influenza virus (PDB entry 3LZG) with six glycosylation sites highlighted.Site 278 and 289 are in the stalk domain and all others are in the head domain.(C) The maximum clade credibility (MCC) tree with branches colored by the posterior median of the latent variable underlying H1 glycosylation site 289.The heatmap on the right indicates the host type of each taxon.https://doi.org/10.1371/journal.pcbi.1011419.g002