Estimation of temporal covariances in pathogen dynamics using Bayesian multivariate autoregressive models

It is well recognised that animal and plant pathogens form complex ecological communities of interacting organisms within their hosts. Although community ecology approaches have been applied to determine pathogen interactions at the within-host scale, methodologies enabling robust inference of the epidemiological impact of pathogen interactions are lacking. Here we developed a novel statistical framework to identify statistical covariances from the infection time-series of multiple pathogens simultaneously. Our framework extends Bayesian multivariate disease mapping models to analyse multivariate time series data by accounting for within- and between-year dependencies in infection risk and incorporating a between-pathogen covariance matrix which we estimate. Importantly, our approach accounts for possible confounding drivers of temporal patterns in pathogen infection frequencies, enabling robust inference of pathogen-pathogen interactions. We illustrate the validity of our statistical framework using simulated data and applied it to diagnostic data available for five respiratory viruses co-circulating in a major urban population between 2005 and 2013: adenovirus, human coronavirus, human metapneumovirus, influenza B virus and respiratory syncytial virus. We found positive and negative covariances indicative of epidemiological interactions among specific virus pairs. This statistical framework enables a community ecology perspective to be applied to infectious disease epidemiology with important utility for public health planning and preparedness.


Introduction
Animals and plants are exposed to a diverse community of pathogenic organisms that co-circulate in time and space. When multiple pathogens infect the same tissue, they effectively share an ecological niche that provides the opportunity for interspecific interactions [1,2,3]. It is known that pathogen interactions may alter the within-host dynamics of infection with consequences for the population transmission of some common infections. Interactions among microorganisms include the promoting or inhibiting effects of gut microbiota on invading pathogenic bacteria in the gastrointestinal tract [4]; the enhanced carriage of pneumococcal bacteria following influenza infection in the respiratory tract [5]; and immune-driven enhancement of Zika virus infection following Dengue virus exposure [6]. The complex ecology of pathogen communities therefore has potentially important implications for the control of infectious diseases and public health preparedness.
While some interactions between pathogens have been suggested by anecdotal epidemiological observations, statistical support for their occurrence and impact on the pathogen population dynamics is lacking. This is owing in part to a paucity of appropriate long-term time series data for many disease systems that describe infection frequencies for multiple pathogens simultaneously. Moreover, although several statistical methods are available for multivariately analysing health-related time series data (including cross-correlation, generalized linear models, wavelet decomposition and spectral analyses [7,8,9,10]) retrospectively inferring non-independent patterns among multiple time series has not been the primary focus.
Bayesian disease mapping models represent a class of regression model that has received much attention in recent years for the analysis of spatial distributions of incidence data routinely collected by public health bodies [11,12]. These models are typically applied to incidence data to estimate spatial patterns of disease risk over a geographical region with several models proposed to capture spatial autocorrelations [9] using conditional autoregressive priors [13,14]. While extensions to disease mapping models have been made to include temporal patterns [15,13] and space-time interactions [16,17], most disease mapping applications focus on spatial structures [18] with temporal dependencies in disease incidence often being overlooked [19,20]. However, multivariate forms of disease mapping models provide a suitable framework for estimating temporal dependencies between pathogens as they naturally incorporate a between-disease (or pathogen) covariance matrix [21].
Modelling multiple pathogens simultaneously allows assessment of specific patterns and non-independencies of infections risk among different pathogens. By estimating the between-pathogen covariance matrix, we aimed to develop a statistical tool that readily enables the joint estimation of pathogen dependencies on the temporal dimension and distinguishes simple correlations from genuine  [22]. Sampling date, patient age, patient gender, the sample origin (hospital or general practice submission that we used as a proxy for infection severity) were recorded. Multiple samples from the same patient received within a 30-day period were aggregated into a single episode of respiratory illness resulting in 28,647 patient episodes. A patient was considered virus-positive during an episode if at least one clinical sample was positive during the 30-day window. We refer the reader to Nickbakhsh et al. [23] for a full description of these data.
Whilst data are available at the individual level, we are predominantly interested in estimating non-independent patterns in temporal patterns between the five viruses at the population level. Therefore, for each virus, data were aggregated into monthly infection counts across the time frame of this study.
From the 28,647 patient episodes, 4,759 were positive to at least one virus group and detection was most common in children aged between 1 and 5 years [23]. Detection of any virus in a given clinical sample was most common in December and least common in August. We observed differing patterns between the five viruses (Fig 1, black lines). IBV, RSV and CoV were more prevalent in winter, AdV was generally less common with a slight increase prevalence in spring and MPV shifts from winter peaks to summer peaks after 2010 [23].
Relative risks identify time points where observed counts are higher or lower than expected, with expected counts accounting for expected seasonality and risk factors associated with respiratory infection [23]. Conventionally, relative risk would measure the risk of infection, if exposed to a risk factor, relative to the risk if unexposed. In this context, relative risk measures the risk of infection of a virus relative to an estimated expected risk. Thus the relative risks measure the excess risk of viral infection that cannot be explained by seasonality or patient demographics. Therefore, by inferring dependencies between viral species in terms of excess risks, we can directly infer viral interactions. We provide a full description of the estimated expected risks in the expected count section.  Multivariate Spatio-temporal model Conditional autoregressive models are extensively used in the analysis of spatial data to model the relative risk of a virus or more generally a disease [24,25]. The class of Bayesian model typically used in this context is given by where Y i , E i and RR i are the observed count, expected count, derived from available patient demographic data (refer to expected counts section), and relative risk for some index i (for example, location or time interval) [14] and φ = {φ 1 , . . . , φ I } spatial random effects modelled jointly through a conditional autoregressive (CAR) distribution [26] φ ∼ MVN(0, (τ (D − λW )) −1 ).
Matrix W is a proximity matrix, λ a smoothing parameter, τ a measure of precision and D a diagonal Extending this model to multiple viruses, or more generally multiple pathogens, then where Y iv , E iv and RR iv are the observed count, expected count and relative risk of virus v and α v a virus specific intercept term. A multivariate CAR (MCAR) distribution can jointly model φ by incorporating a between virus covariance matrix Temporal autocorrelations may be induced in this model, at time point j, through the conditional The parameter s controls the level of temporal autocorrelation such that s = 0 implies no autocorrelation whereas s = 1 is equivalent to a first order random walk [16]. Typically, where temporal autocorrelations are modelled through the conditional expectation, spatial autocorrelations are modelled through the precision matrix [16].

Modelling multivariate time series data
We aim to model monthly time series count data from multiple viruses simultaneously over a nine year period. We index over monthly time intervals and so monthly autocorrelations are modelled in terms of the precision matrix and yearly autocorrelations are modelled in terms of the conditional expectation in a similar fashion to the multivariate spatial-temporal model detailed above. The observed count of virus v in month m of year t, Y mtv is modelled in terms of the expected count E mtv and relative risk RR mtv : with α v an intercept term specific to virus v and φ .t. = {φ .t1 , . . . , φ .tV } a vector of random effects modelled conditionally through a MCAR prior This parameterisation of a MCAR model captures both the seasonal trends of each virus via Ω and long-term temporal trends via s 1 , . . . , s V . The conditional expectation of φ .t. depends on the previous year φ .t−1. , capturing long term temporal tends. By allowing dependencies between neighbouring months, we account for seasonality in viral infection frequencies.

Inferring viral interactions
We focus primarily on the estimation of covariance matrix Λ −1 in order to infer potential temporal dependencies. By formally testing which off-diagonal entries of Λ −1 are significantly different from zero, we can explicitly provide statistical support for viral interactions.

MCAR prior specification
The covariance structure of the MCAR distribution used to model random seasonal-temporal effects is the Kronecker product of precision matrices Ω and Λ.
The between-virus precision matrix Λ accounts for dependencies between viral relative risks in terms of monthly trends. Wishart priors can be used for unstructured precision matrices such as Λ [29], however, we employed a modified Cholesky decomposition to estimate covariance matrix Λ −1 : where Σ was a diagonal matrix with elements proportional to viral standard deviations and Γ a lower triangular matrix relating to viral correlations [30]. This parameterisation ensured the positive-definiteness of Λ −1 , although we note that other parameterisations are available [31].
Matrix Ω captures seasonal trends in terms of monthly dependencies defined through a proximity matrix W . We will consider two possible constructions of W .

Neighbourhood structure
Assuming neighbouring months are more similar than distance months, W can be defined such that w ij = 1 if months i and j are neighbouring months and w ij = 0 if months i and j are not neighbouring months. In this paper, neighbours were fixed as the previous and subsequent three months. Taking a neighbourhood approach, we set where λ is a smoothing parameter and D a 12 × 12 diagonal matrix with D i = j w neighij . The total number of nearest neighbours of month i [27,32].

Autoregressive structure
Under this construction, W was defined through an autogressive process and the corresponding matrix denoted by W auto . We set the ijth entry of W auto (i = j) to be ρ dij with d ij the distance between months i and j and ρ a temporal correlation parameter satisfying ρ < 1. We defined distance as the number of months between i and j.
Taking an autoregressive approach, we set with D a diagonal matrix with D i = j w autoij . We note that these formulations can easily be extended to other MCAR structures [27,33].

Expected counts
We

Full model
Let Y mtv denote the observed count of virus v during the mth month of year t, V the total number of viruses and T the number of years. We modelled Y mtv in terms of the expected count E mtv and relative risk RR mtv and used a Cholesky decomposition to estimate covariance matrix Λ −1 [34] (Fig 2).
This model was implemented in jags [35] using the R2jags package [36] in R [37]. All results are averaged across five independent chains. In each chain, we took 50,000 thinned draws across 500,000 iterations after a burn-in period of 300,000 iterations. Our R code used to model these data and an example of simulated are available upon request to the first author. We note that the multivariate intrinsic Gaussian CAR prior distribution is fully specified in GeoBUGS [38]. However, our approach allows for other parameterisations of the MCAR distribution providing more flexibility in separating monthly and yearly temporal dependencies.

Simulation Studies
The specific aim of this paper was to estimate the between-virus covariance matrix Λ −1 . We show the validity of our proposed model (Fig 2) in modelling multivariate time series data through a three-virus simulation and that this model can accurately and precisely estimate Λ −1 through a five-virus simulation.

Three-virus example
We first present an analysis of simulated data for three viruses; virus 1, virus 2 and virus 3. Seasonal effects were assigned such that virus 1 peaked in winter, virus 2 peaked in summer and virus 3 had no seasonal pattern (Fig 3). Virus 1 and virus 2 were negatively correlated (Λ 12 = −0.5) and both viruses were independent of virus 3 (Λ 13 = Λ 23 = 0).
The probabilities and expected counts of each virus in each month were estimated using the method described in Expected counts. Individual level data were simulated in order to reflect the virological diagnostic data. We simulated 200 samples per month per virus over a four year period. For each sample, an age, sex and severity were drawn from the observed virological diagnostic data distributions [23].
Regression coefficients used to estimate the probability of each virus were drawn such that β intercept = 0, Random effects φ were drawn from multivariate normal distributions with yearly smoothing parameters and monthly smoothing parameter (s 1 , s 2 , s 3 and λ) set at 0.5. Seasonal dependencies were simulated through neighbourhood matrix W neigh defined in MCAR prior specification. We set virus specific intercept terms α 1 = α 2 = α 3 = 0 and calculated the relative risks of each virus at each time point.
Finally, observed counts were taken as the product of relative risks and expected counts (example shown in Fig 3c, black lines).
We fitted both models (Fig 2, neighbourhood and autoregressive structure) to data simulated only through the neighbourhood structure and estimated higher posterior density intervals (HPDI) for each covariance parameter (Λ 12 ,Λ 13 andΛ 23 ). Posterior probabilities were then estimated to assess the probability of zero being included in each interval, synonymous to Bayesian p-values defined in terms of lower tail posterior probabilities [39,40]. Covariance parameters with a posterior probability less than 0.05 were deemed different from zero [39]. In order to control for multiple comparisons, covariance parameters with an adjusted probability, controlling the false discovery rate [39,41], less than 0.05 were deemed different from zero and used as support for a significant covariance between the corresponding viruses.
Generally, significant negative covariances were estimated between virus 1 and virus 2 and non-significant covariances between virus 1 and virus 3 and between virus 2 and virus 3 using both the neighbourhood and autoregressive constructions.
Under the neighbourhood structure, the probability of correctly identifying a non-zero covariance was 0.98 ( Reducing the covariance between virus 1 and virus 2 (Λ 12 = −0.25) decreased the power from 0.94 (under the neighbourhood structure with the multiple comparisons correction) to 0.64 and a further reduction to Λ 12 = −0.1 reduced the corresponding power to 0.22. Therefore, as expected, this test has high power to detect strong associations but lacks the ability to detect an association between viruses with a small dependency.
This simulation study illustrates our method can powerfully and accurately estimate the relative risk of each virus across the time frame of the simulation whilst controlling the familywise error rate and is an important new tool in modelling multivariate time series data.

Five-virus example
To test the accuracy of the full model in estimating the between virus covariance matrix, we simulated data from five viruses over a 15 year period (Fig 4) with Since temporal variances within each virus were set to 1 (diagonal entries of Λ * −1), we refer to virus covariances within this section as correlations. Virus 1 and virus 2 were negatively correlated (such that when the prevalence of one virus increases, the prevalence of the other decreases), virus 4 and virus 5 were positively correlated and virus 3 circulated independently.
Data from the first three viruses were simulated under identical conditions to those described in the three-virus example. Seasonal effects were assigned such that virus 4 and virus 5 peaked in autumn (Fig 4, observed counts). The probabilities and expected counts of virus 4 and virus 5 in each month were estimated using the method described in Expected counts. Likewise, regression coefficients were drawn such that β intercept = 0, β age ∼ N (0, 0.1), β gender ∼ N (0, 0.1) and β severity ∼ N (0, 0.1) . Random effects φ were drawn from multivariate normal distributions with yearly smoothing parameters and monthly smoothing parameter (s 1 , s 2 , s 3 , s 4 , s 5 and λ) set as 0.5. Seasonal dependencies were simulated through neighbourhood matrix W neigh defined in MCAR prior specification. We set virus specific intercept terms α 1 = α 2 = α 3 = α 4 = α 5 = 0 and calculated the relative risks of each virus at each time point. Finally, observed counts were taken as the product of relative risks and expected counts (Fig 4, grey lines).
Using the full model with the neighbourhood construction, we initially estimated Λ −1 using only the first year of data ( Fig 5, Year 1). We then combined the first two years of data to estimated Λ −1 (Fig 5, Year 2). This process was repeated, adding the current year's data at each iteration, until all fifteen years of data were combined to estimate Λ −1 .  from year 9 onwards. We found very similar results using W auto (results not shown).
This example shows this method can precisely and accurately estimate the between virus covariance matrix given long term time series data.

Correlation Estimate
Year 1 Year 2 Year 3 Year 4 Year 5 Year 6 Year 7 Year 8 Year 9 Year 10 Year 11 Year 12 Year 13 Year 14 Year 15

Application to virological diagnostic data
We applied our model (Fig 2) to monthly infection count data from five respiratory viruses (AdV, Cov, MPV, IBV and RSV) across nine years in order to infer interactions between these viruses.
For comparison, we applied the full model assuming all five viruses to be independent by setting Under the neighbourhood structure, we found that allowing dependencies between viruses (Λ −1 = I 5 ) provided a better fit to these data (DIC=2795.6 compared to DIC=3583.8). However, the autoregressive structure with Λ −1 = I 5 minimised DIC (DIC=2686.4).
Estimating RR and Λ −1 Expected counts were estimated for each virus as described in Expected counts (Fig 1, grey lines). The

Discussion
Humans, animals and plants are exposed to a plethora of cocirculating pathogens, which creates frequent opportunity for ecological interactions between them. We adapted Bayesian multivariate disease mapping methods to provide a powerful statistical tool for inferring pathogen-pathogen interactions from diagnostic and/or surveillance time series data. Whilst standard multivariate disease mapping frameworks investigate the joint spatial distribution of multiple diseases coinfecting a population simultaneously, our method instead analyses the joint temporal distribution of multiple infections. Because multivariate disease mapping naturally incorporates a between-disease covariance matrix, these methods conveniently lend themselves to the inference of temporal signatures of pathogen-pathogen interactions when adapted to analyse temporal dependencies.
By applying our framework to extensive diagnostic data accrued over a nine-year period from a well defined patient population, our analysis provides evidence of epidemiological interactions among respiratory viruses. This is particularly important because acute respiratory infections are the most common cause of illness and mortality and are primarily attributed to a group of viruses that occupy a shared ecological niche in the respiratory tract. Although observational data [42,43,44,45] and univariate regression models [46,44,47,48] indicate the potential for interactions among these common pathogens, limited evidence exists of their impact on epidemiological infection dynamics.
A recent study used wavelet decomposition to infer temporal correlations in the time series of a set of respiratory viruses [?]. However, as those statistical methods do not enable different causal hypotheses to be tested, the authors were unable to determine whether pathogen-pathogen interactions were the consequence of cross-immunity or of other drivers of seasonal correlation. Under the autoregressive structure which provided a better fit to these data, our analysis provides robust evidence of a positive covariance between RSV and MPV and a negative covariance between IBV and AdV.
In summary, we adapted a multivariate disease mapping framework to retrospectively infer pathogen-pathogen interactions in a statistically robust fashion. Applying this approach to time series data of pathogens that co-circulate in a given population will lead to a better understanding of the joint epidemiological dynamics of diseases. It is anticipated that these inferences will enhance the understanding of linked pathogen dynamics and inform forecasting of disease incidence and improve public health preparedness. In addition, they will result in better ways to evaluate the impact of public health interventions thus aiding the design of measures to control infectious diseases.