Serosolver: an open source tool to infer epidemiological and immunological dynamics from serological data

We present a flexible, open source R package designed to obtain additional biological and epidemiological insights from commonly available serological datasets. Analysis of serological responses against pathogens with multiple strains such as influenza pose a specific statistical challenge because observed antibody responses measured in serological assays depend both on unobserved prior infections and the resulting cross-reactive antibody dynamics that these infections generate. We provide a general modelling framework to jointly infer these two typically confounded biological processes using antibody titres against current and historical strains. We do this by linking latent infection dynamics with a mechanistic model of antibody dynamics that generates expected antibody titres over time. This makes it possible to use observations of antibodies in serological assays to infer an individual’s infection history as well as the parameters of the antibody process model. Our aim is to provide a flexible inference package that can be applied to a range of datasets studying different viruses over different timescales. We present two case studies to illustrate how our model can infer key immunological parameters, such as antibody titre boosting, waning and cross-reaction, and well as latent epidemiological processes such as attack rates and age-stratified infection risk.

First we outline how this expression is flexibly implemented in the serosolver possible test strains. Following previous work [27], the expected log titre individual i 89 has against the strain that circulated at time j when observed at time k is defined as a 90 linear combination of the contribution of antibody responses from each prior infection: 91 The model components are defined by: 92 1. Long-term boosting. This is defined by a parameter µ 1 , equivalent to the expected 93 persistent rise in titre against a homologous strain following primary infection.  3. Long-term cross-reactive antibody response from related strains. We assume the 99 level of cross-reaction between a test strain j and infecting strain m ∈ Z i 100 decreases linearly with antigenic distance (see Data section below for definition). 101 The cross-reaction function is d 1 (j, m) = max{0, 1 − σ 1 δ m,j }, where δ m,j is 102 antigenic distance between strains j and m, and σ 1 is a fitted parameter. to the package code. 119 Antigenic distance 120 The antibody process model described in Equation 2 includes parameters which 121 describe short-and long-term cross-reactive antibody processes. These processes depend 122 on a metric of antigenic distance between each pair of strains [23]. In the model, the 123 antigenic distance δ m,j between strains m and j is therefore defined by a matrix of 124 pairwise distances. Serosolver can accommodate antigenically varying strains (all δ m,j 125 are specified) or a single homologous strain (all δ m,j = 0). The extent to which strains 126 are antigenically distinct or similar can be described using the distance matrix.

127
Observation model

128
The expected titre X i,j,k defined in Equation 2 feeds into the observation model, which 129 converts the continuously valued model predicted titre into a discrete observed titre.

130
The distribution of the observed titre consists of a normally distributed random variable 131 g(s) with mean X i,j,k and variance ε, which is then censored to account for 132 integer-valued log titres in the assay. Hence the probability of observing an empirical 133 titre at time k within the limits of a particular assay Y i,j,k ∈ {0, ..., Y max } given August 8, 2019 6/30 Serosolver includes an additional option to include strain-specific measurement bias, 136 which may arise through strain-specific differences in assay reactivity [26,[30][31][32].

137
Specifically, an additional observation error is added to the predicted log antibody titres; 138 this measurement error can be different for each individuals strain or can be specified 139 for a group (or cluster) of strains. The predicted titre X i,j,k taking into account 140 strain-specific measurement bias is given as: Where χ j is the measurement offset for strain j. The hierarchical form of the 142 measurement bias term may also be specified by the user: χ j may be estimated as an 143 independent parameter for each j; may be assumed to come from a hierarchical 144 distribution χ j ∼ N (χ, σ 2 χ ); and may be fixed for particular strains/groups e.g., fixing 145 is given by, Each infection event (Z i,j ) is the outcome of a single Bernoulli trial, with probability 155

171
Application to influenza A/H3N2

172
The initial development of serosolver focused on influenza A/H3N2, which has 173 circulated in human populations since 1968 and has undergone substantial antigenic 174 evolution over this time [23,[32][33][34]. Figure 1 illustrates how our analytical approach 175 applies to influenza A/H3N2. In this case, we make the assumption that the antigenic 176 distance between strains can be described by a two-dimensional distance, with strains An individual is infected with the orange virus, which results in boosting and waning of homologous antibody titres. In parallel, antibodies that cross react with viruses at different points in antigenic space also boost and wane (green and blue viruses). The individual is later infected by the green virus, which leads to further boosting and waning of antibodies. Bottom panel: HI titres measured from serum samples taken at different times capture different parts of the homologous and cross reactive antibody kinetics. Different sampling strategies will represent different subsets of these measurements e.g., a cross-sectional study might inform a single subplot, whereas a longitudinal study might inform just the orange bars from each of the three subplots. Clearly a sampling strategy with multiple serum samples and many viruses tested per sample will provide the most information. structures that are not well characterised by standard prior assumptions (i.e., highly 216 dispersed attack rates and variation in individual-level susceptibility) [35]. We therefore 217 provide the user with flexibility in the assumed infection history and attack rate priors, 218 with different prior assumptions each bringing their own biases and rationale.
Serosolver includes four infection history prior options. We summarise these priors in 220 the main text, though an extensive discussion is provided in Supplementary Material 1. 221 Hyper-prior on the probability of infection over time, version 1: Under 222 this prior, the probability of infection is given by φ j . The infection generating process is: 223 where f is a user specified function describing the prior distribution on φ, P (φ j ). By 224 default, f is the uniform distribution, φ j ∼ unif (0, 1), though it may be set to process is: The probability of infection in a given time period is independent of other time  Rather, a prior is placed on the total number of infections that an individual is expected 242 to experience over the course of their life. This is the prior used in our previous The prior on the per-capita attack rate across all individuals therefore follows a This assumption places a beta-binomial prior on both the number of infections at a  Furthermore, although much of the development of this package came from analysis of 293 influenza A/H3N2 dynamics, these concepts and inputs are easily adaptable to antigenically stable pathogens by specifying the input antigenic map.

295
The package work flow is divided into a number of distinct stages, which handle the 296 data and parameter inputs, simulation, inference, posterior diagnostics, and analysis 297 (Fig 2) We developed the package to rely on only a few function calls for each of these 298 stages, but with ample room for customisation and flexibility at each stage.

299
To set up the model, users only need to provide: a data frame describing the model 300 parameters (they can also change a flag to fix or estimate any of the parameters); a chains may be run in parallel locally, but we note that much of our own work with 321 serosolver is done using a high performance cluster. Inputs and outputs for the serosolver R package. The package has two sets of inputs to define data and parameters. These feed into the process model that can either be used to simulate data by itself, or combined with observed data and MCMC to obtain three posterior outputs: individual-level infection histories, population probability of infection, and biological parameters. Once these posteriors have been obtained, serosolver can run MCMC diagnostics and plot key immunological and epidemiological processes

323
We present two case studies to highlight the range of insights that serosolver can  status ( Fig 3A) and age (Fig 3B). Finally, we can estimate biological parameters 344 shaping the short-term antibody response (Fig 3C). 345 We were able to estimate quarterly exposure rates, which could include either  The second case study considers cross-sectional serological samples collected in southern 376 China in 2009, which were tested against nine historical influenza A/H3N2 strains that 377 circulated between 1968 and 2008 [29,42]. Serosolver can be used to reconstruct suggesting either agreement between the data and prior or a lack of information in the 385 data. We also identified clear age-specific patterns of infection. Fig 5D shows the experience antibody boosting, less frequently as they get older [27]. Fig 5E shows  susceptibility and titre at time of infection [43]. These methods could also apply to 430 other pathogens; a similar model structure has recently been used to examine latent 431 titres for dengue [44]. interest.

448
At present, serosolver focuses on inference for a single exposure type. However, for 449 viruses like influenza and dengue, individuals may be exposed to multiple subtypes or 450 serotypes in the same season. Exposure to one antigen may cross react with another 451 antigen providing protection against antigens an individual has not been directly individual [47]. In its current form, serosolver can estimate differences between 458 exposures by being fit independently to different subtypes. It can also fit models 459 separately to vaccinated or unvaccinated populations to estimate how serological 460 dynamics vary between these groups. Although this is a useful first approximation, 461 future versions of serosolver will include potential for multiple exposure types during 462 the same season so that any interactions can be modelled explicitly.

463
There is increasing evidence that serological titre data contain substantial additional 464 information about infection and immunity dynamics, which are not captured by simple 465 four-fold rise metrics [14,44,48,49] Furthermore, in multi-strain pathogen systems, 466 evidence is mounting that individual-level heterogeneity in unobserved exposure 467 histories is a key driver of susceptibility to infection and disease [26,47,50,51].

468
Serosolver provides a generic framework to extract this information from commonly 469 collected data. As serological data become increasingly available, it will be important to 470 develop modern analytical methods and tools that account for known biological and 471 epidemiological processes that may confound or bias inference [49,[52][53][54].