Estimating fine age structure and time trends in human contact patterns from coarse contact data: The Bayesian rate consistency model

Since the emergence of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), large-scale social contact surveys are now longitudinally measuring the fundamental changes in human interactions in the face of the pandemic and non-pharmaceutical interventions. Here, we present a model-based Bayesian approach that can reconstruct contact patterns at 1-year resolution even when the age of the contacts is reported coarsely by 5 or 10-year age bands. This innovation is rooted in population-level consistency constraints in how contacts between groups must add up, which prompts us to call the approach presented here the Bayesian rate consistency model. The model can also quantify time trends and adjust for reporting fatigue emerging in longitudinal surveys through the use of computationally efficient Hilbert Space Gaussian process priors. We illustrate estimation accuracy on simulated data as well as social contact data from Europe and Africa for which the exact age of contacts is reported, and then apply the model to social contact data with coarse information on the age of contacts that were collected in Germany during the COVID-19 pandemic from April to June 2020 across five longitudinal survey waves. We estimate the fine age structure in social contacts during the early stages of the pandemic and demonstrate that social contact intensities rebounded in an age-structured, non-homogeneous manner. The Bayesian rate consistency model provides a model-based, non-parametric, computationally tractable approach for estimating the fine structure and longitudinal trends in social contacts and is applicable to contemporary survey data with coarsely reported age of contacts as long as the exact age of survey participants is reported.


Introduction
The transmission of human respiratory diseases such as influenza, tuberculosis, and COVID-19 is directly driven by the rate of close social contact between individuals. Social contact studies such as the pivotal POLYMOD study [1] have been widely acknowledged as an effective method of obtaining social contact estimates to assess infection risk and to parameterise mathematical infectious disease models [2][3][4]. Consequently, they provide critical epidemiological insights which inform the implementation and evaluation of non-pharmaceutical interventions [5] as well as public health policies such as vaccination schedules [6].
Since the outbreak of COVID-19, longitudinal social contact surveys have been conducted in Europe and around the world, providing indispensable information on the evolving patterns of human mixing behaviour during the pandemic [7,8]. In Germany, the COVIMOD study collected social contact data for nearly two years, and initial analyses [9] focused on data from April to June 2020 during the first partial lockdown in Germany to quantify the scale of social contact reductions relative to pre-pandemic contact patterns. It is important to estimate high-resolution age-and gender-specific human contact patterns and their trends over time, but this has remained an unsolved challenge due to changes in survey methodology from prepandemic studies [1,9] and limitations in available inference methods [2,10,11].
First, COVIMOD and other COVID-era social contact studies record the age of contacts by large age bands of 5 to 10 years, reflecting that often it is difficult for study participants to know the exact age of their contacts. In contrast, the pre-pandemic POLYMOD surveys primarily collected data on the exact age of contacts. Subsequently developed methods, including thin-plate regressions [2] and Gaussian Markov Random Field model approaches [10] relied on such data to estimate high-resolution contact patterns, and these methods are no longer directly applicable. Most studies have thus resorted to the convenient and speedy bootstrap approach implemented in the socialmixr library [11], although it can only provide contact estimates in the same coarse age bands that the data are reported in, which mask the full structure of the actual contact patterns [12].
Second, most COVID-era studies adopted retrospective web-based survey protocols and conducted longitudinal repeat surveys [7,[13][14][15]. The survey waves are typically inter-dependent because a number of participants were surveyed in multiple waves, and additional participants were recruited to replenish the cohort size. While this approach provides valuable longitudinal data, it also introduces the issue of reporting fatigue, where participants tend to report fewer contacts in subsequent participation due to becoming tired of filling out the survey. It follows that directly applying existing methods, which do not incorporate adjustments to counter the confounding reporting fatigue effects, is bound to lead to incorrect estimates. Additionally, participants sometimes found it difficult to recall specific age and gender information for all of their contacts. Instead, they were allowed to report an estimate for the total number of contacts on that occasion [9], which again may result in under-ascertainment of contact intensities if these data are not accounted for in inference approaches of contact patterns.
This work presents a non-parametric Bayesian model to infer age-and gender-specific contact patterns and time trends at high 1-year resolution from longitudinal survey data. The primary innovation of the model is the ability to infer contact patterns by 1-year age bands even when the age contacts are reported in broad age bands. We call this model-based approach the Bayesian rate consistency model for reasons that will be clear soon. In addition, we use recently developed Hilbert Space Gaussian Process approximations [16] to gain substantial advances in computational efficiency, enabling us to make full Bayesian inferences over time and uncover the time trends in social contact structure. We demonstrate that it is crucial to model contact patterns over time to account for reporting fatigue effects in inter-dependent longitudinal survey waves. The primary purpose behind developing the Bayesian rate consistency model is its application to contemporary COVID-19-era survey data, which we present for data spanning the first five survey waves of the COVIMOD study in Germany. We present high-resolution estimates of age-and gender-specific social contacts for each survey wave and describe their time evolution. We also place the inferred contact dynamics into a pre-pandemic context and quantify the differences in contact intensity change by the age of contacts.

Ethics statement
COVIMOD was approved by the ethics committee of the Medical Board Westfalen-Lippe and the University of Münster, reference number 2020-473-fs. Written consent was obtained from all participants. The POLYMOD data collection was approved by national institutional review boards [1]. As only anonymised COVIMOD and POLYMOD data were used in this study, an institutional review was not required for reanalysis.

The COVIMOD study
The COVIMOD study was launched in April 2020 and continued until December 2021, constituting 33 survey waves. Participants were recruited through email invitations to existing panel members of the online market research platform IPSOS i-say [17]. To ensure the sample's broad representativeness of the German population, quota sampling was conducted based on age, gender, and region. Participants were invited to participate in multiple waves to track changes in social behaviour and attitudes toward COVID-19. When the participant size did not meet the sampling quota due to study withdrawals, new participants were recruited into the study. This approach enabled the COVIMOD study to obtain longitudinal samples, but it also introduced the issue of response fatigue, where the number of detailed contacts reported decreased compared to previous participation, irrespective of the survey wave. To procure information on children, a subgroup of adult participants living with children under the age of 18 were selected to be proxies. This procedure meant that middle-aged adults were under-sampled as they completed the survey on behalf of their children.
The COVIMOD questionnaire was based on the CoMix study and includes questions on demographics, the presence of a household member belonging to a high-risk group, attitudes towards COVID-19 as well as related government measures, and current preventative behaviors [9,18]. Participants were also asked to provide information about their social contacts between 5 a.m. the preceding day to 5 a.m. the day of answering the survey. Following the prepandemic POLYMOD study, a contact is defined as either a skin-to-skin contact such as a kiss or a handshake (physical contact) or an exchange of words in the presence of another person (non-physical contacts) [1]. Participants were asked to report the age group, gender, relation, the contact setting (e.g. home, school, workplace, place of entertainment, etc.), and whether the contact was a household member. For survey waves 1 and 2, participants were asked to provide each contact's information separately. However, some participants reported contacts to groups of individuals (e.g., customers, clients) for which a specific number of contacts was assumed (Additional file 2 of [9]). From wave 3 onward, in addition to being asked to report contacts individually, participants were allowed to report a total number of contacts that were either physical or non-physical contacts in the sense described above and which they did not report individually, which we refer to as aggregate contacts. Additionally, some participants could not recall or preferred not to answer the age or gender information of some of the individual contacts that they reported. We treat these three types of entries with missing age or gender equally and refer to them as missing & aggregate contact reports. A copy of the COVI-MOD questionnaire may be found in Additional file 1 of [9].
This current work concerns the first five survey waves of the COVIMOD study. In Fig 1A  and 1B, we show the sampling periods with the number of daily COVID-19 cases, cumulative COVID-19-related deaths, and the OxCGRT Stringency Index which is a measure that captures the degree of containment and closure policies in place for COVID-19 on a scale of 0% to 100% [19]. The following COVID-19 policy timeline is obtained from the ACAPS COVID-19 Government Measures dataset [20]. The first COVIMOD survey was administered from April 30 th to May 6 th in the year 2020, towards the end of the first partial lockdown and the first wave of cases. Before the beginning of the first survey (April 20 th ), small stores, auto dealers, and bookstores were allowed to reopen under strict hygiene regulations. During the final few days of the survey period (May 4 th to 6 th ), phase-out measures were announced by the government, including the step-wise uptake of schools, the reopening of hairdressers under strict hygiene regulations, lifting of the ban on public gatherings of 30 people indoors and 50 outdoors, resumption of religious services, and reopening of public services such as museums, botanical gardens, zoos, and playgrounds. The second wave of the COVIMOD survey was administered from May 14 th to May 21 st . During this period, additional phase-out measures were announced, including the resumption of all cross-country transport and the reopening of hotels and restaurants. International travel to neighbouring countries was also slightly relaxed during this period. The third, fourth, and fifth waves of COVIMOD surveys were taken from May 28 th to July 4 th , June 11 th -22 nd , and June 26 th to July 1 st , respectively. There was no notable introduction or reduction of social contact restriction measures during this time, but international travel restrictions were relaxed primarily for Schengen and EU countries. COVID-19 cases and deaths remained stable during this period (Fig 1A and 1B).
After excluding participants who prefer not to provide age or gender information and 25 participants above the age of 84, there were 1549, 1345, 1076, 1881, and 1603 participants for waves 1 to 5. We observed 3244, 4852, 6344, 13471, and 8353 total contacts for each wave. In Fig 1C, we show the proportion of participants who consented to the survey multiple times. Most participants in waves 2 and 3 had participated in wave 1, with only 6.8% and 16% of participants being new to the survey. The proportion dropped sharply in wave 4, where only 35.1% of initial participants remained. Hence the majority (57.7%) of wave 4 participants were first-time participants. On the contrary, no new participants were enrolled for wave 5, and individuals who participated for the second, third, and fifth time took up approximately 45%, 10.7%, 9.6%, and 34.6% of the sample.

Data processing
In this study, we excluded reports from 20 participants (0.3% of the total) who did not report their own age or gender, as these data are essential for our modelling framework. Further, and following ethical guidelines, the participant age information for children was reported in discrete age bands, i.e., 0-4, 5-9, 10-14, 15-18 years. To obtain fine-age information for participants under 18, we imputed their age by drawing from a discrete uniform distribution with bounds set as the minimum and maximum age of the participant's age category. Finally, regarding the three types of missing & aggregate contacts reported, i.e. contacts reported to groups of individuals, additional contacts reported for who participants did not provide individual information, and individual contacts reported for who participants did not provide age and gender information, we truncated these at 60 (90 th percentile) to remove the effects of extreme outliers. We show the distribution of observed contacts in S1 and S2 Figs.

Estimating contact patterns
Throughout, we focus on estimating contact patterns and dynamics between men and women (superscripts g, h 2 {M, F}) and high resolution one year age groups (subscripts a; b 2 B ¼ f0; 1; 2; � � � 84g). We note that the cutoff at age 84 is a specific choice on account of the age aggregation scheme and the age-specific sample size of COVIMOD. One may change this cutoff for different surveys in an appropriate fashion. To introduce basic notations, let us momentarily consider male-to-female contacts at some fixed time t, suppress the time index, and assume that the exact age of reported contacts is known. We start by considering the total number of contacts Y MF ab that are reported by men of age a who participate in the survey to all women of age b in the population. The number of male survey participants of age a is N M a , and the number of women of age b in the population is P F b . When N M a > 0 for age group a, the contact counts Y MF ab are defined for all b 2 B, and are either zero or positive. When there are no participants for some age group a, there is no corresponding contact data, and we denote the participant age groups in the survey by A M ¼ fa 2 B: N M a > 0g. Finally, we denote the number of age groups in A M by A M and the number of age groups in B by B. We also consider analogous notations for all other gender combinations using the superscripts g and h.
From the observed data, we seek to estimate the contact intensity m MF ab , the average number of contacts from one male participant of age a to all women aged b in the population within a 24-hour time window. We follow [1,2,8,10,[21][22][23] and model the count data with an overdispersion adjusting Negative Binomial observation model, in shape-scale form, Here, m gh ab are the expected contacts, which are expressed in terms of the target quantity of interest, m gh ab , and the known N g a . The overdispersion parameter ν is strictly greater than zero (ν > 0), such that the variance of the contact counts Var½Y gh ab � ¼ m gh ab ð1 þ nÞ is greater than the mean. The contact rate is defined as the probability of contact between one male aged a and one female aged b within a 24-hour time window, i.e.
Crucially, in a closed population, the number of total contacts between the same population strata must be self-consistent and add up to the same, i.e.
for all a; b 2 B. From this, we find contact rates are symmetric because g MF ab ¼ g FM ba for all a, b, and similarly g MM ab ¼ g MM ba and g FF ab ¼ g FF ba for all a < b. Property (3) implies that data on age group a informs contact rates in both age dimensions, which we will exploit heavily below. If the survey captures participants for all possible age groups, i. e. A g ¼ B, the estimation problem reduces to B × B + B × (B + 1)/2 × 2 = B(2B + 1) free contact rate parameters rather than 4B 2 free parameters, an almost 50% reduction. To take advantage of these self-consistency constraints, we follow [10] and expand Eq (1c) to log m gh . Specifically, we model the f gh through computationally efficient Gaussian process approximations as described below, and for ease of notation, write f gh (a, b). The random functions act as age-age-specific offsets to the baseline parameter and thus capture the age structure in human contact intensities. Using random functions, we can estimate arbitrary age-specific contact patterns. This is important because human contact patterns have changed substantially since the COVID-19 pandemic with school closures and other non-pharmaceutical interventions.

Recovering fine age structure from coarse data
For COVIMOD and similar contact surveys, participants were asked to report their contacts in the coarse age groups Importantly, the exact age of the participant is known, and we can leverage this information through the symmetry property in Eq (3) to estimate contact intensities at a much finer resolution. Because of this fundamental property, we call our resulting model the "Bayesian rate consistency model". Using the shape-scale parameterisation in Eq (1), it follows that where g, h 2 {M, F}, a 2 A g , b 2 B, and c 2 C. We will demonstrate below that the high-resolution contact intensities m gh ab are identifiable from coarse contact data.

Estimating dynamics in contact patterns
It is in principle, straightforward to extend model (1-6) to capture time trends in contact patterns, but a particular challenge arises in the context of repeat surveillance. Across COVIMOD survey waves, many participants agreed to report data in multiple rounds, and analyses indicate that participants tend to repeat fewer contacts in subsequent surveys, a phenomenon called reporting fatigue. The time trends in the primary data are thus confounded by longitudinal reporting behaviour. To control for reporting fatigue, we denote by Y gh trac the number of contacts to individuals of age group c and gender h that are reported in survey wave t by all participants of age a and gender g who have participated r time(s). All other notation extends analogously.
In the simplest case, we introduce age-homogeneous reporting fatigue effects r r 2 R at repeat response times r = 0, 1, 2, . . ., and jointly model the longitudinal data with where t = 1, 2, . . . indicates the survey waves, r = 0, 1, 2, . . . repeat surveillance, and ρ 0 = 0 for ease of notation. Here, reporting fatigue is captured by negative ρ r (that decreases with r), which in turn will adjust the contact intensities log m gh tac in follow-up survey rounds to higher estimates than the primary data suggest. New participants entered the COVIMOD survey in each survey wave, so we have data to provide independent information on the contact dynamics and reporting fatigue, with the model borrowing strength across all the data available.
A second challenge is that the number of missing & aggregate contact reports also fluctuated over time. As denominator in our sample, we now consider all participants of age a and gender g regardless of whether they reported individual contacts or missing & aggregate contacts and denote their total in wave t with repeat participation r by N g tra as in (7c). We further denote the number of missing & aggregate contact reports by participants of age a and gender g in survey wave t by T g ta . For instance, if a participant in wave t of age a and gender g had 2 contacts with missing age or information and reported 18 aggregate contacts, then we added 20 to T g ta . In addition, consider the number of total contacts with detailed age information by participants of age a and gender g in survey wave t by Y g ta ¼ P r;c;h Y gh trac . Thus, we can calculate the proportion of contacts that are reported with detailed age information, Assuming that the age of contacts is missing at random within each participant group of gender g, age a and wave t, we then use (8) as an additional offset term in the linear predictor, In practice, if increasingly many participants only provide aggregated contact reports, we have that S g ta < 1, and in turn, this will adjust the contact intensities log m gh tab in later survey waves to higher estimates than the data with full age-specific details suggest.

Non-parametric modelling of contact dynamics
We regularise our inferences in high-dimensional parameter space by associating the random (4) with computationally efficient, zero-mean, two-dimensional Hilbert Space Gaussian Process priors [24,25]. In what follows, we will drop the time and gender sub-and superscripts to ease notation and present our modelling of a generic random function f that represents age structure in contact patterns. Zero-mean two-dimensional Gaussian Processes (GPs) are powerful prior models for random functions. For any finite collection of two-dimensional inputs, the function values are multivariate normal with mean zero. We always have AB count observations on the grid x 1 = (a 1 , b 1 ), . . ., x AB = (a A , b B ) defined by A male and female participant age groups in A and all possible B population age groups in B. The multivariate normal has then a covariance matrix K 2 R AB�AB whose i, j th entries are specified by a covariance kernel function k(x i , x j ). Here, we decompose the 2D kernel function for computational efficiency and model each component through squared exponential or Matérn class kernels. Specifically, using the squared exponential kernels as an example, we have where the scaling parameters α a , α b control the magnitude of the random function in the corresponding dimension, and the lengthscale parameters l a , l b control the bandwidth. The product in Eq (10a) is also known as Kronecker decomposition because the covariance matrix K equals the Kronecker product of the covariance matrices of the kernels with one-dimensional inputs, For computing purposes, we exploit that K 1 , K 2 are positive semi-definite and decompose the covariance matrices as where the superscript > denotes transposition. Using the mixed product property of Kronecker operations, we obtain This shows that the zero-mean two-dimensional GP prior attached to the random function f on the AB inputs x = (x 1 , . . ., x AB ) can be obtained by linear transformation of AB i. i. d. standard Gaussian random variables z � N ð0; 1Þ, In Eq (12), the left-hand side denotes the AB-dimensional column vector of the random function evaluated at the inputs, and the right-hand side shows how the Kronecker product is calculated by a series of basic arithmetic operations. The reshape operation transforms the AB dimensional column vector z column-wise into a A × B dimensional matrix, and the vec operation flattens A × B dimensional matrices column-wise into an AB dimensional column vector.
Eq (12) also shows that the computational cost of two-dimensional GPs is entirely determined by calculating, first, L 1 , L 2 for each new set of GPs hyperparameters, and then, second, performing the arithmetic operations associated with (L 2 � L 1 )z. We use Hilbert Space Gaussian Process (HSGP) approximations [16] to each of the kernels k 1 and k 2 in Eq (10) to reduce the computational cost associated with the first step. For brevity, we refer readers to the excellent introductions to HSGPs in [24,25], and here merely note that the stationary isotropic kernels can be expressed as an infinite sum that involves the spectral density S, eigenfunctions ϕ i and eigenvalues λ i , i = 1, . . ., 1, associated with a certain Laplacian eigenvalue problem on a compact domain O that is strictly larger than B. For convenience, the input domain B is shifted with the midpoint at zero, and then O is written as [−L, L] for some L > 0. To ease notation, we continue to write the shifted inputs in what follows as a i , and b i . The HSGP approximationk 1 to k 1 on the domain [−L 1 , L 1 ] is then obtained by truncating the infinite sum to the first M 1 terms, ffi ffi ffi ffiffi l 1 Crucially, the GP hyperparameters α a , l a enter only in Eq (13b), and the eigenvalues and eigenfunctions are the same regardless of the GP hyperparameters and depend only on the domain boundary value L 1 together with the observed inputs a, a 0 . This speeds up Bayesian computations significantly because the eigenvalues in Eq (13c) and eigenfunctions in Eq (13d) can be precomputed once and for all. For Matérn and other kernels, the corresponding spectral densities may be found in [26]. Rewriting Eq (13a) in matrix notation, we see that L 1 is approximated by where the A × M 1 matrix F 1 has the i, j entries � 1 j ða i Þ, and the Þ. Again, F 1 does not depend on the GP hyperparameters and can be precomputed. The arithmetic operations in Eq (14) can harness computationally efficient diagonal-post-multiply functions in many linear algebra libraries. The HSGP approximation to the k 2 kernel is analogous. The tuning parameters of the HSGP approximations are the integers M 1 , M 2 and the boundary values L 1 , L 2 , and we determine these using established diagnostics [25]. The zero-mean Kronecker-decomposed HSGP prior associated with our random functions f on the input grid x is then wherez is a M 1 M 2 dimensional column vector of i. i. d. standard normal random variables, and the non-negative hyperparameters are θ = (α a , l a , α b , l b ).

Difference-in-age parameterisation
Human contact patterns tend to concentrate among individuals of similar age and individuals with similar age gaps (parent-child, grandparent-child and grandparent-parent). To capture this diagonal structure in the simple Kronecker decomposed priors in Eq (10) for our 2D random functions f, we follow [22] and define f on an age by difference-in-age space rather than an age by age space. This amounts to rotating the age by age space by 45 degrees so that the peer-peer, parent-child, grandparent-child, and grandparent-parent contacts correspond to horizontal lines in the re-parameterised space and match the structure of our Kronecker decomposed priors (10) (see also Fig 1 of Vandendijck et al. [22]). Specifically, we consider age differences d 2 D ¼ fÀ 84; À 83; . . . ; 83; 84g, and re-parameterise the points ða; We are only interested in the random functions evaluated on the original points, which we write as f(d(a, b)) for all ða; bÞ 2 A � B. The number of age differences D = 169 in D is larger than the number of one-year age groups B = 85 and so the difference-in-age parameterisation entails higher computational cost in the calculations that underpin Eq (15). We will show in the results section that this parameterisation is crucial to obtaining accurate estimates of typical fine-age diagonal and off-diagonal human contact patterns from coarsely reported age data, and so recommend its use despite the higher computational cost.

Full Bayesian model and numerical inference
To complete our model for inferring contact dynamics from longitudinal survey data, we specified commonly used priors on all model parameters. The full model is specified for survey waves t = 1, . . ., 5, reporting repeats r = 0, . . ., 4, gender g, h 2 {M, F}, participant age groups a 2 A trg and population age groups b 2 B by n � Exponentialð1Þ ð17dÞ a ti � Cauchy þ ð0; Monte Carlo draws from the joint posterior distribution of all parameters was obtained with the probabilistic computing language Stan [27] via the cmdstanr interface version 0.5.2. Eight chains were run in parallel for 500 warmup iterations and 1000 iterations thereafter. Initial sampling was facilitated by adding the nugget 10 −13 to a gh trab . We typically observed a small number of divergences in the NUTS algorithm, but these accounted for less than 0.005% of samples and were considered to be of no concern. The smallest effective sample size was 1892, and theR convergence diagnostics were below 1.01, indicating that the Markov chains converged and mixed well [28,29]. Trace plots for the parameters with the smallest effective sample size and maximumR are shown in S8 Fig. The R and Stan codes for our models, along with other analysis codes are available at https://github.com/MLGlobalHealth/bayes-rate-consistency.

Simulated social contact data
To validate the Bayesian rate consistency model, we created synthetic datasets that mimic social contact patterns before the COVID-19 pandemic (pre-COVID-19) and during the pandemic (in-COVID-19). To reduce experiment run time, we limited participants and contact ages from 6 to 49 years and assumed that contact intensity patterns do not vary by gender.
Contact intensities were set to be highest among individuals of similar age, mimicking ageassortative contact behaviour. To simulate parent-child contact dynamics, we defined individuals between 6-18 as children and individuals between 30-39 as parents and set the contact intensities between parent-child groups to higher values. Similarly, individuals between 19-29 were defined as children of individuals from 40 to 49, and contact intensities between these groups were also set to a higher value. The resulting patterns are shown in the top left panel of Figs 2 and 3. For full details, we refer readers to S1 Text. From the stylised contact intensity scenarios, we next randomly generated age-and gender-specific contact counts for five different participant size configurations, N = 250, 500, 1000, 2000, 5000, by sampling from a Poisson distribution such that Y gh ab � Poissonðl gh ab Þ where l gh ab ¼m gh ab N g a . We set N g a such that the agegender counts of the participants were representative of the 2011 German census population [30]. To mimic the age reporting scheme in the COVIMOD surveys, we aggregated the simulated contact counts by Y gh ac ¼  Fig 2 illustrates the simulated contact patterns in the pre-COVID-19 scenario by 1-year age bands, along with the corresponding data by coarse 5 and 10-year age bands for a sample size of 2,000 participants, and the fits with the Bayesian rate consistency model that aim to recover the fine age structure by 1-year age bands. Fig 3 shows our results for the in-COVID-19 scenario. The age-age parameterisation performed poorly for both simulation scenarios, especially in regions of the contact matrix where the degree of age aggregation is large, i.e., 10-year age intervals as opposed to 5-year age intervals. For such large reporting intervals of age groups, the contact intensity patterns that we could estimate with the age-age parameterisation showed idiosyncratic bimodal patterns along the main diagonal of the contact intensity matrices (bottom left panels in Figs 2 and 3). In comparison, the difference-in-age parameterisation captured age-assortative contact patterns and the sub-diagonal parent-children contact patterns with much better accuracy. The estimated contact intensity patterns for other gender combinations and simulation scenarios were qualitatively very similar and reported in S4 and S5 Figs. In several previous social contact surveys, participants were asked to report the exact age of their contacts. This allowed us to further test the Bayesian rate consistency model by artificially withholding the exact age of each contact in real-world data. Specifically, we tested how closely the reconstructed fine-age contact intensities inferred on the masked data with the age of contact masked into broad age bands match those inferred from the actual data with the exact age of contacts known. Fig 4 shows the outcomes of this test on data from 1,292 participants who reported 31,670 contacts in Germany as part of the POLYMOD study [1]. The contact intensities learned from the data with the exact age of contact and those learned from the data with age of contact masked into broad age bands were qualitatively very similar, and the mean absolute difference across each age-age gender-gender group was 0.145 in the context of a mean contact intensity of 0.120 in each age-age gender-gender group. We repeated this test on data from 1,122 participants from the Manicaland HIV/STD Prevention Study [31], a general population cohort study carried out since 1998 in Manicaland, the easternmost province of Zimbabwe. In 2013, participants were asked to report on their social contacts, which were either physical or non-physical, in two-way conversations with three or more words in the physical presence of another person, and the exact age was reported for 24,480 contacts. We chose these data for a second test because Manicaland, Zimbabwe's population structure and social

PLOS COMPUTATIONAL BIOLOGY
contact patterns differ markedly from those seen in Europe [1,31]. Fig 5 shows our test outcomes on the Manicaland data. The Bayesian rate consistency model generated again very similar estimates on the data with the age of contact masked into broad age bands as compared to the actual data with the exact age of contact known, and the mean absolute difference across each age-age gender-gender group was 0.179 in the context of a mean contact intensity of 0.125 in each age-age gender-gender group.

Gaussian process approximations enable fast Bayesian inference
In Table 1 and S6 and S7 Figs, we compare the performance of the Bayesian rate consistency model with various parameterisations for different sample sizes and scenarios in terms of estimation accuracy and computing runtimes. Specifically, to assess how well HSGP models can approximate full-rank 2DGP models, we ran simulations for both scenarios and parameterisations with a sample size fixed at 2000. We compared the model fits to the simulation truth in Empirical and estimated contact intensity patterns for POLYMOD using the Bayesian rate consistency model for age-granular contact data and age-stratified contact data. (Top row) Crude empirical contact intensity patterns. (Middle row) Posterior median contact intensity estimates from the Bayesian rate consistency model applied to fine-age contact data meaning the age of contacts was not aggregated into large age bands. (Bottom row) Posterior median contact intensity estimates from the Bayesian rate consistency model applied to coarse-age contact data meaning the age of contacts was aggregated into larger age bands similarly to the COVIMOD study.
https://doi.org/10.1371/journal.pcbi.1011191.g004 terms of the mean absolute error (MAE) based on the age-age-specific inferred contact intensities, expected log posterior density (ELPD), the percentage that the predicted values are inside the 95% prediction intervals according to posterior predictive check (PPC), and the median running time.
The difference-in-age parameterisation achieved significantly better accuracy than the ageage parameterisation but also required more time to fit due to the introduction of additional A 2 − A nuisance parameters. The computational toll of the difference-in-age parameterisation was most strongly reflected in the median runtimes under the full-rank 2DGPs, which could take more than half a day to fit. Using HSGPs, we reduced median runtimes by more than 15-fold while maintaining an accuracy close to that of full-rank 2DGP models. The bottom section of Table 1 compares the accuracy of the difference-in-age HSGP models across survey sample sizes. In general, smaller sample sizes led to less accurate estimates. Empirical and estimated contact intensity patterns for Zimbabwe using the Bayesian rate consistency model for age-granular contact data and age-stratified contact data. (Top row) Crude empirical contact intensity patterns. (Middle row) Posterior median contact intensity estimates from the Bayesian rate consistency model applied to fine-age contact data (i.e., the age of contacts is not aggregated into large age bands). (Bottom row) Posterior median contact intensity estimates from the Bayesian rate consistency model applied to coarse-age contact data (i.e., the age of contacts is aggregated into large age bands in a similar fashion to COVIMOD). https://doi.org/10.1371/journal.pcbi.1011191.g005

Modelling marked structures in age-specific contact patterns
Social contacts are strongly structured by age, reflecting common behaviour and social norms around family size, reproductive age, schooling, and other factors [22]. In turn, smooth process kernels such as the squared exponential (Eq (10b)) may not be well suited to describe marked changes in contact intensities. On our simulated contact scenarios, we find indeed that Matérn 5 2 and Matérn 3 2 kernels performed better in comparison to squared exponential kernels in terms of accuracy (S1 Table). The difference in accuracy between Matérn 3 2 and Matérn 5 2 was small and qualitatively indistinguishable (S4 and S5 Figs), and in the following results, we considered the Matérn 5 2 kernel.  where Y gh ac denote the total number of contacts from participants of gender g and age group a to individuals of gender h and age category c, N g a are age-and gender-specific sample size, and S g a are age-and gender-specific proportion of reports with complete age and gender information. Fig 6 also shows the contact intensity estimates obtained via bootstrapping with the socialmixr package [11], and the estimates from the Bayesian rate consistency model. We found that the crude estimates are sparse and fluctuate greatly, even between neighbouring age groups. Next, the socialmixr estimates aggregate the observations by the large age categories in which the contacts were reported and do not borrow available information through the exact age of participants to obtain higher resolution estimates. The socialmixr estimates are adjusted for symmetry in contact rates but are not adjusted for reporting fatigue or missing & aggregate contact reports (see respectively Eqs (7) and (9). We also observed that contacts  [11]. (Bottom row) Posterior median contact intensity estimates from the Bayesian rate consistency model. For the crude estimates and socialmixr estimates, estimates are divided by the number of integer ages in the corresponding age group to make the estimates more comparable (e.g., if the age of contacts is 20 − 24 than the contact estimate is divided by 5 while if the age of contacts is 25 − 34 the contact estimate is divided by 10). The exact runtime arguments for this comparison are given in script figure-6.R on our accompanying GitHub repository.

Model-based estimates of contact patterns in Germany by 1-year age groups
https://doi.org/10.1371/journal.pcbi.1011191.g006 with missing age information are imputed by sampling the missing age only from all contacts of the participants of the same age group [11].
In comparison, the contact patterns estimated with the Bayesian rate consistency model align with those estimated with socialmixr, but provide finer age resolution. We achieve this higher resolution by logical constraints on who contacts whom in a closed population (recall Eq (1)), and not via imputation. These constraints imply that data on the exact age of survey participants provide information on the exact age of contacts even though they are reported in coarse age brackets. The patterns reveal strong age-assortativeness in mixing patterns that is indicated by the high intensities on the main diagonals of the contact intensity matrices shown in the bottom row of Fig 6. Lying approximately 30 years away from the main diagonal, two strips of high contact intensity fade with increasing age and correspond to intergenerational contacts between parents and children. While this age-dependent pattern persists over time (S9 and S12 Figs), we will also show below that the increases in social contact intensities in subsequent waves were far from uniform across age.

Controlling for time-varying reporting effects
We can sum the contacts' age dimension of estimated contact intensities to obtain the average number of contacts from one person of age a per day. For brevity, we call these "marginal" contact intensities. In Fig 7, we show the estimated marginal contact intensities under the Bayesian rate consistency model, which simultaneously accounts for the time-varying reporting effects that emerge through reporting fatigue (repeat participation in the longitudinal COVIMOD survey) and missing & aggregate contact reports (participants unable to list contacts individually), and which are clearly present in the data as shown in Figs 1 and S1. We compare these (adjusted) marginal contact intensities in Fig 7 to those obtained without adjusting for missing & aggregate contact reports (but adjusting for reporting fatigue), those obtained without adjusting for reporting fatigue (but adjusting for missing & aggregate contact reports), and those obtained without any adjustments for missing & aggregate contact reports or reporting fatigue. Furthermore, we compared the marginal contact intensities of wave 4 given by the final longitudinal model, which adjusts for reporting fatigue with a cross-sectional model fitted on data from the same wave but with repeating participants excluded from the data (S13 Fig). We found that the reporting effect sizes can be estimated simultaneously with all other age, gender and time parameters, meaning that they are mathematically consistent with the symmetry constraints in contact rates in closed populations. The results clearly show that adjusting for missing & aggregate contact reports and reporting fatigue significantly increased the estimated marginal contact intensities in a non-trivial manner that depends on the contribution of repeat survey participants to each survey wave. It is implausible that the actual contact intensities did not increase as contact reduction measures were progressively eased between wave 1 to wave 5 of the COVIMOD survey, as is suggested by the estimates without adjustments. This led us to conclude that longitudinal contact patterns must be estimated from the available data with a model-based framework that can adjust for fatigue effects and missingness.

Social contact intensities from May to July 2020 in Germany remained largely below pre-pandemic levels
With these adjustments in place, we next compared the contact intensities seen during the first 5 waves of the COVIMOD study to those observed in the pre-pandemic POLYMOD study conducted between 2006 and 2008 [1] (Figs 8 and S14). Social contact intensities in the first 5 waves remained substantially below pre-pandemic levels, which illustrates the stark impact that the COVID-19 pandemic and non-pharmaceutical interventions had on social interactions when protective vaccines were unavailable, and mortality following COVID-19 infection was high [32]. Interestingly, the POLYMOD data also suggest that women between ages 20 and 50 had more contacts than men of the same age range before the pandemic (Fig 8). Another point that emerges from the fine-age analysis is that by wave 5, the marginal contact intensities of individuals aged 70 were very similar to pre-pandemic levels. This could reflect that the relatively few social contacts of older individuals aged 70 and above are essential human interactions that are challenging to reduce. Fig 7 shows that the contact intensities increased consecutively from wave 1 to wave 5, with the increases being most substantial from wave 1 to wave 2. These increases are primarily due to marked increases in the number of contacts that were reported in aggregate over the first five survey waves (S1-S3 Figs). Fig 9 illustrates the relative percentage increase in the marginal contact intensities relative to those in wave 1. Although there were marked differences in the contact patterns between men and women (Fig 7), the relative increases showed no significant differences between the two genders, which suggests that the gender differences in contact patterns may arise from underlying gender-dependent social contact structures that are captured in the survey, rather than non-pharmaceutical interventions. Comparing waves 2 and 3 to wave 1, contact intensities tended to increase more in older age groups. For wave 4, we observe a sharp increase in contacts among men approximately 20 years of age. However, we find that this pattern is sensitive to data pre-processing criteria, as we explain below. Comparing wave 5 to wave 1, we found the strongest evidence that contact intensities did not increase homogeneously for all age groups over time and instead tended to increase more in older age groups.

Differential rebound of age-specific social contacts
We next focus on characterising the dynamics in contact intensities for specific age groups. Intuitively, this corresponds to slicing the contact intensity matrices reported in Fig 6 across rows for a fixed column, and for brevity, we call these the "conditional" contact intensities. In Fig 10 (top), we illustrate the conditional contact intensities for individuals aged 10, 20, 35, and 70 years to represent the contact intensities of school children, young adults, the working, and the ageing population. We observe two peaks for participants aged 10 and 20: a larger sharp peak corresponding to contacts between peers and a shorter rounded peak with individuals approximately 45-50 years older, representing contacts with their parents. For participants aged 35, we observe an additional third peak with individuals aged 60 to 70, predominantly corresponding to contacts with their parents. Participants aged 70 generally mixed with individuals of a similar age, with some contact between individuals aged approximately 40 but almost no contact with individuals under 20. These core patterns of social contacts are present across all survey waves. Fig 10 (bottom) shows the ratio of the conditional contact intensities in survey waves 2, 3, 4 and 5 relative to those in wave 1. Ratios above 1 thus indicate increases in social contacts. We find that the increases in social contacts were not homogeneous by the age of contacted individuals. Focusing on wave 5 relative to wave 1, we find that for children aged 10, the conditional contact intensities in individuals of the same age and roughly aged 70 rose particularly strongly. For individuals aged 20, increases in their social contacts were more homogeneous, except for those with young children, which remained similar to those seen in wave 1. For individuals aged 35, increases in their social contacts were concentrated in slighter younger and all older individuals, but not their peers, reflecting that individuals in the 35-year age group retained social contact with their peers during intense non-pharmaceutical interventions in wave 1. Individuals aged 70 homogeneously increased their conditional contact intensities with younger individuals. These findings suggest that the age patterns in social contacts changed in a structured and complex manner after the first COVID-19 wave in Germany.

Discussion
We developed the Bayesian rate consistency model in order to regain the ability to quantify and characterise social contact patterns at high age resolution from contemporary, longitudinal survey data on social contacts. The main contributions of our model-based approach to estimating the trends in human contact patterns are as follows. The Bayesian rate consistency model can reasonably accurately estimate high-resolution social contact patterns from data that aggregate the age of contacts into large age bands. The model also enables adjusting for the confounding effects due to the aggregated reporting of contacts and reporting fatigue in longitudinal social contact surveys. These advancements are particularly relevant to COVIDera social contact studies for which the age of contacts is reported in coarse age brackets and to which participants contributed to multiple survey waves [7,[13][14][15]. We draw from methodologies which serve as principal workhorses in spatial statistics to map the landscape of social contacts [33,34] and incorporate a recently developed GP approximation technique to alleviate the computational bottleneck which often plagues such spatial models [16,25].
In applying our model to the first five waves of the COVIMOD study, we gained insights into the age-and gender-specific social contact dynamics and their evolution over time at a resolution of 1-year age bands. We found relatively limited rebounds in contact intensities after lifting contact reduction measures and that social contact intensities remained substantially below pre-pandemic values until the end of our study period in July 2020 (Fig 7). These results are consistent with social contact survey data in England [13], Belgium [14], the Netherlands [15], and elsewhere. Relative to pre-pandemic estimates, the largest contact reductions occurred among children and young adults (Fig 7). Still, reductions were substantial across all age groups, indicating the strong impact of non-pharmaceutical interventions on social interactions and successfully reduced infection risk during the first months of the COVID-19 pandemic [32].
Over the first five COVIMOD survey waves between May and July 2020, we observed structured changes in contact patterns, with social contact intensities rebounding more rapidly with the increasing age of individuals (Fig 7). These increases in social contacts were also far from uniform across population age bands of contacted individuals. Most increases tended to centre on the off-diagonals of the age-age social contact intensity matrix (Fig 6). These findings are hard to obtain without detailed inferences of social contact patterns by 1-year age bands [13][14][15]. We also noted that despite a continued increase in contact intensity, COVID-19 cases remained stable during the analysis period (Fig 1). This may be because contact counts remained much lower than pre-pandemic estimates despite the ease in non-pharmaceutical interventions [9]. Other protective measures such as face masks, hygiene regulations including surface disinfection, remote work, and warmer summer weather may also have contributed to keeping infections at bay [5]. These observations demonstrate that longitudinal data collection on social contacts and tailored tools to analyse these data are essential to characterise the impact of non-pharmaceutical interventions and the possible social contact routes through which pathogens can spread because the observed dynamics in social contacts do not follow easily predictable patterns.
Our work is not without limitations. First, some issues arise from the sampling methodology of the COVIMOD survey. Participants were recruited from an online panel for market research with email invitations, and although quota sampling was performed, the final samples were not fully representative of the population [9]. Previous work proposed using post-stratification weights to re-scale the data, but a sensitivity analysis did not reveal large differences between weighted and unweighted estimates [9]. Participants consisted only of those with internet access who possibly adhered more to social distancing rules as such a demographic is more likely to respond to health surveys. Participants received guidance only through the text within the questionnaires, which may have been misinterpreted, and participants may report more contacts in paper-based surveys than in an online survey [35]. Additionally, we truncated aggregate contact reports at 60, but different thresholds may lead to slight changes in the inference results (S1 Fig). Next, the difference-in-age parameterisation may not be appropriate if social contacts do not follow the pattern where high intensities lie on the contact matrix's main diagonal and sub-diagonals. This is relevant when investigators wish to conduct analyses for other contexts, e.g., work and transport, where contact patterns may not depend on age and age difference. However, it is easy for investigators to revert to the classical age-age parameterisation if they deem it more appropriate. We provide several template Stan model files in the accompanying GitHub repository. Furthermore, our fully Bayesian modelling framework is currently limited to analysing approximately 10 longitudinal survey rounds. While the recently proposed Hilbert Space Gaussian Process priors enable fast Bayesian inferences on cross-sectional data [16,24,25], additional research is needed to scale up the approach to survey data from 30 waves or more. Potential avenues include HSGP modelling over the time domain within our approach, Integrated Nested Laplace Approximations, penalised splinebased regressions or even variational autoencoders [2,10,36,37]. Finally, and most importantly, the Bayesian rate consistency model requires participant age information to be reported by 1-year age bands. The exact age of participants is usually recorded without error but not necessarily made publicly available [7].

Conclusion
In summary, we provide a new statistical method to strengthen the global pandemic preparedness toolkit, which enables epidemiologists and policymakers to obtain a clearer picture of how infectious respiratory pathogens such as SARS-CoV-2 can spread through populations in near real-time. The Bayesian rate consistency model can analyse contemporary, longitudinal social contact survey data and estimate contact intensity by 1-year age bands. We validated the model on simulated social contact data for different scenarios and real-world data from Europe and Africa where contacts have been recorded in 1-year age bands, and any inference errors made by artificially coarsening the data can be assessed. The outputs-contact intensity matrices by gender and 1-year age band with uncertainty quantification-are central to estimating the effective reproduction number of pathogens in real-time with greater precision than currently possible from contact estimates by coarse age bands [6,10,13]. The outputs are also crucial for parameterising infectious disease models and more accurately forecasting cases, hospitalisations and deaths [32,38,39], and understanding the drivers of disease spread [38,40]. Contact estimates from the first waves of the COVIMOD and CoMix social contact surveys [7] may be of particular interest as they represent the patterns during a time of stay-at-home orders across Europe and subsequent relaxation of non-pharmaceutical interventions, and could be used as templates of social contact patterns in future pandemic emergencies when such data are not immediately available. Vaccine allocation strategies also depend critically on understanding how pathogens spread through structured populations. Optimal allocations count the direct benefits of allocating life-saving vaccines to individuals with the highest fatality risk and the indirect benefits that accrue through prioritised allocations to population groups that would otherwise drive spread disproportionately through their contact and mixing patterns [41,42]. We thus find the Bayesian rate consistency model promises to aid the understanding of contact behaviour, more realistic parameterisations of infectious disease models, and a deeper understanding of how infectious respiratory diseases are propagated through populations. (Top row) Crude empirical social contact intensity patterns, with crude contact intensities above a value of 3 truncated for visualisation purposes. There are some age groups with no participants, and they are represented by white vertical columns. (Middle row) Contact intensity patterns as estimated by the socialmixr R package [11]. (Bottom row) Contact intensity patterns are given by our Bayesian model. (Top row) Crude empirical social contact intensity patterns, with crude contact intensities above a value of 3 truncated for visualisation purposes. There are some age groups with no participants, and they are represented by white vertical columns. (Middle row) Contact intensity patterns as estimated by the socialmixr R package [11]. (Bottom row) Contact intensity patterns are given by our Bayesian model. (JPEG)

S11 Fig. Empirical and estimated social contact intensity patterns for COVIMOD wave 4.
(Top row) Crude empirical social contact intensity patterns, with crude contact intensities above a value of 3 truncated for visualisation purposes. There are some age groups with no participants, and they are represented by white vertical columns. (Middle row) Contact intensity patterns as estimated by the socialmixr R package [11]. (Bottom row) Contact intensity patterns are given by our Bayesian model. Ratmann.