Projecting contact matrices in 177 geographical regions: An update and comparison with empirical data for the COVID-19 era

Mathematical models have played a key role in understanding the spread of directly-transmissible infectious diseases such as Coronavirus Disease 2019 (COVID-19), as well as the effectiveness of public health responses. As the risk of contracting directly-transmitted infections depends on who interacts with whom, mathematical models often use contact matrices to characterise the spread of infectious pathogens. These contact matrices are usually generated from diary-based contact surveys. However, the majority of places in the world do not have representative empirical contact studies, so synthetic contact matrices have been constructed using more widely available setting-specific survey data on household, school, classroom, and workplace composition combined with empirical data on contact patterns in Europe. In 2017, the largest set of synthetic contact matrices to date were published for 152 geographical locations. In this study, we update these matrices with the most recent data and extend our analysis to 177 geographical locations. Due to the observed geographic differences within countries, we also quantify contact patterns in rural and urban settings where data is available. Further, we compare both the 2017 and 2020 synthetic matrices to out-of-sample empirically-constructed contact matrices, and explore the effects of using both the empirical and synthetic contact matrices when modelling physical distancing interventions for the COVID-19 pandemic. We found that the synthetic contact matrices show qualitative similarities to the contact patterns in the empirically-constructed contact matrices. Models parameterised with the empirical and synthetic matrices generated similar findings with few differences observed in age groups where the empirical matrices have missing or aggregated age groups. This finding means that synthetic contact matrices may be used in modelling outbreaks in settings for which empirical studies have yet to be conducted.


Introduction
The emergence of the Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) responsible for causing Coronavirus Disease 2019  has affected the lives of billions worldwide [1]. SARS-CoV-2 is predominantly transmitted between people via respiratory droplets and, as such, the transmission dynamics are strongly influenced by the number and type of close contacts between infectious and susceptible individuals [2][3][4][5][6][7].
Mathematical models have played a key role in understanding both the spread of directlytransmissible infectious diseases such as COVID-19 [8][9][10] and the effectiveness of public health responses [11][12][13][14][15][16]. Since transmission events can rarely be directly observed and measured, most transmission models are based on the social contact hypothesis [17] which implies the risk of transmission between a susceptible and an infected individual be proportional to the rate of contact between them [18]. Rates of contact are known to differ according to characteristics such as the age, of both individuals, and the setting in which the contact takes place, such as the home, school or workplace; they are also commonly assortative, and infection may be concentrated in demographic segments as a result [17,19,20].
Age-structured models often define the rate of mixing between age groups through a mixing matrix where the elements represent the frequency of contact between two individuals from subgroups (such as age groups) represented by the columns and rows. Mixing matrices can be generated from surveys that record the number and type of contacts between people, such as the respondent-completed diaries used in the landmark POLYMOD contact pattern study, which measured social contact patterns in eight European countries [20]. However, the majority of countries around the world lack data from contact surveys that can be used to inform the mixing matrix. This problem is particularly acute in low-and lower-middle-income countries (LMICs), where only 4 studies are available, compared to 54 in high-income countries [21]. Our previous work [22] used country-specific data on household size, school, and workplace composition plus empirical contact data from the POLYMOD survey to generate age-and references. The codes used to generate these analyses and the updated synthetic matrices are available at https://github.com/kieshaprem/ synthetic-contact-matrices DOI: 10.5281/zenodo. 4889500.

Funding:
The following funding sources are acknowledged as providing funding for the named authors. KP, PK and MJ were partly funded by the Bill & Melinda Gates Foundation (INV-003174). MJ and NGD were partly funded by the National Institute for Health Research (NIHR; NIHR200929) MJ was partly funded by NIHR using UK aid from the UK Government to support global health research (16/137/109). The views expressed in this publication are those of the author(s) and not necessarily those of the NIHR or the UK Department of Health and Social Care. KvZ was partly funded by DfID/Wellcome Trust (Epidemic Preparedness Coronavirus research programme 221303/Z/20/Z) and DfID/Wellcome Trust/NIHR (Elrha R2HC/UK DFID/Wellcome Trust/NIHR). KP, PK and MJ were partly funded by the European Union's Horizon 2020 research and innovation programme -project EpiPose (101003688). RME was partly funded by HDR UK (MR/S003975/1) and UK MRC (MC_PC 19065 location-specific contact matrices (synthetic contact matrices) to use in settings where social contact patterns had not yet been directly measured.
These synthetic contact matrices have been widely used in models of SARS-CoV-2 spread and the impact of interventions such as physical distancing which alter the pattern of contacts (e.g. [13]). Following the publication of our previous work, new empirical contact surveys have been conducted in LMICs (reviewed in [21]), full demographic data are now available for more countries for older age groups, which is particularly salient given the age-gradient in the severity of COVID-19 [23,24], and more recent household composition data are now available for more countries than before. Updating the matrices is particularly important since public health interventions during the pandemic, such as shielding, are often age-structured [25].
Geographic differences within countries have also been observed, with large early outbreaks in urban population centres such as Wuhan, New York, London and Madrid [26,27] spreading into more rural areas, which in many countries may lack the healthcare infrastructure to handle surges in severe cases. Tailored public health response in rural and urban settings may thus be called for to minimise unnecessary economic and social impacts. Assessing such policies requires differences between contact patterns in rural and urban environments to be quantified, which has previously been done only for a few countries [28][29][30]. In these studies, individuals in rural settings documented more contacts at home than their urban counterparts [28,29]. However, individuals in rural settings in Zimbabwe [30] reported a lower total number of contacts than those in peri-urban settings. The study in Southern China observed no qualitative difference in overall contact patterns between rural and urban populations [28].
In this paper, we update the synthetic contact matrices with the most recent data, comparing them to measured contact matrices, and develop customised contact matrices for rural and urban settings. We use these to explore the effects of physical distancing interventions for the COVID-19 pandemic in a transmission model.

Updating country-specific demography and setting parameters
As in Prem et al. [22], we employed a Bayesian hierarchical modelling framework to estimate the age-and location-specific contact rates in each of the POLYMOD countries (Belgium, Germany, Finland, United Kingdom, Italy, Luxembourg, the Netherlands, Poland), accounting for repeat measurements of contacts made in different locations by the same individual. We model the number of contacts documented by individual i at a particular location L with an individual in age group α, as X L i;a � Poðm L a i ;a Þ where the mean parameter varies for each individual i, by i's age, a i , and by location, i.e: m L a i ;a ¼ s i l L a i ;a . The σ i parameter characterises differences in social activity levels between individuals i.e., the random effect belonging to individual i. The l L a i ;a parameter denotes the frequency of contact between individuals (or contact rate per day) from two age groups, a and α, at location L and it is the key estimand. Because the number of contacts should be comparable for individuals of similar ages, we imposed smoothness between successive age groups for the l L a i ;a parameter as described in section A.8 in S1 Text. Noninformative prior distributions were assumed for all parameters in the model, as detailed in [22].
We updated the synthetic contact matrices [22] with more recent data on population age structure, household age structure of 43 countries with recent Demographic Household Surveys (DHS) [31] and socio-demographic factors for 177 geographical regions, including countries and some subnational regions such as the Hong Kong and Macau Special Administrative Regions (SARs) of the People's Republic of China. We include 14 country characteristics from the World Bank and United Nations Educational, Scientific and Cultural Organization Institute for Statistics (UIS) databases: gross domestic product per capita, total fertility rate and adolescent fertility rate, population density, population growth rate, internet penetration rate, secondary school education attainment levels, as proxies of development, and under-five mortality rate, the life expectancy of males and females, mortality rates of males, risk of maternal death, mortality from road traffic injury, and the incidence of tuberculosis, as proxies for overall health in the country. The DHS provides nationally-representative household surveys with the largest dataset, from India, containing information on~3 million individuals from about 600 000 households (see Table A in S1 Text). To project the household age structure for a geographical location with no available household data, we use a weighted mean of the population-adjusted household age structures of the POLYMOD and DHS countries as described in the S1 Text. Because the household age structures vary across countries in different stages of development and with different demographics, we use the updated 14 indicators, all standardized by z-scoring, to quantify the similarities between countries with and without household data to derive these weights. We internally validated the household age matrices using leave-one-out validation to verify these matrices describing household structure could be reverse-engineered for countries (POLYMOD and DHS) for which empirical household age matrices were available, as described in Prem et al. [22] and in S1 Text.
By accounting for the demographic structure, household structure (where known), and a variety of metrics including workforce participation and school enrolment, we then estimated contact patterns at home, work, school and other locations for non-POLYMOD countries. Specifically, the population age compositions for 177 geographical regions were obtained from the United Nations Population Division [32]. To derive the working population matrices for each geographical location c, we use the labour force participation rate by sex and 5-year age groups, w c a , for the 177 geographical regions from the International Labour Organization (ILO) [33]. We derive the working population matrix of a country W c a;a from the cross product of w c a and w c a , and the elements describe the probability of encounter between individuals from two age groups, a and α, in the workplace.
When constructing the school-going population matrices, we use the country-specific pupil-to-teacher ratio in schools at various level of education (i.e., pre-primary, primary, secondary and tertiary), enrolment rates of students at various level of education, starting ages and number of years of schooling at various level of education from UIS [34] and the distribution of teachers by age from the Organisation for Economic Co-operation and Development (OECD) [35]. Using the country-specific data, we first estimate the number of students in each age group by education level. Together with the country-specific pupilto-teacher ratio at each education level, distribution of teachers and workforce by age, we then project the number of teachers in each age group. Both students and teachers form the school-going population. Similar to the formulation of the working population matrix, the school-going population matrix estimates the probability of an encounter between two ages. The steps to construct both the working and school-going populations are detailed in S1 Text.
After projecting populations at home, work and school for the 177 geographical regions, we infer the synthetic age-and location-specific contact matrices (S1 Text). For contacts in other locations (not home, work or school), we adjusted the POLYMOD contact matrices with the country-specific population. We also compare the proportion of contacts at other locations measured from the empirical contact studies.

Stratifying contacts by rural and urban areas
We stratified the age-and location-specific contact matrices according to rural and urban areas by the rural and urban population age compositions for all geographical regions of the world from the United Nations Population Division [36] (see [37] for urban and rural classification). The nationally-representative DHS household surveys additionally provide data for rural and urban areas [31], allowing us to derive rural-urban household age matrices. We compare the population age compositions and household age matrices in rural and urban settings of countries with stratified household data (S1 Text).
We assessed the age-specific labour force participation rates by rural and urban regions from ILO [38]. Using the differences in rural and urban schools' pupil-to-teacher ratio from OECD [39], we construct rural and urban school population matrices. These differences were available for 36 countries, and we assumed the OECD average for the regions without data. We also compare the mean total number of contacts among children (0-9-year-olds) and older adults (60-69-year-olds), as well as the basic reproduction number in rural and urban settings.

Comparing synthetic matrices to empirical contact matrices
We extracted data from all contact surveys listed in the Zenodo social contact database [ We then compared each element of the empirical matrices with our synthetic matrices. We also compared the proportion of contacts in "Other" locations, since this was the only setting not directly informed by local data (other than population age structure) in the synthetic matrices. To understand potential sources of differences between the empirical and synthetic matrices as well as between empirical matrices between different regions, we extracted details of how each survey was conducted from the original publications. Table 1 summarises the changes between the construction of the 2017 and 2020 synthetic matrices. Analyses were done in R version 3.6.2 [51], and the codes are deposited in https:// zenodo.org/record/4889500 [52].

Impact on modelling of interventions
We compare the difference in reduction of COVID-19 cases between using the empirical and synthetic matrices in models of COVID-19 epidemics in ten geographical regions-China, France, Hong Kong SAR, Kenya, Peru, the Russian Federation, South Africa, Uganda, Vietnam and Zimbabwe-using an age-stratified compartmental model [13,25]. We model an unmitigated epidemic and three intervention scenarios: 20% physical distancing, 50% physical distancing, and national lockdown. In all intervention scenarios, we assume a 50% reduction in transmission from individuals with clinical symptoms through self-isolation. In addition, we assume the following: (i) 20% physical distancing: 20% reduction in transmission outside of the household, (ii) 50% physical distancing: 50% reduction in transmission outside of the household, (iii) national lockdown: where we applied the pooled mean reduction in settingspecific contacts (i.e. at home, school, work, and other places) as observed in lockdowns implemented in several countries during the COVID-19 pandemic [53][54][55][56][57]. We considered six contact matrices when modelling the interventions to the COVID-19 pandemic: the empiricallyconstructed contact matrices at the study-year and adjusted for the 2020 population, the 2017 synthetic matrices, and the updated synthetic matrices at the national, rural, or urban settings.
More details of the model can be found in sections A.7 and B.5 of S1 Text.

Results
Twenty-five geographical regions were added to this study compared to the 2017 study. We also updated the population demographic data used for all countries including Namibia, Syrian Arab Republic, Republic of South Sudan, Kuwait, and Vanuatu where the proportion of individuals aged > 70 years was previously not recorded. There were varied methods adopted in 11 contact surveys conducted to generate the empirical contact matrices covering 11 geographical locations ( Table 2). The surveys differed substantially from each other and the original POLYMOD survey in sampling frames and survey methodology. Dodd et al.
[29] measured social contacts among adults in South Africa and Zambia. Three surveys were conducted in exclusively rural regions [41, 47,49], (including one in a remote highlands region [47]), three other surveys were conducted only in urban regions [43,44,48], and the remaining five surveys were conducted in a variety of urban and rural settings [29,30,45, 46,50]. Although most studies adopted random or stratified sampling to recruit their respondents, a handful included convenience [43, 47,48] and quota [44,45] sampling methods in their recruitment. In most contact diary approaches, contacts are categorised as physical contacts (e.g., skin-to-skin contacts) and nonphysical contacts (e.g. two-way conversations with three or more words in the physical presence of another person) [20]. They were equally split between studies that asked respondents to fill in surveys retrospectively [29,41,44,47,50] and prospectively [30,43,45,46,48,49].
The estimated proportions of contacts in other locations from POLYMOD contact survey largely match analogous figures in empirical contact studies from five geographical locations which report this-Shanghai and Hong Kong SAR, China; the Russian Federation; Peru; and Zimbabwe-but are higher than those from France for most ages (Fig 1). It is slightly higher in the synthetic matrices in adults (i.e., 20-40-year-olds) in Shanghai, Hong Kong and the Russian Federation, and slightly lower in older individuals (i.e., >60-year-olds) in Peru, but all other ages match closely.
The pronounced diagonals observed in all contact matrices are matched in the synthetic matrices (Figs 2 and 3), as are the secondary diagonals indicating the occurrence of intergenerational mixing. The updated synthetic contact matrices show close similarities to empirical matrices (median correlation between normalised synthetic and empirical matrices 0.82, interquartile range 0.66-0.84). In most geographical regions, both matrices are similar in terms of symmetry. However, there are a few places such as Zimbabwe and China (Shanghai) where the synthetic matrix is more symmetrical than the empirical matrix, as the latter shows more weight above the diagonal (young people report more contacts with old people than vice versa). The degree of symmetry of both synthetic and empirical matrices in each region is compared in Table E in S1 Text.
We reconstructed the empirical household age structures for the POLYMOD and DHS countries with high fidelity (median correlation between the observed and modelled household age matrix (HAM) 0.92, with an interquartile range 0.85-0.95) (See S1 Text sections A.2.2 and B.1 for details). The differences in the population and households age composition by rural and urban settings are presented in section B.2 in S1 Text.
For many of the low-income countries (LIC), a larger mean number of contacts among children (i.e., 0-9-years-old) were observed in rural settings than urban settings. However, in high-income countries (HIC), urban settings had a larger mean number of contacts among children (Fig 4). Among HIC, the basic reproduction number in rural and urban settings are positively correlated (r = 0.73, 95% confidence interval: 0.52-0.84).
The choice of using synthetic or empirical matrices did not make a large difference to the infection attack rate for an unmitigated epidemic (Fig D in S1 Text), or to the overall number of severe COVID-19 cases predicted in a mathematical model of SARS-CoV-2 transmission and disease across the three physical distancing interventions (Fig 5 and Fig E in S1 Text). Where there were discrepancies, the relative magnitude of this discrepancy differed between countries. Differences were more marked in specific age groups (e.g. older people in Hong Kong SAR, Kenya, Peru, Uganda, Vietnam and Zimbabwe; 10-20 year olds in China; [20][21][22][23][24] year olds in Russia). The largest age-related differences could potentially be attributed to particular features of empirical survey design such as missing (Peru, Russia) or aggregated (Kenya, South Africa, Uganda, Vietnam) age groups, mode of questionnaire chosen by participants (Hong Kong SAR) and survey administration during school holidays (Zimbabwe) (See Table D in S1 Text for details).

Discussion
Social mixing patterns have not been directly measured in most countries or regions within countries, particularly in low-and lower-middle-income settings. Synthetic contact matrices provide alternative age-and location-specific social mixing patterns for countries in different stages of sociodemographic and economic development [22]. The synthetic contact matrices presented here were derived by the amalgamation of several data sources and methods: (i) integration into a Bayesian hierarchical framework of age-and location-specific contact rates from eight European countries from the POLYMOD contact study; (ii) construction of agestructured populations at home, work, and school in many non-POLYMOD countries by combining household age-structure data from the POLYMOD study and DHS (which include mostly data from lower-income countries), socio-demographic factors from the UN Population Division and various international indicators; and (iii) projection of age-structured populations at home, work, and school and age-and location-specific contact matrices to other non-POLYMOD and non-DHS countries. Both empirical and synthetic contact matrices capture age-assortativity in mixing patterns; the pronounced primary diagonal highlights that Comparison of the normalised empirical and synthetic age-specific contact matrices in five geographical regions. The empirical matrices collected from contact surveys, modelled synthetic contact matrices, and the scatter plots of the entries in the observed (x-axis) and modelled (y-axis) contact matrices are presented. The correlation between the empirical and synthetic matrices are shown. The matrices are normalised such that its dominant eigenvalue is 1. To match the population surveyed in the empirical studies, the contact matrices from rural settings of Kenya and Peru are presented; and the contact matrix from urban settings of China is presented. No data are available in the grey regions.
This paper provides a substantial update and improvement to previous synthetic matrices published in 2017 ( Table 1). Improvements in the availability of demographic data globally have enabled us to provide validated approximations to age-and location-specific contact rates for 177 geographical regions covering 97.2% of the world's population, compared to 152 geographical regions covering 95.9% previously. Household data from 34 additional LMICs were included in the revision. We have also used the most recent data to build the working and school-going populations. We have extended the method to project contact patterns in rural and urban settings using country-specific urban and rural data. We find a higher positive correlation in mean contact rates and basic reproduction number in rural and urban settings of HIC, owing to the smaller rural-urban differences in these countries. Moreover, when assessing the consistency of results under different mixing assumptions (empirical and synthetic), we observed small differences in the modelled reduction in number of cases across the three physical distancing interventions for the COVID-19 pandemic.
The synthetic matrices provide consistency for inter-country comparisons since they are based on common datasets. This is challenging to achieve through empirical data collection (see Table 2). For such studies, surveying across the whole population poses several challenges. Establishing a sampling frame and obtaining a sample representative of an entire country's population is expensive and in some regions logistically challenging, so researchers often restrict studies to a particular subpopulation. For instance, many recent empirical contact studies only represent certain subregions of countries rather than entire countries. Sometimes surveys rely on nonprobability sampling techniques [43-45,47,48], e.g., convenience and quota sampling, when probability sampling techniques are not feasible. Paper or online selfreported contact diaries are largely used in social contact surveys. Compared to less common face-to-face interviews, respondent-filled contact diaries have a less demanding data collection procedure but may report a lower response rate [21,58]. Zhang et al. [43] found significantly higher contacts documented by telephone interview than by self-reporting in Shanghai, China. In addition, contact diaries can be administered prospectively or retrospectively ( Table 2). In Hong Kong, prospective surveys have been shown to be less prone to recall bias compared to their retrospective counterpart [44], but it is often more challenging to find willing participants for prospective surveys. However, a study in Belgium [59] found no appreciable effect between retrospective and prospective surveying. Other methods, e.g., proximity sensors and phonebased GPS trackers or Bluetooth scanners, have also been employed to measure mixing patterns between individuals [60][61][62][63] and are forming part of many countries' contact tracing efforts during the COVID-19 pandemic [64], though most have been implemented to protect users' privacy by storing data with the user rather than centrally. When we compared our synthetic matrices with empirical contact matrices from 11 studies using contact diaries, we found broad consistencies between findings from the two approaches. However, there were also differences which might reflect the heterogeneity in methods used to collect empirical data.   Another consideration affecting both synthetic and empirical matrices is that they change over time. Estimating synthetic matrices relies on the POLYMOD contact survey administered more than a decade ago. Another larger contact survey, BBC Pandemic [60,65] conducted in the UK used mobile phone-based GPS tracking instead of diary-based surveys, reported a decrease in contacts among adolescents compared to POLYMOD, which may reflect substitution of face-to-face contacts with electronic communication in this age group. Moreover, in addition to the rural-urban environment, age-and location-specific contact patterns could vary by socioeconomic conditions within countries. More differences are expected as countries implement physical distancing measures to mitigate the COVID-19 pandemic. The COVID-19 pandemic has affected contact patterns, whether through non-pharmaceutical interventions or reactive behavioural changes, in particular, how we come into contact with one another. Baseline, expected contact rates, as those inferred here, are critical for determining the amount of change in contact rates in response to the pandemic. Understanding the impact of the COVID-19 pandemic on contact patterns requires a detailed analysis of contact surveys conducted during the pandemic, taking into account baseline contacts and non-pharmaceutical interventions intensity. Future studies are also needed to quantify the possible long-term behavioural changes.
Both synthetic and empirical matrices have complementary strengths and limitations. Empirical contact patterns are dependent on the study design and study population, and when the survey is administered. The synthetic contact matrices are constructed using proxies of contacts such as population and household age structures and country characteristics. However, the datasets used to develop these proxy measures (notably population age structure and DHS data) are generally much larger and more nationally representative than most empirical contact studies. To assess the robustness or consistency of the results under different mixing patterns, modellers should consider using multiple contact matrices constructed using different methods for sensitivity analyses.

Conclusion
In this study, we provide synthetic contact matrices for 177 geographical regions by updating our previous matrices with larger and more recent datasets on population age structure, household, school and workplace composition. The synthetic contact matrices reproduce the main features of the contact patterns in the out-of-sample empirically collected contact matrices.