Estimating household contact matrices structure from easily collectable metadata

Contact matrices are a commonly adopted data representation, used to develop compartmental models for epidemic spreading, accounting for the contact heterogeneities across age groups. Their estimation, however, is generally time and effort consuming and model-driven strategies to quantify the contacts are often needed. In this article we focus on household contact matrices, describing the contacts among the members of a family and develop a parametric model to describe them. This model combines demographic and easily quantifiable survey-based data and is tested on high resolution proximity data collected in two sites in South Africa. Given its simplicity and interpretability, we expect our method to be easily applied to other contexts as well and we identify relevant questions that need to be addressed during the data collection procedure.


Introduction
Infectious diseases such as COVID-19 and influenza are transmitted through close proximity contacts [1] and the modeling thereof is a problem of great interest for public health.The design of effective non-pharmaceutical interventions to mitigate the epidemic spreading often relies on models capable to predict the future or to reconstruct the past of the epidemic's state, see for instance [2][3][4][5][6].Households represent the minimal unit of disease transmission and play a fundamental role in determining the evolution of a viral spread [7].Empirical evidences suggest that, especially at the household level, the commonly adopted homogeneous mixing hypothesis is insufficient to faithfully explain contagion [8][9][10].On the contrary, it is necessary to account for age-dependent contact matrices that represent the diversities -across different age classes -in the frequency of contacts as well as in the transmission parameters [11][12][13][14][15].
Contact matrices are generally estimated through surveys in which the participants have to self-report their contacts in terms of number, duration and (presumed) age of the interacting individual [15][16][17][18].Known limitations of this technique include under-reporting of contacts and overestimation of their durations [19,20].Determining household contact matrices (HCM) is resource-intensive, hardly scalable and technically challenging, especially in low-resource sub-Saharan African countries with high infectious diseases burden and where the data collection is still very limited [21][22][23][24][25].Consequently, a growing attention is devoted to theoretically model HCM.Some of the most popular models to estimate contact matrices rely on the demographic properties of the population under study [17,26], eventually taking the setting (e.g.school, work, home) in which the interactions take place into account.These models assume that the number of contacts between age groups approximately scales as the product of the two population sizes involved, i.e. the number of all possible pairs.In [16] the authors further considered how to make estimates of contact matrices available in countries where the mixing patterns were not directly estimated.More recently, [27] introduced generalized contact matrices in which socio-economic factors are included as well.The authors propose a simple model inducing assortative mixing that is pervasively observed in real-world data.
Here we consider HCM obtained from proximity sensors, encoding the sequence of contacts among a group of selected participants with high resolution in space and time.The proximity sensors are developed by the SocioPatterns collaboration (sociopatterns.org,[28]) and allow us to study and model human dynamics [21,[29][30][31][32][33] and directly estimate HCM by aggregating individuals' contacts across time.We analyze the data collected during the PHIRST study [34,35], a 3-year long experiment conducted in South Africa, designed to provide reliable data-driven guidance to limit viral transmission [34,[36][37][38][39][40][41][42].We show that, although demographic properties are determinant in shaping the HCM, they are insufficient to accurately capture the contacts structure and further age-dependent parameters must be introduced to model the higher sociability typically observed among young people [43].Our parametric model can be calibrated with surveys but, unlike the direct estimation of the full contact matrix, they introduce several advantages.Firstly one only needs to report one's age and not the age of the other interacting individuals, making the estimation process more reliable by design.Secondly, the number of parameters to be estimated scales linearly with the number of age bins (and not quadratically) and the binning itself can be chosen a posteriori.Our method can thus be seen as a reliable compromise between a parameter-free demographic model and a direct estimation of the contact matrix from surveys.Testing our results on the high-resolution measurements, we show that one can approximate the HCM with a cosine similarity equal to 0.96 and 0.98 in the two sites.

Data descriptive statistics
We now provide an overview of the data collection strategy, as well as some basic descriptive statistics.

Data collection
The PHIRST study was a prospective household cohort study described previously in [34,38].We enrolled a cohort of households 2018 at two sites in South Africa (urban: Klerksdorp, North West and rural: Agincourt, Mpumalanga) and followed households Data collection schedule for the 60 selected households.Each row corresponds to a household with the rural site on the left and the urban site on the right.Time is displayed on the x axis and dates are reported in the day/month format.Vertical gray lines correspond to the beginning and end of each deployment.A black dot indicates that at least one contact was measured, while a white one that no contact was recorded on that day. up for 8 to 10 months.Wearable proximity sensors were deployed for 10 to 14 days to all consenting household members to measure high-resolution household contact patterns during three periods of the year.Sensors were worn in PVC pouches on the chest or on a lanyard.Participants were requested to wear the sensor on in the morning, keep it on the entire day (even when leaving the home), take it off at night and store it separately from other household member's sensors.Not all participants felt comfortable wearing sensors outside of the home and instead took sensors off when not at home.Participants were requested to complete a diary to indicate the times the sensor was put on and taken off during the day.Twice a week, the staff visited each household and reminded participants to wear the sensors, monitored if all sensors were still working, and replaced batteries where sensors had stopped working.After at least a ten-day deployment, sensors were collected at the next routine household visit of study staff to the household and taken back to the study office where batteries were removed and data was downloaded from the sensors.After the data cleaning procedure, detailed in Section S.1, our dataset is composed of 307 individuals subdivided into 60 households.For consistency, we choose to consider only households for which the data quality was sufficiently high in all three deployments.The exclusion can be due to the displacement of some individuals or to technical problems with specific sensors.As discussed in the supplementary material, the cleaned dataset is representative of the original both in terms of size and age distributions.Figure 1 summarizes the data collection schedule.

Ethical approval
All experiments were performed in accordance with relevant guidelines and regulations: ethics permission to conduct the experiment was received from the Wits Human Research Ethics Committee (Medical) (ethics reference no.150808) as well as the Mpumalanga Provincial Research & Ethics Committee.Written informed consent was sought and received from all participants or their caregivers.Properties of the measured data: a: normalized contact matrix across the three deployments.The color code refers to the values of the logarithm of R counts whose entries are proportional to the ratio between the number of contacts and the number of possible interacting pairs, setting the mean of R counts to 1.The two axis correspond to the age groups and the number reported indicates the highest age of each group.b: contact duration distribution expressed as number of seconds of interaction across the three deployments in logarithmic scale.

Contact matrices
In this section we describe the properties of the contact matrices as measured by the proximity sensors, after having provided some formal definitions.We thus consider a total of 180 HCM.With the notation C, S we refer to the contact matrices storing the counts/time of interaction between pairs of age groups respectively, or, more precisely C ab = number of contacts per day between a and b, S ab = total time in contact per day between a and b.

Definitions
These matrices should be compared with their expectation, i.e. with the contact matrix obtained assuming a given household line-up and that people interact at random.This is given by [26] 1. Contact matrix similarity across the deployments.Cosine similarity between the measured contact matrices R C (left) and R S (right) in the three deployments.
where Φ a is the number of people in the age group a in a given HCM; ρ = a Φ a is the total number of people and δ ab is the Kroeneker delta (equal to 1 is a = b and equal to 0 otherwise).For a set X of HCM, we define C (X ) , S (X ) , T (X ) as the average of the respective matrix over all X and R where γ (X ) is a constant to impose that the average of R (X ) C equals one.In an analogous way, we define R (X ) S replacing C with S. In words, the entries of R (X ) exceed one for the pairs that interact more than expected and are below one otherwise.If a pair cannot have interactions, we conventionally set R (X ) = 1.To simplify the notation, in the remainder we drop the index X .

Properties of the measured matrices
Given that we considered the same set of households across the three deployments, changes in the HCM structure can mainly be amenable to a seasonality effect.Table 1 precisely shows the cosine similarity between R C (left) and R S (right) for X 1 , X 2 , X 3 being the set of all households in the three deployments.The table reports high similarity values for R C , suggesting that the structure of the contact matrix does not vary a lot across the three deployments.Smaller values are instead obtained by R S implying that the seasonality effect majorly involves the duration (rather than the structure) of the contacts.This observation agrees with the distribution of the individual contact durations, obtained from approximately 10 5 proximity measurements shown in Figure 2b which follows a broad distribution, as expected [44].This distribution broadens in the third deployment when south-African winter is approaching.More quantitatively, we computed the 99 th percentile for the three distributions that is approximately 12 minutes in the first deployment, 27 in the second and 60 in the last.Figure 2a shows instead the matrix log(R C ) across the three deployments, evidencing that younger age groups tend to interact more, regardless of the age group they are interacting with.Based on these observations, we attempt to model the matrix C whose behavior is more predictable than S. Given the result of Table 1, the deployments are treated as three independent, equally reliable measurements of the HCMs.2), while the blue curve corresponds to the second order model of Section 3.2.b, c: correlation between the fluctuations of the activity δ (u) , the group average degree δ (η) and the presence of a major occupation outside the house δ (y) .The quantities δ a,c are defined in Equation ( 3).The Pearson correlation coefficient r is reported in text.

Main result
We introduce two parametric models to approximate the HCM that combine three age-dependent parameters: the number of individuals per age group, the in-house hourly presence and an intensity of activity factor.All the parameters involved in the model only depend on a single age class and not on the interactions between pairs of age classes, as it is commonly required in self-reporting surveys.This allows us to decrease the number of parameters to be estimated from order of n 2 age to n age .We here propose some example of questions to estimate the in-house hourly presence and the intensity of activity factor.
• How much time do you typically spend at home in each hour of the day?
• How much of this time do you typically spend in isolation?
• How many face-to-face interactions do you have per day?
As we will see in the remainder, these questions permit to calibrate the parameters of our model, allowing one to obtain a more faithful representation of contact matrices than the one obtained from purely demographic models.In Section 3.3 we describe some practical implications of our results and the relation to the questions listed above.

A first order model for household interaction
In this section, we define a parametric model to approximate the contact matrix C, as measured by proximity sensors.All matrices here refer to sets of HCM but we drop the index X to keep a light notation.Let T be the matrix defined in Equation (1).We define CT , an approximation of C, as where u ∈ R nage is a set of parameters that represent the activity of each age group and '•' denotes the entry-wise Hadamard product.The entries of this matrix are ( CT ) ab = T ab u a u b and a large number of interactions are expected when many members are present (large values of T ab ) and when they correspond to highly active age groups, such as [0 − 4, 5 − 9], as per Figure 2b.

Model validation
We deploy the following steps to test our model, as detailed and motivated in Section S.2.We independently randomly sample 2500 sets X of 8 HCM without replacement out of the 180 available.For each sampled X we compute the vector u that best approximates C, minimizing a modified Canberra distance [45] between the measured and the estimated matrix, as described in Section S.2.The entries of this vector contain the activity of each age group for the set X .Figure 3a displays the histogram of the cosine similarity between the approximation CT and the measured matrix C and evidences a good agreement between the two matrices with a cosine similarity equal to 0.9 or larger for 53% of the data.This similarity is of the same order of the one observed across the three deployments and reported in Table 1. Figure 3 further shows the same histogram for T being used as an estimator of C.This purely demographic model is much less accurate and reaches a cosine similarity greater than 0.9 for only 7% of the data and 50% of the data have a similarity greater or equal to 0.75.

Interpretation of the parameters
Besides the goodness of the approximation itself, our main interest is to assess whether the vector u can be estimated from easily observable quantities.To do so, for each sampled X we further compute the vector η ∈ R nage .Its element η a is the number of daily interactions per individual, averaged over all individuals in a given age group a. Intuitively, u and η should correlate: a higher activity has to be observed when people are more active.Note that η aggregates all individual's contacts and is oblivious of the age group binning.We divide the sets X according to their activity vector representation u into k = 4 groups with a hierarchical clustering algorithm.For each x ∈ {u, η}, we then write the value corresponding to age a and class c as where xa is the average over the 4 groups, and δ a,p are the fluctuations.Figures 3b shows the scatter plot of the fluctuations of δ (u) and δ (η) , evidencing a strong correlation with a highly significant (p-value less than 10 −3 ) Pearson coefficient of 0.85.This analysis suggests that the measured contact matrix can be estimated with a high precision from aggregated (hence more easily collectable) data being the average number of contacts per individual in the same age group.We now introduce a further parameter y that is even more easily observable than η and has a weaker but still strong correlation with u.Specifically, the entries of y ∈ [0, 1] nage indicate the fraction of people for each age group having an occupation outside the house requiring at least three hours a day.This quantity is expected to be negatively correlated with u, since lower activities should be observed when people spend more time outside the household.Repeating the same procedure detailed for η, we obtain Figures 3f showing indeed that high values of u a are obtained for low y a , as expected (see the red squares for the group [40 − 49]).The correlation between the fluctuations of u and y is reported in Figure 3c, reaching a significant Pearson coefficient of −0.65.We underline that y is a very aggregated quantity that does not directly involve contacts.The first row, in blue, corresponds to Agincourt, the rural site, while the second, in purple, to Klerksdorp, the urban site.The first column shows the matrix C aggregated over the three deployments, as measured by the proximity sensors.The second column is the corresponding random encounter matrix T .The third and the fourth are the estimates obtained by our first and second order models, respectively.All matrices are normalized by the empirical average of their entries.
We now discuss a refined model with respect to Eq (2) that keeps simultaneously into account the activity and the time spent at home.We show that this model produces better estimates of the contact matrices and can be conveniently used to predict the HCM originally excluded from our study.

A second order model for household interaction
In Equation (1) we introduced the matrix T that encodes a purely demographic interaction model in which a higher contact rate is entirely explained by a higher number of interacting individuals.In practice, however, contacts can happen only when people are in the same physical space.To model this effect, we propose an extension of T , that we denote with P .Let v i ∈ {0, 1} 24 be a binary-value presence vector of i, denoting the presence in the house for each hour of the day.The definition of P then reads where V a is the set of all individuals in the age group a.Note that if v i,t = 1 for all i and all t, the definition of P corresponds to the one of T .The scalar product between v T i v j quantifies the time in which i and j had simultaneously contacts with members inside the household.If it equals zero, then there is no chance that i and j got in contact at all.In other words, P predicts the contact rate assuming people get in proximity at random, but keeping into account that people are not always and simultaneously inside the house.We generalize the model of Eq (2) replacing T with P and obtaining CP .Practically, the proximity sensors do not provide us with the information of whether or not an individual is at home in a given moment, but only if it is interacting with another household member.For each individual we then construct a binary indicator on whether or not he/she interacted with someone in a particular hour of the day during the deployment and use this as a proxy for v.

T
CT CP Agincourt 0.83 0.95 0.96 Klerksdorp 0.89 0.95 0.98 Table 2. Goodness of the contact matrix estimation for different methods.The score is reported in terms of cosine similarity and the naming is consistent with Figure 4 which this table refers to.

Model testing
The blue histogram of Figure 3a shows the cosine similarity between the actual and estimated contact matrices obtained using P .A clear gain in accuracy is achieved, obtaining a cosine similarity is greater than 0.9 for 75% of the data.
We finally test the goodness of our model for the two sites separately on all (householddeployment) valid pairs, hence also those that were initially excluded because of quality issues in some (but not all) deployments.We use as u its average realization over the 2500 samples and compare the result of the predicted matrix T, CT and CP with the measured one (Figure 4), considering the two sites separately.The cosine similarity scores reported in Table 2 provide and striking evidence of how contact matrices are approximated with high precision using few age-dependent parameters.

Practical implications
Let us briefly discuss some implications of our results and suggest how these could be translated into practical recommendations for data collection.Survey based estimations are, to-date, the most common and reliable way to estimate contact matrices.This method, however, has some notable limitations -that we discussed in the Section 1and would benefit from the design of simpler questionnaires.We highlight that one can accurately estimate HCM from self-reported quantities that are, by design, more easily and reliably estimated.Our model combines the probability that two individuals meet with an age-dependent activity driven model [46].
We suggested some examples of questions that can be formulated to calibrate our model.For instance, the question "How much time do you typically spend at home in each hour of the day?", can be used to quantify the vectors v of Equation ( 4), needed to obtain P .The similarity of these vectors gives already a good estimation of the probability of interaction of the household members.Even if our experiment focused only on the household contacts, we envision that this approach can be directly extended to other settings, designing context-related contact matrices as done in [26].Moreover, one can think of providing a finer estimation of v considering a multi-day average, so that v t ∈ [0, 1] is a probability to be at home (or, more generally, in a given place) at time t.The question "How much of this time do you typically spend in isolation?" then can allow one to re-weight the entries v t to account for an actual probability of encounter.The last question "How many face-to-face interactions you have per day?" is an example of how one can quantify an individuals' activity rate.Given these estimates, the age parameters are obtained simply aggregating them according to the relevant age-group to obtain the activity vector u.
Our result brings an empirical evidence that most of the structure of contact matrices measured with high-resolution proximity sensors can be reliably captured with a simple statistical model combining behavioral parameters with demographic ones.While it comes as no surprise that a generalization of the matrix T would lead to better estimates, the most important aspects of our results are listed as follows: • Simple, environment-independent models can accurately estimate HCM.The high quality and size of the PHIRST dataset gave us great insights into the problem of HCM estimation.Backed by these empirical data, not only can we say that the proposed parametric model generally improves the estimation accuracy, but we can numerically quantify it, observing very high level of agreement with the HCM obtained with the costly high resolution measurements • Our proposed models are highly interpretable.We expect its parameters to be easily estimated with surveys, addressing questions such as those listed in Section 1.
We expect this to be one of the significant outcomes of our research as we identified some practical questions to calibrate our model, bypassing proximity sensors.
• All parameters are aggregated by age group and involve the behavior of single individuals and do not depend on the age class of other members.This aspect naturally reduces the number of parameters of the model, making the estimation process simpler and addresses the important requirement for surveys that the questions asked should have a simple answer.
The questions suggested in Section 1 constitute an example of possible ways to estimate the activity parameters and are limited to the quantities that turned out to provide a significant explanation of HCM in our experiment setting.Other metadata (such as the number of rooms in the house, the wealth status or the distinction between the rural and the urban site) could potentially be informative to explain the HCM structure, even if they were not in our analysis.
The main limitations of our methodology are related to the quality and nature of the available data.The first concern is related to the time-dependent data collection component which we essentially neglected here.When dealing with contact matrices, it is customary to distinguish between weekdays and weekends.In our measurements, the first and third waves of measurements in households were made asynchronously.After the cleaning procedure, it emerged that, as a consequence of the adoption of this choice for the scheduling of data collection in the field, weekdays and weekends are not evenly distributed among households and changes in the measured HCM are potentially associated with this effect.To cope with this problem, when dealing with asynchronous measurements it would be preferable to consider the same days of the week for all households.A closely related concern is that we have considered all three deployments as equal, even though they correspond to rather different periods in the year.The data sparsity and quality did not allow us to detect any significant change in the seasonality of the contact patterns, except for the duration of contact distribution shown in Figure 2c.It is nonetheless a very reasonable assumption that the contact behavior changes during the year.Our suggestion to investigate individuals' behavioral habits can easily overcome this problem, designing time-dependent expected matrices that could adapt even to diverse scenarios such as, during a quarantine.
In conclusion, our study proposes a parametric model to estimate contact matrices with high accuracy.It improves over the purely demographic models in terms of accuracy and over the purely survey-based approaches in terms of simplicity of the data collection.
July 25, 2023 10/16 Given its simplicity and interpretability, we envision that our framework can be adopted to estimate contact matrices beyond the household setting.As a practical application, our results can impact the strategy to design the surveys currently adopted to quantify social contacts to mitigate the Covid19 and similar epidemics [47,48].
S Supplementary information: Appendix S.1 Data collection and pre-processing Proximity data are measured with the SocioPatterns sensors that we here introduce, addressing the interested reader to [28] for a more detailed reference.Their functioning is based on the emission of low-power signals.Participants are asked to wear the sensor on their chest, so that when they engage in a face-to-face interaction with another participant, the respective sensors can exchange packets of information with a frequency that does not exceed one packet per second.A contact is measured if, in the time-span of 20 seconds, two sensors exchange at least one packet, recording the unique identifier of the interacting sensor, the time at which the interaction occurred and the attenuation of the signal from the sender to the receiver.This attenuation is related to the distance between the two sensors and can be used to filter suitably-defined close-range proximity relations.Additionally, each sensor periodically records some status properties that log metadata and diagnostic information.Among these, an accelerometer allows one to know every 15 minutes if the sensor is moving or not.Given the sensitivity of the accelerometer and the time-scale at which it operates, one can assume that if the sensor is still, then it is not worn.The cleaning procedure is summarized as follows: 1.All contacts measured by non-moving sensors are removed: this is to avoid including spurious contacts between sensors that are, for instance, kept inside a drawer 2. Contacts are filtered and only those with a suitable attenuation threshold.This threshold corresponds to an interaction between two sensors that are approximately at 2 meters, even if this is a context-dependent relation that depends on external parameters, such as, for instance, humidity.
3. All contacts happening before the beginning of the deployment (as reported in the diaries) and after its end are removed.These contacts may exist, because sensors may be collected on different dates from the ones of the planned experiment, but they are removed because sensors' use may be non-systematic, hence unreliable.Moreover, the first and last day of measurement are removed as well.During these days, very intense activity patterns are typically observed due to the interaction with the people dispatching the sensors.Since this kind of interaction deviates from the standard conditions, it is not considered.
4. The data collected by the sensors contains information on the hardware identification code.A mapping relates this identifier with the individuals' pseudonym that allows us to relate contacts and metadata.Errors at this stage make it impossible to relate contacts to people and results in the red dots shown in Figure S.1a.
5. As a minimal request, we impose that, after this cleaning procedure, a deployment can be considered valid only if it has two or more days of measurement.We found this to be a good trade-off between high quality data to work with and a sufficiently comprehensive inclusion principle.Household-deployment pairs that do not fulfill this condition are denoted in blue in Figure S.1a.
6. Finally, non-circadian activity patterns are identified.A great excess of activity during night hours was observed in three households (yellow dots of Figure S.1a) during the first deployment.This may occur, for instance, if the sensors are left in proximity on a vibrating surface: the accelerometer filter does not remove these contacts even though the sensors were not worn at that moment.
Only the households in which all three deployments led to valid measurements (all green dots in Figure S.1a) were included in our study.the age and household size histograms for the whole dataset against its cleaned version, showing that our inclusion principle did not affect either of the four distributions.

Sampling the villages
First of all, in order to devise a good approximation of HCM, it is necessary to define a suitable distance to compare them.When comparing different households, however, one has to consider that typically there are some age groups with no individuals.More formally, this means that for some age group a, Φ a = 0.In some extreme cases there is no way to consistently compare HCM because the corresponding contact matrices are complementary, i.e. the zeros one correspond to the non-zeros of the other.
To address this problem, we choose to compare groups of HCM (villages) X , i.e. small groups of household-deployment (h, d) pairs that guarantee that Φ a > 0 for all a.To build the samples X we then first select some (h, d) at random, with the constraint to achieve Φ a > 0 for all a (we sample only the pairs that can contribute to increasing the zeros entries of this vector) and then we randomly pick other pairs until the fixed size of X is reached.We empirically choose |X | = 8 because it is a good trade-off between two competing effects: if |X | is too low there is a possibility of over-representing households with elderly members that are fewer and hence more valuable to get the condition Φ a > 0 for all a; on the other hand, very large values of |X | will tend towards an "averaging" effect that leads all villages to be very similar to one-another.
where Ã is the matrix A divided by its mean (and equivalently B is B divided by its mean).The distance d C is the Canberra distance computed on the matrices Ã, B, instead of A, B, hence we refer to it as modified Canberra distance.This choice of the distance is motivated by the two following points 1.The entries of C may differ even by a factor 100 as shown in Figure 4.The cosine similarity is meaningful to quantify the proximity of two matrices but it naturally tends to give more weight to entries with a larger magnitude.For this reason it is unsuited for an optimization as it would poorly estimate the small entries of C. On the opposite, the relative distance d C gives approximately the same weight to all matrix entries and can be used for this purpose.
2. The modified Canberra distance compares a normalized version of the contact matrices because we are interested in determining them up to a constant factor.We then have for any α, β > 0, d C (αA, βB) = d C (A, B).

Occupation parameter
We here detail the strategy to determine the vectors y, η appearing in Figure 3b, c, referred to as occupation and compliance vector respectively.
In the PHIRST data collection process, the participants were asked to specify locations or activities in which they spend more than three hours a day for more than three days per week.The options to choose from included: school, university, work, pub, social clubs, hanging out with friends, street vendors and church.We then define a Boolean variable for each person indicating whether or not he/she has a major activity outside the household, i.e. if he/she answered positively to any of the questions above.The value of y a is the average of the Boolean indicator for all people of age a in X .

Fig 1 .
Fig 1.Data collection schedule for the 60 selected households.Each row corresponds to a household with the rural site on the left and the urban site on the right.Time is displayed on the x axis and dates are reported in the day/month format.Vertical gray lines correspond to the beginning and end of each deployment.A black dot indicates that at least one contact was measured, while a white one that no contact was recorded on that day.

Fig 2 .
Fig 2. Properties of the measured data: a: normalized contact matrix across the three deployments.The color code refers to the values of the logarithm of R counts whose entries are proportional to the ratio between the number of contacts and the number of possible interacting pairs, setting the mean of R counts to 1.The two axis correspond to the age groups and the number reported indicates the highest age of each group.b: contact duration distribution expressed as number of seconds of interaction across the three deployments in logarithmic scale.

Fig 3 .
Fig 3. Test of the model for household interaction.a: histogram of the cosine similarity between C and its estimators.The gray curve corresponds to the histogram over the 2500 realization of X using T as an estimator of C. The orange curve is obtained with the first order model of (2), while the blue curve corresponds to the second order model of Section 3.2.b, c: correlation between the fluctuations of the activity δ (u) , the group average degree δ (η) and the presence of a major occupation outside the house δ (y) .The quantities δ a,c are defined in Equation (3).The Pearson correlation coefficient r is reported in text.

Fig 4 .
Fig 4. Measured vs estimated normalized contact matrices in the two sites.The first row, in blue, corresponds to Agincourt, the rural site, while the second, in purple, to Klerksdorp, the urban site.The first column shows the matrix C aggregated over the three deployments, as measured by the proximity sensors.The second column is the corresponding random encounter matrix T .The third and the fourth are the estimates obtained by our first and second order models, respectively.All matrices are normalized by the empirical average of their entries.
Figure S.1b, c, d, e further show

Fig S. 1 .
Fig S.1.Raw data characteristics.a: data quality.On the x-axis we plot households, while on the y-axis the deployments.For each (household-deployment) we assign a color code: black indicates that the household did not participate; red that all household's sensors had data quality issues and did not provide valid measurements; blue that there are less than two days of measurement; yellow that a non circadian activity is observed; green none of the above.b and d: age distribution in Agincourt and Klerksdorp, respectively.The green bars are referred to the whole data-set, while the purple one only refers to the 60 households with valid measurements in all three deployments (see a). Blue dots are obtained by multiplying the height of the green bars for the fraction of the included households, that is the expected bar height, given the cleaned dataset size.c and e: household size distribution.Legends and colors follow b and d.
Given the samples of villages we now compute the value u as the result of the following optimization problemu = arg min v : v T 1=const d C C, T • vv T , where T • (vv T ) ab = T ab v a vb and d C is a modified Canberra distance.Let A, B be two symmetric matrices of size n age , then d C (A, B) = nage i=1 j≤i

Contact matrices incorporate the contacts subdivided by age groups. They are square and symmetric, of size n age , the number of age bins considered. Here the age groups are divided into [0 − 4, 5 − 9, 10 − 19, 20 − 29, 30 − 39, 40 − 49, 50+] years: the finer grain of younger ages is because of the large proportion of population in those age brackets
, shown in Figure S.1.Each HCM refers to a single household and a specific deployment. :