^{1}

^{2}

^{3}

^{4}

^{5}

^{6}

^{6}

^{7}

^{8}

^{*}

Conceived and designed the experiments: FI FT MC FCB SM MA PM. Performed the experiments: FI FT MC SM MA. Analyzed the data: EZ EDF PM. Wrote the paper: FI PM. Design of the paper: PM.

The authors have declared that no competing interests exist.

Knowledge of social contact patterns still represents the most critical step for understanding the spread of directly transmitted infections. Data on social contact patterns are, however, expensive to obtain. A major issue is then whether the simulation of synthetic societies might be helpful to reliably reconstruct such data. In this paper, we compute a variety of synthetic age-specific contact matrices through simulation of a simple individual-based model (IBM). The model is informed by Italian Time Use data and routine socio-demographic data (e.g., school and workplace attendance, household structure, etc.). The model is named “Little Italy” because each artificial agent is a clone of a real person. In other words, each agent's daily diary is the one observed in a corresponding real individual sampled in the Italian Time Use Survey. We also generated contact matrices from the socio-demographic model underlying the Italian IBM for pandemic prediction. These synthetic matrices are then validated against recently collected Italian serological data for Varicella (VZV) and ParvoVirus (B19). Their performance in fitting sero-profiles are compared with other matrices available for Italy, such as the Polymod matrix. Synthetic matrices show the same qualitative features of the ones estimated from sample surveys: for example, strong assortativeness and the presence of super- and sub-diagonal stripes related to contacts between parents and children. Once validated against serological data, Little Italy matrices fit worse than the Polymod one for VZV, but better than concurrent matrices for B19. This is the first occasion where synthetic contact matrices are systematically compared with real ones, and validated against epidemiological data. The results suggest that simple, carefully designed, synthetic matrices can provide a fruitful complementary approach to questionnaire-based matrices. The paper also supports the idea that, depending on the transmissibility level of the infection, either the number of different contacts, or repeated exposure, may be the key factor for transmission.

Data on social contact patterns are fundamental to design adequate control policies for directly transmissible infectious diseases, ranging from a flu pandemic to tuberculosis, to recurrent epidemics of childhood diseases. Most countries in the world do not dispose of such data. We propose an approach to generate synthetic contact data by simulating an artificial society that integrates routinely available socio-demographic data, such as data on household composition or on school participation, with Time Use data, which are increasingly available. We then validate the ensuing simulated contact data against real epidemiological data for varicella and parvo-virus. The results suggest that the approach is potentially a very fruitful one, and provide some insights on the biology of transmission of close-contact infectious diseases.

A century after the first contributions giving birth to mathematical epidemiology, and after 20 years of fast growth since the first public health oriented contributions

A critical aspect common to all such models, is the parameterization of social contact patterns, i.e. how people socially mix with each other

In a relatively simple case, where individuals are stratified by age only, contact patterns are represented in the form of contact matrices whose entries represent the average number of contacts that individuals in age group ^{2}^{2}

Recently, important progress has been made in this area through direct collection of contact data by means of sample surveys

A drawback of time use data is that they usually do not give direct information about the number of social contacts of respondents, or the time they spent in contacts. They only give “marginal” information on the time individuals allocated to the various daily activities

In this paper, we follow the same line and aim to reconstruct contact and time-in-contacts matrices by simulating a suitable “minimalistic” socio-demographic individual-based model for Italy. The model is parameterized by integrating time use data from the Italian time use survey

With our approach we generate three different types of contact matrices, possibly informative of distinct aspects of the biology of transmission: (a) a matrix describing the time spent in contact (Type 1)

In addition, we extracted an adequate

The Italian Time Use (TU) survey was carried out by the Italian National Statistical Agency between 2002 and 2003

A 24-hour day, starting from 4am, is divided into 144 time slots of 10 minutes each, called “ticks”. For each tick, the respondent's diary records the type of location where the person was, and the type of activity done. Due to privacy issues, records always refer to types of places and types of activities, instead of exact places and exact activities. Types of places and types of activity are given unique codes (i.e. 1 for home, 2 for office, etc. for locations; 1 for working, 2 for caring children, etc. for activities). However, these codes are identical for every individual. Therefore, if at the same chronological time two people are both working, each one in his/her own office, we have two records with the same codes, but this does not imply they are in the same office doing the same thing. This has some drawbacks. First, there is never any clue about the purpose of the undertaken activity. For example, if in a certain tick someone reports being on the public transportation network, there is no indication about the reasons for being there. For instance, it could be for going from home to office, or bringing children, if any, to school and then going to work, or anything else. The same applies for places, with a single remarkable exception: if at any time two individuals report that they are at home, and we know from other data that they both belong to the same household, we can infer that they are in the same place. This is the only case in which the partial information given by respondents can be correctly augmented.

Finally, routine socio-demographic data on a) family size and composition

To create an artificial society that matches the one that is revealed by the Time Use survey, some assumptions were made. Unlike other approaches (e.g., the Portland model), whose aim is to create artificial societies that are as close as possible to a real population, we opted for an artificial society based on a “minimally” complex set of rules, that is nonetheless representative of the Italian population. This seems to be a useful departure point: by considering a simple spatial structure and a minimal set of activities/locations (school and work, the household, and “other”, non-school, non-household contacts), which are those considered fundamental in basic epidemiological explanations, we avoid the need to include several extra-assumptions for model parameterization. Further activities and locations can nonetheless be easily included.

Let us list the assumptions adopted. First, we restricted our model to individuals followed over an average workday. This choice sets Little Italy's population to 18,085 (artificial) individuals. We chose to ignore week-end days because the groups of respondents to the surveys are different and therefore some additional assumptions would be necessary to link workdays and week-end days agendas.

During the day, agents move to and from different places. Most of the time, respondents reported to be at home, in the office or at school. For the rest, they either declared to be in more specific places (e.g., bakery, park, etc.) or that they were moving from one place to another (e.g., on foot, by car or by bus). We chose a square grid as Little Italy's “environment”, with grid's size 150×150, in order to allocate families in single cells representing houses, leaving appropriate space for schools and workplaces. Each square in the grid is identified by a pair of coordinates. We allocated one house for each household on a random cell on the grid. House cells can contain at most 5 families.

In order to host all students aged 3–18, and Little Italy's only university, we allocated schools at random on the grid.

The setting up of workplaces required a few more assumptions, since respondents only reported that they were at work during some ticks but gave no information about either the size of the company they were working for, or the number of colleagues (and in many situations, like, for instance, bus drivers, workers “share the environment” with people that are not necessarily colleagues). Therefore, we drew samples of firms from the workforce size distribution of Italian firms in cities having population size comparable to Little Italy, i.e. 10,001–20,000 inhabitants

Two aspects of the previous process are worth mentioning. First, each agent declares how much time she/he spends going, say, from home to office by car. This time is a proxy for the distance from home to office which must be respected all over Little Italy. It is not possible, for example, that agent A takes 1 time slot (10 minutes) to move by 20 cells on the grid while agent B covers the same distance in 6 time slots (1 hour) if they both declare using the car. This would mean that A's car moves 6 times faster than B's one, which is possible, but unlikely. We proceeded as follows. After workers are assigned to firms, a random re-assignment of houses is performed: two households exchange their houses if, in the new configuration, the actual distances between offices and houses are closer to the ones that can be inferred from their diaries. A large number of exchanges is carried out until the error cases are a negligible fraction of total workers.

Second, there are workers who declared not having a single workplace, like, for examples, a plumber. For them, we decided to set their moving workplace at random. Each time they are about to go somewhere, the simulation chooses a random square on the grid as their next workplace. Commercial places are created on the grid and their workers are assigned to them. Students are assigned to classes in schools, according to routine data

To run Little Italy, at every tick each agent must be put somewhere on the grid. This requires each agent's list of activities to be put in a one-to-one correspondence with a pair of coordinates. This, in turn, requires a detailed modelling of the agents movements over Little Italy. Details are reported in

Little Italy was coded in Java, using Repast 3 libraries

To keep track of contacts between agents, a definition of contact was necessary. The adopted “marker” of contact was “having shared the same physical environment with someone else” (i.e. house, the same class at school, the same bus) during a given tick. This corresponds to a form of localized random mixing. Assume, for instance, that during a given tick there are 20 pupils aged 7 and one teacher aged 44 in a class-room. Based on our definition each pupil has 19 contacts with people of the same age, and one contact with adult people (aged 44), while the teacher has 20 contacts with 7 years old people.

By aggregating across time ticks, matrices reporting the total number of contacts between each pair of ages were computed for the following activity/locations: household, school, work, transport, other activities. Then, by summing through activities we computed overall (i.e. including contacts through all locations) contact matrices by age, whose elements _{ij}_{ij}_{ij}_{ij}

From total matrices _{ij}_{i}_{ij}_{ij}_{ij}n_{i}_{ji}n_{j}_{ji}

Since Little Italy matrices do not offer information on contacts for the age group 0–2 years (it is not included in the Italian Time Use Survey), we integrated our matrices using Polymod data from that age group. These computations were applied to each “world”, and then the average of the ensuing matrices was taken.

We note that the different types of contact matrices considered correspond to different views of the contact process, perhaps useful to capture different aspects of the biology of transmission. Type 1 matrix might be relevant for infections for which the time of exposure matters (for instance, for those infections with low transmissibility rates, where the probability of transmission cumulates over time). Type 3 matrix implies that what really matters is the number of social partnerships, independently of the number of repetition of contact episodes and of the time spent together

The Big-Italy matrix was extracted from the IBM used to simulate the spread and control of an influenza pandemic in Italy

We compared the performances of the Little and Big Italy matrices with two other contact matrices available for Italy: a) the overall (i.e. including all reported contacts) Polymod matrix based on survey data collected in eight European countries

Recently collected Italian serological data (age range 0–79 years, sample size = 2,517) on varicella-zoster-virus (VZV)

Fitting contact matrices to serological data was performed using a standard approach

The

Given the estimate of the transmission parameter _{ij} = qm_{ij}_{0}_{0}_{ij}

In order to achieve a high degree of explanatory power of the data to be used as a benchmark of the goodness of fit of the various matrices, we also considered a flexible non parametric model, given by a constrained monotonically increasing P-splines model

We measure assortativeness in the various matrices considered using two different indices. The first one is the _{ij}_{ij}_{XY}

Contour plots of Type 1, 2, and 3 average contact matrices based on 5-years age groups (0–4, 5–9, etc) are reported (

Type 1 (left), Type 2 (center), Type 3 (right). X-axis = age of the contactors, Y-axis = age of his/her contacts.

Proportions _{ii}

Little Italy Type1 | 0.225 | 0.316 |

Little Italy Type 2 | 0.184 | 0.412 |

Little Italy Type 3 | 0.195 | 0.428 |

Big Italy | 0.094 | 0.661 |

Polymod | 0.157 | 0.632 |

Time-Use | 0.070 | 0.569 |

With regards to contacts between parents and their children, well evidenced in

_{0}

Fit to Italian serological data for VZV by an SIR model based on the various contact matrices considered: observed vs predicted immunity profiles to VZV, by age. Dots size proportional to sample frequency of serological data.

Fit to Italian serological data for B19 by an SIR model based on the various contact matrices considered: observed vs predicted immunity profiles to B19, by age. Dots size proportional to sample frequency of serological data.

_{0} |
|||||

Little Italy Type1 | 0.051 (0.047,0.055) | 276.42 (1 df) | 447.61 | 3.14 | |

Little Italy Type 2 | 1.35 (1.29, 1.42) | 111.37 (1 df) | 282.57 | 4.94 | |

Little Italy Type 3 | 1.42 (1.35,1.51) | 190.15 (1 df) | 361.34 | 3.43 | |

Big Italy | 12.35 (11.67,13.09) | 101.11 (1 df) | 272.30 | 4.80 | |

Polymod | 11.37 (10.80, 11.99) | 67.34 (1 df) | 238.53 | 4.77 | |

Time-Use | 4.28 (4.09,4.47) | 114.32 (1 df) | 285.51 | 4.11 | |

Non-parametric | 64.30 (4.92 df) | 243.33 | |||

Little Italy Type1 | 0.029 (0.028, 0.030) | 135.61 (1 df) | 402.11 | 1.72 | |

Little Italy Type 2 | 0.73 (0.71, 0.75) | 157.24 (1 df) | 423.74 | 2.67 | |

Little Italy Type 3 | 0.82 (0.80, 0.84) | 159.90 (1 df) | 426.39 | 1.98 | |

Big Italy | 5.39 (5.20, 5.60) | 195.99 (1 df) | 462.48 | 2.10 | |

Polymod | 5.26 (5.06, 5.48) | 202.91 (1 df) | 469.41 | 2.21 | |

Time-Use | 2.23 (2.16, 2.30) | 195.60 (1 df) | 462.09 | 2.14 | |

Non-parametric | 81.23 (3.95 df) | 353.63 |

For VZV, the Polymod matrix provides the best fit. The Big-Italy and the Little Italy Type 2 matrices perform better than the Time Use matrix but substantially worse than the Polymod matrix, whereas the Little Italy Type 1 and 3 matrices fit poorly. Note that the non-parametric model performs slightly better than the Polymod matrix in terms of deviance, but worse in terms of AIC, due to its larger parameterization. This suggests that the Polymod matrix definitely represents an excellent “explanans” for VZV transmission. Disregarding the Little Italy Type 1 and 3 matrices, which poorly fit, the ensuing values for _{0}_{0}

Things are different for B19. The Type 1 matrix provides the best fit, and overall the three Little Italy matrices perform better than the other matrices. It is however to be acknowledged that the fit remains far from the one provided by the non-parametric model, suggesting that there is still room for large improvements in the explanation. In particular, the Big-Italy and the Time Use matrices, though clearly less performant than the Little Italy matrices, are not worse than the Polymod matrix. The ensuing values for R_{0} range between 1.6 and 2.6. An explanation of the differences in the fit of B19 and VZV is not easy since we do not dispose of tools to globally compare the differences between two arbitrary contact matrices. Assortativeness measures provide however some clue. The three Little Italy matrices predict a very steep immunity profile at low ages, which however suddenly flattens to a plateau later on. This sudden change in regime, which is a pattern known to occur in presence of strong assortativeness, allows the Little Italy matrices to better explain the B19 data, which show a sharp plateauing (though with large randomness). On the other hand, this behavior prevents the Little -Italy matrices to capture the observed VZV profile.

Finally, given that the large-scale (transport and shopping malls) contacts of the Little Italy model required several assumptions to be parameterized, it was important to check the influence of these activities/locations on the result of the fit. We therefore fitted the Little Italy matrices without taking into account such activities, i.e. relying only on households and school/workplaces contacts. The results in the fit of B19 by Little Italy Type 1 and 3 matrices are reported (

_{0} |
|||||

Little Italy Type 1 | All contacts | 0.0293 (0.0284, 0.0301) | 135.61 (1 df) | 402.11 | 1.716 |

Without transportation | 0.0293 (0.0284, 0.0301) | 135.65 (1 df) | 402.14 | 1.713 | |

Without transportation & malls | 0.0294 (0.0286, 0.0303) | 138.36 (1 df) | 404.85 | 1.661 | |

Little Italy Type 3 | All contacts | 0.818 (0.796, 0.842) | 159.90 (1 df) | 426.39 | 1.982 |

Without transportation | 0.826 (0.804, 0.850) | 160.70 (1 df) | 427.2 | 1.96 | |

Without transportation & malls | 0.867 (0.844, 0.893) | 200.11 (1 df) | 466.61 | 1.639 |

Substantial improvements have been achieved in recent times in our knowledge of social contact patterns

The ensuing contact matrices by age were fitted, on the basis of simple transmission models, to Italian serological data for VZV and B19. Goodness-of-fit comparisons with other available contact matrices, such as the questionnaire-based Polymod matrix and the Time-use matrix, were also made. The main results show that for VZV the best fit is provided by the Polymod matrix, which performs excellently, and much better than artificial matrices. However, for B19, all Little Italy matrices fit the data quite well, and better than available concurrent matrices, including the Polymod one.

This paper represents, as far as the authors know, the first comparison on real epidemiological data of bottom-up approaches to the generation of contact data, with the approaches based on direct contacts estimation, such as the Polymod study. Our results on VZV provide further evidence on the merits of the Polymod study, which represents a great advancement in our understanding of contact patterns. However, the better fit to B19 provided by artificial matrices compared to questionnaire-based matrices, is indicative of the difficulty to find “universal” contact patterns that can explain in a satisfactory way many different infections. Therefore, though artificial matrices can not surrogate observed ones, they can certainly represent valuable tools to assist mathematical modellers in the formulation of alternative assumptions.

An important related question is why different infections are better explained by different types of contact matrices. May this be due to the characteristics of the contacts which matter to various infectious diseases? The traditional WAIFW

With regards to model parameterization, the Little Italy model uses real data to parameterize the small scale components (household sizes, schools, workplaces) of the contact network. On the other hand, assumptions were necessary to parameterize the large scale components of the network, e.g. travel and shopping malls patterns. Nonetheless, we could at least make such patterns fully consistent with the general design of the Little Italy model, i.e. the daily time spent on travelling, or in supermarkets, by each Little Italy agent, correctly matches, based on an optimization procedure, the time spent on travelling by a corresponding real agent. In order to appreciate the potential impact on data fitting of the ad-hoc assumptions on travel patterns, we also fitted the model by excluding contacts on transports and shopping mall, showing that in the most significant cases the results were essentially unaffected. This suggests that the “empirically robust” component of the model is sufficient for the main target of the paper, i.e. the generation of contact data. Obviously, given the lack of appropriate epidemiological data to validate travel assumptions, the possibility to use Little Italy for further investigations beyond those presented here, i.e. for example epidemic prediction and information of measures targeting social distance, certainly requires caution. Future work will be devoted to the analysis of the model robustness to the assumptions on its large scale components.

Given the simplicity of the adopted definition of contact, the current model cannot reproduce, unless resorting to further data and hypotheses, the richness of data obtained by Polymod survey, where further noteworthy information such as the intimacy and frequencies of contacts, were collected. This is clearly a shortcoming since these types of contacts are arguably important for most respiratory infections

However the current Little Italy model can potentially be used to answer several important questions. For example, the model can be expanded to describe contacts in a rural-urban environment, given the representativeness of Italian Time Use data for rural and urban populations. Moreover, longer time simulations could address how contacts cumulate (a) during periods of time having a length comparable to the infectivity period, (b) between work-days and week-end days

Further, this paper would like to reinforce the perspective that contact data and time-use data provide useful complementary information. On the data-gathering side, major gains could be achieved by combining the two approaches. This could be achieved, for example, by supplementing time-use surveys with a few questions about people “contacted” (for example those with whom a conversation was held) during any given activity or time slot. This would provide data that consistently incorporate the relationship between time of exposure and contacts. With regards to studies of transmission, it would be important to better understand how to integrate the two types of data, for example by comparing time use data and Polymod data on durations of contacts.

A final point regards the information embedded in age specific serological data, which are the base for infection control strategies. As clear for example for VZV

Little Italy details.

(0.12 MB PDF)

Big Italy details.

(0.19 MB PDF)

Description of the nonparametric model used to ground fitting results.

(0.02 MB PDF)

Little Italy Type 1 activity-specific matrices.

(0.41 MB PDF)

The authors warmly thank three anonymous reviewers of the Journal whose comments contributed substantially to improve the quality and exposition of the manuscript.