^{1}

^{2}

^{3}

^{*}

The author has declared that no competing interests exist.

Conceived and designed the experiments: PH. Performed the experiments: PH. Analyzed the data: PH. Contributed reagents/materials/analysis tools: PH. Wrote the paper: PH.

One of network epidemiology's central assumptions is that the contact structure over which infectious diseases propagate can be represented as a static network. However, contacts are highly dynamic, changing at many time scales. In this paper, we investigate conceptually simple methods to construct static graphs for network epidemiology from temporal contact data. We evaluate these methods on empirical and synthetic model data. For almost all our cases, the network representation that captures most relevant information is a so-called exponential-threshold network. In these, each contact contributes with a weight decreasing exponentially with time, and there is an edge between a pair of vertices if the weight between them exceeds a threshold. Networks of aggregated contacts over an optimally chosen time window perform almost as good as the exponential-threshold networks. On the other hand, networks of accumulated contacts over the entire sampling time, and networks of concurrent partnerships, perform worse. We discuss these observations in the context of the temporal and topological structure of the data sets.

To understand how diseases spread in a population, it is important to study the network of people in contact. Many methods to model epidemic outbreaks make the assumption that one can treat this network as static. In reality, we know that contact patterns between people change in time, and old contacts are soon irrelevant—it does not matter that we know Marie Antoinette's lovers to understand the HIV epidemic. This paper investigates methods for constructing networks of people that are as relevant as possible for disease spreading. The most promising method we call exponential-threshold network works by letting contacts contribute less, the further from the beginning of an outbreak they take place. We investigate the methods both on artificial models of the contact patterns and empirical data. Except searching for the optimal network representation, we also investigate how the structure of the original data set affects the performance of the representations.

In the 1980's and 90's, mathematical epidemiology of infectious diseases made great progress. During these years, researchers went from models where every individual meets everyone else with equal probability, to a framework of networks where people are considered as connected if one can infect the other. This new body of theories, network epidemiology

Network epidemiology rests on coarse simplifications, perhaps the biggest being that that one usually does not explicitly model the dynamic aspects of contact patterns

Consider a sequence of contacts—triples (_{i}_{i}

In this paper, we will use both empirical and artificially generated temporal-network data sets. We investigate three classes of network representations to find which one that can predict ∑_{i}_{i}_{i}

We will compare three conceptually simple methods of reducing a contact sequence to a static network (illustrated in _{start},_{stop}]. The second representation, _{start},_{stop}]. This is thus a network of edges, or relationships that are concurrently active over the time window. This method takes its name from literature of sexually transmitted infections where it is believed that the level of concurrent partnerships is a key-factor to understand how contact patterns influence epidemics ^{−t/τ} with the time

To the left in all panels is a temporal network where each horizontal line is the timeline of a vertex. The vertical curves symbolize the contact between two vertices as one timestep. Panel A shows the construction of the time-slice network. Two vertices are connected if they have at least one contact in the time interval [_{start},_{stop}]. In panel B, a vertex pair is connected if they have contacts before _{start} and after _{stop}. Finally, panel C illustrates how the contact sequence is reduced to a weighted graph that is converted to an unweighted graph by requiring an edge to have a weight over a certain threshold Ω. The thickness of the lines in panel C is proportional to the weight between the pair.

As mentioned, we evaluate the network representations by comparing the importance (∑_{i}_{i}_{i}_{i}_{i}_{i}_{i}_{i}_{i}_{i}

As a start, we will analyze empirical contact sequences of the type outlined above (lists of potentially contagious contacts—who has been in contact with whom at what time). These empirical data sets are more or less related to disease spreading; but they all serve as examples of different temporal-network structures. The data sets fall into three categories—online communication, face-to-face and sexual encounters. The latter two categories are of course more interesting for the spread of infectious diseases (while the former perhaps could be interesting for the spread of e-mail viruses). Of online communication data, we study two e-mail networks—from Refs.

57,189 | 3,188 | 28,972 | 159(8) | 113 | 16,730 | |

92,442 | 31,857 | 115,684 | 647(57) | 2,196 | 39,044 | |

444,160 | 115,684 | 529,890 | 6,027(350) | 20,818 | 50,632 | |

λ | 0.298 | 0.031 | 0.108 | 0.067 | 0.028 | 0.416 |

112.0d | 81.6d | 512.0d | 7.3(1)h | 2.5d | 2,232d | |

0.416 | 0.383 | 0.652 | 0.40(6) | 0.632 | 0.432 |

Turning to the main results of this section, we display the performance of the network representations in

Time slice | Ongoing | Exponential threshold | Acc. | |||||||

ρ_{max} |
_{start} |
_{stop} |
ρ_{max} |
_{start} |
_{stop} |
ρ_{max} |
τ | Ω | ρ | |

0.735(5) | 0 | 0.42(3) | 0.497(5) | 0.25(3) | 0.25(3) | 0.771(2) | 0.40(1) | 0.30(2) | 0.456(4) | |

0.907(4) | 0 | 0.25(2) | 0.914(1) | 0.20(3) | 0.20(3) | 0.931(3) | 1.0(1) | 0.26(2) | 0.883(3) | |

0.821(3) | 0 | 0.65(3) | 0.421(3) | 0.25(2) | 0.25(2) | 0.861(2) | 0.10(4) | 0.16(2) | 0.706(5) | |

0.77(2) | 0 | 0.72(5) | 0.53(2) | 0.39(2) | 0.39(2) | 0.87(1) | 0.70(3) | 0.71(2) | 0.76(1) | |

0.787(2) | 0 | 0.10(1) | 0.743(2) | 0.10(3) | 0.11(2) | 0.775(2) | 0.04(1) | 0.020(2) | 0.532(8) | |

0.711(2) | 0 | 0.77(2) | 0.301(4) | 0.60(2) | 0.60(2) | 0.721(3) | 0.040(3) | 0.20(1) | 0.489(7) |

Another observation is that the aggregate networks, the most common static network representation of temporal network data, perform very poorly (ranging from 51%–91% of the maximal correlation value). The ongoing networks perform very differently for different data sets—sometimes (_{start} = _{stop} (the special case studied in Ref.

The occasional poor performance of the ongoing networks is a bit surprising in the light of the reported importance of concurrent partnerships for disease spreading in sexual networks

The time-slice networks are performing consistently well—in one case better, and in the other cases close to the exponential-threshold networks (on average ρ≈0.09 lower). They have most relevant information if the time interval begins early. Indeed, the optimizing starting time is almost always the same as the beginning of the epidemics. This means they are also in practice, like the exponential-threshold networks, weighing the interactions with a weight decreasing with time (only that this weight function is discontinuous). The relative duration of the optimal time slice varies considerably (from 10% to 77% of the entire sampling time). Ref.

We now take a deeper look at the regions of optimal parameter values for the three classes of network representations. If one wants a quick analysis without the optimization procedure of this paper, then how can one set the parameters? Are there rules of thumb? We use the _{i}_{stop}-values (relative both to the sampling time and the mean interevent time). Nevertheless, we still do not know how to estimate this value without running disease simulations. The good news is that the network representation is rather insensitive to the choice of _{stop}.

Panel A shows data for the time-slice networks; B displays results for the ongoing networks and C gives the picture for the exponential-threshold representation. The dotted line illustrates the exponential form of the region of optimality (the equation being τ/^{Ω/0.32}). The quantities of dimension time are, as indicated, rescaled by the total sampling time

The ongoing networks typically are maximized at _{start}≈_{stop} for some intermediate value smaller than the duration. Also here, it is hard to give an estimate of this parameter value, more than it happens within the optimal time window of the time-slice data. The last method, the exponential-threshold networks, is frequently optimized along a curve ^{Ω/Ω′}, where Ω′ is a constant. This is because larger decay factors give larger weights and thus larger thresholds. The _{stop} for time-slice networks, _{start} ( = _{stop}) for ongoing networks and Ω′ for the exponential-threshold networks.

The fact that different methods works better for different data set and that the important disease spreaders are harder to predict in some data than others, of course, comes from differences in the temporal network structure. In

Next, we turn to studying the network structure of the optimized networks of the three types of network representations. The results are shown in

25,995 | 2,752 | 23,941 | 132(7) | 84 | 10,958 | |

38,938 | 18,324 | 93,348 | 545(49) | 531 | 22,095 | |

0.982 | 0.997 | 0.975 | 0.87(2) | 1 | 0.934 | |

_{0} |
0.676 | 0.934 | 0.847 | 0.984(2) | 1.00 | 0.750 |

3.70 | 2.86 | 4.07 | 3.7(1) | 1.94 | 5.95 | |

_{0} |
3.63 | 2.83 | 3.71 | 2.69(4) | 2.05 | 4.36 |

9,787 | 2,245 | 761 | 28(1) | 78 | 867 | |

12,494 | 10,558 | 548 | 51(5) | 286 | 784 | |

0.847 | 1 | 0.293 | 0.51(3) | 1 | 1 | |

_{0} |
0.579 | 0.924 | 0.432 | 0.91(1) | 0.99 | 0.99 |

5.25 | 2.99 | 4.98 | 1.9(1) | 2.42 | 7.32 | |

_{0} |
4.10 | 2.94 | 2.77 | 2.7(1) | 2.43 | 4.84 |

31,451 | 2,357 | 22,287 | 147(7) | 110 | 10,566 | |

47,949 | 12,856 | 78,608 | 455(43) | 864 | 20,390 | |

0.984 | 1 | 0.963 | 0.57(4) | 1 | 0.924 | |

_{0} |
0.654 | 0.932 | 0.833 | 0.94(1) | 1.00 | 0.736 |

3.77 | 2.93 | 4.29 | 3.7(2) | 2.01 | 6.00 | |

_{0} |
3.69 | 2.86 | 3.81 | 3.21(8) | 2.04 | 4.38 |

57,189 | 3,188 | 28,972 | 159(8) | 113 | 16,730 | |

92,442 | 31,857 | 115,684 | 647(57) | 2,196 | 39,044 | |

0.999 | 1 | 0.977 | 0.81(3) | 1 | 0.945 | |

_{0} |
0.989 | 0.997 | 0.880 | 0.980(2) | 0.990 | 0.792 |

3.93 | 2.78 | 4.05 | 4.17(13) | 1.66 | 5.78 | |

_{0} |
4.10 | 2.85 | 3.79 | 2.83(3) | 1.70 | 4.36 |

Now we will explore effects of the temporal-network structure and the stability of the above observations in a model network. It would be quite impossible to scan all facets of temporal-network structure. Rather, we will focus on the effect of overlapping relationships on the performance of the representations. Can it be the case that they are outperforming the time-slice and exponential-threshold networks for some temporal-networks with a high degree of overlapping relationships? We set up the simulation so as to mimic as much of the observed structure as possible, while simultaneously controlling the average fraction of concurrent partnerships. The latter is achieved through a parameter, μ ∈ (0,1], where larger values mean more relationships that are concurrent. An outline of the construction algorithm is shown in

Steps 1–2 represent the configuration model used to create a static network. Then, in Step 3, we assign active intervals (time periods where contacts are allowed). In Step 4–6, we assign contact times within the intervals from the same interevent time distribution.

In _{i}

We display the maximum value of the Spearman rank correlation as a function of the overlap parameter μ (a model-parameter controlling the fraction of concurrent relationships). Error bars showing the standard error would be smaller than the symbol size and are not plotted.

When μ = 1, in the limit of many contacts per edge, the ongoing and time-sliced networks will be the same (simply equaling the network of aggregated contacts). The difference, seen in

Panel A shows the number of (non-zero degree) vertices in the network; B displays the average degree; C gives the relative size of the largest connected component; while D shows the corresponding figure to C for null models of the same degree sequences as in C, but otherwise random. The error bars are displayed if they are about the same size as the symbols and show the standard error.

The optimizing parameter values are presented in the Supporting Information,

We have explored how to encode as much information from a temporal network and a known start time of an infection into static graphs so that a predictor of disease-spreading importance—degree—is as accurate as possible. The main conclusions are that, on one hand, exponential-threshold networks generally perform best; on the other hand, time-slice networks often perform almost as good. Our general recommendation is thus to use exponential-threshold networks if possible. However, the simplicity in constructing and optimizing a time-slice network makes it a feasible alternative. To straightforwardly use a network of accumulated contacts is not a good idea—for some data sets, the performance is less than 60% of the maximum. In addition, the ongoing networks—recording contacts that are active simultaneously—perform rather poor. The performance is better when there are relatively many concurrent edges, or partnerships, (i.e. when these networks are rather dense), but never as good as the other two methods. It is well established that the overall level of concurrent partnerships increases the frequency of population-wide outbreaks

How much do our results generalize beyond our current analysis? There are of course many other ways to evaluate the performance of network representations. Instead of the performance measure that we consider (the ability of a vertex’ degree to predict its rank in a list of estimated sizes of outbreaks originating at that particular vertex), one can imagine other measures. Different types of centrality measures

Ideally, an importance measure should weigh together both these aspects. In most types of data, we expect these aspects to be strongly correlated and we settle for the mentioned expected outbreak size. One can also think of other prediction tasks for the comparison than finding influential spreaders—for example, predicting epidemic threshold, peak-time of the epidemics, prevalence as a function of time or the final outbreak size. Such studies would require us to study a specific disease-spreading model for the static network. This added complication is the main reason that we avoid such a direction. However, we also believe (as mentioned above), that predicting influential spreaders is a comparatively easy task. If one cannot say who would be an influential spreader, but still get the epidemic threshold right, the latter seems rather like luck. (Investigating this hypothesis rigorously would be an interesting future direction.)

How much do our conclusions depend on the disease simulation model and its parameter values? The per-contact transmission probability probably does not affect the ranking of the vertices (even if the expected outbreak sizes can vary non-linearly). The duration of the infective state, however, could change the ranking. In Ref.

Maybe the most serious reason to be cautious about generalizing our results is that we have investigated only a limited set of temporal-network structures. Indeed one can imagine numerous types of correlations between temporal structure and network position—correlations between edges connected to the same vertex, between vertices connected by an edge, etc. A promising sign, however, is that the empirical data sets span a rather large range of static network structure (both in terms of the network of accumulated contacts and the optimized networks). In the end, it is probably impossible to scan all temporal-network structures. Rather, we hope for higher quality empirical data. This would also allow us to better tailor the network representations to specific pathogens.

The ρ_{max}-values—between 0.68 (for the synthetic data) to 0.93 (for the

An interesting question for the future is why some data sets give higher performance values. With the degree sequences of the accumulated networks ρ_{max} is bounded above by about 0.95–0.98 (1 is unattainable because of the degeneracy of degrees). The discrepancy comes from the network-construction methods being too blunt to capture the relevant temporal-network structure. On the other side, it may be too much to ask from the method to rank the bulk of peripheral vertices accurately—the difference between them will probably be smaller than the errors in the raw data set. Another open future direction is to design other network representations, perhaps putting different weight depending on burstiness

We consider a set

We simulate disease spreading by a version of the SIR model defined as follows. Start the simulation from a situation where all vertices are susceptible. The outbreak is then initiated from a seed

Ideally, we should scan the entire (λ,δ) parameter space, but this would be computationally too demanding. Rather, we will try to simulate the disease spreading where it is easy to separate the more from the less important individuals. This happens at intermediate λ- and δ-values. (For an infinite system, it would be around the epidemic threshold, but for the finite systems that we consider, thresholds are ill defined, so we avoid that terminology.) As a simple principle, we chose δ as one fifth of the sampling time and λ such that the average outbreak size becomes one fifth of the size with λ = 1 and δ =

We limit ourselves to simple graphs (unweighted and undirected graphs that have no multiple edges or self-edges) and require that their construction should be conceptually simple. The simplest type of such representations is the time-slice network—an edge in these is any pair of vertices (_{start}≤_{stop} _{start} and _{stop} are the beginning and end of the data set, then we speak of an aggregated network (which probably is the most common representation when running disease simulations on empirical network data _{start}≤_{stop}<_{start} = _{stop}.

The last type of network representation that we test is exponential-threshold networks. In these, each pair of vertices is assigned a weight

In our tables discussing the structure of the data sets and derived networks, we use a number of quantities that we will define here.

To quantify the tendency of contacts to be temporally separated by broadly distributed intervals, we use the

Another important quantity is the

We compare the static network measures by the corresponding values from a randomized null model with the same set of degrees but otherwise no structure. An instance of this model is generated by: sequentially going through all edges (

We estimate importance of a vertex _{i}

To estimate the importance of a vertex in the disease spreading from the static networks, we _{i}

The method to generate synthetic contact sequences is outlined in ^{4} consecutive times, we give up and delete the remaining stubs. In this paper, we use a truncated power-law distribution to mimic the skewed, broad degree distributions of the empirical networks. To be specific, we draw the random numbers from a distribution_{min} = 1, _{max} =

After the network topology is generated, we proceed to assign times of contacts to the edges. We assume a contact over an edge can only take place during an

We proceed by generate a time series with, once again, a truncated power-law shape. We use the equation_{min} = 1, Δ_{max} = 10^{4} and β = 2. We generate

Scatter plot of the degree in a time-slice network with parameters _{start} = 0 and _{start} = _{i}

(TIF)

The model parameters for optimizing the time-slice networks for the synthetic data sets as a function of the overlap parameter.

(TIF)

The model parameters for optimizing the ongoing networks for the synthetic data sets as a function of the overlap parameter.

(TIF)

The model parameters for optimizing the exponential-threshold networks for the synthetic data sets as a function of the overlap parameter. Panel A shows values for the decay exponent τ in units of the total sampling time and B shows the τ→∞ limit cutoff Ω.

(TIF)

Maximal performance values for the empirical data sets. This table is exactly corresponding to

(PDF)

We are grateful for helpful comments from Naoki Masuda, Fariba Karimi, Luis E. C. Rocha and Jari Saramäki.

^{nd}ed. London: Arnold Publishers.