## Figures

## Abstract

As infectious disease surveillance systems expand to include digital, crowd-sourced, and social network data, public health agencies are gaining unprecedented access to high-resolution data and have an opportunity to selectively monitor informative individuals. Contact networks, which are the webs of interaction through which diseases spread, determine whether and when individuals become infected, and thus who might serve as early and accurate surveillance sensors. Here, we evaluate three strategies for selecting sensors—sampling the most connected, random, and friends of random individuals—in three complex social networks—a simple scale-free network, an empirical Venezuelan college student network, and an empirical Montreal wireless hotspot usage network. Across five different surveillance goals—early and accurate detection of epidemic emergence and peak, and general situational awareness—we find that the optimal choice of sensors depends on the public health goal, the underlying network and the reproduction number of the disease (*R*_{0}). For diseases with a low *R*_{0}, the most connected individuals provide the earliest and most accurate information about both the onset and peak of an outbreak. However, identifying network hubs is often impractical, and they can be misleading if monitored for general situational awareness, if the underlying network has significant community structure, or if *R*_{0} is high or unknown. Taking a theoretical approach, we also derive the optimal surveillance system for early outbreak detection but find that real-world identification of such sensors would be nearly impossible. By contrast, the friends-of-random strategy offers a more practical and robust alternative. It can be readily implemented without prior knowledge of the network, and by identifying sensors with higher than average, but not the highest, epidemiological risk, it provides reasonably early and accurate information.

## Author Summary

As public health agencies strive to harness big data to improve outbreak surveillance, they face the challenge of extracting meaningful information that can be directly used to improve public health, without incurring additional costs. In this article, we address the question: *Which nodes in a social network should be selectively monitored to detect and monitor outbreaks as early and accurately as possible?* We derive best-case performance scenarios, and show that a practical strategy for data collection–recruiting friends of randomly selected individuals–is expected to perform reasonably well, in terms of the timing and reliability of the epidemiological information collected.

**Citation: **Herrera JL, Srinivasan R, Brownstein JS, Galvani AP, Meyers LA (2016) Disease Surveillance on Complex Social Networks. PLoS Comput Biol 12(7):
e1004928.
https://doi.org/10.1371/journal.pcbi.1004928

**Editor: **Marcel Salathé,
Ecole Polytechnique Federale de Lausanne, SWITZERLAND

**Received: **August 25, 2015; **Accepted: **April 19, 2016; **Published: ** July 14, 2016

**Copyright: ** © 2016 Herrera et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

**Data Availability: **All relevant data are within the paper and its Supporting Information files.

**Funding: **This work was funded by NIH/NIGMS MIDAS grant U01 GM087719. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

**Competing interests: ** The authors have declared that no competing interests exist.

## Introduction

Public health agencies rely on diverse sources of information for detecting emerging outbreaks, situational awareness (e.g., estimating prevalence or severity), prediction of future burden, and triggering initiation of control measures. For influenza alone, the CDC has deployed at least eight different surveillance systems [1]. With the public health sector facing increasing budget constraints [2, 3], disease surveillance is at a critical juncture where next-generation big data can potentially be harnessed to revolutionize traditional data-limited practices and improve real-time situational awareness, early detection and forecasting of disease outbreaks.

HealthMap—an event-based system that aggregates worldwide news to generate global health risk maps—was among the first effective demonstrations of internet-driven surveillance [4, 5]. In 2009, Google Flu Trends—a detection algorithm for internet search queries of influenza-related terms—brought next-generation indicator-based syndromic surveillance to the forefront of public health [6–11]. It generally aligns well with seasonal dynamics in the US and Europe, but fell short during the 2009 H1N1 pandemic [12–14]. In the last few years, next-generation surveillance has exploded with efforts to combine both event and syndromic indicator data from search engines [15, 16], crowdsourcing (e.g., Flu Near You in the US and Influenzanet in Europe) [17, 18], Twitter (e.g., MappyHealth) [19, 20], and Facebook [21, 22]. While these new approaches are promising, public health agencies face the significant challenge of comprehensively integrating these diverse data sources to achieve specific surveillance objectives. Many next generation data sources, whether passively scraping data gathered for an incidental purpose or actively engaging volunteer participants, can be used to infer the underlying network through which disease, opinions or information spreads.

Decades of sociology and epidemiology research have demonstrated that network structure can profoundly influence the spread of disease and behavior, and determine if and when individuals are affected [23–30]. In particular, there are diverse methods for quantifying the importance or *centrality* of a *node* (individual) in a network, many of which have been shown to predict epidemiological risk and indicate optimal targets for interventions such as vaccination [31–37, 41–43].

In designing disease surveillance systems for networked populations, one seeks to identify nodes (*sensors*) that are likely to provide timely and accurate indications of epidemic activity. While analogous to the selection of efficient targets for vaccination on networked populations, the best sensors are not necessarily those most likely to be infected and infect others. Nodes that are the earliest or most often infected may be unreliable indicators of the broader epidemiological situation. Conversely, a representative cross-section of a network may provide accurate situational awareness, but the rate of detection from a representative cross-section may be too slow to serve as a timely trigger of control measures. Rapidity of targeted action during the initial phase of an outbreak is fundamental to the effectively curtailing transmission and minimizing disease burden. In previous work on livestock diseases, a network path based strategy has been proposed for identifying surveillance locations that would provide timely and accurate outbreak data [40]; in a recent analysis of disease surveillance in a high school population, Smieszek and Salathé introduce a promising sensor selection criteria (total time students spend collocated with other students) that is expected to yield timelier and more accurate information than alternative centrality-based criteria [47, 48]. Christakis et al. performed an experimental comparison of two social-network-based strategies in a college population [46]. In one strategy, the sensors were a random selection of students; in the other, the sensors were identified as friends of one or more random students. The *friends-of-random* surveillance group was expected to be biased towards more central individuals, and provided an indication of the 2009–2010 pandemic H1N1 influenza epidemic that was two weeks earlier than the *random* surveillance group.

Here, we use a mathematical model to systematically evaluate these and other strategies for selecting surveillance sensors across several networks and for an ensemble of common public health objectives. We quantify the timing and accuracy of the information gained by monitoring the disease states of strategically chosen sensors, as well as the robustness of the information across epidemiological scenarios characterized by different reproduction numbers, *R*_{0}s. We find that the best surveillance targets are not always those with the highest epidemiological risk or those most representative of the underlying network.

## Methods

### Epidemic model

We simulate disease outbreaks in contact networks using a stochastic chain-binomial model that classifies the disease status of individuals as susceptible-exposed-infected-recovered (SEIR) [44, 45]. Networks consist of *nodes* representing individuals and *edges* between pairs of nodes representing contacts between individuals. The *degree* of a node is the number of other nodes to which it is connected via an edge.

During a simulated epidemic, each node is in one of four states: susceptible (S), exposed to disease but not yet infectious (E), infectious (I), or recovered (R). If a node *i* in state S shares an edge with a node *j* in state I, then *j* will infect *i* with probability *β* and *i* will transition from S to E. After a period of *l* days, *i* will enter the infectious state I. It will remain infectious for *d* days, and then move to the immune state R.

The *reproduction number* of a disease, denoted *R*_{0}, indicates the growth rate of an epidemic and the expected number of secondary infections arising from a single infected host in an entirely susceptible population. Sustained epidemics are only possible when *R*_{0} > 1. In a random network, *R*_{0} is related to *β* as follows [49]:
(1)
where 〈*k*〉 and 〈*k*^{2}〉 are the mean degree and the mean squared degree, respectively, of nodes in the network. *R*_{0} depends explicitly on both the intrinsic transmission rate of the pathogen and the structure of the network. For our analyses, we specify *R*_{0} and use Eq 1 to solve for the corresponding *β*. For the empirical networks considered, clustering, modularity and other non-random structures may cause the resulting *R*_{0} to differ slightly from the one initially specified.

For each simulation, we fix the latent period to *l* = 4 days and the infectious period to *d* = 7, roughly in the range of estimates for common respiratory diseases, including influenza [50, 51]. Epidemics are initialized with a single random infected node and allowed to evolve until there are no remaining infected nodes.

### Contact networks

Social interactions often generate complex network structures, with features that impose non-trivial constraints on the flow of information, behavior and disease [52–55]. We evaluated network-based surveillance strategies using three classes of social networks with distinct topological attributes.

*Scale-free networks*: Networks generated using the Barabasi-Albert algorithm [55] with*N*= 10,000 nodes, starting with*m*_{0}= 3 nodes and iteratively adding nodes with edges to*m*= 3 existing nodes.*Student network*: A social network formed by*N*= 4,634 students (nodes) of the Engineering Department from Universidad de Los Andes in Merida—Venezuela, where edges indicate that students attended the same class during the fall 2008 semester (For more information refer to Supporting information S1 Fig).*Montreal WiFi Network*: A co-location network for*N*= 103,425 users (nodes) of the Île Sans Fil free public wireless network in Montreal, Canada, where edges represent concurrent hotspot usage [56].

The degree distributions of the scale-free and Montreal network resemble power laws [55, 56], while the student network has a relatively homogeneous (Poisson) degree distribution. The Montreal network, but not the other two, exhibits strong community structure [56].

### Surveillance strategies

We propose three strategies for designing network-based surveillance systems. Each strategy is a criteria for selecting a subset of individuals to monitor for their disease state: (1) *most connected*: select the highest degree individuals in the network; (2) *random*: select individuals at random; and (3) *random acquaintance*: select a random acquaintance of random individuals (which should be biased toward high degree individuals [57]). These strategies are illustrated in Fig 1 for a scale-free network, where each surveillance subset includes five of the 100 nodes (in red).

Red circles indicate nodes that are selected to be surveillance sensors. For the random acquaintance strategy, yellow squares indicate randomly chosen nodes from which one random acquaintance was selected to be a surveillance sensor.

The most connected strategy assumes complete knowledge of the network structure, whereas the random and random acquaintance strategies do not.

### Evaluation of surveillance strategies

We assess the performance of each surveillance strategy with respect to four different public health goals, listed below (Fig 2). For each strategy-network combination, we build surveillance subsets by selecting 1% of all nodes (unless otherwise specified) via the strategy. We then estimate performance by running stochastic SEIR simulations, and make the following four comparisons between the prevalence time-series in the whole population to that of surveillance subset:

*Early warning*: The lag between the surveillance subset reaching 1% prevalence and the entire population reaching 1% [58–60] prevalence.*Peak timing*: The lag between the surveillance subset reaching its epidemic peak and the entire population reaching its epidemic peak.*Peak magnitude*: The ratio of peak prevalence in the surveillance subset and peak prevalence overall.*Situational awareness*: The complement of the normalized mean absolute error (MAE), minimized over possible lags, is given by (2) Here,*x*_{t}and*y*_{t}are the prevalence in the surveillance subset and in whole population at time*t*, respectively,*N*is the population size,*M*is the size of the surveillance subset, and λ is the lag.

To evaluate strategies, we compare the epidemic curve (prevalence time series) of the subset of nodes under surveillance (green lines) with the epidemic curve for the whole population (blue lines). We calculate the time lag between the surveillance group and whole population reaching 1% prevalence (*early warning*). We also calculate the time lag between the surveillance group and whole population reaching their epidemic peaks and the ratio of the magnitudes of the two peaks (*peak forecasting*), as well as the complement of the normalized mean absolute error (MAE)(*situational awareness*).

All results are averaged over 2000 stochastic SEIR simulations. At the beginning of each simulation, the surveillance subset is chosen anew according to the given strategy. For each objective function, we quantify both the magnitude of the effect and its robustness with respect to a key epidemiological quantity, *R*_{0}. High sensitivity of the information provided by a surveillance system to *R*_{0} indicates that the system may be unreliable or uninterpretable in situations where *R*_{0} is unknown or changing.

## Results and Discussion

In all three networks, the most connected strategy selects subsets of nodes that are most likely to experience earlier and more intense epidemics, whereas the random strategy yields collections of sensors that are highly representative of the population as a whole (Fig 3). The random acquaintance strategy produces subsets that provide some early warning in the scale-free and Montreal networks, but not in the highly homogeneous student network. The epidemic curves in the Montreal network occasionally exhibit multiple peaks, driven by underlying community structure [56].

Lines indicate the fraction of infected nodes overall (magenta) and in 1% subsets of nodes selected according to the most connected (red), random (blue), and random acquaintance (green) surveillance strategies during a single SEIR simulation with *R*_{0} = 3.

A systematic evaluation of the three strategies in the three focal networks (Fig 4) shows that the most connected strategy consistently provides the earliest warning for both the beginning and peak of the season. The most connected strategy also exhibits the highest peaks and the least overall similarity to the full epidemic curve (Fig 4, red points). However, the timing of the early warning can be highly sensitive to *R*_{0}, presenting a challenge when there is uncertainty regarding *R*_{0}. For example, when *R*_{0} = 3, the most connected surveillance subset crosses the season onset threshold an average of 2.5, 27 and 35 days before the entire population in the student, scale-free and Montreal networks, respectively. When *R*_{0} = 5, these early warning periods decrease to averages of 0.71, 15,2 and 18.1 days, respectively (Fig 4B, 4D and 4F). The epidemic peak in the most connected surveillance subsets also depends on *R*_{0}, reducing confidence in the estimation of peak burden under uncertainty (Fig 4A, 4C and 4E). In the Montreal network, the average ratio between the peak in the surveillance subset and the peak overall decreases from 29 to 15.3 as *R*_{0} increases from 3 to 5 (Fig 4E). In general, as *R*_{0} increases, the height of the epidemic peak in the entire population approaches that in the most connected subset of the population.

Points and error bars indicate mean and standard deviation in performance over 2000 simulations, respectively. Performance depends on both *R*_{0} and network structure: scale-free (graphs A and B), student (graphs C and D), and Montreal (graphs E and F).

The random strategy yields surveillance systems that closely reflect the overall epidemiological dynamics, with early warning values close to zero and peak ratios close to one, across all networks and values of *R*_{0} (Fig 4, blue points). The random acquaintance surveillance groups perform relatively well in both the scale-free and Montreal networks (Fig 4A, 4B, 4E and 4F, green points). The random acquaintance approach offers helpful early warning for both season onset and peak, though not as much as the most connected group. Importantly, the random acquaintance approach exhibits greater robustness with respect to *R*_{0} in the timing of early warning, peak ratio, and situational awareness (overall correlation between surveillance epidemic curve and population epidemic curve) compared with the most connected group. However, in the student network, the random and random acquaintance subsets are virtually indistinguishable (Fig 4C and 4D). The student network is highly homogeneous, with most nodes having close to the average number of contacts. Thus, random acquaintances tend to be average as well.

As the size of a surveillance system increases, the detected epidemic curves converge on the full epidemic curve, thereby improving situational awareness (Fig 5). In the scale-free and student networks, the performance of the three different surveillance strategies stabilizes to a quasi-stationary prevalence around 3%, which entails tracking 300 of 10,000 nodes and 139 of 4,634 nodes in the two networks, respectively. In the Montreal network, the random and random acquaintance groups reach their optimal performance by 0.5% (517 of 103,425 of nodes), while the most connected group is still improving beyond 10% (10,342 of 103,425 of nodes).

Surveillance groups were chosen using the most connected (red), random (blue), and random acquaintance (green) strategies.

There are innumerable alternative strategies for selecting surveillance nodes, including prioritization based on other well-studied network centrality measures [38, 39]. For example, *k*-shell decomposition [37] and eigenvector centrality [53] are more computationally demanding and challenging to implement in practice, yet are not expected to significantly improve outcomes (S2 Fig).

### A theoretically optimal surveillance strategy

Following Newman [53], we use percolation theory to model SIR epidemics on networks, and derive the optimal surveillance group for early detection of an epidemic. We consider a disease with transmissibility *β* and recovery rate *γ* spreading through a network of size *N*. During the initial outbreak, the probabilities of each node being infected at time *t* are approximated by the vector
(3)
where *κ* is the leading eigenvalue of the adjacency matrix and **v** its corresponding eigenvector [53].

We extend this equation to calculate the time lag between a subset *S* of the network of size *M* ≤ *N* reaching a given prevalence threshold *p* and the overall population prevalence reaching *p*. Let **1** be the vector of length *N* containing all ones, **1** = (1, …, 1), and **1**_{S} be the binary vector of dimension *N* indicating which *M* nodes are under surveillance
For example, if the 1% most connected nodes were selected for surveillance in a network of size *N* = 1000, then the entries of **1**_{S} corresponding to the ten highest degree nodes would be one, and the remaining entries would be zero.

Let *τ* and *τ*_{S} be the times at which the entire population and a given surveillance group reach the prevalence threshold *p* and *p*_{S}, respectively. Substituting into the above equation, we find
(4)
and
(5)
To solve for the timing of early warning achieved through surveillance Δ*τ* = *τ*_{S} − *τ*, we equate *p* = *p*_{S},
(6)
This implies
(7)
where *c* = **v** · **1**/*N* and *c*_{S} = **v** · **1**_{S}/*M* are the average eigenvector centralities in the network as a whole and the surveillance subset, respectively. The early season lag between the surveillance subset and the whole population can thus be positive or negative, and depends on ratio of their average eigenvector centralities.

We assessed the validity of this mean field approximation by comparing the expected early warning period (Eq 7) to simulated early warning periods for both the most connected subset and the subset of the 1% highest eigenvector centrality nodes. To match the assumptions of our mean field model, we simulated SIR rather than SEIR transmission dynamics. The simulations mirrored the theoretical expectations for both types of surveillance subsets in all three networks, as shown for the scale-free network (Fig 6).

As *R*_{0} increases, the lag between the surveillance subset and the entire population reaching the early detection threshold decreases for both the (A) 1% highest eigenvector centrality nodes and (B) 1% highest degree nodes. Red curves indicate theoretical approximations; box plots show distribution of SIR simulation results.

Next, we solve for the surveillance subset that maximizes the length of the early warning period. For a given surveillance system size *M*, the earliest warning is achieved when **1**_{S} indicates the *M* nodes in the network with the highest valued entries in **v**. Thus, the theoretically optimal surveillance strategy for early warning of epidemic onset selects nodes with the highest eigenvector centrality.

Importantly, Δ*τ* depends on the disease parameters *β* and *γ*. Regardless of the choice of surveillance nodes **1**_{S}, the timing of the early warning period will, therefore, increase as *R*_{0} decreases. An exception occurs when the average eigenvector centrality in the surveillance subset equals that in the population as a whole (*c* = *c*_{S}). In that case, there is no early warning (Δ*τ* = 0). These properties are reflected in the sensitivity to *R*_{0} observed in our simulations (Fig 4B, 4D and 4F).

For the networks under consideration, the most connected strategy produces surveillance groups with relatively high eigenvector centrality while the random strategy yields groups with average eigenvector centrality. However, eigenvector centrality in random acquaintance groups depends on the underlying network: in homogeneous networks such as the student network, it will be average, whereas in heterogeneous networks, it will be above average.

### Identifying the optimal surveillance nodes

Identifying individuals with the highest eigenvector centrality is challenging in real-world populations, where the underlying network structure is generally unknown. However, finding individuals with above average degree centrality is possible using local information. If eigenvector centrality is correlated to degree centrality, as it is in the three networks we consider (see Fig 7), it may be possible to use highly connected nodes as a proxy for high eigenvector centrality nodes.

Both eigenvector and degree centralities are scaled to have maximum value 1, and log-log plots are shown for (A) and (C). The student network shows strong correlation between the two centrality measures, with a Spearman rank correlation coefficient of 0.819. For the scale-free and Montreal networks, the measures have more moderate rank correlation coefficients of 0.441 and 0.620, respectively.

One strategy for finding high degree centrality nodes is to follow chains of random acquaintances. This has been explored extensively in the context of respondent-driven sampling, such as chain-referral (i.e., “snowball”) sampling [64]. In particular, consider the simple random walk in which, at each step, the walker moves to a neighboring node selected uniformly at random. For connected, undirected networks, this is equivalent to the PageRank algorithm with no damping factor [53]. Assuming the network is fully connected, the distribution of the random walker after *m* steps approaches a stationary distribution as *m* → ∞, in which the probability of landing on a node is exactly proportional its degree [65]. Thus, the more connected the node, the more likely we are to reach it.

Precisely, let *k*_{i} be the degree of node *i* and *P*(*k*) the degree distribution of the network. The *n*th moment of the degree distribution is:
(8)
Let *D*_{m} denote the degree of the node at which the random walk resides on the *m*th step, starting from a node chosen uniformly at random. Assuming the mean degree 〈*k*〉 < ∞, then the distribution of *D*_{∞} is given by *kP*(*k*)/〈*k*〉. If 〈*k*^{3}〉 < ∞, which is true for any finite graph but will be violated for power-law networks without cutoff, the mean and standard deviation of *D*_{∞} are given by
(9)
By comparison, the distribution of randomly sampled nodes (*D*_{0}) has mean *μ*_{0} = 〈*k*〉 and standard distribution . Thus, the random walk sample is biased towards nodes with larger degrees. For intermediate values of *m*, the distribution of *D*_{m} can only be derived with full knowledge of the underlying graph. Instead, this distribution converges to that of *D*_{∞} at a rate that depends on the second largest eigenvalue of the adjacency matrix of the graph. If this eigenvalue is close to one, which is usually the case for connected networks with high modularity, convergence is very slow and random walk sampling may require numerous steps to achieve its optimal performance. Methods that bias the random walk towards higher eigenvector centrality nodes should be more effective in this setting. For example, the maximal entropy random walk samples nodes proportional to eigenvector centrality just as the simple random walk considered earlier samples nodes according to their degree centralities [66]. However, the transition probabilities of the the maximal entropy random walk require global information about the network, making it impractical to implement without approximation as part of the sampling strategy.

Eq 9 provide a theoretical upper bound to the mean centrality that can be achieved when using a random walk on a network to design a surveillance system. In particular, for a random-walk surveillance subset of size *M* = *ϵN* with fixed *ϵ* and large *N*, the empirical mean of the sample will become approximately normal with mean *μ*_{∞} and standard deviation , as illustrated for our three study networks (Fig 8).

For purposes of comparison, degree (blue) and eigenvector centrality (red) are divided by the maximum degree and maximum eigenvector centrality, respectively, in each network. Mean degree approaches its theoretical limit (dashed lines), and mean eigenvector centrality also increases as the random walks progress in the (A) scale-free (subset contains *ϵ* = 1% of nodes), (B) student (*ϵ* = 2%), and (C) Montreal (*ϵ* = 0.1%) networks. As expected, the mean degree converges to a normal distribution with mean *μ*_{∞} and standard deviation *σ*_{∞} (gray shading) as the number of steps in the walk, *k*, increases. The random walks converge within a few steps in the scale-free and student networks, but require more steps in the highly modular Montreal network.

### Conclusions

The success of both traditional surveillance systems such as the U.S. Outpatient Influenza-like Illness Surveillance Network (ILINet) and next generation participatory systems including FluNearYou [17, 18], depends on targeted recruitment of reliable, informative providers. With Meaningful Use and the advent of digital disease detection, we are moving from an era of sparse, volunteer-based data into an era of data inundation [16, 61]. Nonetheless, we still face the challenge of finding reliable data sources. Effective mining of electronic medical records, social media and other internet source data, such as Google, Twitter or Facebook, requires sifting through petabytes of data for streams that can provide early and accurate information about emerging outbreaks. While random representative sampling is a good rule-of-thumb and has guided the development of numerous surveillance systems, we can improve the timeliness of surveillance by exploiting our evolving understanding of social networks and their impacts on infectious disease dynamics [24, 28–30, 45, 49, 52–55, 62, 63].

In an ideal scenario where both the contact network and the reproduction number (*R*_{0}) of the disease are known in advance, public health agencies can monitor the most informative nodes and achieve very accurate and early assessments of emerging epidemics. For example, we find that surveillance of the most connected individuals in the Montreal WiFi network can increase lead time on detecting epidemic emergence by two to three weeks and anticipating the epidemic peak by over a week. We show analytically that the optimal strategy for early detection of emerging outbreaks is targeting individuals with the highest eigenvector centrality, a measure that considers the connectivity of a node’s neighbors, and those neighbors’ neighbors, and so on [53]. It can only be calculated with full knowledge of the network, and estimates the proportion time spent on a node during an infinitely long random walk along the edges of the network. While providing the longest lead time (between the surveillance system crossing a prevalence threshold and the rest of the population crossing that threshold), the timing is highly dependent on *R*_{0}. In fact, regardless of which nodes are under surveillance, epidemiological activity becomes more synchronized and the lag time shrinks as *R*_{0} increases.

This ideal scenario is generally unrealistic. When the contact network is unknown, we cannot easily identify the most central individuals, for many measures of centrality. Even if we could monitor the most connected individuals, correct interpretation of the resulting signal requires some knowledge of *R*_{0}. In general, low *R*_{0} implies a longer lag time between epidemiological events in the surveillance group and corresponding events in the general population, and a larger discrepancy between prevalence in the surveillance group and overall epidemiological activity. Several recent studies have identified epidemiologically relevant measures of centrality that can be estimated from readily obtainable school, social network, and workplace data [42, 43, 47, 48]. We hypothesize that these more tractable centrality-based sensors may exhibit a similar trade off between timeliness and robustness.

The random acquaintance strategy, which chooses random contacts of random nodes, provides a practical method for identifying individuals with higher than average centrality. The intuition is that when choosing a random *friend of a node* rather than just a *random node*, the choice is biased towards individuals with more friends. In heterogeneous networks, such as the scale-free and Montreal WiFi network considered here, random acquaintance groups provide some degree of early warning (significantly more than randomly selected nodes) and exhibit epidemic curves that reflect overall disease activity (significantly better than the most connected nodes). This is corroborated by the empirical finding that friends of random students served as better outbreak sentinels than random students during 2009 H1N1 pandemic [46]. Although the timing of the early warning and the discrepancy between the estimated prevalence and true prevalence will depend on *R*_{0}, the uncertainty can potentially be quantified and incorporated into confidence intervals.

In a relatively homogeneous network, such as the Venezuelan student network, the random acquaintance strategy finds fairly average nodes and does not improve upon the random strategy with respect to the surveillance objectives. This finding is consistent with basic theory on Erdős-Réyni networks: in a random network with a Poisson degree distribution, the average degree of random acquaintances will be exactly average [57]. Therefore, if a population is sufficiently homogeneous, surveillance systems should simply target random individuals or employ other methods for identifying highly connected individuals.

We conclude that the friends-of-random strategy, while not optimal for all public health objectives, balances risk and representativeness, provides reasonably robust, accurate and early warning, and can be applied without knowledge of the underlying contact network. Volunteer-based surveillance systems, like Flu Near You, could potentially improve coverage by recruiting friends of existing members. Network analysis, in general, allows us to anticipate individual-level epidemiological risk and can thereby help us improve and strategically extend surveillance systems to enhance the early and reliable identification of outbreaks.

## Supporting Information

### S1 Fig. Venezuelan students network.

Key features of the Venezuelan students network

https://doi.org/10.1371/journal.pcbi.1004928.s001

(JPEG)

### S2 Fig. Comparison of strategies.

Performance of the top-dregree (green), eigenvector centrality (red) and k-shell decomposition strategies when calculating the Peak time difference (left) and Peak ratio (right) for the three networks used: Scalefree (top), Students (middle) and Montreal (bottom).

https://doi.org/10.1371/journal.pcbi.1004928.s002

(JPEG)

## Author Contributions

Conceived and designed the experiments: LAM JLH. Performed the experiments: JLH RS. Analyzed the data: JLH LAM RS. Wrote the paper: JLH LAM APG JSB RS.

## References

- 1.
Centers for Disease Control and Prevention. Overview of Influenza Surveillance in the United States. 2016. Available: http://www.cdc.gov/flu/weekly/overview.htm.% Accessed April 22nd 2014
- 2.
National Association of Country & City Health Officials. July 2013. Available: http://www.naccho.org/topics/infrastructure/lhdbudget/upload/Survey-Findings-Brief-8-13-13-3.pdf
- 3. Brownstein JS, Freifeld CC, Madoff LC. Digital Disease Detection—Harnessing the Web for Public Health Surveillance. The New England journal of medicine. 2009;360(21):2153–2157. pmid:19423867
- 4. Brownstein JS, Freifeld CC, Reis BY, Mandl KD. Surveillance Sans Frontières: Internet-based emerging infectious disease intelligence and the HealthMap project. PLoS Med 2008;5:e151–e151. pmid:18613747
- 5. Freifeld CC, Mandl KD, Reis BY, Brownstein JS. HealthMap: global infectious disease monitoring through automated classification and visualization of Internet media reports. J Am Med Inform Assoc 2008; 15:150–157
- 6. Ginsberg J, Mohebbi MH, Patel RS, Branner L, Smolinski MS, et al. (2009), Detecting influenza epidemics using search engine query data. Nature 457:1012–1014. pmid:19020500
- 7. Carneiro HA, Mylonakis E (2009) Google trends: a web-based tool for real time surveillance of disease outbreaks. Clinical Infect. Dis. 49: 1557–1564.
- 8. Pervaiz Fahad, Pervaiz Mansoor, Abdur Rehman Nabeel, Saif Umar; FluBreaks: Early Epidemic Detection from Google Flu Trends. J Med Internet Res. 2012 Sep-Oct; 14(5): e125. pmid:23037553
- 9. Ortiz JR, Zhou H, Shay DK, Neuzil KM, Fowlkes AL, et al. (2011) Monitoring Influenza Activity in the United States: A Comparison of Traditional Surveillance Systems with Google Flu Trends. PLoS ONE 6(4): e18687. pmid:21556151
- 10. Dugas AF, Jalalpour M, Gel Y, Levin S, Torcaso F, et al. (2013) Influenza Forecasting with Google Flu Trends. PLoS ONE 8(2): e56176. pmid:23457520
- 11. Seifter A, Schwarzwalder A, Geis K, Aucott J. The utility of “google trends” for epidemiological research: Lyme disease as an example. Geospatial Health. 2010;4:135–137. pmid:20503183
- 12. Valdivia A, López-Alcalde J, Vicente M, Pichiule M, Ruiz M, Ordobas M. Monitoring influenza activity in Europe with Google Flu Trends: comparison with the findings of sentinel physician networks—results for 2009–10. Euro Surveill. 2010;15(29):pii = 19621. pmid:20667303
- 13. Scarpino SV, Dimitrov NB, Meyers LA. Optimizing provider recruitment for influenza surveillance networks. PLoS computational biology. 2012;8(4):e1002472. pmid:22511860
- 14. Wilson N, Mason K, Tobias M, Peacey M, Huang QS, Baker M. Interpreting “Google Flu Trends” data for pandemic H1N1 influenza: The New Zealand experience. Euro Surveill. 2009;14(44):pii = 19386. pmid:19941777
- 15. Yuan Q, Nsoesie EO, Lv B, Peng G, Chunara R, Brownstein JS. Monitoring influenza epidemics in china with search query from baidu. PloS one. 2013;8(5):e64323. pmid:23750192
- 16. Brownstein JS, Freifeld CC and Madoff LC, Digital Disease Detection—Harnessing the Web for Public Health Surveillance, New England Journal of Medicine 21(360), 2153–2157 (2009).
- 17.
Flu Near You. 2015. Available: https://flunearyou.org/
- 18. Chunara R, Aman S, Smolinski M, Brownstein J. Flu Near You: An Online Self-reported Influenza Surveillance System in the USA. ISDS Conference Abstracts. 2013;5(1)
- 19. Chew C, Eysenbach G. Pandemics in the age of twitter: Content analysis of tweets during the 2009 h1n1 outbreak. PLoS ONE. 2010;5:e14118. pmid:21124761
- 20. Broniatowski DA, Paul MJ, Dredze M. National and Local Influenza Surveillance through Twitter: An Analysis of the 2012–2013 Influenza Epidemic. PloS one. 2013;8(12):e83672. pmid:24349542
- 21. Boulos M, Sanfilippo A, Corley C, Wheeler S. Social web mining and exploitation for serious applications: Technosocial predictive analytics and related technologies for public health, environmental and national security surveillance. Computer Methods and Programs in Biomedicine. 2010;100:16–23.
- 22. Lee BK. Epidemiologic Research and Web 2.0—the User-driven Web. Epidemiology. 2010;21(6):760–763. pmid:20924229
- 23. Newman M.E.J., (2002), Spread of epidemic disease on networks. Phys. Rev. E 66, art. no.-016128.
- 24. Meyers L.A., Newman M.E.J., et al., (2003), Applying network theory to epidemics: control measures for mycoplasma pneumoniae outbreaks. Emerg. Infect. Dis. 9, 204.
- 25. Salathé M, Jones JH (2010) Dynamics and Control of Diseases in Networks with Community Structure. PLoS Comput Biol 6(4): e1000736. pmid:20386735
- 26. Zonghua Liu and Bambi Hu; Epidemic spreading in community networks; Europhys. Lett., 72 (2), pp. 315–321 (2005).
- 27. May Robert M. and Lloyd Alun L.; Infection dynamics on scale-free networks; Phys Rev E, 64, 066112.
- 28. Meyers L.A., Pourbohloul B., Newman M.E.J., Skowronskic D. M., Brunham R.C.; Network theory and SARS: predicting outbreak diversity; Journal of Theoretical Biology 232 (2005) 71–81. pmid:15498594
- 29. Meyers L.A., Newman M.E.J., Pourbohloul B.; Predicting epidemics on directed contact networks; Journal of Theoretical Biology 240 (2006) 400–418. pmid:16300796
- 30. Volz E.; SIR dynamics in random networks with heterogeneous connectivity; J. Math. Biol. (2008) 56:293–310. pmid:17668212
- 31.
Barrat A., Barthélemy M., Vespigniani A. (2008) Dynamical processes on complex networks. Cambridge, UK: Cambridge University Press.
- 32. Friedkin N.E. (1991) Theoretical foundations for centrality measures. Am. J. Sociol. 96, 1478–1504.
- 33. Albert R., Jeong H., Barabási A.-L. (2000), Error and attack tolerance of complex networks, Nature 406, 378–482. pmid:10935628
- 34. Cohen R., Erez K., ben-Avraham D., Havlin S. (2001) Breakdown of the internet under intentional attack. Phys. Rev. Lett. 86, 3682–3685.
- 35. Pastor-Satorras R., Vespigniani A. (2001) Epidemic spreading in scale-free networks. Phys. Rev. Lett. 86, 3200–3203. pmid:11290142
- 36. Lloyd A., May R. (2001) Epidemiology: how viruses spread among computers and people. Science 292, 1316–1317. pmid:11360990
- 37. Kitsak M., Gallos L. K., Havlin S., Lijeros F., Muchnik L., Stanley H. E., Makse H. A. (2010) Identification of influential spreaders in complex networks, Nature 6, 888–893.
- 38. Christley R.M., Pinchbeck G.L., Bowers R.G., Clancy D., French N.P., Bennett R. and Turner J., Infection in Social Networks: Using Network Analysis to Identify High-Risk Individuals, Am J Epidemiol 2005;162:1024–1031. pmid:16177140
- 39. Bell DC, Atkinson JS, Carlson JW (1999) Centrality measures for disease transmission networks. Social Networks 21: 1–21.
- 40. Bajardi P., Barrat A., Savani L., Colizza V. (2012) Optimizing surveillance for livestock disease spreading through animal movements. J. R. Soc. Interface, 9, 2814–2825. pmid:22728387
- 41. Cohen R., Havlin S., ben-Avraham D. (2003) Efficien immunization strategies for computer network and populations, Physical Review Letters 91, 247901. pmid:14683159
- 42. Potter Gail E., Smieszek Timo and Sailer Kerstin (2015). Modeling workplace contact networks: The effects of organizational structure, architecture, and reporting errors on epidemic predictions. Network Science, 3, pp 298–325. pmid:26634122
- 43. Mastrandrea R, Fournet J, Barrat A (2015) Contact Patterns in a High School: A Comparison between Data Collected Using Wearable Sensors, Contact Diaries and Friendship Surveys. PLoS ONE 10(9): e0136497. pmid:26325289
- 44. Abbey H; An examination of the Reed-Frost theory of epidemics. Hum Biol 1952, 24:201–233. pmid:12990130
- 45. Ferrari MJ, Bansal S, Meyers LA, Bjornstad ON; Network frailty and the geometry of herd immunity; Proc R Soc B Biol Sci 2006, 273: 2743–2748.
- 46. Christakis N.A., Fowler J.H. (2010), Social Network Sensors for Early Detection of Contagious Outbreaks. PLoS ONE 5(9): e12948. pmid:20856792
- 47. Smieszek Timo and Salathé Marcel, A low-cost method to assess the epidemiological importance of individuals in controlling infectious disease outbreaks, BMC Medicine 2013, 11:35. pmid:23402633
- 48. Chowell Gerardo and écile Viboud C, A practical method to target individuals for outbreak detection and control, BMC Medicine 2013, 11:36. pmid:23402649
- 49. Meyers L.A. (2007) Contact network epidemiology: Bond percolation applied to infectious disease prediction and control. Bulletin of the American Mathematical Society 44: 63–86.
- 50.
Huai Y, Xiang N, Zhou L, Feng L, Peng Z, Chapman RS, et al. Incubation period for human cases of avian influenza A (H5N1) infection, China, Emerging Infectious Diseases • www.cdc.gov/eid • Vol. 14, No. 11, November 2008.
- 51.
De Serres G, Rouleau I, Hamelin M-E, Quach C, Skowronski D, Flamand L, et al. Contagious period for pandemic (H1N1) 2009, Emerging Infectious Diseases • www.cdc.gov/eid • Vol. 16, No. 5, May 2010
- 52. Watts D.J. and Strogatz S.H.; Collective dynamics of’small-world’ networks. Nature 393, 440–442 (1998). pmid:9623998
- 53.
Newman M.E.J. (2010), Networks: An Introduction, Oxford University Press.
- 54. Cho A. (2013), Network science at center of surveillance dispute, Science 340, 1272. pmid:23766303
- 55. Barabási A.-L., Albert R. (1999), Emergence of scaling in random networks, Science 286, 509–512. pmid:10521342
- 56. Hoen A.G., Hladish T.J., Eggo R.M., Lenczner M., Brownstein J.S., Meyers L.A. (2015). Epidemic Wave Dynamics Attributable to Urban Community Structure: A Theoretical Characterization of Disease Transmission in a Large Network. Journal of Medical Internet Research 17: e169. pmid:26156032
- 57. Newman M.E.J. (2001), Ego-centered networks and the ripple effect. Social Networks 25, 83–95.
- 58.
Flu view. 2015–2016 Influenza Season—Week 6, ending February 13, 2016. Available: http://www.cdc.gov/flu/weekly/pdf/External_F1606.pdf
- 59. Thompson William W., Comanor Lorraine, and Shay David K., Epidemiology of Seasonal Influenza: Use of Surveillance Data and Statistical Models to Estimate the Burden of Disease, The Journal of Infectious Diseases 2006; 194:S82–91. pmid:17163394
- 60.
Rath TM, Carreras M, Sebastiani P. Automated detection of influenza epidemics with hidden Markov models. In: Proceedings of the international symposium on intelligent data analysis; 2003.
- 61.
Department of Health and Human Services. Centers of Medicare & Medicaid Services. July 2010. Available: http://www.gpo.gov/fdsys/pkg/FR-2010-07-28/pdf/2010-17207.pdf
- 62. Bansal S. and Meyers L.A. (2012) The impact of past epidemics on future disease dynamics. Journal of Theoretical Biology 309: 176–184. pmid:22721993
- 63. Volz E.M., Miller J.C., Galvani A.P., Meyers L.A. (2011) Effects of Heterogeneous and Clustered Contact Patterns on Infectious Disease Dynamics. PLoS Computational Biology 7: e1002042. pmid:21673864
- 64. Salganik M. J., and Heckathorn D. D. (2004). Sampling and estimation in hidden populations using respondent-driven sampling. Sociological methodology, 34(1), 193–240.
- 65. Lovász L. (1993). Random walks on graphs: A survey. Combinatorics, Paul Erdos is Eighty, 2(1), 1–46.
- 66. Burda Z., Duda J., Luck J. M., and Waclaw B. (2009). Localization of the maximal entropy random walk. Physical Review Letters, 102(16), 160602. pmid:19518691