Digital contact tracing and network theory to stop the spread of COVID-19 using big-data on human mobility geolocalization

The spread of COVID-19 caused by the SARS-CoV-2 virus has become a worldwide problem with devastating consequences. Here, we implement a comprehensive contact tracing and network analysis to find an optimized quarantine protocol to dismantle the chain of transmission of coronavirus with minimal disruptions to society. We track billions of anonymized GPS human mobility datapoints to monitor the evolution of the contact network of disease transmission before and after mass quarantines. As a consequence of the lockdowns, people’s mobility decreases by 53%, which results in a drastic disintegration of the transmission network by 90%. However, this disintegration did not halt the spreading of the disease. Our analysis indicates that superspreading k-core structures persist in the transmission network to prolong the pandemic. Once the k-cores are identified, an optimized strategy to break the chain of transmission is to quarantine a minimal number of ‘weak links’ with high betweenness centrality connecting the large k-cores.


I. INTRODUCTION
In the absence of vaccine or treatment for COVID-19, state-sponsored lockdowns have been implemented worldwide to halt the spread of the ongoing pandemic creating large social and economic disruptions [1][2][3]. In addition, some countries have also implemented digital contact tracing protocols to track the contacts of infected people and reinforce quarantines by targeting those at high risk of becoming infected [4][5][6][7][8][9][10][11][12][13]. Here we develop, calibrate, and deploy a contact tracing algorithm to track the chain of disease transmission across society. We then search for quarantine protocols to halt the epidemic spreading with minimal social disruptions [14][15][16][17][18][19].
Our study uses two complementary datasets. The first includes data from 'Grandata-United Nations Development Programme partnership to combat COVID-19 with data' [20]. It is composed of anonymized global positioning system (GPS) data from a compilation of hundreds of mobile applications (apps) across Latin America that allow to track the trajectories of people (users). The data identify each mobile phone device with a unique encrypted mobile ID and specifies its latitude and longitude location through time, encoded by geohash with 12 digits precision. Typically, this dataset generates ∼ 450 million data points of GPS location per day across Latin America in particular in the state of Ceará, Brazil (see SM sections I-V).
The second dataset is an anonymized list of confirmed COVID-19 patients obtained from the Health Department authorities from both states. It includes the geohash of the address, the SARS-COV-2 test detection date and first day of symptoms of COVID-19. We cross-match the geolocation of the patients with the GPS dataset obtaining the encrypted mobile ID of the patients (see SM sections I-V). We then trace the geolocalized trajectories of COVID-19 patients during a period -14/+7 days from the onset of symptoms to look for contacts of the infected person to define the transmission network using the model described below.

II. COVID-19 MODEL
The COVID-19 spreading model is represented by a Susceptible-Exposed-Infectious-Recovered (SEIR) process [15] (Fig. 1a). The infectiousness period of an infected person starts 2 days before and lasts up to 5 days after the onset of symptoms [21]. In this paper, we add two days to each of these limits to conservatively capture most transmissions. Thus, in principle, to trace those people potentially infected by COVID-19 patients, we track contacts 4 days before and 7 days after the reported date of first symptoms (see Fig. 1a). In addition, we extend the tracing period further back in time to also consider exposures that could come from asymptomatic cases. Exposures start the incubation period of the in-fected person which can occur up to 12.5 days before onset of symptoms (5.2 days on average, 95% percentile 12.5 days [22,23], Fig. 1a). To conservatively trace these exposure events, we add ∼2 days to this incubation period and obtain the widely used 14 days period. Hence, to trace transmission and exposure cases, we perform contact tracing over -14/+7 days from onset of symptoms (Fig. 1a). We note that the peak of infectiousness as well as 44% (95% confidence interval, 25-69%) of infected cases occur during the pre-symptomatic stage [21]. Thus, performing contact tracing is essential to stop the spreading of the disease.

III. CONTACT MODEL
The GPS geolocation of the trajectories of both infected and susceptible people is used to trace several layers of contacts in the transmission network using the following model. A contact at time stamp n is initiated with an infected user (source) at time t 0 (see Fig. 1b). At t 0 we draw a contact area as a circle centered in the source position with a radius r. We then gather all the GPS datapoints from susceptible users (targets) that enter the contact area from t 0 to t 0 + T , where T is the total exposure time. We follow the trajectories of source and target within the time-space area and compute the probability of infection at time stamp n as p i is the spatial component, and p t [n] is the temporal component. When the average overlap between source and target is zero, then p d [n] = 1, and when the overlap is 2r, then p d [n] = 0. On the other hand, when the exposure time ≥ T , then p t [n] = 1, and decreases to p t [n] = 0 as the exposure time decreases (see SM sections I-V for definitions). The probability p d [n] quantifies the contact probability for two users in the same area defined by r. A contact requires non only a space overlapping but also a time overlap, p t [n], which quantifies the probability that two users met based on the time commonly spent in the same area. We then combine these two probabilities for each timestamp n into their product.
Contacts with low probability of infection p i [n], but repeated throughout time, can also infect the target. To incorporate this effect in the model, we define the probability of infection for a series of repeated contacts P i [n] as a recursive formula from time 1 to n with P i [0] = 0: The iteration of contacts between source and target, P i [n], generates higher probability of infection than a single contact p i [n]. This means that there is a difference between a short single contact between two people and short repeated contacts between the same people. The latter scenario should have a larger probability than the former to become infected.  The COVID-19 pandemic is represented with a SEIR model. From exposure (E) the virus is incubated in average for 5.2 days (12.5 days 95 th percentile), starting the symptoms 2 days after infectiousness (I) and lasting the disease up to 17 days to recover (R). We use a window -14/+7 days from the first symptoms to detect infectious and exposure. (b) Contact area used in the contact tracing model. The grey person is at the first datapoint of the source at t0. We collect all datapoints for every user in a T =30 min forward window (t1, t2, t3, ..., t0 + T ) within an 8 m circle from the initial position. For each target (green and red) we compute the average position and the time spent inside the contact area (red part of the trajectory line). (c) Partial transmission tree of outbreak of confirmed SARS-CoV-2 infection identified by contact tracing during calibration in the month of March 2020. Links goes from the source of infection to the target. The colors represent the day of first symptoms for each node and size is the out-degree. tion number R 0 = 2.78 in Ceará in the month of March, 2020 (see SM sections I-V). We obtain T = 30 min, r = 8 m and p c = 0.9. Thus, a contact is defined with probability one when exposure is at least 30 minutes within a distance 8m. This calibration procedure provides the partial transmission tree of the outbreak from patient zero to the end of the calibration period shown in Fig. 1c.

IV. TRANSMISSION NETWORK MODEL
Next, we create the contact network of coronavirus transmission by first tracing the trajectories of confirmed COVID-19 patients to search for contacts -14/+7 days from the onset of symptoms using the above model. From the first contact layer, we add four layers of contacts to constitute the contact network of transmission that is used to monitor the progression of the pandemic. The time-varying network is aggregated to a snapshot defined over a time window of a week [15] (SM Section S6.1). We find that other aggregation windows give similar results as presented.
Next, we analyze the spatio-temporal properties of the contact network. The government of the State of Ceará imposed a mass quarantine on March 19, 2020 which led to a decrease in people's mobility by 56.5% as shown in Fig. 2a. During the lockdown, only the displacements of essential workers were allowed. A large decrease in mobility is also observed across all Latin America, see [20].

V. GIANT CONNECTED COMPONENT (GCC)
To understand the effect of the lockdown on the contact network, we think by analogy with a 'bond percolation' process [15,16,24]. In bond percolation, the network connectivity is reduced by removing a small fraction of links (bonds) between nodes, and the global disruption in network connectivity is monitored by studying the normalized size of the giant connected component (see Methods). Following this analogy, the lockdown acts as a percolation process, and therefore we monitor the GCC of the transmission network before and after the lockdown. We find a large decrease in the size of the GCC [15,24] within 6 days of the implementation of the lockdown on March 19, when the GCC is almost fully dismantled decreasing by 89.6% of its pre-lockdown size (Fig. 2a).
Despite the disintegration of the GCC, the cumulative number of cases kept growing albeit at a lower rate (Fig.  2a). We find that the mass quarantine was able to reduce the basic reproduction number from R 0 = 2.78 before lockdown to an effective reproduction number of R e = 1.2 after the lockdown (Fig. 2a). Despite this disruption in the network connectivity, R e has not decreased below one, as it would have been needed to curb the spread of the disease. (a) Evolution for different metrics in Ceará, Brazil, previous to the mass quarantine (grey area), right after the imposed quarantine (yellow area) and later. The plot shows the root mean square displacement (MSD) normalized by the maximum value over the total period (blue), the cumulative number of cases (green) and the size of the GCC normalized by the maximum value over the total period (black). The uncertainty corresponds to the standard error (SE). The mobility data is showcased in the Grandata-United Nations Development Programme map shown in https://covid.grandata.com. The initial rise in GCC is due to the lack of data before March 1. (b) The plot shows the 0.5-kcore size (red), the 0.5-kshell size (cyan) all normalized by their respective maximum value pre-lockdown. While the size of the 0.5-kshell is reduced drastically during the lockdown, the 0.5-kcore was not reduced as much and keeps increasing, contributing to sustain the pandemic. The 0.5-kcore seems to follow the trend in the MSD, which we plot again to show this trend.
The drastic reduction in the GCC is visually apparent in the contact networks in Fig. 3. Before lockdown on March 19 (Fig. 3a), the network is a strongly-connected unstructured 'hairball'. Eight days into the lockdown on March 27 (Fig. 3b), the network has been untangled into a set of strongly-connected modules integrated by tenuous paths of contacts. This structure is even more pronounced a few weeks later on April 28 (Fig. 3c)  (a) Transmission network on March 19 (pre-lockdown). A hairball highly-connected network is observed. The disconnected components of the 7-core (k max core = 12 in this network) are colored. These components are well connected into the hairball network as expected since mobility and connectivity is high. (b) The pre-quarantine hairball in (a) has been untangled and the k-cores have emerged 8 days into the lockdown on March 27. Here, we color the nodes according to layers of the transmission network starting at COVID-19 patient (black nodes). Size of nodes is according degree. (c) Network on April 28 including the components of the 5-core in different colors (k max core = 7 for this network). Visible is the high betweenness centrality node representing the weak-link of this k-core. (d) We plot the location of the contacts in the map of Fortaleza constituting the components of the 5core of the April 28 in (c). The size of the circles in the map corresponds to the number of contacts inside each location. The colors correspond to the clusters of the 5-core in (c). The 5-core sustaining transmission is composed of clusters of contacts localized in hospitals, large warehouses and business buildings. Hospital 3, one of the largest in Fortaleza, constitutes the maximal k max core = 7 of the pandemic.

VI. SUPERSPREADING K-CORE STRUCTURES
The highly connected modules found in Fig. 3b and 3c are k-core structures [25][26][27][28] of higher complexity than the GCC (which is a 1-core), that are known to sustain an outbreak even when the GCC has been disintegrated [15,28]. The k-core of a graph is the maximal subgraph in which all nodes have a degree (number of connections) larger or equal than k [25][26][27][28]. The k-shell is the periphery of the k-core and is composed by all the nodes that belong to the k-core but not to the (k+1)-core (see SM sections I-V for definitions and SM Figs. S2, S3, and S4). The k-core is obtained by iteratively pruning the nodes with degree smaller than k. For instance, the 3core is obtained by removing the 1-shell and 2-shell in a k-shell decomposition process (see SM Figs. S2, S3). Thus, all nodes in a k-core have at least degree k, and are connected to other nodes with degree at least k too. K-cores are nested and can be made of disconnected components (see SM Fig. S4). High k-cores are those with large k up to a maximal k max core , and constitute the inner most important part of the network. In theory, the high k-cores are known from network science studies to be the reservoir of disease transmission persistence [15,28]. On the contrary, low peripheral k-shells (see SM Fig. S2) do not contribute as much to the spread as the high inner k-cores. Figure 2b shows that despite the disappearance of the GCC, there is a significant maximal k-core that was not dismantled by the mass quarantine. The figure shows that the outer k-shells of the transmission network (i.e., the 0.5-kshell defined as the union of the k-shells with k = 1, 2, ..., 1/2 k max core − 1, see SM sections I-V) are disintegrated in the lockdown, decreasing by 91% with respect to their pre-quarantine size, in tandem with the GCC. However, the inner k-core (i.e., the 0.5-kcore defined as the k-core with k = 1/2 k max core , see SM sections I-V) persists in the lockdown. The figure shows that the decrease of the 0.5-kcore is only 50% compared to the 91% decrease of the 0.5-kshell; the former even increases slightly at the end of April, following the same trend in mobility (see Fig. 2b). This process is visually corroborated in the evolution of the networks seen from Fig. 3a to 3c where we observe the disappearance of the peripheral k-shells and the persistence of the maximal kcore. Indeed, the unessential contacts in the peripheral k-shells may have been first pruned during social distancing.
Using numerical simulations, we corroborate previous results indicating that the infection can persist in these high k-cores of the network while virus persistence in outer k-shells is less important [15,28]. We use a SIR model on the transmission network ( Fig. 4a and SM  Fig. S14A) showing that the maximal k-cores of the network sustain the spreading of the disease more efficiently than the outer k-shells. Thus, the maximal k-core components of the contact network are plausible drivers of disease transmission. Apart from this structural explanation (i.e., k-core), epidemiological factors may also play a role in the persistence of the disease, such as a transition of the disease to vulnerable communities with high demographic density, or with large inhabitants per household where isolation is poorly fulfilled.
When we plot the geolocation of the contacts forming the maximal k-core in the map of Ceará, we find that these contacts take place in highly transited areas of the capital Fortaleza, such as hospitals, business buildings, warehouses as well as large condominiums, see versus the removal node fraction, q. Nodes are removed (in order of increasing efficiency): randomly (blue); by the highest k-shell followed by high degree inside the k-shell [28]; by highest degree (orange); by collective influence (red) [19]; and by the highest value of betweenness centrality (green) [32,33]. After each removal we re-compute all metrics. The most optimal strategy among those studied is removing the nodes by the highest value of betweenness centrality. (c)-(d) Effect of removing three high betweenness centrality nodes shown in Fig. 4b in the network of Fig. 3c. (c) We show the 2-core component of the network after the removal of 12 high betweenness centrality nodes. The red node is the one with the highest betweenness centrality value (next node to remove, 13th) and the blue node is the 14th removal. Different k-cores and k-shell are in different colors. (d) Network k-cores are disintegrated after the removal of the high BD nodes.
3d. These contacts generate superspreading k-core events that generalize the conventional notion of superspreaders, which refer mainly to individuals with large number of transmission contacts [29][30][31]. However, connections are not everything [17,18]. K-core superspreaders not only generate a large number of transmission contacts, but their contacts are also highly connected people, and so forth.

VII. OPTIMIZED QUARANTINE
The existence of k-cores in the transmission network suggests that a more structured quarantine could be deployed to either isolate or destroy those cores that help maintain the spread of the virus. We perform an optimal percolation analysis [17][18][19] to find the minimal number of people necessary to quarantine that will dismantle the transmission network. We compare different strategies to find the best among them to break the network by ranking the nodes based on (1) the number of contacts (hubremoval) [15,17,18], (2) the largest k-shells and then by the degree inside the k-shells [15,28], (3) the collective influence algorithm for optimal percolation [19], and (4) betweenness centrality [32][33][34][35] (we also try other centralities, see SM sections I-V). Figure 4b shows the normalized size of the GCC versus the fraction of removal nodes following different strategies, as well as a random null model of removal in a typical network under lockdown in April 28 (March 19 prelockdown results are plotted in SI Fig. S14B). While the disease can persist in the k-cores (Fig. 4a), quarantining people directly inside the maximal k-core is not an optimal strategy. The reason is that k-cores are populated by hyper-connected hubs that requiere many removals to break the GCC [34] (around 7%, see Fig. 4b). For the same reason, removing directly the hubs is not the optimal strategy either, since the hubs are within the maximal k-core and not outside. A collective influence strategy [19] improves over hub-removal since it takes into account how hubs are spatially distributed, yet, it is far from optimal. Clearly, Fig. 4b shows that the best strategy is to quarantine people by their betweenness centrality. By removing just the top 1.6-2% of the high betweenness centrality people, the GCC is disintegrated. This result is consistent with the particular structure of the transmission networks seen in Fig. 3b, c and Fig. 4.
The betweenness centrality of a node is proportional to the number of shortest paths in the network going through that node. Thus, given the particular structure of the networks in Figs. 3b, c, and Fig. 4c, the high betweenness centrality nodes are the bottlenecks of the network, i.e., loosely-connected bridges between the largely-connected k-cores components. These connectors are the 'weak links', fundamental concept in sociology proposed by Granovetter [36], according to which, strong ties (i.e., contacts in the k-cores) clump together forming clusters. A strategically located weak tie between these densely 'knit clumps', then becomes the crucial bridge that transmits the disease (or information [36]) between k-cores. These weak links are people traveling among the different k-cores components allowing the disease to escape the cores into the rest of society. These bridges are displayed in the network of Fig. 4c as the yellow, blue and red nodes. The removal of these high betweenness centrality people disconnects the k-core components of the network entirely, as shown in Fig. 4d, halting the disease transmission from one core to the other [34,37].
An important finding is that quarantining the large superspreading k-cores is neither optimal (as shown in Fig. 4b, green curve) nor practical, since they are mainly comprised by chiefly essential workers who need to remain operational (Fig. 3d). Thus, the best strategy, in conjunction with a mass quarantine, is then to disconnect these k-cores from the rest of the social network (Figs. 4c and 4d), rather than quarantining the people inside the k-cores. This can be performed by quarantining the high betweenness centrality weak-links that simultaneously preserve the operational k-cores. However, individuals belonging to the maximal k-cores should be tested at a higher frequency to promptly detect their infectiousness before the symptoms start, to help control the spreading inside the k-cores.

VIII. SUMMARY
Isolating the k-core structures by quarantining the high betweenness centrality weak links in the transmission network proves to be an effective way to dismantle the GCC of the disease while keeping essential k-cores working. While destroying the strong links and cores is a less manageable task to execute and control, isolating the weak links between cores is a more feasible task that will assure the dismantling of the GCC. In other words, if one core is infected, the disease will be controlled within that core and not extended to the rest of society.
As governments around the world are racing to roll out digital contact tracing apps to curb the spread of coronavirus [4][5][6][7][8][9][10][11], our modeling suggests possible quarantine protocols that could become key in the second phase of reopening economies across the world and, in particular, in developing countries where resources are scarce. Overall, our network-based optimized protocol is reproducible in any setting and could become an efficient solution to halt the critical progress of the COVID-19 pandemic worldwide drawing upon effective quarantines with minimal disruptions.