Modeling the spread of the Zika virus using topological data analysis

Zika virus (ZIKV), a disease spread primarily through the Aedes aegypti mosquito, was identified in Brazil in 2015 and was declared a global health emergency by the World Health Organization (WHO). Epidemiologists often use common state-level attributes such as population density and temperature to determine the spread of disease. By applying techniques from topological data analysis, we believe that epidemiologists will be able to better predict how ZIKV will spread. We use the Vietoris-Rips filtration on high-density mosquito locations in Brazil to create simplicial complexes, from which we extract homology group generators. Previously epidemiologists have not relied on topological data analysis to model disease spread. Evaluating our model on ZIKV case data in the states of Brazil demonstrates the value of these techniques for the improved assessment of vector-borne diseases.


Topological Data Analysis
Topological data analysis (TDA) is a growing field of statistics that makes use of techniques from topology to analyze data. We believe TDA provides valuable tools that epidemiologists can use to improve modeling the spread of disease. Specifically, we will explain the very basics of persistent homology, such as the Vietoris-Rips filtration, as well as how to carry out the filtration using the R programming language. Finally, we explain the persistence diagram, which is a common way of visualizing and extracting information from the Vietoris-Rips filtration.

Persistent Homology
The central idea of persistent homology is to compute topological features of data through a series of polyhedra, which are simplicial complexes based on a parameter, . First to understand what a simplicial complex is, we define the notion of a simplex. In geometry, a simplex is a generalization of a triangle or tetrahedron to n dimensions. For example, a k-simplex is a kdimensional convex hull composed of k + 1 vertices (a 2-simplex is a triangle). A simplicial complex is then a combination of simplices. Each simplicial complex is contained in the subsequent simplicial complex with a larger ε parameter. After obtaining a simplicial complex, its homology vector space can be computed. Each element of the homology vector space represents a type of structure in the complex. But does this structure also exist in the data? To answer this question, we track each homology element (or homology class) as the ε parameter grows. If the feature is statistical noise, the existence of that homology class will likely be short. If the feature is a real signal, the homology class is likely to persist for a longer time 1 . Hence, this method is called persistent homology.

S1 Fig. Example simplex visualizations.
An example of simplexes from dimension 0 to dimension 3. These are used to build a simplicial complex, which is the product of a Vietoris-Rips Filtration. See section 1.2 for a detailed walkthrough of this filtration.

Introduction to the Vietoris-Rips Filtration
The Vietoris-Rips filtration is a popular tool in topological data analysis because it can be used to encode valuable information about the underlying topology of data. The filtration produces a set of simplicial complexes that add a topological skeleton to point clouds. However, it is good to note that in practice, datasets are often too large for the filtration to be calculated in a computationally feasible manner.
To build a complex from a point cloud data set using the Vietoris Rips filtration, we want to approximate the data by a polyhedron, which is a simplicial complex. We define the simplex inductively. The 0-simplices, the vertices of the complex, are defined by the data points. We first initialize a small parameter ε starting at an ε of 0. We surround each vertex with a 2D ball with radius equal to ε and grow these ε balls. If two ε balls intersect, we create a 1-simplex by joining the two 0-simplices at the center of these balls to create a line segment. If 3 ε balls intersect, we create a 2-simplex by joining the three 0-simplices at the center of the 3 balls in a triangle. Now we construct the homology vector spaces by finding the cycles in the complex which are the generators of the homology vector space. A cycle is a simplex which is not part of the boundary of a simplex of higher dimension. For example, the vertices of a 1-simplex are not cycles. The components of the polyhedron are the 0-cycles. A loop of 1-simplices, which is not the boundary of a 2-simplex (triangles in the complex) is a 1-cycle. So the 0-cycles count the components of the polyhedron and the 1-cycles count the non-trivial loops in the polyhedron. Basically the cycles are just the holes of various dimension in the complex. These cycles are the generators of the homology vector spaces denoted by H0, H1, H2 for each dimension. The dimension of the vector spaces H0N, H1N are the indices we compute and interpret for the mosquito data.

S2 Fig. Vietoris-Rips at increasing values of ε.
A visualization at various ε values of the Vietoris-Rips filtration applied to a toy dataset. Notice how in panel a, we begin with ε = 0 since there are no balls present, and we have six 0-simplices. In panel b we begin to grow ε, and we observe that two points become connected to each other, forming our first 1-simplex. In panel c as we grow ε even further, another pair of points becomes connected, birthing another 1-simplex. There are several existing software packages that can compute the Vietoris-Rips filtration, including the TDA package in R, Perseus, and Dionysus.

Persistence Diagram
Typically, persistent homology is visualized through either a barcode diagram or a persistence diagram. A persistence diagram is a plot of the birth and death times of each topological feature in every dimension. The first coordinate of each point denotes the birth time of the feature, while the second coordinate denotes the death time. Formally, a persistence diagram is a set of points expressed as: , ∈ ' , ≥ 0, ≤ }. Therefore, all points lie on or above the line of equality. Points close to the line of equality denote features with shorter lifetimes, while points that lie farther away from the line denote features that persist for a longer time.

S4 Fig. Example persistence diagram.
A persistence diagram that summarizes information from a Vietoris-Rips filtration. The lengths of the vertical arrows represent the lifetime of the respective features that they point to. The green square is a feature that was born and died at the same time. This can happen when a loop is formed but is immediately closed at once.

Modeling using topological features
We are interested in studying the 0-dimensional homology group generators (H0 features) and 1dimensional homology group generators (H1 features) of the simplicial complexes representing the original point clouds of mosquito occurrences of each state of Brazil. The number of H0 features at the the beginning of the filtration correspond to the number of Aedes aegypti mosquito occurrences in each state, since each occurrence starts out as a separate connected component as 0-simplexes in the filtration. Therefore, the H0 features all have birth time of = 0. As we grow , 1-simplexes are created from intersections of the -balls, and the number of connected components decreases. When a connected component dies by joining another connected component, we note its death time or the value at which the component dies. The H1 features correspond to the empty loops born in the Vietoris-Rips filtration. The H1 features die in the filtration when they are filled in by 2-simplexes. We note the birth and death time of these topological features, which are visually displayed in persistence diagrams. We then compute the lifetimes of the H1 features using the birth and death times in the persistent diagrams and record the maximum lifetime for the H1 features for each filtration.

Code and Data Availability
All of the data used for analysis in this paper is 3 rd -party data. We had no special privileges to access the data.

Aedes aegypti Mosquitos
Our mosquito data comes from the Global Compendium of Aedes aegypti and Ae. albopictus occurrence dataset [2]. The full dataset is hosted by the Dryad Digital Repository at http://dx.doi.org/10.5061/dryad.47v3c. In total we consider 5057 entries for Brazil. Each entry represents a mosquito population in a region called a polygon, which is an area with dimension greater than 5km x 5km. Each polygon represents a survey conducted within a Brazilian municipality with a positive finding of a mosquito population of non-zero size. No attempt was made to quantify the true number of mosquitos within a polygon. There are 5570 municipalities in Brazil.

Municipality Weighting
Aedes aegypti mosquitos have been shown to have higher reproductive rates as temperature increases (while below 32° C). And similar population increases have been demonstrated in regions with higher rainfall. Given these entomological features, we weight municipalities that have conditions suitable to large Aedes aegypti mosquito populations. We do this by randomly generating additional points in a 5km radius around municipalities in the top 5% most habitable municipalities based on these climate heuristics. The simplicial complex that we build through the Vietoris-Rips filtration then reflects this weighting since municipalities that are more habitable will have more 0D and 1D features.

Country-level Temperature
We obtained mean monthly temperatures in degrees Celsius from 98 weather stations which cover all states of Brazil in 2010 using FAOClim-Net, an agroclimatic database management system [3]. The data is available at http://geonetwork3.fao.org/climpag/. We take the average of these monthly temperatures to obtain a mean annual temperature for each state of Brazil in the year 2010. We use these temperatures as a proxy for the temperatures of each state of Brazil in recent years.

Municipality-level Temperature and Rainfall
Data on average monthly temperature between March to May in Brazil on a municipality level is available at http://www.ipeadata.gov.br.

Population Density
We obtained population data and geographic area (km) for each state of Brazil from the Instituto Brasileiro de Geografia e Estatistica (IBGE) [4]. We use the estimated resident populations in states of Brazil in the year 2014. To obtain an estimated population density for each state, we divide the estimated resident population data by the geographic area (km). The data can be downloaded at http://downloads.ibge.gov.br/downloads_estatisticas.htm

Zika Cases
The data on the number of Zika cases in each state of Brazil come from monthly reports published by Brazil's Ministry of Health [5]. Although the Ministry of Health publishes cases that are under investigation, we only used data on the confirmed cases. We used the latest cumulative data published weekly, which is from July 2, 2016. There are 27 states in Brazil, so we have 27 data points. The data can be obtained from the PDF available at: http://portalsaude.saude.gov.br/index.php/o-ministerio/principal/leia-mais-o-ministerio/197secretaria-svs/20799-microcefalia