Inferring Epidemic Contact Structure from Phylogenetic Trees

Contact structure is believed to have a large impact on epidemic spreading and consequently using networks to model such contact structure continues to gain interest in epidemiology. However, detailed knowledge of the exact contact structure underlying real epidemics is limited. Here we address the question whether the structure of the contact network leaves a detectable genetic fingerprint in the pathogen population. To this end we compare phylogenies generated by disease outbreaks in simulated populations with different types of contact networks. We find that the shape of these phylogenies strongly depends on contact structure. In particular, measures of tree imbalance allow us to quantify to what extent the contact structure underlying an epidemic deviates from a null model contact network and illustrate this in the case of random mixing. Using a phylogeny from the Swiss HIV epidemic, we show that this epidemic has a significantly more unbalanced tree than would be expected from random mixing.


A Sackin index for a random network generated by the configuration model
The Sackin index is a measure of tree balance [1]. It is defined as where d i is the number of internal nodes you need to traverse when connecting leaf i to the root of the tree and N is the number of leaves. For our purposes it will be useful to rewrite this as a sum over distances in the transmission network, The transmission network is the subgraph of the contact network where only the edges along which an infection event occured are preserved and nodes that were not infected are removed (figure S1 and [2]). We're only interested in the expected value of I S . If the network is of infinite size, i.e. N much larger than the epidemic size, then we can take the D j to be independent of each other. Thus, with N and D random variables that take on the number of leaves N (m) that are at a distance D(m) from the root and m is the distance of the corresponding node in the infection network from the initial seed of the epidemic. V (m) is the subset of node indices that are at this distance m.
If we assume that no super-infections occured then the transmission networks has a tree-like strucutre. In that case the distance D(m) in the phylogenetic tree translates from the distance m in the network as, Here,K(m) is the average number of infections caused by a node at distance m andF (j) is the average infection order of of a node at a distance j. We can assume that there is no preferential order in which the neighbors of a given node are infected, sô Inserting (6) and (7) in (5), If the contact network is sufficiently sparse, then it too will have a tree-like strucutre and short-range loops with be rare. This means that a node fails to infect one of its neighbors, that neighbor will stay uninfected for the rest of the epidemic. In this case, the number of infections caused by a node, K(m), should be independent of m. WritingK(m) = κ The first term is just total number of infected nodes N . The second sum, ∑ m z m m is the sum of all path lengths from the initially infected node to all other nodes. If the initially infected node is chosen at random, then this is equal to the N times the mean path length ℓ in the infection network. Finally, B Strong local clustering as well as super-spreading leads to unbalanced trees Figure 1A in the main text shows that both the Barabási-Albert (BA) model [3] as well as the Watts-Strogatz (WS) [4] model can result in large tree imbalance compared to the Erdős-Rényi (ER) model [5]. Here we illustrate this effect using two idealized cases of the two models.
As an extreme case of the WS model one can imagine a 1D lattice (without cyclic boundary conditions) where all nodes are only just connected to their immediate neighbors (figure S1A). If an individual at the beginning of the chain is initially infected, then this individual will just infect its immediate neighbor who will infect the next immediate neighbor until the epidemic either dies out or all individuals have been infected. This results in a maximally unbalanced tree, as branching only happens and the right-most tree branch. Conversely, a star graph can be seen as an extreme case of a preferential attachment model, where all individuals are connected to a single center node (figure S1B). If the center node is initially infected it will continue to infect its neighbors until it either recovers or dies. This also results in a maximally unbalanced tree. As the tree shape alone cannot distinguish between the two processes, they both result in phylogenetic trees with the same Sackin index, despite having very different contact structures.
This illustration also reflects the dependence on path length and degree variance. Both networks in figure S1 have the same mean degree since the number of nodes and edges is the same in both networks. However, the star graph has a very large degree variance (one node with degree n − 1 and n − 1 nodes with degree 1) and a short mean path length of approximately 2. In the chain graph, all nodes except the ends have degree 2, but the mean path length is of order n.

C Comparison of different tree balance statistics
Several measures exist that quantify the level of imbalance of phylogenetic trees (see [6] for an overview). Of these, the most commonly used are the Sackin index and the Colless index. The Sackin index sums the number of internal nodes d i between each leaf i and the root of the tree, while the Colless index compares the sizes of the right and left subtree and at each internal node (r j and l j respectively), The normalizing constant above results from a maximally unbalanced tree, thus constraining I C to the interval [0, 1]. It has been shown that these two measures of tree balance are highly correlated for trees generated by the Yule model and the uniform model [7].
In [8] Blum and François introduced a further shape statistic, It is possible to rewrite the Sackin index, such that

D Testing the HIV tree using different tree imbalance statistics
We compare the tree imbalance of the phylogenetic tree from the Swiss HIV epidemic to trees generated under a birth-death process with density-dependent birth rates (SIR model with random mixing) using the measures mentioned in above. We are able to reject the birth-death process as a null model using the (normalized) Sackin index (and the Colless index, since these are strongly correlated) for the full tree as well as for the three largest transmission clusters (see figure 7 in main text). It is not possible, however, to reject the birth-death process using the s-Index statistic. When comparing the joint distribution of the normalized Sackin index and the s-Index for trees generated by the SIR model, we show that only a small number of trees (437/10 000) generated by this process are rejected by the Sackin index test while not being rejected by the s-Index test ( figure S3). This indicates that even though the HIV tree cannot be distinguished from an SIR model using the s-Index, it can reject the SIR model when using both both the s-Index as well as the Sackin index.
This argument is supported when looking at the sub-sampling behavior of the s-Index for the HIV tree as well as the transmission groups within. While we cannot reject the SIR model using the s-Index alone for the full HIV tree, the tree from a subsample of individuals is rejected by the s-Index, as are the transmission groups (see figure S4).
The values for N , M and R 0 in figure S3 were chosen to be realistic and conservative, however, it is possible that the SIR model with other parameters could no longer be rejected using tree imbalance statistics. Figure S6 shows the distribution of imbalance measures for trees generated by the SIR model under different parameterizations. The HIV tree rejects the SIR model for all of these parameter combinations. We thus consider this result stable.

E Path length
The control parameter in the Watts-Strogatz model is the rewiring probability p. When p = 0 no edges are rewired and the network is equal to a ring where each node is connected to its K nearest neighbors. The degree distribution of such a network is, As p → 1 the network approaches a ER graph which has a Poisson degree distribution.
Varying p affects two properties of the network: the clustering coefficient and the mean shortest path [4]. The strongest effect on path length is in the small p range, while the strongest effect on the clustering coefficient is in the large p range [4]. Figure S5 shows the dependence on the rewiring probability for a fixed transmissibility T and mean number of neighbors, compared to a random graph generated by the configuration model with degree sequence equal to that of th WS graph. The reduction in tree imbalance is strongest in the low p range, indicating that this effect is due to the reduction in mean path length, rather than reduction in the clustering coefficient or the variation in degree distribution.