Finding Evidence for Local Transmission of Contagious Disease in Molecular Epidemiological Datasets

Surveillance systems of contagious diseases record information on cases to monitor incidence of disease and to evaluate effectiveness of interventions. These systems focus on a well-defined population; a key question is whether observed cases are infected through local transmission within the population or whether cases are the result of importation of infection into the population. Local spread of infection calls for different intervention measures than importation of infection. Besides standardized information on time of symptom onset and location of cases, pathogen genotyping or sequencing offers essential information to address this question. Here we introduce a method that takes full advantage of both the genetic and epidemiological data to distinguish local transmission from importation of infection, by comparing inter-case distances in temporal, spatial and genetic data. Cases that are part of a local transmission chain will have shorter distances between their geographical locations, shorter durations between their times of symptom onset and shorter genetic distances between their pathogen sequences as compared to cases that are due to importation. In contrast to generic clustering algorithms, the proposed method explicitly accounts for the fact that during local transmission of a contagious disease the cases are caused by other cases. No pathogen-specific assumptions are needed due to the use of ordinal distances, which allow for direct comparison between the disparate data types. Using simulations, we test the performance of the method in identifying local transmission of disease in large datasets, and assess how sensitivity and specificity change with varying size of local transmission chains and varying overall disease incidence.


Pairwise dissimilarities
Let D m i (a, b) be the measured distance between two cases a and b for data type i. There could be identical values in our dataset (i.e. D m i (a, b) = 0). As we use ordinal distances to detect cases lying close together, detection of local transmission clusters will be more challenging when many values are identical; no ordering exists on these. To be able to make comparison between cases with identical values and those with distinct values, we will assume that for all cases for which the same value was measured, the actual value lies a random infinitesimal distance away from this measured value. This is actually true for the temporal data, which is always interval censored, as dates but not exact times are given. It is not true for genetic data, which are discrete. However, these can be seen as a proxy for evolutionary time separating two samples, which is again continuous.
We define the dissimilarity d i (a, b) between two cases a and b as the expected number of cases between them, plus one: Here '∧' denotes the logical AND operator; A ∧ B is true if and only if both A and B are true. To see that the definition above coincides with the expected value plus one, consider three points a, b and c, with a and b having the same observed value ()D m i (a, b) = 0), each lying a random infinitesimal distance away from their measured values. Then the probability that any two of the actual pairwise distances are equal is zero. Therefore, if D m i (a, c) = 0, the probability that b is in between a and c is 1 2 . If D m i (a, c) = 0, the probability that b is in between a and c is 1 3 . Further note that, for distinct cases, when identical values do not occur in our dataset the definition above is equivalent to equation (1) in the main text.
As in the main text, the full dissimilarity between two cases a and b is given by

Putative transmission clusters
For any subset S ⊆ D, define l(S) as the largest dissimilarity in the minimum spanning tree of S. Note that several minimum spanning trees can exist, but l(S) is unique (see lemma 1). To test the null hypothesis of independence between data types, we construct the set D from D by randomly permuting the values of the data types. D is identical to D for each of the data types, but satisfies the null hypothesis. We then define the p-value for S as the probability that a subset with at least that size and at most that largest dissimilarity exists under the null hypothesis: and we call S a putative transmission cluster (PTC) if this p-value is beneath a threshold of 0.001. We can limit the number of clusters we have to test using hierarchical clustering, a technique that yields a dendogram of the dataset. A dendogram can be defined as a function h : [0, ∞) → P D , where P D is the set of all partitions of the dataset D, with the properties that m ≤ m implies h(m) ≤ h(m ) (i.e. every element of h(m) is a subset of an element of h(m )), and h is eventually the whole dataset (h(m) = D for sufficiently large m). h(m) here is the set of subsets S of D such that l(S) ≤ m and the only set S 2 that contains S and has l(S 2 ) ≤ m is S itself. Let S be the set of subsets of D that are in h(m) for some m. By lemma 2, subsets of D that are a PTC are always contained in an element of S which is also a PTC. Since we are interested in whether cases belong to a cluster or not, we only have to test the elements of S for being a PTC.
Lemma 1. For any weighted graph G, all minimal spanning trees have the same maximum edge weight.
To prove this, let's assume T 1 and T 2 are minimal spanning trees of G, such that their maximum edge weights are different. Without loss of generality, let the maximum edge weight of T 1 be larger, and let e ∈ T 1 be an edge with this weight. Now select an edge e from T 2 such that e is in the cut induced by e in T 1 . As the maximum edge weight of T 2 is smaller than that of T 1 by assumption, the weight of e is smaller than that of e. The tree (T 1 − {e}) ∪ e is a spanning tree of G, with total weight less than T 1 . This is a contradiction, as T 1 was a minimal spanning tree.
Lemma 2. If S ⊆ D is a putative transmission cluster (PTC), ∃T ∈ S with S ⊆ T and T a PTC.
Either S ∈ h(l(S)) ⊆ S and we are done, or ∃T ∈ h(l(S)) ⊆ S, with S ⊂ T . Because S is a PTC we have P (∃S ⊂ D : |S | ≥ |S|, l(S ) ≤ l(S)) < 0.001 since furthermore |T | > |S|, l(T ) = l(S) and l is monotonically increasing in cluster size, we have that which shows that T is also a PTC.

Details of simulations 3.1 Generating simulated datasets
In our first simulation scenario, all locally infected cases belong to one large outbreak. One index case was generated near the start of the study period so the outbreak would be completed within the time window. As the variance in the final size of a large outbreak generated by branching processes is quite large, we restricted this outbreak to be of size exactly one tenth of the size of the total simulated dataset. We therefore generated cases until the number was reached, and picked an infector for each from the set of previously generated cases. As assigning cases randomly from the set of already generated cases would amount to strong superspreading behavior (the index case would get a large number of infectees assigned), we preferentially picked more recently generated cases.
In particular, we set the probability for any generated case to be assigned as an infector as twice the probability of picking the case generated before. For example, when three cases had been generated, they would be picked as an infector with probabilities 1/7, 2/7 and 4/7. This procedure is arbitrary, but simple and keeps the expected number of infections per infected individual bounded. For example, the expected number of infections caused by the index case would be N i=1 1 2 i −1 ≈ 1.61. The small and very small outbreaks were generated using branching processes, as explained in the main text. Below we calculate the expected size of the outbreaks generated in this way.

Final size calculations
To find the expected value of the final size S of the outbreaks in the second and third scenario, let f (x) be the probability that one infectious case infects x others. For the geometric distribution we use, f (x) = p x (1 − p), where p = R 1+R and R is the expected number of infections per infectious case. As each case infected again infects new cases, we have which simplifies to E(S) = 1 1−R , yielding expected sizes 2 and 10 9 for the R values of 0.5 and 0.1 used.
As only outbreaks of value of at least 2 are characterized as clusters in our analysis, we might also want to find the expected size of these clusters: E(S|S > 1). This is equivalent to conditioning on the index case causing at least one infection. We get yielding 4 and 20/9≈2.22 for the two scenarios.

Additional simulation results
In this section we give the results obtained by applying the proposed method to simulations where the absolute distances between infector-infected pairs are smaller than in the main text, and to simulations where 20% of cases are unobserved.

Small distances
The statistical signal left by clusters of cases depends for a large part on the relation for each of the data types between infector-infected pairs. When distances in these data types are smaller, the statistical signal is stronger. To illustrate this, we performed additional simulations in which these distances are smaller. We simulate as described in the main text, but the time distance is now exponentially distributed with expectation 0.5 (1 in the main text), the geographical distance is N (0, 2) (N (0, 4) in the main text), and the expected number of mutations is now 0.1 (0.5 in the main text). Clustering performance is given in figure S2 and table S1. As the statistical signal is much stronger, the distinction between outbreak and unrelated cases is much clearer than in the main text.

Unobserved cases
Many datasets face the problem of missing or unobserved cases. Here, we tested the performance of our method when facing unobserved cases. We do this by performing simulations as described in the main text, and then separately discarding each of the cases with probability 0.2, thus discarding 20% of cases at random. We applied our method to this reduced datasets, results are given in figure S3 and table S2. Datasets with missing cases are similar to complete datasets with larger distances between the cases; thus this scenario constitutes the opposite of the one in the previous section. As expected, clustering performance decreases for most scenarios. A notable exception are the very small clusters, where sensitivity actually increases. As these transmission clusters are mainly of size two, discarding a case does not lead to larger distances, but to elimination of the cluster. Thus the number of cases and transmission clusters is affected, but not the intra-cluster distances. Outbreak cases and unrelated cases can be distinguished for all scenarios, showing that the method can provide useful results even when cases are unobserved.