Conceived and designed the experiments: EMV SDWF. Performed the experiments: EMV. Analyzed the data: EMV SDWF. Contributed reagents/materials/analysis tools: EMV SDWF. Wrote the paper: EMV JSK MJW ALB SDWF.
The authors have declared that no competing interests exist.
Phylogenies of highly genetically variable viruses such as HIV1 are potentially informative of epidemiological dynamics. Several studies have demonstrated the presence of clusters of highly related HIV1 sequences, particularly among recently HIVinfected individuals, which have been used to argue for a high transmission rate during acute infection. Using a large set of HIV1 subtype B pol sequences collected from men who have sex with men, we demonstrate that virus from recent infections tend to be phylogenetically clustered at a greater rate than virus from patients with chronic infection (‘excess clustering’) and also tend to cluster with other recent HIV infections rather than chronic, established infections (‘excess coclustering’), consistent with previous reports. To determine the role that a higher infectivity during acute infection may play in excess clustering and coclustering, we developed a simple model of HIV infection that incorporates an early period of intensified transmission, and explicitly considers the dynamics of phylogenetic clusters alongside the dynamics of acute and chronic infected cases. We explored the potential for clustering statistics to be used for inference of acute stage transmission rates and found that no single statistic explains very much variance in parameters controlling acute stage transmission rates. We demonstrate that high transmission rates during the acute stage is not the main cause of excess clustering of virus from patients with early/acute infection compared to chronic infection, which may simply reflect the shorter time since transmission in acute infection. Higher transmission during acute infection can result in excess coclustering of sequences, while the extent of clustering observed is most sensitive to the fraction of infections sampled.
Diversity of viral genetic sequences depends on epidemiological mechanisms and dynamics, however the exact mechanisms responsible for patterns observed in phylogenies of HIV remain poorly understood. We observe that virus taken from patients with early/acute HIV infection are more likely to be closely related. By developing a mathematical model of HIV transmission, we show how these and other patterns arise as a simple consequence of intensified transmission during the early/acute stage of HIV infection, however observing these patterns is highly dependent on sampling a significant fraction of prevalent infections.
Phylogenetic clusters of closely related virus such as HIV arise from the epidemiological dynamics and transmission by infected hosts. If virus is phylogenetically clustered, it is an indication that the hosts are connected by a short chain of transmissions
If superinfection is rare, and assuming an extreme bottleneck at the point of transmission, each lineage in a phylogenetic tree corresponds to a single infected individual with its own unique viral population
Given a phylogeny of virus reconstructed from
The sizes of the groupings that arise from a clustering algorithm have been interpreted as a reflection of the heterogeneity of epidemiological transmission. The distribution of cluster sizes of HIV is often skewed right, and depending on the definition of clustering used, can have a heavy tail
When the taxa of the phylogeny are labeled, such as with the demographic, behavioral or clinical attributes of the the individuals from whom the virus was sampled, one can further analyze statistical properties of clustered taxa. Similar taxa, such as those arising from acute infections, may cluster together (or
Many more early infections are phylogenetically clustered than late infections. For future reference, we will refer to this as
If an early infection is clustered, it is more likely to be coclustered with another early infection than expected by chance alone. For future reference, we will refer to this as
The distribution of phylogenetic cluster sizes is skewed to the right and is potentially heavytailed.
Below, we illustrate these clustering patterns using 1235 HIV1 subtype B
These common clustering features motivate several questions. How informative are clustering patters about the underlying epidemic? In particular, how does higher transmissibility per act during early infection shape the phylogeny of virus ? To address these questions, we have developed a simple mathematical framework that demonstrates the connection between epidemiological dynamics and the expected patterns of clustering from a transmission tree and the corresponding phylogeny.
Our modeling work suggests that common features of HIV phylogenies are not coincidences, but universal features of certain viral phylogenies. We expect to see similar patterns for any disease such that the natural history features an early period of intensified transmission. High transmission rates during early infection may be a consequence of higher transmissibility per act due to high viral loads, but are also influenced by behavioral factors, such as fluctuating risk behavior
This research was reviewed by the Institutional Review Board at the University of Michigan. Data used in this research was originally collected for HIV surveillance purposes. Data were anonymized by staff at the Michigan Department of Community Health before being provided to investigators. Because this research falls under the original mandate for HIV surveillance, it was not classified as human subjects research.
Our analysis consists of an empirical component which establishes clustering patterns for a geographically and temporally delineated set of HIV sequences, and an analytical component which establishes a possible mechanism that could generate the observed patterns.
We examined the phylogenetic relationships of 1235 HIV1 subtype B partial
A maximum clade credibility phylogeny was estimated with BEAST 1.6.2
The phylogeny was converted into a matrix of pairwise distances between taxa expressed in units of calendar time. The distance between a pair of taxa was the TMRCA estimated by BEAST. Taxa were then classified into clusters using hierarchical clustering algorithms. A pair of taxa were considered to be clustered if the estimated TMRCA did not exceed a given threshold, and a range of thresholds was examined, from 0.5% of the maximum distance to the distance corresponding to the point where 90% of taxa are clustered with at least one other taxon.
Coclustering of early/acute infections was investigated using a clinical variable (CD4 count) and a measure of genetic diversity of the virus. Both CD4 and sequence diversity are imprecise indicators of stage of infection. Nevertheless, with a large populationbased sample, even noisy indicators of stage of infection are useful for illustrating phylodynamic patterns.
In most cases, CD4 counts were assessed contemporaneously with samples collected for sequencing. The CD4 cell counts can be informative about disease progression and can be used as a noisy predictor of the unknown date of infection
Recent work
A simple analysis was conducted to establish the existence of excess clustering and coclustering in the Michigan sequences. This analysis is not designed to classify our sample into a early/acute component or to estimate the date of infection for each unit.
To illustrate excess clustering of early/acute infections, we calculated the mean CD4 cell count and FAS for each sample unit in a phylogenetic cluster. Because all clustering thresholds are arbitrary, we explored a large range of values, up to the point where 90% of the sample was clustered with at least one other unit. The standard error of the estimated mean was calculated assuming simple random sampling. For small threshold distances, very few taxa are clustered, and the standard error is large, but decreases monotonically as the threshold is increased and more taxa are clustered.
To illustrate excess coclustering, we classified taxa into three categories of CD4: those with CD4
Following the approach outlined in
In
Parameter  Symbol  Value 
Transmission rate of early/acute 

1 per 47 days 
Transmission rate of chronic 

1 per 1207 days 
Mean duration of risk behavior 

19.5 years 
Mean duration of early/acute period 

180 days 
Mean duration of chronic period 

10 years 
Corresponding to an epidemic model of the form 1, we can define a coalescent process
Some of the properties of phylogenies that we seek to reproduce with the model developed below are:
The number of lineages as a function of time (NLFT), also known as the
The fraction of sampled early/acute and chronic infections which are clustered given a threshold TMRCA.
Within a given cluster there will a number of early/acute taxa and a number of chronic taxa. We will calculate the correlation coefficient between these counts across all clusters given a threshold TMRCA.
The moments of the distribution of cluster sizes, including the mean, variance, and skew of cluster sizes.
Dark branches with taxa labeled
The ancestor function is strictly decreasing in reverse time and converges to one (a single lineage) when the most recent common ancestor of the sample is reached. The initial value of the ancestor function (when the population is sampled) is equal to the sample size
The ancestor functions derived from equations 1, and which are derived in the
Real epidemics in a finite population will have transmission trees such that the number of lineages at any time is a random variable. The meanfield model presented in equation 1 can be viewed as a description of the dynamics of a stochastic system in the limit of large population size. In this case, we can adapt the coalescent to make approximate descriptions of the stochastic properties of the transmission tree in large populations. The ancestor functions will reflect the approximation of the actual (random) number of lineages. Previous work has demonstrated that deterministic descriptions can be excellent approximations for the number of lineages over time
Given a clustering threshold TMRCA
Many summary statistics that are potentially informative about transmission dynamics can be derived from these moments. The moments are difficult to interpret, so in practice we use them to calculate summary statistics such as variance and skew of the CSD. Below, we examine 30 summary statistics derived from the first three moments and multiple clustering thresholds.
For example, the variance of cluster sizes counting only type
Eventdriven stochastic simulations were conducted to verify the suitability of the deterministic approximations for inference. Simulations implemented a variation on the Gillespie algorithm
We have further conducted an investigation into the potential of various summary statistics of the viral phylogeny for inference of underlying epidemiological parameters. Of particular interest is the fraction of transmissions that occur during early HIV infection. As indicated above, it is possible that phylogenetic clustering of early infections reflects elevated transmission during early/acute HIV infection, which we will define as the infectious period from zero to six months. The following simulation experiment was carried out to identify informative statistics:
Parameters
For each set of parameters, the HIV ODE model was integrated. The number of transmissions by early/acute and chronic cases was recorded. The number of stage transitions from acute to chronic was also recorded.
For each record of transmissions and stage transitions, a coalescent tree was simulated using the method described in
For each coalescent tree, summary statistics were calculated and recorded. These statistics consisted of the following: The number of lineages as a function of time before the most recent sample; the correlation between between the number of early/acute and chronic infections with threshold TMRCA; the fraction of acute/recent taxa which remain unclustered (not clustered with any other taxa); the fraction of chronic taxa which remain unclustered; the mean number of taxa clustered with a early/acute infection; the mean number of taxa clustered with a chronic infection. Each of these statistics was calculated using 5 threshold TMRCA uniformly distributed between one year and 25 years before the most recent sample.
The coalescent tree was simulated such that the sample size matched that of the Detroit MSM phylogeny, and the heterochronous sampling of that phylogeny was reproduced in the coalescent tree. Furthermore, the number of early/acute versus chronic taxa sampled was determined using the BED test for recency of infection for each patient
Summary statistics were centralized around the mean and rescaled by their standard deviation (
The mean CD4 cell count and FAS for clustered taxa is shown in
Left: The mean CD4 cell count (top) and frequency of ambiguous sites (bottom) versus the threshold TMRCA used to form clusters. Middle: The assortativity coefficient, a measure of similarity of coclustered taxa, versus the treshold TMRCA used to form clusters. Assortativity of CD4 is at top, and frequency of ambiguous sites is bottom. Right: The size of each matrix element is proportional to number of coclusterings between taxa categorized by CD4 (top,
In general, the deterministic model offers an excellent approximation to the stochastic system. All trajectories pass through or close to the median of simulation predictions.
The xaxis gives the time since the beginning of the epidemic, or equivalently, the threshold TMRCA used to calculate the number of lineages over time. Green describes the simulated number of late infections. Blue describes the simulated number of early infections. Dots show the simulated ancestor function for the number of lineages that correspond to late infections. And x's show the simulated ancestor function for lineages in early infection. Dashed lines show the prediction of the deterministic coalescent. The top row shows results for a sample taken at 15 years following the initial infections, and the bottom shows results for a sample at 30 years. The right column shows results for a sample fractions of 20%, and the left column for a census of prevalent infections(100%).
In
Many summary statistics calculated from an HIV gene genealogy can be informative about the fraction of transmissions attributable to early/acute infection,
The threshold TMRCA was five years before the most recent sample. Sample size and distribution of samples over time was matched to the Detroit MSM phylogeny.
The fraction of taxa which are phylogenetically clustered also varies with
Using the mathematical model, we explored many parameters including the threshold TMRCA for clustering, the sample fraction, and the time relative to the beginning of the epidemic at which sampling occurs.
The time of sampling makes little absolute difference to the qualitative nature of the tree statistics if sampling occurs after the peak epidemic prevalence (around 15 years). However the sample fraction (the fraction of prevalent infections sampled) has a large effect on all tree statistics. When the sample fraction is large, the fraction remaining unclustered drops much more precipitously than when it is small as the threshold TMRCA increases. This occurs because each transmission can cause a sample unit to become clustered; a large sample size implies that transmissions will have a greater probability of resulting in an observable coalescent event (e.g. it results in a larger ratio
Early infections become clustered at a much greater rate than late infections. This corresponds to the excess clustering of early/acute infections observed in many phylogenies. By virtue of being infected in the recent past, an acute infection inevitably has a very recent common ancestor with another infection who transmitted to that individual. Mathematically, this is reflected in transmission terms of the form
When the sample fraction is nonnegligible, the fraction of the sample in a cluster levels off for intermediate thresholds. Similar phenomena were noted by Lewis et al.
The skewness of the CSD shows a similar trend (
A practical consequence of having an intermediate to large sample fraction is that chains of acutestage transmission will account for many of the clusters observed at low thresholds. If a taxon is clustered with an early infection, then it is
Corroborating
We have used coalescent models to characterize the phylogenetic patterns of a virus which produces an early stage of intensified transmission followed by a long period of low infectiousness. These patterns have been observed in multiple phylogenies of HIV1 from MSM and IDU, and our model suggests that these should be general features for epidemics which feature early and intense transmission. These patterns are not necessarily a consequence of complex sexual network structure
While there has been much discussion of how clustering of acute infections is caused by the intensity of transmission during the acute stage, the amount of excess clustering that will be observed is also very sensitive to the sample fraction. And even if transmission rates in the early/acute stage are equal to those in the late/chronic stage, we would still observe excess clustering of early/acute provided the sample fraction was large enough. This is a simple consequence of early/acute infections being connected by short branch lengths to the individual who transmitted infection. An advantage of the coalescent framework used in this investigation is that it is accurate even with large sample fractions
Some of the statistics which are most informative of the underlying epidemiological processes are those based on coclustering of labeled taxa, such as the correlation between the number of early and late infections in a cluster. Such statistics tend to be the most responsive to variation of the intensity of transmission during early infection, and are therefore good candidates for future estimation of the fraction of transmissions that occur during the first few months of infection with HIV. Knowing the frequency of early transmission is essential to prevention efforts, since these transmissions are the most difficult to prevent. Individuals with early and acute infection are usually not aware of the infection, and are therefore not susceptible to many interventions. Modeling to evaluate strategies such
Future work could focus on finding ways to use statistics derived from the CSD for estimation of epidemiological parameters within an approximate Bayesian framework
(EPS)
(EPS)
(EPS)
(EPS)
(EPS)
(PDF)
The authors thank Eve Mokotoff and MaryGrace Brandt and colleagues at the Michigan Department of Community Health for assisting with access to the HIV drugresistance database.