Optimized phylogenetic clustering of HIV-1 sequence data for public health applications

doi:10.1371/journal.pcbi.1010745

Table 1.

Summary of sequence data characteristics.

Length is the median length of nucleotide (nt) sequences. HXB2 coords = reference nucleotide coordinates in the HXB2 genome (Genbank accession K03455). Year type: sequences are annotated with year of sample collection, and in some cases date of HIV diagnosis. N = Total sample size, including both old and new sequences. Incid = number of sequences in ‘incident’ subset (most recent year). Subtype classifications were derived from the original data sources, when available, or generated de novo with SCUEAL [60].

More »

Expand

Fig 1.

Examples of clustering definition and growth criteria.

Subtree (A) represents a paraphyletic cluster (C₁) of two background (old) sequences, O₁ and O₃, excluding a third sequence O₂ that is too distance from the root node of C₁. Consequently, O₂ becomes its own cluster of one, C₂. Subtree (B) illustrates a monophyletic cluster where all background sequences meet this criterion. Subtrees (C-E) depict the addition of a new (incident) sequence N to an existing cluster. In (C), the new sequence is added with 100% confidence to a cluster of one background sequence, O₂. Conversely, the placement of N in subtrees (D) and (E) is highly uncertain (with bootstrap supports 40% and 60%). For (D), N becomes incorporated into the same cluster irrespective of its placement, so the bootstrap values are irrelevant. In contrast, neither placement of N in subtree (E) meets the clustering criteria due to the resulting branch lengths—as a result, N becomes a new cluster of one, C₃.

More »

Expand

Fig 2.

Distributions of branch lengths in HIV-1 pol phylogenies.

Each density summarizes the distribution of branch lengths (measured in units of expected nucleotide substitutions per site) for different locations and types of branches, as indicated in the upper right corner of each plot. We used Gaussian kernel densities with default bandwidths adjusted by factors 1.5, 0.75 and 0.75, respectively. Densities are labeled on the right with the corresponding number of branches. Internal and terminal branches are derived from the phylogeny reconstructed from background sequences. New terminal branches refer to additional branches to incident (new) sequences as placed onto the phylogeny by maximum likelihood (pplacer).

More »

Expand

Fig 3.

Difference in AIC between Poisson-linked models of cluster growth.

Clusters and growth are defined at 41 different maximum distance thresholds from 0 to 0.04 with a minimum bootstrap support requirement of 95% for ancestral nodes. The AIC of a null model where size predicts growth is subtracted from the AIC of a proposed model where size and mean time predict growth. The darker colour in each plot corresponds to these AIC results for a maxmimum likelihood tree built from the full set of old sequences, while the lighter colour represents the mean AIC difference obtained by this threshold for 100 approximate likelihood trees built on 80% subsamples of the old sequences without replacement. The shaded area represents 1 standard deviation from the mean AIC difference for subsamples at this threshold. Date of sequence collection was used to measure time for all data sets.

More »

Expand

Fig 4.

Effect of bootstrap thresholds on ΔAIC profiles.

The AIC difference between two Poisson-linked models of cluster growth for all four full data sets, with clusters and growth defined at 41 different maximum distance thresholds from 0 to 0.04. The AIC of a null model where size predicts growth is subtracted from the AIC of a proposed model where size and time predict growth. For the data sets where either sequence collection date or associated patient diagnostic date could define time, both AIC difference results are shown by separate colours. Solid lines represent AIC differences obtained using an additional bootstrap threshold of 0.95 to define clusters and growth, while dashed lines were obtained without this requirement.

More »

Expand

Table 2.

Cluster statistics under paraphyletic and monophyletic clustering.

‘No bootstrap’ corresponds to paraphyetic clustering with b_min = 0; otherwise this threshold defaults to 95%. Optimal d_max is the pairwise distance threshold selected by minimizing ΔAIC, in units of expected number of nucleotide substitutions. ‘Number of clusters’ only counts clusters with two or more background sequences, i.e., this number does not include singletons. Total growth is the number of incident (new) sequences connected to clusters of background sequences. Growing clusters is the number of clusters to which incident sequences attach.

More »

Expand

Fig 5.

Concordance between predicted growth in phylogenetic clusters and the actual (simulated) transmission network.

The top and bottom barplots summarize phylogenetic clusters obtained under bootstrap thresholds of 0% and 95%, respectively. Bars correspond to the number of incident cases mapped to phylogenetic clusters, coloured by three distance thresholds: d_max = 0.01, the ΔAIC optimum, and d_max = 0.03. Distance in transmission network is the shortest path in the transmission network between the incident case and any member of the predicted cluster. A distance of zero means the actual source individual is in the cluster, and distances greater than zero indicate the actual source was not sampled. Unsampled indicates that none of the sampled infections in the transmission history of an incident case were members of the phylogenetic cluster. Discordant indicates that the actual source individual was sampled but does not appear in the predicted cluster.

More »

Expand