^{ 1 }

^{*}

^{ 2 }

^{¤}

^{ 1 }

^{ 1 }

AFYP and SDWF conceived and designed the experiments. AFYP performed the experiments. AFYP analyzed the data. AFYP, FIL, and SLKP contributed reagents/materials/analysis tools. AFYP wrote the paper.

¤ Current address: SAC Inverness, Research and Development Division, Drummondhill, Inverness, Scotland, United Kingdom

The authors have declared that no competing interests exist.

The third variable loop (V3) of the human immunodeficiency virus type 1 (HIV-1) envelope is a principal determinant of antibody neutralization and progression to AIDS. Although it is undoubtedly an important target for vaccine research, extensive genetic variation in V3 remains an obstacle to the development of an effective vaccine. Comparative methods that exploit the abundance of sequence data can detect interactions between residues of rapidly evolving proteins such as the HIV-1 envelope, revealing biological constraints on their variability. However, previous studies have relied implicitly on two biologically unrealistic assumptions: (1) that founder effects in the evolutionary history of the sequences can be ignored, and; (2) that statistical associations between residues occur exclusively in pairs. We show that comparative methods that neglect the evolutionary history of extant sequences are susceptible to a high rate of false positives (20%–40%). Therefore, we propose a new method to detect interactions that relaxes both of these assumptions. First, we reconstruct the evolutionary history of extant sequences by maximum likelihood, shifting focus from extant sequence variation to the underlying substitution events. Second, we analyze the joint distribution of substitution events among positions in the sequence as a Bayesian graphical model, in which each branch in the phylogeny is a unit of observation. We perform extensive validation of our models using both simulations and a control case of known interactions in HIV-1 protease, and apply this method to detect interactions within V3 from a sample of 1,154 HIV-1 envelope sequences. Our method greatly reduces the number of false positives due to founder effects, while capturing several higher-order interactions among V3 residues. By mapping these interactions to a structural model of the V3 loop, we find that the loop is stratified into distinct evolutionary clusters. We extend our model to detect interactions between the V3 and C4 domains of the HIV-1 envelope, and account for the uncertainty in mapping substitutions to the tree with a parametric bootstrap.

The human immunodeficiency virus type 1 (HIV-1) possesses a highly variable envelope comprising the glycoproteins gp120 and gp41, which mediate the binding and entry of the virus into a host cell. The viral envelope is also a potent antigen for neutralizing antibodies [

The detection of interactions among residues in rapidly evolving viral proteins such as the HIV-1 envelope is an important and unresolved problem. First of all, the failure to account for such interactions can hamper efforts to map genetic variation to virus phenotypes, such as coreceptor usage, neutralization sensitivity, or drug resistance. For example, a substitution at position 306 in HIV-1 gp120 (relative to the HXB2 reference sequence) is necessary, but not sufficient, to induce a shift in coreceptor usage in HIV-1; full expression of this phenotype requires additional substitutions at positions 320 or 324 [

The third variable domain (V3) of the HIV-1 envelope typically spans 33 to 35 residues that are bounded by two invariant cysteines that form a disulfide bond to create a loop. The V3 loop is characterized by extensive sequence variation, and is a principal determinant of important HIV-1 phenotypes such as coreceptor usage and cell tropism [

To date, comparative studies of HIV-1

In this study, we propose a new method for detecting interactions from an arbitrary sample of genetic sequences that relaxes both of these assumptions. We apply our method to analyzing residue–residue interactions in the V3 loop of HIV-1 gp120, which has emerged as a model system for the implementation of association test statistics or classification algorithms [

To address the second assumption, we will analyze the phylogenetically augmented data _{S}

We apply this “evolutionary-network” model to detect interactions among residues in the V3 loop of the HIV-1 envelope. Using maximum likelihood, we infer a phylogenetically independent set of substitution events. Interactions among residues are manifested as correlated substitutions within this inferred set, such that substitutions affecting a subset of residues tend to be mapped to the same branch of the tree. Because the phylogenetic inference of substitution events is susceptible to some uncertainty, we carry out a parametric bootstrap procedure to quantify the sensitivity of the results from a maximum-likelihood reconstruction. We also apply our method to several control cases, including the better-characterized compensatory interactions in HIV-1 protease, to validate our results for the V3 loop. Our analysis reveals a large number of interactions among residues that fall into stratified clusters along the length of the V3 loop.

A total of 1,154 full-length sequences of HIV-1

We fit a codon substitution model [

We analyzed the matrix of substitution events mapped to branches in the phylogeny (_{S}^{190} possible network structures on 33 nodes; this number clearly precludes an exhaustive search for an optimal structure. Furthermore, more than one network structure may be supported equivocally by the data, especially when the number of observations is small relative to the number of nodes [^{36} permutations of 33 nodes—and yields a smoother posterior probability surface with improved MCMC convergence properties [

Following Friedman and Koller [_{S}_{P}

We ran the Markov chain for 10^{6} steps with a burn-in period of 10^{5} steps, which we have found to be more than sufficient for convergence for networks of this size. We ran a duplicate Markov chain and found that Gelman and Rubin's convergence diagnostic [^{4} steps of the chain at equal intervals. We estimated the posterior probability for each of 528 possible edges as the proportion of structures in the sample in which the edge was present in either direction, weighted by _{S}^{5} steps with a burn-in period of 10^{4} steps and 10^{3} samples. The profile of each chain was visually inspected to evaluate convergence. The frequency of edges with a posterior probability exceeding 0.95 was summed across bootstrap samples to quantify the sensitivity of edges in the maximum-likelihood consensus network to uncertainty in the reconstruction of ancestral sequences.

We employed three validation procedures to evaluate the accuracy of our methods. First, we invoked a paired binary-character model, originally developed to analyze the evolution of N-linked glycosylation site motifs [

Second, we simulated the evolution of nucleotide sequences along the tree according to a more realistic codon substitution model whose parameters were estimated from the original alignment of V3 sequences. We randomly generated 100 replicate alignments with the same dimensions and characteristics (e.g., expected codon frequencies) as our observed V3 alignment by this method. Because the codon substitution model assumes that an alignment is a set of independently evolving codon sites [

Third, we applied the evolutionary-network model to a set of HIV-1 protease sequences, in which compensatory interactions are substantially better characterized empirically or structurally than the V3 loop, particularly in the context of drug resistance [

We mapped interactions from the consensus Bayesian network to a three-dimensional structure of the V3 loop of HIV-1 gp120 complexed to the CD4 receptor and X5 antibody (Research Collaboratory for Structural Bioinformatics Protein Data Bank [RCSB PDB]), using the visualization software Chimera (University of California San Francisco, Computer Graphics Lab [

The maximum-likelihood reconstruction of ancestral sequences along the tree resulted in approximately 1.87 nonsynonymous substitutions per branch. The mean number of inferred nonsynonymous substitutions was significantly divergent between internal (0.48 substitutions per branch) and terminal (3.28) branches of the tree (Wilcoxon rank sum test,

We validated the accuracy of our evolutionary-network model using three different controls. First, we simulated the evolution of HIV-1 V3-like sequences along the original phylogeny as vectors of binary characters switching between consensus and nonconsensus residues. Each consecutive pair of residues was constrained to coevolve according to an adjustable parameter ɛ, where ɛ = 1 corresponded to independently evolving sites. We contrasted the performance of a binary-state analog of the evolutionary-network model, reconstructing substitution events by maximum likelihood, against the results from applying the Fisher exact test to the extant binary sequences (

(A) A receiver operating characteristic (ROC)-like curve, in which the

(B) Boxplots corresponding to the false-positive rate (as a fraction of the total number of pairwise comparisons = 528) from the corresponding analysis (evolutionary-network [Evol-Net] or Fisher exact test [Fisher]) of simulated sequences evolving according to a null model of codon substitutions in which sites evolve independently, using parameter settings estimated from the original V3 sequence alignment. We generated 100 replicate simulations of nucleotide sequences evolving along the original neighbor-joining tree.

Second, we simulated the evolution of HIV-1 V3 sequences along the phylogeny using a more realistic codon-based substitution model. Because this model assumes that each codon site evolves independently, the number of significant associations from each replicate simulation provided an estimate of the false-positive rate. Using the Fisher exact test on the pairwise combination of amino acids in simulated sequences, we found false-positive associations between 160.8 out of 528 pairs on average (10% and 90% quantiles = 115.4 and 217.3, respectively;

Third, we applied our evolutionary-network method to analyze HIV-1 subtype B protease sequences isolated from 2,461 patients undergoing drug regimens including at least one protease inhibitor [

The prior probability of every potential edge was set to 0.5. Given our augmented dataset, the distribution of the posterior probabilities of edges was strongly U-shaped, with a distinct cluster of edges with probabilities exceeding 0.95 (

This histogram indicates the frequency of the marginal posterior probability of the 528 possible edges, sampled from a Markov chain over the space of node orders (see

Each node corresponds to a residue in the V3 loop, numbered according to their position in the consensus sequence (identical to Korber et al. [

The strongest association in the consensus network occurred between residues 5 and 7 (OR = 155.6), which jointly defined a conserved N-linked glycosylation site motif (i.e., NNTR). Upon inspection, we found 28 phylogenetically independent events in which substitutions occurred along the branch, affecting both residues and disrupting the motif. Because a substitution at either residue would have been sufficient to eliminate the N-linked glycosylation site motif, this association suggested the presence of additional constraints on V3 in the absence of glycosylation. We also found evidence of an interaction between residues 5 and 30 (OR = 53.4). Although these residues resided on the opposite strands of the V3 loop, they were roughly equidistant from the base (

A three-dimensional visualization of the structure of the V3 loop, using structural coordinates from a model comprised of the HIV-1 gp120 core protein complexed to the CD4 receptor and the X5 antibody [

Two of the network components (R12–F19 and I13–Q17) represented positive associations that were nested with respect to the secondary structure (

A large component comprised associations among the nodes S10, D24, I25, and I26, which all mapped to residues in the stem region of the V3 loop [

We applied the 11/25 rule to classify 131 of the extant sequences as yielding CXCR4-binding virus, i.e., having an “X4” phenotype. Thirty-seven of the X4 sequences formed monophyletic groups, for which each common ancestor may have been interpreted to be X4 also. On the contrary, each X4 sequence would most likely have been derived from a CCR5-binding ancestor over the course of an infection [

The final component of the network, comprising associations among the nodes N4, D28, and Q31, mapped to the base of the V3 loop. Although the network components {N5, T7, R30} and {N4, D28, Q31} are nested with respect to the amino acid sequence, preliminary analyses via molecular dynamics simulation of the V3 loop suggested that the side chains of D28 and Q31 occupied a distinct space apart from R30 (unpublished data). Thus, the consensus network components defined a stratified V3 loop with respect to its secondary structure, with clusters of putative interactions localized to its tip, stem, or base regions (

By mapping the statistical associations identified by edges in the consensus network to a structural model of the V3 loop 28, we were able to calculate the average distance separating the residue pairs with respect to the folded protein, or the number of residues separating the pair in the amino acid sequence (i.e., tertiary and primary distances, respectively). We also generated null distributions of mean distances by randomizing the residues occupying nodes of the consensus network. Although the observed mean primary distance (11 residues) coincided with the mean of the null distribution (10.8), the observed mean tertiary distance (10.1 Å) was significantly lower than expected (16.7 Å,

Residue–residue interactions between the V3 and C4 domains of gp120 have been documented in previous experimental work [

A consensus network assembled from edges with marginal posterior probabilities exceeding 0.95, in which nodes represent nonsynonymous substitution events at codon sites in the V3 and C4 domains. Nodes that correspond to residues in the C4 domain are shaded blue (numbered according to their position in the consensus

Mapping substitutions within the V3 loop to branches in the tree allowed us to partition the analysis between terminal and internal branches, i.e., focusing on HIV-1 evolution within or among hosts, respectively. The network obtained from an analysis of substitutions mapped to terminal branches was very similar to the original network, recovering the edges N5–T7, S10–D24, R12–F19, and I13–Q17 (

Our analysis of the covariation among residues comprising the V3 loop of the HIV-1 envelope glycoprotein gp120 is the first to model sequence variation as a joint probability distribution in a phylogenetic context. We refer to this type of analysis as an evolutionary-network model. By simulating sequences on the inferred phylogeny under a null model of independent evolution among sites, we show that analyses that do not account for common ancestry are susceptible to a high false-positive rate, even after applying corrections for multiple comparisons. Consequently, such analyses tend to over-report the number of significant associations within the V3 loop, ranging in one case from 39 to 157, depending on the association test statistic and method of adjusting for the false discovery rate [

Five out of the nine putative interactions that were identified by our evolutionary-network model have previously been reported in comparative studies. Pairing of residues 10 and 24 is ubiquitous [

Unfortunately, very few interactions between specific residues in the V3 loop have been described consistently by experimental or comparative studies (see

Similarly, there are few documented cases of interactions between specific residues in V3 and C4. Morrison et al. [

In light of this, we have performed extensive tests to validate the accuracy of our model. Our simulations indicate that mapping substitution events to the phylogeny is very effective at removing the confounding influence of founder effects, reducing the high false-positive rate experienced by other methods by almost two orders of magnitude. In addition, complex patterns of conditional dependence among codon sites in V3 were revealed by our use of Bayesian network models. In sum, we find that the evolutionary-network model can reliably identify true interactions with a very low rate of false positives. Although our model is a considerable improvement over previous methods, it still requires a number of assumptions. First of all, we are mapping substitution events to branches in a tree that we assume to be a known quantity. It is possible to quantify this uncertainty by simultaneously sampling the topology of the tree and parameters of a nucleotide substitution model from a posterior probability distribution [

Secondly, we implicitly assume that our codon substitution model is an accurate representation of the true process underlying the evolution of our sample of HIV-1

Third, our application of the evolutionary-network model to V3 loop sequences implicitly assumes that residue–residue interactions are constant throughout the evolutionary history of the sequences. This assumption is susceptible to subtype-specific interactions [

Finally, our analysis of covariation in V3 handles all nonsynonymous substitutions at a given site equivocally, i.e., making no distinction between the specific residues involved. This approximation greatly reduces the dimensionality of the model to binary states (presence or absence of any nonsynonymous substitution). As in the case of subtype-specific interactions, this approximation could potentially mask residue-specific interactions [

The paradigm of a subdivision of function among sections of the V3 loop originated with experiments identifying the “tip” region as the principal neutralizing determinant [

Ultimately, our goal is to map residue–residue interactions to host factors and clinically relevant virus phenotypes, such as coreceptor usage or neutralization sensitivity. Because our unit of observation consists of inferred evolutionary events rather than observed variation, we will require an evolutionary model for every phenotype to be included in the analysis, including continuous traits. The task of detecting interactions among components of genotype or phenotype has rapidly grown in its significance to HIV-1 research. Outside of the ongoing work on associating sequence variation in the V3 loop with coreceptor usage [

This network was assembled from edges with marginal posterior probabilities exceeding a cutoff of 0.95, obtained from applying our evolutionary-network method to an alignment of HIV-1 subtype B protease sequences. Edges are labeled with their marginal posterior probability expressed as a percentage. Nodes are labeled with the alignment consensus residue and position of the codon site (consistent with the HXB2 reference sequence). Codon sites that have been previously implicated in resistance to protease inhibitors or subsequent compensatory mutations are labeled in pink (cross-resistant) or orange (specific to nelfinavir).

(10 KB PDF)

This contour plot was generated from random assignments of V3 residues to nodes of the consensus network. Mean primary distance (

(15 KB PDF)

(A) A maximum-likelihood network obtained from substitutions mapped to terminal branches of the tree. Each node corresponds to a residue in the V3 domain, numbered according to its position in the consensus sequence and labeled with the consensus amino acid. Edges connecting nodes indicate an interaction between residues.

(B) A consensus network obtained from substitutions mapped to internal branches of the tree. Edges are labeled with their corresponding parametric bootstrap support values.

(189 KB PDF)

In this table, we summarize the evidence for various putative interactions between pairs of residues in the HIV-1 envelope V3 loop. C = evidence from comparative studies of V3 sequences, E = evidence from experimental mutagenesis of V3 sequences. Entries in parentheses indicate putative interactions within intervals of the V3 sequence that were not completely resolved to specific residue pairs. Overall, concordance among studies is poor, with the possible exception of associations between the residues S10, R12, and D24. Citations for each study were abbreviated as follows: K93 = Korber et al. (1993) [

(52 KB PDF)

The Research Collaboratory for Structural Bioinformatics Protein Data Bank (

We thank Andrew Leigh Brown and Selene Zarate for helpful discussions, and two anonymous reviewers for their insightful comments on previous versions of this manuscript.

Bayesian Dirichlet metric

human immunodeficiency virus type 1

third variable domain