The Evolutionary Analysis of Emerging Low Frequency HIV-1 CXCR4 Using Variants through Time—An Ultra-Deep Approach

Large-scale parallel pyrosequencing produces unprecedented quantities of sequence data. However, when generated from viral populations current mapping software is inadequate for dealing with the high levels of variation present, resulting in the potential for biased data loss. In order to apply the 454 Life Sciences' pyrosequencing system to the study of viral populations, we have developed software for the processing of highly variable sequence data. Here we demonstrate our software by analyzing two temporally sampled HIV-1 intra-patient datasets from a clinical study of maraviroc. This drug binds the CCR5 coreceptor, thus preventing HIV-1 infection of the cell. The objective is to determine viral tropism (CCR5 versus CXCR4 usage) and track the evolution of minority CXCR4-using variants that may limit the response to a maraviroc-containing treatment regimen. Five time points (two prior to treatment) were available from each patient. We first quantify the effects of divergence on initial read k-mer mapping and demonstrate the importance of utilizing population-specific template sequences in relation to the analysis of next-generation sequence data. Then, in conjunction with coreceptor prediction algorithms that infer HIV tropism, our software was used to quantify the viral population structure pre- and post-treatment. In both cases, low frequency CXCR4-using variants (2.5–15%) were detected prior to treatment. Following phylogenetic inference, these variants were observed to exist as distinct lineages that were maintained through time. Our analysis, thus confirms the role of pre-existing CXCR4-using virus in the emergence of maraviroc-insensitive HIV. The software will have utility for the study of intra-host viral diversity and evolution of other fast evolving viruses, and is available from http://www.bioinf.manchester.ac.uk/segminator/.


Introduction
Sequencing platforms, such as the 454 Life Sciences' GS-FLX pyrosequencing system, has greatly parallelized the determination of nucleotide order within genetic material, resulting in the ability to produce extremely large datasets [1]. The vast numbers of short sequence segments produced (termed reads) in conjunction with intrinsic error rates associated with the sequencing platform [2,3] pose challenging computational problems [4,5]. Importantly, these data have the potential to provide previously unprecedented insight into the extent of pathogen variation (diversity) that exists within a single individual. This is particularly important in the detection of minority variants, for example, those associated with drug resistance [6][7][8][9][10][11].
To date, software has focused on eukaryotic and prokaryotic genome-scale sequencing with its associated megabase reference genomes and vast quantities of read data [5,12]. For such studies traditional fast alignment algorithms [13][14][15] that employ flexible k-mer matching are not capable of mapping reads to a reference sequence within a reasonable time. Consequently new software tools have been developed that incorporate faster string matching techniques at the expense of dealing with variation [12,[16][17][18]. For highly variable genomes this limitation will result in data loss as reads with more than the specified numbers of mismatches, in relation to a template sequence, are discarded. This loss can occur non-randomly with reads representing minority subpopulations being less likely to be mapped to the template. For example, two distinct phenotypes of HIV-1 exist that are defined by the host coreceptor that is used during cell entry. The coreceptors involved are chemokine (C-C motif) receptor 5 (CCR5) and chemokine (C-X-C) receptor 4 (CXCR4). The location of the viral genome that determines the phenotype is the third variable (V3) loop, a highly variable region [19] located within HIV's envelope gene, env [20][21][22]. The most often used genomic reference sequence for HIV-1 is HXB2, a CXCR4-using virus. When mapping V3 data to HXB2, and limiting the number of mismatches allowed, reads representing CCR5 variants are more likely to be lost during mapping as a result of known amino acid changes associated with that phenotype [23,24]. This may result in a misleading ratio of coreceptor use within an HIV-1 population, which can have consequences for drug treatment decisions [25]. Thus, for rapidly evolving viruses, such as HIV-1 [26][27][28][29], a limitation on the number of mismatches tolerated during read mapping is less than optimal. Modification of traditional k-mer matching approaches [13] is a more suitable approach and becomes scalable (due to the smaller genome sizes) as they allow for increased tolerance when dealing with higher levels of variation in viral 454 data [6,11].
Prior to any evolutionary study reads must be accurately mapped and aligned. Our software performs these tasks as well as subsequent tropism testing, phylogenetic tree inference and visualization (Fig. 1). We demonstrate the software's underlying framework in order to quantify the effects of divergence on the mapping of reads to a template sequence. In addition to unbiased mapping of data, a reduction of divergence between reads and template is favorable for the removal of platform dependent insertions. Characteristically with 454 data there is a high rate of insertion error associated with the chemistry involved during the pyrosequencing process [2,3]. Failure to remove such insertions can result in a further loss of usable data when translations are required during downstream analysis.
We apply our software to temporally sampled 454 datasets from two HIV-1 infected individuals in order to characterize the emergence of low frequency CXCR4-using variants following treatment with an HIV entry-inhibitor drug, the CCR5 antagonist maraviroc. As the drug will not directly impact on viruses using the CXCR4 coreceptor [30], patients are screened for their presence prior to treatment [25,31]. The aim is to distinguish a viral population that is exclusively CCR5 tropic (R5) from a viral population including either dual-mixed, DM, (R5 and exclusively CXCR4-using [X4]), or R5 and dual-tropic viruses (those that can use both CCR5 and CXCR4 [R5X4]). Note, we refer to both X4 and R5X4 tropic viruses as CXCR4using. This application of our software demonstrates that sequence data generated from the 454 platform -in conjunction with coreceptor prediction tests based on HIV's V3 region [23,24,32] -permits the quantification and evolutionary analysis of HIV-1 tropism present at low frequencies within a sample more effectively than could be achieved using standard population sequencing technologies [6,11]. Our software will also have utility for studying the within-host diversity of other fast evolving viruses.

Datasets
Samples for pyrosequencing were obtained from two HIV-1 infected males, patients D and E, both of whom were treatment naïve and enrolled in the QD arm of the A4001026 study [31], were infected with subtype B virus and received maraviroc once daily together with zidovudine and lamivudine. They discontinued the study due to insufficient clinical response: patient D continued to week 2 and patient E to week 24. Both had the M184V mutation in reverse transcriptase at failure, which confers resistance to the background therapy and, in addition, patient D had M41M/L and K70K/R. Samples were collected over five time points: screening (40 days before), day 1, week 2 (day 15), week 12 (day 80) and week 16 (day 107) for patient D and screening (41 days before), day 1, week 8 (day 57), week 24 (day 162) and week 30 (day 211) for patient E.
For each sample, RNA extraction and amplification from the gp120 region of env was performed. The amplicons were subjected to nebulization to generate fragments of approximately 600 nucleotides. These were amplified as described in Margulies et al. [1]. and sequenced on the Genome Sequencer 20 (GS20, Roche Applied Sciences). Standard protocols for the generation of a library of tagged single-stranded DNA molecules were used (for details, see Margulies et al. [1]). The GS20 software package was used to generate the sequence files. This resulted in files containing between 14,000 and 31,000 reads ( Table 1). The data is available at the NCBI Sequence Read Archive (http:// www.ncbi.nlm.nih.gov/Traces/sra) under accession number SRA023641.1.

K-mer mapping
Although RNA extraction and amplification was carried out across the entire gp120 gene, the region that is required for the coreceptor prediction of HIV-1 variants is V3 [23,24,32]. Reads covering this region were identified using a k-mer matching process similar to the initial phase of the BLAST algorithm [13]. For a single read the location of all k-mers of size five are identified across the template sequence. If a matching region is found for the read, the frequencies of k-mer hits will be above the random level of background noise at that location ( Fig. 1). For each dataset the coordinates of the template in relation to the HXB2 reference genome are 6900 to 7305. This accommodates longer reads that may span the entire V3 region. The coordinates of the V3 loop within this region are 7110 to 7217. Dataset specificity was generated within the templates using a premapping to the HXB2 reference sequence for which multiple alignments are then generated within windows of size 70 (using a 20 nucleotide overlap). Within each of these alignments columns containing more than 50% gaps were removed. Consensus sequences, created for each alignment, are then appended in order to form data-specific templates to which reads are then remapped.
To explore the effects of using a consensus template on k-mer mapping, for each dataset, we compared the number of reads mapped to the data-specific template for that dataset to the number of reads mapped using HXB2 -the latter being the premapping stage prior to consensus template generation. Next, we took our patient D screening dataset and introduced random variation into the consensus template in sequential steps of 2, 4 up to 26% (50 repetitions for each). After each introduction of random variation, the k-mer mapping was performed and the number of reads successfully mapped recorded. It should be noted that no precise pairwise alignment to the template sequence is

Author Summary
Due to high data volumes, error rates, and short sequence lengths, new sequencing technologies present a new challenge for computational biology. In addition, highdepth (or ultra-deep) datasets, for example from pathogens, contain exceptionally large amounts of variation over short genomes or genomic regions. Here we present software for the processing and downstream analysis of such short-read viral sequence data. We apply the software to the analysis of two HIV-1 infected individuals who did not respond optimally to the drug maraviroc. For each patient, pyrosequence data was available for five time points. In both cases we detect distinct clusters of lowfrequency drug-insensitive variants that were present prior to maraviroc treatment and effectively unmasked by the removal of the drug-sensitive HIV. Figure 1. The data analysis framework. On the left hand side the preprocessing of the template sequence prior to read mapping is illustrated. The fragments titled ''k-mers'' are all the unique words (length = 5) within the template sequence. These are stored along with their corresponding locations. On the opposite side all k-mers of equal length, extracted from the read, are shown. The plot indicates the frequency of k-mer matches across the template sequence for a single read. Grey boxes indicate processing events that take place within the framework. The yellow circles indicate optimization steps: (i) only exact k-mer matches used (ii) a heuristic alignment not constructed from the k-mer matching (just the k-mer match frequencies are plotted) and (iii) only the appropriate region of the template is pairwise aligned to the read. doi:10.1371/journal.pcbi.1001022.g001 generated during k-mer mapping. Instead the reads within our datasets covering the portion of the HIV-1 genome between the coordinates 6900 to 7305 are identified along with their approximate start positions. The k-mer mapping approach implicitly allows for a higher degree of tolerance in identifying such reads when compared to approaches that limit the number of mismatches [12,[16][17][18] as, although the process involves matching exact k-mers to the template at any given location, the overall frequency of k-mer hits will increase at the most likely location of the read across the template (Fig. 1). This approach, thus, does not specify an exact limitation on the number of mismatches allowed between the read itself and the template.

Pairwise alignment
Following k-mer mapping, reads are pairwise aligned to the consensus template using Smith and Waterman [33]. Indices obtained during k-mer mapping are used to optimize the process by only aligning reads to the appropriate region. Platform dependent insertion error, which makes up the majority of non-biological error [2,3], is accounted for by maintaining reference to the dataset specific template. Specifically, insertions relative to the template, which represents an in frame consensus sequence, are removed. The frequency at which these insertions occurred across the V3 region was recorded. The usage of a data-specific consensus sequence is important to ensure that insertions naturally existing within the population are not erroneously removed based on use of a divergent template sequence.

Tropism prediction and phylogenetic inference
Reads spanning the V3 region of env were extracted, truncated, identical reads removed (frequencies were stored) and multiply aligned using Muscle [34], packaged with the software. Coreceptor prediction was performed using the 11/24/25 ''charge rule'' [23,32], implemented within the software and using the PSSM web tool [24]. Sequence logos were generated for inferred R5 and CXCR4-using sequence present at each time point using the Web Logo tool [35]. Nucleotide sequences, annotated with coreceptor predictions, were used to infer evolutionary relationships by maximum likelihood using PhyML [36], packaged with the software. The HKY model of sequence evolution was used. The resulting phylogenetic trees were visualized using an integrated version of CTree [37]. Because bootstrapping is unreliable when performed on very short sequence alignments, the significance of the identified clusters within datasets representing the early time points was determined by comparing the ratio between the intra-cluster pairwise distance and the inter-cluster pairwise distance (of five random clusters) to a distribution of values obtained for 500 sets randomly assigned clusters. A low intra-cluster pairwise distance relative to the inter-cluster pairwise distance implies a robust cluster [38,39]. Additionally clustering significance was tested using the approximate likelihood ratio test [40] for branches as implemented within PhyML.

Key functions of the software
The pipeline used for processing the initial read data is available within our software (Fig. 1). Implemented in Java the executable runs on Mac OS X, Linux and Windows. All required external binaries are included within the package. The input is a FASTA formatted file containing unmapped read data. Output files are in FASTA, TXT, PDF or NEWICK format as appropriate. A summary of the key functions incorporated into the software are: (i) accurate mapping of next generation sequence data containing high amounts of variation, (ii) exportation of reads spanning user defined regions of the template, (iii) translation of reads, (iv) determination of nucleotide and/or amino acid residue frequencies, (v) generation of a consensus sequence across the entire dataset taking into account data-specific indels, thus, reducing dependency on a generic template, (vi) removal of reads based on a hamming distance from their corresponding region on the template, (vii) generation of a multiple alignment of reads spanning a particular region of the template using Muscle [34], (viii) detection and annotation of low frequency variants, (ix) inference of phylogenetic trees using PhyML [36], (x) tree label searching based on the annotation produced in viii and visualized using CTree [37], and (xi) management of bar coded data. During the scaffolding process, a number of output plots are generated to summarize the data. These include read length distributions and template coverage. The latter is portrayed in a circular plot to allow for longer templates to be displayed optimally.

K-mer mapping
For each dataset, following k-mer mapping to the consensus template, high read coverage was observed across the V3 region (Table 1). In each case when HXB2 was used as a template sequence fewer reads are mapped. For patient D the mean loss of reads is 26.4%, while for patient E it is 36.5%, the difference being due to the divergence between patients D and E's data-specific templates and HXB2 (Table 1). When random  variation in sequential steps of 2 to 26% was introduced into the consensus template derived from the patient D screening dataset and k-mer mapping performed on reads from that dataset, a reduction in the number of mapped can be observed that is directly proportional to increasing divergence (Fig. 2).

Pairwise alignment
Reads were pairwise aligned to the appropriate region (identified from the k-mer mapping step) of the data-specific consensus templates and those spanning both the start and end of the V3 region were extracted and truncated. Between 17 and 33% of reads contain at least one insertion event across the V3 region in  comparison to the consensus template ( Table 2). The vast majority of these insertions were observed to be singleton or dinucleotide insertions (causing a frame shift), with a mean per site frequency of 0.18%. Note, this frequency is after the sequences have been truncated. These are lower numbers than would be expected if complete reads had been included [3] as the starts and ends of the majority of the reads have been removed by the truncation step. During the alignment process such insertions were removed in order to maintain as many correctly translated V3 regions as possible.

Tropism prediction and phylogenetic inference
For both patients a high coverage of in-frame reads across the V3 region was observed at each time point (Tables 3 and 4) with many unique variants. Those reads that could not be translated correctly were discarded, resulting in the lower numbers observed in Tables 3 and 4 than those presented in  table 2. When both the charge rule and PSSM tests were performed on these data, CXCR4-using variants were detected prior to treatment within both patients (Tables 3  and 4). On maraviroc treatment, for patient D, CXCR4-using virus increased to a frequency of 41% and for patient E increased to 99% at the sampling times (Tables 3 and 4). Interestingly, despite the CXCR4-using population increasing in patient D and becoming dominant in patient E ontreatment, the reduction in viral load corresponds to an order of magnitude less CXCR4-using virus than that prior to treatment. The inference of the evolutionary history for each dataset, revealed the majority of CXCR4-using variants formed a distinct cluster, present prior to maraviroc treatment and divergent to the main R5 populations present prior to treatment (Fig. 3 and. 4). The inset plots confirm the significance of each CXCR4-using cluster based on the comparison of intra-and inter-pairwise distances and confirmed using the approximate likelihood ratio test (Fig. 3 and 4). The sequence logos beside each phylogeny represents a comparison between the sequence characteristics of the R5 tropic and CXCR4-using variants. At each time point, key differences in charge [23,24,32] can be observed at sites 11 for patient D and at site 25 for patient E.
For each patient for the two time points prior to therapy when the CXCR4-using variants located within the R5 tropic clusters were investigated, they were observed to be more similar to their closely related R5 tropic counterparts than to the distinct clusters of CXCR4-using variants ( Fig. 3 and 4, screening and Day 1). For patient D there are four such variants (1.1% of the CXCR4-using population), while for patient E there are ten (7% of the CXCR4-using population) prior to therapy.
It is important to note, the extent of the divergence in the phylogenetic trees is mainly due to the high number of either unique or rare variants. These variants cluster around high frequency variants within the population (Fig. 5). For example, at screening (Patient D) only ten variants make up 75% of the viral population with a single variant contributing to 23% of the population. Despite a proportion of this variation being due to sequencing error, this level of variation emerging in relatively short time periods highlights the extreme mutability of HIV. Tables 3 and 4 show the results of the PSSM test carried out on the extracted V3 sequences. In all cases, with the exception of patient E (week 8), PSSM confirms similar levels of CXCR4-using virus to those predicted by the charge rule. PSSM predicts viruses to be CXCR4-using based on scores being higher than a threshold of 22.88 (Fig. 6), and to be R5 tropic based on scores being below a threshold value of 26.96 (Fig. 6). Between these two threshold values a reliable PSSM prediction cannot be made and so composite PSSM utilizes the charge rule at sites 11 and 25 [24]. Note, we have also included site 24 in the composite prediction [23].

Discussion
We have developed freely available software for the management and downstream analysis of pathogen sequence data. We demonstrate the utility of this software by applying it to the detection, and subsequent evolutionary analysis, of drug resistant variants within two temporally sampled patients infected with HIV-1. In our study of the V3 region we demonstrate that both the CXCR4-using viral populations, which emerge during maraviroc treatment, do not evolve de novo. Instead, confirming previous studies [6,11,41], they emerge from a pre-existing, distinct, viral subpopulation that is present prior to therapy (Fig. 3  and 4). Note, the phylogenetic analysis was repeated excluding sites 11, 24 and 25 (those used in the charge rule test), and the same divergent CXCR4-using cluster identified (and statistically supported), indicating convergent evolution is not biasing the inferences. If the R5 variants had evolved during treatment to become CXCR4-using they would be more closely related to these R5 counterparts. In addition, characteristic differences between R5 and X4 variants (specifically at sites 11, 24 or 25 in V3) are observed at screening and day 1 ( Fig. 3 and 4, sequence logos).
We quantified the effects of HIV-1 mutability on the k-mer mapping process prior to downstream analysis. Using the patient D screening dataset, in conjunction with the consensus template for that data within which random mutations were introduced in sequential steps, we observed that at a divergence level of 26% just over 20% of the reads originally mapped to the unaltered consensus map successfully (Fig. 2). This demonstrates there is a direct relationship between the number of reads that are mapped successfully and the level of divergence between the data and template sequence. The usage of an inappropriate template will, thus, very probably result in the non-random loss of data, introducing an unnecessary bias. Indeed for each of our datasets, when mapped to HXB2 rather than the data-specific consensus templates, between 24 and 44% of reads covering the V3 region were not mapped as a result of divergence between the consensus templates and HXB2 which ranged form 14 to16.5% (Table 1).
When an amino acid translation step is performed, in our case for inferring reads as R5 or CXCR4-using, data loss can be further minimized by utilizing a correction procedure relative to the inframe template sequence. Platform-dependent insertions make up the majority of sequencing error usually resulting in an overrepresentation of frame shifts within the reads [2,3]. Correction based on a divergent template will result in a greater probability of complete codons being removed erroneously and therefore it is optimal to use a template that is dataset specific ( Figure S1). In Tsibris et al., [11] where no such correction was performed on temporally sampled data from two subtype B infected patients much of the data was removed. Within one sample a platform dependent insertion within a known homopolymeric stretch resulted in the staggering removal of 85% reads. Using a correction approach, based on an in-frame consensus template, reduces this loss greatly (Table 5).
Algorithms for the computational prediction of tropism are highly dependent on the available training datasets. In the case of PSSM, for example, the training data used in the web tool defines the threshold cutoff values (22.88 and 26.96) used in the coreceptor prediction [24]. When data falls between the current threshold values the PSSM web tool uses the charge rule [24]. This can be misleading as seen for patient E week 8 (Fig. 6), the charge rule called 99.3% of the population as CXCR4-using based on the presence of a positive charge at site 25, while PSSM called 39.1% of the population as CXCR4-using (Table 4). For the latter only 0.3% of variants fall above the PSSM CXCR4using threshold. The remaining 38.8% of CXCR4-using variants is based on the charge rule and not the PSSM scores. The variants that fall below the CCR5-using threshold (60.9%), despite the majority still possessing a positively charged residue at site 25, have been called based on their PSSM scores. The most likely explanation is that these variants are dual tropic and typing them as R5 is incorrect.
It is also important to consider how much CXCR4-using virus is acceptable in the context of combination therapy. At present a 2% threshold has been proposed by RH [42]. In our study both patients had greater than 2% CXCR4-using virus at screening and the CXCR4-using population was greater than 10,000 copies/ml. Interestingly, although the CXCR4-using virus is clearly present during therapy, the overall CXCR4using plasma HIV-1 RNA was reduced during the treatment phase, presumably due to the effect of the other drugs used with maraviroc.
In conclusion, our results demonstrate that, in conjunction with appropriate software, pyrosequencing data has utility for the evolutionally analysis and detection of low frequency variants within viral populations. In our analysis we have provided a high-resolution snapshot, through temporally sampled data, of intra-patient viral diversity and evolution associated with the CCR5-antagonist maraviroc. We have also quantified the effects of viral diversity on the initial k-mer mapping of read data in relation to the correction of platform Figure 6. Tropism prediction. Frequency plots of PSSM scores of unique V3 sequences within each dataset. The red area indicates the region below the 26.96 threshold (R5), the green region indicates the area above the 22.88 threshold (CXCR4-using) and the grey area indicates the region between the two thresholds. The numbers within each plot area indicate the percentage of reads called as R5 or CXCR4-using for that region. doi:10.1371/journal.pcbi.1001022.g006 Table 5. Reference template comparison.

No reading frame correction
Data-specific consensus based frame correction dependent insertion error. The features of the software used here can be applied to other drug susceptibility and resistance studies, within other genomic regions of HIV-1 or to other pathogen genomes. Figure S1 Example of frame correction to a dataset specific template and HXB2. Pairwise alignment of the V3 region of the consensus template (patient D, screening) to that of HXB2 (A).

Supporting Information
Correction of the V3 sequence based on HXB2, and subsequent loss of a complete codon (B). Correction of the V3 sequence based on the data-specific template, and no codon loss (C). Author Contributions