Evolution and Taxonomic Classification of Alphapapillomavirus 7 Complete Genomes: HPV18, HPV39, HPV45, HPV59, HPV68 and HPV70

Background The species Alphapapillomavirus 7 (alpha-7) contains human papillomavirus genotypes that account for 15% of invasive cervical cancers and are disproportionately associated with adenocarcinoma of the cervix. Complete genome analyses enable identification and nomenclature of variant lineages and sublineages. Methods The URR/E6 region was sequenced to screen for novel variants of HPV18, 39, 45, 59, 68, 70, 85 and 97 from 1147 cervical samples obtained from multiple geographic regions that had previously been shown to contain an alpha-7 HPV isolate. To study viral heterogeneity, the complete 8 kb genome of 128 isolates, including 109 sequenced for this analysis, were annotated and analyzed. Viral evolution was characterized by constructing phylogenic trees using maximum-likelihood and Bayesian algorithms. Global and pairwise alignments were used to calculate total and ORF/region nucleotide differences; lineages and sublineages were assigned using an alphanumeric system. The prototype genome was assigned to the A lineage or A1 sublineage. Results The genomic diversity of alpha-7 HPV types ranged from 1.1% to 6.7% nucleotide sequence differences; the extent of genome-genome pairwise intratype heterogeneity was 1.1% for HPV39, 1.3% for HPV59, 1.5% for HPV45, 1.6% for HPV70, 2.1% for HPV18, and 6.7% for HPV68. ME180 (previously a subtype of HPV68) was designated as the representative genome for HPV68 sublineage C1. Each ORF/region differed in sequence diversity, from most variable to least variable: noncoding region 1 (NCR1) / noncoding region 2 (NCR2) > upstream regulatory region (URR) > E6 / E7 > E2 / L2 > E1 / L1. Conclusions These data provide estimates of the maximum viral genomic heterogeneity of alpha-7 HPV type variants. The proposed taxonomic system facilitates the comparison of variants across epidemiological and molecular studies. Sequence diversity, geographic distribution and phylogenetic topology of this clinically important group of HPVs suggest an independent evolutionary history for each type.


Introduction
Genital human papillomavirus (HPV) infections constitute one of the most common sexually transmitted infections; however, only a subset of these cause cervical cancer and its immediate precursor, cervical intraepithelial neoplasia 3 (CIN3) [1].Cervix cancer is the third most common cancer among women worldwide and is of particular importance in developing countries and based on the fact that relatively young women are stricken [2,3].Most oncogenic or high-risk (HR) HPV types associated with invasive cervical cancer are phylogenetically clustered within the species groups Alphapapillomavirus 9 (HPV16-related) or Alphapapillomavirus 7 (HPV18-related) [4,5].These two species groups account for approximately 75% and 15% of all cervical cancers worldwide, respectively [6][7][8].
The International Committee on Taxonomy of Viruses (ICTV) designates the definitions for genera and species groups of isolates within the family Papillomaviridae based on recommendations of an international PV working group committee.However, it does not set standards below the species level [9].Within the PV research community, a distinct papillomavirus genotype is established when the nucleotide sequence of the L1 ORF of a cloned virus differs from that of any other characterized types by at least 10% [4,5].However, the complete viral genome sequence is used to classify closely related HPV variants, since changes are not uniform across the genome.We have previously employed this nomenclature system to classify HPV6/11, HPV16 and HPV16 related alpha-9 HPV variants [10][11][12]; isolates of the same HPV type are grouped into variant lineages or sublineages when the pairwise nucleotide sequences of their genomes differ by 1.0%-10.0%or 0.5%-1.0%,respectively [12].The lexicon of other taxonomic terms, such as serotypes, strains, subtypes, is not recommended in order to maintain a common language [4].
HPV variants have been shown to differ in geographic origins [13][14][15][16], evolutionary dynamics [17][18][19][20][21], and pathologic potential [15,16,[22][23][24].A comprehensive classification system can facilitate investigation of the clinical and biological role sequence variations play in genotype-phenotype associations.Moreover, since groups of single nucleotide polymorphisms (SNPs) are highly correlated within each lineage/sublineage, this allows HPV researchers to discuss the properties of HPV variant lineages without having to describe sets of nucleotide changes facilitating comparisons of data across studies.For example, an HPV33 variant (C7732G) and an HPV58 variant (C632T and G760A) reported to be associated with higher risks of cervical cancer [25,26] can now be classified into HPV33 sublineage A2 and HPV58 sublineage A3, respectively [12].In addition, recent reports using the alpha-9 HPV variant nomenclature examined risk of cervical intraepithelial neoplasia grades 2-3 (CIN2/3) by HPV31 variants [27], which could be compared with other studies supporting the notion that the HPV31 C lineage was a risk factor for CIN2/3 [28].
In this report, the complete 8 kb genomes of 128 variants representing major lineages and sublineages of alpha-7 HPV types (HPV18, 39,45,59, 68 and 70) were selected and analyzed to capture maximum viral heterogeneity.Variations across the genomes are identified and the phylogeny and nomenclature of alpha-7 variant lineages/sublineages are described.Genomic diversity, evolutionary dynamics and geographic clustering are discussed.

Ethics Statement
The studies providing cervical samples for this work have been IRB approved by each ethics committee and all samples received in the Burk lab were coded and did not have individual identifying information.In details, Rwanda -The Rwanda National Ethics Committee and the Institutional Review Board of Montefiore Medical Center, Bronx NY approved the study protocol and the consent process.

Clinical Specimens, Identification of Novel HPV Variants and Whole Genome Sequencing
DNA from cervicovaginal samples already determined to have alpha-7 types (HPV18, 39, 45, 59, 68 and 70) by previous testing were available from women participating in epidemiological studies worldwide, including -Costa Rica [29], Taiwan [30], Thailand [31,32], Rwanda [33], Burkina Faso [34] and Zambia [35].The methods for sample collection and HPV typing are provided in the references from each study.The number of samples analyzed for each type is shown in Table 1.The HPV genomes within the DNA samples were classified by sequencing the URR and/or E6 regions from PCR products as described [12,27].Briefly, we used type-specific primers to amplify a partial fragment of the URR region and/or the E6 ORF using a one-tube nested PCR method [36].The E6 ORF was evaluated only for those specimens that did not yield data for the URR region.The PCR product sizes were confirmed by gel electrophoresis, purified using the QuickStep 2 PCR Purification kit (Edge BioSystems, Gaithersburg, MD) or QIAquick Gel Extraction kit (Qiagen, Valencia, CA) and submitted for Sanger sequencing of both strands at the Einstein Genomics Facility.The sequences for each type were separately aligned and preliminary phylogenetic trees were used to identify samples that likely contained divergent viral genomes (data not shown).Based on this analysis, we selected type-specific viral isolates for complete genome sequencing that (1) represented novel variant clades or (2) had 2 or more isolates that contained at least 2 unique sequence variations (e.g., single nucleotide polymorphisms (SNPs)) not present in other isolates within the URR/E6 regions.
The number of genomes selected for sequencing was based on identification of divergent isolates as described above and differed for each type (Table 1).The complete 8 kb genomes from clinical samples were amplified in 2 to 3 overlapping fragments using type-specific primer sets (available from authors) based on the prototype sequence of each type.For overlapping PCR, an equal mixture of AmpliTaq Gold DNA polymerase (Applied Biosystems, Carlsbad, CA) and Platinum Taq DNA Polymerase (Invitrogen, Carlsbad, CA) was utilized as previously described [12,37].PCR products of anticipated size, as determined by gel analyses, were either directly sequenced or cloned into pGEM-T easy (Promega, Madison, WI) or TOPO TA pCR2.1 vectors (Invitrogen, Carlsbad, CA) and then sequenced.Comparison of repeat sequencing of PCR products from the same isolates resulted in a difference of less than one change per 8,000 bp; whereas, comparison of the cloned genomes gave a difference of approximately one difference per 5,000 bp.For discrepancies between sequences, we used the sequence of the PCR product as the consensus sequence.HPV complete genome sequences were submitted to GenBank (Table S1).In addition, a GenBank search, at the time of initial analysis (November 2012), for alpha-7 complete genomes identified 9 HPV18 isolates from Thailand [38], 1 HPV59 and 2 HPV68 isolates from China [39,40] and the set of reference sequences (i.e., HPV18, HPV39, HPV45, HPV59, HPV68 and HPV70.)ME180, previously a subtype of HPV68, was included as a variant of HPV68.HPV85 and HPV97 genomes were not investigated in this study due to the limited diversity of currently reported genomes [18].The accession numbers of all sequences used in this report are listed in Table S1.

Evolutionary analyses and phylogenetic tree construction
The nucleotide sequences of the complete circular genomes were linearized at the first ATG of the E1 ORF and globally aligned using the program MAFFT v 6.864b [41].Based on the concept of a single ancestor for each type, a unique genome size is assigned to each HPV type based on the global alignment; the variation in genome sizes of isolated variants is the result of insertions and deletions (indels).Each indel was counted as one event.The assignment of position numbers for each nucleotide is based on the nucleotide numbering of the prototype reference sequence.
Maximum likelihood (ML) trees were constructed using RAxML MPI v 7.2.8.27 [42] and PhyML MPI v 1.4.3 [43] with optimized parameters based on the aligned complete genome e Total number and percentage of variable nucleotide positions based on a reference genome for each HPV type as described above.Nucleotide variations include single nucleotide polymorphisms (SNPs) and indels, which are considered equivalent to one variation per indel, independent of indel size; f Number of amino acid codons (not including overlapping ORFs) based on the reference genome for each type as described above.Cumulative number of amino acid codons are taken from 7 ORFs (E6, E7, E1, E2, E5, L2 and L1), E4 is not counted separately nor are other overlapping ORFs; g Total number and percentage of variable amino acids based on the total number of amino acid positions derived from the reference genome for each HPV type.
nucleotide sequences.MrBayes v 3.1.2[44] with 10,000,000 cycles for the Markov chain Monte Carlo (MCMC) algorithm was used to generate Bayesian trees.A 10% discarded burn-in was set to eliminate iterations at the beginning of the MCMC run.For Bayesian tree construction, the computer program ModelTest v 3.7 [45] was used to identify the best evolutionary model; the identified GTR model was set for among-site rate variation and allowed substitution rates of aligned sequences to be different.Data were bootstrap resampled 1,000 times in PhyML.
SNPs within the HPV genomes and lineage-specific SNPs were determined from alignments of type specific variant genomes using MEGA v 5.05 [46] and MacClade v 4.08 [47], respectively.Mean nucleotide differences and standard errors between and within type-specific lineages and sublineages were calculated from the global sequence alignment of each type using MEGA v 5.05 bootstrapped 1,000 times [46].The pdistance method in MEGA v 5.05 [46] was used to calculate the genome-genome pairwise differences from the above alignments.The rarefaction curves for each type were generated by EstimateS v 8.2 [48].

HPV variant distribution, lineage classification and nomenclature
HPV isolates for sequencing were selected from a large set of samples previously typed for HPV and further analyzed for novel variants by sequencing a fragment of the URR and/or E6 ORF (Table 1).From the complete genome sequences of the isolates sequenced in this study and those obtained from GenBank, HPV variant lineage classification and nomenclature were assigned, as previously described [10,12].The phylogenetic topologies for each type combined with an approximate cut off of 1.0% difference between complete genomes were used to define major variant lineages (1.0%-10.0%).Each major lineage was named using an alphanumeric, with the "A" clade always containing the reference genome for each type.Differences between genomes in the 0.5%-1.0%range were designated as sublineages (e.g., A1, A2, etc.).When the reference genome could be classified as a sublineage, it was always assigned into the "A1" clade.An overview of the alpha-7 HPV types and variant lineages are shown in Figure 1.

Genomic Diversity of HPV18 Variants
HPV18 complete genome variants have been previously investigated and two major lineages, European (E) and African (Af) were identified [14,18].The URR and/or E6 regions of 380 HPV18 isolates from cervical samples were sequenced and clustered into distinct groups as shown in Figure 2 [14,18].Numerous novel SNPs or SNP patterns, e.g., G437A, C7549A and T7475G (Figure S1A), were detected in several samples from which 23 variants with maximum diversity were selected for complete genome sequencing (Table 1).In addition, twentytwo HPV18 complete genome variants from Costa Rican (n = 13) [18] and Thai women (n = 9) [38] that were previously published (Table S1) are included in the current analyses.
A total of 422/7857 (5.4%) nucleotide positions showed variations whereas, 165/2476 (6.7%) codons specified variable amino acids (i.e., they represent nonsynonymous SNPs) (Table 1).The maximum nucleotide pairwise difference between the most dissimilar isolates was 2.1% (Table S2, second column).The most variable region was the noncoding region 2 (NCR2) between the E5 and L2 ORFs with 11.6% overall nucleotide diversity.Consistent with a previous report [18], three different indel events were detected across the genome in the following regions: the E2/E4 ORFs (6 bp deletion), the noncoding region 1 (NCR1 is the region between the stop codon of the E2 ORF and the start codon of the E5 ORF) (19 to 20 bp deletion), and the URR (7 bp deletion) (Figure S1A).

Genomic diversity of HPV39 variants
We amplified and sequenced the E6/URR region of 122 HPV39-containing samples and selected 19 isolates for genome sequencing (Table 1 and Table S1).The overall nucleotide and amino acid variable positions of the complete genomes (19 sequenced for this analysis plus the reference sequence makes a total of 20 complete genomes) were 2.4% (186 sites among 7912 nt) and 2.5% (61 sites among 2428 aa), respectively.The noncoding regions (NCR2 and URR) were more variable than the coding ORFs.HPV39 isolates Tw562 and BF182 were the most distantly related genomes with a nucleotide sequence pairwise difference of 1.1%; this distance represented the maximum inter-lineage diversity of HPV39 variants (Table S2).
Phylogenetic analyses based on complete genome sequences clustered HPV39 variants into 2 lineages designated A and B (Figure 3).Lineage A was further divided into two sublineages, A1 and A2 (0.48 ± 0.06% different); both sublineages were nearly equally distant to lineage B, with mean differences of 0.98 ± 0.09% and 0.97 ± 0.08%, respectively (Figure 3 and Table S3).Isolate Qv36565 (an A1 variant) may bridge the evolution of sublineages A1 and A2, sharing a set of nucleotide changes with A2 variants (E1 -C1298G, C1683T, A2257C, G2355A; E2 -T3157C; E5 -A4081G; L1 -C6785A) (Figure S1B).Lineage and sublineage specific nucleotide variations were determine across the genomes (3 SNPs for sublineage A1, 7 for A2, 37 for lineage A, and 35 for lineage B) (Figure S2).Sublineage A2 variants had a nine amino acid repeat within the E1 ORF (DAEGE(H/N) Figure 1.Alpha-7 phylogenetic tree showing representative types and variant lineages.A maximum likelihood (ML) tree was constructed using RAxML MPI v 7.2.8.27 [42] and PhyML MPI v 1.4.3 [43] inferred from the global alignment of complete circular genome nucleotide sequences linearized at the first ATG of the E1 ORF.Numbers on or near branches indicate support indices < 100% by RAxML and PhyML, respectively.The shaded areas represent groupings of lineages and sublineages of HPV18, HPV39, HPV45, HPV59, HPV68 and HPV70; the prototype sequences of HPV85 and HPV97 are also included as indicated, although no variant lineages are distinguished due to the limited number of isolates of these types.The length of broken and solid lines represent distance between clades, although the number of changes is different for these two lines, as indicated in the upper left corner of the figure.HPV51, an alpha-5 type, was set as the outgroup.doi: 10.1371/journal.pone.0072565.g001GGS), and a 52 bp repeat was detected within the URR region of isolate Qv29509 (an A1 variant).

Genomic diversity of HPV45 variants
Eleven variants from 217 HPV45-containing samples were selected for complete genome sequencing; they represented 4 women from Costa Rica, 3 from Rwanda, 2 from Zambia, and 2 from Burkina Faso.Classification of HPV45 variants was based on the analysis of these samples combined with 13 previously reported HPV45 genomes, including the reference isolate [18,49].There were a total of 239 nucleotide sites amongst the Figure 2. HPV18 tree topology and pairwise comparisons of individual complete genomes.Phylogenetic trees were inferred from global alignment of complete genome nucleotide sequences (the other alpha-7 HPV reference prototypes were set as the outgroup).Numbers on or near branches indicate support indices in the following order: maximum likelihood (ML) bootstrap percentages using RAxML MPI v 7.2.8.27 [42] and PhyML MPI v 1.4.3 [43], and Bayesian credibility value percentage using MrBayes v 3.1.2[44].An asterisk (*) indicates 100% agreement between methods."NA" indicates disagreement between a method and the reference RAxML tree at a given node.Thus, one tree is shown, but three different methods of tree construction were used to estimate the support of the provided tree, as explained above.Distinct variant lineages (i.e., termed A, B, and C) are classified according to the topology and nucleotide sequence differences from > 1% to < 10%; distinct sublineages (e.g., termed A1 and A2) were also inferred from the tree topology and nucleotide sequence differences in the > 0.5% to < 1% range.The percent nucleotide differences for each isolate compared to all other isolates (i.e., 1 x 1 comparisons) are shown in the panel to the right of the phylogeny.Values for each comparison of a given isolate are connected by lines and the comparison to self is indicated by the 0% difference point.Symbols and colored lines are used to distinguish each isolate.doi: 10.1371/journal.pone.0072565.g0027858 bp HPV45 genome that were variable (3.0%).Of the 2440 encoded amino acids, 89 (3.6%) showed variations (Tables 1 and S2, Figure S1C).The maximum nucleotide difference was 1.5%; the most divergent isolates were Qv27565 (an A2 variant) and Qv35960 (a B1 variant); this distance represented the maximum diversity between lineage A and B variants (Table S2 and Figure 4).

Genomic diversity of HPV59 variants
Forty-five HPV59 isolates had the URR/E6 regions sequenced; six samples encompassing each unique variation pattern were selected for complete genome sequencing (Table 1).In total, 8 complete genomes, including the prototype [50] and one isolate from a Chinese woman [40] were analyzed.There were 169/7898 (2.1%) variable nucleotide positions and 66/2441 (2.7%) variable amino acids (Table S2 and Figure S1D).The maximum pairwise nucleotide difference was 1.3%, which was observed between isolates LZod68 and Qv33993  S2); these divergent isolates were part of HPV59 sublineage A1 and lineage B, respectively, as inferred from the phylogenetic analysis (Figure 5).Isolates within lineage A were more variable than those from lineage B and were further classified into sublineages A1, A2 and A3, with mean differences of 0.60%-0.89%(Table S3).HPV59 lineages A and B had mean differences of 1.11%-1.26%and can be distinguished by 53 lineage-specific SNPs across the genome (Figure S2).
Inspection of the distribution of the forty-five HPV59 variants from 4 geographic regions indicated that 61% (20/33) of variants from Costa Rica were A2, whereas 67% (4/6) and 60% (3/5) of samples from Rwanda and Burkina Faso were of the B lineage, respectively (data not shown).Thus, the B lineage was more common in samples from Africa, but the numbers were limited and the difference was not statistically significant (p = 0.11).

Genomic diversity of HPV70 variants
Amplification and sequencing the URR and/or E6 regions of 295 samples containing HPV70 clustered isolates into two distinct groups from which 8 representative isolates were selected for complete genome analyses, in addition to the reference sequence [52] (Table 1).In total, 192 nucleotide variable positions were observed across the 7922 bp genome (2.4%) and 67 of 2413 (2.8%) encoded amino acids were variable (Table S2).The maximum pairwise nucleotide difference was 1.6% between Qv27211 and Qv17574 (Table S2).Within the HPV70 genomes, we detected indels within the NCR2 and URR regions, and a 6-bp insertion (ACTGTA) between nt 6123 and 6124 within the L1 ORF resulting in an insertion of threonine and valine, between L1 aa position 178 and 179 (Figure S1F).

Figure 5. HPV59 variant tree topology and pairwise comparisons of individual complete genomes.
The phylogenetic tree was constructed as described in Figure 2. Distinct variant lineages and sublineages were determined as described in Figure 2. The percent nucleotide sequence differences are shown in the panel to the right of the phylogenetic tree as described in Figure 2.

Alpha-7 HPV evolution
To investigate the relationship between isolates from the alpha-7 species group, we took one representative genome from each lineage or sublineage, constructed a phylogeny and plotted the percent genome differences from global alignments (Figure 8).Four clades were apparent from this analysis-HPV18/45/97, HPV59, HPV85 and HPV39/68/70 corresponding to the 4 deep-seated branches leading to these clades.Each clade was approximately 30% different from isolates within other clades, whereas the clades with more than one type (i.e., HPV18/45/97 and HPV39/68/70) consistently shared at least 80% of nucleotides.The intratype differences are displayed as a heatmap (Figure S3).

Alpha-7 HPV genomic diversity
Rarefaction curves of parsim-informative SNPs (sites detected in ≥ 2 samples) for combined genomes of each type were plotted to estimate the coverage of type specific variations within the sampled cohorts and genomes analyzed (Figure S4A).Although sampling of genomes within the targeted populations may increase the repertoire of genomic variability, it is unlikely to reveal novel variant lineages since the curves flatten out with increasing numbers of genomes already sequenced.Interestingly, the detection of singleton SNPs (variations present only once in the sampled genomes) increases almost linearly with increasing sample size of each type (Figure S4B).Rare variations add information on mutations that potentially occurred during recent epochs of explosive population growth or natural selection and may be identified from sequencing larger sample sizes [53].As sequencing larger numbers of isolates becomes more economically feasible with Next-Gen technologies, false positive SNPs within the genomes of additional variants remains unavoidable due to errors introduced by PCR amplification, sequencing technologies, software analyses and other unknown causes of DNA sequence error.
The pairwise inter-lineage mean differences inferred from the genome sequences revealed maximum genomic diversity of HPV68 and HPV18, followed by HPV70, HPV45, HPV59 and HPV39.When each ORF/region was compared, the noncoding Figure 6.HPV68 variant tree topology and pairwise comparisons of individual complete genomes.The phylogenetic tree was constructed as described in Figure 2. Distinct variant lineages and sublineages were determined as described in Figure 2. The percent nucleotide sequence differences are shown in the panel to the right of the phylogenetic tree as described in Figure 2. doi: 10.1371/journal.pone.0072565.g006regions (NCR1, NCR2 and URR) most often showed the largest variability, followed by the E5 and E4/E2 overlapping ORFs (Table S2).The diversity of the E6, E7, E1, E2, L2 and L1 ORFs varied by type; nevertheless, E1 and L1 ORFs were more conserved in terms of the overall nucleotide diversity compared to the other ORFs (Table S2).Indel events were observed within both coding and non-coding regions, resulting in variability in length of specific ORFs/regions among variants of the same types (Figure S1 and Figure S5).

Discussion
Alpha-7 (HPV18, 39, 45, 59, 68, 70, 85 and 97) is the second most important HPV species group with regard to carcinogenicity.In an effort to investigate and understand the genetic basis of HPV pathogenicity, we have sought to establish the genetic heterogeneity of PVs derived from the common ancestor to all HPV-related oncogenic types [17,18,[54][55][56].To this end, we describe in this manuscript the characterization of a set of full viral genomes from the alpha-7 species either sequenced in our lab or obtained from GenBank.An overview of the tree topology (Figure 1) indicates that each type has a very different evolutionary history.This is further supported by the different diversities of isolates from each type (e.g., 1.1% for HPV39 and 6.7% for HPV68) and surprising varied geographic distribution of lineages and sublineages.Nevertheless, there are some common features that indicate residual punctuated evolution.These include 3 levels of heterogeneity, suggesting at least three evolutionary events including -(i) the rapid expansion of the host population resulting in viral variant lineages and sublineages predominantly in the < 3% range corresponding to events occurring approximately 0.2-1.0 million years ago (MYA) as previously reported [18]; (ii) a subspecies event accounting for the difference between types in the HPV18/45/97 and HPV39/68/70 clades of approximately 18% corresponding to events occurring approximately 5-10 MYA [18], and (iii) the speciation of the alpha-7 most recent common ancestor (MRCA) that resulted in the approximately 30% difference between members of the different alpha-7 clades corresponding to events occurring approximately 10-20 MYA [18].Each alpha-PV species group has its own unique properties of evolutionary topology, diversity and geographic distribution of variant lineages.For comparison, the alpha-9 HPV clades differ by approximately 27%, whereas when all species groups within the alpha-HPV taxon are considered, they differ by approximately 40% ± 5%.However, since we have sampled from extant circulating alpha-7 HPVs and although we have tried to capture maximum viral heterogeneity, it is possible that additional lineages and sublineages will be discovered.Nevertheless, we do not believe that discovery of additional alpha-7 genomes will impact the basic observations derived from the currently available genomes.
This work reports the analysis of a large set of alpha-7 complete genomes from the additional sequencing of 109 variants from 1147 alpha-7 HPV isolates plus 19 complete genomes available from GenBank.These genomes were sampled from > 12,000 women in the Americas, Africa and Asia that were tested for HPV infection and subsequently selected based on the analysis of the URR/E6 regions to identify samples representing the major variant lineages and the maximum genomic diversity.We used a complete genome nucleotide sequence difference of approximately 1.0%-10.0%between two or more variants of sample type to define distinct variant lineages.Similarly, differences across the complete genome of 0.5%-1.0%were used to distinguish sublineages.In contrast to genera, species and type definitions based on the L1 ORF nucleotide sequence, complete genome alignments are used to classify variant lineages, since the recently evolved variant genomes have changes that are not evenly distributed throughout the genome.
Similar to HPV16-related alpha-9 species variants, isolates of HPV18, 39, 45, 59, 68, 70 form at least two deeply separated clades implying codivergence of archaic Hominid and HPV variants.The maximum differences between lineages within a given type were variable (1.1%-6.7%);this most likely reflects different divergence times when isolates of a specific type split from their most recent common ancestor (MRCA).Alternatively, uncharacterized lineage or sublineage variants might exist in an isolated and/or unsampled population or could have disappeared by genetic isolation and/or host demise.Nevertheless, discrete events in human evolution are captured in the patterns of HPV variability.
The evolution and pathogenesis of HPV18 and HPV45 variants have been previously reported [14,18].However, the epidemiological and pathological characteristics of other alpha-7 variant lineages warrant further study.Natural variation exists within all populations of organisms.This occurs partly because random mutations cause changes in the genome of an individual organism, a significant portion of genetic variation is functionally neutral in that it produces no phenotypic effect or significant difference in fitness, while some variations or variation patterns affect a phenotype and could be the result of niche adaptation or a function of fitness.Given the low mutation rate of papillomaviruses [57], natural selection remains an important mechanism for adaption of this slowly evolving virus.However, man-made selection could be a potential force driving the emergence of new SNPs or variant lineages with the introduction of VLP vaccines.Based on what we know about the evolution of the HPV genome, it is unlikely that HPV will show significant change in response to the vaccine, since there is no known mechanism for the virus to develop an increased rate of mutation as it uses the host replication machinery.Nevertheless, empirical data will be needed to assess the underlying evolution of HPV variants in the face of immune pressure generated by the vaccines; whether we will observe the emergence of new variants or variant lineages, or the extinction of particular lineages or types over time remains to be determined.
HPV genome is indicated outside the double-stranded circle.Lengths of each ORF and region are indicated by the histogram pointing to the region/ORF in the figure.The length in nucleotide sequences (bp) for each alpha-7 HPV genome is indicated with the minimal and maximal lengths represented by the bars with dots or highlighted in grey, respectively.The diagram of the HPV genome is not drawn to scale and the histogram for each ORF/region is presented in a different range of values.(PDF)

Figure 3 .
Figure 3. HPV39 variant tree topology and pairwise comparisons of individual complete genomes.The phylogenetic tree was constructed as described in Figure 2. Distinct variant lineages and sublineages were determined as described in Figure 2. The percent nucleotide sequence differences are shown in the panel to the right of the phylogenetic tree as described in Figure 2. doi: 10.1371/journal.pone.0072565.g003

Figure 4 .
Figure 4. HPV45 variant tree topology and pairwise comparisons of individual complete genomes.The phylogenetic tree was constructed as described in Figure 2. Distinct variant lineages and sublineages were determined as described in Figure 2. The percent nucleotide sequence differences are shown in the panel to the right of the phylogenetic tree as described in Figure 2. doi: 10.1371/journal.pone.0072565.g004 doi: 10.1371/journal.pone.0072565.g005

Figure 7 .
Figure 7. HPV70 variant tree topology and pairwise comparisons of individual complete genomes.The phylogenetic tree was constructed as described in Figure 2. Distinct variant lineages and sublineages were determined as described in Figure 2. The percent nucleotide sequence differences are shown in the panel to the right of the phylogenetic tree as described in Figure 2. doi: 10.1371/journal.pone.0072565.g007

Figure 8 .
Figure 8. Phylogenetic tree and pairwise comparisons of representative Alpha-7 lineages and sublineage complete genomes.The phylogenetic tree was constructed as described in Figure 2. Representative genomes for each variant lineage and sublineage described in this report are used and the names of the isolates are shown to the right of the tree branches.The percent nucleotide sequence differences are shown in the panel to the right of the phylogenetic tree as described in Figure 2. doi: 10.1371/journal.pone.0072565.g008 Burkina Faso -The Yerelon Cohort Research Programme was approved by the Ethical Committee of the Centre Muraz, Bobo Dioulasso, the National Ethical Committee of the Ministry of Health, Burkina Faso, and the Ethics Committee of the London School of Hygiene and Tropical Medicine, Zambia -The study was approved by the Research Ethics Committee of the University of Zambia and the Institutional Review Board of the University of Alabama at Birmingham, Thailand -The study protocols were reviewed and approved by the committees on human subject research at Johns Hopkins Bloomberg School of Public Health, Baltimore, MD; Merck & Co., Inc., West Point, PA, each participating recruitment site, and the Institutional Review Board of the Thailand Ministry of Health (MOH), Thailand.Taiwan -The study was approved by the Institution Review Board of the National Taiwan University University College of Public Health.Guanacaste, Costa Rica -The study and informed consent forms were approved by Institutional Review Boards of Costa Rica and the U.S. National Cancer Institute.

Table 1 .
Summary of HPV isolates, genome sizes, variability and variant lineages.

HPV type Isolates tested a Genomes analyzed b Genome size (nucleotide) c Mean GC content (%) Number of CpG site d Variable nt positions e Number of aa codons f Variable aa positions g Variant lineage / sublineage
a Number of isolates characterized by sequencing the URR ± E6 region; b Number of complete genomes analyzed (including sequenced for this report, the prototype and other complete genomes available in NCBI/GenBank), see TableS1for complete list with accession numbers; c Number of nucleotides within the reference genome based on one genome size for each HPV type calculated from the global sequence alignments (see Materials and Methods).Minimum and maximum lengths of sequenced genomes for each type are shown and differ from the presence of insertions and deletions (indels); d Number of CpG sites is the cumulative number in each alignment;