Table 1.
Start and end nucleotide positions (genomic coordinates) for early (E1–E7) and late (L1–L2) genes for each HPV genotype analyzed in this study. Gene boundaries were obtained from reference genomes curated in the Papillomavirus Episteme (PaVE) database and were used to map gene regions onto multiple sequence alignments for entropy-based variability analyses. “N.A.” indicates that the gene was not annotated or not present in the corresponding reference genome.
Fig 1.
Shannon entropy (H) distributions across genome positions for each HPV genotype.
Most sites cluster at low H (high intragenotype conservation), with sparse right-tail bins indicating variable regions. These profiles provide a baseline to identify conserved genomic regions and variable windows that may be informative for genotype differentiation.
Fig 2.
Heatmap of mean Shannon entropy (Mean H; 0–1) by gene (E2, E4, E5, E6, E7, L1, L2) and HPV genotype.
Each cell shows the average H computed over all positions annotated for that gene within the genotype. Color scale: low values (purple) = higher conservation; high values (green/yellow) = greater variability. White cells indicate missing data/lack of gene annotation.
Fig 3.
Inter-genotype metrics per gene based on Shannon entropy.
Rows = genes (E2, E4, E5, E6, E7, L1, L2); columns = metrics: Mean H (mean entropy H across all annotated positions of the gene across genotypes), Median H (median H), IQR H (Interquartile Range = Q3 − Q1), which quantifies the central dispersion of H and is robust to extreme values; % Conserved (percentage of positions with H = 0) and % High (percentage with H > 0.5). The color scale is normalized by column (0–1), the numerical labels show the raw (unnormalized) values. In this summary, high Mean/Median H values together with high % High and low % Conserved indicate genes with inter-genotype divergence, while low values and high % support these genes as relatively conserved regions across genotypes.
Fig 4.
Intergenotype variability per gene in the consensus MSA.
(A) Percentage of positions by Shannon entropy category (Conserved: H = 0; Intermediate: 0.1–0.5; High: H > 0.5) for each gene (E2, E4, E5, E6, E7, L1, L2). A higher proportion of “High” indicates greater divergence between genotypes. (B) Mean entropy (0–1) per gene, with points labeled by their corresponding values. Early genes (E5, E6, and E7) show the highest means and therefore greater variability, whereas capsid genes (L1 and L2) display the lowest mean entropy values, indicating greater relative conservation.
Fig 5.
Maximum likelihood tree inferred from the alignment of consensus sequences of the HPV genotypes analyzed, representing their overall sequence divergence.
The tree was grouped using hierarchical clustering (Ward.D2) into four main clusters (colors) to contextualize intergenotypic variability. Cluster 1 (Green): HPV 156 genotype, showing the greatest basal divergence. Cluster 3 (Blue): The densest and most recent group, grouping closely related oncogenic genotypes (HPV 16, 31, 33, 35, 52, 58, 67, 73). Clusters 2 (Orange) and 4 (Pink): Intermediate groups containing the remaining oncogenic genotypes. The phylogenetic structure supports the need for highly specific differential markers due to the evolutionary proximity between high-risk genotypes.