Distinctive variation in the U3R region of the 5' Long Terminal Repeat from diverse HIV-1 strains

Functional mapping of the 5’LTR has shown that the U3 and the R regions (U3R) contain a cluster of regulatory elements involved in the control of HIV-1 transcription and expression. As the HIV-1 genome is characterized by extensive variability, here we aimed to describe mutations in the U3R from various HIV-1 clades and CRFs in order to highlight strain specific differences that may impact the biological properties of diverse HIV-1 strains. To achieve our purpose, the U3R sequence of plasma derived virus belonging to different clades (A1, B, C, D, F2) and recombinants (CRF02_AG, CRF01_AE and CRF22_01A1) was obtained using Illumina technology. Overall, the R region was very well conserved among and across different strains, while in the U3 region the average inter-strains nucleotide dissimilarity was up to 25%. The TAR hairpin displayed a strain-distinctive cluster of mutations affecting the bulge and the loop, but mostly the stem. Like in previous studies we found a TATAA motif in U3 promoter region from the majority of HIV-1 strains and a TAAAA motif in CRF01_AE; but also in LTRs from CRF22_01A1 isolates. Although LTRs from CRF22_01A1 specimens were assigned CRF01_AE, they contained two NF-kB sites instead of the single TFBS described in CRF01_AE. Also, as previously describe in clade C isolates, we found no C/EBP binding site directly upstream of the enhancer region in CRF22_01A1 specimens. In our study, one-third of CRF02_AG LTRs displayed three NF-kB sites which have been mainly described in clade C isolates. Overall, the number, location and binding patterns of potential regulatory elements found along the U3R might be specific to some HIV-1 strains such as clade F2, CRF02_AG, CRF01_AE and CRF22_01A1. These features may be worth consideration as they may be involved in distinctive regulation of HIV-1 transcription and replication by different and diverse infecting strains.


Introduction
The high diversity of HIV-1 has led to its classification into four groups named M, N, O and P. Group M which is the most prevalent worldwide has been further subdivided into at least nine genetically distinct subtypes or clades (A-D, F-H, J, and K), several circulating recombinant forms (CRFs) and unique recombinant forms (URFs) [1]. Several studies have reported the impact of HIV-1 strains on transmission, replication, pathogenesis, diagnosis and response to therapy [2,3]. Thus, subtype A was shown to be less pathogenic than non-A subtypes and to have a lower replication rate than subtype C [4,5]. The heterosexual transmission rate of HIV-1 appeared to be higher in subtype A compared with D, and in CRF01_AE compared with subtype B [6,7]. Subtype D has been associated with faster disease progression compared with subtype A, C and CRFs [8][9][10]. The overall influence of inter-and intra-clade genetic variability on the biological properties of HIV-1 is not fully understood. However, there is increasing evidence that the genetic diversity of the LTR has the potential to affect the replication rate and expression of HIV-1 [11][12][13][14]. For example, it was shown that a single mutation changing one of the two canonical targets for NF-κB into a binding site for GABP transcription factor, results in higher replication rate and transmission efficiency of CRF01_AE compared with subtype B [15]. Also, mutations in the sequence motifs targeted by the CCAAT/enhancer-binding proteins (C/EBP) and specificity proteins 1 (Sp1) were correlated with disease severity and neurologic impairment [16,17].
During the life cycle of HIV-1, transcriptional regulation of the integrated provirus is a key step ahead of viral replication and expression. This early step is regulated by a synergistic interaction of host transcription factors and viral proteins that bind to specific targets in the 5'LTR of the HIV-1 genome. The HIV-1 5'LTR is a complex structure of approximately 640 bp in length with a high concentration of transcription factor binding sites (TFBS). It has been divided into the U3, R and U5 functional regions. The U3 region located upstream of the transcription start site was subdivided into the modulatory (nt -454 to -104), the enhancer (nt -105 to -79) and the promoter (nt -78 to -1) segments [13,18]. The modulatory segment contains numerous TFBS [19], and in certain HIV-1 variants, this segment is extended with an insert of 15 to 34 nucleotides known as the Most Frequent Naturally occurring Length Polymorphism (MFNLP) [20][21][22]. The enhancer region contains binding sites for the NF-kB family of transcription factors, and the promoter region carries the TATA box and three binding sites for Sp1 [18]. Following the promoter, the R region which is, by definition, the transcription initiation site comprises the trans-activating responsive (TAR) element that mediates activation of transcription through its binding to the viral Tat protein [23,24], and the polyadenylation signal (poly A) involved in the addition of the 3'-poly (A) tail [25]. Along with these elements, various binding sites for different families of transcription factors have also been described in the R region [26][27][28][29]. Interaction between the TAR element and the viral protein Tat is required to enhance transcriptional elongation and gene expression of HIV-1 [30,31]. However, prior to the presence of Tat, the 5'LTR is able to support basal transcription that results in the production of short transcripts allowing viral proteins such as Tat to be made [31]. This capacity of the 5'LTR to control HIV-1 basal transcription is driven by TFBS in the U3 promoter and enhancer segments primarily [31], but also involves cooperative action of regulatory elements in the U3 modulatory segment [32]. Besides its role in transcription, the U3 and the R region of the 5'LTR (U3R) have also been involved in HIV-1 replication, expression and silencing [33][34][35][36][37][38]. Therefore, characterization of the U3R from different HIV-1 strains is of great importance as it might contribute to a better understanding of the influence of HIV-1 variants on the fitness, pathogenicity and disease progression. To date, only few studies have investigated the genetic diversity the HIV-1 5'LTR in non-B and non-C strains and in most of them the common approach was the clone-based Sanger sequencing methodology [39][40][41]. This method is costly, time-consuming and labor intensive. Also, it may be limited by preferential selection that can occur during molecular cloning and only a few hundred clones can be generated and sequenced [42,43]. Next-generation sequencing (NGS) technology is a relevant alternative to overcome these limitations by allowing massively parallel amplification at high coverage of thousands of reads per base pair. Thus, NGS provides the potential to reduce the time and facilitate the detection of intra-specimen variability without the need for cloning PCR amplicons before sequencing. The main objective of this study is to describe the genetic variability of the U3R from viruses representing various clades and diverse CRFs isolated from plasmas samples, by using NGS Illumina sequencing technology.

Specimens and RNA isolation
Sixty five high titer virus derived from de-identified plasma specimens obtained from DUKE EQAPOL viral diversity program [44], and five CRF02_AG virus isolated from stored and deidentified plasma specimens [45-47] as previously described [48] were selected for this study. The study was approved by the Duke University Institutional Review Board for clinical investigation (Pro0029507) and the Review Board of the US HHS/Food and Drug Administration Research in Human Subjects Committee (exempt reference number 01-044B). The selected seventy virus stocks included clade B (n = 18), clade A1 (n = 10), clade C (n = 9), clade D (n = 8), clade F2 (n = 3), clade G (n = 4), CRF02_AG (n = 9), CRF01_AE (n = 5), CRF22_01A1 (n = 4). Viral RNA was extracted using Virus QIAamp Viral RNA Mini Kit (QIAGEN) according to the manufacturers' instructions.

Primer design
References and consensus sequences representing the first 1000 nucleotides of the HIV-1 genome from multiple subtypes and CRFs were downloaded from Los Alamos National Laboratory HIV database and aligned in MEGA version 7 software [49]. Regions with the highest degree of conservation were screened and used to design PCR primers that were optimized for melting temperatures, GC content and reduced hairpin and/or primer dimer formation (NCBI/primer3-BLAST; SMS). Due to mismatches in some strains, selected primers were designed with degenerate nucleotides to match all group M isolates.

cDNA synthesis, PCR amplification and purification
Reverse transcription was performed by using Superscript III First-strand synthesis System for RT-PCR (Invitrogen 18080-051). The cDNA was amplified by touchdown (TD) polymerase chain reaction (PCR) procedure with 2 rounds of amplification. Each amplification run was performed in a 50μl reaction volume using Promega PCR master mix (Promega, Madison, WI, USA cat M7505) supplemented with 0.4μM of primer Table 1. Cycling conditions for both outer and nested TD-PCR was as follows: initial incubation at 95˚C for 3 min followed by 10 cycles at 95˚C for 30 s, 64˚C for 30 s with a 1˚C decrement per cycle, and 72˚C for 50 s. Subsequently, 25 cycles were performed at 95˚C for 30 s, 55˚C for 30 s, and 72˚C for 50 s followed by a final extension of 5 min at 72˚C. The second round PCR amplicons were separated by gel electrophoresis. The expected PCR product size of about 850 pb were excised from the gel and purified using QIAquick Gel Extraction Kit (Qiagen).

Ultra-deep sequencing
Miseq (Illumina, San Diego, CA) sequencing of the gel purified PCR product was carried out at the core facility of the Food and Drug Administration (FDA, White oak, MD, USA). Briefly, concentration of purified amplicons was measured by using Qubit dsDNA BR Assay System (Covaris, Woburn, MA, USA) and (2 ng of DNA purified product was processed for NGS library preparation using the Nextera XT DNA Sample Preparation Kit. Specimens were run in the Miseq instrument using a MiSeq v2 kit (500 cycles) to produce paired-end reads of approximately 250pb. After automated cluster generation in MiSeq, the sequencing was processed and fastq files were generated.

Sequence analysis
All fastq files were imported into CLC Genomics Workbench version 9.0 (CLC bio/Qiagen, Aarhus, Denmark) using default option. Reads were paired and trimmed to obtain reads with an average Phred score ! 20. Contigs were generated using de novo assembly tools with the default options. The reads were mapped back to contigs using previous published parameters [50]. For characterization of the 5'LTR region, the minimum contig length was set at 400bp. For each specimen, contigs with a minimum coverage of 1000 reads were exported from the assembly and used for further analysis. The subtype of each full length LTR contig was assigned using COMET HIV-1 subtype tool [51]. Sequences from the same subtypes were aligned in MEGA 7 using the 5'end of HXB2 sequence (from nucleotide 76 to 850, K03455) as a reference to position the U3R sequence and to locate regulatory motifs. Mutations were identified by comparing sequences from each HIV-1 studied strain to the reference HXB2 and within matched subtype. Identification of potential TFBS was based on similarity with a reported binding site pattern and prediction with the Match 1.0 Public program available online (http:// gene-regulation.com/pub/programs.html). A pairwise comparison in CLC Genomics Workbench was used to calculate the percentage of identity.

Variability of the U3R region of the 5'LTR
Compared to HXB2, the U3 region was the most conserved in clade B with about 88% nucleotide identity versus less than 80% in average for all the other strains. The variability of the U3 region was less than 12% between sequences from the same strains and increased to an average of 23% between sequences from different strains. For most of our studied strains, the highest variability was found in the U3 modulatory segment with less than 75% of nucleotide identity. Nevertheless, the enhancer segment of clade C 5'LTRs share less than 65% of identity with the other strains. Although the number of mutations were higher in the modulatory segment than in the other segments of the U3 region, it also appears to be well conserved within sequences from the same HIV-1 strain (>85% of nucleotides identity). In the enhancer segment the intra-strain diversity varied from 2% (CRF01_AE) to 26% (CRF02_AG), while it was about 6% in the promoter segment. With an average nucleotide identity of 85%, this latter segment of the U3 region displayed few changes when sequences from different strains were compared.

Variability of the U3 region
U3 modulatory spanning nucleotides -378 to -176. Numerous transcription factors have been described in this portion of the U3 modulatory segment [13]. Nucleotide sequence from -378 to -364 which encompasses targets for proteins related to the NF-1 and CREB/ATF family of transcription factors [53, 54], include the CTGATTGGC motif (nt -378 to -370) that was conserved in 56% of the sequences analyzed. Its variant CAGATTGGC was mainly observed in the CRFs as well as in clades G and D specimens, while CAGACTGGC was shown in 1/3 of the CRF02_AG specimens ( Fig 1A). The adjacent sequence (nt -369 to -364) with the consensus motif AGAA(C/T)T was conserved in 55% of the sequences analyzed, while other variants such as ACAACT were mainly found in CRF01_AE and CRF22_01A1 viruses ( Fig 1A). TFBS such as COUP-TF, AP-1, ETS-1 and GATA have been reported within the sequence spanning nt -356 to -325 [55-57]. In our dataset, the first half of the binding site for COUP/AP-1 (AG GGCCA; nt -356 to -350) was highly conserved (80%) with few variants such as the consensus G(G/C)GACCA found in clade C LTRs ( Fig 1A). On the contrary, the second half of this TFBS (nt -349 to -343) which includes a functional target for AP-1 family of proteins (AP-1 (II)), was more variable (Fig 1A). AP1 (II) overlaps a GATA binding site (nt -343 to -338) with the motifs AGATT(T/C) mostly found in F2, CRF12_BF, G and CRF02_AG LTRs, and AGATA(T/C) predominant in all the other studied strains. Further downstream (nt -333 to -327), the TGACCTT consensus for the reported AP-1(I), was found in all the strains studied except in CRF01_AE, CRF22_01A 1 and in clade A1 viruses. Indeed, the AP-1 (I) variant TATGTTT was shared by 80% of CRF01_AE and 40% of CRF22_01A1 specimens. In the remaining sequence from CRF22_01A1 viruses, the consensus sequence TGTG(T/C)TT was found. In clade A1 LTRs, the variant TAACATT generates with its 3'end overlapping nucleotides a CATTTG motif (nt -330 to -325), that matches the E-box consensus CANNTG (Fig 1A).
Contiguously, the GATA-like binding site GGATGG (nt -326 to -319) [58], was mutated to GGGTGG in CRF02_AG LTRs and in some sequences from various other strains (Fig 1A). The consensus TGCTACAAG(C/T) (nt -320 to -310) of the first half of the binding site for the negative regulator of transcription (NRT-1)[59], was mainly conserved in CRF22_01A1 (80%)  isolates. It was changed to a TGCTTCAAG(C/T) consensus in a majority of non -CRF02_AG, non-clade G and D specimens, and to TG(C/T)TTCAAAC variant in clade G and CRF02_AG LTRs. The second half of the NRT-1 was reported to include an E-box/c-Myb and AP-1 sites [60]. In our study, the E-box consensus (CANNTG, nt -303 to -298) was not found in clade G and CRF02_AG LTRs. It was replaced by the non-specific variant CA(A/G)TGG also shared with others strains such as CRF22_01A1 isolates ( Fig 1A). This latter variant also modified the overlapping AP-1 motif (nt -299 to -293). The contiguous GATA target [60, 61], displayed two major variants in our study (Fig 1A), GGTAGA (45%) and AGTAGA (26%) that were closely related but not identical to the canonical (A/T)GATA(A/G) binding site for GATA factors. Nucleotides spanning the AP1-1 and GATA binding sites (nt -296 to -283) lead to a consensus   with a fairly high variability in LTRs from clades B and A as well as CRF02_AG specimens, but with some specific patterns in the other strains analyzed. The HXB2 sequence GCCAGAGAA GTTAG was mutated to CCCAAGGGAAGTAG in clade C, to T(G/C)CAAAGGAGG(C/T) AG in clade D, and to TCCAAGAGAAGTAG in CRF22_01A1 isolates. These variants probably contain specific TFBS, as observed with the clade D included motif AAGGAG that matches a recognition site for Ets-2 [62].
Although it displayed one nucleotide mismatch with the (C/A)TTNCNN(C/A)A consensus described as the best fit for the binding of the C/EBP family of transcription factors [72][73][74], the sequence CTTGTTACA in HXB2 (nt -254 to -246) was reported as a C/EBPβ target (US3). In our study, the US3 consensus (CTTNTNNCA) was found in at least 60% of clades B, D and G assigned LTRs, and the variant TTTNCNNCA mainly found in clade C LTR was also shown to be able to bind C/EBP proteins [74][75][76]. In most of the other studied strains, variants of the US3 generally included more than one nucleotide mismatches with the C/EBP consensus motif (Fig 1C). The adjoining sequence (nt -244 to -234) seemed to include various overlapping TFBS with some strain specific motifs ( Fig 1C). Thus, at the relative location of the GATA-like motif CCTAT(G/A) (nt -244 to -239) that was found in non-clade G and non-CRFs isolates, the CCCATG variant was shown in LTRs subtyped as CRF01_AE only, while CCCATC was found in clade G and CRF02_AG only. Subsequently, we observed a CREB/ATF/AP-1-like binding site in several studied isolates, except for clade G and CRF02_AG. In these latter specimens, we found an E-box with the sequence motif CATCTG. Majority of sequences subtyped as F2 (60%), CRF12_BF (80%) as well as some clade B (25%) displayed a TATAA motif within nucleotides spanning this location (Fig 1C). Further downstream, the nucleotide sequence includes a potential Ets/GATA-like element (nt-232 to -222) and other various potential TFBS such as TGACCC(A/G) in clade B or TGA(G/T)GA(A/G) in clade A1 which resemble were a CREB/ATF/AP-1 target (nt -221 to -212).
The sequence GAGAGAGAAGTG in HXB2 (nt -216 to -203), reported to encompass a second NFAT footprint in this segment of the U3 region [65, 68], was mainly conserved in clade D isolates. In our dataset, this NFAT footprint displayed strain specific mutations generating motif that might potentially be recognized by other transcription factor (Fig 1C). For example, the GAGAAAGAAGT(G/A) consensus was mainly found in clade B LTRs (45%) include C/EBPlike motif, while the GA(C/T)A(G/A)A(G/A)ARGTG variant predominant in CRF02_AG isolates includes a GATA-like motif. The highest variability of this NFAT footprint was found in clade A1 isolates in which we noticed that 50% of the variants contain a GAAACAT motif (nt -208to -202) closely related to the LEF-1/TCF recognition site [64]. The adjacent motif TAGAGTGGA (nt -201 to -193) displayed some specific motifs in most of the HIV-1 non-B strains with some the variant containing an AP-1/ATF/CREB-like motif (Fig 1C). The following nucleotide sequence (nt -188 to -182) was characterized by low intra-clade variability and majority of the variants found contain a motif homologous to the AP-1/ATF/CREB family of proteins target (Fig 1C).
U3 modulatory spanning nucleotides -176 to -106. Several consensus binding sites for transcription factors such as C /EBP, USF-1, Ets, LEF-1 and RBF-2 have been described within the nucleotide sequence from position -176 to the end of the modulatory segment [41, 73,[77][78][79]. In this part of the U3 region, the US2 and US1 targets for C/EBP proteins were reported within the HXB2 sequence AGCATTTCATCA (nt -176 to -165) and AGCTTGCTACA (nt -116 to -106) respectively [80]. In our study, the consensus motif (A/G)GC(A/C)TTNC(A/G)NCA found in clades B (95%) and A1 (20%) was the only US2 variant conformed to the canonical binding site for C/EBP proteins. Non-canonical distinct variants were observed among non-B strains and some of them displayed up to three nucleotides mismatches with the consensus (Fig 1D). The consensus (A/G)GCATTNANNCA was found in about 40% of A1 and 45% of CRF02_AG LTRs, while NGCAC(A/G)CAGACA predominant in sequences from clade C (88%) was also found in LTRs from CRF22_01A1 specimens (40%). In our study, the sequence CGCAGACA included in this latter US2 variant was no longer strictly associated to clade C isolates as previously reported [39,81], as it was also found in the emerging CRF22_01A1 recombinant. On the other hand, we observed that the US2 consensus AGCA(C/A)GAAAACA was only displayed by CRF01_AE specimens. In all the clade D and in some CRF02_AG LTRs, the US2 variant (A/G)GCATTTGA (A/G)CA includes the E-box motif CATTTG similar to what we reported earlier in clade A1 viruses (nt-330 to -325); whereas the AGC(C/A)(C/A)TGAGACA consensus in CRF12_BF (80%) and F2 (60%) LTRs, contains a TGAGACA motif closely related to an AP-1/CREB/ATF target.
Overlapping with the US2 target, the nucleotide sequence CACATGGC in HXB 2 (nt -166 to -159) contains the well described E-box consensus motif CAC(G/A)TG recognized by bHLHzip proteins such as USF-1 or TF3 [75,82,83]. This USF-1 consensus was mainly found in clade B LTRs (70%), in some CRF22_01A1 (20%), and in few clade C isolates (13%). At this position, additional E-boxes with non-USF-1 consensus motifs were found in clade C (50%), while none of the variants found in the other studied strains match the CANNTG consensus ( Fig 1D). According to a previous study, these non-CANNTG variant do not interact with a recombinant USF-1 factor in vitro [84]. We noticed that the A5T mutation of the USF-1 variant CACAAGGC displayed by the clade D and few clade B LTRs, generates an Ets-like core motif (AAGGC). The sequence downstream of the USF-1 motif (nt -153 to -146), includes a potential GATA motif with a high variability across our dataset (Fig 1D). In this sequence, we also observed that the consensus AGAAA(C/T)AT predominant in F2 and CRF12_BF LTRs (Fig 1D), contains a potential recognition site for TCF/LEF family of proteins [64].
Contiguously, the consensus motif (C/T)ATCCGGA (nt -148 to -141) binding site for RBF-1/Ets-1 was found to be very well conserved across all our studied HIV-1 strains except for CRF01_AE isolates mainly in which it was changed to a C(G/A)TCC(T/A)GA variant as previously described [14,40,41]. Here, the RBF-1/Ets-1 target overlaps two major consensus motifs CCG(G/C)AGT and CC(T/A)GAGT (mainly CRF01_AE) in 79% and 14% of the analyzed LTRs respectively (Fig 1D). These two consensuses contain a motif homologous to the recognition site the ATF/CREB/AP-1 family of transcription factors. The canonical consensus for LEF-1/TCF factors commonly reported in proximity of the Ets-1 target [82], was found in about 60% of the non-B, non-G and non-CRF01_AE specimens in our study. For this TFBS, a TATAA(A/G)(A/G) variant which includes a TATAA motif was predominant in G (67%) and CRF01_AE (60%) isolates. Nonetheless, this TATAA motif has been shown to be non-functional in CRF01_AE isolates [41]. The highest variability was observed in clade B LTRs where only 25% of the TCF/LEF variants matched the canonical motif. Adjacent to the binding site for TCF/LEF, an insert of 7 nucleotides with the consensus motif ACTGAGA was specifically found in F2 and CRF 12_BF sequence. In this study, we noticed that this insert reported as a partial RBE site motif [41], was closely related to the AP-1-like element involved in the expression of the tumor suppressor protein p53 [85].
Further downstream, the ACTGCTGA motif of the RBEIII site (nt -129 to -122) targeted by the RBF-2 complex [22,77], was particularly well conserved (93%) across HIV-1 clades and CRFs in our study. This RBE III intersects an AP1-like element (nt -124 to -118) that was conserved in only 25% of the clade B LTRs. Besides the most frequent variant TGA CACA predominant in sequences from non-B, D, G and non-CRF01_AE isolates, the AP-1-like sequence displayed some strains distinctive motifs (Fig 1D). It was followed by the MFNLP which in accordance with previous reports [21,22], was displayed by about 33% of the LTRs in our study. Commonly observed in LTRs from CRF22_01A1 specimens as well as in LTRs subtyped as F2, CRF12_BF (80%) and CRF02_AG (67%), this insert was less frequent ( 40%) in clades A1, G, C and D. It was scarcely found in LTRs subtyped as B, and was totally absent in the LTRs from CRF01_AE isolates. The MFNLP includes a duplicate of the RBEIII and/or the AP-1-like motif along with a GATA-like motif in the LTRs subtyped as CRF02_AG, CRF12_BF and F2 or a C/EBP-like motif in LTRs from clade D isolates. At the end of the modulatory segment, the US1 consensus A(A/G)(C/A/T)TTTCTACA found in clades B (70%) and D (80%) LTRs, was the only in our dataset to be conformed to a reported binding site for C/EBP proteins. In all the other analyzed 5'LTR sequences, the following variants were found: (A/G)AGTT(G/A)CTGAC in A1, G, and CRF02_AG, AAGATTCTAA (G/A) in F2 and CRF12_BF, while AAGTTTCTAAC which overlaps a TCF/LEF-like motif (CTA ACTA) was found in CRF01_AE specimens (Fig 1D). In our study, there was no US1 variant in sequences from clade C isolates as previously reported [76], neither in LTRs from CRF22_01A1 specimens.
U3 enhancer spanning nucleotides -104 to -80. The two canonical sequences GGGA CTTTCC for the binding of NF-κB factor in the enhancer segment of the U3 region were very well conserved. Nevertheless, with an overall percentage of similarity of 85%, NF-κB2 (nt -104 to -95) was slightly less conserved than NF-κB1 (nt -90 to -81) which showed a nucleotide similarity of about 91% with HXB2. Besides the consensus A(G/A)GACTTCC specifically found in CRF01_AE specimens, we observed within the NF-κB2 site some point mutations such as C9A in 20% of CRF02_AG, T6C in 40% of CRF12_BF LTRs and C10T in 40% of the LTRs from CRF22_01A1 isolates (Fig 2). The C10T mutation was also the most frequent NF-κB1variant and was observed only in CRF12_BF (80%) and F2 (40%) LTRs. This C10T variant found was shown to have no impact on the affinity and the functionality of the NF-κB factors [86]. In our study, an additional binding site for NF-kB with the consensus motif GGGGCGTTCC was found in majority of clade C specimens as previously reported [15,41] and with the canonical NF-kB motif in one-third of CRF02_AG LTRs.
The NF-κB2/ NF-κB1 interspace motif GCTG (nt -94 to -91) was overall well conserved. This spacer is embedded within the AP-2 recognition sequence CCGCTGGGGA [87,88], and overlaps motif targeted by GABP factors. Indeed, it has been shown that GABP binds with high affinity to the variant CTTCCG of overlapping the NF-κB2 site of CRF01_AE viruses only; and with the low affinity to the variant CTTTCCG present in most of the other studied strains [15]. In our dataset, we observed three major variants of the interspace motif (Fig 2): 1) ACTG scattered among several specimens and predominant in CRF22_01A1, 2) (G/A)CCT found in clade G, and 3) AAAG found in CRF12_BF and F2 LTRs only. This latter variant, by overlapping NF-κB2, led to a CTTTCCAAA motif conformed to the canonical consensus for the binding of C/EBP proteins ((C/A)TTNCNN(C/A)A). Furthermore, we noticed that the mutation of the AP-2 target in CRF02_AG appeared only in LTRs containing an extra NF-κB site.
U3 promoter spanning nucleotides -78 to -1. Three sites named Sp III, Sp II and Sp I for the binding of Sp1 and related factors have been described along with the TATA box in this segment of the U3 region. In our study, the Sp binding sites displayed a high degree of intraclade conservation and some inter-clade variability. Of the three, the sequence of the Sp III proximal to NF-κB1 displayed the highest variability (Fig 2). Thus, the HXB 2 Sp III consensus motif GAGGCGTGGC (nt -77 to -68) was found in 20% of clade B LTR only, while specific variants, such as GAGGCGTA(A/C)C was found in clade D isolates, and GAAGGCG(T/G)GCC which include a GABP-like motif (GAAGGC) was found in F2 (60%) and CRF12_BF (80%) LTRs (Fig 2). For this TFBS, the variant GAGGTGTGGT (5T10T) predominant in LTRs from clades A1 and CRF02_AG viruses (! 80%) was also found in sequences from CRF22_01A1 (50%) and clade C (22%), and GAGGTGTGGC (5T) seen in all the LTRs from CRF01_AE was Variability in the U3R region of diverse HIV-1 strains shared with clade B LTRs as previously observed [89,90]. Overall, variability of the Sp III in our study was concordant with previous studies reporting a high conservation of the central Gs (GGCG) and the frequent C5T/A mutation [90,91]. The HXB 2 Sp II motif TGGGCGGGAC (nt -66 to -57), mainly conserved in clades B and C sequences was mutated to TGGGCGGAGT in LTRs from clade A1 (90%) and CRF22_01A1 (100%) viruses. In our study, Sp II strain specific variants such as TGGGAGGG(G/A)C, TGGGAGGAGT, and (G/A)GGG(A/C)GGAGT were respectively observed in LTRs from clade G, CRF02_AG and CRF01_AE specimens (Fig  2). These three consensus were not conformed to the GC box consensus (GGGCGGPuPuPy) reported as recognition site for Sp1 family of proteins [92]. Sp I proximal to the TATA box (nt -55 to -46), was the most conserved of the three targets for Sp proteins, with the consensus motif (G/T)GGGAGTGG(C/T) found in more than 90% of the sequences analyzed.
Overlapping with the Sp I site, the subsequent nucleotide sequence GCGAGCCCT in HXB 2 (nt -47 to -39) was conserved in clade B LTRs only (Fig 2). In our study, this sequence displayed several variations such as GTCAACCCT predominant in clade C, GTTAACCC seen only in F2 (40%) and CRF12_BF (80%) sequences, and the consensus GCTAACCCT observed in most of the other studied LTRs. We noted that these variants contained a potential E-box/c-Myb element. Further downstream, the TATA box motif (TATAA) flanked at its 3' and 5' end by an Ebox motif, was extremely well conserved in all our strains studied except for CRF01_AE and CRF22_01A1 where it was changed to TAAAA in 60% and 80% LTR sequence respectively (Fig  2A). The CAGATG motif of the 5'E-box predominant in almost all the LTRs analyzed, was mutated to CAGAAG in clade G (100%) and in CRF02_AG (33%) viruses. However, we noted that in these two HIV-1 strains, a motif similar to the 5'E-box was observed in the sequence adjacent to the US3 target located in the modulatory segment (nt -242 to -237). The 3'E-box consensus sequence CAGCTG reported as a binding site for bHLH factors such as AP-4 [71], was mainly conserved in non-G and non-CRFs specimens where the CAGCCG variant was predominant (Fig 2). Thus, unlike what was previously reported [41,74], our study suggests that the variability of the 3'E-box motif might not be restricted to clade G and CRF01_AE isolates. The sequence between the 3'E-box and the transcriptional start site was reported to contain a target for Oct-1/Oct-2 factor [93]. In this sequence, the CTTTT(T/C)GCCT consensus (nt -15 to -6) was the major variant in non-G, non-A1 and non CRFs isolates (Fig 2), whereas CTTC TCGC(C/T)T consensus was shared by clade A1 (30%), clade G (100%) and CRF02_AG (78%) LTRs, and CTTT(G/T)CGCTT was predominant in CRF01_AE and CRF22_01A1 specimens.

Variability of the repeat (R) region
The TAR element (nt 1 to 59) fold into a hairpin structure composed of a bulge, a loop and a stem. In our study, we found that all the constitutive elements of the TAR structure were affected by singles and clustered sets of mutations across different HIV-1 strains. In general, the stem displayed the highest variability followed by the bulge (nt 23 to 25, with the TCT motif in HXB 2 ) and the loop (nt 30 to 35, with a CTGGGA motif in HXB 2 ).
In accordance with previous reports, we found a C24T change in most of the CRF02_AG and majority of clades D, F2, and G isolates [39,40], but also in CRF12_BF LTRs. Also, we found a deletion of T25 leading to a two nucleotides bulge in clade A1 and -CRF01_AE specimens as previously observed [5,89], and in CRF22_01A1 as well (Fig 3). Furthermore, we noted that mutations of the bulge in clade A1 (60%) and CRF01_AE (80%) isolates mainly also affect the overlapping consensus TGAGCC(T/C) reported as a AP-1 recognition site [29]. The consensus motif of the loop was changed to a CCGGGA variant in most of the CR01_AE viruses as previously reported [89], and also in isolates CRF22_01A1 (60%), clades C (50%), and Al (20%). Additionally, a CTGAGAG variant of the loop was observed in clade D LTRs (40%) and in few sequences subtyped as A1 (10%) and B (5%). This latter variant of the loop converted the well conserved GGGAGCTCTC sequence reported as a NF-κB binding site [27], into a GAGAGCTCTC motif.
Mutations in the stem principally affect the A10:51T, G11:T50, T13:A48 and A22:T40 base pairing. We observed that most of the non-B strains displayed a cluster of point mutations rather than a single point mutation (Fig 3). Thus, while the single mutation A48G (affecting the T13:A48 pairing) was found in most of clade B LTRs, this change was associated with G11T clade D or with 11T22G in CRF22_01A1 specimens (Fig 3). In clade G and CRF02_AG specimens, we observed that the pairing T11:A50 and C13:G48 were associated with various other mutations especially in CRF02_AG viruses which displayed the highest variability for this feature of the TAR element. The T:A pairing at position 11 and 50 also found in CRF01_AE and clade C isolates, was associated with 13G and 13G48T co-variation respectively. The stem overlaps the footprint of a sequence reported to contain binding sites for cellular factors such as LBP-1 (nt -4 to +21) and CTF/NF-I [94,95].
In the Poly-A hairpin (nt 61 to 105), the polyadenylation signal motif AATAAA (nt 74 to 79) as well as the AP-1 target (nt 88 to 95) [96,97], were extremely well conserved in our study (Fig 3). We observed that mutation affecting the stem of the poly A hairpin lead to some strains specific patterns. Thus, the ACTGCTTAAG sequence (nt 60 to 69) which includes a CREB-like recognition site [98], was changed to ACTGCTTAAAG in CRF01_AE and to ACTGCTTGAAG in CRF22_01A1 specimens. In CRF12_BF (80%) and F2 (40%) LTRs, a C61T mutation transformed the CREB-like target ACTGCTTAA into an ATTGGCTTAA motif conformed to the canonical sequence for the binding of C/EBP proteins (Fig 3). The highest inter-clade variability of the poly A hairpin was observed within the HXB2 sequence TCAAGTAGT (nt 96 to 104) ending the R region. Indeed, this motif conserved in less than one-third of our analyzed sequences, was changed in a strains specific consensus such as CT(A/G)AG(C/T)AGT in clade C LTRs and TAA(A)iGTGGT is LTRs subtyped as CRF01_AE (Fig 3). The variant TTAAAGCAGT found in CRF12_BF and F2 LTRs includes a perfect match to the consensus binding site for CREB factors [98]. This potential TFBS may compensate the conversion of the reported CREB-like site (nt 60 to 69) into a potential C/EBP-like observed earlier in these LTR sequences.

Discussion
To the best of our knowledge, this study is the first to describe the diversity of the 5'LTR U3R region from diverse HIV-1 strains including the newly emerging CRF22_01A1 HIV-1 strain. The CRF01_AE subtype assigned to sequences from CRF22_01A1 may be explained by the close relation of the two strains. Indeed, it has been reported that the gag, nef and segments of the env genes of a typical CRF22_01A1 strain, can be subtyped as CRF01_AE [47,52]. The finding of additional contigs with different subtypes that are different from the primary genotype in some of our specimens can be explained as artefacts and/or contamination during our experiment. Alternatively, they can also result from the intra-host diversity of HIV-1 population. This highlights the fact that Next-generation sequencing, by detecting low-frequency mutation [42], might be a more reliable tool to analyze the genetic diversity of HIV-1. The intra-host heterogeneity of HIV-1 population reflects its high mutation rate and rapid turnover, but also the co-circulation of various HIV-1 strains within the same patient. Thus, the F2 contig found in a CRF02_AG specimen from Cameroon might corroborates the recent report of dual-infection with CRF02_AG and F2 in this country [99,100]. Three out of five contigs assigned to be CRF12_BF were from clade B specimens collected in Bolivia, Spain and Japan. In these countries, a high prevalence or emerging BF recombinant circulating along with pure B have been reported [101][102][103][104][105].
The U3R region of the HIV-1 5'LTR is involved in the control of both basal and trans-activated transcription of the HIV-1 genome though its interaction with several cellular transcription factors. Therefore, mutations affecting its nucleotide sequence might drive potentials differences in the biological property of diverse HIV-1 strains. Here, we found that the degree of genetic variability of the U3R can vary from very low to fairly high, depending on the segment analyzed. Nevertheless, it has been demonstrated that even minor changes in this region could influence the replication rate and the disease severity of HIV-1 infection [15,55].
In the U3 modulatory segment, besides binding sites for the RBE III and Ets-1 sites that were conserved in almost all the LTRs, majority of the TFBS were distinctive of one or few HIV-1 strains. Several motifs aligned with the TFBS reported in HXB2 displayed a sequence that varied from the reported consensus. As most of the regulatory elements that were studied belonged to clade B LTRs, functions of these TFBS described in non-B strains remains to be determined. Although the affinity for C/EBP proteins of the motifs found in non-B, C and D is yet to be investigated, it has been suggested that the binding ability and functions of C/EBP proteins can be modulated by interaction with other factors such as ATF/CREB/ AP-1 [106], E-boxes targeted proteins [107], and TATAA-binding proteins [108]. This might explain the finding of potential E-box (in CRF02_AG, clades B, C, D and G), AP-1 like (clade F2, CRF12_BF and CRF22_01A1) and Ets (CRF22_01A1 and CRF01_AE) binding sites in close proximity or embedded into the reported C/EBP binding site in our sequences analyzed. Thus, the differential repartition of these TFBS according to the HIV-1 strain might probably direct a distinctive mechanism of regulation of HIV-1 transcription. As previously reported, we also observed a high variability of the GATA binding motif. This variability has addressed the affinity, specificity and function of GATA factors [109]. Indeed, it has been reported that the specificity of GATA binding sites along with the spanning TFBS influence the fate of GATA-3 mediated T-cell differentiation and function [110]. Thus, the combination of strains-specific GATA binding sites along with distinctive spanning mutations may potentially impact pathogenicity of HIV-1 according to the strain.
Several TFBS have been shown to be able to activate or repress the transcriptional activity of HIV-1 depending on their affinity with the targeted factor and the spanning nucleotides [35,71,83,84,[111][112][113]. Thus, in accordance with a recent report [35], the presence of the AP-1 variant TGACACA in CRF02_AG, CRF22_01A1, CRF12_BF, as well as in clades F2, C and A1 isolates may increase their capacity to establish HIV-1 latent infection compared to other strains. This predisposition may be affected by intra-clade variability due to the duplication of this variant within the MNFLP present in some of these specimens. On the other hand, the presence of an insert motif similar to the AP-1 activator of the p53 promoter might contribute to an increase reactivation of HIV-1 in F2 and CRF12_BF, as it has been shown that p53 could reactivate HIV-1 replication from its latency in U1 cells with upregulation/activation of host transcription factors such as AP-1 (Wang; X., et al, 2017, unpublished data).
In our study, the important TATA box element as well as key TFBS such as NF-κB was fairly well conserved. In accordance with previous reports [66, [114][115][116], the configuration and copy number of NF-κB sites found in our study; suggest inter-strain differences of the NF-κB induced activity. Also, as suggested by our finding in CRF02_AG specimens, this activity might also vary within isolates from the same HIV-1 strain. Furthermore, the variability found in the NF-κB inter-space required for the optimal binding of factors such as NFAT, Ets and AP-2 [87,88,117,118], may also favor a distinctive modulation of the NF-κB activity in CRF01_AE, CRF22_01A1, CRF12_BF and clade F2. The 5T variant of the Sp III was shown to be able to abrogate the binding of Sp1 factors [90], and to be associated with HIV-1 disease progression [17]. As the mechanism leading to faster HIV-1 disease progression observed with CRF01_AE and some clade B isolates remain to be explored [114,[119][120][121], the contribution of Sp III along with other TFBS specificities should be considered. Also, it was reported that Sp1 and GABP factors cooperates to activate various genes such as CD18 which overexpression during HIV-1 infection has been related to disease severity [122]. The potential GABP binding site included in the Sp III variant of F2 and CRF12_BF LTRs, could also differentially impact the outcome of HIV-1 infection with these isolates.
In our study, the bulge residue T23, the loop as well as paired nucleotides below (A22:T40; G21:C41) and above (G26:C39 and A27:T38) shown to be important of the TAR activity were conserved in majority of the sequence analyzed. However, strain specific mutations within and outside these the features, might explained some differences in the transcriptional activity and replication rate between subtypes and CRFs [41,66]. Changes in the Watson-Crick complementarity observed here, might also affect the TAR structure which has been shown to be important for its function, and for the activity of the downstream 5'LTR U5 region and the Gag leader sequence [123][124][125]. Also, it has been reported that regulation of the transcriptional activity driven by TAR-tat interaction can be enhanced through synergic action with various TFBS such the AP-1 targets with a crucial role in the replication and transcription of HIV-1 [97]. In most of the sequence analyzed in our study, the TFBS in the R region were well conserved assuming that the biological mechanism in which these TFBS are involved might be similar for various HIV-1 strains. Nevertheless, the potential loss of one of the AP-1/ATF/CREBlike sites in CRF01_AE and CRF22_01A1 specimens, and the relocation of the CREB-like in CRF12_BF and some F2 LTRs associated with the presence of a potential C/EBP site might suggest a distinctive TAR mediated activity for these isolates.
It is known that the differential activity of the LTR due to the lack or mutation of a TFBS can be compensated by various mechanisms such as the gain of additional factors, the interplay between TFBS in close proximity or the delocalization of the TFBS. On the other hand, association of potential strains distinctive mutations in the U3R may contribute to strain-specific function of the 5'LTRs [66,116]. For CRF01_AE isolates for example, it was suggested that its 2nt bulge (TC), 22G31C and TAAAA co-mutations promote its higher transcriptional activity and replication rate compared with clade B [41,66,126].
In this study, we have reported distinguishing patterns within and outside the reported regulatory element in the U3R region from diverse HIV-1 strains. Due to the important role of this region in the control of the HIV-1 transcription, it is reasonable to suggest that strain specific variability greatly contributed to the effectiveness of replication and gene expression of different HIV-1 subtypes and CRFs. Intra and inter-clade variability of the U3R region should be further explored in future studies related to the replicative capacity and pathogenicity of HIV-1, as this may lead to a better understanding of phenotypic effects of HIV-1genetic diversity.
Supporting information S1