Table 1.
Integration datasets used in this study.
Fig 1.
Gene expression and integration.
The complete set of ca 22,000 RefSeq genes was slightly modified to remove overlaps (see Methods). The non-overlapping genes were divided into either 100 bins or 4 bins, based on transcripts per million reads (TPM) obtained from the RNA-seq analysis of the in vitro infected PBMC. The 100-bin RNA-seq data are shown in green triangles in all panels. The combined IS data are shown for the genes in each of the 4 bins in all panels for PBMC (blue), pre-ART donors (plum) and on-ART donors (red). Darker colors in C and D indicate proviruses oriented in the same direction as the host gene; lighter colors are in the opposite orientation. A. Total integration site (IS) density (sites/Mb) in each bin, normalized to the average for the whole genome (125 sites/Mb for PBMC, 4.28 sites/Mb pre-ART, and 10.7 sites/Mb on-ART). B. The orientation of the proviruses relative to the host gene was calculated for each bin as (proviruses oriented the same as the gene-proviruses opposite the gene)/(total proviruses). Dashed lines indicate p values (binomial) >0.05. C and D. Ratios of proviruses per bin for the pre-ART (C) or on-ART (D) samples. Note that the higher the ratio, the smaller the number of IS in the donor samples relative to the in vitro infected PBMC samples. S1 Fig shows the same results with the 100-bin IS data included.
Fig 2.
Comparative ranking of genes by integration frequency.
The non-overlapping RefSeq genes were ranked according to the number of unique IS in the PBMC in vitro dataset (blue squares), along with the number of sites in the same genes in the pre-ART (plum triangles) and on-ART (red diamonds) datasets. A. All 20,207 genes in the dataset are shown; a little over half have one or more IS in the in vitro PBMC dataset. Points with no IS have been removed for clarity, and the genes with no IS in PBMC are assigned a rank at random. The inset shows an expanded view of the 30 genes with the most IS. The top 10 are labeled, as is STAT5B (number 28). B. Expanded view of the 2500 genes (indicated by the box in A) with the most IS. The 7 genes in which an integrated provirus directly contributes to persistence and/or clonal expansion of the infected cell are circled and labeled. Some pre-ART points are indicated with arrows for clarity. C. Venn diagram showing the genes with at least one IS in the three datasets, and how those genes are shared among the datasets. Numbers indicate the number of genes in each category.
Table 2.
Genes that are favored targets for integration in PBMC infected in vitro and the on-ART samples.
Table 3.
Genes in which integrated provirus can be selected in vivo a.
Fig 3.
Using the Excel-based application described in S2 Fig, a selected region of the genome of any size is divided into 250 bins. The IS in each bin are tabulated and the number of IS in the bin is shown as a bar, with the orientation of the provirus relative to the numbering of the chromosome indicated by color (red for the same as the chromosome and blue for the opposite). The grey boxes indicate the location and relative RNA level of the genes in each bin. RefSeq genes (from the hg19 sequence database) are shown at the bottom of each plot, with arrows indicating their orientation relative to the numbering of the chromosome. The figure shows the distribution of genes and IS on chromosome 16 at 3 different scales. In all panels, the top image shows the distribution of IS from PBMC infected in vitro; the bottom, the distribution of the unique IS data from on-ART donors. A. Entire chromosome 16 (ca 353,000 bp/bin); the arrow shows the region in which IS are enriched in the on-ART samples. The box shows the region enlarged in panel B. Because <10% of the 824 genes could be accommodated on the X-axis, chromosomal position is shown instead. B. A 10Mb region of chromosome 16, (40 kb/bin) centered on MKL2. Again, the box indicates the region expanded in panel C. C. A 2 Mb region of chromosome 16 (8 kb/bin) centered on MKL2. The boxed region is expanded in Fig 4A.
Fig 4.
The complete MKL2 gene (308 bp/bin). The gray bars show the position of exons, with their height indicating the expression level of the gene in TPM, as in Fig 3. The protein coding region is indicated by the colored arrows above the maps. The boxed region is enlarged in panel B. B. Cluster of MKL2 IS on-ART in introns 4–6 (50bp/bin). Top, in vitro PBMC IS; middle, on ART unique sites; bottom on-ART, all sites.
Fig 5.
HIV IS in other genes in which proviruses can give the host cell a selective advantage.
Maps were generated for each of the genes in Table 3 as described in Fig 4. For each pair of panels, the map at the top shows the IS distribution for the PBMC infected in vitro, and the on-ART distribution is at the bottom A. STAT5B (309 bp/bin). B. BACH2 (1.48 kb/bin). C. MKL1 (906 bp/bin). D. MYB (151 bp/bin). E. IL2RB (197 bp/bin) F. POU2F1 (828 kb/bin).
Fig 6.
IS distribution, RNA level, and clonal amplification.
The amplification ratio for all integrated proviruses was calculated using (total number of IS)/(number of unique IS). A. Each RefSeq gene in the non-overlapping dataset was assigned to one of 100 bins (ca. 200 genes in each bin) based on the level of RNA for the gene (as TPM in the in vitro infected PBMC), and the overall amplification ratio was calculated for the sites in each bin. Non-expressed genes were assigned to bins at random. The regression lines shown have a slope of 0.0008 and 0.0013 (p = 0.88 an 0.12) for pre-ART and on-ART bins, respectively. B shows the clonal amplification ratios pre-ART and on-ART. Color coding is the same as in previous figures: Plum, pre-ART; red, on-ART.
Table 4.
IS detected in centromeric repeat DNAa.