Integration in oncogenes plays only a minor role in determining the in vivo distribution of HIV integration sites before or during suppressive antiretroviral therapy

HIV persists during antiretroviral therapy (ART) as integrated proviruses in cells descended from a small fraction of the CD4+ T cells infected prior to the initiation of ART. To better understand what controls HIV persistence and the distribution of integration sites (IS), we compared about 15,000 and 54,000 IS from individuals pre-ART and on ART, respectively, with approximately 395,000 IS from PBMC infected in vitro. The distribution of IS in vivo is quite similar to the distribution in PBMC, but modified by selection against proviruses in expressed genes, by selection for proviruses integrated into one of 7 specific genes, and by clonal expansion. Clones in which a provirus integrated in an oncogene contributed to cell survival comprised only a small fraction of the clones persisting in on ART. Mechanisms that do not involve the provirus, or its location in the host genome, are more important in determining which clones expand and persist.


Author summary
In HIV-infected individuals, a small fraction of the infected cells persist and divide. This reservoir persists during fully suppressive ART and can rekindle the infection if ART is discontinued. Because the number of possible sites of HIV DNA integration is very large, each infected cell, and all of its descendants, can be identified by the site where the provirus is integrated (IS). To understand the selective forces that determine the fates of infected cells in vivo, we compared the distribution of HIV IS in freshly-infected cells to cells from HIV-infected donors sampled both before and during ART. We found that, as previously reported, integration favors highly-expressed genes. However, over time, the fraction of cells with proviruses integrated in highly-expressed genes decreases, implying

Introduction
Integration of viral DNA into the host cell genome is an essential step in the replication of HIV. In the course of untreated infection, the majority of infected cells die within a few days [1][2][3]. A small fraction of cells containing integrated proviral DNA persists for many years of suppressive antiretroviral therapy (ART) and a small fraction of the cells containing intact persistent proviruses is the source of the virus that re-emerges when therapy is interrupted [4].
Understanding the mechanism(s) by which the infected cells persist is central to developing and evaluating strategies to cure HIV infection. Recent studies describing the distribution of integration sites (IS) in cells obtained from infected people have provided important clues about the persistence of infected cells (reviewed in [5,6]). The distribution of proviruses in HIV-infected individuals on long-term ART derives from their initial distribution in newly infected cells, modified by selection that promotes their survival, preferential loss, or clonal expansion. In vitro experiments show that the IS for HIV proviruses are widely distributed in the human genome; however, the distribution is far from random, favoring the bodies of highly-expressed genes located in gene-dense regions [7][8][9]. Retroviral integration exhibits a modest preference for specific bases around the site of integration [10,11], and there is no discernable preference for the integration of the viral DNA in either orientation relative to the direction of transcription of the target gene [7,12,13].
It is likely that integration follows the same rules in cells infected in culture and in a human host. In untreated chronic HIV infection, between one in 100 and one in 1000 CD4+ T cells are infected every day; almost all of the newly infected cells die. In untreated individuals, there are more than 10 8 new integration events every day, a process that continues for many years. Starting soon after the initial infection, a small fraction of the infected cells survive, growing into clones that become large enough to detect (>10 5 cells) in as few as a 4 weeks [14]. Most of the surviving cells (>98%) carry defective proviruses [15,16]. However, despite the fact that cells that carry intact infectious proviruses represent only a small fraction of the cells that survive, they still add up to a large number [16]. At least some of the infectious proviruses are in a poorly defined, but reversible, state of latency. Such cells persist for decades of ART and can produce progeny viruses that can rekindle the infection when ART is interrupted [17,18]. Collectively, the intact infectious proviruses constitute the viral reservoir.
The reservoir comprises proviruses in cells that were infected prior to the initiation of ART (and their descendants), and is not maintained by persistent viral replication in blood or solid tissues [4,14,19,20]. Studies of the distribution of IS in cells from infected individuals revealed that at least 40% of the long-lived, persistently infected cells were in a small number of clones each derived from a single infected cell. Clonally amplified cells can persist for at least 11 years [21][22][23]. Some clonally amplified CD4+ T cells carry infectious proviruses, and can occasionally express infectious viruses with the potential to rekindle the infection when ART is discontinued [22,[24][25][26][27].
Although we now know that there is extensive clonal expansion of infected cells in vivo, important questions remain: 1) Integration in at least three genes (MKL2, BACH2, STAT5B) [21,23,28] can promote the survival and/or proliferation of HIV-infected cells. Are there other genes in which proviral DNA provides a selective advantage to the infected cell? 2) Are there other factors that help to reshape the initial distribution of IS in individuals on long term ART? 3) What is the relative importance of provirally-mediated expression of such genes and other factors (e.g., the immune response) in the clonal expansion and persistence of HIVinfected cells?
To address these questions, we prepared a large IS library from stimulated PBMCs infected in culture with HIV [29]. We then compared the distribution of the IS to the patterns of gene expression in these cells and to the distribution of IS obtained from cells from infected people both before ART and during suppressive ART. Our results support the conclusions that integration into oncogenes, while striking, is only a small factor in HIV persistence, and that clonal expansion must be driven largely by factors unrelated to the location of the provirus.

Study participants
IS data were obtained from the donors described in the studies listed in S1 Table, and their detailed characteristics can be found in the references therein. All studies had the approval of the relevant institutional IRB, and informed consent was obtained for all the samples collected, as described in the cited studies.

Integration site datasets
The distribution of IS in people living with HIV on long-term ART can only be properly understood in the context of the distribution of IS at the time the cells were initially infected. Unfortunately, it has not been possible to prepare sufficiently large IS libraries from newlyinfected people [14]. We therefore analyzed the distribution of HIV IS in phytohemagglutinin (PHA)-stimulated CD8-depleted PBMC isolated from two HIV-negative individuals. PHA activation was used because HIV prefers to infect mitogen or antigen stimulated cells. Except for a ca. 8-fold decline in the number of IS obtained, there is very little difference in either gene expression or IS distribution in PHA stimulated and unstimulated cells; see S6 Fig). The PBMC were infected in vitro with replication competent HIV-1 BAL (referred to hereafter as "PBMC"). We compared that distribution to the distribution of IS in cells obtained from infected individuals both pre-ART and on-ART using identical methods. The comparative analysis of the three datasets allowed us to assess the effects of various selective forces-both positive and negative-that modify the distribution of IS during suppressive antiretroviral ART.
Total DNA was extracted from the PBMC two days following infection, to allow time for integration, but not selection for, or against, proviruses integrated in specific sites or specific genes [29]. We used linker-mediated PCR [30] as previously described [21,31] to identify IS. When we compared the gene distribution of the IS from the PBMC obtained from the two individuals, the correlation coefficient was 0.98 and, for the rest of the analyses, we combined the data from the two sets of IS. The combined dataset from the in vitro-infected PBMC contained 394,975 independent IS. It will be referred to as the "PBMC dataset" for the remainder of the paper.
By combining in vivo IS from several studies [14,20,21,32] (S1 Table), we were able to assemble datasets comprising a total of 15,821 IS from pre-ART samples from 25 HIV-infected donors (hereafter referred to as "donors") and 53,945 IS from on-ART samples from 36 donors (Tables 1 and S1). The introduction of a shearing step prior to linker ligation, PCR, and paired-end sequencing [30] creates random break points in the flanking host DNA, making it possible to distinguish identical IS that arose by clonal expansion of a single infected cell [21,30,31]. After collapsing the IS with multiple breakpoints to a unique site, the two libraries contained 13,283 unique IS pre-ART and 33,336 unique IS on-ART (Table 1).

In vivo selection against proviruses integrated in highly-expressed genes
HIV preferentially integrates a DNA copy of its genome into highly-expressed genes in generich regions [8,9,33]. However, the initial analyses were based on small datasets (<1 IS/gene), used methodology (restriction enzyme cleavage) which has the potential to introduce bias based on local base composition differences, and used cell lines rather than primary cells. To confirm the prior results and extend them to relevant target cells, we mapped the IS in the PBMC dataset into genes from the entire RefSeq gene database [34], slightly modified to remove gene overlaps, yielding a set of 20,207 genes. The overlaps were removed to prevent mapping single IS to two genes. To compare the sites of integration to the expression of the host genes, we performed an RNA-seq analysis using the same PBMC that were used to generate the PBMC IS dataset. Table 1 summarizes the results of this analysis. As expected, most of the IS (83%) were in genes, which comprise only 38% of the human genome, and most (82%) were in expressed genes, which comprise about 22% of the genome. Thus, only 18% of the HIV IS were found in the 78% of the genome that is not in expressed genes (defined as those with �0.5 transcripts per million reads, TPM). A similar preference was evident in the pre-ART data in which 79% of the IS were in genes and 76% were in expressed genes. In the on-ART dataset 77% of the unique IS were in genes and 74% of the pre-ART sites were in expressed genes. Although only 2.1% of the genome is in exons, exonic IS comprised 6.5% of the IS in PBMC and somewhat less (5.6% and 4.7%) of the unique IS in pre-and on-ART samples, respectively. The differences between the percentages of the IS in genes in the in vitro PBMC (83%) and in the on-ART (79%) or pre-ART (76%) datasets, although not large, were highly significant (Table 1). There was an even larger difference in expressed genes (82% vs 76% and 74%). We compared the relationship of IS to gene expression in the three datasets using our non-overlapping gene set (available in S1 Data). All RefSeq genes were divided into 100 bins based on the RNA-seq data (as transcripts per million reads, TPM) obtained from the PHA-stimulated HIV-infected PBMC (Fig 1, green triangles) and also divided into 4 bins, again based on TPM. The mean integration site density (unique IS/Mb, normalized to the mean for each dataset) for the 3 IS datasets when the IS are divided into the 4 bins, is presented in Fig 1A. The same data, divided into 100 bins, are shown in S1 Fig. As expected [8,9,33], the results showed a strong correlation between integration site density and gene expression for all three datasets. There was a small, but highly significant, difference in the integration site density in the 3 datasets for the genes with the highest TPM. When the IS data were divided into 4 bins, the fraction of IS that were in the quartile of the most highly-expressed genes was about 10% lower in the pre-ART dataset than in the PBMC dataset, and about 18% lower in the on-ART dataset, compared to the PBMC dataset (p = 3.3 x 10 −41 ). A smaller effect was seen in the second quartile, based

PLOS PATHOGENS
on gene expression (p = 2.0 x 10 −218 ). No similar effect was seen in the genes in the bins with two lowest levels of expression. This result implies the presence of selection, in vivo, against cells containing proviruses integrated in the most highly-expressed genes, and shows that the selection operated both pre-ART and on ART.
We and others previously reported a preference for proviruses integrated in an orientation opposite to the host gene in HIV-infected individuals and SIV-infected macaques on-ART (5% and 9%, respectively) that is not seen in vitro [21,23,29]. There was, as expected, no significant bias for the orientation of the proviruses, relative to the host gene, in the PBMC dataset ( Fig 1B). However, enrichment for proviruses orientated opposite to the gene in the on-ART dataset, relative to the PBMC dataset, was evident (S2 Table, p = 1.7 x 10 −63 ). We also observed a smaller, but significant, enrichment for proviruses in the opposite orientation to the gene in the pre-ART data (S2 Table,  To understand the nature of the in vivo selection, we separated the IS data based on whether the provirus was oriented with the gene, or opposite the gene, and then plotted IS data relative to the level of expression of the host gene, dividing the genes into four bins based on their levels of RNA expression. There was no significant selection against proviruses that were in the same orientation as the host gene in genes that were not expressed in either the pre-ART or the on-ART dataset ( Fig 1B). In expressed genes, there was a significant selection against proviruses oriented with the gene in the on-ART dataset [p values ranging from about 0.013 to 5 x 10 −47 (S2 Table)]; the strongest selection was seen in the highly-expressed (>30 TPM) genes. A similar pattern was seen with the pre-ART data, although the effects were smaller and the effect for the genes that were expressed at a low level was not significant. While the difference in IS distribution between the PBMC and the in vivo samples could have reflected differences in initial integration preferences between the two conditions, the differences between pre-and on-ART samples are best explained by selective forces that affect the survival or proliferation of the infected cells. The on-ART data ( Fig 1D) also showed that there is weaker selection against cells with proviruses integrated in highly-expressed genes in the opposite orientation to the gene (p = 2.3 x 10 −48 ); again the strength of selection decreased with decreasing levels of gene expression. Similar, but much weaker, selection was seen in the pre-ART dataset (p = 0.011).

Selection involving proviruses integrated in individual genes
To examine the contribution of proviruses in genes that are favored for HIV integration (hot spots) we ranked genes in both the PBMC and the on-ART datasets by IS density, as unique sites per kb (Table 2). For the most part, the IS site density in individual genes showed that the in vitro and in vivo data were in good agreement ( Table 2 and Fig 2). We and others previously reported that proviruses integrated in certain introns of three genes (BACH2, MKL2, STAT5B) can confer a selective advantage for the clonal expansion and/or survival of infected cells [21,23,28]. The in vivo enrichment of IS in these genes is due to post-integration selection, an effect that has sometimes been misinterpreted in the literature as reflecting preferential "hot spots" for integration [35,36].
To ask if there are other genes in which an HIV provirus could provide a selective advantage to the host cell in vivo, we compared, for each gene, the fraction of IS in the in vitro PBMC dataset to the fraction in the pre-ART and on-ART in vivo datasets (Fig 2). The top two panels show the same data using two different scales. Panel A shows the data for all of the genes. The inset shows the top 30 genes, showing that, for most genes, there is a strong correlation between the IS distribution in vitro and in vivo. There is one obvious exception to this correlation in the top 30 genes, STAT5B (at number 28 in PBMC), a gene in which HIV proviruses are known to confer a selective advantage on the host cell [28]. Panel B shows the top 2600 genes by IS frequency in PBMC. Although there is some scatter, the in vivo IS data closely mirror the PBMC data.
Despite the overall selection against proviruses in expressed genes, there were a few genes, circled in Fig 2B, in which there was a higher fraction of proviruses in vivo in the on-ART dataset than in vitro, implying in vivo selection for cells with HIV proviruses in these genes. As will be discussed below, proviruses in these genes show other evidence for a positive in vivo selective effect in the host cells-clustering of IS within the gene and a strong bias toward selection for proviruses in the same orientation as the gene (Table 3).
Panel C is a Venn diagram that shows the extensive overlap between the genes that are favored for integration in the 3 datasets. Of 4,247 genes with at least one IS in the pre-ART dataset, 4090 (96%) also had at least one IS in the PBMC dataset (there were 11,422 genes IS with at least one IS in PBMC). Similarly, of the 6,366 genes in the on-ART dataset, only 422 (6.6%) had no IS in either of the other two sets, and 5,099 (93%) of the genes with at least one IS in the on-ART dataset had at least one IS in PBMC. These comparisons validate the use of the in vitro PBMC data as a surrogate for the initial distribution of IS in vivo and show that the strongest determinant of the overall distribution of HIV proviruses in infected individuals, both pre-ART and on-ART, is the initial distribution at the time the cells (or their ancestors) were infected.

Mapping and comparing the distribution of integration sites in vivo and in vitro
To help compare the distribution of IS in different datasets, we developed an Excel-based tool which can be used to view different sized regions of the host genome ranging from a selected portion of a gene to a whole chromosome. The tool, available as an Excel workbook in S1 Data, makes it possible to compare the IS site distribution and gene expression levels for up to 5 IS datasets at a time (S2 Fig). The workbook, including the visualization tool, gene list, and

PLOS PATHOGENS
What shapes the HIV integration landscape during antiviral therapy?
the RNA-seq integration data used in this study is available as S1 Data.  Fig 2B) [21]. The IS data are shown at several scales, from the whole chromosome (bin size ca 355,000 bp, Fig 3A) down to a fraction of a gene (bin size 150 bp, see Fig 4B). At the three largest scales, the pattern of IS is strikingly similar in the PBMC and on-ART datasets, and closely matches the distribution of expressed genes.

Clustered IS in specific genes
The one exception to the strong similarity in the distribution of the proviruses in chromosome 16 in the PBMC and on-ART datasets is in the MKL2 gene (also known as  Comparative ranking of genes by integration frequency. The non-overlapping RefSeq genes were ranked according to the number of unique IS in the PBMC in vitro dataset (blue squares), along with the number of sites in the same genes in the pre-ART (plum triangles) and on-ART (red diamonds) datasets. A. All 20,207 genes in the dataset are shown; a little over half have one or more IS in the in vitro PBMC dataset. Points with no IS have been removed for clarity, and the genes with no IS in PBMC are assigned a rank at random. The inset shows an expanded view of the 30 genes with the most IS. The top 10 are labeled, as is STAT5B (number 28). B. Expanded view of the 2500 genes (indicated by the box in A) with the most IS. The 7 genes in which an integrated provirus directly contributes to persistence and/or clonal expansion of the infected cell are circled and labeled. Some pre-ART points are indicated with arrows for clarity. C. Venn diagram showing the genes with at least one IS in the three datasets, and how those genes are shared among the datasets. Numbers indicate the number of genes in each category.
https://doi.org/10.1371/journal.ppat.1009141.g002  In addition to MKL2, HIV proviruses integrated in two other genes, BACH2 and STAT5B have been previously shown to contribute to the growth and persistence of infected cells in individuals on ART [21,23,28]. Proviruses in these genes share characteristics with the MKL2 proviruses: enrichment in vivo relative to their frequency in PBMC infected in vitro; clustering in one or a few neighboring introns; and orientation of the proviruses is the same as the host gene ( Fig 5 and Table 3). The current dataset provides strong statistical support for selection of proviruses integrated in BACH2 and STAT5B on-ART (Table 3), as well as evidence of selection for proviruses in these two genes pre-ART (Fig 2).
It is possible that there are additional genes in which HIV proviruses can provide a selective advantage for the host cell. We analyzed the total on-ART IS dataset for the three hallmarks of selection for cells with proviruses in specific genes: 1) enrichment relative to the PBMC dataset; 2) enrichment for proviruses in the same orientation as the gene; and 3) clustering of the IS in one or a few nearby introns ( Table 3).
As expected, when all 5,944 genes with IS in both the on-ART and the PBMC datasets were analyzed and ranked by the combined p value for the relative number of IS and for orientation bias (S2 Data), the three top ranking genes were BACH2, STAT5B, and MKL2. Four more genes were identified using the new on-ART dataset: MKL1 (also known as MRFTA), a paralogue of MKL2 [37]; IL2RB, encoding the β chain of the IL-2 receptor (https://www.ncbi.nlm. nih.gov/gene/3560); MYB, a known protooncogene originally identified in an avian retrovirus [38], and POU2F1, a ubiquitous OCT-1 motif binding transcription factor, whose misexpression is associated with a variety of cancers [39]. All of these genes have been linked directly to cell growth or survival (see Discussion). In all of these genes, the proviruses in the on-ART dataset displayed a strong bias for integration in the same orientation as the gene and the number of IS was significantly greater than what would have been predicted based on the PBMC dataset ( Fig 5 and Table 3). In each of the seven genes, the IS in the on-ART datasets showed evidence of clustering in specific introns. In the current on-ART dataset, these seven genes were the only ones for which there was significant evidence for positive selection of cells with an integrated HIV provirus in vivo. Taken together, the 555 independent IS in these genes in the on-ART data comprised about 2.2% of the unique IS in genes. Because our on-ART dataset is large, we were able to identify genes for which the number of IS represented a very small fraction of the total IS (MYB is 10/33,336 unique IS, or 0.003%). There may be additional genes that we have not identified in which proviral integration could lead to positive selection; however, the fraction of the total IS in such genes must be very small (see Discussion).
In the analysis of the on-ART dataset, we identified a number of genes for which there was apparent enrichment of proviruses oriented opposite to the host gene. In 4 out of 6 cases, these genes had fewer total IS in the on-ART dataset than would be expected based on the PBMC dataset (S3 Table), consistent with stronger than average selection against cells with proviruses  The protein coding region is indicated by the colored arrows integrated in the same orientation as the gene. Unlike the seven genes in which there was evidence of selection for a provirus in the same orientation as the gene, in the four genes with reduced same-sense integration, the overall distribution of IS within the genes appeared to be similar to that seen in the PBMC dataset, without obvious clustering (S3 Fig). One of the 6 genes listed in S3 Table (SLC6A16) had a statistically significant increase in the number of on-ART IS relative to the PBMC IS, suggesting that there might have been selection for cells with proviruses in the gene, despite their opposite-sense orientation. In both the PBMC and on-ART datasets, there was obvious clustering of IS in a large SLC6A16 intron preceding the start of translation (S3C and S3D Fig) However, this gene, encoding a transporter involved in neurotransmission [40], is unlikely to affect T cell growth.
We also considered the possibility that there could be regions outside of genes where proviruses could provide a selective advantage to the infected cell in vivo. We used an unbiased scan to look for evidence of clustering of proviruses within 10-kb regions in the on-ART dataset (S4 Fig and S4 Table). With the exception of genes that had already been described, all of the clusters that were found in the on-ART dataset were also present in the PBMC dataset. Sites in which there was clustering in both the in vitro and in vivo dataset are regions that are favored for integration "hot spots"-this is in contrast to the clusters that arise in vivo from post-integration selection. Not surprisingly, clusters of proviruses were found in some of the genes that are the most favored targets for integration, including PACS1 (S4F Fig), the gene which had the largest number of sites in the infected PBMC (Fig 2A, Inset), and the lncRNA genes NEAT1 and MALAT1 (S4E Fig), which are among the genes with the highest density of IS in vivo and in vitro ( Table 2). One of the hot spot clusters identified in both the PBMC and on-ART datasets appeared to be intergenic (S4 Fig, panel H). However, analysis of the RNA-seq data revealed a transcription unit in this region which has not yet been annotated.

Clonal amplification of HIV-infected cells
In the analyses presented thus far, we collapsed all the breakpoints associated with each IS to one, a step that simplified the comparisons of the in vitro and in vivo data. One of the most striking features of the distribution of HIV IS in vivo is the presence of identical IS that derive from clonally-amplified cells. The effect of clonal amplification on the IS data can be seen by comparing the distribution of IS in the MKL2 cluster in Fig 4B before (bottom panel) and after (middle panel) the amplified sites were collapsed down to unique sites. In this case, the 30 unique IS were found a total of 144 different times-an average amplification ratio of nearly 5, ranging from 1 (no amplification) to more than 25. Overall, the amplification ratio for all sites across the entire genome was 1.6 for the on-ART samples, and 1.2 for the pre-ART samples (Fig 6)., That the pre-ART ratio is >1 confirms that infected cells can grow into clones before ART is initiated [14] The amplification ratios were independent of whether the proviruses were inside or outside of genes, the orientation of the proviruses relative to the host genes, and of the level of expression of the host gene (Figs 6 and S4). The amplification ratios for the proviruses in the seven genes where proviruses could be selected, although quite variable, were, on average, not significantly different from all the other genes ( Fig 6B). When the amplification ratios for each site were ranked and plotted in decreasing order, the sizes of the clones in the seven genes were scattered across the distribution (S5 Fig). Thus, the degree of clonal amplification cannot be taken as evidence of positive selection for a provirus in a specific location, and such proviruses are not a significant contributor to the overall level of clonal expansion.

PLOS PATHOGENS
What shapes the HIV integration landscape during antiviral therapy?

Fig 5. HIV IS in other genes in which proviruses can give the host cell a selective advantage.
Maps were generated for each of the genes in Table 3 as described in

PLOS PATHOGENS
What shapes the HIV integration landscape during antiviral therapy?

Integration in centromeric repeats
The complex nature of the alphoid repeat sequences precludes precise mapping of IS in centromeres, and, for this reason, they are not annotated in either the hg 19 or hg38 genome database and were not detected using our standard pipeline to identify and locate IS, even though the centromeres comprise approximately 7% of the genome [41]. [See, for an example, the large gene-free area around 40 Mb of chromosome 16 (Fig 3)]. It has recently been reported that in rare HIV-infected individuals who maintain very low levels of viremia in the absence of ART (known as elite controllers) the intact proviruses are preferentially found within centromeric sequences [42]. We used a manual protocol to search specifically for centromeric IS in a portion of the data from the PBMC and on-ART datasets. Although finding IS in centromeres is not trivial, and we are not confident that we identified all the centromeric IS in the datasets, Table 4 shows that there are detectable IS in the centromeric repeats in both the PBMC and on-ART datasets. The data show that the amplification ratio of the on-ART proviruses (1.70) is not significantly different from the proviruses found outside genes (1.62, P = 0.77). Nor is the amplification ratio in centromeric proviruses different from proviruses that are integrated in genes Proviruses integrated in centromeric regions do not appear to be expressed and there are no known genes in centromeres. Thus, these results support our conclusion that factors other than the misexpression of genes with integrated proviruses are the primary factors that control the clonal expansion of infected cells.

Discussion
HIV persists for decades on fully suppressive ART, primarily as poorly expressed proviruses in infected cells that are long-lived, and continue to divide. Many (perhaps all) of these long-lived cells are in expanded clones. The present study was undertaken to better understand the factors that affect which provirus-containing cells survive in vivo. We compared a large IS dataset obtained from PBMC infected in vitro to pre-ART and on-ART datasets compiled from a number of in-house studies of HIV-infected individuals. The approach we used [21,30,31] provided IS data that are unbiased by the distribution of restriction enzyme sites in the human genome. The data were extensively screened to remove cross-contamination and any PCR artifacts or other errors.
There are three factors that modify the initial distribution of proviruses in vivo: 1) negative selection against cells with proviruses in highly-expressed genes; 2) positive selection for cells with proviruses in specific regions of certain genes; and 3) stochastic effects, often due to immune signaling [43], that lead to clonal amplification of some infected cells. The primary determinant of the pre-and on-ART distribution of IS is their initial distribution at the time a Data from additional assays of some samples (S1 Table) were analyzed for IS in centromeric repeat DNA.
b Recalculated using the proportions in Table 1 to estimate the numbers of total and unique IS for the different-sized samples.
c A portion of the data from PBMC infected in vitro. Because "amplification" is due only to DNA replication immediately following infection, it is without significance here and is not reported.
d From the donors (S1 Table), including additional samples that were taken after those reported in the rest of the manuscript. the cells were infected. In the experiments we report here, that distribution was modeled using the distribution of IS in donor PBMC infected in vitro. In general, the three IS datasets are very similar. The excellent agreement, for most genes, in the distribution of the IS in the PBMC data and the two in vivo datasets supports our use of the PBMC IS dataset to accurately reflect the initial distribution of IS in vivo. The greater similarity between the distribution of IS in PBMC and the distribution of IS in the pre-ART samples compared to the on-ART samples is consistent with the pre-ART dataset comprising a mixture of newly infected, short-lived cells and cells that have been infected for longer times. Differences between the PBMC and on-ART IS datasets can provide insights into the selective forces that reshape the initial distribution of IS in vivo.
As has been known for many years, HIV DNA integration strongly favors the bodies of transcribed genes and the orientation of the provirus is independent of the chromosomal orientation of the host gene [7][8][9]. The strong association of HIV integration with expressed genes is shown in Figs 1 and 2 and Table 1, where the normalized integration density (IS/Mb) in expressed genes was more than 15-fold greater than in the rest of the genome. Integration density in PBMC was 8-fold greater in genes with high expression levels than in those expressed at low levels.
Many highly-expressed genes are good targets for HIV integration; however, high levels of gene expression do not always predict high levels of integration. This effect may be mediated by the distribution of LEDGF on chromatin [9,[44][45][46]. For example, in S2 Fig, which shows three adjacent STAT genes, all of which produce similar levels of RNA in PBMC. STAT5B is a good target for HIV integration, STAT3, less so, and STAT5A has very few IS. These discrepancies may reflect the presence, in the CD8-depleted PBMCs used for the RNA-seq analysis, of differentially expressed genes in cells that are and are not targets for HIV infection, or genes that express very stable transcripts, so that the steady state RNA levels measured by RNA-seq may not be indicative of the level of transcription, and there could be other explanations as well.
Almost all regions of the host genome that are not expressed are poor targets for HIV integration. This result was consistently seen both in vitro and in vivo. An unbiased search for clustered IS turned up one apparent intergenic region, on chromosome 6, in which there was a cluster of IS but no annotated gene. However, the RNA-seq data show that this region is transcribed, although not yet annotated as a gene encoding a protein or a known long noncoding RNA. This result suggests that clusters of HIV IS might be a useful way to check genome databases for genes that have not been annotated. Very rarely, we found annotated genes that were good targets for integration both in vivo and in vitro, in the absence of detectable levels of RNA in PBMC. HORMAD2, whose protein product is involved in meiosis, is a good example (S4 Fig), in that it had the largest number of proviral ISs of any non-expressed gene in both the in vitro-infected PBMC and on-ART datasets. Interestingly, but perhaps coincidentally, a provirus in this gene (in the form of a solo LTR) is in one of the largest clones we have seen in any on-ART donor [21,47].
Although the distributions of IS in genes in the PBMC, pre-ART, and on-ART datasets are very similar, some clear differences emerged when the frequency of the IS was plotted against the level of gene expression (Fig 1). The normalized fractions of IS in the three datasets were identical for genes that were poorly expressed; however, the fraction of the IS in highlyexpressed genes was higher in the in vitro PBMC dataset than in either the pre-ART or the on-ART datasets. The in vivo datasets also showed that there is selection against proviruses in either orientation in highly-expressed genes. Although an overall opposite-the-gene orientation bias has been previously reported for on-ART samples [21,29], the fact that the magnitude of the selective effect depends on the level of expression of the genes had not been reported. Nor was it previously known that there is a weaker selection against proviruses in highly-expressed genes that are oriented opposite to the gene.
It has been proposed that latent, intact, proviruses within genes may be more likely to be expressed than those in extragenic regions, leading to death of the cell and accounting for a bias for defectiveness for proviruses in genes [24], although the net effect is not large and counterexamples have been reported [48]. Although the vast majority of proviruses are defective, that type of selection could also apply to cells with defective proviruses if the defective proviruses expressed epitopes that are recognized by CTL, for which there is conflicting evidence [49][50][51]. Given the recent results showing that only a small fraction of cells, containing either intact or defective proviruses, express unspliced viral RNA [25], we think it more likely that the selection against cells with proviruses in highly-expressed genes is due to the effects of the provirus on the expression of the host gene. In the case of proviruses in the same orientation as the gene, these effects would involve the insertion of viral sequences containing signals for transcription initiation, splicing, polyadenylation, etc., interfering with its expression. Certain endogenous proviruses of mice are also known to affect the expression of the host gene in which they reside in both ways [52,53]. The effect on gene expression is limited to the allele in which the provirus is inserted, possibly explaining why the strongest selection is seen for proviruses that are integrated in the most highly-expressed genes. Proviruses integrated in the opposite orientation to a gene could also reduce its expression by transcriptional interference; i.e., collisions between RNA polymerases transcribing in opposite directions [54]. However, this effect on gene expression requires LTR-driven transcription and is therefore likely to be weaker for (largely nonexpressed) proviruses in the opposite orientation to the gene. In support of the interpretation that the effect is on the level of expression of the host gene, the effect is due to a weak selection on a large number of genes, rather than strong selection in a small number of genes. The observation that there is a much weaker relationship between gene expression and provirus loss in the pre-ART than in the on-ART samples (compare Fig 1, panels C and D) also supports this conclusion.
It has been known for nearly 40 years that modification of gene structure and expression by retroviral DNA integration can cause cancer in animals [55,56]. Proviruses are known to induce oncogenic modification of genes by the insertion of a promoter or an enhancer, which leads to the overexpression of a host oncogene and/or a truncation of the encoded protein [57]. Similar mechanisms likely affect the expression of the three genes (BACH2, MKL2,and STAT5B) previously identified as targets for HIV provirus mediated preferential expansion or survival of infected T cells in humans [21,23,28,36,58]. In the present study, we did a comprehensive search for additional genes in which an HIV provirus insertion could cause similar posititve selection for infected T cells. Host genes in which a provirus could help the cell proliferate or survive were identified by comparing the IS frequencies in individual genes in the on-ART dataset and the PBMC dataset. Genes were identified based on finding a large increase in the integration frequency in vivo, as well as selection for proviruses in the same orientation as the gene, and clustering of the proviruses in one or a few introns. A complete list of genes with at least one IS in any of the datasets, sorted by thes criteria can be found in S2 Data.
Using the same criteria, we found three additional genes in which proviruses contributed to the growth and survival of the cell: MKL1, IL2RB, and MYB. A fourth gene, POU2F1, an OCT-1 binding transcrition factor, may also belong to this group. Because we began with a large on-ART dataset, this set of seven likely represents a complete, or nearly complete, set of the host genes which are good targets for HIV DNA integration (>10 IS in our PBMC dataset), and in which enhancement of growth and/or survival of the host cell by an integrated HIV provirus is likely to play an important role in HIV persistence. As described in Results, there were only 10 (out of 33,336) unique ISs in MYB in the on-ART dataset, all in the same orientation as the gene. While it is possible that there are other poorly-targeted genes in which HIV proviral integration would have similar effects, any gene that has not yet been identified must be a very poor target for HIV integration. Of the 20,207 genes in our modified database, only about half (11,422) had one or more IS in the PBMC dataset, and only about one quarter (5,159) had more than 10. Any gene we might have missed would have had even fewer activating proviruses than MYB and POU2F1. If either MYB or POU2F1 was excluded from the group, there would be only a small decrease in the number of cells in which a provirus was selected (10 out of the 398-2.5%) proviruses integrated in the seven genes that are oriented in the same direction as the gene). Similarly, adding a gene to the list that had fewer than 10 proviruses would make only a very minor difference in the numbers, and would not have affected the conclusions.
Proviruses integrated in one of these seven genes comprised, together, approximately 1.7% of the total unique proviruses. Many of the proviruses in these genes are defective (Hill, S. and Maldarelli, F, unpublished observations); there are no published reports of an intact provirus in any of these genes. If clonally expanded cells are counted, the cells with proviruses in the seven genes comprise about 2.2% of the total infected cells. Thus, it is unlikely that the cells that have expanded because they have a provirus that drives their proliferation and/or survival are an important part of the viral reservoir, which is, by definition, composed of intact proviruses.
We and others have previously pointed out that a large fraction of the HIV proviruses in on-ART samples are integrated in genes/oncogenes that play important roles in the proliferation and survival of cells [21,23,59]. However, as the data we present here clearly show, this preference is a result of the propensity of HIV proviruses to integrate in expressed genes, and is not due to post-integration selection for cells that have proviruses integrated in oncogenes other than the ones that have already been discussed. For all other oncogenes, the fraction of the IS in the gene does not increase when the PBMC data and the on-ART data are compared, showing the absence of strong post-integration selection in vivo for proviruses in other oncogenes. It is noteworthy that selection for integration in STAT3 was not observed (S2 Fig), although such integration has been found to promote growth of T cells in vitro [60].
As previously mentioned, proviruses of oncogenic retroviruses can modify the expression of host genes [56,57], and it is likely that HIV proviruses can use similar mechanisms to alter host gene expression. In BACH2, STAT5B, and IL2RB, the selected proviruses are in introns upstream of the start site for translation, and probably provide a new promoter for the gene, and a new upstream leader sequence that is derived from the provirus [28]. Proviruses integrated in MKL1 and MKL2 interrupt the protein coding region, which, if the mechanism of activation is promoter insertion, would lead to the expression of truncated proteins that lack amino terminal domains present in the normal proteins. In the case of MYB, the proviruses are in a large 3' intron and are likely to cause premature polyadenylation, resulting in a protein product that lacks the correct 40 C-terminal amino acids. The inserted proviruses would also lead to the synthesis of an mRNA lacking the long 3' noncoding region of MYB mRNA, which has been reported to contain signals that promote rapid RNA degradation [61]. These are all mechanisms that oncogenic retroviruses use to modify expression of host genes [56].
The cells in which an HIV provirus has integrated into one of the seven oncogenes have a growth and/or survival advantage, but they are not cancer cells. However, as with cancer, proviral induced misexpression of the host target gene product may promote cell division that would override the normal growth controls. Alternatively, the altered protein may provide a survival advantage, by, for example, suppressing apoptotic signaling, which is a normal part of the T cell immune response. Of the seven genes in which proviruses led to positive selection, only three were in cells that were clonally amplified to a greater extent than average for all cells in the on-ART samples (Fig 5C). It is possible that this difference reflects two different types of positive selection: Cells in which the provirus is in STAT5B, MKL2, and MYB may have increased potential for growth and cells with proviruses in BACH2, MKL1, IL2RB, and POU2F1 may have better long-term survival due to a reduction in apoptotic signaling.
Clonal amplification is the most striking feature of persistent proviruses on-ART and is responsible for the majority-perhaps all-of the infected cells that persist on long-term ART, whether they contain intact or defective proviruses. We show here that provirus-driven clonal expansion accounts only for a small minority of the clonally expanded cells, which means that forces independent of the location and structure of the provirus cause the majority of the clonal expansion that characterizes infected cells in individuals on ART. Thus, in most cases, the proviruses only provide useful markers for the cells in expanded clones. In support of this conclusion, we found that the extent of clonal expansion is independent of the location of the provirus, whether it is in a gene or intergenic, the expression level of the host gene, and the orientation of the provirus relative to the gene (Fig 5). Even when integration is in centromereassociated alphoid repeat DNA, which is believed to be highly inimical to proviral expression [42,62], the extent of amplification was not different from what was found for all of the proviruses taken together. We therefore conclude that proviral-driven expression of "cancer genes" plays only a minor role in the clonal expansion of infected cells, and that other mechanisms, probably antigen-driven or homeostatic expansion, account for the proliferation and persistence of the large majority of clones of infected cells in individuals on ART, especially those with intact proviruses.
The differences in the distributions of the proviruses in the PBMC infected in vitro and in the pre-and on-ART samples, although highly significant, are small (Fig 2). This result implies that the location of the proviruses in the host genome has only a small effect on the survival of an infected cell. Because cells that carry expressed proviruses are preferentially lost on ART, we conclude that the location of a provirus in the host genome has little effect on the expression of the provirus in long-lived cells. We show that there is a modest loss of cells with proviruses in highly expressed genes, and the loss is greater for the provirus in the same transcriptional orientation as the gene. However, as we have already discussed, this selection is more likely to be due to the effects of the proviruses on the expression of the host genes than the other way around. Turning the problem around, if the cells that carry proviruses that are not expressed have a survival advantage, and if the location of the proviruses in the genome was a substantial factor in whether the proviruses were expressed, we would have expected to see a larger shift in the distribution of the IS on ART. Although there is, with time on ART, a very slow and erratic selective loss of relatively intact or inducible proviruses [49,63,64] which leads to an enrichment in the fraction of defective proviruses [47], there is no evidence that the location of the provirus plays a role in this selection. Our conclusion that the location of an HIV provirus in the genome has little, if any, effect on its expression in infected individuals is supported by data, obtained from cells that were infected with HIV-based vectors in vitro, that the distributions, in the host genome, of the vector proviruses that are and are not expressed are quite similar [62].
There are limitations on the interpretations that can be made based on the data we have presented. For one, the analysis of the behavior of HIV infected cells and the proviruses they carry are limited by the depth of sampling that can be performed. A 100 ml blood sample contains, on average, about 10 8 CD4+ target cells, about 10 −4 of all the CD4+ T cells present in the body. Assuming that the cells are uniformly distributed, and approximately 1000 IS are obtained from such a sample, we can expect to recognize a clone of infected cells if it contains >~10 5 cells [14]. A clone of this size would be at least 16 generations removed from its originally infected progenitor cell. Although it is possible, by obtaining more IS, to identify clones smaller than 10 5 cells, as a practical matter it is not possible to identify clones much smaller than 10 4 cells. Although it is likely that the large majority-if not all-of the infected cells in patients on long-term ART are in clones of cells that have divided multiple times since they were initially infected [65], our hypothesis cannot be demonstrated experimentally.
Second, our assays monitor only IS, not the structure of the appended proviruses, putting a limitation on our ability to ask whether the results we have presented would differ if we had separately analyzed only the intact, potentially infectious, proviruses. It is now possible, to a limited extent, to determine both the integration site and the structure of the linked provirus [24,26]. There are reports of differences in the behavior of cells with intact and defective proviruses during long-term ART in the form of a preferential loss of cells that carry intact proviruses, and that intact proviruses are more commonly outside of genes [24,[66][67][68]. Testing of these proposals will require analysis of much larger numbers of proviruses and their linked IS than is possible with current technology.
In conclusion, the distribution of HIV proviruses, even after years of complete virological suppression, in which the majority of the infected surviving cells are many generations removed from the originally infected cells, still closely resembles the distribution of the proviruses at the time the cells were initially infected, represented by the proviral distribution in PBMC acutely infected in vitro. In addition to a modest modification to the initial distribution by a positive selection for proviruses integrated in the seven genes we have listed, the distribution is also modified by selection against cells with proviruses integrated in highly-expressed genes. There is also extensive clonal amplification; however, more than 98% of the amplification results from factors in which the proviruses and their locations in the genome play no apparent role in the behavior, growth, or survival of the infected cells.

Ethics statement
PBMC were isolated from blood drawn from HIV-infected donors participating in one of a number of clinical studies under the aegis of the HIV DRP, the University of Pittsburgh, and Stellenbosch University as described in the references to S1 Table. Both donors provided written informed consent and the donation protocol was approved by the University of Pittsburgh Institutional Review Board.
Donors and cells. PBMC were isolated from 120 mL of blood from two anonymous healthy, HIV negative donors and depleted of CD8+ T cells using Dynabeads (ThermoFisher). Approximately 30 x 10 6 cells were placed in 30ml of R10 medium, were treated with or without 0.5ug/mL of phytohemagglutinin (PHA) [69], and were incubated at 37˚C for 2 days.
Aliquots of 10 7 cells, were pelleted and resuspended in 2mL of fresh RPMI-1640 medium, with or without PHA, supplemented with 10% FBS and containing 10 7 IU/ml HIV-1 BAL [70] for a multiplicity of infection of 1 infectious unit/cell as assayed in GHOST cells [71]. After 2 hours, the cells were washed and incubated at 37˚C for a further 2 days. 10 7 cells from each flask were pelleted, resuspended in 1ml fresh cryopreservation medium, and stored at -80˚C. Except for a ca. 8-fold decline in the number of IS obtained, there was very little difference in expression of the genes or in the distribution of the IS in the PHA stimulated and unstimulated cells; (see S6 Fig).
Integration site analysis. The distribution of HIV IS in all PBMC samples was determined by linker-mediated nested PCR of fragmented DNA using LTR and linker-specific primers and paired-end Illumina sequencing as described [21]. Raw data were analyzed using the bioinformatic pipeline described by Wells et al [31], which includes steps designed to remove artifacts due to mispriming, cross contamination, etc. such as those found in Cohn et al. [31,35]. When we compared the gene-by-gene distribution of the PBMC IS from the two donors (using the Excel CORREL function), the correlation coefficient (r) was 0.98, and, for all analyses, the two datasets were combined, giving 384,295 total IS. All IS data have been posted on the Retroviral Integration Database (RID) at https://rid.ncifcrf.gov/ [72], and can be found using the PubMed ID of this paper. All data are also present in the Microsoft Excel document that can be found in the "ISA" sheet of S1 Data, along with all the IS and RNAseq data used in this study, The PBMC dataset has also been used as a comparator for a study on SIV IS distribution in the macaque model [29].
The IS analysis pipeline used for this study reports the site (at the end of the 3' LTR) in the hg19 human sequence database, the orientation of each provirus relative to the host chromosome, and the number of differently sheared flanking host DNA sequences (or breakpoints) associated with each distinct IS. The breakpoint count indicates the relative number of clonally-expanded cells containing the same provirus. For most analyses, the clonally expanded IS data were collapsed so that only a single unique IS was used. Since our analysis involves simultaneous amplification of IS using primers for both 3' and 5' LTRs, it is possible that we are slightly overcounting the number of clonally amplified proviruses. This overcount would not affect any of our conclusions, since most analyses (Figs 1-5) used collapsed data and, in the analysis shown in Fig 6, it would have affected all counts equally, leaving the conclusion intact.
Although most current human genome data are now reported in the hg38 format, most of the integration site data, including the published patient data used in this study, were collected from many early studies using hg19. To be consistent, we chose hg19 for all the integration site analysis and hence the RNAseq data also in hg19. We are in the process of transferring the data in S1 Data to hg38, and will make it available when we do so.
To map IS to centromeric regions, fastq files generated from the Illumina platform were processed and analyzed using our in-house standard pipelines for HIV integration sites analysis, with specific modifications to accommodate centromere mapping. Reads properly trimmed and filtered were subjected to centromere alignment using BLAT without -ooc filter to GRChg38 (released Dec. 2013) centromere assembly obtained from the University of California Santa Cruz genome browser (https://genome.ucsc.edu) as a reference. Reads mapped to centromeres were further merged and filtered to identify real integration sites. Manual inspections were used to finalize the mapping results.
Gene locations were obtained from the RefSeq set [34]. For counting purposes, in cases of overlapping genes, the 5'-most gene was truncated, and genes entirely inside other genes were removed so that all sequences in host genes were assigned to a single gene. This correction was necessary to avoid double counting of IS where there are overlapping genes.
The maps shown in Figs 3-5 and S3-S4, were generated using a custom-made Microsoft Excel workbook, as described in S2 Fig, which also contains all the data used in this paper and allows the user to select up to 5 datasets at a time, and choose a specific gene, chromosome, or chromosomal region to be viewed. The program and all the IS and RNAseq data used herein are available in S1 Data. P values for most comparisons were calculated using a binomial or Poisson distribution or Fisher's exact test, as indicated.
RNA-seq analysis. Total nucleic acid was extracted from one million cells each of the CD8-depleted PBMC infected with HIV from both donors with or without PHA stimulation. RNAseq was carried out using an Illumina TruSeq mRNA library kit with 200ng of poly(A) selected RNA as input. Libraries were pooled and sequenced on Illumina NextSeq500 with TruSeq V2 chemistry paired end reads x 75bp, yielding~100 million reads per sample. Sequencing was performed at NCI CCR Sequencing Core Facility. RNAseq data analysis was performed using the standard RNAseq workflow in CLC Genomics Workbench V12. Reads were mapped to human genome hg19 and Ensembl genes annotation v74 was used to calculate gene expression values as transcripts per million reads (TPM). Values obtained from the two donors were averaged. Gene expression levels in the PHA-stimulated, infected PBMC ranged from 0.5 to 12725 TPM. Genes with < 0.5 TPM are referred to as "unexpressed". Only protein coding genes and lncRNA genes were included for combined gene expression and IS analysis.