Epigenomic characterization of latent HIV infection identifies latency regulating transcription factors

Transcriptional silencing of HIV in CD4 T cells generates a reservoir of latently infected cells that can reseed infection after interruption of therapy. As such, these cells represent the principal barrier to curing HIV infection, but little is known about their characteristics. To further our understanding of the molecular mechanisms of latency, we characterized a primary cell model of HIV latency in which infected cells adopt heterogeneous transcriptional fates. In this model, we observed that latency is a stable, heritable state that is transmitted through cell division. Using Assay of Transposon-Accessible Chromatin sequencing (ATACseq) we found that latently infected cells exhibit greatly reduced proviral accessibility, indicating the presence of chromatin-based structural barriers to viral gene expression. By quantifying the activity of host cell transcription factors, we observe elevated activity of Forkhead and Kruppel-like factor transcription factors (TFs), and reduced activity of AP-1, RUNX and GATA TFs in latently infected cells. Interestingly, latency reversing agents with different mechanisms of action caused distinct patterns of chromatin reopening across the provirus. We observe that binding sites for the chromatin insulator CTCF are highly enriched in the differentially open chromatin of infected CD4 T cells. Furthermore, depletion of CTCF inhibited HIV latency, identifying this factor as playing a key role in the initiation or enforcement of latency. These data indicate that HIV latency develops preferentially in cells with a distinct pattern of TF activity that promotes a closed proviral structure and inhibits viral gene expression. Furthermore, these findings identify CTCF as a novel regulator of HIV latency.

Introduction HIV infection continues to be a major global health problem, with 37 million infected individuals and approximately one million deaths per year [1]. HIV infection can be potently controlled by combination antiretroviral therapy (ART), with treatment suppressing viral loads to undetectable levels and allowing HIV-infected patients to live lives of roughly normal lifespan. Nevertheless, persistent health complications in treated patients, as well as side effects of ART, make the development of a cure for HIV a high priority [2][3][4][5]. Moreover, rapid rebound of viremia after treatment interruption demonstrates the persistence of HIV-infected cells, even after long-term ART. Indeed, longitudinal studies to estimate the half-life of this reservoir indicate that greater than 72 years of treatment would be necessary for elimination of all infected cells [6,7]. The mechanisms of HIV persistence are not completely understood, but likely involve several factors. First, HIV is able to establish a latent infection, characterized by the presence of a transcriptionally silent provirus, in a subset of host cells [8,9]. Sporadic reactivation of these cells may occur continuously, and leads to viral rebound if treatment is interrupted [10]. The precise location of this latently-infected reservoir is debated, but is likely distributed across a number of CD4 T cell subsets, including long-lived memory cells, thereby explaining the long persistence of infection [11][12][13]. Second, HIV-infected cells can undergo homeostatic expansion in vivo during ART, further replenishing the pool of infected cells [14][15][16]. Understanding the molecular mechanisms of how latency is established and maintained will be critically important to developing strategies to prevent or eliminate the latent reservoir. Certain cell types, such as resting memory CD4 T cells, are a suboptimal environment for HIV transcription, due to limiting availability of transcription factors required for HIV gene expression, including NF-κB, AP-1 and P-TEFb [17][18][19][20]. Stochastic variation in the levels of the viral Tat protein during infection may also contribute to latency [21]. Furthermore, covalent modification of provirus-associated histones by histone-modifying enzymes such as histone deacetylases (HDACs) or histone methyl transferases (HMTs), such as the PRC2 or HUSH complexes, can have a repressive effective on viral transcription [22][23][24][25][26], and their role in the maintenance of latency is strongly supported by evidence of latency reversal in vivo [27].
We have previously established, using a primary CD4 T cell model of HIV latency, that latency can occur in diverse host cell environments, but occurs preferentially in cells that express markers of quiescent central memory T cells (Tcm), and that exhibit high proliferative potential [28]. These results indicate that the establishment of latency is influenced by the intrinsic biological program of the host cell. However, the mechanistic details of how specific host cell environments or phenotypes impact the initiation or maintenance of latency are unknown. Gene activity can be regulated by changes in the structure and accessibility of chromatin. These changes can be mediated by chromatin remodeling complexes that are recruited to sites of active transcription by transcription factors (TFs) and result in the removal or repositioning of nucleosomes near transcription start sites (TSS) [29]. Furthermore, enzymes that add or remove covalent modifications to histones tails can create a histone "code" that affects the structure of the chromatin and creates docking sites for additional regulators [30]. Changes in chromatin structure and accessibility can then allow greater access to the promoter by core transcriptional regulators including RNA polymerase, thereby facilitating transcription. To further investigate our observations, we sought to characterize primary CD4 cells in which latency has become established by defining chromatin-based characteristics of these cells. The results revealed an association of the latency phenotype with a distinct pattern of chromatin accessibility and reduced accessibility of the HIV genome. Furthermore, we identify a set of cellular transcription factors with differentially accessibly binding sites in latently infected cells, which may play a role in influencing the course of HIV transcriptional silencing and the entry into or maintenance of the latent state. In particular, we investigate and confirm the role of CTCF during HIV latency, implicating this protein as novel latency-regulating factor.

HIV-infected cells enter a stable, heritable state of latency in a cell culture model
We have previously established a cell culture model of HIV latency (Fig 1A). In this model, primary CD4 T cells are activated through TCR stimulation, then infected with a reporter HIV strain that encodes a destabilized eGFP gene (herein referred to as HIV-GFP) [28,31]. Actively infected (GFP+) cells are purified by flow sorting at 2dpi, and then cultured for up to 12 weeks. During this time period, viral gene expression progressively diminishes, and a subset of the cells become GFP-(latently infected), while the remaining cells exhibit variegated levels of viral gene expression. To examine the stability of the viral gene expression phenotype, we resorted this mixed culture at 12wpi into GFP+ (actively infected) and GFP-(latently infected) (Fig 1B) and cultured these cells independently for six additional days. Notably, the infected cells retained their viral gene expression level, that is GFP+ cells remained GFP+, and GFPcells remained GFP-, for this time period indicating that the level of viral gene expression is a relatively stable property of the infected cells (Fig 1C). We also examined the proviral copy number in the GFP-cells by quantitative PCR, and observed an average of almost exactly one copy per cell (1.05), indicating that these cells retained proviral DNA. To look at the effect of cellular stimulation on viral gene expression in latently infected cells, we isolated the latently infected (GFP-) cells by flow sorting, and then restimulated them through their TCR for 3 days, monitoring GFP expression over time (Fig 1D). At the time of stimulus removal, the latently infected cells had potently upregulated GFP expression and a large fraction of the cells (90%) had become GFP+. Interestingly, after removal of the TCR stimulus, the majority of reactivated cells then rapidly lost viral gene expression and returned to latency (GFP-) within 14 days. A residual population of around 20% of the reactivated cells retained a low level of GFP expression after this single round of reactivation. To further investigate this finding, we stimulated these cells three additional times over a total period of 48 days (Fig 1E). Similar to the initial restimulation, after each subsequent stimulation the latently infected cells rapidly upregulated GFP expression, then returned to latency with highly consistent kinetics after stimulus removal. Specifically, the cells returned to a largely GFP-phenotype by 7d post stimulation. Notably, during this period of the time the cells also underwent substantial cell division and cell numbers increased by~170 fold (Fig 1F). This finding suggests that latency is a stable, intrinsic property of infected cells that is rapidly re-established after viral reactivation, and can be transmitted to daughter cells during cell division. This conclusion is consistent with previous studies showing rapid reestablishment of latency after stimulation of infected cells with histone deacetylase inhibitors [32].

Latently infected cells exhibit differentially accessible chromatin
The stable, heritable character of HIV latency in this model system lead us to hypothesize that epigenomic or chromatin-based changes in infected cells were associated with reversible but recurring viral silencing. To examine whether the establishment of latency in our model system was associated with specific changes in chromatin accessibility, both within the HIV genome and within the host cell genome, we performed Assay of Transposon-Accessible Chromatin sequencing (ATACseq) on infected cells. ATACseq uses a hyperactive Tn5 transposon to probe and identify regions of open chromatin [33,34]. To compare cells with latent infection to those with active viral transcription, we sorted HIV-GFP infected cells at 12 weeks after infection based on viral gene expression into GFP+ or GFP-populations, representing cells from the upper and lower 25% of GFP expression intensity respectively, and generated ATACseq libraries from the sorted cells. To ensure biological reproducibility, we analyzed infected cells derived from two independent donors. These data were then aligned to a combined reference genome including the human genome and HIV. Quality control analyses showed nucleosomal laddering of fragment sizes and enrichment of reads at transcription start sites (TSS), indicating high quality data (Fig 2A and 2B). Replicate analysis of cells derived from the same donor clustered closely together by principal component analysis (PCA), indicating high reproducibility (S1 Fig).
Next, we compared accessibility of chromatin from the GFP+ and GFP-infected cells in greater detail by calculating differential accessibility for each of the accessibility peaks present in the samples. These data showed that GFP-and GFP+ cells have distinct chromatin accessibility patterns, with several thousand differentially open peaks between latently infected and actively infected cells (Fig 2C). Specifically, using a false discovery rate (FDR) cutoff value of 0.05, we observed 5021 peaks that were significantly more open in GFP+ cells, and 4068 peaks that were more open in GFP-cells, out of 277992 overall peaks detected across all samples ( Fig  2C). These data thus indicate that viral latency in this system is correlated with a specific pattern of overall genomic accessibility.
Notably, a high proportion of differentially accessible peaks (72%) were associated with genes, indicating that the data likely reflect dynamic mechanisms regulating gene expression. To further validate the ATACseq data, we also analyzed matched RNAseq profiles from these infected cells. Importantly, we observed that ATACseq peaks were highly enriched in genes that displayed differential transcriptional levels between the samples (P = 8.17x10 -10 ),

Fig 1. HIV infected CD4 T cells enter a stable, heritable state of viral latency in vitro. A.
Schematic overview of primary CD4 T cell HIV latency model. B. Flow cytometry plot showing a representative gating of actively infected (GFP+) and latently infected (GFP-) cells. C. GFP+ and GFP-cells were flow sorted from the HIV-GFP infected population and cultured for 6 days post sorting. Viral gene expression was measured at days 0, 3, and 6 post sorting, and the percentage of cells in the GFP+ gate was measured. Each bar represents the average of three independent replicates. D. Sorted infected GFP-cells at 12wpi were reactivated using αCD3/CD28 beads. At 24h, unstimulated cells (U) and stimulated cells (+αCD3/CD28) were analyzed by flow cytometry and the percentage of GFP+ cells analyzed. The reactivated cells were then removed from the activating beads and cultured for an additional 14 days (+14d), then analyzed again by flow cytometry. E. Sorted GFP-cells from an infected population were serially stimulated through their TCR (+αCD3/CD28) and GFP expression was monitored over time by flow cytometry. Shaded areas indicate times during which αCD3/CD28 beads were added to the culture to stimulate the cells. Each datapoint represents the average of two independent replicates. F. Cell numbers were counted at selected timepoints during the serial stimulation shown in E and total cell numbers are shown in millions of cells (M).
https://doi.org/10.1371/journal.ppat. 1009346.g001 indicating that changes in the chromatin accessibility were associated with transcriptional regulation (S1 Table). Direct comparison of the magnitude of individual peak fold changes to fold changes in transcript levels show low correlation (r = 0.123) for genes with significant changes to both transcript level and peak accessibility, likely reflecting the non-linear relationship between changes in chromatin accessibility and transcription (S2 Fig). 471 genes exhibited both a significant change to transcript level and a change in accessibility (S2 Table). This list included several genes with known roles in HIV latency and transcription (eg RUNX1 and RELB), but also many genes that have not previously been associated with HIV latency. We also examined the phenotypic identity of these cells by comparing the transcriptional profiles of the infected cells to reference datasets of primary immune cell types [35], and observed that these cells most closely resembled resting memory CD4 T cells, the natural host cell type for latent HIV (S3 Fig).

The HIV genome exhibits reduced accessibility in latently infected cells
Next we aligned the ATACseq reads to the HIV genome to determine whether HIV latency is associated with differential accessibility of the provirus. For both GFP+ and GFP-infected cells the HIV genome exhibits major accessibility peaks surrounding the 5' transcription start site (TSS) at nucleotide (nt) 455, and within the 3' LTR (Fig 3A). Downstream of the TSS peak, after nt 800, the virus genome becomes significantly less accessible, consistent with the hypothesis that chromatin structural barriers proximal to TSS impede processive viral transcription [36,37]. We also observed an overall increasing gradient of openness across the 3' half of virus in actively infected cells, leading to the large peak in the 3'LTR region. (Fig 3A). Overall we observed significantly reduced accessibility of the HIV provirus in the latently infected cells relative to the actively infected cells (2.45 fold, FDR = 2.87x10 -13 ). To identify specific regions within the HIV genome with differential accessibility, we subdivided the HIV genome into 73bp bins and individually calculated the fold change and false discovery rate for each bin ( Fig  3B). This analysis showed significantly reduced accessibility across the viral genome in latency, but a particularly large difference towards the 3' end of the provirus. This finding is consistent with the hypothesis that physical barriers limiting the accessibility of the provirus to factors that regulate transcription likely contribute to latency.

Latency reversal is associated with changes in accessibility in the host cell and viral genomes
We next examined the impact of latency reversing agents (LRAs) on latently infected cells. We stimulated HIV-GFP infected cells at 12wpi with three different LRAs, each with a different mechanism of action-vorinostat (histone deacetylase inhibitor), AZD5582 (SMAC mimetic), and prostratin (PKC agonist) for 24 hr. We have previously examined the transcriptomic impact of these three LRAs in primary CD4 T cells [38]. At 24h, we measured the response of viral gene expression to the LRAs by flow cytometry. Vorinostat and AZD5582 caused modest increases (~5-10%) in the percent GFP+ cells in the culture, indicating weak latency reversal. By contrast, prostratin caused a more substantial increase in viral gene expression (~25%) (Fig  4A).
We then performed ATACseq on the whole population of LRA stimulated cells, as well as on cells that had been exposed to vehicle (DMSO 0.1%) only. We first examined the impact of LRA stimulation on accessibility of the HIV genome. Notably, all three LRAs caused an increase in fraction of reads that mapped to the HIV genome, although trend was most pronounced for prostratin (Fig 4B and 4C). Specifically, vorinostat and AZD5582 caused 1.47 fold (FDR = 1.49x10 -2 ) and 1.57 fold (FDR = 8.81x10 -2 ) increases in proviral accessibility respectively, while prostratin caused a 2.22 fold increase, (FDR = 1.87x10 -8 ). This observation suggests that latency reversal is associated with reopening of the closed latent provirus, and that the potency of LRAs is related to the magnitude of proviral reopening that they induce. To examine changes in viral accessibility after LRA stimulation in more detail, we subdivided the HIV genome into 73bp bins and individually calculated the fold change and false discovery rate for each bin (Fig 4D). Interestingly, each LRA induced a distinct pattern of changes to proviral accessibility. Prostratin and AZD5582 caused significantly increased accessibility mainly towards the 5' end of the virus, while vorinostat-induced reopening occurred mainly towards the 3' end of the virus. (Fig 4D). The biological basis for the differential behavior of the LRAs in this analysis is unknown, but suggests the existence of previously unappreciated mechanisms of interaction between each agent and the provirus.
By alignment of data from the LRA stimulated cells to the human genome, we were able to identify regions of cellular chromatin for which accessibility was modulated by each agent.  PCA analysis of accessibility peak data from the four conditions showed that replicate samples stimulated with a given LRA tended to cluster together (S4 Fig). Prostratin and AZD5582 stimulated cells formed distinct clusters separated from each other and vorinostat by PC2. Vorinostat and control vehicle (0.1% DMSO) stimulated cells overlapped in the plot indicating a more limited impact of vorinostat on the accessibility of the host cells. Prostratin had the most dramatic effect on the cells, consistent with its known role as a potent modulator of cellular transcription. Analysis of individual peak dynamics by volcano plot also revealed that each LRA induced a distinct pattern of changes to the accessible chromatin of infected cells (Fig 5). Prostratin had the most pronounced impact and caused numerous accessibility peaks to open and to close. Specifically, 33723 peaks were more open, while 22458 peaks more closed after 24h prostratin stimulation. By contrast, vorinostat caused much more modest changes to host cell chromatin accessibility, with 2326 peaks more open and 3323 peaks more closed. Interestingly, the effects of AZD5582 on chromatin accessibility were highly asymmetric, with 11362 peaks more open and 4945 peaks more closed. This observation is consistent with our previous observation of the asymmetric effect of this compound on transcript levels [38]. Similar to our earlier observations, we found that LRA-induced changes in ATACseq peak accessibility were highly enriched in genes that displayed differential RNA levels after LRA stimulation (S1 Table and  Overall these data indicate that the ATACseq data accurately reflects mechanisms that regulate the transcriptional response to LRAs.

Differential accessibility of specific transcription factor binding sites is associated with latency and latency reversal
To further investigate the regulation of HIV latency, we next examined the enrichment of specific transcription factor (TF) binding sites in the differentially accessible chromatin peaks identified by analysis of the ATACseq data. Since active transcription factors typically promote local opening of chromatin surrounding their binding sites, this information can be used to infer the activity of specific TFs in cells of interest. To investigate this, we analyzed the set of variable peaks using a motif enrichment method, HOMER [39]. This method analyzes defined DNA regions for statistical enrichment of known TF binding site sequences, relative to a background reference sequence. First, we compared the TF site enrichment in the DNA sequences of differential accessible peaks between actively infected (GFP+) and latently infected (GFP-) cells, using the non-differentially accessible peaks as a background baseline (FDR>0.2). From this, we derived a list of transcription factors with binding sites that were enriched in the peaks that were differentially open between latently infected and actively infected cells (Tables 1 and S3 and S4). We identified 406 potential TF binding sequences that were preferentially enriched (P <0.05) in GFP+ cells and 252 sequences that were enriched in GFP-cells. Notably, many of the most highly enriched motifs were variants of consensus binding sites for members of large TF families. Since the binding sites of the individual members in these families are often very similar, precise disambiguation of which specific TFs are active in these cells is difficult to achieve from this data alone. Nevertheless, from examination of the 20 most enriched binding site motifs for each condition, several clear trends were noticeable from the data. Actively infected (GFP+) cells displayed strongly elevated activity for three TF families-AP-1 related TFs, Runt-related (RUNX) TFs, and GATA TFs. Latently infected (GFP-) cells, by contrast, displayed elevated activity of Kruppel-like TFs (KLFs), and Forkhead (FOX) TFs. Notably, several members of these TF families have known roles in regulating HIV gene expression indicating that the results are biologically valid. In particular, AP-1 has well characterized binding sites within the HIV promoter, and has been demonstrated to promote HIV transcription [40][41][42]. These results are also consistent with our previous scRNAseq findings in this  model, which found elevated and FOXP1 expression in latently infected cells, and elevated GATA3 expression in actively infected cells [28]. To further validate these observations, we also compared our ATACseq data to existing ChIPseq datasets from the Jurkat T cell line [43] for three selected transcription factors (RUNX1, GATA3, MYB) identified by our motif enrichment analysis as exhibiting increased binding site accessibility in actively infected cells.
Notably, we observed that peaks with increased accessibility in actively infected (GFP+) cells were significantly enriched with ChIPseq-confirmed binding sites for these TFs, relative to unchanged peaks (S5 Fig and S8 Table).

LRAs induce opening and closing of TF binding sites
To examine whether LRA stimulation triggers changes in the accessibility of bindings sites for specific TFs, we also performed HOMER analysis on the differentially open peaks after LRA stimulation. Each LRA induced a distinct set of TF binding sites to open after stimulation (Tables 2 and S5-S7). For AZD5582, the most strongly enriched TF binding site was NFκB, consistent with this compound's known ability to trigger non canonical NFκB activation [38]. Other AZD5582-induced TF binding sites included PITX1, SMAD3 and RUNX1. For vorinostat, the most highly induced TF binding site was NKX2.1, followed Nanog, THRb and SMAD3. Prostratin induced the opening of binding sites for numerous TFs, most potently for members of the AP-1 family (BATF, FRA1, FRA2, JUNB), suggesting that activation of AP-1 could be a key mechanisms of latency reversal for prostratin and other PKC agonists. We also observed that the three LRAs induced increased accessibility in regions that were enriched for ChIPseq-validated binding sites for the transcription factors we had previously examined (GATA3, RUNX1 and MYB) (S8 Table and S5 Fig). Specifically, AZD5582 increased accessibility of GATA3, RUNX1 and MYB ChipSeq sites, prostratin increased accessibility for GATA3 and RUNX1 sites, while vorinostat increased accessibility for MYB sites (S8 Table).

CTCF regulates HIV latency
We also examined enrichment of TF binding sites present in the set of differentially open peaks of GFP+ and GFP-cells through a different approach-using the overall genome as a background reference baseline ( Table 3). Although many of the TFs binding site motifs identified by this approach were also identified by our previous analysis, several notable differences were apparent. Interestingly, the transcription factor with the highest degree of enrichment in the differentially open peaks between GFP+ and GFP-cells relative to the overall genome was the insulator protein CTCF. An overall set of 366 CTCF binding sites were differentially accessible between these populations, with 306 CTCF binding sites more accessible in GFP+ cells, and 60 sites more accessible in GFP-cells (Fig 6A). Furthermore, CTCF binding sites were also highly represented within differentially accessible peaks after LRA stimulation (Fig 6B).
We also compared our data to ChIPseq datasets for CTCF [43] and confirmed that the differentially accessible peaks were highly enriched for ChIPseq-confirmed CTCF binding sites ( S5  Fig and S8 Table). These data suggest that widespread changes in accessibility for CTCF binding sites are associated with HIV latency and latency reversal. These changes occur in both directions, although more CTCF sites were accessible during active infection than in latency. CTCF is a large, ubiquitously expressed regulator of gene expression with an 11 zinc finger central domain and N and C termini that interact with numerous protein partners [44]. The human genome contains 15,000-40,000 CTCF binding sites [44,45], and CTCF homodimerization drives the formation of large topologically-associated DNA domains (TADs) by mediating chromatin loop structures [46]. These loops serve to insulate domains of transcriptional regulation by disconnecting genes from nearby enhancer elements, and by restricting lateral spread of domains of both inhibitory and activating histone modifications [47,48]. Interestingly, CTCF is known to play an important role in regulating transcriptional latency for a number of other viruses, including HSV, KSHV and HPV [49,50]. However, CTCF is not known to have a role in regulation of HIV latency. To investigate the possible role of CTCF in HIV latency we examined the impact of CTCF depletion on viral gene expression in HIV latency models. First, we transduced N6 cells, a latently infected T cell line model that contains

PLOS PATHOGENS
a transcriptionally silent HIV reporter virus that encodes murine heat shock antigen (HSA) [28], with lentiviruses that express an shRNA targeting CTCF or a scrambled control shRNA. Western blot analysis of CTCF confirmed specific reduction in CTCF expression in the cells transduced with the CTCF-targeting shRNA (Fig 7A). At two weeks post transduction, we measured viral reactivation by HSA expression on the surface of the transduced cells using flow cytometry. Notably, we observed potent re-expression of HSA in the CTCF-depleted cells, indicating that CTCF expression is required to maintain repression of HIV transcription in this cell line (Fig 7B and 7C). We next examined the impact of CTCF depletion in our primary CD4 T cell latency model. We activated CD4 T cells through TCR stimulation for 48h, then infected with HIV-GFP. At 3dpi, we purified infected (GFP+) cells, then nucleofected the infected cells with Cas9/sgRNA ribonucleoprotein complexes targeting the viral protein Tat, CTCF or with a non-targeting control (Fig 7D). Quantitative PCR one week after nucleofection confirmed roughly 50% reduction of CTCF expression in the targeted cells (Fig 7D). The infected cells were then cultured for four weeks and the percentage of latently infected (GFP-) cells over time measured by flow cytometry (Figs 7F and S6). Tat targeted cells, as expected, demonstrated a more rapid decline in viral gene expression, confirming the sensitivity of this system to perturbations in regulators of viral transcription (Fig 7F). Notably, CTCF depleted cells exhibited a reduced level of HIV latency beginning at 2wpi, and this effect became more prominent at 3wpi and 4wpi. Overall these results demonstrate a role for CTCF in the establishment or maintenance of HIV latency and highlight a previously unknown mechanism of transcriptional repression for HIV.

Discussion
The presence of a latently infected reservoir is recognized as a primary barrier to curing HIV infection [51]. As such, the mechanisms by which latent infection is established and maintained are of critical importance. By characterizing these mechanisms we may identify strategies to either induce viral gene expression leading to infected cell clearance ("shock and kill)", or induce permanent silencing of the proviruses ("block and lock") [52,53]. Viral gene expression is regulated by host cell transcription factors and epigenetic mechanisms. Dynamic changes and cell-to-cell variation in these factors likely leads to specific cellular environments that favor or disfavor latency. However, viral transcription can also be influenced by stochastic fluctuations in the Tat transcription factors early in infection [21,54]. Nevertheless, the ability of LRAs targeting host cell factors to induce viral reactivation in both animal models and in human clinical studies demonstrates the value of targeting host cell pathways to manipulate latency [22,38,55]. Thus, latency initiation and reversal both likely result from a combination of stochastic and deterministic processes intrinsic to both the virus and the host cell. These processes yield a diverse population of persistently infected cells, with intermittent viral expression and re-entry into quiescence, undergoing both proliferation and death. The rare nature of latently infected cells and the lack of markers for their purification make direct observation of cellular events in infected cells from patients extremely difficult. In this regard, model systems have been invaluable in the discovery and validation of new targets. In this study, we have discovered, using a primary cell model system, that viral latency is a stable, heritable phenotype that is rapidly re-established after stimulation and transmitted to daughter cells during cell division. These findings expand the understanding of epigenetic mechanisms likely contribute to the maintenance of latency through cell division, an observation that has significant implications for the maintenance of the HIV reservoir in vivo through clonal expansion. Epigenetic memory of HIV latency through cell division and T cell development are likely to explain the presence of expanded latently infected clones within diverse T cell subset identities present in infected individuals on therapy. The rapid and repetitive re- establishment of latency after stimulation also has implications for clearance strategies against reactivated latently infected cells in vivo. These results suggest that a temporally limited window of antigen presentation will likely occur after LRA dosing before latency is reestablished and antigen expression wanes. An important caveat to consider is that our model system involves several steps of in vitro manipulation which could influence the phenotype or behavior of these cells. These steps include activation with TCR-stimulating beads, and extensive coculture with H80 cells to permit long term CD4 T cell survival in culture. Our model system also involves the use of a replication defective reporter strain of HIV, which may behave differently from clinical isolates of HIV. Nevertheless, a comparison of the transcriptional profiles of our model cells indicated that these cells closely resemble resting memory CD4 T cells, the natural host for latent HIV.
By analysis of infected cells with ATACseq, we have also discovered that HIV latency is preferentially established in cells that adopt a specific global chromatin structure after activation and return to a resting state. This finding suggests that latency is intricately connected to cell-intrinsic chromatin dynamics, and that these dynamics could represent an important target for preventing latency establishment, reversing latency once established, and preventing a return to latency following latency reversal. The specific mechanisms by which different patterns of overall chromatin accessibility are regulated in these infected cells as latency is established are unclear, and will require further investigation. We have also demonstrated that accessibility of the viral genome is significantly reduced in latently infected cells relative to actively infected cells. This finding confirms previous results in cell line models showing the appearance of restrictive chromatin structures associated with entry of HIV into latency [56]. The closed nature of the provirus likely impedes recruitment of transcription factors essential for initiating viral transcription, and processive RNA polymerase transcription, thereby contributing to latency. Notably, three LRAs with distinct mechanisms of action caused significant re-opening of the provirus, indicating that re-opening of HIV is likely a key feature of latency reversal by LRAs.
An important determinant of proviral chromatin state and expression that cannot be determined from our ATAC-seq data is the influence of the viral integration site. Specifically, the transcription state of neighboring genes and the activity of nearby enhancers will likely modulate the chromatin state and expression of the provirus [57]. HIV frequently integrates near clusters of endogenous enhancer elements [58], and studies in Jurkat cells have shown a clear influence of local endogenous enhancers on HIV expression [57]. While we were able to recover some ATAC-seq fragments that spanned virus-host cell junctions, revealing individual integration sites, these fragments were rare, and the information too sparse to draw significant conclusions about the influence of viral integration site on the proviral chromatin state. Additional experimentation using integrated single cell ATAC-seq and integration site analysis may be useful for resolving this question. New methods, for example, that permit parallel analysis of lentiviral integration sites and host cell chromatin accessibility in single or bulk populations of T cells will help to clarify the relationship between proviral expression, integration site and local chromatin structure [59].
It is also important to note that during clinical infection, infected cells with high viral gene expression are rapidly killed due to a combined effect of viral cytopathic effect and immune pressure. The surviving cells are thus selected to exhibit low residual viral gene expression. Recent studies of individual infected cells demonstrate that a minor population of cells from patients receiving ART express viral transcripts [60], and that these cells can undergo clonal expansion to sustain the reservoir [15,16,60]. HIV infection also frequently results in defective proviruses containing inactivating deletions, and these mutant proviruses constitute the majority of HIV DNA in people receiving ART [61]. For single round reporter viruses such the one used for this study, we expect the frequency of mutated proviruses to be low, particularly since we initially sort for infected cells with active viral gene expression. Furthermore, our bioinformatic pipeline only mapped intact viral DNA fragments and did not consider fragments with deletions. Future studies of the epigenomic characteristics of provirus may derive insight from also considering defective viruses.
This study also provides new insight into the mechanisms of latency by identifying a set of host cell TFs whose binding site accessibility correlates with viral gene expression. Specifically, by comparing the differentially open regions of chromatin in the latent or actively infected cells to the overall set of open regions across infected CD4 T cells, active viral expression was correlated with elevated accessibility of binding sites for AP-1, GATA, and RUNX TFs, while latency was associated with elevated accessibility of binding sites for Forkhead (FOX) and Kruppel-like factor (KLF) TFs (Fig 8). Importantly, this list contains several TFs that have well-established roles in regulating HIV transcription, indicating the biological relevance of the findings. AP-1, for example has long been identified as a regulator of HIV gene expression [40][41][42]. AP-1 is a family of heterodimeric basic leucine zipper TFs that recognize a consensus binding site (TGA G/C TCA), and are potently activated by stimuli that promote HIV gene expression, such as phorbol esters. There are several AP-1 binding sites in the HIV LTR, and these have been shown to be important for HIV transcription and replication, suggesting that the correlation between AP-1 activity and viral gene expression in our model is likely through direct activity of AP-1 at the HIV LTR. Interestingly one AP-1 site has recently been shown to regulate latent infection [40]. GATA TFs have also previously been identified as regulators of HIV transcription. GATA3, in particular, has been shown to bind the HIV LTR and promote HIV gene expression [62]. Notably, we have previously reported that elevated GATA3 transcript levels correlates with active HIV transcription in this model system [28]. From the list of TFs with enriched activity in latently infected (GFP-) cells, FOXP1 was also previously reported by our group as exhibiting elevated expression in latently infected CD4 T cells [28]. This TF is associated with a quiescent central memory T cell (Tcm) phenotype, consistent with the hypothesis that cells that adopt a quiescent Tcm phenotype are more prone to latent infection [63]. Interestingly, a related Forkhead TF, FOXO1, has also recently been reported to promote HIV latency [64,65]. In addition to these known HIV-regulating TFs, several TFs with no known role in HIV transcription were identified by our study, and defining their role in latency will likely lead to the discovery of new mechanisms of transcriptional repression for HIV. However, it is also important to note that changes in TF binding site accessibility do not necessarily reflect differential TF activity, and additional analyses, such as ChIPseq, may be needed to directly confirm differential binding in latently infected cells.
We also examined enrichment of TF motifs in the differentially open chromatin of infected cells relative to the overall human genome. Interestingly, by this comparison, the differentially open chromatin regions of infected CD4 T cells were highly enriched with binding sites for CTCF, indicating that possible dynamic changes to chromatin looping may distinguish latently infected cells from actively infected cells. Furthermore, by selective knockdown of CTCF in a T cell line model of latency (N6) and in a primary cell latency model, we demonstrated that CTCF expression is required for maintaining HIV silencing in HIV infected T cells. These results thus identify CTCF as a novel regulator of HIV latency. Notably, previous studies have shown that CTCF plays a key role in regulating latency for herpesviruses [49,50]. In these cases, dynamic CTCF binding to cognate sites within the viral genomes directly regulates switching between lytic and latent gene expression programs. It is unknown if CTCF binds directly to the HIV genome. As such, the precise mechanism by which CTCF regulates HIV latency is unknown and will require further investigation. CTCF has several known molecular functions that could potentially affect HIV gene expression. In particular, CTCF binding to DNA mediates chromatin looping and the formation of topologically associated domains (TADs), and serves to insulate genes from adjacent enhancer elements and zones of histone modifications [46]. It is conceivable that, in many infected cells, HIV gene expression is impacted by insulation of the HIV promoter from nearby enhancer elements and that disruption of the insulator function of CTCF allows longer-range promoter/enhancer interactions to occur that promote viral reactivation. Alternatively, many latent proviruses may be trapped in TADs that are enriched for repressive histone modifications such as H3K9me3 and H3K27me3, and removal of the CTCF boundary allows replacement of these heterochromatic histone modifications with activating marks. CTCF can also regulate sub-nuclear localization of chromatin regions and mediate interactions with the nuclear lamina [66]. As such, it is also possible that CTCFs ability to regulate latency is determined by spatially restricting access of the provirus to activating transcription factors within the nucleus.
It is also unclear what might regulate differential binding activity of CTCF within our latency model system. CTCF is ubiquitously expressed in mammalian cells, and CTCF transcripts were not differentially expressed between the latently infected (GFP-) and actively infected (GFP+) cells. It is possible that CTCF binding is regulated by expression or differential binding of an interacting partner. CTCF contains three major structural domains-a central DNA binding domain with 11 zinc fingers, and long N and C terminal domains. All three domains interact with an array of binding partners, including Y-box binding protein 1 (YBX1), Chd4 and Snf2h [44]. CTCF itself can be regulated by phosphorylation, sumoylation and ADP ribosylation [67]. CpG methylation of CTCF target sites can also regulate CTCF binding [68]. It is also possible, given CTCFs fundamental role in transcriptional regulation, that CTCFs role in repressing HIV transcription is indirect, and mediated through an as yet unidentified factor whose expression or activity is regulated by CTCF. The precise mechanism behind CTCFs role in latency will need to be clearly defined. No therapeutic strategies for targeting CTCF currently exist, but the results we describe in this manuscript suggest that inhibition of CTCF activity or a pathway regulated by CTCF could serve as a strategy for impacting HIV latency clinically.

Viruses and cell culture
Stocks of HIV-GFP were generated by co-transfection of 293T cells with the pNL4-3-Δ6-dreGFP plasmid (generous gift of Robert Siliciano) and the packaging plasmids PAX2-GagPol) and MD2-VSVG, using Mirus LT1 reagent. At 2 days post transfection virus was harvested from the supernatant and clarified by low speed centrifugation, followed by filtration through a 45uM filter. shRNA expressing lentivirus plasmids targeting CTCF were purchased from Genecopeia (HSH060478-LVRU6MP) and viral particles generated as for HIV-GFP. Latently infected N6 cells are a clone of Jurkat cells that contain a full length clone of the NL4-3 strain of HIV in which the Nef open reading frame has been replaced by an HSA-P2A-Luciferase cassette (generous gift from David Irlbeck, Viiv Healthcare).

Primary cell latency model
Primary CD4 T cells were isolated from fresh whole blood (purchased from Gulf Coast Regional Blood Center) by Ficoll isolation of peripheral blood mononuclear cells (PBMCs), then magnetic negative selection of CD4 T cells using a CD4 enrichment kit (Stem Cell). Total primary CD4 T cells were activated using anti-CD3/CD28 beads (Thermo Fisher) at a ratio of one bead per cell for 2 days, then infected with HIV-GFP viral supernatant by spinocculation for 2hrs at 600g with 4ug/mL polybrene. The infected cells were then resuspended in fresh RPMI with IL-2 (Peprotech) at 100U/mL and incubated for two days before actively infected (GFP+) cells were sorted using a FACSAria flow sorter (Becton Dickson). The purified infected cells (GFP+) were then maintained at 1-2M/mL with fresh media and IL-2 added every 2-3 days for 2 weeks. From 2wpi, cells were co-cultured with H80 cells (provided by D Bigner, Duke University) in the presence of 20U/mL IL-2 for up to 12wpi. During co-culture the cells were fed with fresh media and IL-2 every 2-3 days, and moved to a fresh flask of H80 feeder cells once a week. For latency reversing agent stimulation, AZD5582, vorinostat and prostratin were generous gifts from David Irlbeck (Viiv Healthcare), and were reconstituted in DMSO at 10mM, before dilution into media to working concentrations of 250nM.

ATACseq library construction and sequencing
Infected cultures of CD4 T cells were stained with the viability dye Zombie Violet (ZV, Biolegend) and 50,000 live (ZV-) cells were sorted into 5mL polypropylene tubes using a FACSAria flowsorter (Becton Dickson) per replicate. The sorted cells were then washed with phosphate buffered saline (PBS) and resuspended in 50ul TD buffer (Illumina) with 0.01% digitonin and 2.5ul TDE Tagmentation enzyme (Illumina). The cells were then agitated in a thermomixer at 37C, 300rpm for 30mins to allow transposon-mediated tagmentation to occur. Tagmented DNA was then purified a Minelute purification kit (Qiagen). DNA fragments were PCR amplified using dual barcoded primers and re-purified using a Minelute kit followed by a 1.5x final cleanup step with Ampure SPRI beads (Agencourt). Libraries were quantified by Qubit DNA assay (Invitrogen) and size distribution of the library was determined using a High Sensitivity DNA BioAnalyzer chip (Agilent). Samples were sequenced using an Illimuna Hiseq2500 to a depth of 50-150M reads with 150bp paired end sequencing.

ATAC-seq bioinformatic analysis pipeline
Sample fastq files were combined by read end and validated with FastQC v0.11.8. Combined fastq read end pair files were aligned with BWA_MEM v0.7.17 [69] against the GRCh38 genome with an added HIV chromosome sequence. BWA is a fast and accurate aligner with which we have significant experience. It generates high-quality alignments while still allowing for soft-clipping and alignment across HIV (non-clonal) integration sites (with the -Y flag). Aligned bam files were sorted and optical duplicates were flagged with Picard Tools v2.21.1 [70]. Picard tools and SAMtools v1.9 [70] were used to verify the quality of the alignments. Filtering prior to read counting was done by SAMtools to remove duplicate, unmapped, and unpaired reads. HIV-aligned reads were duplicated into separate bam files and merged by group with SAMtools for visualizing in IGV [71]. Read counting was done using csaw v1.16.1 [72], with parameters set to count reads with map quality greater than or equal to 10 and inferred fragment length less than or equal to 2000. Read ends were counted separately and combined into bins of 73 bp windows (~1/2 nucleosome width) tiled across the entire genome. Reads were also counted into 5000 bp bins, generating a second count matrix at lower resolution. The csaw function filterWindows() was used to calculate the log2 signal to noise of all 73 bp windows relative to the median of the large windows counts, scaled by relative window sizes. Low signal windows less than or equal to log2(2.5) over background were dropped. Per sample TMM normalization factors were calculated using the csaw normFactors() function, retaining potential global changes in accessibility. The normalized noise-filtered count matrices were analyzed by PCA using the prcomp() function of R 3.5 [73]. Review of the PCA results indicated the differential accessibility model for GFP+ vs GFP-data should include donor as a cofactor. For latency reversing agents treated cells, four replicate experiments were carried out using cells from a single donor and a simple treatment-control model for each. Differential accessibility was calculated using edgeR v3.23.4 [74], with global dispersion factors from esti-mateDisp(). Models were fit with glmQLFit() using robust = TRUE, and the significance calculated using glmQLFTest(). Finally, successive windows were merged into clusters across valleys of up to one bin width using mergeWindows() and combineTests() from csaw. The generated clusters from each modeled comparison were plotted as volcano plots visualizing the FDRs and the mean fold change (calculated as the mean fold change of all the cluster's windows). Clusters were assigned category labels if they overlapped a matching HG38 annotated region, e.g. a "promoter" or "enhancer", as provided by annotatr v1.8.0.
Transcription factor binding site enrichment for each modeled comparison was performed using findMotifsGenome.pl from HOMER v4.10 [75]. Input bed files were generated from the csaw cluster results, separately for the significantly up and significantly down clusters (FDR < = 0.1), and for the undifferentiated regions (FDR > 0.2). Using the included default set of transcription factors, HOMER was run to identify known TF binding sites enriched in the significant bed files relative to the undifferentiated background. Separate runs were performed examining the middle 50, the middle 200, and all bases of the significantly up and significantly down cluster regions. Results from the top hits were manually examined for relevant biological associations with HIV. Additionally, HOMER was used to compare the significantly different regions against the genome as a whole. A bed file was generated containing all clusters and analyzed with findMotifsGenome.pl using the -find parameter to locate all CTCF binding sites. Clusters overlapping CTCF sites were extracted and their differential accessibilities were plotted as volcano plots.

RNAseq analysis
RNA was extracted from infected CD4 T cells using an RNEasy plus kit (Qiagen), and eluted into RNase free water. RNA was quantified by Qubit assay (ThermoFisher) and integrity checked using a BioAnalyzer RNA nano kit (Agilent). RNA was then used for construction of expression libraries using a SMART-seq low input RNA kit (Takara). Libraries were quantified by Qubit and sequenced on an Illumina Hiseq2500 sequencer. Alignment to the Hg38 reference genome was carried out with Star v 2.4.2a [76]. Aligned bam files were sorted and optical duplicates were flagged with Picard Tools v1.86 [70]. Picard tools and SAMtools v1.9 [67] were used to verify the quality of the alignments. Filtering prior to read counting was done by SAMtools to remove duplicate, unmapped, and unpaired reads. Gene transcript counts were obtained from the filtered aligned reads using Salmon v 0.6 [77] and differential gene expression results were quantified using DESeq2 v 1.22.2 [78].

Comparative RNAseq-ATACseq analysis
For each data set, genes with differential expression results were split into two groups, those with significant and large differences (adjusted p-value < 0.1; abs fold change > 1.5x) differences versus all others. Independently, genes with overlapping differential accessibility clusters were split into two groups, those with associated clusters whose differential accessibility is significant (FDR < 0.1), large (abs fold change > 1.5x), and are changing in the same direction (up or down) versus those with either no significant and large difference in accessibility or those with mixed directionality. For genes with both differential expression and differential accessibility data, R 4.0 [73] was used to: (1) perform a Fisher's exact test ("fisher.test" function) for evaluating the significance of the relationship between expression and accessibility changes, (2) to obtain the expected distributions of paired categories ("chi.test" function) for evaluating the directionality of the relationship, and (3) to obtain the linear correlation ("cor" function) between the (log2) of the gene expression fold change and the (log2) of the largest fold change of any associated cluster.

Comparative ChIPseq-ATACseq analysis
ChipSeq data from a human CD4 T cell line [43] in the form of reads per million per 50bp bins were dowloaded and converted from hg19 to hg38 using rtracklayer V1.48.0 [79] for the RUNX1 (GSM1975912), GATA3 (GSM1975913), MYB (GSM1519635), and CTCF (GSM1689152) transcription factors. Each cluster was assigned a score for each transcription factor based on all chip read bins overlapped = log2 (1 + sum of bin per million counts). Evaluation of the difference in ChIP score means between the significantly more open clusters (FDR < 0.1 and fold change > 1.5x) and non-significantly different clusters (FDR > 0.2, any fold change) was done using the t-test function of R 4.0 [73] to perform a Welch Two Sample t-test.

Antibodies/Western blots
To extract protein, cells were washed in phosphate-buffered saline (PBS) then lysed in RIPA buffer (Sigma) supplemented with Complete protease inhibitors (Roche), before centrifugation to remove cell debris. Protein concentrations were then quantified using Bradford assay reagent (BioRad), and 40ug of protein per sample was run on a 4-12% Tris Glycine polyacrylamide gel (Invitrogen). The gel was then transferred to a PVDF membrane. The membranes were then incubated in Tris-buffered saline with 0.1% Tween 20 (TBST) with a primary detection antibody for 2-hrs, before washing with TBST. Primary antibodies used were anti-CTCF (clone D31H2, Cell Signaling) and beta-actin (ab49900, Abcam). Primary antibody staining was followed by incubation with an HRP-linked anti-rabbit antibody. Bands were the imaged by incubation of the membrane with ECL chemiluminescence reagents (ThermoFisher) and imaging on a ChemiDoc MP imager (BioRad). For flow cytometry for murine Heat Shock Antigen (HSA), cells were washed in PBS before staining with anti-HSA-APC (Biolegend) at 1:200 dilution for 30mins at 4C, before washing in PBS and analysis using a Fortessa flow cytometer (Becton Dickson).

Quantitative PCR
Quantitative PCR for CTCF was performed using Taqman probes/primers (Thermo Scientific). Specific primer probe sets used were Hs00902016_m1 for CTCF, and Hs01060665_g1 for beta-actin. Reactions were run using primer/probes at 1:20 dilution and Taqman FastVirus one-step reaction mix (Thermo Scientific), and amplified using an Applied Biosystems QS3 cycler. Expression level was determined using cycle threshold relative to beta-actin and represented in arbitrary units.

infected (GFP+) cells and latently infected (GFP-) cells were compared to
ChIPseq datasets for RUNX1, GATA3, MYB, and CTCF [35]. For each accessible peak, a ChIPseq signal score was calculated for the TFs indicated. Box plots of the summed peak cluster ChIPseq scores are shown for peaks that exhibited significantly increased accessibility in GFP+ cells (pink) versus those that showed no significant change (blue). P values for differences between the samples was calculated by Welch two sample t-test (S8 Table; Table. Enrichment of ATACseq peaks in differentially expressed genes. For each experimental comparison, the number of cellular genes associated with a significant change for its RNA transcript and/or an ATACseq peak (fold change>1.5, P adj <0.1), were calculated. Statistical enrichment of ATACseq peak changes in genes with changing transcript levels was calculated using Fisher's exact test (lower right panel). (DOC) S2 Table. Genes with altered transcript and ATACseq peak levels. Genes exhibiting a significant change to both transcript level and accessibility in actively infected cells (GFP+) vs latently infected cells (GFP-) are shown (fold change>1.5, P adj <0.1). Increased RNA levels or accessibility in actively infected cells (GFP+) are designated as changing in the "up" direction. Turner, Edward P. Browne.