RSM, BFB, ARWS, CCB, JRE, and FDB conceived and designed the experiments. RSM, BFB, ARWS, PS, HC, CCB, JRE, and FDB performed the experiments. RSM, BFB, ARWS, PS, HC, CCB, HCB, JRE, and FDB analyzed the data. RSM, BFB, CCB, JRE, and FDB wrote the paper.
The authors have declared that no conflicts of interest exist.
The completion of the human genome sequence has made possible genome-wide studies of retroviral DNA integration. Here we report an analysis of 3,127 integration site sequences from human cells. We compared retroviral vectors derived from human immunodeficiency virus (HIV), avian sarcoma-leukosis virus (ASLV), and murine leukemia virus (MLV). Effects of gene activity on integration targeting were assessed by transcriptional profiling of infected cells. Integration by HIV vectors, analyzed in two primary cell types and several cell lines, strongly favored active genes. An analysis of the effects of tissue-specific transcription showed that it resulted in tissue-specific integration targeting by HIV, though the effect was quantitatively modest. Chromosomal regions rich in expressed genes were favored for HIV integration, but these regions were found to be interleaved with unfavorable regions at CpG islands. MLV vectors showed a strong bias in favor of integration near transcription start sites, as reported previously. ASLV vectors showed only a weak preference for active genes and no preference for transcription start regions. Thus, each of the three retroviruses studied showed unique integration site preferences, suggesting that virus-specific binding of integration complexes to chromatin features likely guides site selection.
Retroviruses have potential for gene therapy only if they do not activate endogenous genes. Of three tested retroviral vectors, ASLV showed no preference for integration into human transcription start regions.
Retroviral replication requires reverse transcription of the viral RNA genome and integration of the resulting DNA copy into a chromosome of the host cell. A topic of long standing interest has been the chromosomal and nuclear features dictating the location of integration target sites (reviewed in
With the availability of the complete human genome sequence, large-scale sequence-based surveys of integration sites have become possible (
The origins of the 3,127 integration sites studied are summarized in
The human chromosomes are shown numbered. HIV integration sites from all datasets in Table 1 are shown as blue “lollipops”; MLV integration sites are shown in lavender; and ASLV integration sites are shown in green. Transcriptional activity is shown by the red shading on each of the chromosomes (derived from quantification of nonnormalized EST libraries, see text). Centromeres, which are mostly unsequenced, are shown as grey rectangles.
DOI:
Three integration site datasets were newly determined in this study. Integration by an ASLV vector was analyzed in 293T-TVA cells, which are human 293T cells engineered to express the subgroup A avian retrovirus receptor. Integration by an HIV-based vector was characterized in two types of primary human cells, peripheral blood mononuclear cells (PBMCs) and IMR90 lung fibroblasts. Several previously described datasets were also subjected to further analysis in parallel—HIV integration sites in three transformed cell lines (SupT1 [
The use of restriction enzymes to cleave cellular DNA during the cloning of integration sites could potentially introduce a bias in favor of isolating integration events closer to restriction sites. Previous work suggested that integration site surveys were not strongly biased. In one study, an experimental control based on integration in vitro indicated that the cloning and analytical methods used did not detectably bias the conclusions (
In this study we have added a computational method to address possible biased isolation. Each integration site was paired with ten sites in the human genome randomly selected in silico that were constrained to be the same distance from a restriction site of the type used for cloning as the experimentally determined integration site. Statistical tests were then carried out by comparing each experimentally determined integration site to the ten matched random control sites. In this way any bias due to the placement of restriction sites in the human genome was accounted for in the statistical analysis. All the collections of integration sites were analyzed in this manner, including data previously published in (
For HIV the frequency of integration in transcription units ranged from 75% to 80%, while the frequency for MLV was 61% and for ASLV was 57%. For comparison, about 45% of the human genome is composed of transcription units (using the Acembly gene definition). Analysis using the different catalogs of human genes suggests that somewhat different fractions of the human genome are transcribed, and new information indicates that an unexpectedly large fraction of the human genome may be transcribed into noncoding RNAs (
We next assessed the placement of integration sites within genes and intergenic regions. A previous study revealed that integration by MLV is favored near transcription start sites, but no such bias was seen for HIV (
Genes or intergenic regions were normalized to a common length and then divided into ten intervals to allow comparison. The number of integration sites in each interval was divided by the number of matched random control sites and the value plotted. A value of one indicates no difference between the experimental sites and the random controls. Viruses and cell types studied are as marked above each graph. The direction of transcription within each gene is from left to right. Note that our normalization method de-emphasizes favored MLV integration events just upstream of gene 5′ ends (outside transcription units), as reported by
Transcriptional profiling analysis was carried out in some of the cell types studied, allowing the influence of transcriptional activity on integration site selection to be assessed. Transcriptional profiling was carried out on infected cells so that the data reflected the known influence of infection on cellular gene activity (
The median expression level (average difference value) of genes hosting integration events was consistently higher than the median of all genes assayed on the microarrays. The ratios (targeted genes/all genes assayed) for HIV ranged from 1.6 to 3.0, indicating that integration targeting in human primary cells (PBMC and IMR90) favored active genes, as shown previously for transformed cell lines (
Expression levels were assayed using Affymetrix HU-95Av2 or HU-133A microarrays and scored by the average difference value as defined in the Affymetrix Microarray Suite 4.1 software package. All the genes assayed by the chip were divided into eight “bins” according to their relative level of expression (the leftmost bin in each panel is lowest expression levels and the rightmost the highest). Genes that hosted integration events were then distributed into the same bins, summed, and expressed as a percent of the total. The y-axis indicates the percent of all genes in the indicated bin.
All three HIV datasets showed reduced integration in the most highly expressed category of genes analyzed, suggesting that although transcription favors integration, very high level transcription may actually be less favorable (
We next investigated the effects of cell-type-specific transcription on integration site selection. For this analysis we used only the three HIV integration site datasets for which we had transcriptional profiling data from infected cells (i. e., SupT1, PBMC, and IMR90), to allow us to control for the effects of infection on transcription. Pairwise comparisons of the microarray datasets for the three cell types showed that the correlation coefficients ranged only from 0.64 to 0.79, indicating that transcriptional activity indeed differed among cell types. We reasoned that since active transcription favors integration, then the genes targeted by integration should on average be more highly expressed in the cell type that hosted the integration event than in either of the other two. Statistical analysis (
Genes hosting integration events by the HIV vector were analyzed for their expression levels in transcriptional profiling data from IMR90, PBMC, and SupT1 cells. For each gene hosting an integration event, the expression values from the three cell types were then ranked lowest (red), medium (orange), and highest (yellow). The values were summed and displayed separately for each set of integration sites: (A) IMR90 sites, (B) PBMC sites, and (C) SupT1 sites. In each case there was a significant trend for the cell type hosting the integration events to show the highest expression values relative to the other two (
We next analyzed factors influencing the placement of integration sites at the chromosomal level, taking into account both gene density and expression (see
All data were quantified in 2-Mb intervals. The top line shows summed EST data documenting the “transcriptional intensity” for each chromosomal interval (data from
Statistical analysis was carried out comparing integration frequencies to (1) gene density or (2) transcriptional intensity, as measured by the EST counts. All analyses incorporated a comparison to the matched random control set of integration sites. Each type of vector showed a significant positive correlation with gene density (HIV,
Thus, the analysis of transcriptional activity in the context of chromosomal location revealed significant effects of transcription on MLV and ASLV integration. This is in contrast to the study based on transcriptional profiling alone, in which the effect was not statistically significant—however, a similar trend was evident and the general conclusions similar (see
Two lines of evidence indicated that the chromosomal regions favorable for integration can be subdivided into favorable and unfavorable segments. In the first study, a computational analysis was carried out to determine the length of the chromosomal segments yielding the best fit between transcriptional intensity and integration intensity. The sizes of the chromosomal regions analyzed were varied systematically from 25 kb to 32,000 kb, and the statistical significance determined for the correlations. This analysis revealed that the segment length yielding the best correlation was comparatively short, around 100–250 kb, the length of one or a few human genes. These conclusions held for HIV, ASLV, and MLV (
An analysis of integration frequency near CpG islands also indicated substructure within regions favorable for integration. CpG islands are chromosomal regions enriched in the rare CpG dinucleotide. These regions commonly correspond to gene regulatory regions containing clustered transcription factor binding sites—consequently, CpG islands are more frequent in gene-rich regions. Previously
The viral vectors and cell types studied are indicated by color. A value of one indicates no bias, less than one indicates disfavored integration, and more than one indicates favored integration. The x-axis (from plus or minus 1 kb to 50 kb) indicates distance from the edge of a CpG island in either direction along the genome. The statistical analysis specifically removed the favorable effects of being in a gene and being in a region containing expressed genes to highlight the effects of CpG islands alone. When effects of gene density and activity are left in, HIV integration goes from disfavored at short distances (less than 1 kb) to favored at longer distances (more than 10 kb). This is because at longer distances the association with genes is significant—many CpG islands are within 10 kb of a gene, and genes are favored targets for HIV integration. To carry out this analysis, the numbers of experimentally determined and matched control sites were fitted according to whether they were near a CpG island, whether they were in genes, and the level of the expression density variable. Each variable contributes a “multiplier” for the ratio of the number of experimental to control sites. The multiplier for “near CpG island” is shown (
High gene density in the human chromosomes is known to correlate with several other features, including high levels of gene expression, high densities of CpG islands, the occurrence of Giemsa-light chromosomal bands, and high G/C content (
A statistical model was constructed to examine the relative contributions to integration intensity of (1) gene density, (2) gene activity, (3) proximity to CpG islands, (4) G/C content, and (5) location within genes (
Human chromosome numbers are indicated at the bottom of the figure. The number of integration events detected in each chromosome was divided by the number expected from the matched random control. The line at one indicates the bar height expected if the observed number of integration events matched the expected number. Higher bars indicate favored integration, lower bars, disfavored integration. Most of the cell types studied were from human females; too little data were available for the Y chromosome for meaningful analysis.
We report that ASLV, MLV, and HIV have quite different preferences for integration sites in the human chromosomes. HIV strongly favors active genes in primary cells as well as in transformed cell lines. MLV favors integration near transcription start regions and favors active genes only weakly. ASLV shows the weakest bias toward integration in active genes and no favoring of integration near transcription start sites. We expect that these same patterns will be seen for MLV and ASLV integration in different human cell types, because all four HIV datasets yielded similar results, though more data on additional cell and tissue types will be helpful to further evaluate the generality.
One of the earliest models for chromosomal influences on integration targeting proposed that condensed chromatin in inactive regions disfavored integration, thereby concentrating integration in more open active chromatin (
However, it seems unlikely that relative accessibility is the only feature directing integration site selection, because HIV, ASLV, and MLV each show such distinctive target sequence preferences. Studies of the Ty retrotransposons of yeast, close relatives of retroviruses, have revealed that interactions with bound chromosomal proteins can tether the Ty integration machinery to chromosomes and thereby direct integration to nearby sites (
The analysis of chromosomal regions favored for integration also suggested a role for locally bound proteins. Chromosomal regions enriched in active genes were generally favorable, but further analysis revealed interleaved favorable and unfavorable regions. Statistical tests indicated that favorable regions were typically short (100–250 kb), and for HIV these were interspersed with unfavorable regions near CpG islands. CpG islands are thought to be regulatory regions that bind distinctive sets of transcription factors. Thus, a simple model to explain targeting is that a distinctive set of sequence-specific DNA-binding proteins bound at or near CpG islands disfavor HIV integration, while proteins bound in active transcription units are favorable. For MLV, the proteins bound at CpG islands instead favor integration.
For ASLV, it is possible that the viral integration machinery does not interact with factors bound in or near genes, explaining the more random distribution of integration sites in the genome. Such a pattern might have evolved to minimize disruption to the host cell chromosomes due to integration. Another possibility, however, is that ASLV does have stricter target site preferences during normal integration in chicken cells, but the targeting system does not function properly in the human cells studied here. According to this idea, putative chicken chromosomal proteins normally bind ASLV integration complexes and direct integration, but the homologous human proteins may be too different to interact properly. It should be possible to investigate this possibility by characterizing ASLV integration in chicken cells, now that the draft chicken genome sequence is completed (
One consequence of the above findings is that integration will differ from tissue to tissue as a consequence of cell-type-specific transcription. To assess effects of tissue-specific transcription, we analyzed HIV integration in three different cell types (SupT1, PBMC, and IMR90). Transcriptional profiling data showed that transcription was significantly different among the three. This allowed an analysis of integration targeting, which showed that highly expressed genes particular to each tissue were favored for integration in that tissue. However, the magnitude of the tissue-specific biases on integration were modest, probably because most of the cellular transcriptional program appears to be common among cell types (
Additional mechanisms could also contribute to targeting. For example, we and others have detected statistically significant biases in integration frequency in whole chromosomes that do not appear to be fully explained by gene density or gene activity (
Our data indicate that ASLV has integration site preferences that may make it attractive as a vector for human gene therapy. MLV-based vectors have the unfavorable preference for integration near transcription start sites (
Each oligonucleotide is described by its name, sequence (written 5′ to 3′), and use, in that order.
To produce HIV vector particles, 293T cells were cotransfected with three plasmids: one encoding the HIV vector segment (p156RRLsinPPTCMVGFPWPRE) (
PBMCs were separated from human blood using a ficoll gradient (Amersham Biosciences, Little Chalfont, United Kingdom). 1 × 107 PHA, IL-2 prestimulated PBMCs, or IMR-90 cells (passage #36) at 30%–50% confluency (1–2 × 106 cells) were infected with the HIV-based vector at an moi of 10 (60 ng p24 per 5 × 10 5 cells). The vector was added to the cells with DEAE-dextran at a final concentration of 5 μg/ml. Forty-eight hours after infection, the cells were pelleted. For RNA isolation, cells were resuspended in 250 μl of PBS and 750 μl of TRIzol and frozen in liquid nitrogen. To determine the extent of infection, cells were analyzed by flow cytometry. For ASLV, supernatant containing RCASBP(A)GFP particles was added to 293T-TVA cells (293T 0.8 cells; a gift from John Young, Salk Institute) at 30%–50% confluency. Forty-eight hours post infection, green fluorescence was seen in approximately 30% of the cells, as determined by examination of the cultures with a fluorescence microscope. DNA was harvested at this point (DNeasy, Qiagen, Valencia, California, United States). RNA from infected cells was also isolated at 48 h post infection (TRIzol) and stored at −80 °C until used for transcriptional profiling analysis. RNA was isolated from infected cell cultures, and samples from each were used for hybridization on one Affymetrix (Santa Clara, California, United States) microarray.
HIV integration sites were cloned by ligation-mediated PCR essentially as described in
Transcriptional profiling was carried out using Affymetrix microarrays as described in
A detailed description of the statistical methods used is presented in Protocols
(322 KB PDF).
(128 KB PDF).
The GenBank (
We thank members of the Bushman laboratory for helpful discussions, and Dr. John Young for the 293T-TVA cells. This work was supported by National Institutes of Health grants AI52845 and AI34786, the James B. Pendleton Charitable Trust, the Berger Foundation, Robin and Frederic Withington (grant to FDB), and the Fritz B. Burns Foundation (grant to JRE). ARWS was supported by a grant from the Deutsche Forschungsgemeinschaft.
avian sarcoma-leukosis virus
human immunodeficiency virus
murine leukemia virus
peripheral blood mononuclear cell