Retroviral DNA Integration: ASLV, HIV, and MLV Show Distinct Target Site Preferences

The completion of the human genome sequence has made possible genome-wide studies of retroviral DNA integration. Here we report an analysis of 3,127 integration site sequences from human cells. We compared retroviral vectors derived from human immunodeficiency virus (HIV), avian sarcoma-leukosis virus (ASLV), and murine leukemia virus (MLV). Effects of gene activity on integration targeting were assessed by transcriptional profiling of infected cells. Integration by HIV vectors, analyzed in two primary cell types and several cell lines, strongly favored active genes. An analysis of the effects of tissue-specific transcription showed that it resulted in tissue-specific integration targeting by HIV, though the effect was quantitatively modest. Chromosomal regions rich in expressed genes were favored for HIV integration, but these regions were found to be interleaved with unfavorable regions at CpG islands. MLV vectors showed a strong bias in favor of integration near transcription start sites, as reported previously. ASLV vectors showed only a weak preference for active genes and no preference for transcription start regions. Thus, each of the three retroviruses studied showed unique integration site preferences, suggesting that virus-specific binding of integration complexes to chromatin features likely guides site selection.

The first model is a baseline with no terms in it. The second model has only a single term indicating whether an insertion or a matching site is in an Acembly gene. As is evident by the substantial decrement in the deviance (and the associated small p-value from the likelihood ratio test) due to this one term, being in a gene has a marked effect on integration. The intensity for integration is 2.84 times as great at a locus in a gene as at a locus that is not in a gene. The third model allows for differences among the effects of being in a gene in the six data sets and it is apparent that there are differences. To get some further detail on the differences among the data sets with respect to the effect of being in a gene, pairwise comparisons among the data sets are performed (using Wald tests). These are summarized in the following The 'log.ratio' column gives the logarithm of the ratio of integration intensity in the first data set divided by that in the second listed data set. As is evident, the loci in genes in the 'ASLV' data are not as relatively attractive as integration sites as are loci in genes in the other data sets.
Model 4 adds a term for whether an insertion or a matching site is in an Acembly exon, which results in a statistically significant decrement in the deviance. The intensity for integration is 1.486 times as great at a locus in a gene as at a locus that is not in an exon.
Finally, Model 5 allows for differences among the effects of being in an exon in the six data sets, but no differences are apparent.

Positioning in or near genes
In this section we examine whether the position of a locus relative to a start of the coding region of a gene influences integration. We begin with a model that uses 'feature width' -the distance from the last boundary of a gene and the next one. This quantity is studied since it forms the denominator of the 'distance to start' measure, which gives the fraction of distance from the coding start site to the insertion site for insertions that are in genes or for insertions that are not in genes the distance to the nearest gene if that gene is transcribed in the direction leading away from the insertion. Here is the analysis of deviance table comparing the null model that allows for regions in genes to differ according to the data set from which they came to a model that adds log(feature distance) to another model that allows the log(feature distance) terms to vary according to data set: As is evident, most of the improvement in model fit is achieved in passing from model 1 to model 2. A similar picture is obtained by using a somewhat richer model for feature width, viz. that which uses B-splines for log(feature distance) with two interior knots. Here is the analogous analysis of deviance table: The next table considers the distance from/to the start of transcription. This distance is the fraction of distance from the coding start site to the insertion site divided by the length of the gene for insertions that are in genes. For insertions that are not in genes it is the distance to the nearest gene if that gene is transcribed in the direction leading away from the insertion divided by the distance between genes. Otherwise the distance to the nearest gene divided by the distance between genes is used. As is evident, there are statistically significant reductions in the deviance in each step. The following table uses B-splines with two interior knots for start distance: Again it is evident that there are statistically significant reductions in the deviance in each step. Thus, the distance from (or to) the start site affects the integration intensity and does so differently in the different data sets. Here are the pairwise comparisons between the data sets for the start distance. Note that all of the comparisons with MLV are statistically significant, three of the ASLV comparisons are statistically significant, and none of the other pair-wise comparisons are statistically significant. However, it is worth noting that the actual reduction in deviance is generally small; in part this is a consequence of there being little influence on integration in most of the data sets.

Analysis of Deviance
Since it is of interest to determine whether ASLV shares the preference of MLV for integrating into the 5' end of genes, we examine the empirical distribution of integration sites by forming a barplot for each data set in which the relative intensity of integration is plotted for 10 intervals of 'start distance': It appears that the integration intensity in the 5' end of a gene is somewhat elevated in the ASLV data, but not by nearly so much as in the MLV data. We test this directly by fitting a model that includes an indicator variable for whether a site is in a gene and with a 'start distance' of less than 0.1 for each data set. Here is a table of results for this model. The 'se' column gives the standard error of the logarithm of the relative intensity for integration. The effect for ASLV is not less than the conventional 0.05 significance level, and the preference for integration near the 5' end of a gene appears markedly lower than that for MLV. Here are the pairwise comparisons:

Gene Density
In this section the effects of the local density of genes and of expressed genes is studied. The Ensemble genes and ESTs collection described in Vesteeg et al [2] are used. For every insertion site, a region around that site is searched for genes and for expressed genes (i.e. those having EST counts greater than zero). The local gene density is given by i.e. the EST count is trimmed at 200 and the sum of all truncated counts in the region is divided its width. In addition a score for high EST counts is given by: Note that d score + d high is just the EST count divided by the region width. The motivation for decomposing it into several pieces is that the distribution of EST counts has a very long upper tail, and there was a suspicion that the impact of a single EST with a very high count would be much less than that of a number of ESTs whose total was equally high.
As it turns out, the resulting scores will often be zero and also have rather long upper tails. Preliminary analysis suggested forming a zero-one indicator variable to flag those site in which the score is zero and a quantitative variable that is either the logarithm of the score for non-zero scores. When the original score is zero, the variable has the median of those log scores in its place.
The following analysis of deviance table shows the effects of gene density in a 500 kilobase region surrounding each site. The null model has effects for the genes that differ according to data source, the next model has an indicator for zero score and the quantitative (log-score) variable included, and the final model allows the effects of the indicator and quantitative score to vary according to data source. As is evident, the bulk of the decrease in deviance is in the first step. Still the second step does attain statistical significance, indicating differences among the data sources with respect to the effect of gene density. Here is the analogous analysis of deviance table using B-splines with two interior knots for the quantitative scores. Again the bulk of the decrease in deviance is in the first step, although the second step is also statistically significant. Perhaps it is also worth noting that the use of the B-splines results in a modest improvement in the deviance over using the original score.

Analysis of Deviance
The following table gives the analysis of deviance for d countexpressed : Notice that the total decrement in deviance is substantially greater than that seen for gene density per se.

Analysis of Deviance
Here is the table for the expression score, d score :

Cytobands
Here the effect of being in a Gband is studied. The cytoband data is coded as 'Gscore', which assigns the values 0, 0.25, 0.5, 0.75, and 1.00 to the codes 'gneg', 'gpos25', 'gpos50', 'gpos75', and 'gpos100'. The analysis of deviance table shows that the incremental effect of accounting for 'Gscore' after taking account of whether an insertion is in a gene or an 'expression dense' region is statistically significant and that the Gscore effect varies in the different data sets. However, the magnitude of the decrease in deviance is rather small, which suggests that the effects are modest. The point of departure is a model that includes the effect of expression 'density' and data set specific effects of being in a gene:

CpG Islands
Wu et al [3] noted an eight-fold difference in insertion in regions within ±1kb of CpG Islands. Using the annotated locations of the CpG Islands from http://genome.ucsc.edu/goldenPath/14nov2002/database/cpgIsland.txt.gz we determined whether the insertion site was within ±1kb, within ±5kb, or within ±10kb.
Here is the analysis of deviance table for regions within 1 kilobase of a CpG island (or in the island). The point of departure is a model that includes the effect of expression 'density' and data set specific effects of being in a gene: Here is the analysis of deviance table for regions within 5 kilobases of a CpG island (or in the island): Here is the analysis of deviance table for regions within 10 kilobases of a CpG island (or in the island): Here is the analysis of deviance table for regions within 50 kilobases of a CpG island (or in the island): Here is a plot of the relative intensity of integration (after accounting for the effects of being in a gene and the expression density) based on the regression coefficients above. The 'error bar' drawn with each colored bar indicates the range of the 95 percent confidence interval. Error bars that do not cross the horizontal line for relative intensity = 1.0 indicate preference for (or avoidance of) sites near CpG islands. CpG ± 1 kb CpG ± 5 kb CpG ± 10 kb CpG ± 25 kb CpG ± 50 kb Evidently the effects are generally strongest near the CpG islands and tend to be in the direction of suppressing integration for HIV cells while increasing it for ASLV and MLV.

GC content
Using the annotations of GC content from http://genome.ucsc.edu/goldenPath/14nov2002/database/gcPercent.txt.gz we determined whether the GC content of the region surrounding the insertion site. Here is the analysis of deviance table taking a model that includes the effect of expression 'density' and data set specific effects of being in a gene as the point of departure: of the variables at a time is only about half of the value obtained for dropping all of them. This is due to correlation amongst regressor variables -particularly gene density and expression density, whose joints effects acount for roughly one third of the deviance explained by all variables.
The following table gives a somewhat more detailed view of these results. The proportion of deviance accounted for by the model that includes all terms in each of the cell lines is given by the 'fit.all' column, while each of the 'drop' columns gives the proportion of deviance accounted for by all terms but the one that is listed.