Retroviral DNA Integration: Viral and Cellular Determinants of Target-Site Selection

Retroviruses differ in their preferences for sites for viral DNA integration in the chromosomes of infected cells. Human immunodeficiency virus (HIV) integrates preferentially within active transcription units, whereas murine leukemia virus (MLV) integrates preferentially near transcription start sites and CpG islands. We investigated the viral determinants of integration-site selection using HIV chimeras with MLV genes substituted for their HIV counterparts. We found that transferring the MLV integrase (IN) coding region into HIV (to make HIVmIN) caused the hybrid to integrate with a specificity close to that of MLV. Addition of MLV gag (to make HIVmGagmIN) further increased the similarity of target-site selection to that of MLV. A chimeric virus with MLV Gag only (HIVmGag) displayed targeting preferences different from that of both HIV and MLV, further implicating Gag proteins in targeting as well as IN. We also report a genome-wide analysis indicating that MLV, but not HIV, favors integration near DNase I–hypersensitive sites (i.e., +/− 1 kb), and that HIVmIN and HIVmGagmIN also favored integration near these features. These findings reveal that IN is the principal viral determinant of integration specificity; they also reveal a new role for Gag-derived proteins, and strengthen models for integration targeting based on tethering of viral IN proteins to host proteins.


Introduction
In this document, I examine the association of integration siting with various genomic features.
The numbers are shown below: Origin.of.data.set MLV-Burgess MLVPuro 917 544 The distribution of relative frequency of insertions across the chromosomes is given in this barplot: Are there chromosomes that are particularly favored for integration by one group over the other? This was tested for statistical significance. The test performed used the likelihood ratio statistic for the logistic regression model (reviewed in [1]) as implemented by the glm function of R using the binomial family. The null hypothesis tested is the ratio of true integration events in the two groups is constant across all chromosomes. This test attains a p-value of 1.6095e − 05.

Acembly Genes
Here we examine the relative preference that integration events in the two groups have for genes. In the following plot we show the relative frequency of integrations in genes according to the 'Acembly' annotation. The bars grouped over the label "In Gene" give the relative frequency of integration events (compared to control sites) between bases located within Acembly gene annotations, while the label "Not in Gene" give the relative frequency of integration events (compared to control sites) between bases not located within Acembly gene annotations. Is there is a difference in the tendency for insertions to occur in genes? A formal test of significance yields a p-value of 0.12085. In the following plot we show the relative frequency of insertions in exons according to the 'Acembly' annotation The bars grouped over the label "In Exon" give the relative frequency of integration events (compared to control sites) between bases located in exons according to the Acembly annotation, while the label "Not in Exon" give the relative frequency of integration events (compared to control sites) between bases not located in exons according to the Acembly gene annotation. Here is the table of coefficients of the log ratio of intensities along with their standard errors, z statistics, and p-values: coef se z p (Intercept) -0.63800 0.0929 -6.8700 6.62e-12 in.gene 0.17800 0.1180 1.5100 1.32e-01 in.exon -0.00828 0.1780 -0.0466 9.63e-01 The model on which these coefficients are based include terms for whether the site is in a gene or not. Thus, the effect shown as 'in.exon' is net of that due to being in a gene. Note that in the barplot above the 'Not in Exon' bars include the both introns and intergenic regions, so the impression given by the table may differ from that for the barplot.

refGenes
Here we examine the relative preference that insertions of the two types have for genes. In the following plot we show the relative frequency of insertions in genes according to the 'refGene' annotation. Is there is a tendency for insertions to occur in genes? A formal test of significance yields a p-value of 0.65712.
In the following plot we show the relative frequency of insertions in exons according to the 'refGene' annotation. The model on which these coefficients are based include terms for whether the site is in a gene or not. Thus, the effect shown as 'in.exon' is net of that due to being in a gene. Note that in the barplot above the 'Not in Exon' bars include the both introns and intergenic regions, so the impression given by the table may differ from that for the barplot.

ensGenes
Here we examine the relative preference that insertions of the two types have for genes. In the following plot we show the relative frequency of insertions in genes according to the 'ensGene' annotation. Is there is a tendency for insertions to occur in genes? A formal test of significance yields a p-value of 0.28032.
In the following plot we show the relative frequency of insertions in exons according to the 'ensGene' annotation. The model on which these coefficients are based include terms for whether the site is in a gene or not. Thus, the effect shown as 'in.exon' is net of that due to being in a gene. Note that in the barplot above the 'Not in Exon' bars include the both introns and intergenic regions, so the impression given by the table may differ from that for the barplot.

genScan Genes
Here we examine the p erence that insertions have for genes. In the following plot we show the relative frequency of insertions in genes according to the 'genScan' annotation. Is there is a tendency for insertions to occur in genes? A formal test of significance yields a p-value of 0.8604.
In the following plot we show the relative frequency of insertions in exons according to the 'genScan' annotation. The model on which these coefficients are based include terms for whether the site is in a gene or not. Thus, the effect shown as 'in.exon' is net of that due to being in a gene. Note that in the barplot above the 'Not in Exon' bars include the both introns and intergenic regions, so the impression given by the table may differ from that for the barplot.

uniGenes
Here we examine the preference that insertions have for genes. In the following plot we show the relative frequency of insertions in genes according to the 'uniGene' annotation. Is there is a tendency for insertions to occur in genes? A formal test of significance yields a p-value of 0.76307.
In the following plot we show the relative frequency of insertions in exons according to the 'uniGene' annotation. The model on which these coefficients are based include terms for whether the site is in a gene or not. Thus, the effect shown as 'in.exon' is net of that due to being in a gene. Note that in the barplot above the 'Not in Exon' bars include the both introns and intergenic regions, so the impression given by the table may differ from that for the barplot.

CpG Island Neighborhoods
Here we study the effect of being in the neighborhood of CpG Islands. Following Wu et al [2], who found that the neighborhoods within ±1kb of CpG islands are enriched for MLV insertions, we study such neighborhoods.

1 kilobase neighborhoods
The following plot shows the effect of being in or within ±1kb of a CpG island:

5 kilobase neighborhoods
The following plot shows the effect of being in or within ±5kb of a CpG island:

10 kilobase neighborhoods
The following plot shows the effect of being in or within ±10kb of a CpG island: A formal test of significance comparing the difference attains a p-value of 0.024048.

25 kilobase neighborhoods
The following plot shows the effect of being in or within ±25kb of a CpG island: A formal test of significance comparing the difference attains a p-value of 8.0113e − 06.

50 kilobase neighborhoods
The following plot shows the effect of being in or within ±50kb of a CpG island:

Gene Density, Expression 'Density', and CpG Island Density
In this section the association with gene density is examined. The 'genes' that are counted are the genes represented on the microarray. In addition, we the number of such genes expressed at various levels. The levels are low.ex Count genes whose expression is in the upper half and divide by number of bases med.ex Count genes whose expression is in the upper 1/8 th and divide by number of bases high.ex Count genes whose expression is in the upper 1/16 th and divide by number of bases The bolded terms are used as abbreviations in what follows. The abbreviation dens is used to indicate gene density as number of genes per base.

25 kiloBase Window
In the barplot that follows we examine the association of insertion sites with gene density in a 25 kilobase window surrounding each locus. More such plots will follow and the method of their construction is always to try to divide the data according to the deciles of density. However, it often happens that there is a very skewed distribution of density and often even the 90 th percentile is zero. In that case, the barplots simply show the sites for which the density is zero and those for which it is non-zero. If there are fewer than ten groups of bars, then the groupings contain ten percent of the sites each except for the leftmost grouping which will contain all of the remaining sites.
Also note that the title of the plot contains clues as to its content; the prefix indicates the type of variable studied while the suffix indicates the window width in the number of bases. The p-value given is the result of fitting a cubic polynomial to the gene density values.
The following expression data and probe set were used for this report: [1] "HeLa_exp_data-HU133a" Here are the results for expression density. First, we count just genes that are in the upper half.

50 kiloBase Window
In the barplot that follows we examine the association of insertion sites with expression density in a 50 kilobase window surrounding each locus. First, we count just the number of genes on the represented on the chip. Here are the results for expression density. First, we count just genes that are in the upper half.

100 kiloBase Window
In the barplot that follows we examine the association of insertion sites with expression density in a 100 kilobase window surrounding each locus. First, we count just the number of genes on the represented on the chip.

250 kiloBase Window
In the barplot that follows we examine the association of insertion sites with expression density in a 250 kilobase window surrounding each locus. First, we count just the number of genes on the represented on the chip.

16 megaBase Window
In the barplot that follows we examine the association of insertion sites with expression density in a 16 megabase window surrounding each locus. First, we count just the number of genes on the represented on the chip.

32 megaBase Window
In the barplot that follows we examine the association of insertion sites with expression density in a 32 megabase window surrounding each locus. First, we count just the number of genes on the represented on the chip. Now we count genes in the upper 1/8 th : Category limits lower category upper 1 1.551484e-07 group.1 4.047619e-07 2 4.047619e-07 group.2 5.375000e-07 3 5.375000e-07 group.3 6.435583e-07 4 6.435583e-07 group.4 7.867708e-07 5 7.867708e-07 group.5 9.072917e-07 6 9.072917e-07 group.6 1.078542e-06 7 1.078542e-06 group.7 1.315626e-06 8 1.315626e-06 group.8 1.793056e-06 9 1.793056e-06 group.9 2.168327e-06 10 2.168327e-06 group.10 4.529922e-06 The next plot studies the distance to the nearest boundary between a gene and a non-gene region. The distance is expressed as a fraction of the length of the region. Thus, '0.25' refers to one quarter of the distance from the site to nearest boundary divided by the total width of the region. This plot studies the effect of nearness to the beginning of a transcript. For sites in genes, it is the distance to the start of the gene divided by the width of the gene. For other sites it is the distance from the site to the nearer gene if that gene boundary is also a transcription starting point. Locations near '0' are relatively near the beginning of transcription, while those near '1' are near the termination of the transcript.

GC content
Here we study the effect of GC content on insertion. The GC content is taken from the Human Genome Draft at GoldenPath from the

Cytobands
Here we study the association of cytoband with insertion intensity. The data are obtained from http://genome.ucsc.edu/goldenPath/hg17/database/cytoBand.txt.gz. A formal test of significance attains a p-value of 0.080486.