HTLV-1 Integration into Transcriptionally Active Genomic Regions Is Associated with Proviral Expression and with HAM/TSP

Human T-lymphotropic virus type 1 (HTLV-1) causes leukaemia or chronic inflammatory disease in ∼5% of infected hosts. The level of proviral expression of HTLV-1 differs significantly among infected people, even at the same proviral load (proportion of infected mononuclear cells in the circulation). A high level of expression of the HTLV-1 provirus is associated with a high proviral load and a high risk of the inflammatory disease of the central nervous system known as HTLV-1-associated myelopathy/tropical spastic paraparesis (HAM/TSP). But the factors that control the rate of HTLV-1 proviral expression remain unknown. Here we show that proviral integration sites of HTLV-1 in vivo are not randomly distributed within the human genome but are associated with transcriptionally active regions. Comparison of proviral integration sites between individuals with high and low levels of proviral expression, and between provirus-expressing and provirus non-expressing cells from within an individual, demonstrated that frequent integration into transcription units was associated with an increased rate of proviral expression. An increased frequency of integration sites in transcription units in individuals with high proviral expression was also associated with the inflammatory disease HAM/TSP. By comparing the distribution of integration sites in human lymphocytes infected in short-term cell culture with those from persistent infection in vivo, we infer the action of two selective forces that shape the distribution of integration sites in vivo: positive selection for cells containing proviral integration sites in transcriptionally active regions of the genome, and negative selection against cells with proviral integration sites within transcription units.


Introduction
In this document, I examine the association of integration siting with various genomic features.
The numbers are shown below: Origin.of.data.set MRC-inVivo Meekings-inVivo 3110 313 The distribution of relative frequency of insertions across the chromosomes is given in this barplot: Are there chromosomes that are particularly favored for integration by one group over the other? This was tested for statistical significance. The test performed used the likelihood ratio statistic for the logistic regression model (reviewed in [1]) as implemented by the glm function of R using the binomial family. The null hypothesis tested is the ratio of true integration events in the two groups is constant across all chromosomes. This test attains a p-value of 0.0011709.

Acembly Genes
Here we examine the relative preference that integration events in the two groups have for genes. In the following plot we show the relative frequency of integrations in genes according to the 'Acembly' anotation. The bars grouped over the label "In Gene" give the relative frequency of integration events (compared to control sites) between bases located within Acembly gene annotations, while the label "Not in Gene" give the relative frequency of integration events (compared to control sites) between bases not located within Acembly gene annotations. Is there is a difference in the tendency for insertions to occur in genes? A formal test of significance yields a p-value of 0.89336. In the following plot we show the relative frequency of insertions in exons according to the 'Acembly' anotation The bars grouped over the label "In Exon" give the relative frequency of integration events (compared to control sites) between bases located in exons according to the Acembly annotation, while the label "Not in Exon" give the relative frequency of integration events (compared to control sites) between bases not located in exons according to the Acembly gene annotation. The model on which these coefficients are based include terms for whether the site is in a gene or not. Thus, the effect shown as 'in.exon' is net of that due to being in a gene. Note that in the barplot above the 'Not in Exon' bars include the both introns and intergenic regions, so the impression given by the table may differ from that for the barplot.

refGenes
Here we examine the relative preference that insertions of the two types have for genes. In the following plot we show the relative frequency of insertions in genes according to the 'refGene' anotation. Is there is a tendency for insertions to occur in genes? A formal test of significance yields a p-value of 0.80619.
In the following plot we show the relative frequency of insertions in exons according to the 'refGene' anotation. The model on which these coefficients are based include terms for whether the site is in a gene or not. Thus, the effect shown as 'in.exon' is net of that due to being in a gene. Note that in the barplot above the 'Not in Exon' bars include the both introns and intergenic regions, so the impression given by the table may differ from that for the barplot.

ensGenes
Here we examine the relative preference that insertions of the two types have for genes. In the following plot we show the relative frequency of insertions in genes according to the 'ensGene' anotation. Is there is a tendency for insertions to occur in genes? A formal test of significance yields a p-value of 0.93144.
In the following plot we show the relative frequency of insertions in exons according to the 'ensGene' anotation. The model on which these coefficients are based include terms for whether the site is in a gene or not. Thus, the effect shown as 'in.exon' is net of that due to being in a gene. Note that in the barplot above the 'Not in Exon' bars include the both introns and intergenic regions, so the impression given by the table may differ from that for the barplot.

genScan Genes
Here we examine the p erence that insertions have for genes. In the following plot we show the relative frequency of insertions in genes according to the 'genScan' anotation. Is there is a tendency for insertions to occur in genes? A formal test of significance yields a p-value of 0.37354.
In the following plot we show the relative frequency of insertions in exons according to the 'genScan' anotation. The model on which these coefficients are based include terms for whether the site is in a gene or not. Thus, the effect shown as 'in.exon' is net of that due to being in a gene. Note that in the barplot above the 'Not in Exon' bars include the both introns and intergenic regions, so the impression given by the table may differ from that for the barplot.

uniGenes
Here we examine the preference that insertions have for genes. In the following plot we show the relative frequency of insertions in genes according to the 'uniGene' anotation. Is there is a tendency for insertions to occur in genes? A formal test of significance yields a p-value of 0.97151.
In the following plot we show the relative frequency of insertions in exons according to the 'uniGene' anotation. The model on which these coefficients are based include terms for whether the site is in a gene or not. Thus, the effect shown as 'in.exon' is net of that due to being in a gene. Note that in the barplot above the 'Not in Exon' bars include the both introns and intergenic regions, so the impression given by the table may differ from that for the barplot.

oncogenes
Here we examine the preference that insertions have for oncogenes. In the following plot we show the relative frequency of insertions within 50kb of an oncogene 5' end. It seems evident that there is a strong tendency for insertions to occur near oncogenes. A formal test of significance bears this out with a p-value of 0.055964.
Here is the table of coefficients of the log ratio of intensities for true insertion sites versus control insertion sites along with their standard errors, z statistics, and p-values for each data set:

CpG Island Neighborhoods
Here we study the effect of being in the neighborhood of CpG Islands. Following Wu et al [2], who found that the neighborhoods within ±1kb of CpG islands are enriched for MLV insertions, we study such neighborhoods.

1 kilobase neighborhoods
The following plot shows the effect of being in or within ±1kb of a CpG island: A formal test of significance comparing the difference attains a p-value of 6.4805e − 08.

5 kilobase neighborhoods
The following plot shows the effect of being in or within ±5kb of a CpG island: A formal test of significance comparing the difference attains a p-value of 1.0465e − 10.

10 kilobase neighborhoods
The following plot shows the effect of being in or within ±10kb of a CpG island:

25 kilobase neighborhoods
The following plot shows the effect of being in or within ±25kb of a CpG island: A formal test of significance comparing the difference attains a p-value of 1.2228e − 05.

50 kilobase neighborhoods
The following plot shows the effect of being in or within ±50kb of a CpG island: A formal test of significance comparing the difference attains a p-value of 0.00019175.

Gene Density, Expression 'Density', and CpG Island Density
In this section the association with gene density is examined. The 'genes' that are counted are the genes represented on the microarray. In addition, we the number of such genes expressed at various levels. The levels are low.ex Count genes whose expression is in the upper half and divide by number of bases med.ex Count genes whose expression is in the upper half 1/8 th and divide by number of bases high.ex Count genes whose expression is in the upper half 1/16 th and divide by number of bases The bolded terms are used as abbreviations in what follows. The abbreviation dens is used to indicate gene density as number of genes per base.

25 kiloBase Window
In the barplot that follows we examine the association of insertion sites with gene density in a 25 kilobase window surrounding each locus. More such plots will follow and the method of their construction is always to try to divide the data according to the deciles of density. However, it often happens that there is a very skewed distribution of density and often even the 90 th percentile is zero. In that case, the barplots simply show the sites for which the density is zero and those for which it is non-zero. If there are fewer than ten groups of bars, then the groupings contain ten percent of the sites each except for the leftmost grouping which will contain all of the remaining sites.
Also note that the title of the plot contains clues as to its content; the prefix indicates the type of variable studied while the suffix indicates the window width in the number of bases. The p-value given is the result of fitting a cubic polynomial to the gene density values.
The following expression data and probe set were used for this report: Here are the results for expression density. First, we count just genes that are in the upper half.

50 kiloBase Window
In the barplot that follows we examine the association of insertion sites with expression density in a 50 kilobase window surrounding each locus. First, we count just the number of genes on the represented on the chip. Here are the results for expression density. First, we count just genes that are in the upper half.

100 kiloBase Window
In the barplot that follows we examine the association of insertion sites with expression density in a 100 kilobase window surrounding each locus. First, we count just the number of genes on the represented on the chip. Here are the results for expression density. First, we count just genes that are in the upper half.

250 kiloBase Window
In the barplot that follows we examine the association of insertion sites with expression density in a 250 kilobase window surrounding each locus. First, we count just the number of genes on the represented on the chip. Here are the results for expression density. First, we count just genes that are in the upper half.

500 kiloBase Window
In the barplot that follows we examine the association of insertion sites with expression density in a 500 kilobase window surrounding each locus. First, we count just the number of genes on the represented on the chip. Here are the results for expression density. First, we count just genes that are in the upper half.

1 megaBase Window
In the barplot that follows we examine the association of insertion sites with expression density in a 1 megabase window surrounding each locus. First, we count just the number of genes on the represented on the chip. Here are the results for expression density. First, we count just genes that are in the upper half.

2 megaBase Window
In the barplot that follows we examine the association of insertion sites with expression density in a 2 megabase window surrounding each locus. First, we count just the number of genes on the represented on the chip. Here are the results for expression density. First, we count just genes that are in the upper half.

4 megaBase Window
In the barplot that follows we examine the association of insertion sites with expression density in a 4 megabase window surrounding each locus. First, we count just the number of genes on the represented on the chip. Here are the results for expression density. First, we count just genes that are in the upper half.

8 megaBase Window
In the barplot that follows we examine the association of insertion sites with expression density in a 8 megabase window surrounding each locus. First, we count just the number of genes on the represented on the chip. Here are the results for expression density. First, we count just genes that are in the upper half.

16 megaBase Window
In the barplot that follows we examine the association of insertion sites with expression density in a 16 megabase window surrounding each locus. First, we count just the number of genes on the represented on the chip. Here are the results for expression density. First, we count just genes that are in the upper half.

32 megaBase Window
In the barplot that follows we examine the association of insertion sites with expression density in a 32 megabase window surrounding each locus. First, we count just the number of genes on the represented on the chip. Here are the results for expression density. First, we count just genes that are in the upper half. The next plot studies the distance to the nearest boundary between a gene and a non-gene region. The distance is expressed as a fraction of the length of the region. Thus, '0.25' refers to one quarter of the distance from the site to nearest boundary divided by the total width of the region. This plot studies the effect of nearness to the beginning of a transcript. For sites in genes, it is the distance to the start of the gene divided by the width of the gene. For other sites it is the distance from the site to the nearer gene if that gene boundary is also a transcription starting point. Locations near '0' are relatively near the beginning of transcription, while those near '1' are near the termination of the transcript.

GC content
Here we study the effect of GC content on insertion. The GC content is taken from the Mouse Genome Draft at GoldenPath from the table Following the plot is a table of fitted coefficients based on splitting the GC percent data at the median.

Cytobands
Here we study the association of cytoband with insertion intensity. The data are obtained from http://genome.ucsc.edu/goldenPath/hg17/database/cytoBand.txt.gz. A formal test of significance attains a p-value of 0.044387.