Deciphering the Code for Retroviral Integration Target Site Selection

Upon cell invasion, retroviruses generate a DNA copy of their RNA genome and integrate retroviral cDNA within host chromosomal DNA. Integration occurs throughout the host cell genome, but target site selection is not random. Each subgroup of retrovirus is distinguished from the others by attraction to particular features on chromosomes. Despite extensive efforts to identify host factors that interact with retrovirion components or chromosome features predictive of integration, little is known about how integration sites are selected. We attempted to identify markers predictive of retroviral integration by exploiting Precision-Recall methods for extracting information from highly skewed datasets to derive robust and discriminating measures of association. ChIPSeq datasets for more than 60 factors were compared with 14 retroviral integration datasets. When compared with MLV, PERV or XMRV integration sites, strong association was observed with STAT1, acetylation of H3 and H4 at several positions, and methylation of H2AZ, H3K4, and K9. By combining peaks from ChIPSeq datasets, a supermarker was identified that localized within 2 kB of 75% of MLV proviruses and detected differences in integration preferences among different cell types. The supermarker predicted the likelihood of integration within specific chromosomal regions in a cell-type specific manner, yielding probabilities for integration into proto-oncogene LMO2 identical to experimentally determined values. The supermarker thus identifies chromosomal features highly favored for retroviral integration, provides clues to the mechanism by which retrovirus integration sites are selected, and offers a tool for predicting cell-type specific proto-oncogene activation by retroviruses.


Introduction
Retroviruses and retrotransposons are of profound importance to eukaryotic biology, evolution, and medicine. These retroelements constitute at least 40% of the mass of mammalian genomes [1] and 75% of the maize genome [2]. When retroelements are transcribed they remodel eukaryotic genomes by generating a cDNA and integrating it into locations scattered throughout the host cell genome [3,4]. By doing so, retroelements have the potential to influence local gene expression or to promote recombination and generate deletion mutations [5][6][7]. In some cases they act in trans to catalyze retrotransposition of cellular RNAs, generating pseudogenes or new exons within existing genes [8,9]. Since retrotransposon enhancer elements influence local gene expression, and retrotransposon silencing can vary from cell to cell, it has been proposed that retrotransposons contribute to the phenotypic variation that distinguishes genetically identical individuals [10]. Additionally, it has been suggested that programmed release from retroelement silencing accompanies metazoan development and leads to hypermutation in complex somatic tissues like the brain [11,12].
Among retroelements, retroviruses have received much attention, in part due to their association with human disease. Basic studies concerning retroviral replication have greatly advanced understanding of the biochemistry of retrotransposition [4,13]. A tetramer of the viral integrase protein (IN) [14] cleaves the ends of the viral cDNA to produce recessed 39OH and free CA dinucleotides at the terminus of each long terminal repeat (LTR) [15]. IN catalyzes nucleophilic attack of host chromosomal DNA by the two free 39-OH viral DNA ends, resulting in covalent attachment of the retroviral DNA strands to the host DNA [16][17][18]. The remaining free ends of the viral DNA are then repaired by host enzymes [19][20][21].
Study of HIV-1, the retrovirus that causes AIDS, has led to the development of drugs that block retrotransposition and alter progression to AIDS [22,23]. Attempts to develop better therapies for HIV-1 would benefit from a deeper understanding of the integration mechanism. Gene therapy vectors based on another retrovirus, MLV, dramatically rescued children from a lifethreatening illness, but a large percentage of the patients suffered from insertional activation of proto-oncogenes [24][25][26][27][28]. This lethal complication further emphasizes the need to better understand retroviral integration site selection in host chromosomal DNA.
Retroviruses establish proviruses at sites throughout the host cell genome, but integration is not random. Some regions are favored hundreds of times over others [29,30]. For some retroviruses, transcribed regions are preferred [31,32], though high-level, concurrent transcription at a given target gene inhibits integration [33]. Nucleosome-bearing DNA is targeted more efficiently than free DNA in vitro [34][35][36][37] perhaps because the integration machinery preferentially targets bent DNA [38]. Indeed, highthroughput sequencing experiments analyzing over 40,000 HIV-1 integration sites in cells show periodic distribution on predicted nucleosome positions, consistent with favored integration into outward-facing DNA major grooves in chromatin [39].
The retrotransposition mechanism, and integration site selection on a genomic scale, differs considerably from one class of retrovirus to another. HIV-1 infects non-dividing cells [40,41] and integrates preferentially into transcriptionally active genes, all along the length of the gene [32,42,43]. In contrast, MLV integration requires mitosis [41,44] and has a tendency to localize near promoters, 20% of the time within 2 kB of transcriptional start sites [31,42]. Retroviral capsid (CA) is sufficient to determine whether a given virus infects non-dividing cells [45,46] but both CA and IN contribute to integration site selection: an HIV-1 vector in which IN-coding sequences and a fragment of gag encompassing CA were replaced by the homologous MLV sequences exhibits the retrotransposition behavior of MLV [43].
Of the many host factors reported to interact with retroviral CA or IN [47][48][49][50][51][52], the lentiviral IN-interacting protein PSIP1/ LEDGF/p75 [53][54][55] is the most informative regarding integration site selection. LEDGF promotes the infectivity of HIV-1 and related lentiviruses and influences integration site selection [56][57][58][59] perhaps by acting as a physical tether directing integration to the chromosomal sites this protein naturally occupies. In support of this model, fusion of heterogeneous chromatin binding domains to the part of LEDGF that binds IN redirected the site of HIV-1 integration [60][61][62]. The mechanism by which gammaretroviruses such as MLV preferentially target promoter regions is unknown.
We attempted to identify chromatin features predictive of retroviral integration site selection by exploiting ChIPSeq datasets. Compared to previous methods, this technology has brought profiles of human DNA binding factors and histone epigenetic modifications closer to genome-wide saturation [63][64][65][66][67][68]. Over 60 ChIPSeq datasets were compared with 14 retroviral integration data sets in order to develop tools for predicting viral integration sites throughout the genome with maximal predictive power.
The proviruses in the datasets used here ( Table 2) were cloned from host genomic DNA using restriction enzymes, each of which has the potential to introduce a bias [79]. Therefore, as described in the literature [42,43,78,80], each integration site was matched to ten control sites designed to exhibit the same bias as the experimental set: control sites were placed the equivalent distance from randomly chosen recognition sites of the restriction enzyme that was used to clone the provirus (see Methods). No distortion of the results by the control datasets was evident, in that identical values for provirus association with a given chromatin feature were obtained using 10 different randomly-generated control datasets.
Integration datasets are generally compared with control datasets using Fisher's exact test and reported as the p-value [42,43,77,80]. Since significance determination is dependent upon dataset size, these measures can be easily conflated, generating

Author Summary
When HIV-1, murine leukemia virus (MLV), or other retroviruses infect a cell, the virus generates a DNA copy of the viral RNA genome and ligates the cDNA within host chromosomal DNA. This integration reaction occurs at sites throughout the host cell genome, but little is known about how integration sites are selected. We attempted to identify markers predictive of retroviral integration by comparing the genome-wide binding sites for more than 60 factors with 14 retroviral integration datasets. We borrowed Precision-Recall methods from the Information Retrieval field for extracting information from highly skewed datasets such as these. For MLV and other gammaretroviruses, strong association was observed with STAT1, acetylation of H3 and H4 at several positions, and methylation of H2AZ, H3K4, and K9. We generated a supermarker by combining high scoring markers. The supermarker localized within 2 kB of 75% of MLV proviruses and predicted the likelihood of integration within specific chromosomal regions in a cell-type specific manner. This study identified chromosomal features highly favored for retroviral integration. It also provides clues to the mechanism by which retrovirus integration sites are selected, and offers a tool for predicting cell-type specific proto-oncogene activation by retroviruses.
extraordinarily low p-values and making it difficult to compare the importance of two factors [78]. Receiver operating characteristic area methods (ROC) have also been used to identify associations [78,80,81], but these methods also have drawbacks when it comes to discriminating between markers for retroviral integration. With the datasets used in these studies, the number of true negatives (control sites not associated with the marker) is considerably higher than the number of false positives (control sites associated with the marker). Given that the false positive rate = false positives / [false positives+true negatives], two markers which differ by as much as 10-fold in terms of the number of false positives will fail to be differentiated from one another using ROC [82].
To address the problems associated with the analysis of these highly skewed data sets, we borrowed the concepts of Precision and Recall from the field of Information Retrieval [82][83][84]. In the context of this discussion, Precision is defined as the number of experimentally-determined integration sites associated with a marker divided by the sum of all associated experimental and all associated control sites (see Methods). Recall is the number of marker-associated experimental integration sites divided by all experimental integration sites. The F b score, a convenient way to aggregate Precision and Recall, is the weighted harmonic mean of the two measures [85]. Usual values for b are 0.5, 1 or 2 [86]. To limit the influence of true negatives in the analysis of these skewed datasets, we emphasized Precision over Recall by setting b = 0.5. The F score tracks better with statistical significance when b = 0.5, than 1 or 2 (see the comparison of results using different values for b, as well as with other metrics, described below, as well as Text S1). Moreover we normalized the number of false positives with respect to the number of experimental integration sites so as to make the F score independent of control sample size. For the analysis here, markers with F scores between 0.5 and 1 were considered to be associated with integration sites.
To visualize genome-wide association of proviruses with potential markers, chromosome projection mandalas were developed ( Figure 1A, see Methods). Each dot on the mandala represents a retroviral integration site with the following polar coordinates: angular distance corresponds to genomic location on the indicated chromosome; radial distance from the contour of the circle is the distance in nucleotides from the nearest site of the marker in question, log-scaled from 0 to 1 megabase.

Association of retroviral integration sites with ChIPSeq datasets
Currently, the best chromosomal marker for retroviral integration site selection is the association of CpG islands and transcription start sites (CpG+TSS) with gammaretroviruses [31,43,71]. By examining published datasets for MLV, 21 to 27% of integration sites fall within 2 kB (wi2kB) of CpG+TSS, with probabilities ,3610 222 to ,4610 242 (Table 3). Despite these extremely low p-values, F scores calculated for these datasets fall between 0.36 to 0.51 (Table 3 and Figure 1E), indicating that CpG+TSS is not a powerful predictor of MLV integration sites. Stronger association with CpG+TSS was observed with porcine endogenous retrovirus, PERV (50% wi2kB; p,10 2250 ; F score 0.72), and xenotropic MuLV-related virus, XMRV (33% wi2kB; p,10 246 ; F score 0.58), two viruses from the same gammaretrovirus family as MLV (Table 3 and Figure 2). No significant association with CpG/TSS was observed for proviruses generated by non-gammaretroviruses, including HIV-1, for which the F score was 0.11 (Table 3, Figure 3), or with ASLV, HTLV, or Foamy virus (Table 3, Figure S1).
ChIPSeq datasets for 60 chromatin-associated factors (Table 1) were compared with 14 provirus datasets for MLV, PERV, XMRV, HIV-1, HTLV-1, ASLV, Foamy virus, and HIV/MLV chimeras ( Table 2). Acetylation of H3 and H4 at several positions, and methylation of H2AZ, H3K4, and K9, were strongly associated with gammaretroviral integration sites, all with F scores .0.80 (Figures 1 and 2, Table 3 and Tables S1 and S2). H3K4me3 in particular was strongly associated with MLV integration sites (68% wi2kB; p,10 2324 ; F score 0.83) and with the integration sites of PERV (60% wi2kB; p,10 2350 ; F score 0.82) and XMRV (64% wi2kB; p,10 2170 ; F score 0.81) (Figures 1  and 2, Table 3). The effect of window size on the F score was examined for factors strongly associated with MLV and the other gammaretroviruses. Interestingly, the F score was maximal when it was calculated using a window of +/22 kB for proviruses flanking the sites of these chromatin features ( Figure 4).
In contrast to the gammaretroviruses, HIV-1 integration sites were not associated with H3K4me3 (9% wi2kB; p.0.05; F score 0.21)( Figure 3 and Table 3). Among the markers for which ChIPSeq datasets were available from HeLa cells, H3K4me1 had the strongest association with HIV-1 proviruses (48% wi2kB; p,10 231 ; F score 0.6), though H3K4me1 was the sole chromatin marker that yielded F score values greater than 0.5 across all queried viruses (Table 3, Table S3). H3K4me3, and other chromatin modifications linked to transcriptionally active promoters [64,[87][88][89], were reported to be associated with HIV proviruses when a window of 50 kB flanking the proviruses was considered [81,90]. This could be explained by the fact that HIV-1 proviruses localize to active transcription units with equal distribution along the length of the genes [32,42,43], and that the size of the average transcription unit is on the order of tens of kilobases.
To examine this further, the F score for HIV-1 versus H3K4me3 in HeLa cells was plotted as a function of window size ( Figure 5). For comparison, a similar plot was generated for a hypothetical marker at the TSS of transcribed genes in HeLa cells, taking into account the length of these genes, and considering a uniform distribution of proviruses on each gene. For both H3K4me3 and the hypothetical TSS marker, the F score plateaued at a window size of 20 kB, the median gene length. Thus if the window size is large enough to encompass the TSS and  half of the gene length, the F score becomes significant. This could explain the window-size dependence of HIV-1 association with H3K4me3.
We also analyzed an integration site map for an HIV-1 vector in which IN-encoding pol sequences and part of gag were replaced by homologous sequences from MLV [45]. It was shown previously that substitution of these two viral components from MLV is sufficient to change the integration site preference of HIV-1, such that it targets TSS with a frequency like MLV [43]. Replacement with these MLV genes was sufficient for HIV-1 proviruses to associate with methylated histones (65% wi2kB, p,10 2182 , F score 0.82) in a manner that was indistinguishable from MLV ( Figure 3).
Attempts to detect a protein-protein interaction between STAT1 and MLV IN were unsuccessful. STAT1-deficient cell lines, either Stat1 2/2 mouse embryonic fibroblasts [91], HeLa cells with stable STAT1 knockdown using lentiviral vectors [92], or well-characterized, STAT1 mutant, HT1080 cells [93], were challenged with MLV and, as a control, HIV-1. No clear defect associated with STAT1-deficiency was detected when MLV infectivity was compared with HIV-1 (data not shown). These results suggest that STAT1 itself is not directly responsible for MLV integration site preference but that its chromatin preferences resemble those of MLV.

The F score is robust and highly discriminating
The stability of the F score for H3K4me3, an excellent marker, and for TSS/CpG, a poor marker, was examined as the size of a dataset containing 588 MLV proviruses [43] was decreased. The ratio of the size of the provirus dataset with respect to the control dataset was fixed at ten. While the p-value varied enormously as the size of the provirus dataset decreased, the F score was constant for both H3K4me3 and TSS/CpG over the full range from 50 to 500 proviruses ( Figure 6A). The size of the provirus dataset was then fixed at 588 [43] and the F score was plotted versus the ratio (from 0.1 to 10) of the experimental and control datasets. Under these conditions the F score for either factor was constant except for a small increase when the ratio of the experimental to control datasets decreased below 0.3 ( Figure 6B). The p-value for H3K4me3 changed markedly with the change in ratio of the datasets. Thus, while the p-value is strongly biased by the size of the provirus dataset or by the ratio of experimental to control sites, the F score is a remarkably stable measure. Similar stability was observed for the F score of all markers as compared to all proviral integration datasets (data not shown).
As demonstrated for the F score ( Figure 6), the area under the curve (AUC) ROC method used previously to evaluate markers associated with retroviral integration sites [78,80,81] is a robust measure that is insensitive to dataset size. Like the F score, AUC(ROC) also works well to assess markers that are weakly or moderately associated with integration sites (Text S1). But, as demonstrated for the highly associated marker H3K4me3, AUC(ROC) does not respond to the increase in false positives that is expected with increasing window size ( Figure 7A). Moreover, this insensitivity to false positives leads AUC(ROC) to overestimate the association of markers that are more common in the genome. Consequently, AUC ranks markers differently from statistical significance, as shown in Figure 8 and discussed in more detail in Text S1. In contrast, the p-value and the F 0.5 score incorporate an adjustment for the increase in false positives as window size increases, and both measures achieve a maximal value at a window size of 2 kB ( Figure 7A). A standard regression plot shows that the F 0.5 score tracks with the p-value almost perfectly (R 2 = 0.97), whereas the AUC(ROC) diverges considerably (R 2 = 0.37) ( Figure 7B). The F 0.5 score and the p-value adjust similarly for the increasing number of false positives. Indeed among a set of measures that included F 0.5 , F 1 , F 2 , Area Under Curve (AUC), Area Under Precision/Recall (AUPR), Odds Ratio (OR), Shannon Mutual Information (SMI), and Difference of Proportions (DOP), the F 0.5 score showed the strongest link with statistical significance (see Methods). We analyzed one of the MLV integration dataset in HeLa cells [43] (the same results were obtained using the other HeLa dataset [31]) and the MLV integration dataset in CD4+ T cells [71]. The strength of association of 9 significant markers (in terms of p-value) from HeLa cells, and 31 significant markers from CD4+ T cell, was assessed. Markers were ranked according to each of the above methods and the results of each were compared with the ranking obtained using significance 2log(p value). This was done by fixing the matched control data set size at 10-times the experimental dataset size and using window sizes of 2, 5, 10, and 20 kilobases. Results for the analysis are reported in Table 4 and in Text S1.
Several conclusions can be drawn from this analysis. Concerning markers that were highly associated with proviruses, the ranking yielded by the F 0.5 score closely tracked with significance (Table 4). By increasing the weight of recall over precision by increasing the beta value (F 1 or F 2 ) the F score tracked less well with significance (it was the F 0.5 score that was used throughout this manuscript). The SMI also tracked well, but, unlike the F score, the results with this method vary with dataset size (see Text S1). The AUC, OR, AUPR, and DOP were clearly not as good as the F 0.5 score.
Concerning markers that are moderately or weakly associated with proviruses (Text S1), the ranking based on the F 0.5 score was Figure 2. Chromosome projection mandala and F score calculated within 2 kB for the indicated markers (columns) versus the indicated proviruses (rows). The source of the provirus datasets is listed (see Table 2 and the text) and N indicates the number of proviruses considered for each analysis. MLV [31] proviruses were cloned from HeLa cells, XMRV proviruses from DU145, and PERV proviruses from HEK 293. H3K4me3 and STAT1 ChIPSeq datasets were from HeLa (see Table 1 and text). The F score and the percentage of proviruses within 2 kB are presented under each mandala. doi:10.1371/journal.pcbi.1001008.g002 similar to that obtained by significance, AUC, AUPR, OR, or DOP (Table 4). SMI scored less well for these markers. Figure 8 visualizes the deviation of AUC, AUPR or F 0.5 from significance. Red squares indicate cases in which the ranking calculated by the specified metric differs from the rank obtained by significance.
All results indicate that, for the datasets evaluated here, the F 0.5 score is a superior measure at discriminating among factors for differences in magnitude of association with genomic sites of integration. . Chromosome projection mandala and F score calculated within 2 kB for the indicated markers (columns) versus the indicated proviruses (rows). All proviruses were cloned from HeLa cells ( Table 2 and text). H3K4me3 and STAT1 ChIPSeq datasets were from HeLa cells (Table 1). N indicates the number of specific proviral integrations considered for each analysis. The F score and the percentage of proviruses within 2 kB are presented under each mandala. doi:10.1371/journal.pcbi.1001008.g003

Generation of a supermarker for retrovirus integration
Given the effectiveness of the F score for identifying and ranking individual factors associated with retrovirus integration site selection, markers with the best F scores were combined in an attempt to generate a supermarker (see Methods for more details).
An estimate of the probability of proviral integration into the host genome (P(V)) was derived based on the genomic distribution of combinations of ChIPSeq peaks for the best scoring markers with respect to particular experimental provirus datasets. The resulting probability mass function (at base-pair resolution) is where V is the set of proviral integration sites, F j is the F score associated with each marker M j , for the set of peaks C j . x is the physical position on chromosomal DNA and K is a normalization constant. From this composite distribution, the peaks with the largest amplitude were identified, and the subset of peaks yielding the maximal F score in the test dataset was defined as the supermarker peak set. Two strategies were used to validate the supermarker procedure. First we calculated the supermarker and the relative peak set on each single proviral dataset and then we evaluated the association with the remaining datasets. The second strategy was a  (Table 5 and Table  S5). Further, we compared the strength of association of the supermarker peak set for gammaretroviral datasets to the performance of the Random Forest machine learning algorithm [94]. The two methods obtained superimposable results (Table S6, see Methods for details).
With respect to MLV integration in HeLa cells, H3K4me1, H3K4me3, H3K9ac and STAT1 were the markers with the best F scores (.0.80)(Table S1 and S2). Examination of the ChIPSeq peaks derived from all combinations of these five candidates revealed that the best supermarker was generated by combining H3K4me3, H3K4me1, and H3K9ac (75% wi2kb; p,10 2284 ; F score 0.87) (Figure 9 and Table 5). Figure 9A shows the distribution of supermarker density and MLV integration sites across the human genome, with an expansion of chromosome 1 to help visualize detail in Figure 9B. The Pearson correlation for the supermarker density and MLV integration site density across the whole genome was 0.75 (p = 0, with both functions averaged over a non-overlapping 10 kB window). Figure 9C shows the correlation for chromosome 1 in isolation. As with the single marker H3K4me3, the supermarker yields a maximal F score using a window size of 2 kB (Figure 4).
Inclusion of STAT1 in the HeLa supermarker increased the number of false positives over the number of true positives and thus decreased the composite F score. This suggests that any information carried by STAT1 is contained within the other markers.

The F score detects differences between cell types
The F scores reported here (Tables 3 and 4) were calculated using ChIPSeq and provirus datasets that were matched for cell type. In a previous report, when AUC(ROC) was used to evaluate epigenetic marks mapped in T cells, the correlation with proviruses cloned from T cells was no greater than the correlation with proviruses cloned from other target cell types such as the human embryonic kidney cell line HEK 293 or the fibrosarcoma cell line HT1080 [90]. Differences due to experimental error were in fact greater than differences due to cell type [90].
To determine if the F score has the ability to discriminate between cell types, MLV provirus data sets from HeLa and CD4 + T cells were compared with the supermarker for each of these cell types, in all combinations. As mentioned above, when an MLV provirus dataset obtained from infection of HeLa cells [43] was compared with the supermarker from HeLa cell ChIPSeq data, very strong association was observed (75% wi2kB; p,10 2284 ; F score 0.87) ( Table 5 and Figure 10). When the same provirus dataset was compared with the supermarker derived from CD4 + T cell ChIPSeq data the strength of the association was much decreased (32% wi2kB; p,10 257 ; F score 0.61) ( Table 5 and Figure 10). The same pattern was seen for the chimera HIVmINmGag, for which association with the supermarker in HeLa cells (70% wi2kB; p,10 2263 ; F score 0.86)( Table 5 and Figure 10) was much greater than association with the supermarker in CD4+ T cells (27% wi2kB; p,10 224 ; F score 0.56) ( Table 5 and Figure 10). The opposite pattern was also seen in that MLV proviruses cloned from CD4+ T cells [71] were strongly associated with the supermarker derived in these cells (71% wi2kB; p,10 2112 ; F score 0.84) ( Table 5 and Figure 10), and less well associated with the supermarker from HeLa cells (39% wi2kB; p,10 242 ; F score 0.67) ( Table 5 and Figure 10).
A similar analysis was attempted with provirus datasets for the gammaretroviruses XMRV and PERV ( Table 5). The XMRV provirus data was obtained in the human prostate cancer cell line DU145 [76] and ChiPSeq datasets are not available for these cells. Despite the mismatched cell lines, when the XMRV dataset from DU145 cells was compared with the epigenetic markers mapped in HeLa cells strong correlation was observed with the supermarker (66% wi2kB; p,10 2190 ; F score 0.83). When the supermarker was derived from CD4 + T cell data, the association with XMRV was much less significant (41% wi2kB; p,10 285 ; F score 0.70). Similarly, the PERV provirus dataset cloned from HEK 293 cells was better associated with the supermarker from HeLa cells (66% wi2kB; p,10 2350 ; F score 0.83) than from CD4+ T cells (51% wi2kB; p,10 2350 ; F score 0.75).
To understand why some mismatched cell comparisons gave higher F scores than others, CD4+ T cells, HeLa, DU145, Jurkat, HEK 293, and CD34+ hematopoietic stem cells were clustered based on global gene expression profiles (http://www.ncbi.nlm. nih.gov/geo). The resulting dendrogram ( Figure S2) demonstrated that the cells clustered into two groups, one consisting of HeLa, DU146, and HEK 293 cells, and the other CD4+ T cells, Jurkat cells, and CD34+ cells. Based on expression profiles DU145 cells are more similar to HeLa cells than to CD4+ T cells, offering an explanation for the higher F score when XMRV was compared with HeLa.
Use of the supermarker to predict the likelihood of integration at specific loci within specific cell types As a first step towards examining the utility of the supermarker in the context of published clinical or experimental data, supermarker density was examined in proto-oncogenes that have been activated by retroviral insertion. 20 SCID-X1 patients were successfully treated with autologous bone marrow CD34+ hematopoietic stem cells transduced ex-vivo with an MLV vector expressing the therapeutic gene IL2RG. 5 of these patients developed T cell leukemia and 4 possessed insertional mutations from the MLV vector at LMO2 [24][25][26][27][28], a T cell oncogene [95].

integration in HeLa cells (A) or in CD4+ T cells (B) were ranked by Area Under Curve (AUC), Area Under Precision and Recall Curve (AUPR)
, or using the F 0.5 score. The rankings obtained by these methods were compared with the ranking obtained by the Fisher's exact test: each crosslink between markers in the grid represents a comparison. Red squares indicate when the ranking calculated by the specified metric disagrees with the ranking calculated by significance. Markers were arranged in order of decreasing significance (from left to right). doi:10.1371/journal.pcbi.1001008.g008 The fifth patient had a provirus near CCND2, another lymphoid oncogene [96] that encodes cyclin D2. When ChIPSeq datasets from HeLa cells were used to generate the supermarker, no high probability sites were identified near the promoters of LMO2 or CCND2 (Figure 11). For LMO2 the nearest sites in HeLa cells were .150 kbp upstream and .200 kbp downstream of the TSS. For CCND2 the nearest sites in HeLa were .800 kbp upstream and .50 kbp downstream of the TSS.
Sufficient ChIPSeq datasets to generate a supermarker were not available for CD34 + hematopoietic stem cells. Given the relative similarity of the transcription profile ( Figure S2) we used the supermarker data generated from CD4 + T cells. The F score when crossing from CD34+ cells to CD4+ cells decreases from 0.85 to 0.78 (57% wi2kb, p,10 2102 ), but is much better than when using HeLa cell data (38% wi2kB; p,10 248 ; F score 0.66).
With respect to the LMO2 TSS a very prominent supermarker peak was observed at 21730 bp ( Figure 11A). Based on the probability of the supermarker we estimate that 1 out of 10 5 MLV proviruses would target this gene in CD34+ cells or CD4+ T cells, as compared to a much less frequent 1 out of 10 7 MLV proviruses in HeLa cells. Nearly identical probabilities were calculated based on experiments in which MLV proviruses were cloned from T cell lines and HeLa cells [97]. These authors observed a hotspot for MLV integration located between 21740 to 23000 of the LMO2 promoter within CD4+ T cells but not within HeLa. Though experimental data for calculating the probability of integration into CCND2 is not available, it is interesting that multiple, highprobability supermarkers are located wi2kB of the promoter ( Figure 11B).

Discussion
Here we attempted to identify epigenetic markers predictive of retroviral integration site selection. To this end, the growing body of ChIP-Seq and retroviral integration datasets was exploited.
Borrowing from the field of information retrieval, we derived a measure, the F score, that allowed us to identify and rank candidate markers for association with proviruses. Covalent modification of histone H3, most prominently H3K4me1, H3K4me3, and H3K9ac, as well as binding sites for the transcription factor STAT1, were tightly linked to proviruses from MLV, XMRV, and PERV. The F score also permitted us to combine factors to generate a supermarker that predicted 75% of integration sites with precision and with specificity for integration site preference within a given cell type. The ChIPSeq datamining approach used here identified markers for gammaretroviral integration site selection that are superior to any markers previously reported.

Advantages of the F score
Prior to this study, the best predictor for retroviral integration site selection was the association of TSS/CpG with gammaretroviruses such as MLV [31,43,71]. Given a window of 2 kB, TSS/CpG predicts 21 to 27% of MLV integration sites. But even this modest prediction comes with the cost of a high background rate (low precision) and consequently a borderline F score (0.51 under the best conditions). In contrast, H3K4me3 predicts 63 to 68% of MLV integration sites with high precision (F score 0.84). H3K4me1 predicts 90% of MLV integration sites but, in isolation, this marker has a higher background rate (F score 0.78) due to the larger size of the H3K4me1 ChIPSeq dataset (300,000 binding sites for H3K4me1 versus 70,000 for H3K4me3).
Previous studies have reported the same histone modifications as markers associated with integration sites [81,90]. The Precision-Recall methods used here have been shown to be better suited than ROC when negative results far exceed positive ones [82]. Precision-Recall methods have been shown to perform better than ROC in a number of other areas in biology, including the prediction of functional residues within proteins [98] or predicting the function of genes [99]. In our case, the resolution offered by the Precision-Recall-based F score allowed us to rank markers according to statistical significance (Text S1). Then, by ranking markers with respect to their F score, we were able to combine them to generate a supermarker which predicts 75% of MLV integration sites wi2kB with very high precision (F score 0.87). It will certainly be important to find an explanation for the remaining 25% of integration sites not accounted for by the markers identified here.

Significance of the supermarker
The supermarker was used here to predict the probability of gammaretroviral integration into a specific locus, in a cell-type specific manner (Figure 11). Our in silico probability estimates for integration near a particular proto-oncogene, LMO2, were nearly identical to the probabilities calculated from experimental data [97], and even concurred with respect to the cell-type specificity of the experimentally determined probability. Additional experimental confirmation of supermarker predictions is called for but the case of LMO2 suggests that the supermarker is indeed the first powerfully predictive tool for retroviral integration site selection. A supermarker generated from cell-type-specific ChIPSeq data for a Matched means that the supermarker was calculated using proviruses cloned from the same cell type as the ChIPSeq dataset. In the case of XMRV and PERV, proviruses were cloned from a cell type that is similar to the ChIPSeq dataset, according to the transcriptional profile (see text and Figure S2). doi:10.1371/journal.pcbi.1001008.t005 handful of markers has the potential to transform how decisions are made concerning clinical gene-therapy trials.
The calculations here were based on distinct datasets from multiple sources (Tables 1 and 2). It is possible that by generating matched datasets, i.e., integration datasets and ChiPSeq datasets from identical cells and by the same laboratory, or by combining ChIPSeq data for new factors in new combinations, the ability of the supermarker to predict integration sites will be improved even further. On the other hand, STAT1, a powerful marker in isolation, increased the false positive rate and decreased the F score. In addition to the ChIPSeq datasets in Table 1, we checked if the F score was improved by examining other previously reported features, including GC content, AT content, putative consensus sequences for integration or transcription factors [80,100]. When a window of 2kB was considered, these features failed to yield a significant F score (all were #0.5) for all of the retroviral provirus datasets, and these factors considerably lessened the F score when combined with the highly associated markers (Table S7).

Mechanistic implications
The strength of the associations with H3K4me3, H3K4me1, and H3K9ac indicates that gammaretroviral integration is not a quasi-random process, but rather, a deterministic process that follows the epigenetic histone code. Though some of these histone modifications are linked to transcriptionally active promoters [64,[87][88][89], the link to transcription per se seems not to be relevant since 60 to 70% of supermarker loci are not associated with TSS/ CpG. Consistent with this point, our supermarker is highly associated with the LMO2 promoter in CD4+ T cells, but not in HeLa cells, and these cell-type-specific differences in marker binding do not correlate with differential LMO2 expression in these cells [97]. The 2 kB window maxima for the F score of the supermarker is intriguing and suggests that it is a physical property of chromatin that is favored for integration by gammaretroviruses, perhaps linked to the position of the supermarker relative to nucleosomes or bent DNA [34,[36][37][38].
The factors constituting the supermarker, along with the other histone modifications listed in Tables S1 and S2 that are also associated with MLV integration, suggest a mechanistic link between gammaretroviral integration and chromatin-associated complexes with H3K4 methyltransferase and histone acetyltransferase activity. H3K4 methylation is clearly linked with histone acetylation, in that promoters which are methylated are much more likely to become acetylated [65] and knockdown of WDR5, a factor required for H3K4 methylation [101] leads to altered histone acetylation [65,102]. Methylation may recruit chromatin remodeling complexes [103,104], the methylated histone may be bound by the acetylases [105], or acetylases may be components of the methylase complex itself [101]. CBP/p300 is associated with H3K4 methyltranferase activity in vivo [106,107]. ChIPSeq data on acetyltransferases shows a weak but significant association between CBP and MLV integrations in CD4+ T cells (F score 0.68, Table S4). Interestingly, combination of CBP and p300 leads to an aggregated F score of 0.75. Thus, any of these chromatin associated factors, methylated histones, methylases, chromatin remodeling complexes or acetylases are candidates for gammaretroviral IN-binding factors. Interestingly, HIV-1 IN associates with, and is acetylated by, p300 [108] but the p300 ChIPSeq binding profile was not associated with the HIV-1 proviral datasets (F score 0.34).  Gammaretrovirus association with STAT1 Though very strong association was observed when any of the gammaretroviruses were compared with STAT1 binding sites, adding this transcription factor to the supermarker did not improve the F score. This is perhaps because any retroviral targeting information derived from STAT1 binding sites is already present in the modified histone H3. 90 to 95% of the STAT1 binding sites are in fact within 2 kB of the nearest H3K4me1 site. Our attempts to detect STAT1 binding to MLV IN, or to see effects of STAT1 disruption on MLV infectivity were unsuccessful. Taken together it seems likely that STAT1 itself is not mechanistically involved in gammaretrovirus integration. More likely, STAT1 homes to chromosomal regions that are also preferred targets for integration by these viruses. STAT1 has a complex relationship with the histone acetylase CBP/p300. Acetylation of histones is required for STAT1-mediated transcription [89,109] but STAT1 itself binds CBP/p300 [110] and is also acetylated and this contributes to its inactivation [111].

HIV integration site selection
The best single marker for HIV-1 in HeLa cells, H3K4me1, predicted 48% of proviruses wi2kB but with only moderate precision (F score 0.60). Using the F score we were able to detect a stronger association of HIV-1 with H3K4me1 in CD4+ T cells (57% wi2kB, p,10 271 , F score 0.73) but combining markers in an attempt to generate a supermarker failed to improve the F score. The associations that were observed may be related to HIV-1's propensity to integrate along the length of transcriptionally active genes [81,90]. Association with histone modifications at active promoters may be detected given short enough gene-length, or a wide-enough window around the provirus ( Figure 5). Either way, we were unable to identify a marker capable of predicting HIV-1 integration site selection wi2kB. Perhaps the HIV-1 IN-interacting protein PSIP1/LEDGF/p75 [53][54][55] would be such a factor. Though binding sites have been reported for LEDGF [112], this dataset is limited to 1% of the human genome and cannot be used for a genome-wide association study. LEDGF influences HIV-1 integration site selection in that its disruption causes a shift away from transcriptional units and towards CpG-rich sequences [56,58,59]. Nonetheless, these are relatively general effects and LEDGF binding sites may fail to give resolution down to a window of 2 kB. It appears that integration site selection by HIV-1 is mechanistically quite different than for the gammaretroviruses.

Retrovirus integration site datasets and generation of controls
The analysis of integration sites was based on the published integration datasets in Table 2. In the analysis performed here, to control for possible bias introduced during the cloning of the integration sites, 10 control sites in the human genome were generated for each integration site, as previously described [42,43,78,80]. These control, in silico-generated sites were used to calculate the significance and the F score (see below).

CpG island and transcription start sites
These genomic features were obtained from Annotated Genome version hg18 for human (http://genome.ucsc.edu/). CpG island and transcription start sites were combined into single datasets for determining association with retrovirus integration sites.

ChIPSeq datasets
ChIPSeq peaks were derived from published ChIPSeq datasets (Table 1) with a robust and fast algorithm, F-Seq [113] running with default parameters and standard Poisson statistics. We recalculated the peaks even when the peak set was already available to confirm the reproducibility of published procedures.

Statistical analysis
Two-sided Fisher exact test (or x 2 approximation when appropriate) was used to evaluate statistical significance. All pvalues were Bonferroni corrected for multiple testing. p-values ,0.01 were considered significant.
To measure marker performance with respect to a given retroviral integration dataset, we used the F b -score (van Rijsbergen 1979). It is defined as the b-weighted harmonic mean of Precision P~t p t p zf p and Recall R~t p t p zf n , that is : where tp is the number of actual integration sites within 2 kB from a specified factor; tn is the number of control datapoints (generated in silico as described above) .2 kB from a specified factor; fp is the number of control datapoints within 2 kB from a specified factor and fn is the number of actual integration sites .2 kB from a specified factor. We set b = 0.5 to give more weight to Precision than to Recall. This balances type I and type II errors by adjusting for the high rate of False Positives (fp) inherent in the examination of large datasets for genome-wide binding sites according to statistical significance (Text S1). Moreover, to overcome the limitation of standard statistical methods we normalized fp with respect to the number of actual integration sites. The normalized F 0:5 -score is finally F 0:5~1 :25t p 1:25t p z0:25f n zf p V C with V and C being, respectively, the number of effective and control integration sites. The resulting F score is almost constant with respect to the size and ratio of experimental and control datasets (Figure 7). It is worth noting that a null-predictor yielding f p~C (i.e. a marker composed of all bases in the genome) gives P = 0.5 and R = 1, resulting in an F score%0.5. A marker is considered significant if the F score lies between 0.5 and 1.0.

Marker ranking and metric comparison
Different metrics can be used to measure the association between proviruses and given markers. We opted to identify the metric among F 0.5 , F 1 , F 2 , Area Under Curve (AUC), Area Under Precision/Recall (AUPR), Odds Ratio (OR), Shannon Mutual Information (SMI), and Difference of Proportions (DOP) that best agrees with statistical significance. The association between markers and proviruses was measured according to each of the above-mentioned metrics. Then the markers were ranked by comparing the measure associated to the i-esim marker with that associated with the j-esim marker and filling in an NXN matrix M for each measure. Formally where X is one of the considered metrics. As a reference, a similar matrix was built using the p-value (significance) obtained by Fisher's exact test, defined for the i-esim marker as S i~{ log (p i ). Thus A simple measure of similarity between metric X and reference S was calculated by D(X ,S)~X i,j 1{DM S ½i,j{M X ½i,jD N 2 (sum spans over all matrices elements). Observe that 0ƒDƒ1.

Generation of a supermarker
The mass probability functions p(V = i) or p(M = i) are defined as the probability of a provirus V or a marker M to be localized at a given genomic location defined as i;(chromosome, position). p(V = i) is estimated from the linear combination of mass probability functions for candidate markers, that is Coefficient p j measures the goodness of fit of the marker M j and it seems reasonable to write p j as a function of the related F score.
Indeed the probability of integration P(V) can be written as . . . , with respect to a set of markers M 1 ,M 2 ,…,M N . Adding these equations we get the mixture model Now, from (1) and b~1 2 we have A first order approximation of (2) is then where K is a normalization constant. Eventually we set p j~K F j N and the resulting new probability mass function is The marker mass density p(M j~i ) was modeled as the sum of Gaussian functions centered on ChIPSeq peaks, with the variance set as the average size of the peak regions, as determined by the Fseq algorithm [113]. In this way we minimized the potential bias that can arise by summing ChIPSeq densities obtained over different experimental conditions. Briefly, each marker probability density function was written as where C is the peak set of the marker M. This function (3) summarizes the properties of all the markers and can be interpreted as a new ChIPSeq density. Indeed it contains all markers associated and not associated peaks. To reduce the number of false positives we applied a thresholding procedure similar to that used to filter raw ChIPSeq data in a training set of experimental and control integration sites. The peaks of function (3) were ranked with respect to their amplitude and the F score is recalculated on the training set as a function of the number of peaks. We define the supermarker M* as the marker set that yields a maximal F score.
The supermarker density function is finally written as where C* is the reduced peak set. To validate the model, we adopted two strategies. First we calculated the supermarker and the relative reduced peak set on each single proviral dataset and then we evaluated the association with the remaining datasets. The second strategy was a standard 10-fold cross-validation applied to each single dataset.

Machine learning
To validate the effectiveness of the supermarker peak set, we trained RandomForest [94], a machine learning algorithm, with the same set of markers composing the supermarker. Our datasets are extremely imbalanced and this results in a classifier with an high misclassification error for predicting the minority class (i.e. the experimental dataset) as shown in Table S6. In order to correct for that, RandomForest can be tuned by an additional parameter, classwt, that can be used to assign priors to the classes (experimental and control) to minimize the misclassification error and improve the performance. We adopted a 10-fold crossvalidation procedure by correcting the priors in the training set.
Interestingly, the maximum achievable F score and the number of associated integration sites wi2kb match almost exactly with the F score and wi2kb that we obtained with our supermarker procedure. We consider this as further evidence of the effectiveness of the supermarker.

Position specific scoring matrix (PWM)
PWM for retroviruses and human transcription factors was borrowed from [80] and from the JASPAR database (jaspar. cgb.ki.se).

Computation
All computation and graphics were done with ad-hoc Python scripts with the support of the motility library for PWM calculations (cartwheel.caltech.edu/ motility), Matplotlib library for graphical and scientific computing (matplotlib.sourceforge.net) and the Random Forest implementation on R environment (http://cran. r-project.org/web/packages/randomForest/).

Graphic representation of data
Chromosome projection mandalas (Figure 1) represent the distribution across of the genome of binding sites for a specific factor or histone modification on the circumference of a circle. Each dot represents a retroviral integration site with the following polar coordinates: angular distance corresponds to genomic location on the indicated chromosome; radial distance from the contour of the circle is the log-scaled distance in nucleotides from the closest marker site. Diagrams have been set to visualize proviruses located between 0 and 1 megabase. Proviruses located more than 1 megabase from the nearest marker accumulate at the center of the mandala. Figure S1 Chromosome projection mandala and F0.5 score calculated within 2 kB for the indicated markers (columns) versus the indicated proviruses (rows). ASLV and HTLV1 proviruses were cloned from HeLa cells, the Foamy virus from CD34+ hematopoietic stem cells (Table 2 and text). H3K4me3 and STAT1 ChIPSeq datasets were from HeLa cells (Table 1). N indicates the number of specific proviral integrations considered for each analysis. The F0.5 score and the percentage of proviruses within 2 kB are presented under each mandala.