Recurrent Targeted Genes of Hepatitis B Virus in the Liver Cancer Genomes Identified by a Next-Generation Sequencing–Based Approach

Integration of the viral DNA into host chromosomes was found in most of the hepatitis B virus (HBV)–related hepatocellular carcinomas (HCCs). Here we devised a massive anchored parallel sequencing (MAPS) method using next-generation sequencing to isolate and sequence HBV integrants. Applying MAPS to 40 pairs of HBV–related HCC tissues (cancer and adjacent tissues), we identified 296 HBV integration events corresponding to 286 unique integration sites (UISs) with precise HBV–Human DNA junctions. HBV integration favored chromosome 17 and preferentially integrated into human transcript units. HBV targeted genes were enriched in GO terms: cAMP metabolic processes, T cell differentiation and activation, TGF beta receptor pathway, ncRNA catabolic process, and dsRNA fragmentation and cellular response to dsRNA. The HBV targeted genes include 7 genes (PTPRJ, CNTN6, IL12B, MYOM1, FNDC3B, LRFN2, FN1) containing IPR003961 (Fibronectin, type III domain), 7 genes (NRG3, MASP2, NELL1, LRP1B, ADAM21, NRXN1, FN1) containing IPR013032 (EGF-like region, conserved site), and three genes (PDE7A, PDE4B, PDE11A) containing IPR002073 (3′, 5′-cyclic-nucleotide phosphodiesterase). Enriched pathways include hsa04512 (ECM-receptor interaction), hsa04510 (Focal adhesion), and hsa04012 (ErbB signaling pathway). Fewer integration events were found in cancers compared to cancer-adjacent tissues, suggesting a clonal expansion model in HCC development. Finally, we identified 8 genes that were recurrent target genes by HBV integration including fibronectin 1 (FN1) and telomerase reverse transcriptase (TERT1), two known recurrent target genes, and additional novel target genes such as SMAD family member 5 (SMAD5), phosphatase and actin regulator 4 (PHACTR4), and RNA binding protein fox-1 homolog (C. elegans) 1 (RBFOX1). Integrating analysis with recently published whole-genome sequencing analysis, we identified 14 additional recurrent HBV target genes, greatly expanding the HBV recurrent target list. This global survey of HBV integration events, together with recently published whole-genome sequencing analyses, furthered our understanding of the HBV–related HCC.


Introduction
Chronic human hepatitis B virus (HBV) infection leads to a range of liver diseases, from chronic hepatitis and liver cirrhosis to hepatocellular carcinoma (HCC). HBV infects approximately 2 billion people worldwide with 350 million suffering from chronic infection. HBV infection causes 500,000 to 1.2 million deaths annually, 320,000 of which are attributable to HCC [1]. Chronic HBV infection has been found to play a causal role in the development HCC from many epidemiological studies [2]. In addition, integrated HBV DNA sequences and episomal HBV genomes have been found in 85-90% of HBV-related HCCs [3].
The role of HBV integration in the development of HCC is rather complicated and yet not fully elucidated. HBV DNA integration was found to be distributed throughout different chromosomal sites in the host genome [4]. These integrations could induce chromosome changes, genome instability, or changes in the expression of human genes. HBV integration events were reported to be associated with chromosome fragile sites or repetitive sequences, and usually followed by local rearrangement, all of which relate to a higher genomic instability [5]. Furthermore, recent studies using genetic approaches and microarray technologies showed that HBV-related HCCs displayed higher rates of chromosomal alterations than HCCs of other risk factors [6]. HBV insertion was found to target the retinoic acid receptor-beta (RARB) gene [7] or the human cyclin A2 (CCNA2) [8], and generate chimeric oncogenic proteins.
A comprehensive analysis of HBV insertion sites targeting new genes would be desirable and would allow us to understand the mechanism of HCC development and to identify novel therapeutic targets. Towards this comprehensive goal, a moderate highthroughput approach using Alu-PCR approach has identified 68 cellular flanking sequences (seven repetitive or unidentified sequences, 42 cellular genes, and 19 sequences potentially coding for unknown proteins) for HBV integration in the HCC genomes [3]. The targeted genes belong to distinct pathways: calcium signaling related genes, 60s ribosomal protein encoding genes, and platelet derived growth factor and mixed lineage leukemia encoding genes. Recurrent HBV insertions nearby the telomerase reverse transcriptase (TERT) have also been reported [3,9,10].
Recently, many novel approaches have been developed to identify unique integration sites (UISs) for retrovirus in highthroughput manner using the next-generation sequencing (NGS) technologies including novel nonrestrictive approaches [11][12][13], which avoided preferential amplification and biased identification of UISs for the previous methods that use specific restriction enzymes.
More recently, complete genomic sequencing has been used to identify HBV integration genome-wide, taking advantage of the advance in next-generation sequencing technologies. Sung et al. conducted WGS at more than 30 fold coverage for 81 HBV positive HCCs and identified 399 HBV integration events using a criteria of more than 2 read pairs with close mapping positions linking an end of hg19 to an end of HBV, which resulted in the identification of an average of 4.9 HBV integration events per individual patient [14]. In another recently publication, Jiang et al. sequenced 4 HCC patients at 80X coverage and identified 255 HBV integration sites, which would generate at about 64 HBV integrations per individual patient [15]. Considering that WGS is still expensive and cost prohibitive for sequencing large number of samples, we have developed, in this study, an alternative method that integrates ligation-mediated PCR with Illumina's paired-end (PE) adapters for amplification and deep sequencing. Here we reported the analysis of 286 unique integration sites (UISs) from 40 pairs of HBV-related HCC and tumor-adjacent tissues, which resulted in the identification of about 7 HBV integrations per individual. We were able to further estimate the abundance of the insertion sites because our method used random fragmentation. Our analysis provided a global profile of HBV DNA integration for HCC. An integrated analysis with recently published whole genome sequencing analyses of HBV-related HCCs [14][15][16] provided us with 14 additional recurrent HBV target genes, thus greatly expanding the HBV recurrent target list.

Development of an NGS-based approach for genomewide analysis of HBV integration
We developed a massive anchored parallel sequencing (MAPS) approach to isolate and sequence integrants. The method combines thoughtful designs integrating ligation-mediated PCR with Illumina's PE adapters for amplification and deep sequencing. The MAPS approach stems from the intelligent integration of adapters used in Illumina library preparation with adapters used in LM-PCR, developed for HIV integration site analysis [17]. Illumina uses adapters for flow cell surface annealing, amplification and sequencing while LM-PCR uses adapters (or linkers) as primer matching sites for amplification of unknown nearby sequences. Adopting designing principles for linkers in established genome walking studies [18], we designed a functional adapter building upon Illumina's PE adapter sequence. The PE 2 Walking Adapter was partially complementary and 'Y' shaped, with a 'T' overhanging on the blunt end ( Figure 1A).
A schematic outline of the MAPS is illustrated in Figure 1B, and all MAPS primers are listed in Table S1. Genomic DNA was sheared by sonication, resulting in fragment sizes ranging from 1 kb to 100 bp. The ligation of PE 2 Walking Adapter to the fragments generates an intermediate library with two important characteristics: first, one adapter of the single strand template is identical to Illumina's PE PCR Primer 2 (PE 2); second, the adapter sequence is long enough to suppress unspecific amplification due to single-primer PCR effect [19] during the nested PCR. Targeted sequences were further enriched and full length Illumina PE sequencing adapters were introduced by PE PCR Primer 1 and 2 (PE 1 and PE 2). The library fragments between sizes of 500-600 bp were selected for sequencing.

DNA bar coding and Illumina PE sequencing of HBV integrations
To make the MAPS approach more efficient, we introduced barcodes to primers for multiplex analysis ( Figure 1, Table S2 and S3). In particular, the 6 nt barcode (with a common 'T' overhang) was introduced through the barcoded PE 2 Walking Adapters. The 5 nt barcode on Read 1 was introduced by barcoded primers through the secondary PCR.
We applied our MAPS approach in identifying HBV integrations in 40 pairs (cancer and adjacent tissues) of HBV-positive HCCs. According to the previous screen study [3], viral half of chimeric HBV-human DNA was preferentially truncated at a region between 1500-2000 base pairs on the HBV (numbering from the hypothetical EcoRI site of the HBV subtype adw). Therefore, a nested pair of primers in the gene hepatitis B virus x (HBX) of HBV was designed based on 4 HBV complete genomes (AY800389.1, AY800390.1, AY800391.1, and AY800392.1) isolated from the same local hospital where we collected the HCC tissues. The nested HBX primer (1583-1602), given the preference of viral truncation during integration, was able to generate chimeric HBV-human amplicons suited to the MAPS approach described above. Two negative controls were included, one with the human genomic DNA from an HBV-negative individual, and the other with the above plus purified HBV

Author Summary
Integration of the hepatitis B virus (HBV) into the human liver cells was found in most of the related hepatocellular carcinomas (HCCs). Here, taking the recent advances in high-throughput sequencing, we devised an efficient and cost-effective method that we named massive anchored parallel sequencing (MAPS) method, to conduct a global survey of HBV integration events in 40 pairs of HBV-related HCC tissues (cancer and adjacent tissues). We identified 286 unique integration sites (UISs) with precise HBV-Human DNA junctions. We identified a higher number of HBV integration events in cancer adjacent tissues than in HCC tissues, suggesting a clonal expansion process during HCC development. We also found that fibronectin and its related genes (fibronectin type III-like fold domain containing genes) were frequently targeted by HBV. Fibronectin is a protein produced abundantly by the liver cells and also serves as a linker in the extracellular matrix. Our findings might suggest a role for the disruption of fibronectin and associated cellular matrix in HBV related liver cancers. We also identified 14 additional recurrent HBV target genes, greatly expanding the HBV recurrent target list. This study would add significantly to our understanding of HCC development.  genomic DNA (i.e. a mixture of HBV DNA with the human genomic DNA, but no integration).
After NGS analysis, sequences from both ends were obtained. The Read 1 sequence starts from the same position after the anchored nested HBX primer; and the Read 2 sequence, which was obtained from the opposite end, differs in the alignment to the human genome due to the DNA random fragmentation from sonication ( Figure 2A). We relied on the DNA shear point deduced in Read 2 to identify putative integrations. To be conservative, integration was determined as authentic if there is a Read 2 sequence present at the corresponding viral-cellular junction.

Identification of global HBV integrations in the human genome and their distribution features
In the end, we identified 296 HBV DNA integrations (Table S4 and Table 1). 10 of which are paired, identical integrations found in both cancer and their adjacent tissues of the same patient, therefore there were 286 UISs. Among them, 42 sites were found in cancers, while 254 sites were found in adjacent tissues, which is about 6 times more than that found in cancers.
We randomly picked 28 putative insertions and used a standardized nested-PCR assay, we confirmed 27 out of 28 putative insertions (96%) ( Table S5). Our confirmation rate is similar to what Sung et al. reported for the HBV insertions from whole genome sequencing, which they randomly selected 32 breakpoints at the 6 affected genes for PCR analysis in 22 samples and were able to successfully validate 82% of these integration sites [14]. To further validate those putative sites, we sequenced two amplified integrant PCR products (HBV integration in ERBB4 and GRM8) by the Sanger method. BLAST analysis revealed that both sequences extend from the HBX gene to the identified sites in the human genome ( Figure 2B and 2C). We also examined the length of DNA fragments in the sequencing library. The lengths of the integrants (without the adapter and barcode sequences) clustered around 350-450 bp ( Figure 2D).
We annotated insertions the nearest RefSeq genes within 10 kb from the HBV-Human junction points (Table S4) and this annotation was used in the other analyses in this manuscript. However, we also provided the annotations to within 150 kb in the Table S4 as previous studies showed that the HBV genome includes enhancer elements capable of activating promoters even 150 kb away, in a position and orientation independent manner [9,20].
We compared the genes targeted by HBV from our study to three recently published studies. Fujimoto et al. only identified 6 genes (ADAM5P, FRAS1, THRB, HNF1A, FAM18B2, TTLL9) in addition to TERT (their supplementary table 13) from 11 HBVrelated HCCs in 23 integration events [16]. Only the TERT gene is in common with our HBV targeted gene list. Sung et al. identified 399 HBV integration events from 81 HBV positive HCCs (from their supplementary table S3). As they did not indicate what distance they allowed from the HBV integration to the start or end site of genes, we re-annotated their list with the same criterion we used in our annotation to within 10 kb of genes, and then performed the comparison. Jiang et al. reported the identification of 255 HBV integration events in three HBV positive HCCs [15]. We also re-annotated Jiang et al.'s list to within 10 kb of the gene for the comparison. Comparing our list (Table 1) to Sung et al.'s list, we found that there are 9 genes in common ( Table 2). Comparing our list to Jiang et al.'s list, we found that they are four genes in common. Two genes, FN1 (fibronectin 1) and MLL4 (Myeloid/lymphoid or mixed-lineage leukemia 4) were found in all three studies (Table 2). Except for We also checked all the HBV integration events against the miRNA clusters using the genomic coordinates of all miRNAs in the miRBase (www.mirbase.org). We found one integration event at ChrX:108297754 that maps 18 nucleotides upstream of miRNA has-mir-6087. Has-mir-6087 was recently identified in human ES cells [21].
We also mapped the position of each UIS to chromosomes and generated a superimposed diagram illustrating integration of UISs to within cytogenetic bands of the human genome ( Figure 3). To analyze whether the HBV integrations had any preferences over transcript units and individual chromosomes, a set of 8015 computer-simulated random integrations was generated as the control. We calculated the frequency of HBV integration within each chromosome, showing that chromosome 17 was favored target for HBV integration (Figure 4). 60.5% of the HBV integrations (178/294) landed in RefSeq genes (Table S4). This is significantly different from the result of the random dataset, of which only 42.68% landed in genes reflecting the feature of the human genomes (Pearson's chi-squared test, P,0.0001).

Functional annotations of the HBV targeted genes
To understand the function of the genes targeted by HBV, we performed Gene Ontology (GO) analysis using the GO miner program. We found that they are enriched in 3 GO terms related to cAMP metabolic processes and several GO terms related to Tcell differentiation and activation involved in immune response (Table S6). Interestingly, GO terms related to GO:0034661 ncRNA catabolic process, GO:0031050 dsRNA fragmentation, GO:0070918 production of small RNA involved in gene silencing by RNA, and GO:0071359 cellular response to dsRNA were also enriched (Table S6). These suggest that the HBV targeted host genes related to responding, or to defending or eliminating the virus such as the immune response and degradation of RNAs. We also found that GO terms related to TGF beta receptor pathway were enriched (Table S6).

Semi-quantification of HBV insertion sites based on tags covering the integration sites
The abundance of the UISs was quantified by the count of different shear points near the junction, a method that was also utilized by Gillet et al. [11]. In addition, we tested the quantitative potential of the UISs abundance by a serial dilution PCR, and found that the number of unique tags covering an integration site correlated with the abundance of the integration determined by     serial dilution PCR, both for an individual sample ( Figure 5A) and across different samples ( Figure 5B).
To examine technical reproducibility of the quantification, we compared duplicate results using same DNA samples but sheared separately. The correlation co-efficiency of the number of unique tags covering the common integration sites, even between two technical replicate experiments with slight modifications of primers used (one with the single 5 nt barcode and the other with the paired barcodes), was very high (R 2 = 0.9278; Figure 5C). We also randomized the total reads into 39 subsets and plotted a rarefaction curve showing the number of putative integrations identified with accumulating subsets. The curve plateaus in Figure 5D suggested that further sequencing would not yield many additional integration sites. The reliability of quantification was also reflected by the consistency in the number of unique tags covering the opposite integration junctions from the same integrations (data not shown).
Fewer UISs were identified in HCC comparing to adjacent tissues supporting the theory of clonal expansion in HCC development As each integration event provides a unique genetic marker for a cell, we can assess the clonal expansion of HBV integrated hepatocytes using UISs. We found that the cancer tissues showed clonal expansion compared to tumor-adjacent tissues, as judged by the number of shear sites near individual integration ( Figure 6, Mann-Whitney test, p,0.0001). Furthermore, the number of UISs (median = 1) found in cancer was less than that of the tumoradjacent tissue (median = 7, Figure 6B, Mann-Whitney test, p,0.0001), consistent with the results from previous studies [3,22]. Interestingly, there was no difference in the UIS abundance between clones inserted in the vicinity of genes and clones that were not (Mann-Whitney test, p = 0.8576) (Data not shown).

Recurrent HBV integration into 8 human host genes including TERT and FN1 genes
We annotated the 286 UISs with RefSeq genes and found that 8 genes were targeted by HBV integration for more than once (Table 3) [14]. FN1 and TERT were found as recurrent in both our and Sung et al.'s studies. The TERT gene was also reported by Fujimoto as a recurrent targeted gene of HBV [16]. If we consider the matches of our genes to the previous identified HBV targeted genes as recurrent, there are additional 9 recurrent genes ( Table 2).
To understand the consequences of HBV integration and the exact genomic features that is disrupted by the HBV integration, we created a BED format file to be used as custom track for UCSC genome browser to display the full list of HBV UIS with the chromosome integration junction (a 100 nucleotides were added to the junction position to create the ChromEnd column in the BED file) and with the total number of sequence reads covering the junction (Table S9). The positions of the recurrent integration sites in relationship to the gene structures were displayed using UCSC genome browser custom track ( Figure S1). All 6 unique integrations targeting TERT inserted at the gene's promoter region. In particular, five HBV integrations in TERT located at the upstream the transcription start site, and one located in the 59 UTR (Figure 7), similar to the HBV integrations target TERT    Figure S1). We performed RT-PCR on the available samples. For TERT gene, we compared one pair of samples where the only the cancer tissue harbors the integration, another pair of samples where both the HCC and the adjacent tissues harbor the HBV integration, and 6 pairs of samples where none of the tissues harbor the HBV integration. We found that expression ratios of TERT gene in HCC tissues to the tumor-adjacent tissues were 18.7, 1.0, and 0.33 (median of 6) respectively ( Figure S2A). This is consistent with Sung et al.'s data showing HBV integration increase the expression of TERT gene [14]. This may due to the fact that HBV integrations to the TERT gene that we identified were all to the promoter regions.
We also conducted real time RT-PCR of the FN1 genes in three pairs of samples that were available. In two pairs of samples where FN1 integration was found only in the adjacent tissues, one showed no significant change (1.5 fold only), and the other showed reduction in the adjacent tissues harboring FN1 integration ( Figure  S2B). For the third pair of samples where FN1 integration was found in the cancer tissue only, the expression of FN1 was increased in the cancer tissue ( Figure S2B). This might suggest that, for FN1 integration event, the effects on the FN1 might be position dependent or context (tumor vs. adjacent normal) dependent. The expression ratios of FN1 in the cancer to adjacent tissues in samples without HBV integration also varied greatly ( Figure S2B, right panel). There was no significant association between the HBV integration in the FN1 gene and the expression changes (overall T-test P = 0.506 comparing samples with or  without HBV integration, regardless of the HBV integrations were in the adjacent or cancer tissues) ( Figure S2B, right panel). This is consistent with Sung et al.'s observation that there was no statistically significant difference for the FN1 expression between samples with or without HBV integration (P = 0.42), as shown in their figure 4b [14].
We also conducted RT-PCR with the available pairs of samples for other recurrent HBV targeted gene including ARHGEF12, CYP2C8, PHACTR4, PLXNA4, and RBFOX1. For ARH-GEF12, the integration was found in the tumor-adjacent tissues in two pairs of available samples, and both integrations increased the expression of ARHGEF12 in the tumor-adjacent samples (7.4 and 1.8 fold respectively, tumor-adjacent to tumor ratios), which is statistically significant different from ARHGEF12 expression in 9 pairs of samples without ARHGEF12 integration (the average expression ratio of normal to tumor tissues was 1.1, T-test P = 0.04) ( Figure S2C).
RT-PCR analysis of two HBV integrations into the PHACTR4 gene in the tumor-adjacent tissues revealed that the integration increased the expression 35.8 fold in one case, but did not increase the expression in another case (only 1.44 fold), compared with the matched cancer tissues that did not harbor the integration (data not shown). RT-PCR analysis of HBV integration to the PLXNA4 and CYP2C8 did not reveal significant changes in gene expression (data not shown).
For the RBFOX1 gene, HBV integration events reduced the expression of RBFOX1 in the tumor-adjacent tissues comparing to the cancer tissue in two match pairs of samples. In additional 6 pairs of samples without HBV integration in RBFOX1 that we analyzed, RBFOX1 gene were over expressed in 5 of the 6 pairs, but was greatly under expressed in one of the 6 pairs (maybe an outlier) ( Figure S2D). Thus, the overall P value is not significant (P = 0.9).
Considering that a HBV target identified both in our study and the whole genome sequencing analyses of HCCs [14][15][16] is a recurrent event, we conducted an integrative analysis and identified a total of 20 recurrent HBV targeted genes (Table 4). Among them, 15 genes were either first identified in our studies or derived from the integrative analysis of our data with the recently published whole genome sequencing analysis.

PCR-based methods including Alu-PCR [23], LM-PCR [24]
and RS-PCR [10] were employed for isolating HBV integrants, but with biased amplification and limited sequencing capacity. Our massive anchored parallel sequencing (MAPS) approach avoided restriction bias by random fragmentation and takes full advantage of NGS with dual-functional adapters for comprehensive profile of HBV DNA integrations. Our approach is complementary to or an alternative approach to recent reports of applications of whole genome sequencing in the identification of HBV integration events [14][15][16].
We identified 286 UISs from 40 pairs of HBV-related HCC and tumor-adjacent tissues (Table S4), averaging at about 7 HBV integrations per individual. Recently, Sung et al. conducted whole genome sequencing at more than 30 fold coverage for 81 HBV positive HCCs and identified 399 HBV integration events using a criteria of more than 2 read pairs with close mapping positions linking an end of hg19 to an end of HBV, which resulted in the identification of an average of 4.9 HBV integration events per individual patient [14]. In another recently publication, Jiang et al. sequenced four HCC patients (only three are HBV positives) at 80X coverage and identified 255 HBV integration sites, which would generate at about 85 HBV integrations per HBV positive HCC patient [15]. Comparing the number of HBV integration per individual, our findings that the 7 HBV integrations per individual in our analysis is similar to the 4.9 HBV integrations per individual in Sung et al.'s study [14]. However, this might not be surprising as we sequenced two tissues (HCC and adjacent tissues) per individual while Sung et al. [14] only sequenced one tissue sample per individual. We found that adjacent tissues also harbor HBV integration events. However, the average numbers of HBV integration in our study and in Sung et al.'s analysis (about 7 and 5 respectively) are far off from the average number of HBV integrations (about 85) in Jiang et al.'s study. It is worth noting that, in the same issue of Nature Genetics, Fujimoto et al. found 23 HBV breakpoints using a criteria of 3 or more supporting readpairs from 11 hepatitis B virus (HBV)-related HCCs with about 40X coverage whole genome sequencing, averaging at only about 2 HBV breakpoints per sample [16]. Further investigation would be necessary to resolve the discrepancy. There might be many potential reasons: first, there might be great HCC heterogeneities. Different subclasses of HCC, not yet defined, could have been used in the studies. Secondly, different HBV strains or subgenotypes might have different capability of integration. Future  Table 3. Recurrent HBV targeted genes identified by the MAPS approach.  analysis of the correlation between HBV subtypes and their integration frequencies would be necessary to address the question. Thirdly, different technologies might have different false positive and false negative HBV integration identification rates. The Complete Genomics' technology was used in Jiang et al.'s study [15] but the Illumina's platform was used in Sung et al.'s study [14]. The false positives might be derived from different methods of library construction and ligation procedures (which might generate chimeric ligation events). The false negative rate might be related to the sequence depth, sequence quality, the mapping parameters for the sequence reads and so on. To control the false positive identification, we made the efforts to use two controls: one negative genomic DNAs (without HBV integration) and another with a mixture of free HBV DNA and human genomic DNAs. The latter would control for chimeric ligation events. Using our analysis pipeline and criteria, we did not find any HBV integration events in either of the controls. Fourthly, the criteria to call an HBV integration event might be another factor affecting the number of HBV events called. Sung et al. and Jiang et al. called an HBV integration events if there are supported by at least two paired-end reads [14] or two chimeric reads [15]. Fujimoto et al. [16] and we used the criteria of at least 3 mapped read pairs. We found more (about 6 times) HBV integration events in adjacent tissues (254 of 296 events, about 86%) than those in cancer tissues (42 of 296, about 14%). The median number of UISs in cancers was 1, whereas it is much higher at 7 in our tumor-adjacent tissues, consistent with previous reports showing nearly one insertion per HCC nodule [3,25] and several (less than 10) integrations in chronic hepatitis patients [22,26]. However, our observation is different from Sung et al.'s study where they found that HBV integration is observed more frequently in the tumors (86.4%) than in adjacent liver tissues (30.7%) [14]. We do not yet know the reason of this difference. We postulate that definition of adjacent tissues might be different in the two studies. In our study, the tumor-adjacent tissues might be closer to the cancerous tissues than the tumor-adjacent tissues in Sung et al.'s study [14]. In other word, one could postulate that the tumor-adjacent tissues closer to the cancer tissues might already contain precursor lesions while the tumor-adjacent tissues further away might be normal liver tissues. Therefore, we have adopted the term 'tumor-adjacent' tissue instead of the term normal tissue or the 'adjacent nontumor' tissue in this manuscript.
In addition to the lower number of UISs in cancer tissues described above, the HBV insertional frequencies in cancer tissues were statistically larger than that in cancer-adjacent tissues (p,0.0001), which was also observed by Jiang et al. in their study [15]. Taking together, this might suggest a higher level of clonal expansion of cancer hepatocytes in liver cancer tissues compared to cancer-adjacent tissues, which was also suggested by Jiang et al.'s whole genome sequencing of 4 hepatocellular carcinoma patients [15]. As suggested by a previous study [3], it seems that viral integration occurred during HBV chronic infection before HCC Figure 8. Illustration of the integrations in FN1. The sequences were representative Read 2 tags that cover the junction points of the HBV integration to the FN1 gene. HBV DNA was shown in lowercase, and the Human DNA was in uppercase and underlined. The numbered boxes represent nearby exons. The nucleotide position at the human and HBV were also indicated (based on GenBank AY800389.1 and Human hg19 assemble). doi:10.1371/journal.pgen.1003065.g008 initiation. Hepatocytes with HBV integration underwent certain rounds of expansion during chronic hepatitis; some of the cells participated in the formation of so called focal proliferative lesions and gained growth advantages, resulting in clonal expansions during tumorigenesis.
We also compared our list with the list of common CISs for HCC-associated genes identified in a transposon-based (Sleeping Beauty) insertional mutagenesis screening in mice [27]. There are no common genes between the lists. However, two genes in the mouse CISs belong to the same family members with two genes in our list: zbtb20 in mouse vs. ZBTB16 in human, and slc25a13 in mouse vs. SLC2A13 in human. This prompted us to analyze the two lists to see if they have common pathways. Using the DAVID program (david.abcc.ncifcrf.gov/) to analyze the pathways enriched in both lists, we found that they shared three enriched (P,0.05) pathways, which are the renal cell carcinoma pathway, focal adhesion pathway, and the ErbB signaling pathway (data not shown).
Once, integrations of virus into host genomic DNA was thought to be totally random [28]. More recent studies, however, suggested that integration of various viruses into human genome has distinct preferences [29][30][31]. The profile of HBV insertion sites in this study also demonstrated that HBV integration preferentially landed in transcription units and specific chromosomes. Using a randomly generated in silico integration site dataset, we demonstrated that viral integration by HBV favored transcription units (P,0.0001) and chromosome 17. Preference for HBV integration to chromosomes 11 and 17 has been reported in a previous study [4] and a preference for chromosome 3 has been reported in chronic hepatitis tissues without HCC by Alu-PCR [26].
A possible oncogenic contribution of HBV DNA integration involves the insertion of viral DNA into cellular genomic regulatory regions or coding regions, resulting in modification of gene expression (cis-activation) or production of structurally and functionally aberrant cellular or hybrid proteins. Previously identified targeted genes include the myc family oncogenes [32,33], CCNA2 [8,34,35], the RARB [7,36], the mevalonate kinase (MK) [37,38], the carboxypeptidase N (CPN2) [39], sarco/ endoplasmic reticulum ralcium ATPase (SERCA) [40] and TERT [9]. Whether any oncogenic proteins such as chimeric HBV-host gene fusion proteins would be generated from the 286 UISs remains to be investigated. We found that 8 target genes were inserted by HBV for more than once (Table 3), including ARHGEF12, CYP2C8, FN1, PHACTR4, RBFOX1, SMAD5 and TERT. Previously, TERT and MLL4 (myeloid/lymphoid or mixed-lineage leukemia 4) were identified as recurrent HBV targeted genes [10,16,41]. Here we identified additional 8 genes as recurrent target of HBV integration, expanding the list of human host genes targeted by recurrent HBV integration. Further additional 9 recurrent genes were identified if we consider the matches of our HBV targeted genes to the previously identified HBV targeted genes [14,15], expanding the novel recurrent target genes to 14 (Table 4, identified under our study or integrated analysis).
Fibronectin 1 is a high-molecular weight (,440 kDa) glycoprotein of the extracellular matrix and contain binding domains for binding to cell surface integrins and other extracellular matrix components such as collagen, fibrin, actin and heparan sulfate proteoglycans (e.g. syndecans), and it plays a major role in cell adhesion, motility, opsonization, as well as in wound healing and embryonic development [42]. Altered fibronectin expression, degradation, and organization have been implicated in cancer and fibrosis [43]. We identified five HBV integration sites in the FN1 genes: four integrations into the intron 5, 26, 27 and 29, and the fifth into the exon 15 ( Figure 8). Whether these HBV integrations could alter FN1 expression or had any causative role in HCC carcinogenesis remains to be determined. HBV integration into FN1 exon 15 could either create a chimeric FN1-HBV fusion protein or could generate a truncated FN1 protein missing the exon 16-46, which would shorten the transcript from 8449 nucleotides (NM 002026) to less than 2565 nucleotide (exon 15 ends at nucleotide 2565). The protein product would shorten from 2355 AA to less than 766 AA, which would reduce the number of fibronectin type-III repeats from 16 in FN1 to about 1.5 in the truncated version. Furthermore, the truncated FN1 would miss the heparin-binding and syndecan-binding domains. Sung et al. recently also reported 5 HBV integrations into the FN1 gene; however, all are to the introns [14].
Using PCR, we were able to confirm the HBV genomic integration into FN1 using the genomic DNAs used for MAPS library preparation and sequencing. However, when we attempted to take a second section of the tissue to prepare RNA and to perform RT-PCR to identify chimeric transcripts that could be potentially generated from the integration, we were not able to identify chimeric HBV-FN1 RNA transcripts. It could be possible that integration prevented transcription or due to different section of the tissues were used as tumors usually show great heterogeneity. Further experimentation would be necessary to figure out the consequence of the HBV integration into the FN1 region.
Considering that 7 genes (PTPRJ, CNTN6, IL12B, MYOM1, FNDC3B, LRFN2, FN1)) contain the IPR003961: Fibronectin, type III domain (Table S6), we speculate that fibronectin type III repeat or like fibronectin type III repeat fold domain might play a role in HCC. Zhang et al. recently showed that up-regulated miR-143 transcribed by nuclear factor kappa B (NFKB) promotes the invasion of HBV-HCC by repressing expression of fibronectin type III domain containing 3B (FNDC3B) [44]. Further investigation is warranted. However, we believe that would be the task for a future paper, as it will take considerable effort and time to investigate the mechanism involved. When Tomlins et al. first identified recurrent fusion of TMPRSS2 and ETS transcription factor genes in prostate cancer in a science publication [45], it took the field several years to figure out the mechanism involved [46].
We used quantitative RT-PCR to assess the effects of HBV integration on the targeted gene expression in available matched pairs of tumor-adjacent and tumor samples. We found that HBV integration into the TERT gene increases its expression ( Figure  S2A). This is consistent with Sung et al.'s data showing HBV integration increase the expression of TERT gene [14], and might also reflect the fact that HBV integrations to the TERT gene that we identified were all in the promoter regions. In addition, we also found that HBV integration to the ARHGEF12 gene in the tumor-adjacent tissues increased their expression ( Figure S2C). However, for the rest of recurrent targeted genes, the effects on the expression are more complicated: they either showed no effects or had sample-specific effects. Our observation on the effect of HBV integration to its targeted genes is similar to what Sung et al. observed, where they found that three of the six HBV targeted genes showed changed expression level, but another three of the six targeted genes showed no statistically significant changes in gene expression (Figure 4b in [14]).
To determine the sensitivity of our MAPS approach, we tested to see if we can identify the same HBV integrations in data from replicate runs generated during our optimization of the protocol. The two replicate experiments were performed using the same pool of DNAs of 6 pairs of samples plus two controls, but differ slightly in the modifications of primers used: the first was with the single 5 nt barcode and the second with the paired barcodes. For the first replicate, we identified 54 HBV integration events, and for the second replicate, we identified 68 HBV integration events. 40 HBV integration events are in common (data not shown). We identified more HBV integration events for the second data sets as we had a deeper sequencing coverage (a total of 30,761,630 reads vs. 24,886,345 in the first dataset). A total of 82 HBV integration events were identified combining both replicates. 14 were uniquely identified in the first replicate. Therefore, it is possible that we might miss 14 of 82 (17%) of the HBV integration events if we only sequenced 30 million per 14 samples (i.e. a sequence depth of about 2.2 million reads per sample). In a random sampling analysis ( Figure 5D), we showed that indeed the curve for the number of HBV integration events identified per million of additional sequence reads starts to become flat at 10 million reads (for 14 samples, this would correspond to an average at about 700 K per sample), but it continues to rise, although slowly. As the average sequence depth is about 2.2-2.5 million reads per sample in our analysis, we would miss at least 17% of the HBV integration events. It is worth to point out that our sensitivity is not 100%, but our specificity is high as we controlled for the false positive identification using two controls: one negative genomic DNAs (without HBV integration) and another with a mixture of free HBV DNA and human genomic DNAs. The latter would control for chimeric ligation events. Using our analysis pipeline and criteria, we did not find any HBV integration events in either of the controls.
There are 47 HBV integration events that could not be annotated (Table S4). We searched them against the human repeat sequence using the RepeatMasker program (www. repeatmasker.org). Seven of the integrations were found in the repeat regions including LINE, LTR/ERV1 and SINE/Alu repeats (Table S11).
In summary, this study highlights these novel findings: 1) higher number of HBV integration events were found in cancer adjacent tissues than in HCC tissues, suggesting a clonal expansion process during HCC development; 2) fibronectin and its associated genes including fibronectin type III-like fold domain containing genes are frequently targeted by HBV DNA integration, and 3) 14 novel recurrent HBV targeted genes were identified when combined our list with recently published HBV targeted gene lists [14][15][16], greatly expanding the recurrent HBV human genome integration gene list. This study, together with recent global survey of HBV integration events [14][15][16] provides a good foundation for further experimentation and understanding the mechanism of HBVrelated HCC.

Ethics statement
Informed consent was obtained from each patient and the ethics committee of the First Affiliated Hospital, College of Medicine, and Zhejiang University approved all aspects of the study.

Isolation and identification of HBV DNA integrations
The study population comprised 40 HBV-positive HCCs (Table  S10).
Two micrograms of the genomic DNA in a volume of 60 ml of Elution Buffer (EB, Qiagen) were fragmented by Bioruptor (Diagenode Inc.). The operating conditions were the following: ice-water batch refreshed every 5 minutes of sonication, intensity level H, 5 sec power on and 5 sec power off, a total of 25-35 minutes of sonication depends on the results of gel electrophoresis. Sonication resulted in the sizes of DNA fragments ranging from 1000 bp to 100 bp. DNA ends were then end-repaired using 15 units of T4 DNA polymerase (New England Biolabs, NEB), 5 units of DNA polymerase I Klenow fragment (NEB), 50 units of T4 polynucleotide kinase (NEB) and 0.8 mM of dNTP (NEB) in T4 DNA ligase buffer (NEB) at 20uC during 30 min. DNA fragments was then purified using a Qiaquick PCR purification kit (Qiagen) and eluted in 30 ml of EB. Addition of an adenosine at the 39 ends of the DNA was performed by adding 0.2 mM of dATP (TaKaRa) and 15 units of Klenow Fragment 39 to 59 exo-(NEB) in NEB2 buffer (NEB) at 37uC for 30 min. DNA was then purified using a Qiaquick PCR purification kit and eluted in 30 ml of EB. Twenty-six different linkers were constructed, each one with a specific 6 bp tag (Table S2) to mark each DNA sample before amplification.
Two rounds of PCR were performed to amplify target sequences (integrants) using KOD-Plus-Neo DNA polymerase (Toyobo Co., Ltd.). A first round of PCR was performed with the HBX1 primer and the walking primer PE 2.1 for 25 cycles. Target sequences were amplified in 50 ml reaction volume containing 1X PCR Buffer for KOD-Plus-Neo DNA polymerase, 1.5 mM MgSO4, 200 mM each of dATP, dCTP, dGTP and dTTP, 300 nM each of the HBX primer and the walking primer, 500 ng of the walking library and 0.02 units/ml KOD-Plus-Neo DNA polymerase. Amplification condition comprised of a 2 minutes activation step at 94uC followed by 25 cycles of 95uC for 15 seconds, and 68uC for 1 min. A second round of PCR was performed with the PE1-Barcode-HBx2 primer and the walking primer PE 2.2. One micro liter of the first round PCR product was used as the template in a 50 ml reaction. All components were at the same concentrations as the first round PCR, and cycling conditions were the same as well. Reactions of nested PCR were diluted 10 fold in sterile distilled water.
Five microliters of the diluted product were used to enrich target templates while introducing the full-length Illumina PE sequencing adapters. Enriched libraries were then purified by gel electrophoresis separately. For each sample, a gel slice containing the DNA material in around 500 bp was excised with clean scalpels. Libraries were purified with the QIAquick Gel Extraction Kit (Qiagen Inc.), and then the 12 barcoded samples were pooled together with two control libraries at equimolar ratios. These two negative control libraries were used to help evaluate this method. HBV-free genomic DNAs isolated from the glioblastoma cell line U251, either mixed with HBV Genomic DNA or alone, were subjected to the MASP approach described above. Pooled PE sequencing libraries were quantified with Qubit (Invitrogen Inc.). 95695 or 78678 cycles of PE sequencing reactions were performed on an Illumina GA II according the standard protocol.

Double barcodes for multiplex analysis
Six or sixteen pairs (cancer and adjacent tissues) of HBVpositive hepatocellular carcinomas (HCCs) were analyzed in each multiplex lane using DNA bar coding, a 5 nt barcode on Read 1 and a 6 nt barcode on Read 2. The barcoded primers were listed in Table S2 and S3. Illumina image analysis algorithms use the first 4 or 5 cycles of pictured images to identify the positions of individual cluster (11). In MAPS, Read 1 ends sequences extended from specific nest primers (HBX secondary primer) are of low-diversity. If there is no barcode bases added before the primer, the position determining base signals on the image tend to be severely skewed (data not shown), which compromises much of the sequencing capacity. As a result, we used the 5 nt barcode in PE1-Barcode-HBX2 primers to analyze pooled DNA samples.
Multiple parallel nested PCR reactions provide an ideal scenario for migration of amplicons between samples. Barcoded HBX primers crosstalk with integrant templates from other samples, and the crosstalk products were simultaneously sequenced. In this study, however, insertions were assigned to original samples by comparing their frequencies in each sample (data not shown). To secure sequence assignment and suppress PCR contaminations, we additionally introduced the 6 nt barcode to Read 2. Using the paired barcodes, we eliminated potential contaminations and assign insertions to their samples with confidence.

Sequencing data analysis and insertion identification
Images deconvolution and quality values calculation were performed using the Goat module of the Illumina pipeline. Read 1 sequences of Illumina PE sequencing were assigned to individuals according to the 5 nt barcodes using SplitFastQIndex. py from http://bioinf.eva.mpg.de/multiplex/. HBV containing sequences were filtered from non-HBV sequences using SOAP (Short Oligonucleotide Analysis Package) (http://soap.genomics. org.cn/) matching with all HBV genomes references we downloaded from NCBI. The corresponding Read 2 sequences with the correct 6 nt barcode of each sample were picked out using the SplitFastQIndex.py script. The selected Read 2 sequences were then aligned with HBV genomes to filter out HBV derived sequences by SOAP. The remaining sequences were aligned with Human Genome to map potential insertions.
HBV containing sequences from Read 1 were identified by aligning them to all HBV genome references from the GenBank database isolated in China. The corresponding Read 2 was aligned to the HBV genomes to filter out HBV derived sequences. The remaining Read 2 was aligned to the human genome (build hg19) to map their locations for putative insertion site identification.
We calculated the frequency and coverage of the insertion sites. Read 2 sequences were viewed as redundant copies of unique tags, if they mapped the same coordinates of the Human Genome. The coverage is the average copy numbers of all the unique tags. To be conservative, we considered different tags offset by less than 2 nucleotides at the ends as one effective unique tag. We also set the criteria that a putative integration should have at least three effective unique tags. Based on the sequencing data from the negative controls, we selected candidate integration site when there is more than 10 fold of overall coverage, and 2/3 of the unique tags have the coverage over 3 have 10 fold of overall coverage. We viewed those putative integrations as authentic if there is Read 2 sequence covering the integration junction point.

Validation by nested PCR assays
The first round of the PCR was performed with the HBX1 primer and a site-specific first round primer for 25 cycles, with 50 ng of the genomic DNA as templates. The 20 ml reaction contained 1X PCR Buffer for KOD-Plus-Neo DNA polymerase, 1.5 mM MgSO4, 200 mM each of dATP, dCTP, dGTP and dTTP, 200 nM each of HBX primer and site-specific primer, and 0.02 units/ml KOD-Plus-Neo DNA polymerase. Thermal conditions were consisted of a 2 minutes activation step at 94uC followed by 25 cycles of 95uC for 15 seconds, and 68uC for 1 min. 1 ml of the products were diluted by adding 49 ml sterile distilled water to each. 0.5 ml of the dilution was subjected to the second round, 30 cycles of 10 ml reactions with the nest site-specific primer and HBX 2 primer (ACTTCGCTTCACCTCTG-CACGT). Concentrations and thermal condition were similar to the first round; except that extension time at 68uC was reduced to 45 sec. Final products were visualized on 2% gels.

Semi-quantitative analysis of UIS abundance
The relative abundance of a given UIS was calculated from the number of amplicons of different length (i.e. different shear site). Semi-quantitative PCR was performed to evaluate the quantification within a single sample and among multiplex samples. Duplicate experiments were compared to examine the consistency of the quantification. To evaluate the sequencing depth for quantitative analysis, we also did a rarefaction analysis of raw sequence data using a random sampling approach.

Generation of random dataset
To determine if there were favored chromosomes for HBV integration in vivo, we generated a comparative random set of 8,015 genomic positions in silico. A set of 10,000 random numbers was generated by Perl script, between 1 and 3,095,677,412, the size of the human genome. These random numbers were then converted into chromosomal positions. We retrieved 35 nt sequences that ended with these positions. We then mapped those sequences to the Human Genome using SOAP, resulting in 8,015 unique matches.

Rarefaction analysis
We randomized the total sequence reads 10 times into 39 size bins to evaluate saturation of the insertion pool by current MAPS strategy. As the total reads was 30,761,630, the largest size of randomized sets was set to 30 million (M). The first 9 smallest sizes of subset range from 0.1 M to 0.9 M, at an interval of 0.1M. The other 30 subsets range from 1 M to 30 M, at an interval of 1M. We determined the numbers of putative integrations in all the randomized subsets according to the criteria above. The results were plotted using Sigma Plot (version 11.0; SPSS Inc.). Regression analysis was also performed to fit the rarefaction curve into the double rectangular hyperbola curve model.

Mapping and functional annotation of UIS
RefSeq genes or its vicinity (150 kb upstream of the transcript start sites and 150 kb downstream of the transcript stop sites) inserted by HBV DNA were determined by home-written Perl scripts. The nearest RefSeq gene was selected for each integration site, and was referred as HBV integration target gens in the text. We used the Idiographica (http://www.ncrna.org/idiographica/) to generate a high-quality ideogram of human chromosomes with HBV integration sites. We performed Gene Ontology (GO) analysis using the GO miner program (http://discover.nci.nih. gov/gominer/index.jsp). Additional Functional annotation of the target genes tagged by integration was performed using DAVID (http://david.abcc.ncifcrf.gov/) Bioinformatics Resources. Categories of KEGG PATHWAY, and INTERPRO were included for further interpretation.

Real-time PCR assay
Total RNAs were extracted using mirVana miRNA Isolation Kit (Ambion, Life Technologies) according to the manufactory's instruction on total RNA extraction. The concentration and quality of RNAs were determined by Nanodrop and gel electrophoresis. RT-PCRs were performed with M-MLV Reverse Transcriptase (Promega) using 2 mg RNAs, and real-time PCRs were performed using SYBR Premix Ex Taq II (TaKaRa) in 20 ml in duplicates.

Statistical analysis
Statistical tests were performed using SAS 9.2. The Shapiro-Wilk test or the Kolmogorov-Smirnov test was used for determining normality of data. The Mann-Whitney test was performed to analyze the difference in UISs abundance and viral integration frequency between tumor and non-tumorous groups. Integration rate in RefSeq genes compared to random was tested using the Pearson's chi-squared test. A p-value ,0.05 was considered as significant for all the tests.