Exogene: A performant workflow for detecting viral integrations from paired-end next-generation sequencing data

Zachary Stephens; Daniel O’Brien; Mrunal Dehankar; Lewis R. Roberts; Ravishankar K. Iyer; Jean-Pierre Kocher

doi:10.1371/journal.pone.0250915

Abstract

The integration of viruses into the human genome is known to be associated with tumorigenesis in many cancers, but the accurate detection of integration breakpoints from short read sequencing data is made difficult by human-viral homologies, viral genome heterogeneity, coverage limitations, and other factors. To address this, we present Exogene, a sensitive and efficient workflow for detecting viral integrations from paired-end next generation sequencing data. Exogene’s read filtering and breakpoint detection strategies yield integration coordinates that are highly concordant with long read validation. We demonstrate this concordance across 6 TCGA Hepatocellular carcinoma (HCC) tumor samples, identifying integrations of hepatitis B virus that are also supported by long reads. Additionally, we applied Exogene to targeted capture data from 426 previously studied HCC samples, achieving 98.9% concordance with existing methods and identifying 238 high-confidence integrations that were not previously reported. Exogene is applicable to multiple types of paired-end sequence data, including genome, exome, RNA-Seq and targeted capture.

Citation: Stephens Z, O’Brien D, Dehankar M, Roberts LR, Iyer RK, Kocher J-P (2021) Exogene: A performant workflow for detecting viral integrations from paired-end next-generation sequencing data. PLoS ONE 16(9): e0250915. https://doi.org/10.1371/journal.pone.0250915

Editor: Zechen Chong, University of Alabama at Birmingham, UNITED STATES

Received: April 14, 2021; Accepted: July 8, 2021; Published: September 22, 2021

Copyright: © 2021 Stephens et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: We have used whole-genome and whole-exome sequencing data from the TCGA-LIHC project, which is accessible at this url: https://portal.gdc.cancer.gov/projects/TCGA-LIHC. PacBio long reads supporting human/viral integration sites are available on SRA under BioProject ID PRJNA741814.

Funding: This study was funded by the Mayo Clinic Center for Individualized Medicine. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

The integration of viruses into the human genome has been extensively studied and is central to the etiology of many prominent diseases [1, 2]. The link between viral integration and tumorigenesis in humans was established in the 1960s [3, 4], and since then there has been increasing experimental evidence associating viral integrations with human cancers. Examples include human papilloma virus and cervical cancers [5], hepatitis B viruses (HBV) and liver cancer [6], herpes and Epstein-Barr viruses and lymphoma [7, 8], among others [9–11]. Over the last decade, next-generation sequencing (NGS) technologies have accelerated the study of viral integration, enhancing our understanding of virus-associated tumor development and enabling the study of viral integration on genome-wide scales. These studies have found many associations between viral integration and host genome instability, e.g. regions surrounding integration sites exhibiting increased mutation rates, copy number alterations, or aberrant gene expression [12–14]. Additionally, it has been observed that viral integrations in tumor samples are often enriched near genes with known associations to cancer, including MLL4 [15], MYC [16], and TERT [17].

Despite their clinical utility, the sensitive detection of human/viral junctions from NGS data is made challenging by several factors. These include sequence similarities between host and viral genomes, integrated virus segments that differ from available reference genomes, and the limited number of validated integration sites in publicly available samples that can be used to assess detection accuracies. Several software applications for detecting viral integrations have been recently reviewed [18, 19], with each tool generally starting from unmapped reads or from reads mapped to a combined human + viral reference database. The tuning of read filtering and breakpoint detection strategies is crucial for the efficient extraction of informative reads, particularly when working with tumor samples where the number of reads supporting an integration may be limited. Additionally, these methods must be computationally efficient to be useful in practice, and must be scalable to the size and complexity of large sequencing datasets.

To address these challenges, we present Exogene, a new workflow for reporting viral integration sites from paired-end sequencing data. Exogene is computationally efficient and can identify integration coordinates from paired-end whole-genome sequencing (WGS), whole-exome sequencing (WES), RNA-Seq, or targeted capture data. Exogene builds upon our previous methodology HGT-ID [20], with new preprocessing, alignment, and filtering strategies to improve breakpoint precision.

We demonstrate Exogene’s ability to identify viral integration sites in 6 samples (5 WES, and 1 WGS + WES) from the TCGA Liver Hepatocellular Carcinoma (HCC) project. We show that the coordinates reported by Exogene are highly concordant with those found in a long read validation set. We demonstrate an improvement in accuracy over HGT-ID, attributable to Exogene’s improved extraction of informative read pairs. Additionally, we demonstrate Exogene’s applicability to targeted capture data by processing 426 HCC tumor/normal pairs from a previous study, achieving 98.9% concordance with existing results and augmenting them with 238 novel high-confidence integrations. Exogene’s runtime scales with input file size, and can process a 100× coverage WGS BAM (∼ 470 GB) within 12 hours (4 CPUs, 32GB memory).

Exogene is distributed as a Docker container, and is available at github.com/zstephens/exogene.

Materials and methods

Exogene takes as input a BAM file, or alternately paired FASTQ files, and produces an output report of all detected integrations, including breakpoints, quality metrics and visualizations (Fig 1).

Download:

Fig 1. Overview of Exogene workflow.

https://doi.org/10.1371/journal.pone.0250915.g001

Exogene begins by aligning the input reads to a collection of 1,628 viral reference sequences that are included with the workflow. This is performed using BWA MEM in single-end mode. From the resulting BAM file we enumerate the names of all reads which were able to be mapped to a virus with an alignment length of at least K. By default Exogene uses K = 30, but for shorter reads it may be necessary to reduce this value. Because this first step maps all input reads to solely viral references, it will likely contain alignments of human DNA which were only mapped to a viral reference due to human/viral sequence similarity. We have found that a vast majority of these reads are either low-complexity, or originate from regions of the human genome which we identified as having similar sequence content to one of the viruses. To address this, aligned reads are annotated for low-complexity sequence using Dustmasker [21]: If > D% of a read’s length is masked then it is discarded. By default Exogene uses D = 70%. The reads are then tested for similarity to a collection of decoy and transcriptome sequences (including exon-exon junctions), and reads are discarded if > T% of their length matches in a single alignment. By default Exogene uses T = 90%.

Two FASTQ files are then constructed by extracting, from the original BAM/ FASTQ, all read pairs in which one or more mates are aligned to a virus and passed all filters. These reads are then aligned using BWA MEM in paired-end mode to the combined human+viral reference (human reference build GRCh38). BWA is run with the -Y input option so that large soft-clipped segments are recorded as supplementary alignments.

SAM records are extracted from this alignment if their associated read pair has at least one alignment to human and at least one alignment to virus. Possible integration coordinates are identified from soft-clipping in human alignments, and the specific virus is inferred from viral alignments (either from a supplementary alignment of the read containing the soft-clip, or from the primary alignment of its mate). If no clipping is present then we only have the discordant mapping as evidence of integration. In this case integration coordinates are estimated based on the position of the human alignment and fragment length statistics provided by BWA. In the event that one or more of the reads are multi-mapped, that is, aligned at multiple positions with mapping quality 0, a “representative” alignment is chosen for each read (see S1 Fig).

Detected integrations are clustered by position and each cluster is summarized with its predicted integration coordinate, supporting read count, and quality metrics such as breakpoint variance and read mapping quality (MAPQ) distribution. If desired, the user can specify to include weakly-supported integrations in the final output report, which includes integrations flagged as:

Low read count: less than N_s soft-clipped reads, less than N_d discordant reads. By default N_s = 2, N_d = 5.
Low MAPQ: supporting reads were aligned with mapping quality 0. This filter only applies to reads mapped to human. It is expected that viral alignments may have low mapping quality because our viral database contains many highly similar sequences for certain viruses.
Uncertain coordinate: integration position is in a large repetitive region, or in regions with high sequence similarity to viral references.

Viral references

Exogene uses a database of 1,628 viral reference sequences. A majority of the sequences were sourced from Virus-Host DB (https://www.genome.jp/virushostdb/), which compiles sequences from RefSeq, GenBank, EBI, UniProt, ViralZone, and published literature. We augmented the set with specific genomes of interest sourced from specialized databases; most notably, additional strains of herpes and HPV (sourced from GenBank), and additional strains of HBV (genotypes A-H and various recombinants sourced from HBVdb [22]). We include multiple strains of certain viruses to increase the likelihood of extracting reads originating from viral genomes that may differ from the available reference sequences.

Long read validation

To evaluate Exogene’s performance we compared its results to long reads sequenced from the same samples. DNA was extracted from frozen liver tumor tissue of 6 individuals from the TCGA Liver Hepatocellular Carcinoma project. Short reads were obtained from TCGA, including 1 WGS (barcode TCGA-DD-A1EL) and 6 WES (barcodes TCGA-DD-AACV, TCGA-DD-AAD0, TCGA-DD-AADL, TCGA-DD-AADU, TCGA-DD-AADV, and TCGA-DD-A1EL). The sequencing was performed at the Human Genome Sequencing Center (HGSC) at Baylor College of Medicine. Paired-end DNA sequence libraries were prepared following standard HGSC protocols (www.hgsc.bcm.edu/sites/default/files/documents/Illumina_Barcoded_Paired-End_Capture_Library_Preparation.pdf).

Long reads were sequenced at Mayo Clinic on a PacBio Sequel II, following the standard protocols for Continuous Long Reads (CLR) and high-fidelity Circular Consensus Sequences (HiFi/CCS) reads (www.pacb.com/wp-content/uploads/SMRTbell-Library-Preparation-for-High-Fidelity-Long-Read-Sequencing-Customer-Training.pdf). A 10kb fragment size was targeted for the HiFi reads, which were processed using the CCS application in SMRT Link v7.0 and required a minimum predicted accuracy of 99.9% per read.

Integration sites were identified in the long reads by aligning them to the combined human + viral references using pbmm2, a fork of the popular minimap2 aligner [23]. Reads with alignments to both human and viral sequences were extracted, and the position of the soft-clipped coordinates were used to validate Exogene’s reported integration sites.

Results

We processed short reads from TCGA-DD-A1EL WGS through both Exogene and HGT-ID and enumerated all HBV integration sites that were also found in long reads (Table 1). On average, Exogene’s integration coordinates differed from long reads by 1.6 bp (std. 3.6 bp). HGT-ID differed by 175 pb (std. 102 bp). In addition to integration coordinates and quality metrics, Exogene produces figures showing the intersection of evidence at each integration site (example in Fig 2).

Download:

Fig 2. Comparison of evidence for HBV integration at chr7:72,027,703 from Exogene and HGT-ID.

Shaded regions indicate breakpoint ranges as inferred from read fragment lengths and orientations, darker shades indicate greater support.

https://doi.org/10.1371/journal.pone.0250915.g002

Download:

Table 1. Overview of HBV integration sites in TCGA-DD-A1EL.

https://doi.org/10.1371/journal.pone.0250915.t001

At 14/16 sites, both Exogene and HGT-ID reported an integration corroborated by long reads. At the remaining 2 sites, Exogene reported integrations that were missed by HGT-ID. We note that these 2 integrations were reported in repetitive regions of the genome near genes RGPD1 and FLJ360000. The short reads that support these integrations were all aligned with mapping quality 0, indicating that they map equally well to other locations and thus the reported integration coordinate is likely not unique. The long reads, however, were aligned with high mapping quality, suggesting that the integrations are not false positives and that the size of the repetitive elements they are located in are larger than the length of the short reads, but smaller than the length of the long reads.

Computational performance

The A1EL WGS BAM was approximately 470 GB in size, which Exogene completed processing in 12 hours of runtime (4 threads, 48 CPU hrs in total). HGT-ID completed in 26 hours (4 threads, 41 CPU hours in total). Note that Exogene does not require an aligned BAM as input, so if we were starting with FASTQ files HGT-ID would require additional computational time to first align the reads. A majority of Exogene’s runtime is spent in the initial BWA alignment to viral references. Subsequent steps complete quickly as the subset of read pairs with viral alignments which pass all read filters is generally small as compared to the size of the original input BAM/FASTQ.

Additional WES samples

Next we processed 6 WES samples with Exogene and identified 18 HBV integration sites with long read support (Table 2). HGT-ID was not included in this comparison as it only supports WGS and RNA-Seq input data. At 15/18 sites, Exogene reported integration coordinates within ≤ 2 bp of coordinates identified in long reads. Across all 18 sites, Exogene’s reported coordinates differed from long reads by 11.6 bp on average (std. 35.8 bp). Noteworthy integration sites include TERT promoter, which is well known to be associated with HCC. Integrations were also reported in ADARB2, RALYL, and URI1, which have been associated with liver tumor development [24–26].

Download:

Table 2. HBV integration sites in 6 WES samples.

https://doi.org/10.1371/journal.pone.0250915.t002

Exogene applied to targeted capture

To further validate Exogene, we apply it to short read targeted capture data sequenced for a previous study on HBV integrations in liver tumors [12]. For this study the authors designed sequence-capture probes for 8 strains of HBV, which they used to extract and sequence viral integration sites from liver tissue. The authors used the HIVID pipeline [27] to identify 4199 HBV integrations across 426 tumor/normal pairs. 707 of the 4199 integrations (16.8%) reported by HIVID were located in centromeres, telomeres, or other large repetitive regions of the genome where unique coordinates cannot be reliably inferred (i.e., regions where reads supporting a particular integration coordinate would align equally well to other positions in the reference genome). Thus we solely consider the 3492 integrations not reported in such regions.

We ran Exogene on each of the 426 tumor/normal pairs using paired-end FASTQ data hosted on the Sequence Read Archive [28] under project accession PRJNA298941. Exogene reported 3454/3492 (98.9%) of the integrations identified by HIVID. The full table of reported integrations is provided in S1 File. The average processing time for each sample was 20 minutes, and each used up to 6 GB of memory.

Of these 3454 concordant calls, 3265 were supported by soft-clipped reads, and the remaining 189 had only discordant read pairs as evidence. Of the 3265 concordant calls with soft-clipped evidence, 2861 (87.6%) of the integration coordinates reported by Exogene were identical to those reported by HIVID. Integrations with non-identical coordinates between the two workflows differed by 48 bp on average. The coverage depth and mapping quality varied substantially in reads extracted by Exogene (Figs 3 and 4). That is, very few reads with high mapping quality were extracted at certain sites identified by HIVID as having an HBV integration. 1277/3454 (37.0%) concordant integrations had more than half of their supporting reads aligned with mapping quality 0, 780 of which were supported entirely by such reads. We note that these low-confidence integrations tend to occur in clusters, often near low-complexity regions. Genes most affected by this include HERC2, CCDC144, SNORD3, and SLBP.

Download:

Fig 3. Concordance rate of Exogene and HIVID calls as a function of minimum coverage.

https://doi.org/10.1371/journal.pone.0250915.g003

Download:

Fig 4. Concordance rate of Exogene and HIVID calls as a function of minimum allowable percentage of reads aligned with mapping quality 0.

https://doi.org/10.1371/journal.pone.0250915.g004

Exogene reported additional HBV integrations that were not found in the HIVID results. Based on the distributions in Figs 3 and 4, we identified 238 novel integrations supported by at least 100 uniquely aligned reads. While these novel integrations are not enriched in any particular genomic region, a number of them hit introns of genes associated with HCC, including WDHD1, THSD4, and KIF20A.

From this comparison, we conclude that Exogene is effective on targeted capture data, achieving high concordance with the HIVID pipeline. Exogene’s annotations potentially reduce false positives in regions of poor mappability or human-viral sequence homology by flagging integrations in these regions as low confidence. The novel integrations identified by Exogene are potentially valuable for future study.

Discussion

Previously, many authors seeking to validate integration sites either compared against previous analyses of the same dataset [29, 30] or against PCR experiments on a limited number of sites [31]. Previous reviews have used simulated data to compare accuracy across methods [18, 19], but this approach is limited in its applicability to real samples which have additional complexities such as recombinant viral strains, confounding structural variation (including virus-mediated rearrangements), and sequencing biases that simulation tools do not replicate.

In addition to these strategies, another approach for validating integration sites is via intersecting results from multiple analyses on the same sample across different sequencing protocols or sequencing platforms. Long reads from ‘third-generation’ sequencers, such as those from PacBio or Oxford Nanopore, are attractive for this validation due to their increased ability to anchor large structural variation and to span repetitive genomic regions.

Using integrations identified from PacBio long reads as a baseline set, we compared results from Exogene to HGT-ID on one WGS sample with many integrations. We observed that on average, the breakpoints reported by Exogene-SR were significantly closer to those in long reads, as opposed to breakpoints reported by HGT-ID (Table 1). This is largely attributable to Exogene’s improved extraction of soft-clipped reads, which provide evidence for breakpoints at specific coordinates (as opposed to discordant read pairs, which support a range of possible breakpoint positions). Conversely, HGT-ID extracts most of its evidence from discordant read pairs and reports the average of their ranges as the final breakpoint. We attribute Exogene’s improved extraction of soft-clipped reads to three main factors: 1) The initial alignment to viral references only, instead of a combined human + viral FASTA. This ensures that reads of viral origin that would be preferentially aligned to human reference sequence due to homologies are retained for further analysis. 2) Instead of discarding reads with multiple alignments or alignments to blacklisted regions, we include them in reporting but flag them as low confidence. 3) Improved logic for choosing representative alignments in cases where reads are multi-mapped or have multiple supplementary alignments.

We observed similarly high concordance in the 6 WES samples, where at nearly every site the HBV integration coordinates reported by Exogene were very close to those found in long reads. There is only one site (near gene PRAG1) where the coordinates differ substantially. This is attributable to it being the only site where Exogene could not extract soft-clipped reads. When Exogene’s only source of evidence is discordant read pairs, the reported coordinate is estimated from alignment orientation and fragment length statistics (in a similar manner as HGT-ID).

Usability

Workflows for identifying viral integrations typically leverage multiple third-party bioinformatics tools, sometimes requiring specific system configurations or laborious installation procedures. Additionally, it has been our experience that existing workflows exhibit poor stability or that resource requirements make running them prohibitive. This has been commented on by other authors, who have excluded comparisons with certain tools due to an inability to successfully apply them to their samples [20, 29, 30].

To facilitate ease of use we make Exogene available as a Docker container which can be downloaded and run immediately, without requiring users to install third-party software (other than Docker itself) or to obtain specific versions of other resources.

Conclusion

Exogene is an efficient and sensitive workflow for detecting viral integrations in human WGS, WES, RNA-Seq, and targeted capture paired-end sequencing data. We demonstrated Exogene’s accuracy via comparisons with long read validation for 6 HCC tumor samples, and demonstrated its applicability to targeted capture data by applying it to 426 previously studied tumor/normal pairs. Exogene’s read filtering and breakpoint detection strategies improve upon our previous workflow, yielding high confidence integration site coordinates. Exogene is freely available at github.com/zstephens/exogene. Additionally, we have made Exogene available as a Docker container to facilitate ease of use.

Supporting information

S1 Fig. Exogene logic got selecting representative alignments for multi-mapped reads.

https://doi.org/10.1371/journal.pone.0250915.s001

(TIFF)

S1 File. All HBV integrations in targeted capture samples.

https://doi.org/10.1371/journal.pone.0250915.s002

(TSV)

References

1. White MK, Pagano JS, Khalili K. Viruses and human cancers: a long road of discovery of molecular paradigms. Clinical microbiology reviews. 2014;27(3):463–481. pmid:24982317
- View Article
- PubMed/NCBI
- Google Scholar
2. Pagano JS, Blaser M, Buendia MA, Damania B, Khalili K, Raab-Traub N, et al.; Elsevier. Infectious agents and cancer: criteria for a causal relation. Seminars in cancer biology. 2004;14(6):453–471. pmid:15489139
- View Article
- PubMed/NCBI
- Google Scholar
3. Henle G, Henle W, Clifford P, Diehl V, Kafuko GW, Kirya BG, et al. Antibodies to Epstein-Barr virus in Burkitt’s lymphoma and control groups. Journal of the National Cancer Institute. 1969;43(5):1147–1157. pmid:5353242
- View Article
- PubMed/NCBI
- Google Scholar
4. Nonoyama M, Kawai Y, Pagano J. Detection of Epstein-Barr virus DNA in human tumors. Bibliotheca Haematologica. 1975;40:577–583. pmid:169825
- View Article
- PubMed/NCBI
- Google Scholar
5. Mincheva A, Gissmann L, Zur Hausen H. Chromosomal integration sites of human papillomavirus DNA in three cervical cancer cell lines mapped by in situ hybridization. Medical microbiology and immunology. 1987;176(5):245–256. pmid:2821369
- View Article
- PubMed/NCBI
- Google Scholar
6. Azam F, Koulaouzidis A. Hepatitis B virus and Hepatocarcinogenesis: Concise Review. Annals of hepatology. 2008;7(2):125–129. pmid:18626429
- View Article
- PubMed/NCBI
- Google Scholar
7. Daibata M, Taguchi T, Taguchi H, Miyoshi I. Integration of human herpesvirus 6 in a Burkitt’s lymphoma cell line. British journal of haematology. 1998;102(5):1307–1313. pmid:9753061
- View Article
- PubMed/NCBI
- Google Scholar
8. Gulley ML, Raphael M, Lutz CT, Ross DW, Raab-Traub N. Epstein-barr virus integration in human lymphomas and lymphoid cell lines. Cancer. 1992;70(1):185–191. pmid:1318776
- View Article
- PubMed/NCBI
- Google Scholar
9. Syrjänen S. Human papillomavirus (HPV) in head and neck cancer. Journal of clinical virology. 2005;32:59–66.
- View Article
- Google Scholar
10. Fan H. A new human retrovirus associated with prostate cancer. Proceedings of the National Academy of Sciences. 2007;104(5):1449–1450. pmid:17244700
- View Article
- PubMed/NCBI
- Google Scholar
11. Derse D, Crise B, Li Y, Princler G, Lum N, Stewart C, et al. Human T-cell leukemia virus type 1 integration target sites in the human genome: comparison with those of other retroviruses. Journal of virology. 2007;81(12):6731–6741. pmid:17409138
- View Article
- PubMed/NCBI
- Google Scholar
12. Zhao LH, Liu X, Yan HX, Li WY, Zeng X, Yang Y, et al. Genomic and oncogenic preference of HBV integration in hepatocellular carcinoma. Nature communications. 2016;7(1):1–10.
- View Article
- Google Scholar
13. Jiang Z, Jhunjhunwala S, Liu J, Haverty PM, Kennemer MI, Guan Y, et al. The effects of hepatitis B virus integration into the genomes of hepatocellular carcinoma patients. Genome research. 2012;22(4):593–601. pmid:22267523
- View Article
- PubMed/NCBI
- Google Scholar
14. Tamori A, Yamanishi Y, Kawashima S, Kanehisa M, Enomoto M, Tanaka H, et al. Alteration of gene expression in human hepatocellular carcinoma with integrated hepatitis B virus DNA. Clinical cancer research. 2005;11(16):5821–5826. pmid:16115921
- View Article
- PubMed/NCBI
- Google Scholar
15. Saigo K, Yoshida K, Ikeda R, Sakamoto Y, Murakami Y, Urashima T, et al. Integration of hepatitis B virus DNA into the myeloid/lymphoid or mixed-lineage leukemia (MLL4) gene and rearrangements of MLL4 in human hepatocellular carcinoma. Human mutation. 2008;29(5):703–708. pmid:18320596
- View Article
- PubMed/NCBI
- Google Scholar
16. Popescu N, Zimonjic D. Chromosome-mediated alterations of the MYC gene in human cancer. Journal of cellular and molecular medicine. 2002;6(2):151–159. pmid:12169201
- View Article
- PubMed/NCBI
- Google Scholar
17. Nault JC, Zucman-Rossi J. TERT promoter mutations in primary liver tumors. Clinics and research in hepatology and gastroenterology. 2016;40(1):9–14. pmid:26336998
- View Article
- PubMed/NCBI
- Google Scholar
18. Chen X, Kost J, Li D. Comprehensive comparative analysis of methods and software for identifying viral integrations. Briefings in bioinformatics. 2019;20(6):2088–2097. pmid:30102374
- View Article
- PubMed/NCBI
- Google Scholar
19. Sulovari A, Li D. VIpower: Simulation-based tool for estimating power of viral integration detection via high-throughput sequencing. Genomics. 2019;112(1):207–211. pmid:30710609
- View Article
- PubMed/NCBI
- Google Scholar
20. Baheti S, Tang X, O’Brien DR, Chia N, Roberts LR, Nelson H, et al. HGT-ID: an efficient and sensitive workflow to detect human-viral insertion sites using next-generation sequencing data. BMC bioinformatics. 2018;19(1):271. pmid:30016933
- View Article
- PubMed/NCBI
- Google Scholar
21. Morgulis A, Gertz EM, Schäffer AA, Agarwala R. A fast and symmetric DUST implementation to mask low-complexity DNA sequences. Journal of Computational Biology. 2006;13(5):1028–1040. pmid:16796549
- View Article
- PubMed/NCBI
- Google Scholar
22. Hayer J, Jadeau F, Deleage G, Kay A, Zoulim F, Combet C. HBVdb: a knowledge database for Hepatitis B Virus. Nucleic acids research. 2013;41(D1):D566–D570. pmid:23125365
- View Article
- PubMed/NCBI
- Google Scholar
23. Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics. 2018;34(18):3094–3100. pmid:29750242
- View Article
- PubMed/NCBI
- Google Scholar
24. Toh TB, Lim JJ, Chow EKH. Epigenetics of hepatocellular carcinoma. Clinical and translational medicine. 2019;8(1):13. pmid:31056726
- View Article
- PubMed/NCBI
- Google Scholar
25. Wang X. Identification and characterization of stemness-related genes (RALYL and S100A10) in the development and progression of hepatocellular carcinoma. HKU Theses Online (HKUTO). 2019.
26. Tsuchiya H, Amisaki M, Takenaga A, Honjo S, Fujiwara Y, Shiota G. HBx and c-MYC cooperate to induce URI1 expression in HBV-related hepatocellular carcinoma. International journal of molecular sciences. 2019;20(22):5714.
- View Article
- Google Scholar
27. Li W, Zeng X, Lee NP, Liu X, Chen S, Guo B, et al. HIVID: an efficient method to detect HBV integration using low coverage sequencing. Genomics. 2013;102(4):338–344. pmid:23867110
- View Article
- PubMed/NCBI
- Google Scholar
28. Leinonen R, Sugawara H, Shumway M, Collaboration INSD. The sequence read archive. Nucleic acids research. 2010;39(suppl_1):D19–D21. pmid:21062823
- View Article
- PubMed/NCBI
- Google Scholar
29. Xia Y, Liu Y, Deng M, Xi R. Detecting virus integration sites based on multiple related sequencing data by VirTect. BMC medical genomics. 2019;12(1):19. pmid:30704462
- View Article
- PubMed/NCBI
- Google Scholar
30. Tennakoon C, Sung WK. BATVI: fast, sensitive and accurate detection of virus integrations. BMC bioinformatics. 2017;18(3):101–111. pmid:28361674
- View Article
- PubMed/NCBI
- Google Scholar
31. Ho DW, Sze KM, Ng IO. Virus-Clip: a fast and memory-efficient viral integration site detection tool at single-base resolution with annotation capability. Oncotarget. 2015;6(25):20959. pmid:26087185
- View Article
- PubMed/NCBI
- Google Scholar

[ref1] 1. White MK, Pagano JS, Khalili K. Viruses and human cancers: a long road of discovery of molecular paradigms. Clinical microbiology reviews. 2014;27(3):463–481. pmid:24982317
View Article
PubMed/NCBI
Google Scholar

[2] View Article

[3] PubMed/NCBI

[4] Google Scholar

[ref2] 2. Pagano JS, Blaser M, Buendia MA, Damania B, Khalili K, Raab-Traub N, et al.; Elsevier. Infectious agents and cancer: criteria for a causal relation. Seminars in cancer biology. 2004;14(6):453–471. pmid:15489139
View Article
PubMed/NCBI
Google Scholar

[6] View Article

[7] PubMed/NCBI

[8] Google Scholar

[ref3] 3. Henle G, Henle W, Clifford P, Diehl V, Kafuko GW, Kirya BG, et al. Antibodies to Epstein-Barr virus in Burkitt’s lymphoma and control groups. Journal of the National Cancer Institute. 1969;43(5):1147–1157. pmid:5353242
View Article
PubMed/NCBI
Google Scholar

[10] View Article

[11] PubMed/NCBI

[12] Google Scholar

[ref4] 4. Nonoyama M, Kawai Y, Pagano J. Detection of Epstein-Barr virus DNA in human tumors. Bibliotheca Haematologica. 1975;40:577–583. pmid:169825
View Article
PubMed/NCBI
Google Scholar

[14] View Article

[15] PubMed/NCBI

[16] Google Scholar

[ref5] 5. Mincheva A, Gissmann L, Zur Hausen H. Chromosomal integration sites of human papillomavirus DNA in three cervical cancer cell lines mapped by in situ hybridization. Medical microbiology and immunology. 1987;176(5):245–256. pmid:2821369
View Article
PubMed/NCBI
Google Scholar

[18] View Article

[19] PubMed/NCBI

[20] Google Scholar

[ref6] 6. Azam F, Koulaouzidis A. Hepatitis B virus and Hepatocarcinogenesis: Concise Review. Annals of hepatology. 2008;7(2):125–129. pmid:18626429
View Article
PubMed/NCBI
Google Scholar

[22] View Article

[23] PubMed/NCBI

[24] Google Scholar

[ref7] 7. Daibata M, Taguchi T, Taguchi H, Miyoshi I. Integration of human herpesvirus 6 in a Burkitt’s lymphoma cell line. British journal of haematology. 1998;102(5):1307–1313. pmid:9753061
View Article
PubMed/NCBI
Google Scholar

[26] View Article

[27] PubMed/NCBI

[28] Google Scholar

[ref8] 8. Gulley ML, Raphael M, Lutz CT, Ross DW, Raab-Traub N. Epstein-barr virus integration in human lymphomas and lymphoid cell lines. Cancer. 1992;70(1):185–191. pmid:1318776
View Article
PubMed/NCBI
Google Scholar

[30] View Article

[31] PubMed/NCBI

[32] Google Scholar

[ref9] 9. Syrjänen S. Human papillomavirus (HPV) in head and neck cancer. Journal of clinical virology. 2005;32:59–66.
View Article
Google Scholar

[34] View Article

[35] Google Scholar

[ref10] 10. Fan H. A new human retrovirus associated with prostate cancer. Proceedings of the National Academy of Sciences. 2007;104(5):1449–1450. pmid:17244700
View Article
PubMed/NCBI
Google Scholar

[37] View Article

[38] PubMed/NCBI

[39] Google Scholar

[ref11] 11. Derse D, Crise B, Li Y, Princler G, Lum N, Stewart C, et al. Human T-cell leukemia virus type 1 integration target sites in the human genome: comparison with those of other retroviruses. Journal of virology. 2007;81(12):6731–6741. pmid:17409138
View Article
PubMed/NCBI
Google Scholar

[41] View Article

[42] PubMed/NCBI

[43] Google Scholar

[ref12] 12. Zhao LH, Liu X, Yan HX, Li WY, Zeng X, Yang Y, et al. Genomic and oncogenic preference of HBV integration in hepatocellular carcinoma. Nature communications. 2016;7(1):1–10.
View Article
Google Scholar

[45] View Article

[46] Google Scholar

[ref13] 13. Jiang Z, Jhunjhunwala S, Liu J, Haverty PM, Kennemer MI, Guan Y, et al. The effects of hepatitis B virus integration into the genomes of hepatocellular carcinoma patients. Genome research. 2012;22(4):593–601. pmid:22267523
View Article
PubMed/NCBI
Google Scholar

[48] View Article

[49] PubMed/NCBI

[50] Google Scholar

[ref14] 14. Tamori A, Yamanishi Y, Kawashima S, Kanehisa M, Enomoto M, Tanaka H, et al. Alteration of gene expression in human hepatocellular carcinoma with integrated hepatitis B virus DNA. Clinical cancer research. 2005;11(16):5821–5826. pmid:16115921
View Article
PubMed/NCBI
Google Scholar

[52] View Article

[53] PubMed/NCBI

[54] Google Scholar

[ref15] 15. Saigo K, Yoshida K, Ikeda R, Sakamoto Y, Murakami Y, Urashima T, et al. Integration of hepatitis B virus DNA into the myeloid/lymphoid or mixed-lineage leukemia (MLL4) gene and rearrangements of MLL4 in human hepatocellular carcinoma. Human mutation. 2008;29(5):703–708. pmid:18320596
View Article
PubMed/NCBI
Google Scholar

[56] View Article

[57] PubMed/NCBI

[58] Google Scholar

[ref16] 16. Popescu N, Zimonjic D. Chromosome-mediated alterations of the MYC gene in human cancer. Journal of cellular and molecular medicine. 2002;6(2):151–159. pmid:12169201
View Article
PubMed/NCBI
Google Scholar

[60] View Article

[61] PubMed/NCBI

[62] Google Scholar

[ref17] 17. Nault JC, Zucman-Rossi J. TERT promoter mutations in primary liver tumors. Clinics and research in hepatology and gastroenterology. 2016;40(1):9–14. pmid:26336998
View Article
PubMed/NCBI
Google Scholar

[64] View Article

[65] PubMed/NCBI

[66] Google Scholar

[ref18] 18. Chen X, Kost J, Li D. Comprehensive comparative analysis of methods and software for identifying viral integrations. Briefings in bioinformatics. 2019;20(6):2088–2097. pmid:30102374
View Article
PubMed/NCBI
Google Scholar

[68] View Article

[69] PubMed/NCBI

[70] Google Scholar

[ref19] 19. Sulovari A, Li D. VIpower: Simulation-based tool for estimating power of viral integration detection via high-throughput sequencing. Genomics. 2019;112(1):207–211. pmid:30710609
View Article
PubMed/NCBI
Google Scholar

[72] View Article

[73] PubMed/NCBI

[74] Google Scholar

[ref20] 20. Baheti S, Tang X, O’Brien DR, Chia N, Roberts LR, Nelson H, et al. HGT-ID: an efficient and sensitive workflow to detect human-viral insertion sites using next-generation sequencing data. BMC bioinformatics. 2018;19(1):271. pmid:30016933
View Article
PubMed/NCBI
Google Scholar

[76] View Article

[77] PubMed/NCBI

[78] Google Scholar

[ref21] 21. Morgulis A, Gertz EM, Schäffer AA, Agarwala R. A fast and symmetric DUST implementation to mask low-complexity DNA sequences. Journal of Computational Biology. 2006;13(5):1028–1040. pmid:16796549
View Article
PubMed/NCBI
Google Scholar

[80] View Article

[81] PubMed/NCBI

[82] Google Scholar

[ref22] 22. Hayer J, Jadeau F, Deleage G, Kay A, Zoulim F, Combet C. HBVdb: a knowledge database for Hepatitis B Virus. Nucleic acids research. 2013;41(D1):D566–D570. pmid:23125365
View Article
PubMed/NCBI
Google Scholar

[84] View Article

[85] PubMed/NCBI

[86] Google Scholar

[ref23] 23. Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics. 2018;34(18):3094–3100. pmid:29750242
View Article
PubMed/NCBI
Google Scholar

[88] View Article

[89] PubMed/NCBI

[90] Google Scholar

[ref24] 24. Toh TB, Lim JJ, Chow EKH. Epigenetics of hepatocellular carcinoma. Clinical and translational medicine. 2019;8(1):13. pmid:31056726
View Article
PubMed/NCBI
Google Scholar

[92] View Article

[93] PubMed/NCBI

[94] Google Scholar

[ref25] 25. Wang X. Identification and characterization of stemness-related genes (RALYL and S100A10) in the development and progression of hepatocellular carcinoma. HKU Theses Online (HKUTO). 2019.

[ref26] 26. Tsuchiya H, Amisaki M, Takenaga A, Honjo S, Fujiwara Y, Shiota G. HBx and c-MYC cooperate to induce URI1 expression in HBV-related hepatocellular carcinoma. International journal of molecular sciences. 2019;20(22):5714.
View Article
Google Scholar

[97] View Article

[98] Google Scholar

[ref27] 27. Li W, Zeng X, Lee NP, Liu X, Chen S, Guo B, et al. HIVID: an efficient method to detect HBV integration using low coverage sequencing. Genomics. 2013;102(4):338–344. pmid:23867110
View Article
PubMed/NCBI
Google Scholar

[100] View Article

[101] PubMed/NCBI

[102] Google Scholar

[ref28] 28. Leinonen R, Sugawara H, Shumway M, Collaboration INSD. The sequence read archive. Nucleic acids research. 2010;39(suppl_1):D19–D21. pmid:21062823
View Article
PubMed/NCBI
Google Scholar

[104] View Article

[105] PubMed/NCBI

[106] Google Scholar

[ref29] 29. Xia Y, Liu Y, Deng M, Xi R. Detecting virus integration sites based on multiple related sequencing data by VirTect. BMC medical genomics. 2019;12(1):19. pmid:30704462
View Article
PubMed/NCBI
Google Scholar

[108] View Article

[109] PubMed/NCBI

[110] Google Scholar

[ref30] 30. Tennakoon C, Sung WK. BATVI: fast, sensitive and accurate detection of virus integrations. BMC bioinformatics. 2017;18(3):101–111. pmid:28361674
View Article
PubMed/NCBI
Google Scholar

[112] View Article

[113] PubMed/NCBI

[114] Google Scholar

[ref31] 31. Ho DW, Sze KM, Ng IO. Virus-Clip: a fast and memory-efficient viral integration site detection tool at single-base resolution with annotation capability. Oncotarget. 2015;6(25):20959. pmid:26087185
View Article
PubMed/NCBI
Google Scholar

[116] View Article

[117] PubMed/NCBI

[118] Google Scholar

Figures

Abstract

Introduction

Materials and methods

Viral references

Long read validation

Results

Computational performance

Additional WES samples

Exogene applied to targeted capture

Discussion

Usability

Conclusion

Supporting information

S1 Fig. Exogene logic got selecting representative alignments for multi-mapped reads.

S1 File. All HBV integrations in targeted capture samples.

References