VirusFinder: Software for Efficient and Accurate Detection of Viruses and Their Integration Sites in Host Genomes through Next Generation Sequencing Data

Next generation sequencing (NGS) technologies allow us to explore virus interactions with host genomes that lead to carcinogenesis or other diseases; however, this effort is largely hindered by the dearth of efficient computational tools. Here, we present a new tool, VirusFinder, for the identification of viruses and their integration sites in host genomes using NGS data, including whole transcriptome sequencing (RNA-Seq), whole genome sequencing (WGS), and targeted sequencing data. VirusFinder’s unique features include the characterization of insertion loci of virus of arbitrary type in the host genome and high accuracy and computational efficiency as a result of its well-designed pipeline. The source code as well as additional data of VirusFinder is publicly available at http://bioinfo.mc.vanderbilt.edu/VirusFinder/.


INTRODUCTION
VirusFinder is a software tool for efficient and accurate identification of viruses and their integration sites (if present) in host genomes (e.g. the human genome) from next generation sequencing (NGS) data, including whole genome sequencing (WGS), whole transcriptome sequencing (RNA-Seq), or targeted sequencing data. VirusFinder follows a three-step procedure: (1) preprocessing, (2) virus detection, and (3) virus integration site detection.
Step (2) can be skipped if the sequence of the virus being examined is provided as an input parameter of VirusFinder. If the virus type is unknown, however, VirusFinder will run all the three steps sequentially on the input data. Figure 1 shows the flowchart of virus identification using VirusFinder.

Source code
We used Perl to implement VirusFinder. The following table provides a summary of all our Perl scripts, among which VirusFinder.pl is the interface of VirusFinder. VirusFinder.pl prepares input data for each of its three steps and processes their output after they terminate. Scripts detect_virus.pl and detect_integration.pl correspond to the steps (2) and (3) Table 2 lists third-party tools required by VirusFinder. Users should specify the full paths to these tools in a configuration file (see subsection 3.2 for a detailed description of the configuration file). VirusFinder also needs several Perl modules, e.g. threads.pm, to make it work properly. The script sys_check.pl in Table 1 can help you identify modules that need to be installed in your system.

Input parameters of VirusFinder.pl
VirusFinder.pl prints the following help information (Figure 2) if it is run with no argument or with the argument "-h". The help information indicates that a configuration file is required by VirusFinder as a mandatory input. We will use entire subsection 3.2 below to discuss the configuration file due to its importance.
Besides the configuration file, user can specify/provide the sequence of the virus under study (in FASTA format) to VirusFinder, if the virus that infected the sample is already known. As VirusFinder has the capability to determine the correct type of the virus in the sample, the option "-v" is not required. However, because virus detection is time-consuming, the offer of virus sequence to VirusFinder can bring significant timesaving.
All the intermediate and final results of VirusFinder are stored in the directory specified by the option 'o'. If the output directory is not provided, current working directory will be used.

Configuration file
The configuration file allows users to specify the full paths to their sequencing data and the third-party tools required by VirusFinder. An example configuration file is provided in Figure 3, which shows different input parameters are defined as "variable=value" pairs.  The first three variables in the above configuration file specify the paths to the NGS data to be analyzed. The input of VirusFinder can either be raw sequencing reads (in FASTQ format) or an alignment file (in BAM format). All the rest of the variables except "mailto" are mandatory.

OUTPUT
Under the working directory of VirusFinder, three files, 'virus.txt', 'contig.txt' and 'integration-sites.txt', which contain the final results, will be created upon the termination of the method. These files as well as the intermediate files generated by VirusFinder are introduced in subsections below.

Intermediate results
VirusFinder keeps all intermediate results so that it does not have to restart the whole process from scratch if the server, to which it is assigned to run, fails. The drawback of this design is that users may have to manually delete the file that is improperly created by VirusFinder as a result of system failure.
Typically, each run of VirusFinder creates three folders, 'step1', 'step2', and 'step3', under the directory specified by users. Each of these three folders stores the intermediate files of the corresponding step of VirusFinder. For WGS data with very high coverage, e.g., 120×, the size of the intermediate files in the folder 'step1' can be close to 0.5TB. So we remind users to delete them after the analysis process terminates.

Output of virus detection
At the end of step (2), i.e., the virus detection step, two resultant files are copied out of the subdirectory 'step2' to be placed in the working directory of VirusFinder. These two tab-delimited files, named 'virus.txt' and 'contig.txt' respectively, contain candidate viruses identified by VirusFinder and the corresponding contigs mapped to the virus sequences.
The candidate viruses in the file 'virus.txt' are sorted based on the alignment quality of the contigs that are mapped to them. The top-ranking virus candidate is in the first row of this file. Table 3 below shows the top three candidate virus sequences identified from our simulation data, which can be downloaded from our website (http://bioinfo.mc.vanderbilt.edu/VirusFinder/). All these three candidates indicate the existence of HPV-16 virus in the sample and HPV-16 is exactly what we inserted into this simulation data initially. One contig, comp4_c0_seq1, map to all these three candidate sequences. The length of the contig and the number of reads fallen on it are also provided in the file 'virus.txt'.
It is worth mentioning that at the end of step (2), VirusFinder.pl extracts the sequence of the top-ranking virus, i.e. 'gi_310698439_ref_NC_001526.2__Human_papillomavirus_type_16' in the case of our simulated data, from the virus DB and store it as a file named 'results-virus-top1.fa' under the folder 'step2'. This file is used as a separate pseudo-chromosome 'chrVirus' to be concatenated with the reference human chromosome for the detection of virus integration sites in step (3). Another file 'contig.txt' includes all high quality non-human contigs. Table 4 shows the first contig in the 'contig.txt' file created for our simulation data.

Output of integration site detection
The third file, integration-sites.txt, reports the positions of all detected virus insertion sites. In our simulation data, we plugged one mutated copy of the reference HPV-16 virus at the position chr1:24020700 of the human genome. The following table provides the content of the integration-sites.txt file created by VirusFinder for this simulation data. The virus integration site in Table 5 involves two breakpoints, one at position 24,020,709 of the human chromosome 1 and another at position 1 of the HPV-16 virus sequence. The last column indicates how many NGS reads support this detection. The word pair means the number of paired-end reads with one end mapped to the human genome and another end aligned to the HPV-16 virus sequence. The word softclipping means the number of reads that actually harbor the integration breakpoint within themselves.
It may be worth mentioning that VirusFinder's predictions of virus integration sites are very close to the real positions. On most WGS samples, the virus integration loci predicted by VirusFinder are only several base pairs difference from the real ones. However, VirusFinder's ability to predict virus insertion is affected by sequencing coverage, read quality, read mapping, difference between the human genome and virus sequence under study, etc. This is the case not only to VirusFinder, but also true to other virus