VarSCAT: A computational tool for sequence context annotations of genomic variants

doi:10.1371/journal.pcbi.1010727

Fig 1.

The illustration of sequence contexts of a genomic variant.

The example shows the reference and sample sequences where a dinucleotide GC is deleted at “chr_example”, position 123. The deletion is located in a short tandem repeat region, which is a dinucleotide repeat motif GC with a copy number of seven. The GC content (GC%) of the short tandem repeat region is 100%. The short tandem repeat can result in the deletion having multiple possible representations but all lead to an equivalent change. This issue is also known as the breakpoint ambiguity, which indicates the exact breakpoint of a variant is impossible to be confidently identified. The Human Genome Variation Society (HGVS) recommends describing different types of variants with specific roles and formats. The left and right flanking bases of the deletion are marked based on these equivalent deletions on the sequence. There is also a single nucleotide substitution located in the 3’ direction of this deletion. The distance between two variants is 4 bp, which is also determined from on these equivalent deletions. The mutated sequence can be determined by considering all the variants within the region, which in the above example are one single nucleotide variant (SNV) and one deletion (DEL). The variant annotations related to the sequence contexts are shown in bold text.

More »

Expand

Table 1.

Examples of tools which have functions for annotating the sequence context of genomic variants, including breakpoint ambiguity, flanking bases of genomic variants, wildtype/mutated sequences, HGVS nomenclature, nearby variants and short tandem repeats (STR).

Note that primary functions of the listed tools may not be specifically designed for annotating sequence context of genomic variants. Many listed tools also have diverse functions on other aspects.

More »

Expand

Fig 2.

VarSCAT workflow.

The input VCF file is first passed to the variant normalization module of VarSCAT, from which the essential information, including positions, reference/alternative alleles, identifiers, and genotypes of variants are extracted. This module can split any potential multiallelic variants into biallelic variants and then normalizes all input variants as parsimonious and left aligned. The output of the variant normalization module is passed to an adjacent sequence annotation module and a tandem repeat (TR) annotation module. The adjacent sequence annotation module can be used to annotate the breakpoint ambiguities, flanking bases of variants, wildtype/mutated DNA sequences, HGVS nomenclature, distances between adjacent variants, and custom annotations. The tandem repeat annotation module can annotate tandem repeat regions of input variants.

More »

Expand

Fig 3.

The benchmarking results of the VarSCAT tandem repeat annotation module.

Benchmarking was performed for annotating small variants in perfect STR regions in chromosome 1 of (a) GIAB HG002, and (b) GIAB HG006. GATK ‘TandemRepeat’ function is an annotation method that directly takes a VCF file as the input; TRF and RepeatMasker from the UCSC Genome Browser’s ‘Simple Repeats’ and ‘RepeatMasker’ tracks represent a ready-made STR annotation approach; Krait is an annotation method for detecting perfect STRs with a reference genome.

More »

Expand

Fig 4.

The benchmarking results of VarSCAT and UPS-indel for annotating indels with breakpoint ambiguity.

Benchmarking was performed with indels of (a) GIAB HG002, (b) GIAB HG003, (c) GIAB HG004, (d) GIAB HG005, (e) GIAB HG006, (f) GIAB HG007, (g) Platinum Genomes NA12877, and (h) Platinum Genomes NA12878. Venn Diagrams were used to show the concordance of indel annotations between the tools. The numbers are the counts of indels annotated by each tool.

More »

Expand

Fig 5.

Proportions of ambiguous breakpoint indels and indels located in duplicates.

The analysis was performed with indels from the ClinVar database, eight high-confidence human individual germline small variant sets, and one semi-random indel set. The proportions of deletions and insertions in different categories are shown separately: (a) the proportion of ambiguous breakpoint indels and (b) the proportion of indels in duplicate. The numbers on the right of each bar are the numbers of ambiguous breakpoint indels and indels in duplicate for each indel set, respectively.

More »

Expand

Fig 6.

Proportions of small variants and indels in STR regions in different human superpopulations in the 1000 Genomes Project.

(a) The proportions of small variants in STR regions and (b) the proportions of small indels in STR regions. The numbers at the top of each boxplot are the average numbers of variants or indels in the STR regions of each superpopulation. African: n = 671; American: n = 348; East Asian: n = 515; European: n = 522; South Asian: n = 492.

More »

Expand

Fig 7.

Proportions of small variants in STR and not in STR regions shared by human superpopulations in the 1000 Genomes Project.

(a) The proportion of small variants in STR regions and (b) not in STR regions shared by superpopulations.

More »

Expand

Fig 8.

Proportions of different types of small variants in the STR and not in STR regions.

(a) The proportions of deletions, insertions, and single nucleotide variants (SNVs) in STR regions and (b) not in STR regions. The analysis was performed on high-confidence small variant sets from two individuals from the Platinum Genome and six individuals from GIAB. The numbers on the right of each bar show the numbers of small variants in STR regions or not in STR regions for the individuals. The percentages on the right of each bar donate the proportions of small variants in STR regions or not in STR regions among all variants for the individuals.

More »

Expand

Fig 9.

Indel calling results of GATK HaplotypeCaller and VarScan with the whole exome sequencing data of GIAB HG002.

(a) The indel calling results of GATK HaplotypeCaller and (b) VarScan. The evaluation results of true positive (TP), false positive (FP), and false negative (FN) of indel calls were generated using hap.py. The numbers on top of each bar show the total number of indels in this category. The indels in STR regions and not in STR regions are marked with different colors.

More »

Expand