TEvarSim: A genome simulator for transposable element (TE) variants

doi:10.1371/journal.pcbi.1013933

Fig 1.

Software workflow of TEvarSim.

TEvarSim consists of four Python modules: First module (TErandom.py, TEreal.py, and TEpan.py): Creates TE insertion sequences and BED files containing the coordinates of TE insertions and deletions. Known TE deletions are used by both TErandom.py and TEreal.py. Users must select one and only one of the three parallel methods (TErandom, TEreal, or TEpan) per simulation run. Second module (Simulate.py): Generates TE variant-containing genomes and corresponding VCF files based on the simulated TE variants. Third module (Readsim.py): Produces short- and long-read sequencing data from the simulated genomes. Fourth module (Compare.py): Compares the simulated VCF files with files from TE variant genotyping tools to evaluate detection accuracy.

More »

Expand

Table 1.

Comparison of five TE simulation tools.

More »

Expand

Fig 2.

Comparison with an existing tool.

(A) Comparison of TE-containing genomes simulated by SimulaTE and TEvarSim, showing identical outputs. (B) Runtime comparison between SimulaTE and TEvarSim using a single CPU and thread.

More »

Expand

Fig 3.

Validation of simulated genomes.

(A) Comparison of TE variants recorded in the VCF file (left) versus those extracted from CIGAR strings in the BAM file after minimap2 alignment (right). A discrepancy in one TE deletion is highlighted with an arrow. (B) When the sequence of the TE deletion in question was mapped to both GRCh38-Chr21 and the simulated genome, it aligned fully to GRCh38-Chr21 but not to the expected location in the simulated genome, indicating that the deletion was successfully introduced. This sequence had four multi-mapped positions on the reference and three on the simulated genome, likely contributing to the minimap2 mapping artifact.

More »

Expand

Fig 4.

Validation of sequence variation.

(A) Multiple sequence alignment showing that random sequence variations, including SNPs, INDELs, polyA tails, and truncations, were correctly introduced into TE consensus sequences. (B) Multiple sequence alignment demonstrating the expected sequence diversity of the same TE insertion across different individual genomes.

More »