Fig 1.
Software workflow of TEvarSim.
TEvarSim consists of four Python modules: First module (TErandom.py, TEreal.py, and TEpan.py): Creates TE insertion sequences and BED files containing the coordinates of TE insertions and deletions. Known TE deletions are used by both TErandom.py and TEreal.py. Users must select one and only one of the three parallel methods (TErandom, TEreal, or TEpan) per simulation run. Second module (Simulate.py): Generates TE variant-containing genomes and corresponding VCF files based on the simulated TE variants. Third module (Readsim.py): Produces short- and long-read sequencing data from the simulated genomes. Fourth module (Compare.py): Compares the simulated VCF files with files from TE variant genotyping tools to evaluate detection accuracy.
Table 1.
Comparison of five TE simulation tools.
Fig 2.
Comparison with an existing tool.
(A) Comparison of TE-containing genomes simulated by SimulaTE and TEvarSim, showing identical outputs. (B) Runtime comparison between SimulaTE and TEvarSim using a single CPU and thread.
Fig 3.
Validation of simulated genomes.
(A) Comparison of TE variants recorded in the VCF file (left) versus those extracted from CIGAR strings in the BAM file after minimap2 alignment (right). A discrepancy in one TE deletion is highlighted with an arrow. (B) When the sequence of the TE deletion in question was mapped to both GRCh38-Chr21 and the simulated genome, it aligned fully to GRCh38-Chr21 but not to the expected location in the simulated genome, indicating that the deletion was successfully introduced. This sequence had four multi-mapped positions on the reference and three on the simulated genome, likely contributing to the minimap2 mapping artifact.
Fig 4.
Validation of sequence variation.
(A) Multiple sequence alignment showing that random sequence variations, including SNPs, INDELs, polyA tails, and truncations, were correctly introduced into TE consensus sequences. (B) Multiple sequence alignment demonstrating the expected sequence diversity of the same TE insertion across different individual genomes.
Table 2.
Performance metrics of two randomly selected TE detection tools.
Fig 5.
Radar charts of performance metrics of MELT and xTea using TEvarSim-simulated TE-containing short-read and long-read datasets, respectively.
Table 3.
Comparison of MELT performance across genomic variation parameter sets in TEvarSim simulations.
Table 4.
Example computational resource consumption.