Table 1.
Comparison of the assembly accuracy, measured with precision (%) and the number of matching transcripts, of StringTie2 and Scallop2 using different annotations as the reference.
In each combination (of dataset, aligner, genome build, annotation) the two metrics are averaged over all samples in the dataset. Symbol 〈〉 indicates that one method gets higher on one metric but lower on the other; symbol > indicates that StringTie2 outperforms on both metrics, while < indicates Scallop2 outperforms on both metrics. The three columns of raw counts give the number of samples in each category by comparing raw precision and recall. Samples in the 〈〉 category are further compared using the adjusted precision, and the number of samples are merged into either > or < category accordingly, shown in the two columns of under adjusted.
Fig 1.
Illustrating the difference of assembly accuracy evaluated with RefSeq and Ensembl annotations, both from GRCh38 genome build.
Each arrow represents a sample, pointing from the accuracy evaluated with RefSeq annotation to that with Ensembl annotation. The subfigures correspond to the four combinations of dataset (EN10 or HS7) and aligner (HISAT2 or STAR).
Fig 2.
Illustrating the difference of assembly accuracy evaluated with RefSeq and Ensembl annotations, both from T2T-CHM13 genome build.
Each arrow represents a sample, pointing from the accuracy evaluated with RefSeq annotation to that with Ensembl annotation. The subfigures correspond to the four combinations of dataset (EN10 or HS7) and aligner (HISAT2 or STAR).
Fig 3.
The Jaccard similarity of 4 pairs of annotations (4 subfigures) at the level of boundary, junction, and intron-chain.
Fig 4.
The distribution of Jaccard similarities of all paired genes in each pair of compared annotations at the level of boundary, junction, and intron-chain.
The three dashed lines in each subfigure mark the Jaccard similarity at the 25th, 50th, and 75th percentile of the total frequency.
Table 2.
Illustration of the number of multi-exon transcripts in different biotypes between Ensembl and RefSeq annotations (GRCh38 build).
The first column lists biotypes defined by the Ensembl annotation; the second column lists the number of multi-exon transcripts in each biotype; the third and the fourth columns give the number and the percentage of multi-exon transcripts in each biotype that are also annotated in the RefSeq annotation.
Fig 5.
Comparison of the matching transcripts assembled by StringTie2 and Scallop2 in the five largest biotypes using the Ensembl annotation as the reference.
The average number over all samples in each dataset is reported in the barplot. The 5 biotypes are: protein_coding (pc), retained_intron (ir), lncRNA (lnc), processed_transcript (pt), and nonsense_mediated_decay (nmd). These five biotypes account for 98.6% of the total transcripts in Ensembl.
Fig 6.
Comparison of the transcripts assembled by Scallop2 and StringTie2 that are annotated by Ensembl but not by RefSeq in each of the five largest biotypes.
The average number over all samples in each dataset is reported in the barplot. The 5 biotypes are ir: retained_intron, pc: protein_coding, lnc: lncRNA (long non-coding RNA), pt: processed_transcript, nmd: nonsense_mediated_decay.
Table 3.
Comparison of relative change in percentage of precision and the number of matching transcripts after filtering out transcripts with intron retentions in StringTie2 and Scallop2 assemblies, evaluated with different annotations as the reference.
Numbers are averaged over all samples in each dataset.
Fig 7.
Pipeline of evaluating the accuracy of compared assemblers.
Fig 8.
A toy example for illustrating the Jaccard similarity of two annotations T1 and T2 at the level of boundary, junction, and intron-chain.
Genes and transcripts from the same annotation are colored the same. Identical boundaries between two annotations are marked with vertical dashed lines. We have JB(T1, T2) ≔ |B(T1) ∩ B(T2)|/|B(T1) ∪ B(T2) = 5/6|, JJ(T1, T2) ≔ |J(T1) ∩ J(T2)|/|J(T1) ∪ J(T2)| = 5/8|, and JC(T1, T2) ≔ |C(T1) ∩ C(T2)|/|C(T1) ∪ C(T2)| = 1/3|.
Fig 9.
An illustrative example for the three criteria used to define transcripts with intron retentions.
Identical boundaries are marked with vertical dashed lines. Transcript t2 satisfies criterion 1 (it has lower abundance than t1, and its first exon spans the second intron of t1). Transcript t3 satisfies criterion 2 (it has lower abundance than t1 and its last exon spans the fourth intron of t1). Transcript t4 satisfies criterion 3 (it has lower abundance than t1 and its second exon fully covers the second intron of t1).