Transcript assembly and annotations: Bias and adjustment

doi:10.1371/journal.pcbi.1011734

Table 1.

Comparison of the assembly accuracy, measured with precision (%) and the number of matching transcripts, of StringTie2 and Scallop2 using different annotations as the reference.

In each combination (of dataset, aligner, genome build, annotation) the two metrics are averaged over all samples in the dataset. Symbol 〈〉 indicates that one method gets higher on one metric but lower on the other; symbol > indicates that StringTie2 outperforms on both metrics, while < indicates Scallop2 outperforms on both metrics. The three columns of raw counts give the number of samples in each category by comparing raw precision and recall. Samples in the 〈〉 category are further compared using the adjusted precision, and the number of samples are merged into either > or < category accordingly, shown in the two columns of under adjusted.

More »

Expand

Fig 1.

Illustrating the difference of assembly accuracy evaluated with RefSeq and Ensembl annotations, both from GRCh38 genome build.

Each arrow represents a sample, pointing from the accuracy evaluated with RefSeq annotation to that with Ensembl annotation. The subfigures correspond to the four combinations of dataset (EN10 or HS7) and aligner (HISAT2 or STAR).

More »

Expand

Fig 2.

Illustrating the difference of assembly accuracy evaluated with RefSeq and Ensembl annotations, both from T2T-CHM13 genome build.

Each arrow represents a sample, pointing from the accuracy evaluated with RefSeq annotation to that with Ensembl annotation. The subfigures correspond to the four combinations of dataset (EN10 or HS7) and aligner (HISAT2 or STAR).

More »

Expand

Fig 3.

The Jaccard similarity of 4 pairs of annotations (4 subfigures) at the level of boundary, junction, and intron-chain.

More »

Expand

Fig 4.

The distribution of Jaccard similarities of all paired genes in each pair of compared annotations at the level of boundary, junction, and intron-chain.

The three dashed lines in each subfigure mark the Jaccard similarity at the 25th, 50th, and 75th percentile of the total frequency.

More »

Expand

Table 2.

Illustration of the number of multi-exon transcripts in different biotypes between Ensembl and RefSeq annotations (GRCh38 build).

The first column lists biotypes defined by the Ensembl annotation; the second column lists the number of multi-exon transcripts in each biotype; the third and the fourth columns give the number and the percentage of multi-exon transcripts in each biotype that are also annotated in the RefSeq annotation.

More »

Expand

Fig 5.

Comparison of the matching transcripts assembled by StringTie2 and Scallop2 in the five largest biotypes using the Ensembl annotation as the reference.

The average number over all samples in each dataset is reported in the barplot. The 5 biotypes are: protein_coding (pc), retained_intron (ir), lncRNA (lnc), processed_transcript (pt), and nonsense_mediated_decay (nmd). These five biotypes account for 98.6% of the total transcripts in Ensembl.

More »

Expand

Fig 6.

Comparison of the transcripts assembled by Scallop2 and StringTie2 that are annotated by Ensembl but not by RefSeq in each of the five largest biotypes.

The average number over all samples in each dataset is reported in the barplot. The 5 biotypes are ir: retained_intron, pc: protein_coding, lnc: lncRNA (long non-coding RNA), pt: processed_transcript, nmd: nonsense_mediated_decay.

More »

Expand

Table 3.

Comparison of relative change in percentage of precision and the number of matching transcripts after filtering out transcripts with intron retentions in StringTie2 and Scallop2 assemblies, evaluated with different annotations as the reference.

Numbers are averaged over all samples in each dataset.

More »

Expand

Fig 7.

Pipeline of evaluating the accuracy of compared assemblers.

More »

Expand

Fig 8.

A toy example for illustrating the Jaccard similarity of two annotations T₁ and T₂ at the level of boundary, junction, and intron-chain.

More »

Expand

Fig 9.

An illustrative example for the three criteria used to define transcripts with intron retentions.

Identical boundaries are marked with vertical dashed lines. Transcript t₂ satisfies criterion 1 (it has lower abundance than t₁, and its first exon spans the second intron of t₁). Transcript t₃ satisfies criterion 2 (it has lower abundance than t₁ and its last exon spans the fourth intron of t₁). Transcript t₄ satisfies criterion 3 (it has lower abundance than t₁ and its second exon fully covers the second intron of t₁).

More »

Expand