Fig 1.
An overview of the experimental (a) and informatics(b) components in the ToFU pipeline to generate transcript isoforms.
Fig 2.
Long, high-quality, consensus sequences accurately benchmark transcript diversity.
a, Length distributions of full-length (FL) input reads, high-quality CCS reads, and ToFU transcript sequences. b, Histogram of percent nucleotide identity of ToFU transcript sequences aligned to the reference genome. c, Accumulative histogram of number of reference annotations that have a ToFU transcript that completely covers each annotated junction (transcript-covered) or only partially covers the annotated gene (loci-covered). Reference annotations that were not assayed (blue stack) are also shown. d, Distribution of distinct isoforms per loci for the reference annotation and ToFU transcript set. e. Illumina short-read coverage (grey) and junction support (red lines, associated numbers indicate Illumina reads that support each splice junction) aligned along the reference annotated transcript (blue) for a glycosyl hydrolase gene with 120 distinct PacBio isoforms aligned below (splice junctions are shown in red and exon sequences are shown in green). f, An enlarged view of the region between two starts in 2e.
Fig 3.
Evaluating short-read transcript reconstruction against ToFU transcripts.
a, Percentage of ToFU transcripts recovered by three different short-read assembly methods. The isoform frequency shows whether a ToFU transcript is recovered by exactly 0, 1, 2, or all 3 of the assemblers. b, Number of assembled transcripts validated by ToFU transcripts. A transcript is validated as an exact match of a ToFU transcript if it shares exactly the same number of exons and donor-acceptor sites. c, Fraction of ToFU transcripts recovered (sensitivity) by each short-read assembler as a function of isoform complexity. d, Fraction of assembled transcripts validated (specificity) by ToFU as a function of isoform complexity. Isoform complexity is determined by the number of ToFU isoforms at each locus.
Fig 4.
The genome-wide presence of polycistronic mRNAs.
a, Short-reads (Illumina) aligned to a cluster of tandem reference genes (Annotation, 3 tandem genes on the first row). The numbers of supporting short-reads for each junction are indicated. Polycistronic transcripts (TOFU) are shown in green and non-polycistronic transcripts in gray. b, A comparison of transcription termination signals. The sequence composition profiles (upper panel for A-content and lower panel for U-content) before the polyadenylation sites for different classes of ORFs. pORF1 is the upstream ORF and pORF2 is downstream ORF, while nORF stands for non-polycistronic mRNAs. The y-axis are the frequencies of a specific nucleotide averaged for 200 randomly sampled polycistronic mRNA or non-polycistronic controls, dotted lines are the expected frequencies (0.25) if all four bases are equally likely. Arrows denote NUE (upper panel) and FUE (lower panel), respectively. For this figure, only polycistronic transcripts with exactly two ORFs are plotted. Genome-wide analysis base composition of termination signals for all transcribed loci is shown in Fig B in S2 File c, The independent expression levels of ORFs within polycistronic RNAs. ORF numbers indicate their order in the transcript (5’- to 3’). d, Polycistronic transcripts are likely a unique feature to Agaricomycetes. The top plot shows the total number of adjacent ORF pairs within polycistronic transcripts from P. crispa that have conserved gene configuration in related species. The numbers on x-axis are species with increasing evolutionary distance. The bottom heatmap shows the conservation for each individual pair of ORFs. Red indicates the presence of a homologous gene pair in the species.
Table 1.
Polycistronic transcripts identified in several fungi transcriptomes.