Figure 1.
Schematic illustration MULTo file generation.
(A) We defined the minimum unique length (MUL) of a genomic coordinate as the length of the shortest starting oligonucleotide at that coordinate that is needed to be unique. To find the MUL value, Fasta files with artificial “reads” of different lengths were iteratively created from whole chromosome fasta files and mapped to the genome using bowtie. When the minimum length needed for uniqueness was found, this value was stored in a binary file. In this example, position 3000091 was unique at 33 base pairs but not at 32, i.e. we have a MUL value of 33. (B) Exemplifying that MUL values can be retrieved from arbitrary regions in just a few lines of code.
Figure 2.
Uniqueness in the transcriptome.
(A, B) We calculated the proportion of unique positions for each transcript, both for single reads and paired-end fragments (mean 500 nt), and then plotted how many transcripts have a certain proportion of unique positions. The y-axis represents the proportion of all transcripts that satisfies the given condition. (A) Gene-level uniqueness of all RefSeq transcripts. (B) Transcript-level uniqueness for all transcripts from multi-isoform genes. (C) Positional plot of the uniqueness proportion across all coding transcripts. We calculated the number of reads of a specific length that passes through each position, and determined what proportion of these were unique. Since transcripts differ in length, we binned positions together so that each region (upstream, downstream, coding sequence, 5′ and 3′UTR) had the same number of bins for each transcript. The x-axis represents coordinate bins across transcripts.
Figure 3.
Effects of uniqueness normalization on expression level.
(A) Histogram showing how uniqueness compensation using MULTo affects the RPKM values at different read lengths. The x-axis show the difference in gene expression between uniqueness compensated and uncompensated expression levels. (B) RPKM values for FTH1 before and after uniqueness normalization. (C) Read coverage and uniqueness profile across FTH1 for 25 nt reads. Uniqueness density was calculated as the proportion unique reads aligning to each genomic coordinate.
Figure 4.
Comparison of uniqueness compensation methods for RNA-Seq.
Scatter plots showing how gene expression values (RPKM) are affected by uniqueness compensation for transcripts as a function of increasing proportion unique positions. (A) Uniqueness compensation with MULTo corrected transcript lengths are close to optimal compensation line (y = 1/x) (B) ERANGE uniqueness compensation. (C) Cufflinks uniqueness compensation. (D,E) MA-plots between MUL and ERANGE uniqueness compensation (D) and between MUL and cufflinks uniqueness compensation (E) showing how gene expression differences correlate to the gene expression average. Short transcripts were colored red.
Figure 5.
Uniqueness profiles within genomic regions.
The proportions of unique positions within different regions were calculated for read lengths in the range 20–255 nts. (A) Proportion unique positions in whole genome, within RefSeq genes, intergenic regions, known p300 binding sites, proximal promoters and CpG islands. (B) Proportion unique positions within different parts of genes; exons, introns and UTRs. (C,D) Difference in proportion unique positions between the regular and bisulfite converted genome. The y-axis in (C) and (D) represents the uniqueness proportion in bisulfite genome subtracted from that in the regular genome. The vertical dashed line marks 35 nucleotide reads.