Pervasive Transcription of the Human Genome Produces Thousands of Previously Unidentified Long Intergenic Noncoding RNAs

doi:10.1371/journal.pgen.1003569

Figure 1.

The human intergenic transcriptome.

(A) 85.2% of the genome has evidence of transcription, with RNA-seq reads mapping directly to 78.9% of genomic sequence. The remaining genomic coverage is comprised of known genes, spliced ESTs and spliced cDNAs. The grey circle represents the portion of the genome (83.4%) that is uniquely mappable with RNA-seq reads. (B) Protein coding (NM gene) exon, intron and intergenic region expression level distribution. Regions that have high expression have a larger fraction of base calls appearing at higher read depths. Protein coding gene exons have the highest proportion of bases with high read depth, while introns and intergenic regions have relatively more bases of low read depth though each contain many highly expressed regions. Base calls = (# of genomic positions at a specific read depth)(read depth). (C) Most intergenic transcription is outside of annotated noncoding RNA genes. The fraction of intergenic base calls within RefSeq noncoding RNA genes (NR genes) compared to other intergenic regions are compared. In (A–C), only uniquely mappable portions of the genome are considered (see Methods).

More »

Expand

Figure 2.

Discovery of lincRNAs.

(A) Discovery of lincRNAs consisted of de novo assembly of transcripts from RNA-seq data and compilation of annotated and putative noncoding RNAs (see Methods), followed by a series of filters designed to remove all known and novel protein coding transcripts and non-lincRNA noncoding RNAs. Only intergenic noncoding transcripts at least 200 nucleotides in length and expressed at least at one copy per cell were ultimately annotated as lincRNAs. (B) Analysis of ribosomal profiling data reveals that the lincRNA catalog is composed of noncoding transcripts. The maximum 30 bp window ratio of HeLa ribosomal/RNA-seq reads [22] is plotted for exons of lincRNAs, 3′ UTRs and coding sequences (CDS). *P<2.2E-16; whiskers extend +/−1.5 times interquartile range and dots represent outliers. (C) Computational analysis of protein coding capacity of the lincRNAs reveals a lack of protein coding capacity. The cumulative distribution of PhyloCSF [40] scores for lincRNAs and RefSeq NM genes are plotted. Higher scores correspond to higher predicted coding capacity.

More »

Expand

Figure 3.

LincRNAs possess features inconsistent with transcriptional noise.

(A) ChIP-seq and RNA-seq data from IMR90 cells [28], [29] were analyzed for lincRNAs and RefSeq NM genes. *P = 4.01E-7, ** P = 4.52E-9, *** P = 2.43E-14, **** P<2.2E-16; P = 0.137 for lincRNAs H3K9me3; whiskers extend to +/−1.5 times interquartile range or most extreme data point. (B) LincRNA FPKM values in polyA+ specific and polyA− specific RNA-seq libraries in H9 ESCs and HeLa cells [46] were compared. Transcripts with RNA-seq reads in all four datasets and with FPKM>1 in at least one of the two fractions for each cell type were analyzed (16,819 NM genes and 127 lincRNAs). Individual lincRNA and NM gene ratios of FPKMs in polyA+/polyA− fractions are plotted. Pearson correlation value for lincRNAs = 0.622 (P = 5.551E-15) and for NM genes = 0.702 (P<2.2E-16). (C) The maximally conserved 50 bp windows in each NM gene, lincRNA, and repetitive element (nonconserved control sequences) were determined. The maximally conserved 50 bp windows of 12 functional human lincRNAs are indicated for comparison.

More »

Expand

Figure 4.

LincRNAs are enriched for trait-associated SNPs.

The number of trait-associated SNPs within RefSeq NM gene exons, lincRNA exons, or background loci (nonexpressed intergenic sequence) per tested SNP in genome wide association studies is compared (see Methods). *P = 0.0173, **P<2.2E-16; error bars represent 95% binomial proportion confidence interval.

More »

Expand

Table 1.

Datasets used for chromatin modification analysis.

More »

Expand