Identifying a High Fraction of the Human Genome to be under Selective Constraint Using GERP++

doi:10.1371/journal.pcbi.1001025

Figure 1.

Overview of GERP++.

(1) For each position of the multiple alignment we compute the conservation score in rejected substitutions by subtracting the estimated evolutionary rate from the neutral rate. The neutral rate is computed by removing species gapped at that position from the phylogenetic tree and summing the branch lengths of the resulting projected tree; the evolutionary rate is estimated by computing the maximum likelihood rescaling of the projected tree. (2) Given position-specific conservation scores, we generate a set of candidate elements. (3) For each candidate element, we compute a p-value to represent the likelihood of observing a segment of equal length and greater than or equal score under the null model. We then select a non-overlapping set of elements in order of increasing p-value.

More »

Expand

Figure 2.

Per-chromosome constraint intensity.

(A) Mean RS score for all alignment positions where evolutionary rate was computed. Note the elevated average score for chromosome X. (B) Fraction of chromosome that falls into predicted constrained elements. Light green bars show fraction of entire chromosome, while dark green bars show fraction adjusted for regions where no rate computation was performed and no elements could span (see Methods).

More »

Expand

Figure 3.

Estimating detectable constraint.

The red curve represents the number of bases within predicted constrained element as a function of the false positive cutoff parameter. The blue curve represents the number of predicted bases minus the expected number of false positive bases, also as a function of the false positive cutoff.

More »

Expand

Figure 4.

Relationship between CEs and known functional elements.

(A) Mean rejected substitution scores for entire human genome, constrained elements predicted by GERP++, and known annotated exons, introns, and UTR regions. (B) Breakdown of constrained element positions by region type.

More »

Expand

Table 1.

Fraction of functional regions covered by constrained elements on a nucleotide level.

More »

Expand

Figure 5.

Distributions (smoothed histograms) of 3-periodicity bias for known exons (red), introns (green), CEs that overlap exons (orange), and CEs not overlapping exons (blue).

More »

Expand

Table 2.

Mean 3-periodicity bias for different types of regions.

More »

Expand

Figure 6.

GERP++ vs phastCons predictions.

(A) Mean length (left), number (middle) and total length (right) of constrained elements predicted by GERP++ (blue) and phastCons(yellow). (B) Nucleotide-level fraction of annotated exons, introns, UTRs and noncoding RNAs genes covered by GERP++ (blue) and phastCons (yellow) predictions. (C&D) Histogram of number of distinct predicted GERP++ (blue, D) and phastCons(yellow, C) constrained elements overlapping each annotated coding exon. Note the difference in scale on the y-axis. (E) A constrained region slightly over 200 base pairs in length that contains a known exon, as annotated by GERP++ (labeled ‘GERP++’, black) and phastCons (purple track labeled ‘Mammal El’). Note how phastCons fragments the exon into multiple CE predictions.

More »

Expand

Figure 7.

Mean distribution of PolII binding sites by number of overlapping CEs over 9 Encode PolII ChIP experiments, for GERP++ and phastCons.

More »

Expand