Figure 1.
Phylogenetic hidden Markov model used by phastBias.
The model consists of four states: neutral evolution with no gBGC (), neutral evolution with gBGC (
), evolutionary conservation with no gBGC (
), and evolutionary conservation with gBGC (
). gBGC is assumed to influence nucleotide substitution rates and patterns only on the lineage leading to a designated target genome (human or chimpanzee in this study). The model generalizes the phylo-HMM used by phastCons for prediction of evolutionarily conserved elements [28]. The state transition probabilities are defined by four parameters, denoted
,
,
, and
. See Methods and Table 1 for details.
Table 1.
Summary of HMM parameters.
Figure 2.
Power and accuracy for simulated data.
The plot shows true positive rates (TPR; fraction of true gBGC bases correctly predicted) and positive predictive values (PPV; fraction of predicted bases in true gBGC tracts) as a function of tract length. Results are shown for two sets of simulations, one assuming strong BGC (), and the other assuming weaker BGC (
) (see Methods). Both the power (as measured by the TPR) and the accuracy (as measured by PPV) of gBGC detection depend strongly on tract length. At shorter lengths (less than 3000 bp) power also depends strongly on the strength of gBGC, while accuracy does not. Both TPR and PPV are fairly high (80% or more) for tracts longer than 1 kb that have experienced strong gBGC, and for tracts longer than 1.6 kb that have experienced weaker gBGC.
Table 2.
Summary of predicted gBGC tracts.
Figure 3.
Length distribution of predicted human gBGC tracts ().
The predicted tracts average 1,018 bp in length, with a median value of 788 bp. The length distribution is roughly geometric except for a deficiency of short tracts (less than 600 bp) and a slight excess of long tracts. The deficiency of short tracts is typical for predictions based on a hidden Markov model and most likely primarily reflects limitations of power in this range. Nevertheless, the full distribution suggests that phastBias can identify tracts ranging from a few hundred to several thousand bases in length.
Figure 4.
Genomic distribution of predicted human and chimpanzee gBGC tracts.
Both human (blue) and chimpanzee (red) gBGC tracts are found throughout the genome, but tend to cluster and fall near telomeres. Chimpanzee gBGC tracts are displayed at the corresponding aligned positions in the human genome. The dense cluster of gBGC tracts near the centromere of chromosome 2 is the site of the fusion of two ancestral chromosomes on the human lineage. This region is telomeric in chimpanzee and was telomeric for much of human evolution. As illustrated by the magnified section of chromosome 1, human and chimpanzee tracts often occur in similar regions, but rarely overlap.
Table 3.
Recombination rates in gBGC tracts.
Figure 5.
Human polymorphism data indicates an ongoing preference for the fixation of G and C alleles in the predicted gBGC tracts.
(A) W→S changes in gBGC tracts have significantly higher derived allele frequencies than S→W changes in tracts. This plot is based on data for the YRI population from the 1000 Genomes Project [33]. Results for other populations were similar (data not shown). (B) The -norm, a measure of the degree of W→S bias in polymorphism data [17], is significantly higher in gBGC tracts than in the entire genome or in GC-matched control regions (see Methods). Recombination hotspots also show somewhat elevated values but much less elevated than the predicted tracts. The
-norm for human polymorphisms in “ortho-tracts” mapped from the chimpanzee genome is slightly elevated but significantly lower than that for human gBGC tracts. This is consistent with the lower human recombination rate in chimpanzee tracts compared to human tracts (Table 3). A similar species-specific skew in derived allele frequencies is seen in chimpanzee gBGC tracts (Figure S12). The error bars indicate 95% confidence intervals.
Table 4.
Enrichment for disease-associated regions.
Figure 6.
Illustration of genome browser track.
(A) UCSC Genome Browser screen shot focused on the LMAN1 gene (hg18.chr18:55,148,088–55,177,461). This region contains a predicted gBGC tract (black bar, second track from top); the “wiggle” track below shows the posterior probability of gBGC at each site computed by phastBias. The gBGC tract overlaps an exon of the gene (blue bar at top; adjacent chevrons indicate introns), a human accelerated region (2×HAR.23; short black bar), and a known missense variant from dbSNP (rs146465318; black tick mark). The phyloP-based conservation track (“Mammal Cons”) shows that phastBias can predict tracts that span both conserved and nonconserved regions. The phastBias track is available at http://genome-mirror.bscb.cornell.edu (hg18 assembly). Notably, this region has an elevated recombination rate (2.5 cM/Mb; not shown). (B) The multiple sequence alignment for a portion of the gBGC tract (hg18.chr18:55,171,469–55,171,548) illustrates the characteristic signature of gBGC. This interval has nine human-specific W→S substitutions over 80 nucleotides, four of which fall within the exon. Positions in other species that match the human sequence are indicated with a period.