Effects of cis and trans Genetic Ancestry on Gene Expression in African Americans

Variation in gene expression is a fundamental aspect of human phenotypic variation. Several recent studies have analyzed gene expression levels in populations of different continental ancestry and reported population differences at a large number of genes. However, these differences could largely be due to non-genetic (e.g., environmental) effects. Here, we analyze gene expression levels in African American cell lines, which differ from previously analyzed cell lines in that individuals from this population inherit variable proportions of two continental ancestries. We first relate gene expression levels in individual African Americans to their genome-wide proportion of European ancestry. The results provide strong evidence of a genetic contribution to expression differences between European and African populations, validating previous findings. Second, we infer local ancestry (0, 1, or 2 European chromosomes) at each location in the genome and investigate the effects of ancestry proximal to the expressed gene (cis) versus ancestry elsewhere in the genome (trans). Both effects are highly significant, and we estimate that 12±3% of all heritable variation in human gene expression is due to cis variants.

relative to the proportion of genetic variation in AA that is attributable to genome-wide ancestry, which is equal to (0.16)(0.14/0.50) 2 = 0.0125 (since the standard deviation of genome-wide ancestry in AA is 0.14 as compared to 0.50 for CEU+YRI).

Mixture model simulations.
To compare theoretical predictions to our actual data, we simulated a mixture model in which an underlying variable x is drawn from a mixture of two distributions and a noise variable y is subsequently added to produce an observed variable x+y.
Intuitively, x+y corresponds to differences observed in CEU vs. YRI, and x corresponds to heritable differences that are validated in AA. Let N(0,V) denote a normal distribution with mean 0 and variance V. We define x ~ N(0,c/p) with probability p or x is identically zero with probability 1-p, and define y ~ N(0,1-c). Under this model, the coefficient for x+y predicting x (or x plus independent noise) is equal to c. Setting p = 0.50 and c = 0.43, we estimated the regression coefficient for x+y predicting x when restricting to the top 10% of values of |x+y|.
Limited sample size of AA data precludes analysis excluding CEU and YRI Page 2 We determined that, due to limited sample size and the relatively low variability in genomewide ancestry among AA individuals (standard deviation of 14% as compared to 50% for CEU+YRI; note that effective sample size scales with the square of this quantity), the AA data contains too much sampling noise for an analysis excluding CEU and YRI to be useful. As a demonstration of this, we computed the previously described statistic 1-π 0 , which estimates the proportion of causal data points from an observed distribution of P-values, to try to infer the fraction of common SNPs whose frequency varies with continental ancestry based on genotype data from the 89 AA samples. Based on standard F ST -based models, it is commonly believed that 100% of common SNP frequencies vary with continental ancestry to at least a small extent. However, based on genotype data from the 89 AA samples, which strongly replicate genetic differences between CEU and YRI (c = 0.96 above), the value of 1-π 0 was equal to 28%, which is much lower than 100%. The statistic 1-π 0 represents a lower bound which has proven useful in a variety of contexts, but our analysis shows that this lower bound may not be very informative in data sets of limited sample size, in which causal data points may have Pvalues that are not statistically significant. This observation also applies to the use of this statistic, or other lower bounds, to estimate the proportion of genes with population differences in gene expression. On the other hand, our validation analyses which analyze AA data in conjunction with CEU and YRI data are not affected by the limited sample size of the AA data (see Materials and Methods).

Variation in local ancestry across the genome
We computed local ancestry estimates γ gs for sample s at gene g as described in the main text.
The mean ± SD of γ gs across samples s and genes g was 21 ± 29%. The standard deviation Page 3 matched the theoretical expectation for a sample with 21% genome-wide ancestry which has 2 European copies with probability (0.21) 2 , 1 European copy with probability 2(0.21)(0.79) or 0 European copies with probability (0.79) 2 . We also computed the average ancestry γ g (i.e. the average across samples s of γ gs ) for each gene g. The mean ± SD of γ g across genes g was 21 ± 3%, as expected under a binomial model (variance equal to ) ). Values of γ g ranged from 13% to 31%, but these deviations from the mean of 21% were not statistically significant after applying a Bonferroni correction (either for 4,015 genes tested, or for hundreds of independent loci based on ancestry block sizes on the order of 10Mb).
Relationship between c, c cis and c trans . The variation in local (cis) ancestry γ gs (standard deviation = 29%; see above) is considerably larger than the variation in genome-wide (trans) ancestry θ s (standard deviation = 14%; see Results). In fact, fixing g and letting s vary, we can view γ gs as binomially sampled from θ s , i.e. γ gs is equal to θ s plus sampling noise. We confirmed this by computing for each g the regression coefficient (across samples s) for θ s for predicting γ gs . The average value of this regression coefficient (averaging across g) was equal to 0.97. Under the assumption that γ gs is equal to θ s plus sampling noise σ gs , and that the magnitude of c cis a g σ gs is small relative to the overall noise variance ν gs of gene expression level e gs , we would expect c ≈ c cis + c trans , since e gs = c cis a g γ gs + c trans a g θ s + ν gs = c cis a g (θ s + σ gs ) + c trans a g θ s + ν gs ≈ (c cis + c trans )a g θ s + ν gs .