Genetic Background Drives Transcriptional Variation in Human Induced Pluripotent Stem Cells

Human iPS cells have been generated using a diverse range of tissues from a variety of donors using different reprogramming vectors. However, these cell lines are heterogeneous, which presents a limitation for their use in disease modeling and personalized medicine. To explore the basis of this heterogeneity we generated 25 iPS cell lines under normalised conditions from the same set of somatic tissues across a number of donors. RNA-seq data sets from each cell line were compared to identify the majority contributors to transcriptional heterogeneity. We found that genetic differences between individual donors were the major cause of transcriptional variation between lines. In contrast, residual signatures from the somatic cell of origin, so called epigenetic memory, contributed relatively little to transcriptional variation. Thus, underlying genetic background variation is responsible for most heterogeneity between human iPS cell lines. We conclude that epigenetic effects in hIPSCs are minimal, and that hIPSCs are a stable, robust and powerful platform for large-scale studies of the function of genetic differences between individuals. Our data also suggest that future studies using hIPSCs as a model system should focus most effort on collection of large numbers of donors, rather than generating large numbers of lines from the same donor.


Fragment count and FPKM calculation
For each gene in the gene annotation, we counted the number of sequenced fragments of which one or other sequenced end lay within an annotated ENSEMBL gene exon. Let Y ij be the fragment count of the gene j (j = 1, . . . , L) for a sample i (i = 1, . . . , N). We had a total L = 55, 804 genes from our gene annotation and a sample size of N = 46. We calculated log 2 FPKM (fragments per kilobase of exon per million fragments mapped), y ij , for sample i at gene j as follows: where l j is the mean spliced transcript length in kilobase of gene j calculated from the gene annotations and Y i = ∑ L j=1 Y ij /10 6 is the total fragment count in megabase for the sample i.

GC correction
We corrected for varying amplification efficiency of different GC contents using the method described in [2]. We first calculated mean GC content for each gene, which is mean G/C base counts over mean cDNA length from all possible transcripts of a gene in our annotation. Then we assigned all genes to 200 approximately equally sized bins {B 1 , . . . , B 200 } based on the GC content. Let S il = ∑ j∈B l Y ij be the number of fragments in bin l from sample i. For each bin, for each sample, we calculated the log 2 relative enrichment, F il , of fragments in each GC bin, such that F il = log 2 S il /S ·l S i· /S ·· , where S ·l = ∑ i S il , S i· = ∑ l S il and S ·· = ∑ i,l S il . For each sample, we fitted a smoothing spline to the plot of F il against the mean GC content for the bin. We used the R function smooth.spline with a smoothing parameter of 1. The result showed significant GC effect on each sample (Supplementary Figure 18). LettingF il be the predicted value of the smoothing spline for bin l in sample i, we set c ij = F il , where c ij is the predicted log 2 over/under-representation of fragment count of gene j ∈ B l in sample i. Then the normalised FPKM was obtained bỹ For simplicity, we calculated FPKMs based on genes instead of exons.

Variance component analysis using a linear mixed model
Although hierarchical clustering illustrated major sources of transcriptional variations between adult and stem cells or between tissues in adult cells, more subtle sources of variation, such as tissue of origin variation in iPSCs or ES/iPS variation, were difficult to discern. To provide quantitative information on the relative importance of different sources of variation we employed the following variance components analysis using a linear mixed model. We assumed that the normalised log 2 FPKMs for gene j,ỹ j = (ỹ 1j , . . . ,ỹ Nj ) , can be modelled as a linear combination of fixed and normally distributed random effects, such that , where Z 1 , . . . , Z 5 are design matrices of 0/1 values, in each of which ith row specifies the category that the sample i belongs to, and I d indicates the d-dimensional identity matrix.
The first term in the model was an intercept, b 1j . Because the primary goal of this part of our analysis was to quantify variation rather than estimate mean expression levels, we estimated the parameters of the distribution from which the means were drawn, rather than the means themselves. The intercept term captures three properties of the distribution of gene expression levels, subdivided into two groups for adult and stem cells: the grand means of gene expression across all genes, corresponding to the two elements of the vector β 1 , the variances of gene expression across all genes corresponding to the first and second diagonal elements of the two-dimensional variance matrix D 1 , δ 2 11 and δ 2 12 and the covariances in expression levels corresponding to the off-diagonal elements of the variance matrix D 1 , δ 2 12 . The second term, b 2j , was specific to only stem cells and captured variation between iPS and ES cells (variance parameter δ 2 2 ) The third term, b 3j , captured variation between the different adult somatic cells (variance parameter δ 2 31 ) and between different somatic tissues of origin for iPSCs (variance parameter δ 2 31 ). The fourth term, b 4j captured the variation between individuals. We were specifically interested in whether between-individual variation differed between adult and iPS cells and so we assigned different variance parameters for each, δ 2 41 and δ 2 42 for adult and stem cells, respectively. The final term, b 5j , captured variation between different sequencing batches. We assumed sequencing batch was independent of cell or tissue type, and so we introduced a common variance parameter δ 2 5 for this term. The fixed effect parameters β 2 , . . . , β 5 captured grand means of expression levels under specific conditions (cell type, tissue type, etc.). However, because of our unbalanced study design, the fixed effect parameters are not uniquely determined and their biological interpretability is therefore limited. We note that estimates of the variance parameters are still uniquely determined in an unbalanced sampling design.

Scaling of gene expression levels
A standard feature of gene expression data is a greater variance in expression in lowly expressed genes (see e.g., [8]). To capture this aspect of our data, all variance parameters δ = (δ 11 , δ 12 , δ 13 , δ 21 , δ 22 , δ 31 , δ 32 , δ 41 , δ 42 , δ 5 , σ) were scaled by precision for each gene j, which is identically and independently Gamma distributed. In our data, the posterior mean of τ j given by (2) clearly showed that the higher the expression level is the higher the precision is (Supplementary Figure 19). Technically, the assumption that τ follows a Gamma distribution provided two advantages. First, this limited the range of τ j in (0, ∞) thereby avoiding parameter uncertainty. Second, the Gamma distribution is a conjugate prior distribution forỹ j given τ j , such that , the compound fixed effect vector β = (β 1 , . . . , β 5 ) and the combined design matrix Z = (Z 1 , . . . , Z 5 ), so that the marginal distribution ofỹ j is the multivariate student t distributioñ with mean Zβ, variance V and the ν degrees of freedom. The multivariate student t distribution is convenient for parameter estimation using an EM algorithm.

Parameter estimation
We used a standard EM algorithm [10] to estimate all parameters θ = {β, δ, ν}, where τ j (j = 1, . . . , L) are thought to be unobserved variables. Because the Gamma distribution on τ j is a conjugate prior forỹ j given τ j , the posterior distribution of τ j givenỹ j is also a Gamma distribution such that Therefore the posterior means can be obtained without any numerical integration in each E-step, where ψ[·] is the digamma function.
In the M-step, we alternatively maximised a series of Q-functions using a Newton-Raphson method with the posterior means of τ j and log τ j , such that with respect to the parameters β, δ and ν. Although the mean Zβ is not of interest here, it is required to estimate δ and ν. Because Z is essentially degenerate, β is not uniquely determined and, therefore, to maximise the likelihood function, we use a generalised inverse matrix A + = A + AA + to obtain the maximum likelihood estimator

Heteroscedastic model
Our initial model assumed that the residual error has the same variance for all factors in the model (homoscedasticity). We next extended this model to allow the residuals in different cell types to take different variances, known as heteroscedasticity.We incorporated a variety of different error parameters on each ε ij according to the cell/tissue type of the sample i, including Fig. 1d in the main text.

Estimation of Primary eQTL SNP variation
To further dissect the individual variance into genetic and non-genetic variances, we introduced an additional covariate in (1), which is the primary eQTL SNP found in gEUVADIS project [15]. Because the SNP genotype is only available for iPS lines but not for ES lines in our study, we modelled the FPKM for 25 iPS lines, such that where b 1j is a scalar mixed intercept drawn from N (α 1 , δ 2 1 /τ j ) and 1 is a vector of all 1s. Following b 3j , . . . , b 5j and Z 3 , . . . , Z 5 are the mixed effects and the design matrices defined as before. We don't have the component b 2j because we have no ES lines in the model. The vectorg j is the normalised SNP genotype at the primary eQTL for the gene j. We first estimated the population allele frequencyπ j from the SNP genotypes of CEU and GBR populations in the 1000 Genomes Project and then normalised the raw SNP genotype g j of ours bỹ becomes an unbiased, positive semi-definite estimator for the kinship matrix [14].
There is a problem when we compare the variance explained by the component with others, because the variance is essentially different for each individual i at gene j; One solution would be to calculate the overall transcriptional variation given the set of eQTL SNPs. By integrating out τ j from (3), such that we have the averaged variance across all genes where the diagonal elementK ii → 1/2 as L → ∞ for an outbred individual. The averaged variance is compatible with the averaged residual variance Therefore the overall variance explained for the set of eQTL SNPs is given byσ 2 g /(σ 2 g +σ 2 ).

Differential Expression Analysis
Although our variance components analysis suggested relatively small effects of tissue of origin, this could mask subtle effects at important individual genes. We next sought to identify genes whose expression in iPSCs more closely resembled their somatic progenitors or which appeared to be significantly diverged from both ESCs and somatic cells. We performed differential expression analysis using two different approaches; one is using DESeq method [8] and a novel hierarchical model similar to that introduced in [9]. For both these methods, we analysed different tissues (skin fibroblast, keratinocyte and EPC) independently.
Our analysis compared mean expression levels of all cell types (adult cell, iPSC and ESC) simultaneously. We modelled the fragment count Y ij for sample i at a gene j using a negative binomial distribution, such that with the cell type specific means and the offset K ij = 2 c ij Y i correcting GC content and total fragment count bias. Note that the over-dispersion parameter θ j was treated in different way for the two approaches as described in subsequent sections. Then we introduced the null hypothesis where all cell types share the same mean expression level: compared with the following four alternative hypotheses: where the superscript "A", "I" or "E" stands for adult cell, iPS cell or ES cell, respectively.

Gene selection
We filtered out low or unexpressed genes in advance by only including those genes with mean FPKM> 1 in at least one cell type (A, I or E). We also filtered out genes whose biotypes defined by Ensembl are snRNA, snoRNA, misc RNA, miRNA and rRNA, because, those expression levels cannot be correctly captured by our RNA extraction protocol. Finally, because read alignments appeared to be unreliable in many known pseudogenes, these were also removed from our annotation.

P-value method using DESeq
To identify individual genes of interest, we used gene-by-gene P-value tests. We employed DE-Seq [8] to estimate the over-dispersion parameter θ j across all genes under the null hypothesis where all cell types share the same mean expression level. We performed hypothesis testing for each of the four alternative hypotheses in (4) and selected the most significant hypothesis that gives the minimum P-value. We used nbinomGLMTest implemented in the DESeq package to obtain the P-values.
Because we selected the minimum P-value from four alternative hypotheses, our P-value distribution under the null was slightly skewed towards 0. To correct for this, we estimated an empirical false discovery rate (FDR) from one million permuted data sets, where sample labels were permuted with respect to the expression data, the same differential expression tests were performed and the minimum p-value for the four alternative hypotheses was selected. The genome-wide p-value thresholds corresponding to a 5% FDRs were then estimated separately for each tissue of origin from the empirical distributions as: Fibroblasts/F-iPSCs -p<0.015; Keratinocyte/K-iPSCs -p<0.009;EPCs/E-iPSCs -p<0.004).

Hierarchical model
Although the P-value method gave us a rough estimate of the proportion of each alternative hypothesis that different gene will fall into, estimating these proportions is based on an arbitrary FDR threshold, such as 5%. Therefore, to estimate the proportions of genes falling into either the null or alternative hypotheses across all genes, we employed the following hierarchical model similar to that proposed in [9].
We first introduced probability Π 0 for the invariant expression (null hypothesis) and Π 1 = 1 − Π 0 for differential expression (alternative hypotheses). The probability of observing fragment counts for all samples at gene j, Y j = (Y 1j , . . . , Y Nj ), was defined as where P 0 j denotes the probability of the fragment counts given that there is no expression difference among the three cell types and P 1 j denotes the probability of the fragment counts given that the gene is differentially expressed between the three cell types. Given that the gene j is differentially expressed, the probability of the observed fragment counts can be given by where P 1 jk is the probability of the fragment counts observed under the alternative hypothesis k and π k is the prior probability that a gene belongs the alternative hypothesis k such that ∑ 4 k=1 π k = 1. Overall, the likelihood was given by and the mixture parameters Π = {Π 0 , Π 1 } and π = {π k } 4 k=1 can be estimated by using a standard EM algorithm [10]. Note that π k in our model does not depend on j which suggests that the prior distribution is essentially non-informative.
The probability P 0 j and {P 1 jk } 4 k=1 were assumed to be given by the following negative binomial distribution with mean parameter log λ ij = x i β j , where effect size β j was defined according to the null and alternative hypotheses, such as Complex with a compatible design vector x i indicating the cell type for sample i (x i = 1 for the invariant expression), where the superscript "A", "I" or "E" stands for adult cell, iPS cell or ES cell, respectively and the superscripts "I/E", "A/E" or "A/I" denotes shared mean for two different cell types (i.e., I/E stands for λ I j = λ E j ). This definition to that in (4), the mean expression level λ * j was reparametrized in the logarithmic scale (e.g., log λ A j = log λ I j = β A/I j and log λ E j = β E j for a transcriptional memory gene) so as to introduce prior distributions on β j as well as θ j ; where N (µ, Σ) denotes a normal distribution with mean µ and variance-covariance matrix Σ and G(α, β) denotes a gamma distribution with shape parameter α and rate parameter β. The joint probability was then given by In order to calculate P 0 j and {P 1 jk } 4 k=1 , we needed to integrate out β j and θ j from the equation. There was no closed form for the negative binomial distribution unlike the normal gamma mixture in [9]. Therefore we used the Laplace approximation such that Here the maximum likelihood estimator of β j and the maximum modified profile likelihood estimator of θ j such thatβ were obtained by using a quasi-Newton method. The hyperparameters Σ, κ and δ were not integrated out from the joint probability in (5), instead, we assigned Σ = 10 3 I and κ = δ = 10 −3 suggesting the prior distributions were almost non-informative compared with the range ofβ j andθ j across all genes (data not shown).

Gene ontology analysis
Genes that were significantly differentially expressed (either transcriptional memory or aberrant reprogramming) at a 5% FDR, and showed a more than a 2-fold difference in expression between ESCs and IPSCs were selected for Gene Ontology analysis. All expressed genes in each tissue, their derived iPSCs or the ESCs were used as a background. For example, in F-iPSCs, any gene that was defined as expressed in either fibroblasts, F-iPSCs or ESCs was used as the background. All GO analysis was performed using the topGO R package [12]

Transposable element transcription
We downloaded annotated LINE and LTR repetitive elements as defined by the UCSC genome browser. We removed all elements that overlapped an annotated transcribed region as defined in Ensembl GRCh37 assemby 69. Using these criteria we identified 533,254 and 305,339 LINE and LTR elements, respectively, in the human genome. We further removed any elements whose average mapability for 75bp fragments was less than 1, as defined in the wgEncode-CrgMapabilityAlign75mer annotation available from UCSC and counted the total number of fragments that overlapped each annotated repetitive element in our data.

Allele-specific Expression Analysis
We next focused on the effects of varying genetic background in our iPSCs. Our initial variance component analysis suggested that differences between individuals had a more significant impact on genome-wide gene expression in iPSCs than factors such as somatic tissue of origin. However, the number of individuals in our data set was small, a standard expression quantitative trait loci (eQTL) mapping experiment (see e.g., [15]) was not possible. Therefore we performed the allele-specific expression analysis [2] to identify genes whose expression levels are affected by individual genetic variations and showed that similar effects could be observed in the iPSCs that were previously observed in adult human tissues [15].
For this analysis, we genotyped all our individuals (S2, S4, S5 and S7) except for H9 human ES sample using Illumina HumanOmni2.5-8 BeadChip, followed by imputation with the 1000 Genomes Project data and haplotype phasing using Beagle software [5]. For each individual, we identified all heterozygous exonic SNPs and counted the numbers of fragments separately for each of the two different alleles at the SNP loci in our RNA-seq samples.
Let m ijk and n ijk be the allele-specific fragment count for the mutant allele and the total fragment count at SNP k (k = 1, . . . , L j ) in an exonic region of gene j for the iPSC sample i (i = 1, . . . , M), where i stands for the iPSC RNA-seq sample identifier for an individual (not the identifier for the four individuals in our study). For each gene j, we explicitly modelled the allelic imbalance π jk at heterozygous SNP k using a beta-binomial distribution with an over dispersion parameter θ j across all iPSC samples from one individual, such that m ijk ∼ BB(n ijk ; π jk , θ j ), where β j is the common effect size of allelic imbalance for gene j across all SNPs and h jk = 1 mutant allele on maternal haplotype, −1 mutant allele on paternal haplotype.
Note that maternal/paternal haplotype was arbitrarily determined during the phasing. Then we asked whether the allelic imbalance is significant at gene j in terms of the likelihood ratio P-value between the following two hypotheses: The parameter estimation followed by hypothesis testing was performed for each gene, independently. The over dispersion parameter θ j was estimated by using the modified profile likelihood [4] given the maximum likelihood estimator of β j , rather than using the maximum likelihood estimator. Because of the small sample size, the maximum likelihood estimator would tend to underestimate the over-dispersion, thereby we would overestimate the number of significant genes with a same FDR threshold.

Comparison with eQTL analysis in Lappalainen et al.
For subsequent allele-specific expression analysis, we re-analyzed the RNA-seq data in [15] to identify genes with eQTLs in lymphoblastoid cell lines. We downloaded the raw fastq files for 162 CEU and GBR HapMap individuals at http://www.geuvadis.org/web/geuvadis. Reads were aligned and FPKMs were calculated with GC correction as described previously. We performed principal component (PC) analysis to detect unknown confounding factors in the experiment. Then, for each gene, we performed a linear regression of the FPKM values on the first 8 PCs, and replaced the FPKM values with their residuals in that regression as described in [2]. We finally performed cis-eQTL mapping for each gene using a single linear regression with a SNP genotype located within 200Kb from both ends of the gene. The SNP with the minimum P-value for each gene was selected as the cis-eQTL SNP (referred to as eSNP in this text). The high-expression (+) and low-expression (−) alleles at the eSNP were defined by the sign of regression coefficient. To estimate FDR, we performed permutation of phenotypes as described in [2] and defined eQTL genes and eSNPs by setting study-wide false-discovery rate (FDR) to 5%. In Figure 2b in the main text, we asked SNP genotypes for our individual at the eSNP and classified those into homozygote of high-expression alleles (+/+), heterozygote (+/−) and homozygote of low-expression alleles (−/−). Then, for genes with an eQTL at an FDR of 5%, we plotted the normalised FPKMỹ ij against the classified genotypes.
In Figure 2c in the main text, we identified all individuals heterozygous for the eSNPs (+/−). Then, in these individuals we classified the "maternal" haplotype into the high-expression haplotype (+) or low-expression haplotype (−) according to the high/low-expression allele on the haplotype. Finally, for genes with an eQTL at an FDR of 5%, we plotted logit −1β j against the high/low-expression haplotype.