Host Genetic Variants and Gene Expression Patterns Associated with Epstein-Barr Virus Copy Number in Lymphoblastoid Cell Lines

Lymphoblastoid cell lines (LCLs) are commonly used in molecular genetics, supplying DNA for the HapMap and 1000 Genomes Projects, used to test chemotherapeutic agents, and informing the basis of a number of population genetics studies of gene expression. The process of transforming human B cells into LCLs requires the presence of Epstein-Barr virus (EBV), a double-stranded DNA virus which through B-cell immortalisation maintains an episomal virus genome in every cell of an LCL at variable copy numbers. Previous studies have reported that EBV alters host-gene expression and EBV copy number may be under host genetic control. We performed a genome-wide association study of EBV genome copy number in LCLs and found the phenotype to be highly heritable, although no individual SNPs achieved a significant association with EBV copy number. The expression of two host genes (CXCL16 and AGL) was positively correlated and expression of ADARB2 was negatively correlated with EBV copy number in a genotype-independent manner. This study shows an association between EBV copy number and the gene expression profile of LCLs, and suggests that EBV copy number should be considered as a covariate in future studies of host gene expression in LCLs.


Introduction
Epstein-Barr virus (EBV) is a ubiquitous human gammaherpesvirus. Following primary infection EBV establishes lifelong persistent infection through latent infection of memory B cells where the virus genome is transcriptionally silent [1,2]. Reactivation from latency is required for the production of infectious EBV, with such lytic EBV replication being under the control of host and virus factors. In particular, terminal differentiation of memory B cells into plasma cells can lead to EBV lytic reactivation [3]. The mechanisms of host induction of EBV lytic replication are incompletely understood, but periodic shedding of EBV in saliva [4] and variation in saliva virus load between people [5] suggest host genetic variation may contribute to EBV lytic cycle induction. Lymphoblastoid cell lines (LCLs) are human B cells immortalised in vitro by EBV and are a useful model of latent infection of B cells. Previous studies on LCLs have shown that when multiple LCLs are derived from the same individual, inter-individual variation in EBV copy number in LCLs is greater than intraindividual variation [6]. A study of the impact of EBV copy number on the gene expression profiles of 198 HapMap LCLs reported that expression of 125 human genes was significantly correlated with EBV copy number [7]. A comparison of Epstein-Barr virus copy number in 62 adult and paediatric LCLs found considerable inter-individual variation in EBV copy number that correlated with expression of immediate-early viral lytic genes BRLF1 and BZLF1, suggesting that spontaneous lytic reactivation is the cause of high EBV genome copy numbers in a subset of LCLs. After the addition of acyclovir, a drug which inhibits viral reactivation, Davies et al. showed EBV genome copy numbers fall in LCLs, and return to previous high levels after the removal of acyclovir [8]. This suggests that spontaneous lytic reactivation may be under the control of cell-intrinsic factors. When the viral gene expression profiles of LCLs were compared, using RNAseq data from multiple experiments from different laboratories, Arvey et al. [9] reported two major EBV gene expression profiles: latency type III and a lytic pattern of expression. There is evidence of both BZLF1 expression and virus particle production in some LCLs [10]. We have therefore hypothesised that high EBV copy number in LCLs is the result of poor host cell control of the EBV latentlytic cycle switch, and may be under the control of host genetic factors.
Genome-wide association studies have been successfully used to identify the host genes involved in the pathogenesis of infectious disease [11,12,13,14,15]. Two genome-wide association studies of genetic control of antibodies to herpesviruses have been performed. A study of EBV antibody titres in ,2000 individuals identified 15 loci exceeding genome-wide significance associated with either the quantitative or discrete trait of antibody titre [16]. By contrast, a similarly-sized study of cytomegalovirus (CMV) antibody response, a betaherpesvirus which also establishes lifelong latent infection in humans, did not find any genome-wide significant associations [17]. Other studies of host genetic response to herpesvirus infection and lytic reactivation have been limited to family linkage [18] and candidate gene [19] studies of herpes simplex virus-induced disease, and small studies of susceptibility to infection of chickens with an avian herpesvirus, Marek's disease virus [20,21]. As yet, there have been no attempts to characterise common human genetic polymorphisms associated with cellintrinsic response to EBV infection.
Here we describe a study to identify human genetic variants associated with Epstein-Barr virus genome copy number in the HapMap [22] and 1000 Genomes [23] LCLs, incorporating sequencing and genotyping data from the HapMap and 1000 Genomes projects. We also investigate differences in gene expression associated with EBV genome copy number using publicly available gene expression data for a subset of the HapMap 3 samples [24].

Relative EBV copy number in LCLs and between-population comparisons
We determined the relative EBV genome copy number for 915 LCLs from the HapMap and 1000 Genomes populations using quantitative PCR (qPCR). The qPCR assay had an 8 log 10 dynamic range from 1610 9 to 1610 2 copies/reaction ( Figure S1) and an analytical sensitivity of 100 copies/reaction. The PCR efficiency was 99.4%. The samples assayed included individuals from 12 populations (Table 1); and EBV copy numbers of all samples were within the dynamic ranged of the qPCR. Across all populations the mean relative (to single copy host gene) EBV copy number per LCL is 23.36 with a range of 16.07-29.02 (SD62.03), corresponding to an absolute range of 1 copy per cell to 350 copies per cell (Figure 1 A). We examined the trait of EBV copy number in different populations (Figure 1 B), and found significant differences in the mean EBV copy number between the populations (ANOVA p,2.2610 216 ). Interestingly, apparently similar ethnic groups in different geographical areas have different mean EBV genome copy numbers. The Denver Han Chinese (CHD) and Beijing Han Chinese (CHB) populations have statistically significantly different means (difference = 23.23, 95% CI 24.34-22.12), p = 1.17610 28 ), although the CHB and Japanese from Tokyo (JPT) do not differ from one another significantly. Some European ancestry populations also differ from one another in their mean EBV copy number: Toscani from Italy (TSI) differ from CEU (European ancestry in Utah, USA) (difference = 21.63, 95% CI 22.52-20.74, p = 2610 27 ).

Association testing
Using the entire sample set we performed a genome wide association study for common host genetic variants associated with LCL EBV copy number. To determine if the EBV that immortalised the LCLs was the laboratory strain B95.8 rather than outgrowth of spontaneous LCLs with wild type EBV we assembled the EBV genomes from 77 CEU and YRI LCLs. We looked for the presence of the deletion specific for the B95.8 genome, which was present in every LCL studied. The frequency of spontaneous LCLs containing wild type EBV, and/or LCLs immortalised with B95.8 and co-infected with wild type EBV, is therefore less than 1 in 77.
After sample-level quality control, genome-wide sequence and genotype data were available for 899 samples from the 1000 Genomes Phase I and HapMap Phase III consensus releases. Mixed-effects modelling [25] was used to test each variant individually for association with EBV copy number in LCLs. Samples without full human genome sequence data were imputed using 1000 Genomes information [26]. Variants were taken forward to association analysis if they were observed to vary in all 12 populations studied with a frequency of .1% and a minimum imputation quality of 0.9. This created a set of 1.6610 6 SNPs in common between the samples. The sample size of phenotyped LCLs with genotype or whole genome sequence information available is 899 individuals, including 798 unrelated individuals, once offspring of 101 trios were removed. The statistical power of the study to detect a variant explaining a given proportion of the total trait variance was calculated using GWApower [27]. Our study had an 80% power to detect a variant explaining 4.7% of the total variance in relative EBV copy number. The overall distribution of P values showed little evidence of genomic inflation (l = 0.98), consistent with the null hypothesis and suggesting that mixed-effects modelling was able to correct for high familial relatedness and population structure ( Figure 2 A and B).
We estimated the proportion of variance in EBV copy number within the dataset explained by the set of ,1.6 million common genetic variants using GCTA, in 677 unrelated individuals to be 0.65 (SE 60.38; p = 0.04). We also calculated the heritability of EBV copy number in 101 trios present within the 1000 Genomes dataset, where EBV copy number was available for all trio members. The parental mid-point EBV copy number was regressed against the offspring EBV copy number, giving a heritability estimate of 34% (SE 611, p = 0.0028).
No variants passed the genome-wide significance threshold (P, 5610 28 , Figure 2 A), although 98 SNPs achieved a genome wide significance of ,1610 25 and are suggestive of a possible association (Table S2 in File S1). Of these the top SNP, rs10959089 (P = 1.17610 26 , beta = 0.52 for the minor C allele), was located in the first intron of the gene PTPRD (protein tyrosine phosphatase receptor delta) (Figure 2 C).

Association testing of variants implicated in EBV infection, immune response and disease by previous studies
48 SNPs and small structural variants have been previously reported to influence EBV traits such as acquisition risk, antibody response, or EBV-positive disease risk. 28 of these SNPs were included in our association study ( Table 2). Two SNPs had P values of nominal significance (rs2516049, p = 0.01; rs1052536, p = 0.03). It is therefore not possible to link these variants to the phenotype of relative EBV copy number in LCLs.

Epstein-Barr virus gene copy number, host gene expression and eQTL analysis in LCLs
Microarray gene expression data was available for 466 unrelated individuals from 8 populations [24]. A linear regression was performed for these individuals between 21,800 gene transcripts and EBV genome copy number. A statistically significant positive correlation was found between EBV relative copy number and the expression levels of two genes: CXCL16 (chemokine (C-X-C motif) ligand 16) and AGL (amylo-alpha-1, 6glucosidase, 4-alpha-glucanotransferase), and a statistically significant negative correlation between EBV relative copy number and ADARB2 (adenosine deaminase, RNA-specific, B2) expression ( Figure 3; Table 3). Transcripts with suggestive P values (P. 5610 23 ) are included in Table S3 in File S1. Evidence for the effect of EBV genome copy number on eQTL results was not observed for any of these genes; the correlation did not occur in a genotype-dependent manner. QTL mapping using EBV as a phenotype did not reveal any statistically significant SNPs located in or near these genes.

Discussion
This is the largest study to investigate the impact of host genetic factors on Epstein-Barr virus genome copy number in lymphoblastoid cell lines. As inhibitors of the EBV lytic cycle effectively reduce EBV copy number in LCLs this suggests the higher copy numbers are the result of lytic replication induction [8]. Our study therefore also represents a proxy phenotype for the spontaneous switch from latent to lytic EBV replication in LCLs. We identified the EBV strain infecting a sample of the LCLs, quantified relative EBV genome copy number in 915 LCLs from the HapMap and 1000 Genomes study, performed a genome-wide association study of EBV genome copy number, estimated the heritability of EBV genome copy number in parent-offspring trios, and examined the relationship between EBV genome copy number and human gene expression in a subset of LCLs. Using high-coverage sequence data from the 1000 Genomes Pilot project for 77 CEU and YRI LCLs, we estimate that the rate of LCLs containing wild-type or non-B-95.8-strain EBV is less than 1 in 77. This is consistent with the findings of Santpere and colleagues [25], who detected unambiguous evidence of EBV coinfection or wild-type infection in 10 out of 929 LCLs from the 1000 Genomes Project dataset. These 10 LCLs were not included in our analysis, therefore competition between viral strains was unlikely to be a factor influencing copy number in our study.
Our results, of relative EBV copy number varying between individuals, and within and between populations are consistent with previous studies [6,8] of LCLs of unreported ancestry, where EBV genome copy number was a trait that varied more significantly between LCLs than between different passages or sub-cultures of the same LCL. Davies et al. [8] suggest that the variation in EBV copy number is controlled by cell-intrinsic factors. In our analysis, it remained unclear whether genetic or cell-intrinsic factors caused the differences in mean EBV copy number between the Denver Han Chinese (CHD) and Beijing Han Chinese (CHB) populations. While it has been noted that the CHD LCLs grow more slowly than the Asian (ASN) LCLs (which include the CHB LCLs) [28], another study [29] found no statistically significant relationship between LCL growth rate and EBV copy number. We conclude that although cell intrinsic factors are still the most plausible explanation, multiple genetic variants, each with a small effect of EBV copy number, also likely play a role.
The strongest association signal in the genome-wide association study was an intronic SNP, located in gene PTPRD. PTPRD functions in cellular signalling, is located on the cell surface, and has been identified as a tumour suppressor. Mutations within PTPRD have been associated with several cancers: glioblastoma multiforme and head and neck squamous cell carcinomas [30], and clear cell renal carcinoma [31]. The paralogous gene PTPRK (protein tyrosine phosphatase receptor kappa) interacts with EBV. When PTPRK is over-expressed in EBV-infected Hodgkin lymphoma cells, survival of these cells decreases; when PTPRK expression is knocked down by RNAi, survival increases; suggesting a role for PTPRK in tumour suppression. The EBV gene EBNA1 targets Smad2, a protein that regulates PTPRK expression. By decreasing the half-life of Smad2, PTPRK is downregulated in turn [32]. PTPRC (protein tyrosine phosphatase receptor C) has recently been associated with herpes simplex encephalitis susceptibility in mice [33]. Therefore although not genome wide significant, polymorphisms in PTPRD are in a biologically plausible gene.
Changes in the relative EBV copy number of LCLs have an impact on the gene expression profiles of those LCLs in a genotype-independent manner. Our analysis of genes differentially expressed between LCLs identified three genes whose expression positively (CXCL16 and AGL) or negatively (ADARB2) correlated with relative EBV copy number. CXCL16 is regulated by the microRNA CMVmiR-M23-2 of another human herpesvirus, CMV [34]. Reptar [35] predicts that CXCL16 is targeted by seven EBV microRNAs (ebv-miR-BART17-5p, ebv-miR-BART21-3p, ebv-miR-BART21-5p, ebv-miR-BART22, ebv-miR-BART3*, ebv-miR-BART5* and ebv-miR-BART7). It is also interesting to note that CXCL16 is a chemokine which has recently been associated with disease activity in multiple sclerosis [36] and mouse models of experimental autoimmune encephalomyelitis [37]. EBV microRNAs are also predicted to interact with AGL and ADARB2 [35]. It is likely that in a larger sample of LCLs, further genes would show significant correlations with relative EBV copy number. It is therefore important to control for the effects of EBV copy number in gene expression studies utilising large samples of LCLs. Other studies of the impact of EBV on the gene expression patterns of LCLs include a study by Choy et al., which reported that ,15% of genes to have at least 5% of their variance correlated with EBV copy number [7].
In a study of semi-quantitative antibody response to EBV gene EBNA1, Rubicz et al. [38] were able to identify 15 SNPs which were associated with EBV antibody response at genome-wide significant level. In contrast, Kuparinen et al. [17] performed a GWAS for CMV antibody response and did not identify any genome-wide significant SNPs. It is possible that a different genetic architecture underlies the host response to control of EBV copy number in LCLs than that of EBV antibody response. Additionally  all of the genes associated with the EBV antibody response were within the HLA system, a region of the genome this study was not well-powered to interrogate given the differences in HLA allele frequencies between populations and our criteria that a variant must be present in all 12 populations studied for SNP inclusion. Our finding that the 1.6 M SNPs studied in this GWAS could collectively explain 65% (se = 38%) of the variance in relative EBV copy number, while no single SNP reached genome-wide significance, suggests that many variants of small effect play a collective role within LCLs. Similarly, GWAS of genetic resistance to HIV-1 infection [15] could not find any common variants (with the exception of CCR5-delta 32) which were protective, while GWAS of host control of HIV-1 viral load found a number of genome-wide significant loci associated with lower HIV-1 viral load set points and slower progression to AIDS [12,39]. The power to identify host genetic variation of virus infection traits may greatly depend on the trait under study and the sample sizes.
We estimated the heritability of relative EBV copy number, based on data from 101 parent-child trios to be 34% (se 611%, p = 0.003). Other studies have found EBV anti-EBNA1 antibody response to be 68% heritable when considered as a discrete trait (seropositive versus seronegative) [38] and discrete anti-VCA IgG antibody response to be 32-48% heritable [40]. Infectious mononucleosis concordance rates in twins were estimated to be 12% between monozygotic twins and 6% between dizygotic twins [41]. Therefore, host genetic factors appear to play a variable but significant role in symptomatic response to primary EBV infection, the adaptive immune response to EBV latency, and the cellintrinsic control of EBV latency although none of the variants previously identified were significantly associated with EBV copy number in our analysis.
This study has established that relative EBV copy number within LCLs is very much a complex trait, which has a significant heritable component. Our results suggest that many genetic variants of effect size less than 4.7% of variance in relative EBV copy number exist, but which this study was not statistically powered to detect. This study also focused on 1.6 million single nucleotide polymorphisms which were common to every population studied. Because SNPs which did not vary in all 12 populations studied were excluded from this analysis due to potential population stratification, we cannot rule out the effect on relative EBV copy number of SNPs which were only variable in a subset of the populations studied. We also cannot exclude the effects of rare variants of large effect on relative EBV copy number in LCLs, as these were not studied here. It is also possible that structural variants which are poorly tagged by common SNPs may play a role. To identify the genetic factors that underpin EBV copy number, a significant increase in sample size is necessary that will become possible within on-going large-scale sequencing and genotyping projects. However, we do find that even in a relatively modest sample size, EBV copy number is correlated with LCL host-gene expression patterns, in a host genotype-independent manner. Studies using larger samples of LCLs to study host gene expression profiles may find EBV-associated changes in LCLs  generate false-positive results, unless EBV copy number is controlled for.

Ethics
No primary human tissue was used in this study. Details of this project were sent to the Coriell Cell Repository to be passed on the relevant Community Advisory Groups for HapMap participants.

Samples
Separated from peripheral blood as part of the HapMap and 1000 Genomes Projects, LCLs and LCL-derived DNA were provided to Wellcome Trust Sanger Institute by the Coriell Cell Repository. DNA from 915 HapMap lymphoblastoid cell lines was obtained from Coriell Institute for Medical Research, representing 12 populations. A summary of the composition of the sample group that provided the LCLs is provided in Table 1. Cell line BCBL-1 was used as a calibrator for quantitative PCR. It is a Kaposi Sarcoma Herpesvirus-positive, EBV-negative body cavityderived primary effusion lymphoma cell line, of B cell origin [42].

Quantification of relative EBV copy number per cell
Quantification of relative EBV copy number was performed using quantitative PCR (qPCR). An artificial gene (GeneArt, Life Technologies) was designed based on the EBV BALF5 sequence for use as a positive control in qPCR, but containing an artificially inserted sequence to distinguish it from wild-type BALF5, with the sequence: CCCTGTTTATCCGATGGAATGACGGCGCATTTC-TCGTGCGTGTACACCGTCTCGAGTATGACTGGTTCCAATT-GACAAGCTGGGTCGTAGACATGGAAGTCCAGAGGGCTTC-CG. Quantitative PCR was performed on an Agilent MxPro 3005 machine using the QuantiTect Multiplex PCR NoROX Kit (Qiagen). The PCR reagents were: 2.5 ml nuclease-free water (Qiagen), 12.5 ml QuantiTect Multiplex PCR No ROX mastermix (Qiagen), 2 ml BALF5 primer mix (10 pmol/ml of each primer and 0.75 pmol/ml probe [43]), 2 ml GAPDH primer mix (2.5 pmol/ml of each primer and 2 pmol/ml probe [44]), 1 ml ROX (diluted 1:10 in 10 mM Tris-HCL (Life Technologies)), and 5 ml template DNA or positive control. EBV qPCR primer and probe sequences were from Kimura, 1999 [43]. GAPDH primer and probe sequences were from Pardieu, 2010 [44]. PCR conditions were as follows: 95uC for 15 minutes, followed by 45 cycles of 94uC for 60 seconds, 57uC for 30 seconds and 72uC for 30 seconds. Fluorescence data was collected during the annealing step.
The 22DDCT method [45] was used for relative quantification of target gene abundance (target gene BALF5, endogenous control gene GAPDH). Gene copy numbers in LCLs were normalised against the BCBL-1 cell line [46]. Data analysis was performed using MXPro v4.10 qPCR software (Agilent Technologies).

Heritability analysis
Within the individuals with relative EBV copy number data available, there were 101 trios with EBV copy number information available for mother, father and offspring. The family structure was taken from the 1000 Genomes pedigree files. They were drawn from four populations -CEU, IBS, MXL and YRI. The mid-parental average for the phenotype (relative EBV copy number) was calculated and the child's phenotype regressed against the mid-parental phenotype. The regression gives an estimate of narrow sense heritability of the trait and its associated P value.

Imputation and association testing
The statistical power to identify a genetic variant was calculated using GWApower [27]. Imputation of genotyped samples to 1000 Genomes Phase 1 was performed using IMPUTE2 [26]. This increased the number of SNPs in common between the samples from 600,000 to 37 million. These imputed variants were then subjected to quality control in PLINK [47], namely removing: SNPs where Hardy-Weinburg equilibrium was P,1610 25 in at least one of the 1000 Genomes populations (n = 8,754,733); SNPs which were monomorphic among phenotyped samples (n = 8,689,910); SNP with a missingness .0.01 (n = 9,808,778) and all SNPs with MAF ,0.01 (n = 16,881,983). SNPs which were called discordantly in samples where sequencing and genotyping information were both available were excluded from further analysis. As EBV copy number varies significantly between different populations, we performed additional QC in order to account for potential stratification in the association analysis: removing any SNP if HWE P,0.01 in two or more populations, and removing any SNP that was monomorphic in at least one of the populations. This left a final set of 1,595,489 SNPs which were polymorphic in every population studied. Association analysis was performed using a linear mixed model implemented in FaST-LMM-Select [25].

Transcriptional profiles
Microarray expression data from Stranger et al. [24] (''RE-DUCED'' dataset) was obtained for 466 unrelated individuals with relative EBV copy numbers. Expression data is available on http://www.ebi.ac.uk/arrayexpress/(Series Accession Number E-MTAB-264 and E-MTAB-198.processed.1). For each individual, correlation between expression levels of individual gene transcripts and EBV copy number was determined via linear regression. Pvalues of correlation with EBV copy number were obtained for each gene transcript (21,800 in total). Transcripts with p-value lower than 5610 26 were considered significantly correlated with EBV copy number. Transcripts were mapped back to the genes they correspond to according to the array design file (''A-MEXP-930.adf.txt'') available on http://www.ebi.ac.uk/arrayexpress/ experiments/E-MTAB-264/ eQTL analysis Effect of EBV relative copy number on the identification of expression quantitative loci (eQTLs) was tested in PLINK via linear association model with or without EBV copy number used as a covariate. Expression data included normalized log 2 quantitative gene expression measurements for the 21,800 probes (18,226 unique autosomal genes) from 466 unrelated individuals of HapMap Phase III assayed on the Illumina Sentrix Human-6 Expression BeadChip [24]. SNP genotypes were as described for the QTL analysis. SNPs within a 2 Mb window around a gene locus were defined as eQTLs. Correction for population structure was performed using principal component analysis (PCA). Difference in effect sizes of eQTLs identified in the two association studies (with or without EBV copy number as a covariate) was determined via paired t-test across all tested SNPs (1.6 million). Resulting p-values of the difference in effect sizes were plotted for each transcript across all 22 non-sex chromosomes. Figure S1 Standard curve of BALF5 quantitative PCR primer set dilution series, which amplifies the EBV polymerase gene BALF5. The efficiency of the BALF5 qPCR assay was 99.4% (compared to GAPDH), r2 = 0.998.

(DOCX)
File S1 Supplementary tables. Table S1 in File S1 shows the results of a Tukey honest significant difference test of mean EBV copy number between populations. Table S2 in File S1 summarises the results of the EBV copy number GWAS, including all SNPs associated with EBV copy number with P,5610 25 . Table S3 in File S1 shows changes in gene expression (microarray) correlated with EBV copy number with suggestive P values (P .