Exploration of a Polygenic Risk Score for Alcohol Consumption: A Longitudinal Analysis from the ALSPAC Cohort

Background Uncertainty remains about the true extent by which alcohol consumption causes a number of health outcomes. Genetic variants, or combinations of variants built into a polygenic risk score (PGRS), can be used in an instrumental variable framework to assess causality between a phenotype and disease outcome of interest, a method known as Mendelian randomisation (MR). We aimed to identify genetic variants involved in the aetiology of alcohol consumption, and develop a PGRS for alcohol. Methods Repeated measures of alcohol consumption from mothers and their offspring were collected as part of the Avon Longitudinal Study of Parents and Children. We tested the association between 89 SNPs (identified from either published GWAS data or from functional literature) and repeated measures of alcohol consumption, separately in mothers (from ages 28–48) and offspring (from ages 15–21) who had ever reported drinking. We modelled log units of alcohol using a linear mixed model and calculated beta coefficients for each SNP separately. Cross-validation was used to determine an allelic score for alcohol consumption, and the AVENGEME algorithm employed to estimate variance of the trait explained. Results Following correction for multiple testing, one SNP (rs1229984) showed evidence for association with alcohol consumption (β = -0.177, SE = 0.042, p = <0.0001) in the mothers. No SNPs showed evidence for association in the offspring after correcting for multiple testing. The optimal allelic score was generated using p-value cut offs of 0.5 and 0.05 for the mothers and offspring respectively. These scores explained 0.3% and 0.7% of the variance. Conclusion Our PGRS explains a modest amount of the variance in alcohol consumption and larger sample sizes would be required to use our PGRS in an MR framework.


Introduction
Alcohol is a leading preventable cause of ill health in Europe [1], with Europeans accounting for more than a quarter of the total worldwide alcohol consumption (despite making up 15% of the global population) [2]. Despite this, uncertainty remains about the true extent by which alcohol consumption in the general population causes a number of health outcomes including type 2 diabetes [3] and cardiovascular disease [4] mainly because of bias in conventional epidemiological studies.
Genetic variants can be used in an instrumental variable framework to improve evidence on causality between an exposure and disease outcome of interest, a method known as Mendelian randomisation (MR) [5]. Details of the rationale and assumptions of MR have been discussed in detail elsewhere [6]. In brief, the allocation of genetic variants is random at conception, therefore the frequency of those variants associated with an exposure of interest should be approximately the same in groups of individuals with different confounding factors. Furthermore, as genotype is determined at conception, it cannot be susceptible to reverse causation [5,7], which is particularly problematic when studying long-term effects of alcohol use ("sick-quitter effect" [3,8]). Holmes et al (2014) [9] used rs1229984 (a genetic variant in ADH1B) in an MR framework to examine the causal impact of alcohol on cardiovascular disease in European populations while other examples can be found in East Asian populations [10][11][12]. Polygenic risk scores (PGRS) can also be used in an MR framework, accounting for a greater proportion of the variance in the exposure phenotype of interest, thus increasing power. Use of PGRS in MR can avoid or alleviate weak instrument bias [13], which is a common problem in MR. Furthermore, in the instance that researchers wish to use a PGRS in a large cohort that has genotyping availability but no GWAS data, a finite number of SNPs from the PGRS can be easily and cost effectively genotyped for that purpose.
There are clear advantages of using PGRS in MR studies of alcohol consumption, however to date there are no known variants robustly associated with alcohol drinking in populations of European origin, other than the relatively rare ADH1B used by Holmes et al [9]. This is despite estimates of heritability for alcohol use disorders and consumption reaching approximately 50% at their peak [14][15][16][17], and linkage and genome wide association studies (GWAS) suggesting a variety of potential loci that might be implicated [18][19][20]. The majority of GWAS for alcohol phenotypes focus on dependence rather than heaviness of use [21][22][23][24], and among the top findings are often alcohol dehydrogenase genes (ADH) and aldehyde dehydrogenase (ALDH2) [25][26][27], which have also been reported in candidate gene studies of metabolic reactions following ingestion [28].
We therefore aimed to identify genetic variants likely to play a role in the aetiology of alcohol consumption. Our end goal was to develop a polygenic risk score for average alcohol consumption in the general population, which could be specific to consumption and explain a larger proportion of the variance than the known ADH1B variant. We used a multi-step approach, by [a] Identifying genetic variants that could plausibly be associated with alcohol consumption from genome wide association studies (GWAS) and the functional literature, [b] Estimating their association with alcohol consumption in mothers (heritability is estimated to be higher after college years) and offspring from the Avon Longitudinal Study of Parents and Children (ALSPAC), and [c] Creating PGRSs (based on the initial set of SNPs) and estimating the proportion of variance explained for both mothers and offspring's phenotypes. We fitted both cross-sectional and longitudinal models, thus taking advantage of the repeated measures of alcohol consumption available at different time points in life to minimise noise in the definition/reporting of alcohol use.

Study Population
Data were taken from the Avon Longitudinal Study of Parents and Children (ALSPAC), a longitudinal study situated in South West England. ALSPAC recruited 14,541 pregnant women between 1991 and 1992, with over 14,062 live births resulting from these pregnancies. Comparison with the 1991 census shows the sample was broadly representative of the British population [29]. Both mothers and offspring have been followed up with a series of questionnaires, clinics and lab-based assessments over the past 25 years, which has allowed for a wide range of phenotypic and biological measures to be collected. Ethical approval was obtained from the ALSPAC Ethics and Law Committee and the Local Research Ethics Committees. Further information of the recruitment process is available elsewhere [29][30][31]. The study website contains details of all data through a searchable data dictionary [32].
We used data from 10 postal questionnaires completed by the mothers in the cohort over 18 years (ranging from a mean age of 28 at baseline to a mean age of 48 at 18 years post pregnancy) and questionnaires completed by ALSPAC offspring over 6 years (ranging from age 15 to 21 years). All individuals who had available genetic data (outlined below) and answered alcohol related questions at these time points were included in the analysis (Mother N = 1609 to 3912; Offspring N = 2604 to 7989, Table 1).

Phenotypic Measures
Weekly alcohol consumption (units). In all alcohol related questions used in this research, participants were informed that "one drink referred to ½ pint of beer/cider, a small (125ml) glass of wine or a single (25ml) measure of spirit", with each of these drinks containing approximately one UK unit of alcohol. Weekly alcohol consumption was treated as a continuous measure of UK units.
Mothers' weekly alcohol consumption between zero and three years post pregnancy and at six years post pregnancy were calculated using the question "How often have you drunk alcoholic drinks". Participants selected one of the following responses: "Never"; "Less than 1 glass a week"; "At least 1 glass a week"; "1-2 glasses every day"; "At least 1-9 glasses every day"; "At least 10 glasses every day". At zero years post pregnancy, this question related to the amount of alcohol consumed before the current pregnancy. At four years post pregnancy and between seven and 12 years post pregnancy, weekly alcohol consumption was calculated from self-reported beers/ciders, wines, spirits, other alcohol or low alcohol beverages consumed on each day of the previous week. Weekly alcohol consumption at 18 years post pregnancy was calculated by multiplying the number of days the mother generally drank by the number of drinks consumed on a typical drinking data (S1 Table).
Offspring's weekly alcohol consumption between ages 15 and 21 years was calculated by multiplying the frequency at which the child drank by the number of drinks consumed on a normal drinking day (S2 Table).
Potential covariates. Mothers' covariates included: Cigarettes per day treated as a continuous measure at one, two, three and six years post pregnancy; age in years; social class (III manual skilled, IV and V unskilled manual or casual workers or those who rely on state for their income/I and II professional occupations and managerial and technical occupations and III non-manual skilled workers); highest level of education (certificate of secondary education/ vocational qualification/O level/A level/Degree); cannabis, antidepressant, amphetamine and opiate use in the past year (No/Yes) at one, two, three and six years post pregnancy.

Genetic Measures
ALSPAC offspring were genotyped using the Illumina HumanHap550 quad chip genotyping platforms by 23andMe subcontracting the Wellcome Trust Sanger Institute, Cambridge, UK and the Laboratory Corporation of America, Burlington, NC, US. ALSPAC mothers were genotyped using the Illumina human660W-quad array at Centre National de Génotypage (CNG). Following quality control (individual call rate > 0.97, SNP call rate >0.95, MAF > 0.01, HWE > 1E-7, cryptic relatedness within mothers and within offspring IBD < 0.1, non-European clustering individuals removed) 9237 offspring and 8196 mothers were retained with 477482 SNP genotypes in common between them. SNPs were flipped to forward strand and Baseline 7425 3.6 (6.1) 3.5 (0.5-3.5) 18 years (Q) 2604 6.0 (6.5) All mother 67988 3.7 (6.5) 1.1 (0-4) 24 34.0 (6.8) 1 All offspring's time points represent the age at which the questionnaire/clinic was administered. 2 All mothers time points correspond to the amount of time since the end of the first pregnancy (i.e. pregnancy enrolled into ALSPAC); baseline corresponds to a questionnaire administered at enrolment into the study that reflects alcohol use before pregnancy.
(C) = data collected during a clinic session; (Q) data collected using a postal questionnaire. To identify SNPs with some a priori evidence of association with alcohol consumption/ potentially associated with alcohol consumption, we searched the NHGRI-EBI GWAS catalogue [33] for published GWAS of alcohol-related phenotypes, using the following search terms: "Alcohol consumption"; "Alcohol dependence"; and "Alcoholism". 68 SNPs were associated with at least one of the latter phenotypes at a P value < 1.0 x 10 −5 . We supplemented this list with a search for candidate gene or functional studies of alcohol consumption, identifying a further 23 SNPs, bringing the total number of SNPs to 91 (note that three SNPs were identified from functional studies that had already been identified by the GWAS search). We then extracted genotype data for the 91 SNPs from ALSPAC. 31 SNPs were directly genotyped. For the remaining 60 SNPs, imputed genotypes with imputation r 2 >0.8 were available (two SNPs with imputation r2<0.8 were excluded). 89 SNPs were therefore included in the analyses. A full summary of selected SNPs is provided in S3 Table. Statistical Analysis All analyses were conducted using Stata13 [34]. We tested for association between weekly alcohol consumption and the 89 SNPs in both mothers and offspring separately. For alcohol consumption, an abundance of zeros was expected, since no consumption would be reported both by those offspring and mothers who never drink, and by offspring and mothers who had not drank in the preceding week. To analyse cross-sectional and repeated measures of these data we log-transformed units and focussed analysis only on those who had ever reported drinking (i.e. dropping non-drinkers). We modelled log units using a linear mixed model and calculated the beta coefficient of each SNP as a function of the number of copies of the minor allele. The outcome was also assessed cross-sectionally using a log-linear regression at each time point separately. In each model, we adjusted for age and controlled for population stratification using the first 10 principal components. To test for pleiotropic effects, we examined the associations between the 89 SNPs and 48 potential confounders detailed above.
We used the Bonferroni method to correct for multiple testing. Evidence for association was taken at p = 0.00056 (0.05/89) for repeated measures analyses and p = 0.000037 (0.05/ number of time points Ã 89) for the cross sectional analyses.
To determine a PGRS for alcohol consumption, we randomly separated the individuals into 80% training and 20% discovery sets. Repeated measures of log units were modelled in the training set, separately for each of the 89 SNPs with beta coefficients, their standard error and corresponding p-values recorded. Using p-value thresholds of 0.01, 0.05, 0.1, 0.2, 0.4 and 0.5 for inclusion, we then created a weighted PGRS for each threshold. These scores were used to predict repeated measures of log-units in the 20% discovery set, with R-squared recorded for the score corresponding to each p-value threshold. The process was repeated five times, and the p-value threshold with the highest R-squared was taken as optimal. This was done independently for mothers and offspring data. Finally, the AVENGEME [35] algorithm was employed to estimate variance of the trait (i.e. alcohol units consumed in a week) explained by the PGRSs.
To demonstrate a possible use of the PGRS, we tested the association between our PGRS on proxy measures for cardiovascular disease [9] which were available both in mothers and in their offspring, namely HDL cholesterol, systolic blood pressure and diastolic blood pressure. These were modelled against the two PGRS while controlling for age at measurement.

Sensitivity Analysis
The above statistical methods were used to conduct the following sensitivity analyses using the mother's data: (a) excluding individuals who were pregnant at completion of the questionnaire; and (b) excluding weekly alcohol consumption measures at four, seven, eight and 12 years post pregnancy in the mothers as these questions were phrased differently from the other time points (S1 Table).

ALSPAC Mothers
For the ALSPAC mothers, there were 67988 questionnaire responses to units consumed, between the ages of 28 to 48 years. Alcohol consumption increased over the course of the questionnaires and over age ( Table 1).
In the repeated measures analysis, six SNPs had a p value < 0.05. Following correction for multiple testing, one SNP (rs1229984) showed evidence for association with alcohol consumption (β = -0.177, SE = 0.042, p = 0.00002) ( Table 2 and S4 Table). In the cross sectional analyses, 27 SNPs had a p value < 0.05 at a minimum of one time point. The top ranked SNP was rs1229984 with the alcohol consumption variable measured at 12 years post pregnancy (difference in units per week for each additional copy of the minor allele = -0.326, SE = 0.084, P = 0.0001). This SNP also showed evidence for association (multiple testing threshold) at baseline (difference in units per week for each additional copy of the minor allele = -0.19, SE = 0.051, P = 0.0002) and was consistently negatively associated with alcohol consumption across all time points (Fig 1). (S5 Table). None of the mother's covariates showed evidence for association with any of individual SNPs following correction for multiple testing (S6 Table).

ALSPAC Offspring
Units of alcohol consumed were measured 17998 times over 5 questionnaires during adolescence for the ALSPAC offspring, from age 15 to 20.5. The average number of units peaked at 13 per week at age 18 years, dropping to 8 per week by age 21 years (Table 1).
In the repeated measures analysis, six SNPs had a p value < 0.05. Following correction for multiple testing, no SNPs showed evidence for association (Table 2 and S4 Table). In cross sectional analyses, 28 SNPs had a p value < 0.05 at a minimum of one time point. The top ranked SNP in the offspring was rs2228093 with the alcohol consumption variable measures at age 18 years (difference in units per week for each additional copy of the minor allele = -0.105, SE = 0.036, P = 0.004), however this did not meet the p value threshold for multiple testing (S5 Table). In contrast to the ALSPAC mothers, rs1229984 did not pass multiple testing thresholds for alcohol consumption in repeated measures analysis and did not show a consistent pattern across all time points tested (Fig 1). None of the offspring's covariates showed evidence for association with any of individual SNPs following correction for multiple testing (S6 Table).

Sensitivity Analysis
When excluding non-pregnant women and data from questionnaires at four, seven, eight and 12 years post pregnancy, rs1229984 remained the only SNP associated with weekly alcohol consumption (excluding pregnant women: increase in units per week for each additional copy of the minor allele = -0.159, SE = 0.042, P = 0.0001; excluding questionnaires: increase in units per week for each additional copy of the minor allele = -0.149, SE = 0.039, P = 0.0001) (S7 Table).  Polygenic Risk Score The best allelic score was generated using p value cut offs of 0.5 and 0.05 for the mothers and offspring, respectively, which resulted in including a total of 42 SNPs (out of 89) for mothers and 6 SNPs (out of 89) for offspring. For the ALSPAC mothers, the variance in units of alcohol per week explained was 0.3% (95% CI 0.13% to 0.76%). For the ALSPAC offspring the variance explained was 0.66% (95% CI 0.22% to 1.3%). Neither of the PGRS showed evidence for association with any of the confounders, i.e. there was no evidence to suggest that the PGRS could be violating the second assumption of instrumental variables (that the instrument is independent of the confounders of the original exposure-outcome association) and therefore being invalid as an instrumental variable to proxy for alcohol intake (S6 Table).
When modelling the effect of our PGRS on cardiovascular disease risk factors, there was evidence for association between the offspring PGRS and offspring diastolic blood pressure in the expected direction (beta = 1.61, 95% CI 0.25 to 2.97, p = 0.020). However, our estimates of the association between the offspring PGRS and both HDL cholesterol and systolic blood pressure provided no strong evidence of association (HDL cholesterol: beta = 0.06, 95% CI -0.03 to 0.14, p = 0.177; systolic blood pressure: beta = 1.44, 95% CI -0.84 to 3.72, p = 0.216). Similarly, there was no statistical evidence for association between the mothers PGRS and any of the cardiovascular disease risk factors (HDL cholesterol beta = 0.39, 95% CI -0.10 to 0.17, p = 0.565; systolic blood pressure beta = -0.68, 95% CI -7.05 to 5.69, p = 0.0.835; diastolic blood pressure beta = 2.14, 95% CI -2.34 to 6.62, p = 0.350).

Discussion
The aim of this study was to develop a polygenic risk score for alcohol consumption, in view of using this to assess the causal impact of alcohol on health related outcomes such as cardiovascular disease. Literature searches of published GWAS and functional studies identified 89 candidate SNPs that had previously shown some evidence of association with alcohol-related phenotypes. Using repeated measures analysis of alcohol behaviour over the course of a 20 year period, we found strong evidence confirming that rs1229984 plays a role in alcohol consumption, confirming previous results [9]. This SNP was associated with a decrease of 0.84 units of alcohol per week, on average. It was found to be associated in cross-sectional analyses of questionnaires measured 20 years apart and effect estimates were stronger in the repeated measures analysis, strengthening the evidence that it relates to alcohol consumption throughout the life course. The PGRS derived through cross-validation only explained a modest proportion of the variance in alcohol consumption (0.3% for mothers and 0.66% for the offspring in our sample).
The score could in principle be used to conduct MR analyses for example in the field of cardiovascular disease (CVD), although large sample sizes would be required. The British Heart Foundation estimates that there are 7m people living with CVD in the UK (~10% of the population) [36], if the odds ratio for alcohol consumption on CVD incidence was 0.75 (OR for CVD mortality used) [37], we would require a sample size of 126,500 (assuming a 1:1 ratio, 80% power, alpha = 0.05 and R 2 = 0.3% from the mothers PGRS result), or 54,200 (assuming a 1:1 ratio, 80% power, alpha = 0.05 and R 2 = 0.7% from the offspring's PGRS result) [38]. Similarly, the incidence odds ratio for coronary heart disease (CHD) is 0.71 [37], with 2.3 million in the UK living with CHD. To perform an MR analysis to examine the effect of alcohol consumption on CHD using the mothers PGRS we would need 89,300 with 38,300 individuals needed for the offspring's PGRS. The required sample sizes are much larger than those in our sample, therefore our tests of association between the PGRSs and proxy measures of cardiovascular disease are underpowered. As such, we cannot be certain that the lack of associations are representative of null results.

Strengths and limitations
ALSPAC is a well characterised birth cohort with repeated measures of alcohol use, which have been used in several other studies of substance use [39][40][41]. Moreover, the available data were collected on mothers and their offspring over the course of 20 years. As such, they are an excellent resource to investigate alcohol behaviour over time. The nature of this dataset allowed for the use of repeated measures to strengthen the phenotype. Furthermore, the wealth of additional data allowed for detailed sensitivity analyses and examining of a wide range of potential covariates to test for pleiotropic effects of the alcohol variants and the derived PGRSs. An additional strength comes from the way our PGRS was constructed. By taking a limited number of SNPs that have previously shown some evidence of association with alcohol behaviours (either from GWAS or functional literature) we have developed a PGRS that has a reduced number of SNPs compared to the numbers that might be required if using p value cut offs from whole GWAS. The advantage to this approach is that the resulting PGRS is less likely to have pleiotropic effects than one from a deep GWAS list. Furthermore, this finite number of SNPs would therefore be more cost effective to genotype and could, therefore, be feasibly used in a study that does not have access to genome wide data.
There are also some limitations which need to be considered when interpreting the results of this study. First, the set of SNPs identified through searches came from GWAS analyses of alcohol dependence [21][22][23][24] and so might not show an association with our phenotype (units of alcohol per week). Meanwhile, functional literature reports the role of genetic variants in metabolism, however the effects of these genes are not taken as far as alcohol consumption. Second, alcohol consumption questions were not uniform over time, however sensitivity analyses excluding data from a different version of the questionnaire returned similar results. Third, our data on alcohol consumption are based on self-report and so may be subject to misclassification. However, there are currently no reliable biological alternatives for alcohol use in a general population sample [42], with current biomarkers only being able to identify long term heavy use [43,44]. One might expect to find that the direction of bias differs in the two populations of mothers and offspring, as mothers might underreport their use (negatively impacting estimates), while offspring might over-report their consumption (positively impacting estimates) [45][46][47]. Fourth, there was loss to follow up, with greater proportions of missing data in later questionnaires, which reduced statistical power and could lead to selection bias if alcohol consumption is related to the loss to follow up. This drop in sample size also meant that stratifying the offspring analysis by gender would reduce the power, however we did adjust for gender in this analysis. Furthermore, we were unable to examine the association between SNPs and alcohol consumption in adult males (i.e. fathers) as their genetic data was not available. Finally, we found no suitable independent cohort study with life course alcohol consumption data for testing PGRS performance and hence we used cross-validation in ALSPAC. However, it has previously been reported that sample sizes such as those used in this analysis are adequate when using two separate 'training' and 'testing' samples [48].

Findings in relation to other research
Burgess and colleagues suggested that variants with known biology are better for use in MR studies [49]. The underlying biology of some of the SNPs (those selected from functional literature) included in our analysis is known, and linked to changes in alcohol metabolism, and as such, would be better for use in a PGRS MR analysis. One SNP (rs1229984 in ADH1B) was consistently identified as being associated with alcohol consumption in ALSPAC mothers. This SNP has previously been used in an MR framework, after Holmes and colleagues validated it as a genetic instrument by providing solid evidence for association with various alcohol phenotypes (including units of alcohol per week) in a sample of >200,000 participants [9]. In their estimate, carriers of the minor allele consumed 17.2% fewer units per week than noncarriers, which is very similar to our result of 0.177 fewer log units per week (equivalent to 16.2% fewer units per week).
The set of SNPs included in the mothers PGRS and offspring PGRS were different, possibly due to age and gender effects. Previous literature has suggested that the heritability of alcohol consumption changes across the life course. Estimates start to increase at the age of 15 years and peak in the mid-20's [14][15][16][17]. It is therefore possible that the offspring are so young that their genetic potential to abuse or avoid alcohol is not yet fully expressed. Conversely, the age of the ALSPAC mothers at baseline ranged from 14 to 46 years (mean = 28 years), with these individuals being followed up for 20 years. Since the mothers' longitudinal analyses cover a wide range of ages across the life course, it is not possible to make assumptions about the impact of age on the PGRS composition or the proportion of the variance it explains, in relation to the offspring's PGRS. Additionally, gender differences may also have a role if there are systematic differences in alcohol consumption by gender. However, stratifying by gender here would reduce power in the analysis.
PGRS have previously been used in an MR framework to evaluate the causal effect of a number of traits/exposures, with proportions of the variance explained in the trait in the range or 1.5-3% (e.g. BMI: 1.5% -2.5% [50][51][52], type 2 diabetes:~2% [53], schizophrenia:~3% [54]). However, the proportion of the variance explained by our PGRS is comparable to the variance of age of onset of alcohol consumption explained by a previously reported PGRS [55]. In our analysis, the variation explained by the initial 89 SNPs selected was estimated to be between 0.13% and 0.76% for the ALSPAC mothers. A previous PGRS for tobacco (cigarettes smoked per day) was shown to explain 0.4-0.5% of the variance in glasses of alcohol per week [56], which is comparable in magnitude to the variance explained by our PGRS for alcohol consumption. This provides additional evidence that some genetic risk factors are shared between substances, suggesting that incorrect effect estimates could be introduced through pleiotropy. However, the lack of evidence for association between our PGRSs and potential confounders (including tobacco and other drug use) suggesting minimal evidence for pleiotropy. These comparisons are limited by design differences (i.e. genome wide analysis in previous literature compared to the candidate gene approach here). However, in theory, our selection process identified 'a-priori' candidates and we would therefore expect a higher proportion of the variance to be explained in this analysis. This highlights how little we know about the genetic contribution to alcohol consumption.

Conclusion
The PGRSs developed in our analyses explained a modest proportion of the variance in alcohol consumption for both ALSPAC mothers and offspring. For future MR analyses examining the causal effects of drinking alcohol in the general population, the mothers' PGRS reported here is most likely a more suitable genetic proxy as it is based on a breadth of ages, although one limitation to this discovery sample is the inclusion of women only. Very large sample sizes, such as those from multi-study consortia, would be required if these PGRSs were to be used as genetic instruments in MR analyses.
Supporting Information S1