Examination of Candidate Exonic Variants for Association to Alzheimer Disease in the Amish

Alzheimer disease (AD) is the most common cause of dementia. As with many complex diseases, the identified variants do not explain the total expected genetic risk that is based on heritability estimates for AD. Isolated founder populations, such as the Amish, are advantageous for genetic studies as they overcome heterogeneity limitations associated with complex population studies. We determined that Amish AD cases harbored a significantly higher burden of the known risk alleles compared to Amish cognitively normal controls, but a significantly lower burden when compared to cases from a dataset of unrelated individuals. Whole-exome sequencing of a selected subset of the overall study population was used as a screening tool to identify variants located in the regions of the genome that are most likely to contribute risk. By then genotyping the top candidate variants from the known AD genes and from linkage regions implicated previous studies in the full dataset, new associations could be confirmed. The most significant result (p = 0.0012) was for rs73938538, a synonymous variant in LAMA1 within the previously identified linkage peak on chromosome 18. However, this association is specific to the Amish and did not generalize when tested in a dataset of unrelated individuals. These results suggest that additional risk variation in the Amish remains to be identified and likely resides outside of the classical protein coding gene regions.


Introduction
Alzheimer disease (AD) is the most common cause of dementia, the global loss of cognitive ability beyond the normal changes associated with aging. The prevalence of AD for individuals aged 85 years and older is 32%, and the number of people with AD is predicted to triple by 2050 [1]. The World Health Organization lists AD as the 4 th leading cause of death in high-income countries [2]. AD is generally categorized as early onset at age 65 or below or late onset (LOAD) after the age of 65. Despite the high prevalence and associated death, much is unknown about the cause and pathogenesis of this neurodegenerative disorder.
Most genetic studies evaluating LOAD risk are performed in general population studies, introducing analysis and interpretation problems due to heterogeneity. To further the understanding of this disease, we studied the genetically isolated Amish communities of Ohio and Indiana to identify additional genetic variants that contribute to disease risk. The population bottleneck that occurs when a small group of individuals establishes a separate subpopulation and creates a founder effect. The random drift that occurs in this new subpopulation may change disease prevalence, reduce effective population size, alter allele frequencies and change patterns of linkage disequilibrium. In particular, the Amish are more genetically homogenous because members of these communities marry within their culture, thus limiting the amount of new genetic variation introduced from the general population. Additionally, due to their strict lifestyles, environmental exposures are more homogenous. For example, the older Amish have generally led an agricultural lifestyle, achieved similar levels of education, and consumed similar diets. These factors make the Amish populations advantageous for genetic studies by further controlling for non-genetic heterogeneity.
Since these isolated populations differ from the general population, the specific variants identified through general population studies may not be present in the same frequency or have the same effect in the Amish. Conversely, since the Amish are a genetic subset of the general European Caucasian population, it is expected that the same genes and pathways implicated in the Amish will also be associated and confer risk in the general population. By studying the genetics of these isolated populations, the limitations introduced by the heterogeneity in broad population studies can be overcome and variants or genes that help explain the missing heritability of LOAD can be identified.
If the known risk alleles do not contribute the same risk in the Amish, it is hypothesized that the Amish cases will have a significantly lower burden of risk alleles when compared to the LOAD cases from a dataset of unrelated individuals. However, if all or a subset of the known genetic risk alleles do contribute to disease risk in the Amish, the Amish cases should tend to have a significantly higher genetic risk score than the Amish cognitively normal controls.
To identify additional exonic variation harbored by the Amish that may contribute to disease risk in this isolated population, this study used whole-exome sequencing of a selected subset of the overall study population as a screening tool to identify variants harbored in the regions of the genome that are most likely to contribute risk. By then genotyping the top candidate variants from this screen in the full dataset, there is more power to detect an association between the variant and phenotype of interest (Fig. 1). It is hypothesized that the top candidate exonic variants will be associated with LOAD risk in the Amish.

Study populations and clinical data
There were two significant waves of immigration that established the Amish communities in the United States, the first wave occurred in the 1700s with Swiss Anabaptists settling in Pennsylvania and the second in the 1800s when individuals immigrated to Ohio and Indiana from Europe and the Pennsylvanian settlements [23]. The full dataset for which samples have been collected is comprised of individuals from the Amish communities in Adams, Elkhart and La-Grange Counties in Indiana and Holmes County in Ohio. Individuals were ascertained through public directories, public notices, and referrals from previously enrolled participants. Over 30% of the Amish populations over the age of 65 have been contacted and 87% of these individuals have consented to participate in the study. The Modified-Mini Mental Status (3MS) exam was used to screen individuals during the initial interviews [24]. Information from these baseline screens and additional cognitive testing were used to generate a consensus diagnosis according to the National Institute of Neurological and Communicative Disorders and Stroke (NINCDS) and the Alzheimer's Disease and Related Disorders Association (ADRDA) criteria [25]. Methods for ascertainment were reviewed and approved by the individual Institutional Review Boards of the respective institutions. Sample collection, DNA extraction, cognitive testing and affection statuses derived from the consensus diagnoses followed procedures detailed in Individuals were selected from the full Amish dataset for whole-exome sequencing. The variants identified from these data were used to screen three classes of variants, genes that are very near or contain GWAS hits or carry early-onset mutations, genes within four candidate linkage regions implicated by previous studies and genes that harbor variants that occur uniquely in cases or in controls. The top variants from these three classes were then genotyped in the full dataset and case-control association was performed.
previous studies conducted in these populations [26]. The full Amish dataset comprises 1,119 individuals sampled.
The Anabaptist Genealogy Database (AGDB) combined and digitized genealogy records and books that include multiple individuals and families [27]. This resource generated an allconnecting pedigree consisting of over 5000 Amish members. From this pedigree, the relationship status and degree of relatedness can be determined for all individuals in the full dataset.
As a comparison dataset for the genetic burden and variant association analyses, cases and cognitively normal controls ascertained from a general clinical population were studied [18]. A collaborative study between researchers at the University of Miami and Vanderbilt University has ascertained Caucasian individuals affected with LOAD unique from the Amish populations also studied. These individuals have been diagnosed with probable or definite AD according to NINCDS-ADRDA criteria with an age of onset greater than 60. To make these diagnoses, documentation or a clinical history of significant cognitive impairment was present. Age-and gender-matched cognitively healthy controls were ascertained from the same regions and had a documented 3MS or MMSE score in the normal range. As the Amish are founded from European immigrants, this European-American dataset of unrelated individuals is of similar ancestry.

Ethics Statement
This research study has been continually approved by the Institutional Review Board at Vanderbilt University. This sub-committee determined the study poses minimal risk to participants and approved the Consent Form, Application for Human Research and the Protocol. Written consent was given by participants for their information and history to be used in this study.

Analysis of genetic risk score
To determine if the known genetic etiology of LOAD also impacts LOAD in the Amish, the total genetic risk score using known LOAD risk alleles was calculated and compared across affection groups. Twenty-one SNPs that reached genome-wide significance in the most recent LOAD genome-wide association study (GWAS) were genotyped, 17 of which passed the QC for the follow-up genotyping phase, in the full Amish dataset (Table 1) [16]. Previous genotype data for APOE were used [26]. The weighted genetic risk score was calculated by multiplying the number of risk alleles at each marker by the weight (proportional to its published effect size) for that marker, and then summed across all markers [16]. In addition to comparing across Amish affection statuses, cases and cognitively normal controls ascertained from a general clinical population were compared (Table 2). Moreover, Amish LOAD cases were compared to unrelated LOAD cases and Amish cognitively normal controls to unrelated cognitively normal controls. Logistic regression was performed (R, version 3.0.2) to determine if total genetic risk score was correlated with affection status or population dataset.

Selection for sequencing
From the larger dataset of 1119 Amish individuals, 176 individuals, 59 AD cases, 68 unaffected controls and 49 unknowns, were selected for whole-exome sequencing for several different studies (LOAD, Parkinson's disease and age-related macular degeneration). Individuals were chosen for the LOAD using the following prioritization (a) large sibships with both affected and unaffected individuals, (b) close relatives of sibships in (a), (c) APOE genotype, and (d) members of subpedigrees with high linkage results from previous studies [26]. We hypothesized that these 59 AD cases are most likely to harbor unidentified variants that confer risk to LOAD.

Sample preparation and exome sequencing
Paired-end whole-exome sequencing was performed on DNA extracted from blood. Two sequencing sites, the Genome Sciences Resource at Vanderbilt University Medical Center and the sequencing core of the Center for Genomic Technology at the Hussman Institute for Human Genomics (HIHG) at the University of Miami Miller School of Medicine, performed

Processing of raw sequences and calling of variants
Sequence processing consisted of aligning reads, removing duplicates, realigning around local indels, recalibrating quality scores and calling of variants. Using BWA (version 0.6.2), raw sequences reads were aligned to the UCSC hg19 human reference genome. Picard tools (version 1.74) was used in the process of marking duplicates. All steps performed in the Genome Analysis Tool Kit (GATK, version 2.1-10) followed the best practices available at the time of processing, these consisted of local realignment around indels, base recalibration, multi-sample variant calling (using the UnifiedGenotyper), and variant recalibration. The reference bundle was downloaded from GATK and used for all processing steps. Data management was performed using vcftools (version 0.1.9) for additional QC steps. Samples with an average depth less than 30 (n = 9), samples with a concordance rate less than 90% with previous genotyping (n = 3), and samples with discordant genders between the sequencing and the genders recorded in the clinical data (n = 2) were removed. There were a total of 170,849 variants called across the 162 whole-exomes (53 LOAD cases, 65 cognitively normal controls and 44 of unclear/unknown status) that passed the above QC measures. Of these variants, 153,272 passed the processing filter of a minimum phred-scaled quality threshold of 10. To control for missing data, variants with a calling efficiency of less than 80% were removed from analysis. Only biallelic markers were analyzed as the software used to test for association while correcting for the pedigree structure is restricted to biallelic markers. After this QC, 162 individuals and 79,203 exonic variants were analyzed (Table 3). This QCed dataset was 99.1% concordant for 8,268 exonic variants overlapping with previous genotyping and individuals were sequenced at an average depth of 58.60 ± 13.53.

Prioritization of identified variants
To overcome low power due to small sample size in the initial screening population, variants from three classes of genes were prioritized for follow-up analysis in the full dataset. Class one included 26 genes previously implicated in LOAD through GWAS and early-onset mutations [14][15][16][17][18][19]. Class two genes resided in four previously identified candidate linkage regions in this Amish dataset [26]. Class three genes harbored variants that occurred uniquely in cases or in controls. There was no overlap between the previously implicated AD genes and genes under the four linkage peaks. A total of 56 variants (25 in AD genes, 30 in linkage regions, and 1 unique to cognitively normal controls) were identified from the sequencing data and thus were prioritized for genotyping in the full Amish dataset (S1 and S2 Tables). The criteria for prioritization were a nominally significant association p-value (< 0.01) in the sequencing data or because the variant was not present in three catalogs of human variation (dbSNP build 137, ESP 6500 release, and 1000 Genomes April 2012 release).

Genotyping of selected variants
Fifty-four of the prioritized variants were genotyped in the full Amish dataset using three pools designed for the Sequenom iPLEX Gold assay on the MassARRAY platform. This technology is based on a single-base primer extension reaction coupled with mass spectrometry. The remaining two variants were genotyped via pre-designed TaqMan assays that contain allele-specific primers and fluorescent probes. In addition to the variants identified from the sequencing experiments, the 21 GWAS hits used for the genetic risk score analysis were genotyped in these pools in both the Amish and the unrelated datasets. Seven variants (two GWAS hits and five sequencing variants) failed genotyping. Of the remaining 70 variants, two failed to validate and thus were monomorphic in the larger dataset. Additionally, two variants with low efficiency and one multiallelic marker were dropped from analyses. One variant had low concordance with the sequencing data and genotypes were manually called from the cluster plots. This resulted in 48 sequencing variants and 17 GWAS hits passing these QC measures (Table 4). A total of 1,119 unique Amish samples were genotyped for the 77 variants. Eighty-three samples were dropped due to a genotyping efficiency below 95%. Two individuals were dropped from analysis for low concordance between follow-up genotyping and the sequencing data. To calculate kinship coefficients to adjust for relatedness, individuals not currently in the AGDB and those who were not in the subsequent all-connecting pedigree were removed. This resulted in 921 samples passing all QC measures (Table 5).

Analysis of single variant in cases versus controls
Case-control association in the Amish was performed using the Modified Quasi-Likelihood Score (MQLS) test, which corrects for the relatedness of individuals [28]. This program uses the kinship coefficient, a measure of relatedness between two individuals, to account for the pedigree structure. Additionally, this method allows for the inclusion of samples with unknown or unclear affection status, increasing the overall sample size. Type 1 error rates for the method are not inflated when used for the Amish [29]. This association test was used for both stages of the study, analysis of sequence variants and follow-up genotyping. A conservative Bonferroni correction for the number of tests performed in each stage was used to determine the threshold for the level of significance. To generalize the results of any significant association in the follow-up genotyping phase, logistic regression was performed in PLINK (version 1.07) with APOE as a covariate to test for association in the unrelated dataset.

Power studies
Power is dependent on a number of variables, including sample size, allele frequency and effect size. The small sample size of the sequencing dataset (162 exomes passing QC measures) is likely to be too small even to detect an association for a common allele with a moderate effect size. For example, if 162 unrelated cases and an equal number of controls were sequenced or genotyped for a variant with a minor allele frequency of 5% and an OR of 2, the power to detect an association is only 34.7% if the type I error rate is 0.05. This estimate assumes individuals are unrelated and therefore is an overestimate of the power in this population of related individuals. By genotyping the prioritized variants in over 1,100 samples, the power limitations of the screening population may be overcome and associations may be detected. Previous studies in this Amish population investigated this software's power to detect associations [29]. For dominant and additive models, there was greater than 90% power to detect an association at p < 0.05 when the simulated odds ratio (OR) was at least 2 and the minor allele frequency was held constant at 0.2. For genome-wide data, the Bonferroni-corrected p-value is traditionally 5 x 10-8. If the OR is 5, there was 90% power to detect an association for dominant and additive models, but this power dropped significantly, less than 5%, if the OR was less than or equal to 2.
In the unrelated dataset, there was at least 90% probability to detect an association, if present, when the effect size was at least 1.25 with a type I error probability of 0.05. The association program used in the Amish, MQLS, does not calculate an OR or effect size for the variant being tested so this power calculation is an estimate that may vary based on the true effect size.

Analysis of genetic risk score
Total genetic risk score was calculated for each individual in the study population to evaluate the genetic contribution of known risk loci in this population (Fig. 2). Amish cases harbored a significantly higher burden of the known risk alleles compared to Amish cognitively normal controls (logistic regression, p = 1.01 x 10 -6 ). As expected, the unrelated cases also had a significantly higher burden when compared to the unrelated cognitively normal controls (p < 2 x 10 -16 ). When compared to unrelated cases, Amish cases had a significantly lower burden of known risk alleles (p = 1.60 x 10 -7 ). Cognitively normal Amish controls were not different from the unrelated controls (p = 0.71). The difference between Amish LOAD cases (μ = 0.45 E4 risk alleles) and unrelated cases (μ = 0.82) was even greater when APOE was evaluated independent of the GWAS hits (p = 9.76 x 10 -8 ).

Case-control analysis of sequencing data
To focus further analysis on the variants most likely to contribute genetic risk to LOAD in the Amish, we sequenced 59 cases, 68 controls, and 49 unknowns most likely to harbor unidentified variants that confer risk to LOAD. The exomes sequenced harbored 155 variants in the known AD genes ( Table 6). The most significant association p-value among these genes was 0.0098 for position 10,054,789 on chromosome 19 in ABCA7. This missense variant was not present in the three catalogs of human variation queried. Within the candidate linkage regions, 557 variants were identified and the most significant p-value was 0.00017 on chromosome 3 ( Table 7). After correcting for the total number of variants tested in these three classes, no variant reached an experiment-wide level of significance.
In addition, and as a secondary screen, single variant case-control analysis was performed for all 79,203 exonic sequencing variants to test for possible association with LOAD in the Amish. The most significant p-value was 1.25 x 10 -6 for positions 102,762,544 on chromosome 10 and 91,503,598 on chromosome 15. Thirteen additional exonic variants had p-values less than 1 x 10 -4 ( Table 8). None of these reach classical levels of genome-wide significance when corrected for multiple comparisons.

Case-control analysis of selected variants
To verify the sequence variants and to evaluate them in the full dataset, the 56 most significant candidate variants from the known AD genes and implicated linkage regions were genotyped and tested for association with LOAD in the Amish (S3 Table). No variant passed the significance threshold when corrected for 48 tests (p < 0.00104). The most significant result (p = 0.0012) was for rs73938538 (MAF 0.087), a synonymous variant in LAMA1 within the linkage peak on chromosome 18. No other variant was significant at a threshold of p < 0.05. Seven of the 48 markers had a p-value less than 0.1 (Table 9). To determine if the rs73938538 association generalized in a dataset of unrelated cases and controls, the variant was genotyped in 473 LOAD affected individuals and 498 cognitively normal controls. When the variant was tested for association with LOAD, it failed to replicate (logistic regression with APOE as a covariate, p = 0.28). In this unrelated dataset, the minor allele frequency (MAF) in cases was 0.081 and 0.094 in controls, which is the opposite direction of effect of the minor allele in the Amish. This result was consistent in a large consortium-derived meta-analysis of 74,046 individuals investigating 7,055,881 genotyped and imputed SNPs (p = 0.21) [16].

Discussion
The results of the genetic risk score analysis indicate that the common variants so far implicated by GWAS in European Caucasian populations explain a smaller proportion of genetic risk in the Amish than in the general population. The results of the genetic burden analysis of APOE only support the previous association of the E4 allele with LOAD in the Amish [26]. In the Amish from Elkhart, LaGrange and Holmes Counties, the APOE E4 allele has a frequency of 0.18 in cases, but in cases from the general Caucasian population this risk allele frequency is 0.42 [12,26]. This allele frequency disparity may in part explain the increase in difference in genetic burden between cases from the two datasets when only APOE was analyzed, but additional factors are likely to contribute as well. Although there is evidence that prevalence of dementia in the Amish may be somewhat lower than in the general population, the small sample size in these studies generates a very large confidence interval [30,31]. Our experience suggests that the prevalence is likely to be greater than these published reports. Further, these results indicate that genetic variation other than those already described is responsible for at least some of the Amish dementia.
Since Amish cases did have a higher burden when compared to cognitively normal controls from the same population, it can be assumed these known risk loci do explain some of the expected genetic effects. The concordance of the genetic risk scores between the general population cognitively normal controls and the Amish controls indicates that the Amish cognitively normal population is similar to their general population counterparts and is consistent with their shared ancestry.
A synonymous variant in LAMA1, rs73938538, is associated with LOAD in the Amish just below experiment-wide significance. While this association did not generalize in the unrelated dataset, a potential relationship between this variant and risk for LOAD is supported by the relevant function of this gene to LOAD pathophysiology. LAMA1 encodes the laminin alpha subunit. Laminin is a major functional component of the basement membrane of many tissues, including the endothelium of blood vessel walls, and different isoforms may contribute to vascular homeostasis [32]. The alpha1 subunit of laminin is expressed in the basal lamina of blood vessels in the central nervous system, mostly confined to capillary walls [33]. There is strong evidence to suggest the etiology of LOAD may include cerebrovascular dysregulation and that the neuronal degeneration is secondary to this dysregulation [34,35]. This synonymous variant encodes for a more common valine codon (GTG) which has a frequency of 2.91 in highly expressed human genes and 2.78 in all human genes than the referent allele does (GTT) which has frequencies of 1.12 and 1.11, respectively [36,37]. The association of the synonymous variant rs73938538 with LOAD in the Amish suggests that inefficient translation or abnormal cotranslational folding of a protein important for cerebrovascular homeostasis and dysregulation may contribute to the underlying pathology and degeneration in this isolated population. This suggests LAMA1 as a candidate gene for follow-up studies to further explore the relevance and consequences of the association of this variant and LOAD.
In the QCed sequence dataset comprised of 53 cases and 65 cognitively normal controls, the power to detect an association, if present, was limited by the allele frequency and the effect size of the variant. For a rare variant with a MAF of 1%, the probability of detecting an association was 90% if the OR was at least 26, an extremely large effect size. If the MAF was 3%, this study was at least 90% powered to detect an association if the OR was 11. If the variant was more common with a MAF of 5%, the sequence dataset was at least 90% powered to detect an association with a variant with an OR of 8. Because of this low power in the sequence dataset, the top candidate variants were selected for genotyping in the larger Amish dataset comprised of 126 cases and 503 controls. In this dataset, the study was at least 90% powered to detect an association with a variant that had a MAF of 1% and an OR of 8. It was at least 90% powered to detect an association if the MAF was 3% and the OR of the variant was 4. For a common variant with a MAF of 5%, the study was at least 90% powered to detect an association if the OR was 3.2. These power calculations are based upon a type I error rate of 0.05 and assume the samples are unrelated. The samples studied in the Amish datasets are related to one another, so these calculations are only estimates of the true power in these two datasets. Thus out data indicate that the Amish do not carry low frequency exonic variants contributing to their risk of dementia.
The lack of generalization in the unrelated dataset may be due to several reasons. First, the association detected in the Amish population may be a false positive and therefore not a true association. If this is true, the association should not be detected in any other study population or dataset. However, there was at least 90% probability to detect an association, if present in the unrelated dataset, if the true effect size is between 0.125 and 0.5 for the given sample size (n = 971) with a type I error of 0.05. Second, the association may true and detected because the Amish have a more homogeneous background. Third, the phenotype-genotype correlation may have arisen separately in the Amish after the founding of the population and could therefore be unique to this genetically isolated population. Fourth, the association with this variant and LOAD may be the result of an interaction with another genetic variation or a component of the environment that is unique to the Amish culture or way of life. If this interacting factor was untested or unaccounted for in this study, and therefore not reproduced in the unrelated dataset, the association may not be detected.
Six genes had multiple variants in the follow-up genotyping. In a large dataset, gene-burden analysis might be illuminating. However, the relatively small size of the Amish dataset and the low frequency of the observed variants conspire to severely underpower such an analysis.
The genetic risk score results suggest that the known LOAD risk loci explain a smaller proportion of the genetic risk in the Amish than in the general population. The targeted association results suggest that additional exonic variation in associated LOAD genes and regions implicated by previous linkage studies does not contribute risk to LOAD in the Amish, beyond the possible association with LAMA1. Other areas of the genome, intronic regulatory elements, epigenetic modifications, or previously unassociated genes, may be harboring variation that confers susceptibility in the Amish but were not interrogated by this study. Additional studies examining the non-exonic variation of known AD genes and the candidate linkage regions, as well as other portions of the genome, are likely to identify new variation that confers susceptibility to developing LOAD in the Amish.
Supporting Information S1