A hierarchical Bayesian model to predict APOE4 genotype and the age of Alzheimer’s disease onset

In this work we use a hierarchical Bayesian paradigm to introduce a theoretical framework to determine an individual’s Apolipoprotein ε4 (APOE4) genotype, which heavily influences both the age of onset and probability of acquiring Alzheimer’s disease (AD). This calculation is based solely on an individual’s family history. This APOE4 genotype estimation is then combined with a number of known factors that influence AD onset to produce a function that estimates the onset of AD as a function of age. We disseminated our Alzheimer’s predictive tool online at http://www.alzheimerspredictor.com.


Introduction
Alzheimer's disease (AD) is a devastating age-related neurodegenerative disease. Its devastation lies in its ability to impair the memory and cognitive function of people who have lived to an elderly age by having avoided or conquered many of the other mortal diseases such as cancer, cardio-and cerebro-vascular diseases, and respiratory infections. This cognitive impairment requires the allocation of extensive familial and societal resources to care for the afflicted individual [1]. Many people are interested in obtaining genetic testing to determine their possibility of suffering from AD and a variety of corporations have begun to offer these genetic testing services to fulfil the demand. The results of these tests have not been demonstrated to cause significant emotional harm to subjects [2]. Genetic testing involves the collection of blood or other genetic material to detect the presence of the APP, PSEN1, or PSEN2 genes in the case of early onset AD (EOAD) or the APOE-ε4 (APOE4) allele in the case late onset AD (LOAD) [3]. While these genetic tests are very accurate in determining genotype, these tests suffer cost drawbacks as well as a general failure to predict the age of onset of AD by correcting for other factors such as history of Type 2 Diabetes Mellitus (T2DM) and traumatic brain injury (TBI). A number of factors have been demonstrated to affect the probability of AD and the age of onset. Major factors that influence age of AD onset include T2DM [4,5], TBI [6,7], education level [8], and race [9]. Additionally, previous studies have found observational links between age of onset and a variety of foods, mental activity, and physical exercise [10]. a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 A number of other reports have successfully demonstrated the use of various data to predict the onset of AD. Tierney et al used neuropsychological tests to predict AD with a sensitivity and specificity above 70% for predicting AD within 5 years [11]. Callahan et al used clinical memory test scores and imaging biomarkers to derive a predictive model for AD [12]. Macdonald et al derived a mathematical model of AD based on APOE status [13] based largely on the earlier work by Farrar et al. [14].
In this work, we propose a theoretical framework to predicting AD onset by applying a hierarchical Bayesian model to a subject's familial history to predict their APOE4 genotype, the major genetic risk factor for LOAD. Firstly, this model considers the familial history of the subject dating back two generations, i.e. parents and grandparents. Following the APOE4 genotype estimate, known hazard ratios are applied to determine a mathematical model of the probability of AD as a function of age correcting for covariates sex, race, diabetic history, traumatic brain injury, physical activity, diet, and education level. Important limitations of this approach are discussed in the discussion section of this report.
Our novel AD prediction model provides a theoretical predictive tool that corrects for various factors not considered in existing genetic tests without the costs associated with genetic tests. Our model is accessible online at http://www.alzheimerpredictor.com for general interest purposes.

Predicting the APOE-ε4 genotype
In order to obtain the most information about the APOE4 genotype, we began by estimating the genotype of each grandparent using Bayes theorem [15] to determine the conditional probability that given a particular age of AD onset, the person has either an APOE4 +/+, +/-, or -/genotype. The APOE4 allele frequencies within the general population of 86.5% for -/-, 9% for +/-and 4.5% for +/+ [16] formed the prior probabilities for our calculations.
The conditional probability of an individual having a certain APOE genotype based on the age of AD onset, between 45 and 95 years of age, P(Genotype|AGE), was solved using Bayes theorem, Where P(AGE|Genotype) is the probability of having AD at a specific age based on the genotype, P(Genotype) is the prior probability of having a specific APOE4 genotype and P(AGE) = S (P(AGE|Genotype) Á P(Genotype)). For example, the conditional probability of a subject having an APOE4 +/+ genotype is given by the equation: Where P(Genotype ++ ) is prior probability of an individual having an APOE4 +/+ genotype which is equal to the frequency of the APOE +/+ genotype in the population (i.e. 4.5%).
Within the population, the mean age of onset for APOE4 +/+ individuals is 68±8.2; +/-76±8.2, and 84±8.2 for -/-individuals [3,16]. Assuming a Gaussian distribution for the age of AD onset, the probability of a given individual, at age, AGE, with a given genotype, with an mean onset age of MeanOnsetAge ± σ, having the onset of AD symptoms was calculated using a Gaussian distribution where, Where σ g is the standard deviation of the age of onset, AGE is the age of onset, MeanOnsetAge g is the mean age of AD onset. The subscript, g, is used to indicate generality of the genotype. To calculate the probability for a given genotype, the values for the genotype being calculated are used. For example, to calculate the probability for the APOE4 +/+ genotype, P(AGE|Genotype ++ ), σ ++ and MeanOnsetAge ++ are used. P(AGE) is given by the equation: Where PopFreq is the population frequency of the occurrence of an allele. These equations are used to calculate the conditional probability of an individual having a given APOE4 genotype as a function of age of onset of AD symptoms.
We used a hierarchical Bayesian methodology to estimate the posterior probability of each genotype for each subsequent generation. The probability of each genotype for progeny was calculated by multiplying the probabilities of the genotypes of the parents by factors obtained from Mendelian statistics (¼, ½, ¼ for +/-+/-parents) [17,18]. The sum of probabilities for each possible genotype was calculated and was used as the prior probabilities for the next generation.
For example, the prior probability of a subject who developed AD at a specific age having an APOE4 +/+ genotype, regardless of whether or not the parents had AD, is given by: Where the subscripts "Pat" and "Mat" denote the previous generations' posterior probabilities that the given allele was inherited from the paternal and maternal side respectively. Although we calculated the probability of the subject having any of the rare autosomal dominant genotypes (APP, PSEN1, PSEN2), the very small genetic frequency of these genes resulted in a negligible posterior probability at most relevant ages and therefore the consideration of autosomal dominant genetics was omitted from further consideration for our model.
A baseline hazard function was calculated by fitting the cumulative risk of AD to a Gompertz function which is a function known to fit well to AD onset probability [13]. The Gompertz function has a general form of, Applying the probabilities of each APOE genotype to the Gompertz function, the function is expanded to: Where S A Á P = A +/+ Á P +/+ + A +/-Á P +/-+ A -/-Á P-/-. S k Á P and S x Á P follow a similar pattern. P +/+ , P +/-, P -/-is the probability of the respective genotype based on the Bayesian analysis (e.g. P +/+ = P(Genotype ++ | Age)). Coefficients A, k, x are determined by fitting the cumulative AD symptom probability function to a Gompertz function (Table 1).

Gompertz probability function
Following the determination of the probability of the subject having each of the possible APOE4 alleles and fitting this data to a Gompertz function, we applied previously determined hazard ratios to modify Gompertz age of AD onset function. Since hazard ratios were acquired from different sources, we had to adjust the hazard ratios to obtain a "mean" hazard ratio of the population of 1. The mean hazard ratio for a given factor were determined by the following equation: Where HR adj is the adjusted hazard ratio, HR is the hazard ratio, and PopFreq is the percentage of the population with the given phenotype. The Gompertz function, adjusted for estimated APOE genotype, was then multiplied by a factor equal to the sum of the hazard ratios to obtain a factored age of AD onset function to yield a final factored risk function, Justification and limitations of this methodology are discussed in the discussion section.

Validation
Our model was validated using data from the Alzheimer's Disease Neuroimaging Initiative (ADNI). Correction for diet and exercise was excluded from the validation because these data were not available from the ADNI database. Only participants with a known APOE status and having sufficient data collected for analysis were included in the data analysis. The probability of AD symptoms was calculated using our tool from data supplied from AD participants (n = 250) and age-matched healthy controls (n = 1751). Mean "Risk of AD" at the current age was calculated for healthy controls and Mean "Risk of AD" at the time of AD onset was calculated for AD patients within the dataset.

Results
We begin our analysis by graphing the conditional probability function, P(AGE|Genotype), of a subject having a given APOE4 genotype based on their age of onset (Fig 1). The APOE4 +/+ genotype is associated with an earlier onset than +/-and -/-genotypes, however it occurs much less frequently in the population than those genotypes. An early AD onset is associated with +/+ genotype, however the probability that the subject has the APOE4 +/+ genotype decreases rapidly with reported age of onset given its low prevalence in the population. Additionally, early AD onset is associated with the autosomal dominant genes APP, PSEN1 and PSEN2. Given the genetic infrequency of these genes, the probability that a subject having an autosomal dominant genotype is still quite low, even when AD symptoms appear early. Conversely, the APOE4 -/-genotype is far more common in the population and is associated with later AD onset, if AD onset even occurs. Table 2 contains a prediction of APOE4 genotype for a hypothetical individual to serve as an example.
The calculated estimated APOE4 genotype is noted in the bottom of the cell for each individual and is calculated in accordance with the methodology in the Methods section. Onset is the age of AD onset. If there is no history of AD, then the field is blank. Age is the current age of the individual or the age of death, although this variable is not used in the analysis. Once the APOE4 genotype is estimated, a "baseline" age of onset Gompertz function can be plotted based on the previously calculated parameters of A, k, and x weighted in accordance with the probability of the individual having each of the +/+, +/-, -/-genotypes (Fig 2).
Finally, additional risk factors were added to modify the amplitude of the Gompertz AD risk function. In our example, the individual in the example above is a white male with a history of diabetes and no traumatic brain injury with a high school education. He exercises regularly and eats a typical "American" diet. His hazard ratios are shown in Table 3.
The Gompertz AD onset function is multiplied by a factor of 1.22 to obtain a factored AD onset risk function (Fig 2). Table 4 contains a list of the most well-established factors that affect AD onset with their corresponding hazard ratios and reference to the appropriate literature.   Gompertz AD onset risk function for the general population ("baseline"), the subject's estimated APOE status ("genetic risk"), and risk factoring in known risk factors ("factored risk"). Notice the risk increases steadily until approximately age 80, at which point the risk grows at a slower rate. This is because there is a probability that this individual will never have AD. The baseline risk is calculated by applying the model to the "average" person, that has APOE4 genotype probability status equal to that of the population. https://doi.org/10.1371/journal.pone.0200263.g002 Finally, we validated our theoretical model to actual data acquired from the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset. AD patients (n = 250, μ = 74.0±7.0) were age matched to healthy controls (n = 1751, μ = 76.0±7.0). Data from these patients were inputted into the AD prediction calculator and the probability of AD at the participants' age (for healthy controls) or the age of AD onset (for AD participants) was calculated. The mean theoretical probability of displaying AD symptoms for healthy controls was 12.2% whereas the mean theoretical probability of displaying AD symptoms for AD patients was 15.9%.

Discussion
In this work, we provide a theoretical framework based on a hierarchical Bayesian model for estimating an individual's APOE4 genotype and further to estimate the probability of AD onset as a function of age based on a number of known factors that affect AD onset. AD is a progressive disease and is heavily dependent on APOE4 genotype [2,16,22], however a number of modifiable lifestyle factors are also implicated in its onset and progression [10]. We can apply this approach in a hierarchical manner to estimate the genotype of an individual based on the age of AD onset of his or her parents. In this case, the parents' posterior probability of a given genotype is used as the prior probability for the subject's genotype estimation. This methodology can then be applied to subsequent generations. In our model we only go back two generations from the subject to estimate the subject's genotype. This is because it is rare for subjects to be familiar with the cognitive status of their great-grandparents. Additionally, cognitive history beyond two generations has little impact on the genotype estimation of a subject. It is worth noting that even if the subject is not familiar with an AD status of a parent or grandparent, the model simply assigns an APOE4 genotype estimation equal to the prior probability.
The strength of our work is that it provides a low cost rough estimation of AD onset that is accessible to members of the general public and provides concerned individuals lifestyle modification factors that affect the risk of AD. Secondly, the Bayesian approach to determining APOE4 genotype is a novel method of determining a specific genotype and could be adapted to the prediction of other disease such as BRCA genotypes for individuals with breast cancer.
In addition to the strengths of our work that we have identified, our work has a number of important limitations that must be identified: 1. We added the hazard ratios from different factors (i.e. education, diet, etc.) to obtain a cumulative hazard ratio. In reality, different hazards act in a confounding, not an additive, fashion. However, we found no individuals in our dataset that had the both of the greatest risk factors for AD (T2DM and TBI). The result of this limitation is that in the highly improbable case where an individual has a high probability of APOE4 +/+ genotype, low educational attainment, and a history of T2DM and TBI, the predicted AD onset would be in excess of 100% at a certain age. Whether a subject with multiple co-morbidities and such a poor prognosis would actually live to this theoretically certain age of AD onset would be the subject of conjecture.
2. Because this model relies heavily on the APOE4 genotype prediction, this methodology only estimates AD onset, not the onset of dementia of different aetiologies such as vascular dementia.
3. AD has overlapping diagnostic criteria with other aetiologies of dementia and differentiating AD is somewhat problematic. For this reason, an individual may have been incorrectly diagnosed with AD thereby influencing the APOE4 genotype estimation of his progeny. 4. This model makes no prediction of EOAD onset.
The difference between AD onset risk for healthy individuals (12.2%) vs the risk of AD for people who displayed AD symptoms (15.9%) highlights the highly sporadic nature of LOAD. Despite these limitations, this work can still serve as a theoretical framework for future studies with sufficient resources to conduct large scale clinical trials to validate the hazards for different factors and provide a more accurate tool for the prediction of AD onset.
Supporting information S1 File. The raw data used in this report is attached as supporting information provided as an Microsoft excel file. (XLSX)