Artificial intelligence and leukocyte epigenomics: Evaluation and prediction of late-onset Alzheimer’s disease

We evaluated the utility of leucocyte epigenomic-biomarkers for Alzheimer’s Disease (AD) detection and elucidates its molecular pathogeneses. Genome-wide DNA methylation analysis was performed using the Infinium MethylationEPIC BeadChip array in 24 late-onset AD (LOAD) and 24 cognitively healthy subjects. Data were analyzed using six Artificial Intelligence (AI) methodologies including Deep Learning (DL) followed by Ingenuity Pathway Analysis (IPA) was used for AD prediction. We identified 152 significantly (FDR p<0.05) differentially methylated intragenic CpGs in 171 distinct genes in AD patients compared to controls. All AI platforms accurately predicted AD with AUCs ≥0.93 using 283,143 intragenic and 244,246 intergenic/extragenic CpGs. DL had an AUC = 0.99 using intragenic CpGs, with both sensitivity and specificity being 97%. High AD prediction was also achieved using intergenic/extragenic CpG sites (DL significance value being AUC = 0.99 with 97% sensitivity and specificity). Epigenetically altered genes included CR1L & CTSV (abnormal morphology of cerebral cortex), S1PR1 (CNS inflammation), and LTB4R (inflammatory response). These genes have been previously linked with AD and dementia. The differentially methylated genes CTSV & PRMT5 (ventricular hypertrophy and dilation) are linked to cardiovascular disease and of interest given the known association between impaired cerebral blood flow, cardiovascular disease, and AD. We report a novel, minimally invasive approach using peripheral blood leucocyte epigenomics, and AI analysis to detect AD and elucidate its pathogenesis.


Introduction
Alzheimer's Disease (AD) is the most common form of age-related dementia, accounting for 60-80% of such cases [1]. The disorder causes a wide range of significant mental and physical disabilities, with profound behavioral changes and progressive impairment of social skills. Globally in 2015, nearly 47 million individuals suffered from AD and it is projected that 75 million will be affected by 2030, with a further rise to 131 million by 2050 [2]. The World Health Organization has therefore declared AD a global health priority [3].
AD is a complex disorder influenced by environmental and genetic factors [4,5]. Many studies have investigated the genetic basis for both early-onset AD (EOAD) and late-onset AD (LOAD) [6,7]. Genome-wide association studies (GWAS) [8] have identified several LOADassociated risk loci [9] proliferation in peripheral blood leukocytes including in T-lymphocytes [10], B-lymphocytes [11], polymorphonuclear leucocytes [12], monocytes, and macrophages [13] have been reported. DNA methylation plays an important role in Alzheimer's disease [14][15][16]. Leukocyte DNA methylation from CpG-based biomarker analyses was used for early detection of many diseases, including our recently published brain disorders cerebral palsy [17], autism [18], and concussion [19]. However, the genome-wide blood DNA methylationbased molecular mechanisms that contribute to the pathogenesis of AD remain still largely unknown.
Artificial Intelligence (AI) is rapidly transforming modern life in areas as diverse as face recognition and robotics. Machine Learning (ML) is a branch of AI that focuses on computer learning and adapting from a set of data with which it has been presented. ML involves learning by computers that require no or only minimal explicit programming by humans. An area of interest given the geometric expansion of medical data is the use of ML for the detection and diagnosis of various diseases [20]. ML has been reported to be superior to conventional statistical approaches for prediction such as logistic regression and Cox proportional hazard modelbased analysis [21] when interrogating mega-data. Challenges with classical statistical techniques include but are not limited by the requirement for an assumption of independence between predictors and risk of overfitting and collinearity when a large number of variables are analyzed. Deep Learning (DL) is the latest developing branch of ML. DL uses multi-layered neural networks that are modeled after neural networks in the brain of animals, to learn essential tasks. Thus, with minimal or no explicit human programming (unsupervised), the computer can learn intricate patterns from complex data matrices. When subsequently exposed to a new data set, it can classify and make precise predictions based on past experiences. With DL, between the input (raw data) and output (i.e. completed task e.g. group classification) layer of 'neurons,' there are multiple hidden layers that enhance the ability to handle tasks of increasing complexity. DL more closely mimics the intellectual function of the cerebral cortex. There is an increasing interest in using DL in the analysis of biologic big-data such as genomics [22,23] to understand and accurately predict diseases. We have recently published using AI/ML-based technologies of epigenomic [17] and metabolomics [24][25][26] data for accurate disease prediction. In the present study, we used DL and other commonly used ML platforms combined with genome-wide DNA methylation analysis of leucocytes DNA for AD detection/prediction. The term 'prediction' is used here in a cross-sectional as opposed to a temporally longitudinal sense since the samples were not obtained before the development of AD. To further explore the molecular mechanisms of LOAD, we used the Ingenuity Pathway Analysis (IPA).

Materials and methods
Institutional Review Board (IRB) approval was provided by William Beaumont Hospital, Royal Oak MI, USA (IRB#2014-038). Written consent was obtained from all participants and their legally authorized representatives when applicable. The diagnosis of AD in these live subjects was made using the published criteria of NINCDS-ADRDAj [27]. Demographic and clinical data were extracted from the medical records (S1 Table) and compared between AD and control groups. Genomic DNA was extracted from whole blood samples using the Gentra Puregene Blood Kit (Qiagen) according to the manufacturer's protocol. Approximately 500 ng of genomic DNA was extracted from each of the 48 samples, which subsequently were bisulfite converted using the EZ DNA Methylation-Direct Kit (Zymo Research, Orange, CA) per the manufacturer´s protocol and processed according to Illumina protocols. Bisulfite conversion was performed in a PCR cycling protocol (16 x 95˚C for 30 sec, 50˚C for 60 min) and then held at 4˚C.

Genome-wide methylation scan using the Infinium MethylationEPIC array BeadChips
The Infinium MethylationEPIC array (Illumina, Inc., California, USA) contains probes for >850,000 CpGs per sample. All 48 samples were processed together to minimize batch effects. This is further elucidated in the Supplementary Methods. This section also includes validation results using pyrosequencing along with primer sequences.

Statistical and bioinformatic analysis
Differential methylation was determined by comparing the ß-values per individual nucleotide at each cytosine 'CpG' locus between AD subjects and controls. The p-value for the methylation difference between AD and control groups at each locus was calculated as previously described [28]. Probes associated with X and Y chromosomes were removed to negate any bias caused by gender differences. Further detailed statistical and bioinformatic analyses are described in the Supplementary section.

Artificial Intelligence (AI) analysis
AI analysis was performed as previously described by our group [29], using a combination of CpG sites from different genes. A total of six different AI platforms including Deep Learning (DL) were evaluated. Each CpG locus used as a marker displayed significant differential methylation in AD defined as FDR p-value <0.05. The methylation β-values were logged and autoscaled using their standard deviation before quantile normalization to minimize sample to sample difference. Standard techniques were used with DL including adjustments by the program of weights (strength of the connection between 'neurons') and biases (an additional parameter or constant) and backpropagation-all of which helps to optimize the accuracy of the output or results. Softmax classifier was used to assign new labels to the samples. To tune the parameters of the DL model, the h2o package in the R module was used [30,31]. For the sake of comparison, standard logistic regression algorithms for AD prediction were also performed and detailed later in the manuscript.

Other machine learning algorithms
We compared the performance of DL to five other commonly used machine learning algorithms: Support Vector Machine (SVM), Generalized Linear Model (GLM), Prediction Analysis for Microarrays (PAM), Random Forest (RF), and Linear Discriminant Analysis (LDA) [30,32]. A comprehensive explanation of the AI methodology is provided in the Supplementary Section.

Bootstrapping
We also performed bootstrapping as alternative 10-fold cross-validation and compared the new results with that based on 10-fold CV. The bootstrap method involves iteratively resampling a dataset with replacement. Instead of only estimating our statistic once on the complete data, this can be performed many times on a re-sampling (with replacement) of the original sample. We repeated this re-sampling 100 times and averaged the results.

Results
A total of 24 LOAD subjects and 24 cognitively healthy controls were used in this study. Selected clinical and demographic characteristics were compared between AD and control groups (S1 Table). There were no significant differences in age, gender, and common cardiovascular diseases between groups. There was a higher percentage of females in both the study and control groups consistent with LOAD demographics; however, gender was not significantly (p = 0.53) different between groups. The MMSE (mini-mental status exam) is a psychological test commonly administered to screen for AD. As expected, the MMSE test score was significantly lower in the AD than in the control group (p-1.54x10 -7 ). A comparison of the methylation profiles between AD and control subjects revealed 152 differentially methylated intragenic CpG sites (FDR p<0.05 and fold change �1.5) associated with 171 unique genes. We validated two randomly chosen CpGs by pyrosequencing and confirmed the top-ranking hits in the whole blood DNA of our cohort samples. These analyses revealed similar methylation data like those from the Illumina Infinium MethylationEPIC arrays, indicating that the initial methylation changes were not artifacts. 33 intragenic CpG sites met the GWAS stringent p-value thresholds i.e. p<5X10 -8 (Table 1). A total of 17 separate intragenic CpG sites had moderate to good individual predictive accuracy (AUC � 0.75) for AD detection based on methylation levels. An additional 119 CpG markers displaying significant methylation differences (FDR p-value <0.05) between AD and controls are presented in S2 Table. Both hyper-(66.4%) and hypomethylation (33.6%) were observed among intragenic CpG sites in the AD cases.
A prior report found significant differential methylation of intergenic/extragenic sites in the leukocyte genome in AD [33] which correlated with the performance on the MMSE. Based on this we also evaluated the methylation changes in intergenic/extragenic CpG sites for AD prediction. Highly significant differences in CpG methylation were observed for multiple intergenic/extragenic sites throughout the genome. This was observed when using different thresholds to define statistical significance: A total of 1524 intergenic/extragenic CpGs with FDR p-value <0.05 and 103 intergenic/extragenic CpGs using a stringent threshold (p<5x10 -8 ) were identified [34]. The top 25 intergenic/extragenic markers for AD prediction using the different statistical thresholds mentioned above are listed in Tables 2 and 3.
Principal Component Analysis (PCA) and Partial Least Square Discriminant Analyses (PLS-DA) confirmed significant segregation of AD cases from controls using intragenic CpG methylation markers (Fig 1). Permutation testing indicated that the separation observed between the AD and control groups was highly statistically significant (p<5x10 -8 ) and not likely due to chance.
For most of our analyses, conventional statistical tools were used to first identify high performing individual markers as indicated by AUC or FDR p-value thresholds, and these subsets of markers were then subjected to AI analyses. This approach has the advantage of reducing AI computing time and therefore costs. Prior publications suggest however that ML approaches might be superior to conventional statistical methods such as logistic regression analysis for group discrimination and risk prediction. [35]. Thus, direct AI analysis of the entire CpG data-space may improve AD prediction.
Using the direct AI analysis approach improved the predictive accuracy. Direct analysis of 283,143 individual intragenic markers CpGs improved predictive accuracy (Table 4) as did a direct analysis of 244,246 intergenic (extragenic) CpGs, (Table 5). Almost all ML platforms yielded a high predictive accuracy with an AUC �0.93. In the case of Deep Learning, using direct analysis of the intragenic markers, we observed AUC's = 0.992 with both sensitivities and specificities of ≧97% for AD prediction, respectively (Table 4). For the intergenic (extragenic) markers, direct AI analysis (Table 5) yielded an AUC = 0.999 for DL with both sensitivities and specificities of = 97.5% for AD prediction. Our findings suggest that direct AI analysis of the raw methylation data could perform as well as or even further improve predictive

PLOS ONE
performance compared to analysis based on high performing individual CpG loci determined by conventional statistical approaches (see below). As noted above we looked at the predictive performance of AI-based analysis of DNA methylation levels in intragenic and intergenic/extragenic CpG sites using individual markers that achieved different significance thresholds for AD prediction. High predictive accuracies were also achieved with these CpG markers using significance threshold FDR p-value<0.05 (S3 and S4 Tables) followed by the stringent significance threshold p-value <5X10 -8 (S5 and S6 Tables). DL appears to perform slightly better than other ML platforms however much larger case numbers would be required to assess this definitively. Increasing the number of predictors to 10 or 20 CpG loci did not appear to meaningfully improve predictive performance over the use of only 5 predictors. Similarly bootstrapping (1,000 samplings) yielded essentially similar results.

Logistic regression analysis
We further investigated the performance of conventional logistic regression for comparison purposes. The methylation status of a combination of CpG markers: cg04515524, cg00613827, cg02356786, and cg07509935 was a good predictor of AD. The following performance was where P is Pr(y = 1|x). AI-based analysis, and in particular DL, was superior to conventional regression analysis, Tables 4 and 5, S3-S6 Tables. Overall, these results appear to support the robustness of bloodbased epigenomic markers for AD prediction.

Network and pathway analyses results
The network and pathway analysis based on intragenic epigenomic markers identified significantly enriched canonical pathways. The molecular pathways that were found to be statistically significantly overrepresented were Cardiac Hypertrophy Signaling, Sirtuin Signaling, FGF Signaling, Wnt/β-catenin Signaling, and Neuregulin Signaling (S7 Table). The over-represented

PLOS ONE
disease pathways were Abnormal morphology of the cerebral cortex, Gliosis, Hydrocephalus, Morphology of nervous system, Ventricular hypertrophy, dilated cardiomyopathy, and Inflammatory response (S8 Table). The related gene (Fig 2) and disease pathways (Fig 3) are depicted. S9 Table provides a summary of genes that were significantly differentially methylated and plausibly linked to AD development.
To evaluate the correlation between leukocyte methylation and gene expression in the brain, we matched our result with the study of Miller et al., [36] They reported the genes that

PLOS ONE
were differentially expressed in the CA1 and CA3 regions of the brain from AD patients. We found 13 genes differentially expressed in CA1 and CA3 regions of the brain from that study [36] were significantly differentially methylated in circulating leukocytes. These were CCDC3, CPS1, ERMAP, FAM84B, MIB2, PTPRC, SARM1, SEC11A, TRIM6, TXNIP found to be differentially expressed in the CA1 region and ADM, ANKS1B, LANCL1 differentially expressed in the CA3 region [36]. Among these, CPS1 is involved in ammoniac intake in the urea cycle [37], PTPRC is one of the microglial expressed gene [38], SARM1 is involved in axon degeneration, which a factor observed in AD [39], TXNIP is linked to neuroprotective function [40], ANKS1B regulates hippocampal synaptic transmission [41] and LANCL1 is required for normal neuronal function [42]. We also compared our methylation results with a previous study evaluating differentially methylated genes in leukocyte blood samples of mono and dizygotic

PLOS ONE
twins [43]. These twin pairs were discordant for methylation. Twenty-two of those differentially methylated genes were also found to be significantly differentially methylated in our study. The direction i.e. increased versus decreased, of methylation change was similar in that and the current study for the following genes: C5orf38, CDK20, CREB5, CTSV, DISC1, ELOVL4, FGF22, HOXC12, IGSF21, IGSF9B, IRX4, MAF, S1PR1, STX8, TBX2, and TSHZ3. However, for genes ASCL2, FAM124B, FAM174B, KIF19, KIF26A, and WSCD1 both studies found significant methylation changes in the leukocyte DNA of AD cases however the direction of the methylation change was discordant between the studies [43].

Discussion
Dementia represents a looming global health crisis. The problem is expected to worsen with an anticipated explosion in the aged population in the future [44]. The direct health care costs, along with intangible costs, are burdensome at an estimated $550 billion annually [45]. The inpatient hospital cost for individuals 65 years and over with Alzheimer's and other dementias is greater than 3 times that of similarly aged individuals without dementia, with the nursing home facility costs greater than 20 times that of the latter group [46]. Despite the current absence of curative therapy, the justification for biomarker development remains compelling. Early detection of AD is needed to ensure early interventions that could potentially mitigate disease severity and also give families time to better prepare for the care of such individuals. With a very active drug pipeline, early detection will be needed to identify appropriate candidates for these trials. Finally, early detection and resulting intervention to slow disease progression could minimize time spent with severe dementia and promote the preservation of cognitive function for as long as possible. This would be beneficial for quality of life [47] and health care costs considerations. AD is a slowly developing disorder enhancing the feasibility of achieving these objectives.

PLOS ONE
Consistent with the call for the integration of breakthrough technologies (systems biology, genomics, big data science, and blood-based markers) to advance precision medicine objectives in AD [48], we combined AI analysis with leukocyte epigenomic data for AD prediction. Using raw intragenic CpG markers alone, we achieved a highly accurate prediction of AD using ML-based techniques. All the AI platforms achieved an AUC �0.93 using leukocyte epigenomic data. In the case of Deep Learning, we obtained an AUC = 0.99 with 97% sensitivity and specificity values. Additionally, we achieved high predictive accuracy using intergenic/ extragenic CpG sites alone for AD detection. The use of conventional clinical predictors and MMSE did not improve performance further.
AI is superior to conventional statistical tools for the analysis of big data generated by omics analysis [17,49]. It is a powerful tool for discriminating and classifying groups. It can identify multiple markers each with limited individual predictive capabilities which when combined achieve excellent discriminating performance. To minimize the chances of overfitting strategies such as RF were used (see Supplementary Methods). For the sake of comparison, we also investigated the predictive performance of conventional logistic regression. Employing cross-validation techniques, regression analysis yielded good predictive accuracy for AD based on methylation markers: AUC (95%CI) = 0.85 (0.74-0.96) but less than that of AI. This, however, further supports the robustness of the leukocyte epigenomic markers for AD detection.
Currently, a range of imaging markers continues to be deployed in clinical and research diagnosis and evaluation of AD. These include CT, MRI, and PET imaging of the brain and CSF amyloid and tau levels. A systematic review of imaging biomarkers revealed that currently, the most commonly utilized antemortem diagnostic tests have achieved moderate to good diagnostic accuracy [50]. The expense, and in some cases the invasive nature of these tests, precludes use in the general aged population. Psychological testing including the MMSE, the most widely used cognitive test, might not be readily available in many primary care settings where the majority of elderly patients receive clinical care. Further, the MMSE was found on meta-analysis to have only modest accuracy for ruling out dementia when deployed in a community or primary care settings [51]. Based on all these considerations, there remains a need for accurate biological screening tests in a low to moderate risk setting.
While not a requirement, an important collateral benefit of an ideal biomarker, beyond predictive accuracy, is the ability to help elucidate disease pathogenesis. We identified altered CpG methylation in several individual genes (CR1L, MYC, NRG1, LMNA, ELOVL4, MYB, AGPAT1, and NSG1) previously reported playing a role in AD. Single nucleotide polymorphisms in these genes increase AD risk by affecting the formation of neurofibrillary tangles, neuronal apoptosis, and neuronal vesicle trafficking in AD (S7 Table). [52][53][54][55][56][57][58][59][60] Further, IPA found enrichment of several pathways involved in brain and neuronal development and brain and cardiovascular function such as abnormal morphology of cerebral cortex, gliosis, the morphology of the nervous system, Inflammatory response and cardiac ventricular hypertrophy, and dilated cardiomyopathy (Figs 2 and 3 and S5-S7 Tables).
AD appears to primarily affect the medial temporal cortex of the brain and both AD and aging affect the inferior parietal lobe and dorsolateral prefrontal cortex regions of the brain [61]. The accumulation of a significant volume of neurofibrillary tangles in the neocortical region is a hallmark of AD development [62]. We found significant epigenetic changes in genes (CR1L, CTSV, APAF1, and SS18L1) responsible for cerebral cortical morphology.
Microglia are immune cells residing in the brain. Proliferation and hypertrophy of these cells (gliosis) occur in response to CNS damage. Gliosis can lead to neuroinflammation and induce tau pathology thus accelerating neurodegeneration. In the case of AD, amyloid-β plaque deposition aggravates gliosis [63]. Our pathway analysis suggested a relationship between abnormal methylation and increased gliosis in AD. S1PR1 and MYC genes were hypermethylated in our study. The S1PR1 gene is involved in CNS inflammation [64] and the MYC gene in astrogliosis and inflammatory response [65].
We also found an over-representation of molecular pathways, including cardiac hypertrophy signaling and Wnt signaling, in AD. Vascular disease is strongly associated with negative effects on cognition [66]. Left ventricular hypertrophy is reported to be an independent risk factor for dementia [67]. We identified genes involved in cardiac hypertrophy signaling that displayed altered methylation in the AD group. Polymorphisms of the ADRA2B gene have been linked to cerebrovascular disorders [68]. The FGF18 and FGF22 genes are known to play a role in heart development and physiological processes [69] while the MYC gene is implicated in angiogenesis, cardiomyogenesis, apoptosis, oxidative stress response and plays a major role in initiating and maintaining cardiac hypertrophy and contractility [70]. In our study, these genes were found to be significantly differentially methylated and further support an important link between cardiovascular function and AD.
The Wnt/β-catenin signaling pathway is one possible link between cardiovascular disease and dementia. Wnt signaling is critical for the developmental processes in multiple organs including that of the heart. The pathway is reactivated in many post-natal cardiac disorders [71]. The activation of Wnt signaling has a neuroprotective effect while inhibition promotes neurodegeneration [72]. Downregulated Wnt/β-catenin signaling is associated with AD [73]. Wnt/β-catenin signaling genes such as MYC, SOX14, and WNT9B were found to be hypermethylated in the study.
A limitation of our study was the relatively small sample size. We also performed bootstrapping to confirm the stability of our estimates (see Supplemental Methods section). This slightly increased the performance estimates for 4 platforms including DL while slightly decreased the performance in 2 AI platforms. We intend to perform follow-up validation studies in a larger cohort of patients. Despite the study size, we demonstrated highly significant methylation changes in circulating leukocytes in AD. Highly accurate AD prediction was observed using an AI platform and different marker combinations. Also, while expression studies were not performed in this particular analysis, several CpG site methylation differences in AD cases versus controls were greater than 5-10%. This level of methylation difference has been noted to correlate with changes in corresponding gene expression [74]. While we did not perform expression analysis in the current study, we did find evidence of significant methylation changes in some leukocyte genes that have been previously reported to be differentially expressed in AD brains [36]. These findings also help to validate our data.
While significant epigenetic changes were also identified in the intergenic/ extragenic sites, we are currently unable to report the specific mechanisms of their contribution to AD pathogenesis as these sites have not been linked to particular genes. It is known however that intergenic/extragenic sites can exert long-range influence and control gene function.
Overfitting can be a challenge with AI analysis. To avoid overfitting in the DL model strategies including the use of regularization parameters, dropout, and controlling the input-dropout ratio were used and are detailed in the Supplemental Methods section. For the other AI platforms, several parameters were used to tune the models and to overcome the overfitting problem: number of trees for RF, classification cost for SVM, and threshold amount for shrinking toward the centroid for PAM.
Another limitation of the study is that we were not able to eliminate the possibility that some of the observed epigenetic changes were not due to co-morbidities such as schizophrenia, bipolar disorder, or epilepsy. Given the age of the study subjects, co-morbidities are the norm rather than the exceptions in AD. We did not however identify significant differences in the frequency of these disorders in our AD versus control groups. We did not have access to the medications of our study group. The study included a higher percentage of females in both the case and control groups. This however is consistent with the distinct gender-based demographics of the disorder. There was however no significant difference in the gender ratios of the case and control groups. Further, we removed all probes associated with X and Y chromosomes to minimize gender bias. We have excluded any CpGs having close association (0 to 10 bp distance) with single nucleotide polymorphisms to avoid genetic mutational association with the methylation changes. Finally, no information on the APOE gene mutation status was available for this particular cohort. These are not routinely obtained in the assessment of our clinical patients.
A significant strength of our study is the novelty, i.e. the use of blood leukocytes to accurately detect AD and also for interrogating the pathogenesis of AD. Leukocyte samples are easily obtained, raising the prospect of a minimally invasive and potentially affordable technique for investigation of the mechanisms, detection, as well as longitudinal monitoring of AD. The potential value of methylation changes in blood leukocytes for the detection of brain disorders including schizophrenia has been previously reported [75,76]. Of interest, we did find overlap in some of the genes that were significantly differentially methylated in AD in our study and a prior report of leukocyte DNA methylation variation in twins discordant for AD [43]. This provides further validation to the use of leukocyte methylation for the investigation of AD.
In summary, we have performed genome-wide methylation analysis in blood leucocytes and identified significant methylation changes in genes, gene networks, and disease pathways that were previously known or suspected to play an important role in AD. Significant methylation changes were also found in intergenic i.e. extragenic sites. Using AI techniques, highly accurate leukocyte epigenomic prediction of AD was reported for the first time to the authors' knowledge. The results could potentially advance the precision medicine objectives that have been outlined for AD [48]. Our work provides evidence in support of the view that epigenetic factors may play a pivotal role in AD development. Further validation studies using a larger number of subjects are necessary to confirm and expand on our findings.