Factors associated with the onset of Alzheimer's disease: Data mining in the French nationwide discharge summary database between 2008 and 2014

Introduction Identifying modifiable risk factors for Alzheimer’s disease (AD) is critical for research. Data mining may be a useful tool for finding new AD associated factors. Methods We included all patients over 49 years of age, hospitalized in France in 2008 (without dementia) and in 2014. Dependent variable was AD or AD dementia diagnosis in 2014. We recoded the diagnoses of hospital stays (in ICD-10) into 137 explanatory variables.To avoid overweighting the "age" variable, we divided the population into 7 sub-populations of 5 years. Results We analyzed 1,390,307 patients in the PMSI in 2008 and 2014: 55,997 patients had coding for AD or AD dementia in 2014 (4.04%). We associated Alzheimer disease in 2014 with about 20 variables including male sex, stroke, diabetes mellitus, mental retardation, bipolar disorder, intoxication, Parkinson disease, depression, anxiety disorders, alcohol, undernutrition, fall and 3 less explored variables: intracranial hypertension (odd radio [95% confidence interval]: 1.16 [1.12–1.20] in 70–80 years group), psychotic disorder (OR: 1.09 [1.07–1.11] in 70–75 years group) and epilepsy (OR: 1.06 [1.05–1.07] after 70 years). Discussion We analyzed 137 variables in the PMSI identified some well-known risk factors for AD, and highlighted a possible association with intracranial hypertension, which merits further investigation. Better knowledge of associations could lead to better targeting (identifying) at-risk patients, and better prevention of AD, in order to reduce its impact.


Discussion
We analyzed 137 variables in the PMSI identified some well-known risk factors for AD, and highlighted a possible association with intracranial hypertension, which merits further investigation. Better knowledge of associations could lead to better targeting (identifying) at-risk patients, and better prevention of AD, in order to reduce its impact. PLOS

Introduction
In 2015, the global prevalence of dementia was re-estimated to 46 million people based on data from the Global Burden of Disease Study [1]. This number would exceed 115 millions in 2050 [2]. Alzheimer's dementia (AD) accounts for 60 to 70% of dementias [3,4]. There is a long presymptomatic period of about 15 years between biochemical changes in the brain and the development of AD [5,6]. About one-third of AD cases can be attributed to a modifiable cause [7].
Research of modifiable risk factors is a critical issue in dementia research: recent reviews of systematic reviews and meta-analyses have examined about 80 risk factors for AD [8,9]. Data mining is another tool for finding new associated factors.
Our aim was to determine factors associated with the occurrence of AD by using data mining in the database of all hospital stays in France (PMSI).

Study design
We utilized the PMSI database (presented below). All the inpatient stays in 2008 and 2014 were included.

Ethics statement
Approval from the French data protection agency (CNIL) was obtained to conduct the present study; the data were captured through the Technical Agency for Information on Hospital Care (ATIH), according to the current legislation. Studies assessing the accuracy of diagnosis coding by medical chart review are authorized by Lille University Hospital ethical committee.

Data source
The PMSI database (Programme de Médicalisation des Systèmes d'Information) is the French nationwide exhaustive hospital discharge database [10]. Database used in our study comprehends all the inpatient stays, from nonprofit and for-profit acute care hospitals (medicine, surgery and obstetrics), excluding psychiatric hospitals and rehabilitation care centers. This database includes administrative data (admission and discharge dates and modes), demographic data (age, gender, geographic area), diagnoses encoded in ICD-10 [11], medical procedure encoded with the French medical classification for clinical procedures (CCAM: Classification Commune des Actes Médicaux) [12], and other pieces of information [13]. This information is anonymized and can be reused for research purposes [14].
The database comprehends 23,781,314 inpatient stays in 2008 and 27,087,492 in 2014.

Inclusion and exclusion criteria
We included all patients present in both the 2008 and 2014 PMSI databases who were over 49 years of age. We excluded patients with dementia in 2008. Dementia and related diseases encoding rules have been defined in 2006 [15]. In accordance with those rules, the inpatient stays having one of the following codes in 2008 were excluded (ICD-10 codes in brackets): AD (G30 � , 4 codes), AD dementia (F00 � , 84 codes), vascular dementia (F01 � , 126 codes), other dementia (F02 � , 120 codes), unspecified dementia (F03 � , 20 codes), or mild cognitive impairment (F067 � , 2 codes).

Dependent and explicative variables
Dependent variable was AD in 2014, defined as AD (G30 � ) and AD dementia (F00 � ).
Sex and age (in 2008) were explicative variables available in the PMSI database. We created a "longitude" variable and a "latitude" variable from the prefectures of the departments where patients were hospitalized in 2008 (excluding the overseas departments and territories).
Diagnoses of inpatient stays in 2008 were recoded into binary variables after mapping the ICD-10 and the CCAM. Of the 40,109 ICD-10 codes, 11,768 were coded into 130 binary variables of interest (based on a literature review); of the 8,982 CCAM codes, 320 were coded into 10 binary variables. A total of 137 different variables were tested. The same code could correspond to several binary variables (for example "tuberculous meningitis" to "meningitis", "tuberculosis", "bacterial infection"). We aggregated data (ICD-10 and CCAM) from several hospital stays for the same patient.
In each strata, we ranked the 20 explanatory variables most associated with the onset of AD in 2014, using the "importance" value (varImpPlot function) of the random forest algorithm, based on the Breiman and Cutler Fortran code (package 'randomForest', version 4.6-12) [16]. Random Forest produced 20 classification trees (ntree) on a random fraction of the data, with 2 variables tested (mtry) at each division.
We then looked for interactions between these variables using decision trees by age group (package 'rpart' for Recursive Partitioning and Regression Trees, version 4.1-10) [17].
Finally, in each age group, we created a multivariate model by logistic regression, by age group, using a stepwise procedure (sequential replacement), with the 20 explanatory variables most associated with the onset of Alzheimer's disease in 2014 by importance (randomForest), as well as age, sex, longitude and latitude. The results of the logistic regression were expressed as odds ratio (OR) and 95% confidence interval. Statistics were computed using R version 3.3.2 [18].

Characteristics of the population
We analyzed 1,390,307 patients in the PMSI in 2008 and 2014, without dementia in 2008 and aged 49 years or over on January 1, 2008 (Fig 1). The patients included were 66.7 ± 10.45 years of age on average. The main characteristics of patients in 2008 are described in Table 1.
In our population, 55,997 patients had coding for AD or AD dementia in 2014 (4.04%).

Multivariate models
In our models, some variables were significantly associated with the occurrence of AD in 2014 ( Table 2). Variables changed according to the patient age, and included psychotic disorder (in the 65-70 and 70-75 years groups), intracranial hypertension (in the 70-75 and 75-80 years group), epilepsy (in the 70-75, 75-80 and over 80 years groups). Some appeared more agedrelated as hemorrhagic stroke for the 70-75 years group; mental retardation and undernutrition for the 75-80 years group; depression and fall in over 80 years group.
We also identified variables associated with the absence of AD coding: cancer, carcinoma in situ and benign tumor, diverticulosis, inflammation, rheumatoid arthritis, psoriasis, obesity, osteoarthritis, ischemic and non-ischemic heart disease.

Discussion
Analysis of 137 variables concerning 1.4 million patients aged over 49 years, included in the PMSI with a 6-year perspective, revealed statistically significant associations between the onset of AD and about 20 explanatory variables. Some are well described (stroke, diabetes, female, alcohol, depression . . .) in literature [19], while others are still little explored (intracranial hypertension, epilepsy . . .) Our study shows associations with a temporality criterion. These associations must be interpreted with caution. On the one hand, the dependent variable is the coding of a hospital diagnosis of AD: thus, some pathologies may be associated with a higher or lower diagnosis given the modalities of the stay (neurology or geriatrics stay, colonoscopic follow-up, ambulatory surgery, etc.). On the other hand, the explanatory variable may be interpreted as risk factors (increase in neural lesions), precipitating factors (earlier diagnosis) or confounding factors (common ground, early symptoms). For example, falling after age 80 can be a risk factor (head injury), a warning sign of AD or a confounding factor (diabetic neuropathy, Parkinson's disease, stroke . . .); falling is also a cause of hospitalization in geriatrics, where the assessment will likely include a cognitive assessment.  In our study, 55,997 patients had AD in 2014 (4.04%). Rate of AD in our study increases with age, as found in other studies: in France, the rates are about 6% of patients over 65, 18% of patients over 75 and up to 40% beyond 85 years (versus respectively 7.1%, 11.5% and 14.9% in our study) [20,21]. The main factors associated with AD in our study change according to the patient age. Some are well described, while others are still little explored.
In the literature, we find a similar association regarding diabetes [22][23][24][25], alcohol abuse [26], BMI < 18 kg/m 2 [24,27], heart failure [28], depression [22,29,30], bipolar disorder [31], mental retardation or low level of education [22,26]. The prevalence of psychotic disorders or anxiety disorders in AD has been estimated about 34-40% [32,33]. Intoxications may be a confounding factor with psychiatric disorders or reflect attempted suicide [34,35]. Link between epilepsy and AD is described but poorly understood; the prevalence rate of dementia is estimated to be between 8.1 and 17.5% for epileptic patients and the prevalence rate of epilepsy is estimated to be between 1 and 9% for dement patients [36].
For the first time, we show a link between intracranial hypertension and Alzheimer's disease in the 60-65 age group and then in the 70 to 80 age group. This may seem surprising because intracranial hypertension rarely occurs in elderly patients due to age-related cerebral atrophy (including chronic subdural hematoma). Nevertheless, the hypothesis of a link between intracranial hypertension and Alzheimer's disease has already been formulated. Indeed, normal pressure hydrocephalus and head injuries (e. g. in boxers) can be accompanied by anatomopathological lesions similar to those of Alzheimer's disease [37,38]; repeated episodes of intracranial hypertension (during head injuries or conditions such as heart failure, sleep apnea syndrome or chronic obstructive pulmonary disease) may be a contributing, precipitating or triggering factor in Alzheimer's disease [39,40]. Several variables were not associated with the onset of AD in our study, unlike in some studies in the literature, as age-related hearing loss [41], hypertension [22,25,42], hypercholesterolemia [25,43], Helicobacter pylori infection [44], Chlamydia trachomatis infection [45], head injury [46], obesity [22,24,27] or essential tremor [47].

Table 2. Odds-ratios [95% confidence interval] in multivariate models: factors associated (in red) and factors inversely associated (in blue) with the onset of Alzheimer's disease in 2014 (stronger associations at +/-5% are in bold
The main strength of our work is the sample size with over 55,000 AD patients in 2014 for whom we have reliable data recorded prospectively 6 years ago. Data mining techniques and the large sample size make it possible to study a large number of variables and raise new hypotheses of risk factors. They also enable to confirm associations already described in certain sub-populations (according to age).
Our study has several limitations. The main limitation of the reuse of the PMSI is the impossibility of returning to the source data and quality of coding. The use of the PMSI for activity-based pricing can also lead to overcoding of certain pathologies and undercoding codes without interest for pricing. Coding is the responsibility of the clinician and can sometimes be approximate: for example, it is possible to have a family history of cancer without it being coded in the database (weak interest for pricing); and in the case of intracranial hypertension, we cannot verify the data of papilledema or pressure of the lumbar puncture. Nevertheless, it is unlikely that there is a differential bias in favour of better or worse coding of intracranial hypertension or family history of cancer in patients rehospitalized with AD coding 6 years later.
Concerning quality of dependent variable, there is a strong correlation between a clinical ante-mortem diagnosis and a post-mortem diagnosis [48,49]. In our study, ICD-10 diagnosis of AD may have questionable accuracy and variation, or even be confounded by delirium in some cases considering the fact that most data points were from short-term hospital stays. Nevertheless, we have shown in preliminary studies that the diagnosis of AD is more reliable in PMSI in 2014 than in previous years, probably in connection with the proposal for new NINCDS-ADRA criteria [50][51][52].
Our study is a correlation analysis and not a retrospective cohort: we did not include competing risks (attrition, loss of sight, death, etc.) This selection strategy tends to create spurious negative associations between diagnoses of chronic conditions in 2008 and AD in 2014, since patients with neither chronic condition in 2008 nor AD in 2014 are more likely to be missing from the analysis set.
We used PMSI database excluding psychiatric hospitals and rehabilitation care centers: it could also have led to selection bias and this may explain the low rate of psychiatric disorders in our population.
We used PMSI database in 2008 and 2014. We have opted for this simple time management for several reasons: primo, this was allowed by the numbers of patients in our study; secundo, diagnosis of AD is more reliable in 2014 than in previous years [50][51][52] so incorporating the diagnosis of AD in 2012 or 2013 would probably have decreased the quality of this dependent variable; tertio, our objective was to identify risk factors rather than early symptoms, so we opted for the extreme years (2008 and 2014) allowed by the accreditation giving access to PMSI data at the time of analysis.
We have chosen a minimum age of 49 years in 2008 (55 years in 2014) because AD is rare before 55 years of age, and mainly concerns family cases. However, since genetics is not a study factor in the base of the short stay PMSI, we preferred to avoid the inclusion of these cases. Moreover, our maximum decline is 6 years, which does not really allow us to determine the first symptoms before the prodromal phase of 15 to 20 years.
Due to the extremely high impact of age on the onset of AD, we divided our sample into 5 years classes. This allowed us to identify the main explanatory variables according to different times in life and clinical situations: psychotic disorders after 60 years, intracranial hypertension, epilepsy or denutrition after 70 years. . . Nevertheless, the use of 5-year age intervals could produce spurious associations between diseases/conditions (such as AD) with strongly age-dependent prevalences (e.g. a spurious positive association with a condition such as undernutrition which is increasingly prevalent with older age). AD multifactorial nature adds the complexity of having too many confounding variables that are impossible to be adjusted for in the PMSI. It may be more informative to perform an age-stratified analysis on all data, testing for an effect of each condition on probability of AD and for an interaction between the condition and age.
We have identified pathologies inversely associated with AD diagnosis 6 years later: cancer, family history of cancer, inflammation, rheumatoid polyarthritis, psoriasis, etc. Some of these associations are cited in the literature as the inverse relationship with cancer [59,60] or rheumatoid arthritis [61][62][63]. As exposed above, they seem to be reasons for follow-up, i.e. iterative hospitalizations motivated mainly by the initial pathology, not necessarily leading to a coding of AD. A neuroprotective phenomenon of inflammation could also be evoked [64].
In conclusion, an analysis of 137 variables in the PMSI identified some well-known risk factors for AD, and highlighted a possible association with intracranial hypertension. Better knowledge of associations could lead to better targeting (identifying) at-risk patients, and better prevention of AD, in order to reduce its impact.