Skip to main content
Advertisement
  • Loading metrics

A population-based study exploring phenotypic clusters and clinical outcomes in stroke using unsupervised machine learning approach

  • Ralph K. Akyea ,

    Contributed equally to this work with: Ralph K. Akyea, Stephen F. Weng, Nadeem Qureshi

    Roles Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Resources, Visualization, Writing – original draft, Writing – review & editing

    Ralph.Akyea1@nottingham.ac.uk

    ‡ These authors are joint senior authors on this work.

    Affiliation PRISM Research Group, Centre for Academic Primary Care, School of Medicine, University of Nottingham, Nottingham, United Kingdom

  • George Ntaios,

    Roles Writing – review & editing

    Affiliation Department of Internal Medicine, Faculty of Medicine, School of Health Sciences, University of Thessaly, Larissa, Greece

  • Evangelos Kontopantelis,

    Roles Writing – review & editing

    Affiliations Division of Population Health, Health Services Research and Primary Care, School of Health Sciences, Faculty of Biology, Medicine and Health, Manchester Academic Health Science Centre (MAHSC), The University of Manchester, Manchester, United Kingdom, Division of Informatics, Imaging and Data Sciences, School of Health Sciences, Faculty of Biology, Medicine and Health, Manchester Academic Health Science Centre (MAHSC), The University of Manchester, Manchester, United Kingdom

  • Georgios Georgiopoulos,

    Roles Writing – review & editing

    Affiliation School of Biomedical Engineering and Imaging Sciences, St Thomas Hospital, King’s College London, London, United Kingdom

  • Daniele Soria,

    Roles Writing – review & editing

    Affiliation School of Computing, University of Kent, Canterbury, United Kingdom

  • Folkert W. Asselbergs,

    Roles Writing – review & editing

    Affiliations Amsterdam University Medical Centers, Department of Cardiology, University of Amsterdam, Amsterdam, The Netherlands, Health Data Research UK and Institute of Health Informatics, University College London, London, United Kingdom

  • Joe Kai,

    Roles Funding acquisition, Writing – review & editing

    Affiliation PRISM Research Group, Centre for Academic Primary Care, School of Medicine, University of Nottingham, Nottingham, United Kingdom

  • Stephen F. Weng ,

    Contributed equally to this work with: Ralph K. Akyea, Stephen F. Weng, Nadeem Qureshi

    Roles Conceptualization, Funding acquisition, Methodology, Supervision, Writing – review & editing

    ‡ These authors are joint senior authors on this work.

    Affiliation PRISM Research Group, Centre for Academic Primary Care, School of Medicine, University of Nottingham, Nottingham, United Kingdom

  • Nadeem Qureshi

    Contributed equally to this work with: Ralph K. Akyea, Stephen F. Weng, Nadeem Qureshi

    Roles Conceptualization, Funding acquisition, Supervision, Writing – review & editing

    ‡ These authors are joint senior authors on this work.

    Affiliation PRISM Research Group, Centre for Academic Primary Care, School of Medicine, University of Nottingham, Nottingham, United Kingdom

Abstract

Individuals developing stroke have varying clinical characteristics, demographic, and biochemical profiles. This heterogeneity in phenotypic characteristics can impact on cardiovascular disease (CVD) morbidity and mortality outcomes. This study uses a novel clustering approach to stratify individuals with incident stroke into phenotypic clusters and evaluates the differential burden of recurrent stroke and other cardiovascular outcomes. We used linked clinical data from primary care, hospitalisations, and death records in the UK. A data-driven clustering analysis (kamila algorithm) was used in 48,114 patients aged ≥ 18 years with incident stroke, from 1-Jan-1998 to 31-Dec-2017 and no prior history of serious vascular events. Cox proportional hazards regression was used to estimate hazard ratios (HRs) for subsequent adverse outcomes, for each of the generated clusters. Adverse outcomes included coronary heart disease (CHD), recurrent stroke, peripheral vascular disease (PVD), heart failure, CVD-related and all-cause mortality. Four distinct phenotypes with varying underlying clinical characteristics were identified in patients with incident stroke. Compared with cluster 1 (n = 5,201, 10.8%), the risk of composite recurrent stroke and CVD-related mortality was higher in the other 3 clusters (cluster 2 [n = 18,655, 38.8%]: hazard ratio [HR], 1.07; 95% CI, 1.02–1.12; cluster 3 [n = 10,244, 21.3%]: HR, 1.20; 95% CI, 1.14–1.26; and cluster 4 [n = 14,014, 29.1%]: HR, 1.44; 95% CI: 1.37–1.50). Similar trends in risk were observed for composite recurrent stroke and all-cause mortality outcome, and subsequent recurrent stroke outcome. However, results were not consistent for subsequent risk in CHD, PVD, heart failure, CVD-related mortality, and all-cause mortality. In this proof of principle study, we demonstrated how a heterogenous population of patients with incident stroke can be stratified into four relatively homogenous phenotypes with differential risk of recurrent and major cardiovascular outcomes. This offers an opportunity to revisit the stratification of care for patients with incident stroke to improve patient outcomes.

Author summary

Using an unsupervised machine learning cluster analysis approach, adult patients with incident stroke were grouped into four clinically meaningful phenotypic clusters based on their demographic, biochemical, comorbidities, and prescribed medication profiles at the time of incident stroke. The findings of this study highlight the significant heterogeneity that exists within patients with incident stroke with respect to subsequent cardiovascular morbidity and mortality outcomes. This offers an opportunity to revisit the stratification of care for patients with incident stroke to improve patient outcomes and highlights the potential to target modifiable characteristics in clusters for more targeted preventive intervention.

Introduction

Stroke is a leading cause of death and disability globally with a substantial economic cost due to treatment and post-stroke care [1]. Patients at time of incident stroke have varied clinical characteristics, demographics, and biochemical profiles. This heterogeneity in characteristics at time of incident stroke impacts on cardiovascular morbidity and mortality outcomes [2]. Phenotyping (subgrouping) people after incident stroke, in terms of the risk of various cardiovascular outcomes, could provide individuals with the poorest prognosis better care. Intensive secondary prevention strategies including the use of novel medications such as proprotein convertase subtilisin/kexin type 9 (PCSK9) inhibitors and colchicine in patients at very high risk of adverse cardiovascular morbidity and mortality outcomes.

Cluster analysis, a hypothesis-free unsupervised machine learning data-driven approach, has been widely used to analyse clinical data to identify new phenotypic subgroups of complex and heterogeneous diseases including obstructive sleep apnoea [3], asthma [4,5], chronic obstructive pulmonary disease, chronic heart failure [6], dilated cardiomyopathy [7], sepsis [8], Parkinson’s disease [9], breast cancer [10], and diabetes [11]. This approach does not include outcome data, and may be less biased in its results, especially when using retrospectively collected data. Clustering of clinical data may, therefore, be helpful in identifying subgroups of patients with incident stroke and generating new hypotheses. Efforts to determine such phenotypic groups in patients with incident stroke remain limited.

Using a large population-based cohort of adult patients with incident stroke, the objectives of this study are: (i) to identify patterns in linked primary and secondary clinical data and cluster patients based on phenotypic similarities; (ii) to assess the association between phenotypic clusters and subsequent recurrent stroke or CVD-related mortality, recurrent stroke or all-cause mortality, coronary heart disease (CHD), recurrent stroke, peripheral vascular disease (PVD), heart failure, CVD-related mortality, and all-cause mortality.

Methods

Study design and data source

This prospective population-based cohort study used the UK Clinical Practice Research Datalink (CPRD) GOLD database of anonymised longitudinal primary care electronic health records [12], linked to secondary care hospitalisation data (Hospital Episode Statistics [HES]) [13], national mortality data (Office for National Statistics [ONS]) [14], and social deprivation data (Index of Multiple Deprivation (IMD) 2015) [15]. Patients included in the CPRD GOLD database, from a network of general practices across the UK, are representative of the UK general population in terms of sex, age, and ethnicity [12].

Study population

We identified a cohort of patients with incident non-fatal stroke in either primary care (CPRD GOLD) or secondary care (HES) between 1 January 1998 and 31 December 2017. Details about this cohort were previously reported [16]. Patients with a prior record of coronary heart disease (CHD), peripheral vascular disease (PVD), or heart failure before incident stroke event were excluded. Patients were followed from the date of incident stroke diagnosis until they developed a major adverse cardiovascular event (MACE), died, ceased contributing data, or last data collection date of the practice. The study flow diagram is shown in Fig 1.

Outcomes

The primary outcome was a composite of recurrent stroke or CVD-related mortality event recorded after incident stroke from across the linked data sources (CPRD, HES or ONS registry). The secondary outcomes included: CHD, recurrent stroke, PVD, heart failure, CVD-related mortality, all-cause mortality, and the composite of recurrent stroke or all-cause mortality.

Subsequent outcomes within 30 days were considered to be representing or relating to the incident stroke event [16]. Analyses were, therefore, restricted to patients with subsequent outcomes occurring after 30 days of incident stroke.

Potential candidate variables for phenotyping

Based on availability in the electronic health records and established association with CVD, 336 candidate variables were selected. These included demographic data, vital signs, biochemical parameters, comorbid conditions, and prescribed medications (S1 Table). For vital signs and biochemical test results, the most recent values/records within 24 months before incident stroke were extracted. A prescription within 12 months before incident stroke was considered as a medication prescribed. All comorbid conditions were defined based on the latest record of a comorbid condition any time before incident stroke. All code lists used have been published and available for download [17,18].

Data processing

The variable distributions and missingness were first assessed. Multiple imputation by chained equations was used to account for missing data (S1 Fig, S2 Table). Ten imputed datasets were generated, using all available covariates and all the outcomes, although outcomes were not imputed [19,20]. The imputed datasets were pooled into a single dataset using Rubin’s rules [21]. A high number of dimensions from a dataset with many variables/features is associated with a loss of meaningful differentiation between similar and dissimilar individuals–the ‘curse of dimensionality’ [22]. To improve the cluster analysis process and performance, feature selection was carried out to reduce collinearity, conditional dependence and noise contributing to increasing the variance. Feature selection was based on two (2) widely used data-driven feature selection methods (Boruta [23] and Least Absolute Shrinkage and Selection Operator (Lasso) regression [24]–S2 Fig) and clinical expert consensus. An expert group of clinicians from both primary (Consultant General Practitioners–NQ, JK) and secondary care (Stroke Medicine Consultant/Specialist–GN, GG) were independently consulted to attain consensus on which variables to select for the cluster analysis. Clinical expert consensus was defined as a 75% (3 out of 4) agreement among the clinical experts on each variable. 49 variables were rated important by the clinical experts and at least 1 of the 2 data-driven methods–S1 Table. After evaluating correlation among the 49 selected variables using mixedCor and Lares functions in R for mixed-type data (S3 Fig & S4 Fig), we excluded 10 highly correlated variables based on clinical judgement/importance. The remaining 39 variables, Box 1, were used for the cluster analysis.

Box 1. Phenotypic domains and phenotypic variables used for cluster analysis

Phenotypic clustering

The prediction strength method by Tibshirani and Walther, 2015 [25] in the kamila function and the Elbow method were used to select the optimal number of clusters–S5 Fig. The kamila algorithm for mixed data clustering (S1 Text) was implemented to identify distinct patient phenotypic clusters. To ensure robustness of the clusters identified, 1,000 initialisations (that is, random starting points) were carried out. Plot of the clusters with the principal component analysis (PCA) dimensions was generated (S6 Fig).

Using the h2o package (http://www.h2o.ai), a gradient boosting model was applied to identify as well as rank the key covariates (candidate variables) that predict each of the identified phenotypic clusters. The respective cluster groupings were coded as 1 –belonging to cluster or 0 –belonging to other clusters. SHAP (SHapley Additive exPlanations) was used to assess the discriminative influence of the variables for each of the identified clusters [26].

Statistical analysis

For each cluster descriptive characteristics were provided, reporting proportion (%) for categorical variables and mean (SD) or median (IQR) for continuous variables. Kruskal-Wallis and chi-squared tests were used to compare across clusters, for continuous and categorical data, respectively.

The association between phenotypic clusters and adverse cardiovascular morbidity and mortality outcomes were assessed using Cox proportional hazards regression model. The hazard ratio (HR) for each phenotypic group is presented with 95% confidence intervals (CI) and corresponding p-values. Cumulative incidence plots were derived and differences between phenotypic groups assessed by the log-rank test. All statistical analyses were performed using Stata SE version 17 (StataCorp LP) and R version 4.1.0. An alpha level of 0.05 was used.

Ethics approval and consent to participate

Ethical approval for this study was obtained from the Independent Scientific Advisory Committee (ISAC)–study protocol number 19_023R. De-identified (anonymised) patient data was obtained from the CPRD hence this study was exempt from obtaining informed consent from patients.

Results

Clinical characteristics among phenotypic clusters

We identified 68,642 patients aged ≥18 years old with any incident non-fatal stroke event between 1998 and 2017. A total of 20,528 (29.9%) patients with subsequent clinical outcomes occurring within 30 days of incident stroke event were excluded, as these outcomes were considered to be related to the incident stroke event [16]. Cluster analysis was performed in the remaining 48,114 patients. Four phenotypic clusters with significant differences in clinical characteristics were identified. The identified clusters were numbered from 1 to 4 according to the ascendent overall incidence of subsequent composite outcome of recurrent stroke or CVD-related mortality, the primary outcome. Table 1 describes and compares the clinical characteristics among the phenotypic clusters.

thumbnail
Table 1. Characteristics of study population at time of incident stroke according to cluster membership (n = 48,114).

https://doi.org/10.1371/journal.pdig.0000334.t002

The plots of the clusters are shown with the principal component analysis (PCA) dimensions in S6 Fig. The cluster profiles are summarised in Box 2.

Variable importance for clusters

The supervised gradient boosting model to identify key covariates (candidate variables) that predict the respective phenotypic cluster had excellent prediction accuracy–area under the receiver operative curve (AUC) of 0.985, 0.982, 0.974, and 0.970 for clusters 1, 2, 3 and 4, respectively. The most common variables for predicting the respective phenotypic clusters were age at incident stroke, blood pressure, hypertension, LDL cholesterol, and potency of prescribed statin—Fig 2.

thumbnail
Fig 2. Plot showing the clinical parameters which are the core of each phenotypic cluster.

aki: acute kidney injury; dbp: diastolic blood pressure; dm_eye_comp: diabetic ophthalmic complications; sbp: systolic blood pressure; gfr: glomerular filtration rate; hb: haemoglobin; hdl: high-density lipoprotein cholesterol; ldl: low-density lipoprotein cholesterol; hba1c: glycated haemoglobin; nonRH_aortic: non-rheumatic aortic valve disorder; smi: severe mental illness; tg: triglyceride; tia: transient ischaemic attack. SHAP summary plot combines feature/variable importance with feature effects. Each point on the summary plot is a Shapley value for an individual. The position on the y-axis is determined by the feature and on the x-axis by the Shapley value. The colour represents the value from low to high. The features are ordered according to importance.

https://doi.org/10.1371/journal.pdig.0000334.g002

Association with subsequent clinical outcomes

During the median follow-up time of 12.60 years (IQR, 7.60–16.97 years), there was a total of 24,588 (51.1%) composite recurrent stroke or CVD-related mortality outcome events. The occurrence of recurrent stroke + CVD-related mortality was different across the 4 phenotypic clusters–cluster 1 had the lowest incidence rate (15.13 per 100 person-years; 95% CI, 14.54–15.74), while cluster 4 had the highest incidence rate (23.17 per 100 person-years, 95% CI: 22.67–23.69). The risk of subsequent recurrent stroke + CVD-related mortality was significantly increased in cluster 2 (hazard ratio (HR), 1.07; 95% CI: 1.02–1.12); cluster 3 (HR, 1.20; 95% CI: 1.14–1.26), and cluster 4 (HR, 1.29; 95% CI: 1.26–1.33), when compared with cluster 1. Similar incidence rate and hazard ratio trends were observed for subsequent recurrent stroke + all-cause mortality outcome (cluster 2: HR, 1.07; 95% CI, 1.03–1.12; cluster 3: HR, 1.32, 95% CI, 1.26–1.37; cluster 4: HR, 1.54; 95% CI: 1.48–1.60) and recurrent stroke outcome (cluster 2: HR, 1.10; 95% CI, 1.05–1.16; cluster 3: HR, 1.12, 95% CI, 1.06–1.18; cluster 4: HR, 1.25; 95% CI: 1.19–1.32).

Different trends in incidence rate and hazard ratios were observed, however, for subsequent CHD, PVD, heart failure, CVD-related and all-cause mortality outcomes–Fig 3 and Table 2. When compared with cluster 1, the risk of subsequent CHD events was significantly decreased in the other 3 clusters (cluster 2: HR, 0.49; 95% CI: 0.44–0.55; cluster 3: HR, 0.64; 95% CI, 0.56–0.73; cluster 4: HR, 0.55; 95% CI, 0.49–0.63). A similar decreased risk in the other 3 clusters when compared to cluster 1 was observed for risk of subsequent PVD.

thumbnail
Fig 3. Incidence rate for the subsequent adverse outcomes by the identified phenotypic clusters.

https://doi.org/10.1371/journal.pdig.0000334.g003

thumbnail
Table 2. Subsequent major adverse outcomes after incident stroke by phenotypic clusters.

https://doi.org/10.1371/journal.pdig.0000334.t004

For risk of subsequent heart failure, CVD-related mortality and all-cause mortality, cluster 2 had a significantly decreased risk when compared to cluster 1 while clusters 3 and 4 had a significantly increased risk–Table 2. The occurrence of subsequent cardiovascular morbidity and mortality outcomes across the different phenotypic clusters is presented as Kaplan Meier plots in Fig 4.

thumbnail
Fig 4. Kaplan-Meier plots for subsequent clinical outcomes stratified by phenotypic clusters.

A: Recurrent stroke and CVD-related mortality (log-rank p<0.0001); B: Recurrent stroke and all-cause mortality (log-rank p<0.0001); C:Recurrent stroke (log-rank p<0.0001); D: Coronary heart disease (log-rank p<0.0001); E: Peripheral vascular disease (log-rank p<0.0001); F: Heart failure (log-rank p<0.0001); G: Cardiovascular-related mortality (log-rank p<0.0001); H: All-cause mortality (log-rank p<0.0001).

https://doi.org/10.1371/journal.pdig.0000334.g004

Discussion

This population-based study exploring phenotypic characteristics of patients with incident stroke using a data-driven-cluster analysis approach identified four clinically meaningful patient clusters based on the phenotypic characteristics at time of incident stroke. There was a varied relationship between the identified phenotypic clusters and subsequent risk of adverse cardiovascular morbidity and mortality outcomes.

In our study, four distinct and clinically meaningful phenotypic clusters were identified. Smoking, a strong independent modifiable risk factor for cardiovascular morbidity and mortality outcomes [27], was most highly prevalent in clusters 1 and 2. Preventative strategy to communicate the risks of smoking and the benefits of quitting to this cluster of patients could be an effective means to promote smoking cessation and reduce risk for subsequent adverse events [28]. With the exception of clusters 2, the 3 other clusters included had high prevalence of multiple long-term conditions as well as CVD risk factors at time of incident stroke. Patients with incident stroke have been shown to commonly have pre-existing long-term conditions [29]. To optimally manage the possible atherogenic effect of these comorbid condition to reduce risk of subsequent cardiovascular morbidity and mortality outcomes, both non-pharmacological (that is, lifestyle modification [30,31]) and pharmacological (antihypertensives for blood pressure management [32]; lipid-lowering medications such as statins for cholesterol management [33]; antidiabetics for blood sugar control [30]; and antiplatelets/anticoagulants to manage arrhythmia [34]) strategies need to be prioritised in line with clinical guidelines [35]. Frequent monitoring/reviews to ensure treatment targets are being met is important [36]. Age, a non-modifiable risk factor, was a key factor for the patient cluster membership. Among older adults (typical of cluster 4), incidence of aortic disease, PVD and venous thromboembolism increase as age-related alterations in vascular structure and function are compounded by the longer exposure to CVD risk factors [37].

Clustering is a common approach used to analyse large datasets, to identify both the number of subgroups in the data and the attributes of each subgroup, as has been done in this study. Data analysed in real applications including healthcare (from electronic health records) are mostly characterised by a mix of continuous and categorial variables. More common approaches that have been applied to mixed data include converting the variables to a single data type by either coding the categorical variables as numbers or dummy coding the variables and then applying standard distance methods such as k-means designed for continuous variables to the transformed data to achieve the clustering objective(s) [38,39]. Continuous variables have also been converted to categorical variables using interval-based bucketing [40,41]. Similarities that may have been observed in the original data may be lost when the data is transformed in such ways [40]. Kamila clustering algorithm has, however, been shown to better handle high imbalance between continuous and categorical data than any other method [40,42]. From a computational perspective, when compared with other algorithms, the Kamila algorithm offers the best performance and most time-efficient when dealing with large datasets (in relation to both observations and variables) in the setting of heterogeneous data, as was the situation in our study [40,42].

Strengths and limitations

To our knowledge, this is the first time that a data-driven cluster analysis aimed at identifying stroke phenotypes in a well characterised large population-based cohort of adults with any incident stroke. This allows us to cover a large range of stroke phenotypes. Most importantly, we had a comprehensive linked database with a broad spectrum of clinical data with many of these variables being explored in cluster analysis for the first time.

There are, however, limitations of this study worth considering. First and foremost, the study was not meant to propose a new classification for stroke, because the clusters are likely to vary according to patient characteristics and available data. These results serve to underscore the need for novel multidimensional stroke classification approaches for improving patient care. Furthermore, they are aimed to generate hypotheses for future studies that will integrate clinical and biological data in patients, with the goal of improving the care of patients with stroke. With immense advancement in machine learning, cluster analysis can be performed in a large number of ways [42,43]. However, the knowledge and experience of the relevant experts remain the best judge in the interpretation of findings from cluster analysis, hence the involvement of a diverse group of clinical specialists, clinical researchers, and data experts in our study. The presence of missing data is a common occurrence in clinical research using electronic health records collected as part of routine care. For example, laboratory tests are typically requested only when considered necessary for a patient’s health condition. Similarly, information on BMI or smoking status may not be consistently recorded, leading to potential bias in patterns of data completeness. To address this issue, multiple imputation by chained equations, as outlined in the methods section, was used to handle missing data in our study, which is the preferred option under any missingness mechanism [19,20].

Implications

Cluster analysis is most suited to address the multidimensional complexity of disease conditions with considerable heterogeneity such as stroke. Population-based cluster analysis could provide further understanding of disease patterns. Additionally, patients could be phenotyped and allocated to specific clusters that could be associated with different risks for various outcomes. Different treatment strategies or interventions could be targeted at specific phenotypic clusters, based available evidence on risk and possible response. Future clinic trial design could also focus on high-risk clusters or focus on specific aspects within a cluster.

Conclusions

Using an unsupervised learning data-driven cluster analysis on a broad spectrum of baseline clinical data of patients with incident stroke, we identified four phenotypic and clinically meaningful clusters with respect to risk of subsequent major adverse outcomes. These findings highlight the significant heterogeneity that exists within patients with incident stroke with respect to subsequent adverse outcomes. This offers an opportunity to revisit the stratification of care for patients with incident stroke to improve patient outcomes. Further exploration in different patient cohorts and populations is needed.

Supporting information

S1 Fig. All clinical variables with missing values.

https://doi.org/10.1371/journal.pdig.0000334.s002

(DOCX)

S3 Fig. Plot of correlation matrix of 49 selected variables.

https://doi.org/10.1371/journal.pdig.0000334.s004

(DOCX)

S4 Fig. Ranked cross-correlation plot of 49 selected variables.

https://doi.org/10.1371/journal.pdig.0000334.s005

(DOCX)

S6 Fig. Principal component analysis (PCA) plots.

https://doi.org/10.1371/journal.pdig.0000334.s007

(DOCX)

S1 Table. Overview of all variables and the in- or exclusion at the various data processing steps.

https://doi.org/10.1371/journal.pdig.0000334.s008

(DOCX)

S2 Table. Observed versus imputed values after multiple imputation for all clinical variables with missing data.

https://doi.org/10.1371/journal.pdig.0000334.s009

(DOCX)

Acknowledgments

We thank the practices that contributed to the CPRD GOLD.

References

  1. 1. Rajsic S, Gothe H, Borba HH, Sroczynski G, Vujicic J, Toell T, et al. Economic burden of stroke: a systematic review on post-stroke care. Eur J Heal Econ. 2019;20: 107–134. pmid:29909569
  2. 2. Prosser J, MacGregor L, Lees KR, Diener HC, Hacke W, Davis S. Predictors of early cardiac morbidity and mortality after ischemic stroke. Stroke. 2007;38: 2295–2302. pmid:17569877
  3. 3. Joosten SA, Hamza K, Sands S, Turton A, Berger P, Hamilton G. Phenotypes of patients with mild to moderate obstructive sleep apnoea as confirmed by cluster analysis. Respirology. 2012;17: 99–107. pmid:21848707
  4. 4. Haldar P, Pavord ID, Shaw DE, Berry MA, Thomas M, Brightling CE, et al. Cluster analysis and clinical asthma phenotypes. Am J Respir Crit Care Med. 2008;178: 218–224. pmid:18480428
  5. 5. Siroux V, Basagan X, Boudier A, Pin I, Garcia-Aymerich J, Vesin A, et al. Identifying adult asthma phenotypes using a clustering approach. Eur Respir J. 2011;38: 310–317. pmid:21233270
  6. 6. Ahmad T, Pencina MJ, Schulte PJ, O’Brien E, Whellan DJ, Piña IL, et al. Clinical implications of chronic heart failure phenotypes defined by cluster analysis. J Am Coll Cardiol. 2014;64: 1765–1774. pmid:25443696
  7. 7. Verdonschot JAJ, Merlo M, Dominguez F, Wang P, Henkens MTHM, Adriaens ME, et al. Phenotypic clustering of dilated cardiomyopathy patients highlights important pathophysiological differences. Eur Heart J. 2021;42: 162–174. pmid:33156912
  8. 8. Seymour CW, Kennedy JN, Wang S, Chang CCH, Elliott CF, Xu Z, et al. Derivation, Validation, and Potential Treatment Implications of Novel Clinical Phenotypes for Sepsis. J Am Med Assoc. 2019;321: 2003–2017. pmid:31104070
  9. 9. Fereshtehnejad SM, Romenets SR, Anang JBM, Latreille V, Gagnon JF, Postuma RB. New clinical subtypes of Parkinson disease and their longitudinal progression a prospective cohort comparison with other phenotypes. JAMA Neurol. 2015;72: 863–873. pmid:26076039
  10. 10. Soria D, Garibaldi JM, Ambrogi F, Green AR, Powe D, Rakha E, et al. A methodology to identify consensus classes from clustering algorithms applied to immunohistochemical data from breast cancer patients. Comput Biol Med. 2010;40: 318–330. pmid:20106472
  11. 11. Ahlqvist E, Storm P, Käräjämäki A, Martinell M, Dorkhan M, Carlsson A, et al. Novel subgroups of adult-onset diabetes and their association with outcomes: a data-driven cluster analysis of six variables. Lancet Diabetes Endocrinol. 2018;6: 361–369. pmid:29503172
  12. 12. Herrett E, Gallagher AM, Bhaskaran K, Forbes H, Mathur R, van Staa T, et al. Data Resource Profile: Clinical Practice Research Datalink (CPRD). Int J Epidemiol. 2015;44: 827–836. pmid:26050254
  13. 13. NHS Digital. Hospital Episode Statistics (HES). In: NHS Digital [Internet]. 2019 [cited 21 Jun 2019]. Available: https://digital.nhs.uk/data-and-information/data-tools-and-services/data-services/hospital-episode-statistics
  14. 14. Office for National Statistics. Deaths Registration Data. In: ONS [Internet]. 2018 [cited 21 Jun 2019]. Available: https://www.ons.gov.uk/peoplepopulationandcommunity/birthsdeathsandmarriages/deaths
  15. 15. Department of Communities and Local Government. English Indices of Deprivation 2015. 2015 [cited 10 Jul 2016] pp. 1–11. Available: https://www.gov.uk/government/statistics/english-indices-of-deprivation-2015
  16. 16. Akyea RK, Vinogradova Y, Qureshi N, Patel RS, Kontopantelis E, Ntaios G, et al. Sex, Age, and Socioeconomic Differences in Nonfatal Stroke Incidence and Subsequent Major Adverse Outcomes. Stroke. 2021;52: 396–405. pmid:33493066
  17. 17. Kuan V, Denaxas S, Gonzalez-Izquierdo A, Direk K, Bhatti O, Husain S, et al. A chronological map of 308 physical and mental health conditions from 4 million individuals in the English National Health Service. Lancet Digit Heal. 2019;1: e63–e77. pmid:31650125
  18. 18. CPRD @ Cambridge. Codes Lists (GOLD). [cited 6 Mar 2021]. Available: https://www.phpc.cam.ac.uk/pcu/research/research-groups/crmh/cprd_cam/codelists/v11/
  19. 19. Royston P. Multiple imputation of missing values: Update of ice. Stata J. 2005;5: 527–536.
  20. 20. Kontopantelis E, White IR, Sperrin M, Buchan I. Outcome-sensitive multiple imputation: A simulation study. BMC Med Res Methodol. 2017;17: 1–13. pmid:28068910
  21. 21. Rubin DB. Multiple imputation for nonresponse in surveys. Wiley; 1987. https://doi.org/10.1002/9780470316696
  22. 22. Altman N, Krzywinski M. The curse(s) of dimensionality this-month. Nat Methods. 2018;15: 399–400. pmid:29855577
  23. 23. Kursa MB, Rudnicki WR. Feature selection with the Boruta package. J Stat Softw. 2010;36: 1–13.
  24. 24. Tishbirani R. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society. Series B (Methodological). 1996. pp. 267–88.
  25. 25. Foss AH, Markatou M. kamila: Clustering mixed-type data in R and hadoop. J Stat Softw. 2018;83: 1–44.
  26. 26. Lundberg SM, Erion G, Chen H, DeGrave A, Prutkin JM, Nair B, et al. From local explanations to global understanding with explainable AI for trees. Nat Mach Intell. 2020;2: 56–67. pmid:32607472
  27. 27. Mons U, Müezzinler A, Gellert C, Schöttker B, Abnet CC, Bobak M, et al. Impact of smoking and smoking cessation on cardiovascular events and mortality among older adults: Meta-analysis of Individual participant data from prospective cohort studies of the CHANCES consortium. BMJ. 2015;350: 18. pmid:25896935
  28. 28. Duncan MS, Freiberg MS, Greevy RA, Kundu S, Vasan RS, Tindle HA. Association of Smoking Cessation with Subsequent Risk of Cardiovascular Disease. JAMA—J Am Med Assoc. 2019;322: 642–650. pmid:31429895
  29. 29. Gallacher KI, Batty GD, McLean G, Mercer SW, Guthrie B, May CR, et al. Stroke, multimorbidity and polypharmacy in a nationally representative sample of 1,424,378 patients in Scotland: Implications for treatment burden. BMC Med. 2014;12: 1–9. pmid:25280748
  30. 30. Kernan WN, Ovbiagele B, Black HR, Bravata DM, Chimowitz MI, Ezekowitz MD, et al. Guidelines for the prevention of stroke in patients with stroke and transient ischemic attack: A guideline for healthcare professionals from the American Heart Association/American Stroke Association. Stroke. 2014;45: 2160–2236. pmid:24788967
  31. 31. Billinger SA, Arena R, Bernhardt J, Eng JJ, Franklin BA, Johnson CM, et al. Physical activity and exercise recommendations for stroke survivors: A statement for healthcare professionals from the American Heart Association/American Stroke Association. Stroke. 2014;45: 2532–2553. pmid:24846875
  32. 32. Arima H, Chalmers J, Woodward M, Anderson C, Rodgers A, Davis S, et al. Lower target blood pressures are safe and effective for the prevention of recurrent stroke: The PROGRESS trial. J Hypertens. 2006;24: 1201–1208. pmid:16685221
  33. 33. Fulcher J, O’Connell R, Voysey M, Emberson J, Blackwell L, Mihaylova B, et al. Efficacy and safety of LDL-lowering therapy among men and women: Meta-analysis of individual data from 174 000 participants in 27 randomised trials. Lancet. 2015;385: 1397–1405. pmid:25579834
  34. 34. Gent M. A randomised, blinded, trial of clopidogrel versus aspirin in patients at risk of ischaemic events (CAPRIE). Lancet. 1996;348: 1329–1339. pmid:8918275
  35. 35. Kleindorfer DO, Towfighi A, Chaturvedi S, Cockroft KM, Gutierrez J, Lombardi-Hill D, et al. 2021 Guideline for the prevention of stroke in patients with stroke and transient ischemic attack; A guideline from the American Heart Association/American Stroke Association. Stroke. 2021;52: E364–E467. pmid:34024117
  36. 36. National Institute for Health and Care Excellence. Multimorbidity: clinical assessment and management. NICE; 2016 [cited 1 Oct 2021]. Available: https://www.nice.org.uk/guidance/ng56
  37. 37. Miller AP, Huff CM, Roubin GS. Vascular disease in the older adult. J Geriatr Cardiol. 2016;13: 727–732. pmid:27899936
  38. 38. Dougherty J, Kohavi R, Sahami M. Supervised and Unsupervised Discretization of Continuous Features. Mach Learn Proc 1995. 1995; 194–202.
  39. 39. Hennig C, Liao TF. How to find an appropriate clustering for mixed-type variables with application to socio-economic stratification. J R Stat Soc Ser C Appl Stat. 2013;62: 309–369.
  40. 40. Foss A, Markatou M, Ray B, Heching A. A semiparametric method for clustering mixed data. Mach Learn. 2016;105: 419–458.
  41. 41. Ichino M, Yaguchi H. Generalized Minkowski Metrics for Mixed Feature-Type Data Analysis. IEEE Trans Syst Man Cybern. 1994;24: 698–708.
  42. 42. Preud’homme G, Duarte K, Dalleau K, Lacomblez C, Bresso E, Smaïl-Tabbone M, et al. Head-to-head comparison of clustering methods for heterogeneous data: a simulation-driven benchmark. Sci Rep. 2021;11: 1–14. pmid:33603019
  43. 43. Mclachlan GJ. Cluster analysis and related techniques in medical research. Stat Methods Med Res. 1992;1: 27–48. pmid:1341650