Figures
Abstract
Genetic factors contribute to 60-70% of the variability in rheumatoid arthritis (RA). However, few studies have used genetic variants to predict RA risk. This study aimed to enhance RA risk prediction by leveraging single nucleotide polymorphisms (SNPs) through machine-learning algorithms, utilizing Women’s Health Initiative data. We developed four predictive models: 1) based on common RA risk factors, 2) model 1 incorporating polygenic risk scores (PRS) with principal components, 3) model 1 and SNPs after feature reduction, and 4) model 1 and SNPs with kernel principal component analysis. Each model was assessed using logistic regression (LR), random forest (RF), eXtreme Gradient Boosting (XGBoost), and support vector machine (SVM). Performance metrics included the area under the receiver operating characteristic curve (AUC), sensitivity, specificity, positive and negative predictive values (PPV and NPV), and F1-score. The fourth model, integrating SNPs with XGBoost, outperformed all other models. In addition, the XGBoost model that combines genomic data with conventional phenotypic predictors significantly enhanced predictive accuracy, achieving the highest AUC of 0.90 and an F1 score of 0.83. The DeLong test confirmed significant differences in AUC between this model and the others (p-values < 0.0001), particularly highlighting its efficacy in utilizing complex genetic information. These findings emphasize the advantage of combining in-depth genomic data with advanced machine learning for RA risk prediction. The most robust performance of the XGBoost model, which integrated both conventional risk factors and individual SNPs, demonstrates its potential as a tool in personalized medicine for complex diseases like RA. This approach offers a more nuanced and effective RA risk assessment strategy, underscoring the need for further studies to extend broader applications.
Author summary
In this study, we explored the role of genetic factors, which account for 60-70% of rheumatoid arthritis (RA) risk variability, by utilizing genetic data, specifically single nucleotide polymorphisms (SNPs). Using Women’s Health Initiative data, we developed four advanced machine-learning models to predict RA risk. These models ranged from integrating common RA risk factors to sophisticated SNP analysis. The most effective model employed the eXtreme Gradient Boosting (XGBoost) method, combining SNPs with conventional risk factors, significantly enhancing predictive accuracy. This model outperformed others, achieving the highest accuracy as indicated by key metrics like the area under the curve (AUC) and F-1 score. Our findings underscore the potential of integrating detailed genetic information with machine learning in predicting RA risk, marking a significant advancement in personalized medicine, especially for postmenopausal women. This approach paves the way for more tailored healthcare strategies.
Citation: Xu Y, Wu Q (2025) Using machine learning and single nucleotide polymorphisms for improving rheumatoid arthritis risk Prediction in postmenopausal women. PLOS Digit Health 4(4): e0000790. https://doi.org/10.1371/journal.pdig.0000790
Editor: Henry Horng-Shing Lu, National Yang Ming Chiao Tung University, TAIWAN
Received: February 7, 2024; Accepted: February 17, 2025; Published: April 9, 2025
Copyright: © 2025 Xu, Wu. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: Data was obtained from the dbGap, please find the data from https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000200.v12.p3
Funding: This work is supported by the National Institute on Minority Health and Health Disparities (R21MD013681 awarded to QW), the National Institute on Aging (R01AG080017 awarded to QW), and the National Institute of General Medical Sciences (P20GM121325-8463 awarded to QW). The funders had no role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Rheumatoid arthritis (RA) is a chronic autoimmune disease posing significant global health challenges, affecting about 1% of the global population and leading to substantial morbidity and mortality [1,2]. As of 2020, the age-standardized global prevalence rate of RA was approximately 208.8 cases per 100,000 population [3], showing an increasing trend, especially in females. This prevalence contributes to substantial work disability, affecting around 35% of individuals with RA [4], thereby imposing considerable burdens on individuals, families, and communities.
Early and accurate diagnosis of RA is critical for effective management. Yet, it remains challenging due to overlapping symptoms with other types of arthritis and a limited window for effective treatment intervention [5]. Traditional risk factors for RA, such as age, race/ethnicity, physical activity, smoking status, and body mass index (BMI), have been well-documented [6–9]. However, the integration of genetic information for RA risk prediction remains underexplored.
Advancements in genome-wide association studies (GWAS) have shed light on genetic variants associated with RA, accounting for a significant portion of the variation in RA liability [10]. Our prior research, utilizing a polygenic risk score (PRS) derived from a pruning and thresholding method, indicated a higher RA risk in individuals with a high PRS [11]. Yet, this approach, assuming linear additive effects of genetic variants, may not reflect the complexity of RA’s genetic underpinnings.
In the present study, we explore the potential of machine learning (ML) – a branch of artificial intelligence that enables predictive modeling from complex datasets [12]. ML’s application in medical fields like oncology has shown promise in diagnosis, recurrence, and prognosis predictions [13,14]. This study aims to harness ML’s power using single nucleotide polymorphisms (SNPs) to develop a comprehensive predictive model for RA. We will compare the performance of logistic regression, random forest, eXtreme Gradient Boosting, and support vector machine algorithms in RA risk prediction against traditional phenotype or PRS-based methods. This endeavor could revolutionize RA’s early diagnosis and risk stratification, providing insights into its genetic architecture and facilitating personalized treatment approaches.
By focusing on this innovative approach, the present study aligns with the digital health paradigm, aiming to contribute significantly to personalized medicine and the management of RA, particularly in postmenopausal women who represent a high-risk group. This research thus addresses a critical gap in RA diagnosis and management, situating itself at the forefront of digital health and ML applications in chronic disease management.
Results
A total of 12,028 participants were included in this study, 1304 with and 10,724 without RA. The mean (SD) age of participants was 61.7 (7.5) and 62.5 (7.4) years in participants with and without RA, respectively, while the mean (SD) BMI was 30.6 (5.5) and 29.1 (5.4) in the two groups. Significant age and BMI differences were observed among participants with and without RA (Table 1). The distribution of race and physical activity significantly differed among the two groups, as shown in Table 1.
Fig 1 shows the AUCs of the four models with four algorithms (including LR, RF, XGBoost, and SVM). The models using individual SNPs outperformed model 1 (with phenotype) and model 2 (with PRS) in the four algorithms. The results of the DeLong test (Table 2) showed significant differences between model 3 and model 4, with the model with phenotype and PRS (models 1 & 2) in the four algorithms (all p-values ≤ 0.01). Model 3 had a better performance than model 1 and model 2, with a significantly higher AUC than its counterpart in model 1 and model 2 (DeLong test, p-value ≤ 0.01). Fig 2 presents the sensitivity, specificity, PPV, NPV, and F1-score of the four models in four algorithms. Model 3 had a higher F-1 score when compared to models 1 and 2. Model 3 with XGBoost had a better F-1 score (F1-score=0.76) among the four ML algorithms. With this model, important SNPs can be identified by variable importance, and the top ten highest-ranking important SNPs in model 4 with XGBoost are shown in Fig 3.
*LR: logistic regression; RF: random forest; XGBoost: eXtreme Gradient Boosting; SVM: support vector machine; PRS: polygenic risk score; PCs: principal components; FeatureWiz: uses Searching for Uncorrelated List of Variables and XGBoost for feature reduction; KPCA: kernel principal component analysis; FPR: false positive rate; TPR: true positive rate.
*LR: logistic regression; RF: random forest; XGBoost: eXtreme Gradient Boosting; SVM: support vector machine; PPV: positive predictive value; NPV, negative predictive value; AUC: Areas Under the Curve.
Fig 1 indicates that model 4 had the highest AUC among the four models. The DeLong test shows significant differences in AUC between model 4 and model 3 (all p-values<0.0001) in the three algorithms except for the RF. Fig 2 shows that model 4 had the highest F1 score of all models. Among the four algorithms, model 4 with XGBoost had a better performance, achieved the best F1-score of 0.83, and its corresponding sensitivity, specificity, PPV, and NPV were 0.73, 0.72, 0.95, and 0.25, respectively. The results of AUC and other metrics indicate that XGBoost had a better performance in RA prediction when compared to LR, RF, and SVM.
Moreover, integrating conventional risk factors (Model 1) and SNP data with KPCA (Model 4) into the XGBoost model significantly improved its performance. This integrated approach achieved a notable AUC of 0.90, outperforming both Model 1 and Model 4 (p-values <0.0001). The model also demonstrated robust results with an F-1 score of 0.87, sensitivity of 0.83, specificity of 0.79, a positive predictive value (PPV) of 0.97, and a negative predictive value (NPV) of 0.33.
Discussion
This study investigated the utility of incorporating genomic information with ML algorithms for predicting RA risk in postmenopausal women. Our findings indicate that using high dimensional genomic data with ML models outperformed models using phenotype or PRS data. Additionally, XGBoost outperformed other algorithms regarding RA prediction among postmenopausal women. Moreover, integrating genomic information with traditional phenotypic predictors using the XGBoost model significantly enhanced its predictive accuracy, achieving an F-1 score of 0.87 and an AUC of 0.90.
Several GWAS have identified hundreds of SNPs associated with RA. Some studies observed that individuals with a family history of RA significantly increase the risk of developing the disease [15], indicating that using genetic variants as predictors may help predict an individual’s RA risk. PRS has been used to summarize information from genetic variants as a single value for estimating the risk for a specific disease. However, PRS often have limited predictive power because the effect sizes of genetic variants are small. Thus, PRS may not adequately capture the polygenic liability of RA. In one of our previous studies, a significant association between PRS and RA was observed but with a small effect size, so the PRS might not perform well in risk prediction in practice. In contrast, treating each SNP as an individual predictor might be beneficial since PRS has an assumption about the additivity, while treating SNPs as predictors do not rely on such an assumption; thus, this method might be more robust for disease prediction in personalized medicine.
ML has been used in several studies to predict complex disease phenotypes using SNPs, including RA prediction. A recent study used electronic health records with a deep learning model to accurately predict the activity of RA [16]. While ML has been used for RA prediction in previous studies, our study is the first to utilize SNPs as predictors with ML algorithms to predict RA risk. We found that models using single SNPs as predictors performed better than models using phenotype or PRS only. Our results suggest that high dimensional genomic data with ML models can significantly improve RA risk prediction over models using phenotype or PRS data. ML models can potentially capture interaction effects between SNPs and other risk factors, thus capturing more information and improving RA prediction. In addition, we utilized KPCA for dimensionality reduction [17]. Although principal component analysis (PCA) is more used in practice, KPCA can handle nonlinear data while PCA can only perform linear dimensionality reduction [18]. Several studies utilized KPCA for feature reduction and observed that such a method could effectively capture the underlying structure and information [17,19]. The current study found that the KPCA model outperformed the FeatureWiz model.
Our study acknowledges several limitations. The focus on postmenopausal women in the United States may limit the broader applicability of our findings, underscoring the need for further research across diverse populations to enhance the model’s generalizability in RA risk prediction. Additionally, due to limitations in data access, only a narrow set of phenotype information (e.g., history of live birth) was utilized in the study. Broader phenotype risk factors should be explored in future research. Moreover, using randomly split training and test datasets, though effective in mitigating overfitting, might not capture the complexities of real-world data [20]. Hence, future studies should employ cross-validation and external validation to augment the model’s robustness and practical applicability. A notable limitation is the inability of our model to differentiate between RA subtypes due to the constraints of the WHI data, which is a critical aspect, considering the varied clinical manifestations and treatment responses of seropositive and seronegative RA. Future research should include subtype-specific data to develop more accurate models, paving the way for personalized risk predictions and treatment strategies. Addressing these limitations will enhance our model’s predictive accuracy and clinical relevance, making it a more effective tool for personalized RA risk assessment.
Our findings highlight a significant advancement in RA risk prediction through integrating genomic information with ML algorithms, notably outperforming models based on phenotype or PRS data. XGBoost has proven to be the most effective algorithm in postmenopausal women. This research provides crucial insights into using ML and genomic data for predicting complex diseases like RA, with implications extending to clinical and public health realms. Note that using the conventional clinical risk factors and SNPs with KPCA with XGBoost substantially improves predictive performance. This finding suggests that our model is promising to enhance risk prediction in clinical scenarios, potentially offering benefits in more accurate patient stratification and personalized medicine. The improved predictive accuracy of our model offers healthcare providers a tool for more efficient identification of high-risk individuals, enabling earlier and more customized interventions. Moreover, these methods could enhance public health strategies, leading to better-targeted screening and resource allocation for RA management. Continued research is necessary to extend these methods to a broader range of populations, ultimately enriching their applicability. The long-term value of these approaches lies in their potential to transform personalized medicine, especially in the risk assessment and management of RA.
Materials and methods
Design and participants
This study analyzed data from the Women’s Health Initiative (WHI), a longitudinal study conducted among postmenopausal women at 40 clinical centers across the United States. The study design has been previously described [21]. Briefly, 161,808 women were recruited for one or more of three clinical trials or an observational study. Participants received physical examinations yearly or every three years and additional data were collected through questionnaires via mail or telephone. Every participating institution obtained approved consent forms from the Institutional Review Board. Data for this study were obtained from the database of Genotypes and Phenotypes with the approval of the institutional review board at the University of Nevada, Las Vegas. The data was fully anonymized before we accessed them, and UNLV IRB waived the informed consent. Access to the data was granted through the database of Genotypes and Phenotypes (https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000200.v12.p3).
The current study included participants from three WHI sub-studies: the Women’s Health Initiative Memory Study, Population Architecture using Genomics and Epidemiology, and Genomics and Randomized Trials Network. Participants were excluded if they had arthritis at baseline or participated in hormone replacement therapy or the vitamin D trial, as these factors may impact RA [22,23]. This study conforms to the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE), and the STROBE checklist can be found in S1 Checklist.
Genotype information
Genotyping was performed using the Illumina or Affymetrix 6.0 Array Set Platform. Genotype imputation was conducted on the Sanger Imputation Server using the Haplotype Reference Consortium reference panel and Positional Burrows–Wheeler Transform imputation algorithm [24]. Summary statistics from the most powerful RA-related genome-wide association studies (GWAS) were used for SNP extraction [25]. Quality control steps described by Choi et al. were performed using PLINK 1.9 [26]. SNPs with minor allele frequencies less than 0.01, a low P-value from the Hardy-Weinberg Equilibrium, or missing in a high fraction of subjects were filtered out. Pruning was performed with a window size of 200 variants, sliding across the genome with a step size of 50 variants at a time. SNPs with LD r2 higher than 0.25 were filtered out, and individuals with a first or second-degree relative in the sample were removed. Each SNP was coded as AA = 0, Aa = 1, aa = 2, implying that each additional copy of the minor allele increases the risk by the same amount. Our prior study determined an optimal threshold for PRS calculation. The current study utilized the optimal threshold to compare the model’s predictive performance using SNPs as predictors and the model with PRS. An individual’s PRS be derived as: where Xj is the number of risk alleles (0, 1, or 2) for variant j, and βj is the effect size of variant j, obtained from the summary statistics of GWAS. The PRS was derived based on our previous study [11]. More specifically, seven candidate scores were derived using different p-value thresholds (0.5, 0.4, 0.3, 0.2, 0.1, 0.05 and 0.001); the optimal PRS was selected based on the maximum area under the receiver operating characteristic curve (AUC) when predicting observed RA cases, and it was subsequently used for further analyses.
Phenotype information
Well-established risk factors of RA were used in the current study, including age, race/ethnicity, physical activity, smoking status, and BMI, which were considered predictors [6–9]. Except for BMI, other information was collected by questionnaire (more details: https://www.whi.org/formList). The participants in the current study self-reported race, including Caucasian, African American, Hispanic, American Indian/Alaska Native, Asian, and American Indian. The physical activity is normally expressed in metabolic equivalent of task (MET) units, and the expenditure of energy from recreational physical activity (MET-hours/week) was calculated based on questions about the participant’s usual activity exercise. The Physical Activity Guidelines for Americans 2008 suggested a minimum of 150 min/week of moderate-intensity exercise, which equals 7.5–14.9 MET-h/week. In this study, we categorized physical activity into four levels: no exercise (0); less than the guideline (0.1–7.4 MET-h/week); meeting the guideline (7.5–14.9 MET-h/week); and exceeding the guideline (≥15 MET-h/week) [7]. For the smoking status, women were classified as current, former, or never-smokers (participants who had not smoked 100 cigarettes in their lifetime). Former smokers were defined as answering ‘yes’ to the question, ‘Have you smoked 100 cigarettes in your life?’ but’ no’ to ‘Do you smoke cigarettes now?’ Current smokers were defined as answering ‘yes’ to both questions [9]. The BMI was calculated based on measured weight and height. Individuals were categorized as underweight (BMI<18.5), normal weight(18.5≤BMI<25), overweight (25≤BMI<30), and obese (BMI≥30) [6]. Weight was measured in kg on a balance beam scale with the participant without shoes. Height was measured in 0.1 centimeters using a wall-mounted stadiometer.
Statistical analysis
The dataset was divided into training (80%) and testing (20%) sections for model development and performance evaluation. The dataset was imbalanced, so under-sampling was applied [27], a method that deletes majority examples from the dataset so that the numbers of examples between different classes become balanced. Hybrid feature selection methods were utilized to address the dimensionality challenge caused by including whole genome-wide SNPs as individual predictors combining different feature selections in a multi-step process [28]. LASSO, a penalized regression method [29], was utilized for SNP selection, and 4774 SNPs were retained based on their non-zero coefficient. Two methods were employed for dimension reduction; one is FeatureWiz, a quick and effective technique that uses Searching for Uncorrelated List of Variables (SULOV) algorithms and XGBoost to reduce features. Another one is kernel principal component analysis (KPCA) [30], which can construct nonlinear mappings that maximize the variance in the data. Four models were developed for RA prediction, including model 1 (using the most common RA risk factors), model 2 (model 1 with PRS and five principal components), model 3 (model 1with SNPs after feature reduction with FeatureWiz), and model 4 (model 1with SNPS after feature extraction with KPCA). Four algorithms, including LR, RF, XGBoost, and SVM, were used for each model, and five-fold cross-validation was performed with hyperparameters tuned using random and grid search. Parameters were tuned based on the model characteristics. Performance was assessed using the area under the receiver operating characteristic curve (AUC), sensitivity, specificity, positive and negative predictive values, and F1-score. The flowchart of the current study is shown in Fig 4 The DeLong nonparametric test was used to assess differences in performance between models. In the DeLong test, the z-score is calculated to assess the null hypothesis H0:AUC1=AUC2. The z-score measures the standardized difference between the two AUCs by considering their variances and covariance. The ten most important SNPs in model 4 were evaluated [31]. The K-nearest neighbor algorithm was used for missing value imputation in phenotype information. The missing values were replaced with the mean values from similar or close neighbors, with a default number of neighbors (N=5). All statistical tests were two-sided, and the statistical significance was p < 0.05. Python programming language version 3.8.15 was used for all statistical analyses, utilizing NumPy, Pandas, imblearn, Matplotlib, Seaborn, Scikit-learn, Xgboost, and Featurewiz were used in this study [32–39].
Acknowledgments
We sincerely thank the original WHI study investigators and the invaluable participants for their pivotal contributions to advancing women’s health research. We also express our gratitude to the National Institutes of Health (NIH) and the database of Genotypes and Phenotypes (dbGaP) for granting access to analyze the WHI data. This work reflects our independent analysis and interpretation and does not represent the views of other parties associated with the WHI study. We sincerely appreciate the collective efforts and contributions of all institutions, collaborators, and teams involved in the WHI study. Part of Dr. Qing Wu’s work was conducted at the University of Nevada, Las Vegas.
References
- 1. Cross M, Smith E, Hoy D, Carmona L, Wolfe F, Vos T, et al. The global burden of rheumatoid arthritis: estimates from the global burden of disease 2010 study. Ann Rheum Dis. 2014;73(7):1316–22. pmid:24550173
- 2. Aletaha D, Smolen JS. Diagnosis and Management of Rheumatoid Arthritis: A Review. JAMA. 2018;320(13):1360–72. pmid:30285183
- 3. Black RJ, Cross M, Haile LM, Culbreth GT, Steinmetz JD, Hagins H, et al. GBD 2021 Rheumatoid Arthritis Collaborators. Global, regional, and national burden of rheumatoid arthritis, 1990-2020, and projections to 2050: a systematic analysis of the Global Burden of Disease Study 2021. Lancet Rheumatol. 2023;5(10):e594–610. pmid:37795020
- 4. Allaire S, Wolfe F, Niu J, Lavalley MP. Contemporary prevalence and incidence of work disability associated with rheumatoid arthritis in the US. Arthritis Rheum. 2008;59(4):474–80. pmid:18383413
- 5. Senthelal S, Li J, Ardeshirzadeh S, Thomas MA. Arthritis. StatPearls. 2022; Available from: https://www.ncbi.nlm.nih.gov/books. Available from: https://www.ncbi.nlm.nih.gov/books/NBK518992/
- 6. Abuhelwa AY, Hopkins AM, Sorich MJ, Proudman S, Foster DJR, Wiese MD. Association between obesity and remission in rheumatoid arthritis patients treated with disease-modifying anti-rheumatic drugs. Sci Rep. 2020;10(1):18634. pmid:33122725
- 7. Xu Y, Wu Q. Prevalence trend and disparities in rheumatoid arthritis among US adults, 2005–2018. J Clin Med. 2021;103289.
- 8. van Nies JAB, Tsonaka R, Gaujoux-Viala C, Fautrel B, van der Helm-van Mil AHM. Evaluating relationships between symptom duration and persistence of rheumatoid arthritis: does a window of opportunity exist? Results on the Leiden early arthritis clinic and ESPOIR cohorts. Ann Rheum Dis. 2015;74(5):806–12. pmid:25561360
- 9. Chang K, Yang SM, Kim SH, Han KH, Park SJ, Shin JI. Smoking and rheumatoid arthritis. Int J Mol Sci. 2014;15(12):22279–95. pmid:25479074
- 10. Uhlig T, Moe RH, Kvien TK. The burden of disease in rheumatoid arthritis. Pharmacoeconomics. 2014;32:841–51.
- 11. Xu Y, Wu Q. Genome-wide polygenic risk score for rheumatoid arthritis prediction in postmenopausal women. J Gene Med. 2024;26(1):e3659. pmid:38282146
- 12. Plana D, Shung DL, Grimshaw AA, Saraf A, Sung JJY, Kann BH. Randomized Clinical Trials of Machine Learning Interventions in Health Care: A Systematic Review. JAMA Netw Open. 2022;5(9):e2233946. pmid:36173632
- 13. Howard FM, Kochanny S, Koshy M, Spiotto M, Pearson AT. Machine Learning-Guided Adjuvant Treatment of Head and Neck Cancer. JAMA Netw Open. 2020;3(11):e2025881. pmid:33211108
- 14. Yuan Q, Cai T, Hong C, Du M, Johnson BE, Lanuti M, et al. Performance of a Machine Learning Algorithm Using Electronic Health Record Data to Identify and Estimate Survival in a Longitudinal Cohort of Patients With Lung Cancer. JAMA Netw Open. 2021;4(7):e2114723. pmid:34232304
- 15. Deane KD, Demoruelle MK, Kelmenson LB, Kuhn KA, Norris JM, Holers VM. Genetic and environmental risk factors for rheumatoid arthritis. Best Practice & Research Clinical Rheumatology. 2017;31(1):3–18.
- 16. Norgeot B, Glicksberg BS, Trupin L, Lituiev D, Gianfrancesco M, Oskotsky B, et al. Assessment of a Deep Learning Model Based on Electronic Health Record Data to Forecast Clinical Outcomes in Patients With Rheumatoid Arthritis. JAMA Netw Open. 2019;2(3):e190606. pmid:30874779
- 17. Cao LJ, Chua KS, Chong WK, Lee HP, Gu QM. A comparison of PCA, KPCA and ICA for dimensionality reduction in support vector machine. Neurocomputing. 2003;55(1–2):321–36.
- 18. Anowar F, Sadaoui S, Selim B. Conceptual and empirical comparison of dimensionality reduction algorithms (PCA, KPCA, LDA, MDS, SVD, LLE, ISOMAP, LE, ICA, t-SNE). Computer Science Review. 2021;40100378.
- 19. Mohammed NN, Mohammed CJ. Enhanced Determination of Gene Groups Based on Optimal Kernel PCA with Hierarchical Clustering Algorithm. 2021 55th Annual Conference on Information Sciences and Systems (CISS). 2021:1–5.
- 20. Dobbin KK, Simon RM. Optimally splitting cases for training and testing high dimensional classifiers. BMC Med Genomics. 2011;4:31. pmid:21477282
- 21. Anderson G, Cummings S, Freedman LS, Furberg C, Henderson M, Johnson SR, et al. Design of the Women’s Health Initiative clinical trial and observational study. The Women’s Health Initiative Study Group. Control Clin Trials. 1998;19(1):61–109. pmid:9492970
- 22. Orellana C, Saevarsdottir S, Klareskog L, Karlson EW, Alfredsson L, Bengtsson C. Postmenopausal hormone therapy and the risk of rheumatoid arthritis: results from the Swedish EIRA population-based case-control study. Eur J Epidemiol. 2015;30(5):449–57. pmid:25762170
- 23. Kostoglou-Athanassiou I, Athanassiou P, Lyraki A, Raftakis I, Antoniadis C. Vitamin D and rheumatoid arthritis. Ther Adv Endocrinol Metab. 2012;3(6):181–7. pmid:23323190
- 24. Das S, Forer L, Schönherr S, Sidore C, Locke AE, Kwong A, et al. Next-generation genotype imputation service and methods. Nat Genet. 2016;48(10):1284–7. pmid:27571263
- 25. Okada Y, Wu D, Trynka G, Raj T, Terao C, Ikari K, et al. Genetics of rheumatoid arthritis contributes to biology and drug discovery. Nature. 2014;506(7488):376–81. pmid:24390342
- 26. Choi SW, Mak TS-H, O’Reilly PF. Tutorial: a guide to performing polygenic risk score analyses. Nat Protoc. 2020;15(9):2759–72. pmid:32709988
- 27. Fujiwara K, Huang Y, Hori K, Nishioji K, Kobayashi M, Kamaguchi M, et al. Over- and Under-sampling Approach for Extremely Imbalanced and Small Minority Data Problem in Health Record Analysis. Front Public Health. 2020;8:178. pmid:32509717
- 28. Pudjihartono N, Fadason T, Kempa-Liehr AW, O’Sullivan JM. A Review of Feature Selection Methods for Machine Learning-Based Disease Risk Prediction. Front Bioinform. 2022;2927312. pmid:36304293
- 29. Ayers KL, Cordell HJ. SNP selection in genome-wide and candidate gene studies via penalized logistic regression. Genet Epidemiol. 2010;34(8):879–91. pmid:21104890
- 30. Savic V, Larsson EG, Ferrer-Coll J, Stenumgaard P. Kernel Methods for Accurate UWB-Based Ranging With Reduced Complexity. IEEE Trans Wireless Commun. 2016;15(3):1783–93.
- 31. DeLong ER, DeLong DM, Clarke-Pearson DL. Comparing the Areas under Two or More Correlated Receiver Operating Characteristic Curves: A Nonparametric Approach. Biometrics. 1988;44(3):837.
- 32. Harris CR, Millman KJ, van der Walt SJ, Gommers R, Virtanen P, Cournapeau D, et al. Array programming with NumPy. Nature. 2020;585(7825):357–62. pmid:32939066
- 33. McKinney W. Data Structures for Statistical Computing in Python. Proceedings of the Python in Science Conference. 201056–61.
- 34. Hunter JD. Matplotlib: A 2D Graphics Environment. Comput Sci Eng. 2007;9(3):90–5.
- 35. Waskom M. seaborn: statistical data visualization. JOSS. 2021;6(60):3021.
- 36. Buitinck L, Louppe G, Blondel M, Pedregosa F, Müller AC, Grisel O, et al. API design for machine learning software: experiences from the scikit-learn project. 2013 [cited 2 Oct 2024]. Available: https://arxiv.org/abs/1309.0238v1
- 37. Chen T, Guestrin C. XGBoost: A Scalable Tree Boosting System. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016;13-17-August-2016: 785–794.
- 38.
Github – AutoViML/ featurewiz.
- 39. Lemaître G, Nogueira F, Aridas CK. Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning. Journal of machine learning research. 2016.