Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Trustworthy AI for medical decisions: Adversarially robust and fair machine learning prediction for Parkinson’s disease

Abstract

Parkinson’s disease (PD) is a neurodegenerative disorder characterized by motor and non-motor symptoms, including tremor, rigidity, and postural instability. Machine learning (ML) models have shown promise for the diagnosis of PD; however, many existing approaches do not explicitly address fairness and robustness. As a result, these models can lead to biased outcomes across demographic groups and vulnerability to adversarial attacks. In this study, we used the Parkinson’s Progression Markers Initiative (PPMI) cohort, which includes clinical and demographic information from 1,084 participants spanning diverse age, sex, and racial groups. Our study addresses the key challenge of developing robust and equitable ML models to diagnose the progression of PD. We evaluated the performance of two fairness-optimized classifiers, namely, Random Forest (RF) and Decision Tree (DT). To evaluate model vulnerability, we applied adversarial techniques, specifically label leakage and data poisoning attacks, which simulate intentional or erroneous data alterations that can amplify biases and degrade accuracy. These adversarial manipulations substantially degraded model performance; specifically, DT accuracy declined by more than 10% between sensitive groups. The accuracy of the RF model decreased by 20%. Moreover, under attack, fairness metrics such as Statistical Parity Difference (SPD), which looks at differences in the chances of getting a positive prediction across demographic groups, and Equal Opportunity Difference (EOD) for differences in true positive rates between groups, both showed a decline. This pattern suggests that adversarial perturbations increased bias and widened performance disparities across demographic groups. Our results demonstrated that adversarial attacks increased the incidence of false positives and false negatives, thereby lowering the accuracy and fairness of the PD diagnostic predictions. These findings underscore the urgent need for robust and fairness-aware defenses in medical AI to mitigate racial, age, and gender disparities and ensure a reliable clinical decision-making process.

1 Introduction

Neurodegenerative disorders, such as Parkinson’s disease (PD), are conditions that people acquire as part of their lifelong journey. It often attacks middle-aged to older individuals. Common symptoms include tremors [1], bradykinesia (slow movements) [2], rigidity, postural instability, and gait difficulties [3] resulting from loss of instinct, movements, speech, and handwriting problems [4]. Other symptoms include hallucinations, depression, insomnia, and low blood pressure [5]. There is no cure for PD; however, surgery and medical options can manage symptoms. However, like any other drug, all drug options produce side effects that are undesirable in a normal daily life scenario [6]. Health care has been discriminated against and treated unfairly throughout history, and the development of Artificial Intelligence (AI) will only make it more able to be unfair [7].

Modern AI-driven systems can identify statistically significant patterns in data that humans cannot discover [8]. This accounts for more accurate data analysis and modeling. Although these systems are subject to frequent errors, their modeled outcomes outperform established approaches. It will lead to presenting a real challenge to data and domain experts [9]. The emerging problem from data-intensive methods is the limited ability to discern biases in ML projections. Medical diagnosis [10], college admissions [11], loan distribution, prediction of recidivism, recruitment, online advertising, facial recognition, language translation, recommendation engines, fraud detection, credit decisions, pricing, and fake news detection [12]. At the same time, research has not yet examined the relationships of race, age, and gender [13] with the diet of PD in cognitive capabilities [14].

Initially, men exhibit more stiffness and show more sleep disturbances during rapid eye movement compared to women, while women suffer more from dyskinesia and depression [15]. Non-motor symptoms and quality of life related to the health of PD patients differ by sex [16]. From the onset of PD, the quality of life related to women’s health is most adversely affected by fatigue and depression. There is a theoretical explanation for the variations in quality of life experienced by people with PD [17]. To our knowledge, no research has examined the potential bias in identifying vulnerable groups of patients using PD-based ML, particularly those designated by age, gender, or ethnicity, who may be disproportionately targeted by such prejudice. [18].In the context of PD, where the accuracy of diagnosis significantly impacts patient outcomes, this is especially crucial. The resilience of AI models, as emphasized in [19], guarantees uniform performance even when dealing with varied and volatile clinical datasets. The present study highlights the simultaneous importance of equity and resilience in the development of reliable AI systems for healthcare. Through the integration of robustness into our fairness-aware methodology, our objective is to make a valuable contribution towards the development of AI systems that possess not only impartiality but also resilience and dependability in operational medical environments. The ML process is characterized by two main sources of inadvertent bias [20,21]. Models can learn from incorrect or biased data, which means that they will find relations but not real-world patterns and produce results with varying reliabilities [22]. Even though today’s data accurately reflects patterns, it may still contain biases, making AI susceptible to legal challenges for unfairness [23].

Our study thus investigates the impact of data bias on both diagnostics and treatment regimes, using ML models to ensure consistent performance for a wide range of clinical scenarios, and provides solutions to address these biases. This research’s primary focus is the development and testing of a robust and equitable ML system for identifying PD. In contrast to previous work, we thoroughly test our model’s performance in hostile scenarios, such as label-leak attacks and poisoning. We also use preprocessing and fairness measures to account for biases related to age, gender, and race. A novel and beneficial step toward applying AI to make trustworthy medical decisions is the two-pronged perspective of strength and fairness.

  1. The primary contribution of this paper is the early study to jointly examine fairness-aware preprocessing and adversarial robustness in ML models for Parkinson’s disease, providing a framework that reduces demographic bias while testing vulnerability to data poisoning and label-leak attacks.

The remainder of the paper is organized as follows. Sect 2 reviews related work on fairness and robustness in medical AI. Sect 3 describes the dataset, evaluation metrics, and the methodological framework. Sect 4 presents the proposed models and adversarial settings. Sect 5 reports the experimental results. Sect 6 Discussion of the research findings, while Sect 7 discusses the limitation of the proposed research. Finally, in Sect 8 Conclusion and future work are reported.

2 Related work

This section provides a comprehensive overview of the literature on fairness, examined through multiple perspectives. Initially, we discussed biases and the critical importance of trustworthiness in the medical field. Subsequently, we shift focus to research on fairness. In addition, we discuss recent developments in clinical decision support systems (CDSS) and fairness in the context of ML and DL [24]. According to research on adversarial resilience in ML models, adversarial attacks can worsen or disclose biases in medical data [25]. The article discusses strategies for protecting ML models from attacks while ensuring fairness.

2.1 Profound bias patterns

Prior to explicitly discussing what fairness is, we can recognize common biases that lead to unfair conduct in commercial ML systems; a few examples of biases in ML are discussed as follows. Training data often inherently reflects existing human biases. In situations where bail and parole decisions are involved, predictions might help tackle problems like recidivism [10]. Predicting whether an ex-offender would commit a new crime within a given time frame is their goal. But instead of using convictions, they use arrest records. Those involving drug offenses are disproportionately associated with minority groups, which are often the targets of increased enforcement [26]. ML algorithms are made to match the data by automatically reproducing the bias that already exists in the data that leads to propagation rather than elimination of existing prejudices [27]. It is important to minimize the average error, which benefits the majority group as their data is more consistent and predictive of the results, which reduces the general error. This leads to disproportionate errors affecting minority groups within diverse populations, as limited access to resources can compromise the reliability of their data [28]. Addressing these disparities through further research is both urgent and essential.

The inputs to the prediction algorithm are influenced by its past behaviors in critical areas like medication trials, etc. In a wide range of professions, including law enforcement and healthcare, it is common practice to make decisions only after gathering insufficient information [29,30]. Exploratory behavior, which involves performing actions that may not always be optimal to acquire additional data, is the optimal approach to maximizing information acquisition in such circumstances, according to learning theory. First, we must determine whether the expenses of exploratory activity disproportionately harm any population. Second, inadequate measurements for an individual patient can be unethical in medical studies, limiting learning and perpetuating unfairness.

2.2 Trustworthy AI in medicine

One of the most significant barriers to the widespread adoption of AI is the public’s skepticism toward ML. To earn trust, data scientists must go beyond traditional performance metrics such as prediction accuracy and prioritize ethical considerations. Minimizing bias and promoting fairness while maintaining robustness and transparency are essential to foster a future in which ML algorithms serve as reliable collaborators for humans [31,32]. Building confidence in AI systems requires attention to critical factors throughout the entire lifecycle of a predictive model, including data collection, preprocessing, feature engineering, model training, and optimization. Equally important are the stages of testing, deployment, and ongoing monitoring. Together, these interconnected steps form what is known as the “chain of trust” [33,34].

2.3 Fairness in ML-based approaches

The term “fairness” encompasses a wide range of scenarios, from simple binary decisions to complex real-time policy control situations. It is a complex and context-dependent concept dictated by local societal norms. Ensuring fairness within an organizational context is challenging. Notably, fairness can be mathematically expressed in 21 ways [35]. Conflicts in how fairness is measured can highlight different aspects of fairness, but they also have implications for test results. Recent work on fairness has shown that satisfying multiple reasonable fairness axioms is impossible. A decision tree to determine whether an organization can achieve fairness, developed by the University of Chicago, could be useful [36].

2.4 ML and DL approaches for clinical decision support system

Medical AI research has increasingly focused on applications in real-time clinical settings. In this context, researchers have developed accurate PD’s clinical decision support systems (PD-CDSS) by leveraging ML and DL techniques, as illustrated in Fig 1. However, such models often underperform in real-world clinical practice compared to initial expectations. Recent efforts have therefore emphasized the development of high-quality medical CDSS models that require less patient data while integrating a wider range of ML algorithms. Although a COVID-19 diagnostic technique was proposed by [37] as a consistent approach, it diverged from established methodological criteria and definitions. Alongside renewed interest in neural network security, researchers have also developed interpretable DL models that address domain-specific features and associated security challenges. Nevertheless, given current technological limitations, no definitive solution exists for comprehensive risk assessment in AI-driven clinical applications [38]. Several critical questions remain regarding how end users perceive and adopt these systems, as well as the perceived limitations in their design and deployment. Despite substantial investment in this area, such challenges continue to hinder widespread clinical adoption. The effective integration of AI-enhanced clinical decision support systems (AICDSS) into healthcare therefore requires rigorous empirical investigation [39]. Furthermore, fostering sustained trust is a fundamental requirement, grounded in eight core principles that clinical AICDSS should satisfy: justness, accountability, responsibility, robustness, transparency, replicability, security, and privacy. Empirical studies further indicate that end users report greater confidence in automated systems when key attributes such as explainability, security, privacy, and reliability are present, all of which are considered critical for practical deployment. In addition, ongoing research on fairness in ML extends to domains such as ranking algorithms, recommendation systems, and contextual bandit learning.

thumbnail
Fig 1. The figure illustrates the various steps of a ML pipeline and highlights considerations related to model fairness.

https://doi.org/10.1371/journal.pone.0342062.g001

2.5 Comparison of machine learning approaches

PD is a progressive neurodegenerative disorder characterized by an early and accurate diagnosis that remains a clinical challenge, as subtle prodromal signs often elude standard assessments. In response, researchers in Table 1 have turned to ML and DL techniques leveraging neuroimaging, clinical, and biomarker data to improve diagnostic accuracy and forecast disease progression. Initial efforts employed traditional classifiers such as support vector machines (SVM), random forests (RF), and XGBoost, combined with feature selection, to achieve accuracies of 80–83% [41,43]. Zhang et al. [40] advanced this work by integrating convolutional neural networks (CNNs) with long short-term memory (LSTM) models on the PPMI dataset, reporting an accuracy of 85.3 % and an F1-score of 84.5 %. Kim et al.’s proposed CNN-SVM framework, evaluated on both external cohorts and PPMI data, further improved performance to 87 % accuracy and an AUC of 0.89 [36]. More recently, reinforcement learning approaches have been explored to adaptively refine model parameters [33], while ensemble methods have enhanced sensitivity and stability [35]. Fairness-sensitive architectures and adversarial training exemplified by Malik et al.’s random forest strategy framework have also been proposed to enhance robustness [42]. Despite these promising results, most studies focus narrowly on expected performance metrics. They rarely assess equity across demographic groups or examine model behavior under adversarial conditions. This gap erodes trust and risks perpetuating biases when ML systems enter diverse clinical environments. To bridge this gap, we propose a unified framework that integrates rigorous robustness evaluations, including label-leak and data-poisoning attacks, with fairness-aware training protocols. By simultaneously measuring predictive accuracy and demographic parity, our approach constructs a comprehensive “chain of trust” for ML-based PD diagnosis, aligning technical validation with clinical and ethical standards.

thumbnail
Table 1. Comparison of recent ML approaches for PD detection/progression (2021–2025), including our work.

https://doi.org/10.1371/journal.pone.0342062.t001

3 Materials and methods

This section first examines the dual challenges of fairness and robustness in medical AI, with a particular focus on PD’s applications. The robustness of the model is then explored from two complementary perspectives. The first concerns adversarial robustness, which considers threat models such as label leakage and data poisoning. The second addresses data bias by analyzing demographic skews in age, gender, and race within the PPMI cohort. Finally, this section introduces the evaluation framework. This includes bias metrics, such as Statistical Parity Difference and Equal Opportunity Difference, as well as group fairness principles, is used to assess model equity across sensitive attributes.

3.1 Fairness and robustness in medical AI

Although there has been promise in using ML to diagnose PD, biases are introduced by issues such as limited medical data sets, inadequate validation, and a lack of reliable clinical evaluations. As an outcome, to guarantee that an AI algorithm that uses PD symptoms (or risk factors) as input should retain accuracy and dependability to reduce AI bias. Biases about race and ethnicity have been found in a comprehensive systematic review [44]. The assessment also highlighted prejudices toward minorities and language. The discussion focused on issues related to the assessment of stability and dependability to reduce biases, the summarization of validation techniques, and the analysis of large datasets. In this study, we employ an optimized preprocessing technique to mitigate bias in the PD dataset prior to model training. Preprocessing methods aim to correct data imbalances and reduce the disparate representation of sensitive attributes such as age, gender, and race. This approach was selected for its effectiveness in improving fairness while maintaining model performance, as demonstrated in recent literature. We focus specifically on this method rather than in-processing or post-processing techniques to provide a clear and targeted fairness intervention aligned with our experimental goals [45,46].

3.2 Model adversarial robustness

The resilience of ML models, particularly DL architectures, against adversarial attacks has emerged as a critical area of research, especially in the context of medical imaging applications. A study employing the Fast Gradient Sign Method (FGSM) on a Vision Transformer model demonstrated that minor image perturbations can lead to substantial degradation in classification performance, with accuracy decreasing from 90.1% to 27.38% [47]. However, the same study showed that adversarial training can significantly enhance the robustness of the model, increasing the precision of the classification to 96.61%. Recent studies [35,48] have further explored quantum adversarial machine learning (QAML), which integrates quantum computing principles with ML to strengthen model defenses against adversarial attacks. QAML exploits quantum properties such as superposition and entanglement to potentially improve robustness against input perturbations. Early findings [49] suggest that quantum-enhanced defense mechanisms may produce improvements in adversarial resilience. Additional investigations [50] have examined the relationship between adversarial robustness and model complexity in medical image classification tasks, with particular emphasis on vulnerability to DeepFool attacks and Jacobian-based Saliency Map Attacks (JSMA). These attack strategies deliberately manipulate input data to induce model misclassification [51,52]. While the DeepFool method iteratively perturbs input images until misclassification occurs, JSMA targets the most influential input features to mislead the model. Developing models that are resistant to such adversarial manipulations is especially critical in high-stakes domains such as medical diagnosis. Robust AI systems hold significant potential for medical imaging by enabling reliable analysis of radiological data under diverse and potentially adversarial conditions. In this context, AI resilience can be viewed as a multidimensional concept encompassing several forms of robustness. One important dimension is data resilience, which refers to a model’s ability to maintain stable performance when confronted with incomplete, imbalanced, or heterogeneous datasets [53].

3.2.1 Model robustness.

The robustness of the model is the ability to maintain high performance despite variations or perturbations in input data, such as noise, adversarial attacks, or changes in data distribution [54]. A robust model is less sensitive to small, unpredictable changes in input and can generalize well to unseen data. Obtaining robustness often involves regularization techniques, data augmentation, and adversarial training, which enhance the resilience of the model to internal and external variations, ensuring reliable predictions in diverse scenarios [55].

3.2.2 Adversarial robustness.

The unequal distribution of sensitive features between the control and PD groups causes bias in the dataset. With older people and specific racial groups (such as white patients) being overrepresented in comparison to younger or minority populations, age and race traits are noticeably skewed. Subtle disparities still exist, even if the proportion of genders is more equitable. The ability of a model to continue producing accurate predictions in the face of intentionally altered inputs is known as adversarial robustness [56]. Such robustness is essential in clinical AI systems because even small input perturbations can result in potentially fatal diagnostic errors [57]. By specifically including adversarial examples during training, adversarial training has been acknowledged as one of the best methods for improving generalization and defense against such disturbances. For example, Madry et al. [58] showed how adversarial training greatly increases model resilience under these attacks, and Goodfellow et al. [59] introduced the Fast Gradient Sign Method (FGSM) to create adversarial samples. By maintaining data diversity and privacy, federated learning (FL), which decentralizes model training across devices or institutions without sharing raw data, naturally promotes robustness [59,60]. Recent studies have demonstrated that incorporating adversarial defense mechanisms within FL can improve the robustness and effectiveness of clinical models in addition to their privacy benefits [61]. Resilient FL frameworks in particular have demonstrated enhanced performance in heterogeneous medical datasets and have reduced the risk of centralized data poisoning [62,63].

In addition, auditing and maintaining robustness in healthcare AI depend on Explainable AI (XAI). By showing when a model’s predictions rely too heavily on spurious features, techniques like SHAP and LIME can be used to identify vulnerabilities and provide insights into model decision pathways [64]. As highlighted in recent guidelines for reliable medical AI, XAI also facilitates robust model validation by enabling clinicians to confirm whether model behavior is consistent with clinical reasoning [65]. It is becoming more widely accepted that establishing standardized frameworks for robustness evaluation is an essential first step in implementing AI in healthcare. More reliable and equitable AI systems are ensured by benchmarking models against known attack vectors and across fairness-sensitive attributes including gender or race [66].

3.2.3 Data bias.

In ML, challenges introduced during model development and training can contribute to the emergence of bias. The assumptions, design choices, and implicit cognitive biases of the developers may be reflected in the models they construct. This issue is further exacerbated by the use of inaccurate, incomplete, or biased datasets for training and evaluation [67]. In particular, demographic attributes such as gender, age, and race are often unevenly distributed across classes, leading to systematic biases in the data. Empirical observations indicate that age and race exhibit greater skewness compared to gender-related features. For instance, in both the non-PD and PD cohorts, the majority of patients are 60 years or older, highlighting a pronounced age imbalance in the dataset. It is crucial to consider differences in clinical outcomes and data representation when analyzing gender, age and race-related bias in PD prediction models. For example, if the dataset is not sufficiently balanced, model learning can be biased because women are more likely to experience sadness and anxiety, and men with PD are more likely to report stiffness and sleep disturbances [68]. Furthermore, due to age-related multi-morbidities and unusual symptomatology, older people are often underrepresented in AI datasets, which can lead to biased predictions [69]. The particularly concerning aspect is racial prejudice, as Black and Hispanic patients are underrepresented in study cohorts and often have delayed PD diagnoses, which reduces the generalizability of the model [70,71]. These variations highlight the need for stratified sampling, intersectional equity evaluations, and ongoing model assessments to prevent the perpetuation of systemic bias.

3.3 Evaluation metrics

Employing a set of indicators, commonly referred to as justice metrics, enables the identification of preprocessing bias in datasets as well as in in-processing models, which may exhibit either unintentional or deliberate favoritism toward certain groups. These indicators, summarized in Table 2, facilitate the detection of bias in the data or model-level and support subsequent mitigation efforts. From a bias identification perspective, this approach highlights instances of unfair treatment of one group relative to another [72]. Bias mitigation refers to a set of techniques aimed at reducing bias in datasets or models used in empirical research. In this context, adversarial evaluation assesses the robustness of a model to perturbed inputs designed to reveal or amplify underlying biases, thereby enhancing the evaluation of model fairness. Under standard evaluation conditions, such methods can expose latent model vulnerabilities. In general, adversarial evaluation provides a practical framework for assessing the extent to which a model is susceptible to adversarial manipulations that may exacerbate its inherent biases [73].

thumbnail
Table 2. This table shows the error rates and predictive values used in our analysis.

These metrics play a crucial role in evaluating performance.

https://doi.org/10.1371/journal.pone.0342062.t002

3.3.1 Bias metrics.

The protected groups, which we denote as hi and hj, are considered in the analysis. The first category of metrics, known as parity-based metrics, focuses on comparing the expected positive rates, such as , across different groups.

1. Statistical parity difference: A distance measure is used to determine similarity [74] if findings are comparable between groups and unrelated to the protected property. This is achieved as follows:

  • : This represents the probability that the predicted outcome equals 1, given the condition or group hi.
  • : This represents the probability that the predicted outcome equals 1, given the condition or group hj.

2. Disparate impact: Based on confusion matrix measurements, this class of measures spans elements outside the positive rate [60], including the true positive rate (TPR) and true negative rate (TNR). It specifically contrasts the chances of a favorable result for rich and poor groups.

3. Equal opportunity difference: This criterion ensures that the false negative rate (FNR), defined as the probability that an instance belonging to the positive class is incorrectly classified as negative, is equal across both protected and unprotected groups. When this condition is satisfied, the classifier is considered to satisfy the fairness requirement [75].

4. Average absolute odds difference: The AAOD metric, which calculates bias using the false positive rate and true positive rate, is considered fair when it equals 0. This measure reflects the disparity in performance between groups, where fairness implies equal treatment of both groups.

Where the group of sensitive quality is A, every group is supposed to get an equal share of good results. This method finds it difficult, nevertheless, to handle situations when people belong to many protected categories. Giving one group justice priority could jeopardize fairness for another.

3.3.2 Group fairness.

Group fairness is the principle that individuals from different demographic groups such as gender, race, or age, should, on average, receive comparable model outcomes. Parity-based measures operationalize this concept by quantifying differences in prediction results between protected groups. As shown in Table 2, common metrics include statistical parity difference (SPD), equal opportunity difference (EOD), and average absolute odds Difference (AAOD) [76]. SPD measures disparities in the probability of favorable outcomes (for example, a correct PD diagnosis) across subgroups, while EOD assesses differences in true positive rates to ensure equal access to beneficial interventions. Although confusion-matrix metrics such as true positive rate (TPR), false positive rate (FPR), and positive predictive value (PPV) underpin these measures, they are only applied within fairness formulations when broken down by group membership. For instance, AAOD captures fairness violations by averaging the absolute differences in group-specific TPR and FPR. Accordingly, this study reports both individual performance metrics and aggregated group-fairness measures to present a comprehensive view of model fairness [77].

4 Proposed model

Fig 2 illustrates the proposed framework for the detection of PD, which adopts a fairness-aware design to systematically address algorithmic bias. In this study, the Parkinson’s Progression Markers Initiative (PPMI) database is used to extract baseline and multimodal data, including clinical, imaging, and biospecimen records. The preprocessing pipeline comprises feature selection, using techniques such as recursive feature elimination to identify informative predictors, and data normalization, employing z-score standardization to ensure consistent feature scaling. Model optimization is performed through hyperparameter tuning guided by cross-validation, with the objective of improving predictive performance and generalizability. To account for potential biases, an initial discriminatory data unit testing phase is performed to detect disparate impacts across demographic subgroups, using fairness metrics such as demographic parity and equal opportunity. Subsequently, bias mitigation strategies are applied through reweighting techniques and group-specific adjustments informed by fairness-aware criteria. Finally, the performance of classifiers, including random forests and gradient boosting models, is evaluated by comparing the results obtained from the original and bias-adjusted datasets. Fig 3 presents the experimental workflow adopted in this study.

thumbnail
Fig 3. The experimental route map of the proposed framework.

https://doi.org/10.1371/journal.pone.0342062.g003

4.1 Data preparation

We performed a series of preprocessing steps on the training data prior to training and evaluating the proposed model. The initial and most critical step involved identifying the most informative modalities for PD detection. Subsequently, the statistical significance of the selected modalities at baseline was assessed, with each undergoing multiple preprocessing procedures to enhance data quality. Their discriminative capacity for evaluating PD severity was also examined. Details regarding the dataset and the selected features are provided in [61].

4.1.1 Data collection.

The Parkinson’s Progression Markers Initiative (PPMI) dataset is a publicly available, longitudinal, multicenter cohort study designed to identify biomarkers of PD progression. Globally tracking PD cases and including data from several nations, the PPMI database [62] covers patient clinical data rather brilliantly. As part of the PPMI dataset, it includes each patient’s medical records, laboratory results, motor and non-motor test scores, bio-samples, and personal information [67]. The study divides the dataset into numerous data groups with individual traits. The investigation covers the fundamental modalities, including motor and non-motor, as well as subject characteristics and medical history. The De Novo PD cohort consists of individuals diagnosed within two years of enrollment who have not commenced dopaminergic treatment. Healthy Controls (HC) are meticulously matched to PD participants based on age and sex and must possess no history of neurological dysfunction, no first-degree relatives with PD, and a Montreal Cognitive Assessment (MoCA) score of no less than 26. The prodromal cohort comprises individuals aged 60 years or older who display high-risk indicators for Parkinson’s disease, such as REM sleep behavior disorder (RBD), yet do not fulfill clinical diagnostic criteria. Finally, the genetic cohort consists of both symptomatic and asymptomatic carriers of pathogenic mutations in the LRRK2, GBA, and SNCA genes. To maintain cohort integrity, stringent exclusion criteria are enforced: PD participants must not exhibit signs of atypical parkinsonism, and controls must undergo comprehensive cognitive and motor assessments.The PD dataset is provided by the PPMI as a time series, which reflects patient visits in Table 3. For their six regular visits, patients follow a 12-month interval schedule: Baseline (BL) visit 1 at month 12, visit 2 at month 24, visit 3 at month 36, visit 4 at month 48, and visit 5 at month 60 as shown in Fig 4. Using the H&Y scale and a variable NHY, patients are categorized. The dataset baseline information is that all, 1059 patients are part of it. Using the current standard stages, we use the data preparation techniques: mean imputation fills in missing values, the minimum-maximum scaler function normalizes, and the recursive elimination strategy helps us to choose features. With gender and racial attributes translated into a 0/1 numerical data type, the goal variable is a binary variable where 1 denotes Parkinson’s patients (PAT n = 648) and HC (healthy control n = 434). Concerning sensitive factors, age and race were categorized to enhance fairness assessment across subgroups while reducing privacy concerns. In our analysis, we categorized age into two bins to align with clinical stratifications commonly used in PD research [78,79] and to ensure sufficient sample size per subgroup for meaningful fairness evaluation. Similarly, race was binarized into “White” and “Non-White” to preserve statistical power and participant anonymity, given the small counts in minority subgroups. Gender was retained as a binary variable (male/female) per PPMI coding. Importantly, these sensitive attributes (age, gender, and race) were excluded from model training features to prevent direct bias amplification [78], but were retained for post-hoc fairness auditing across demographic subgroups. This design enables us to assess whether the model’s predictions exhibit disparities despite not using protected attributes as inputs, thus revealing latent biases embedded in clinical and biomarker data.

thumbnail
Table 3. Overview of the PPMI database for Parkinson’s disease, including patient demographic information, laboratory results, clinical records, and motor and non-motor assessment outcomes.

https://doi.org/10.1371/journal.pone.0342062.t003

thumbnail
Fig 4. Illustrates how PD patients and non-PD patients differ in terms of (a) age, (b) gender, and (c) race.

https://doi.org/10.1371/journal.pone.0342062.g004

4.1.2 Data cleaning.

It uses data cleaning methods to make the dataset match up with other related datasets in the process. Fixing errors involves formatting data, getting rid of duplicates, and turning numbers saved as text back into numbers [80].

4.1.3 Missing data.

Removes a feature if it has more than 30% missing data. To fill in the blanks, researchers have utilized a variety of methods, such as forward and backward filling. In our case, we employed the median to fill in continuous (numerical) traits and the mode method to fill in categorical data.

4.1.4 Data normalization.

Data normalization is an essential step in the data preprocessing pipeline. Normalization, also known as standardization or feature scaling, is the process of making data without any dimensions and with similar patterns [67,72]. Numerous studies suggested using the min-max normalization approach to make the dataset more uniform across the range [0,1] [81]. These steps involve rescaling the data so that the feature values stay within this range.

Min-Max normalization is given by the following formula:

where and denote the maximum and minimum values, respectively. This normalization technique scales the values within the interval [c,d], transforming the original value y into the normalized value .

4.1.5 Data splitting.

The dataset was partitioned into training and testing subsets, with 80% used for model training and 20% retained for evaluation. The dataset exhibits a moderate degree of class imbalance, consisting of 648 PD cases and 434 healthy controls (HC), corresponding to an approximate PD-to-HC ratio of 3:2. This imbalance may bias model predictions toward the majority class, thereby inflating overall accuracy while masking suboptimal performance on the minority class. To mitigate this effect, five-fold stratified cross-validation was employed to preserve class distributions across training and validation partitions. Model performance was assessed using balanced evaluation metrics, including accuracy, recall, F1-score, and group-specific fairness indicators (e.g., TPR and FNR), in order to ensure sensitivity to minority-class outcomes.

4.1.6 Feature selection.

Feature selection constitutes a central component of the most ML based studies aimed at predictive modeling [82]. The primary objective is to identify a minimal subset of input features that preserves high predictive performance while maintaining clinical relevance. Reducing the dimensionality of the feature space mitigates overfitting and eliminates redundant or irrelevant variables, allowing the classifier to focus on the most informative predictors [36].In this study, sensitive attributes, including age, gender, and race, were removed from the training feature set to prevent bias amplification, as these variables are used more appropriately for fairness evaluation rather than prediction. Incorporating such attributes can introduce discriminatory effects or confounding relationships that compromise model equity [83]. Consequently, the feature selection strategy was determined based on extensive empirical evaluation. Recursive Boosting and Stability and Uncorrelated Local Optima-based Variable Selection (SULOV) were employed to identify the most relevant features for model construction. By leveraging iterative XGBoost-based importance ranking in conjunction with SULOV, the proposed approach facilitates the selection of a compact and informative feature subset that enhances both predictive accuracy and model interpretability.

For instance, the dataset used in this study is represented as , with p samples and q features. The model’s prediction is , where Gl is the l-th regression tree. The prediction score given to the n-th sample by the l-th tree is . By minimizing the objective function described below, we can learn the set of functions Gl in the regression tree model:

is the regularizing term for the l-th regression tree, which helps to prevent overfitting. Here, is the loss function, measuring the difference between the actual value wn and the projected value .

We can efficiently find the most significant features by applying recursive XGBoost and SULOV, guaranteeing that the model stays accurate and interpretable.

4.2 Optimization of baseline ML models

We evaluated supervised ML models to predict the progression of PD in order to construct a prognostic framework for the disease advancement. These ML techniques include Random Forest (RF) [84] and Decision Tree (DT) [85].These models are well suited for binary classification tasks and are capable of leveraging multiple combinations of modality. We optimized model hyperparameters using a stratified 5-fold cross-validation and grid search technique, and selected the configurations yielding the lowest training loss. The trained models with optimal hyperparameter settings were subsequently used for PD detection to ensure reliable predictive performance. The initial application of these techniques focused primarily on accuracy and predictive ability, without incorporating fairness constraints. This stage served to reveal potential model biases and to establish a baseline for subsequent fairness-aware comparisons. We utilized a decision-tree-based simulation framework for analysis and evaluation. In addition, we included a baseline DL algorithm to enable direct comparison under identical data splits and preprocessing conditions. This baseline demonstrates that a standard MLP can achieve competitive predictive performance while still exhibiting fairness disparities, thus supporting the need for bias mitigation strategies irrespective of model class [86].

4.2.1 Decision tree.

Based on user-defined criteria for the target variable, the predicted outcome is inferred using a DT classifier [87]. Classification trees are hierarchical structures in which internal nodes represent decision functions and terminal nodes (leaves) correspond to distinct subsets of the target variable, referred to as classes. Decision Trees are widely adopted Ml algorithms due to their conceptual simplicity, interpretability, and ease of implementation. Most DT-based algorithms employ a top-down recursive partitioning strategy, in which a feature is elected at each node that maximizes the separation of the data into homogeneous subsets. The selection of an optimal splitting criterion is guided by a range of statistical measures that quantify the purity of the resulting partitions. These measures assess the distribution of class labels within each subset of the dependent variable. By applying this process recursively to all subgroups, the average reduction in impurities provides an estimate of the quality of the split [88]. Both entropy-based information gain and Gini impurity are commonly used splitting criteria, and their mathematical formulations are given as follows:

4.2.2 Random forest.

Breiman introduced the RF algorithm as an ensemble learning approach that constructs multiple independent base learners and aggregates their predictions to improve overall performance. The fundamental principle of this method is to reduce variance and improve generalization by combining the outputs of a collection of weak learners. RF incorporates a variant of the traditional bagging (bootstrap aggregating) technique, in which each individual classifier is trained on a bootstrapped sample drawn from the original training dataset. Unlike a conventional decision tree algorithms, which evaluate all available features in each split, RF selects a random subset of features at every node when determining the optimal split. This random feature selection mechanism promotes model diversity and reduces correlation among individual trees. In addition to its strong predictive performance in high-dimensional feature spaces, RF has also been shown to perform robustly in settings with relatively low-dimensional feature representations. Formally, let denote the training dataset, and N represent the total number of trees in the ensemble. A set of classification or regression trees is constructed using bootstrapped samples from . For a given unseen input sample x*, the final prediction is obtained by averaging the outputs of all individual trees in the ensemble, which can be expressed as:

(1)

4.2.3 Multi-Layer perceptron.

We included a two-hidden-layer multilayer perceptron (MLP) as a representative deep learning baseline model for binary classification. By composing affine transformations with ReLU nonlinearities, the MLP captures nonlinear feature interactions and is trained in an end-to-end to manner using a binary cross-entropy loss. Regularization was applied via batch normalization, dropout, weight decay, and early stopping to improve generalization and to enable a reproducible comparison with the classical baselines [89]. However, the MLP performance to data availability and hyperparameter tuning is typically less interpretable than tree-based models and requires feature scaling and controlled random seeding for reliable reproducibility. A general l-layer MLP forward pass can be represented as:

Hidden-layer nonlinearity.

Output mapping.

Deep baseline configuration. We used an MLP with architecture , ReLU activations, batch normalization in hidden layers, and dropout with p = 0.20. The trainable weights were initialized using the He initializer. The network was trained with BCEWithLogitsLoss loss funtion. Optimization was performed using Adam , together with ReduceLROnPlateau (factor 0.5, patience 3). Training was run for up to 100 epochs with early stopping based on validation AUROC (patience 10, restoring the best checkpoint) and a batch size of 16. Numerical features were preprocessed using a standardized z-score normalization approach, and categorical variables were converted to a one-hot encoding format. All transformations were initially applied only to the training fold and then applied to the validation and test folds. The decision threshold was selected on the validation split to maximize the F1-score and then applied to the test split. We reported Accuracy, F1-score, AUROC, and fairness gaps (SPD and EOD) as mean ± 95% confidence intervals over 10-fold cross-validation. This deep baseline provides a direct check that fairness interventions remain effective beyond model family differences, supporting the use of the proposed mitigation strategy in both classical and DL models.

4.3 Fairness preprocessing

Pre-processing, in-processing, and post-processing approaches are commonly used to mitigate bias in machine learning pipelines. In particular, pre-processing methods aim to reduce biases present in the training data, thereby limiting the propagation of bias through subsequent modeling stages. A probabilistic formulation of data preparation has been proposed to address dataset-level bias. Under group-fairness, individual-distortion, and data-integrity constraints, optimized pre-processing applies probabilistic modifications to both features and labels to produce a less biased representation of the data. Robust optimization in this setting is typically guided by three objectives: minimizing distortion to individual samples, controlling discriminatory effects across groups, and preserving predictive utility while learning the data transformation [90].

4.4 Adversarial attacks

This section highlights a clear overview of the structure of poison and label leak attacks and their possible consequences on the fairness of ML models. These conversations provide a basis for future research on mitigation techniques to guaranty reliable results [91,92].

4.4.1 Poison attack.

A poisoning attack refers to the intentional injection of adversarially crafted samples into the training set, with the goal of biasing the learning process and compromising model integrity [93]. Consider a setting in which predictions are intended to be independent of a sensitive attribute t. Under label leakage, however, the model may learn a mapping f that directly (or indirectly) uses t, yielding

In the ideal case, the model would instead learn a function k that relies only on the non-sensitive features z, such that

and the prediction is not influenced by t. The disparity induced by dependence on t can be quantified by comparing expected prediction errors across groups defined by t:

where denotes the loss function and denotes expectation.

4.4.2 Label leak attack.

Label leakage is inadvertent learning of the model to depend on labels connected with a sensitive attribute and results in biased forecasts. One may express a label leak attack numerically as follows: let the forecast and a sensitive attribute s have no relationship. But label leaking causes the model to learn a function g.

The model should ideally learn a function h that s has no influence on the prediction , and:

The difference in prediction errors among groups defined by s allows one to measure the label leakage:

where the expected is represented by E.

4.5 Experimental setup

In this study, Python version 3.8 was used for conducting experiments, with Scikit-learn for conventional ML implementations and PyTorch framework for DL based model training and evaluation. An MSI Pulse 17 B13V laptop with an Intel Core i9 CPU and 48GB of RAM was used for model training, adverserial attack simulations, and hyperparameter optimization.

5 Results

In our experiments, we evaluated DT and RF classifiers under two conditions: (i) standard training without fairness mitigation, and (ii) training with an optimized pre-processing fairness mitigation method. The performance of the model was examined both overall and with respect to sensitive attributes (age, sex, and race) using a combination of individual- and group-level bias indicators. The dataset was divided into training and testing subsets (80%/20%). Teh hyperparameters were tuned using a grid search technique, and each model was trained on the full dataset containing all demographic groups. Each experiment was repeated five times; the results were reported as the mean ± standard deviation (SD). Consequently, two result tables were provided for each classifier. The hyperparameters were summarized in Table 4. In addition, a baseline for MLP was reported in Sect 4.2.3, with implementation details provided in Table 5.

thumbnail
Table 4. Final chosen hyperparameters for machine learning models and adversarial attacks (this work).

Seeds follow seed = base.

https://doi.org/10.1371/journal.pone.0342062.t004

thumbnail
Table 5. MLP (DL baseline) with and without adversarial robustness (adversarial debiasing).

Entries are mean ± 95% CI over 10-fold CV. Higher is better for Accuracy/F1/AUROC; lower is better for fairness gaps (SPD, EOD).

https://doi.org/10.1371/journal.pone.0342062.t005

All reported metrics were verified to lie within their expected ranges. Specifically, performance measures (FPR, FNR, TPR, and TNR) were constrain

5.1 Fairness assessment

Hypothesis 01: Due to bias reduction, data biases and fairness evaluation of the decision tree with baseline will display dropped outcomes.

Hypothesis 02: Due to bias reduction, data biases, and the fairness assessment of Random Forest with baseline, it will display decreased results.

Hypothesis 03: The findings and the efficiency of bias mitigation will be affected by the complexity of the model computation.

5.1.1 Data biases and fairness evaluation of decision tree.

Table 6 shows the model performance metrics together with variations in certain bias and group measures, both with and without the use of the improved preprocessing fairness mitigating strategy. This table shows the approach followed in the DT model and shows how the fairness strategy reduces prejudices about gender, age, and color.

thumbnail
Table 6. This table provides the five-fold performance difference between the DT classifier’s performance with and without the enhanced preprocessing fairness mitigation technique.

We provided two types of fairness evaluation: group metrics (TPR, TNR, FPR, FNR, FDR, FOR, PPV, and NPV) and bias measures (SPD, DI, EOD, and AAOD).

https://doi.org/10.1371/journal.pone.0342062.t006

After fairness mitigation was put into effect, model performance clearly changed. Accuracy dropped across all demographic factors, for example, from 82.12% to 74.56% for race, from 80.32% to 76.24% for age, and from 85.15% to 77.08% for gender. These modifications show the compromise between the decrease of systematic data biases and general model performance. Precision, recall, and F1-score measures, however, indicated only minor changes that highlighted the balanced character of forecasts following mitigation.

Additionally, showing notable gains were the measures of fairness bias: SPD, DI, EOD, and AAOD. For the racial characteristic,SPD declined from –0.4512 to –0.0058, but DI rose from 0.3887 to 0.9942, therefore showing more fair treatment for the groups. In the same vein, AAOD dropped significantly from –0.3523 to –0.002, therefore confirming the success of the fairness approach. For age and gender, there was a clear drop in prejudice measures showing advancement toward justice, even if the changes were less noticeable than in race.

While the false rate measures (FPR, FNR, and FDR) were always high, group metrics including TPR, TNR, and PPV exhibited modest changes, thereby highlighting the ongoing difficulties of guaranteeing justice in severely unbalanced datasets. The false omission rate (FOR) stayed high across all groups despite the mitigating measures, which reflects the difficulties in correcting strongly ingrained differences in prediction mistakes.

Figs 5 and 6 highlight the outcomes even further. Following the mitigating strategy, SPD, EOD, and AAOD metrics reached zero for the race characteristic, while DI neared one, thereby showing significant fairness improvements. The results were varied for age and gender traits, most likely due to underlying dataset discrepancies. Specifically, the gender characteristic displayed opposite bias patterns, whereby group-level measurements indicated differences that remained following mitigation.

thumbnail
Fig 5. Improved preprocessing fairness mitigation strategy, we can see reduced bias disparities across race, age, and gender features in the Decision Tree model, leading to fairer and more equitable outcomes across these protected groups.

https://doi.org/10.1371/journal.pone.0342062.g005

thumbnail
Fig 6. For each of the sensitive characteristics, we can provide a brief summary of the DT ML classifier’s group measure (bias or performance disparities) both with and without the use of the optimal pretreatment fairness mitigation strategy: (a) race, (b) age, and (c) gender.

https://doi.org/10.1371/journal.pone.0342062.g006

Although the pretreatment fairness mitigating strategy significantly reduced bias across several demographic variables, it also brought a trade-off with general model accuracy and predictive dependability. These results emphasize the need for ongoing research to maximize fairness strategies while preserving high model performance.

5.1.2 Data biases and fairness evaluation of random forest.

We evaluated the fairness of the DT and RF models. Since both of these models belong to the tree-based classifier family, researchers in the PD domain find them similar and often use them. The RF model’s performance is described in Table 7, in which the top row presents accuracy matrices including accuracy, precision, recall, and F1 score. Fig 7 shows even more how the fairness mitigating strategy affects bias measurements. With metrics like SPD, DI, and AAOD approaching ideal values, the optimal approach greatly lowers bias for the racial characteristic, as seen in Fig 7. Regarding the age feature, the mitigating strategy yields mixed results; whilst SPD, DI, and AAOD measures show improvement, EOD exhibits bias increase following mitigation, as shown in Fig 7(b). On the gender feature, on the other hand, all bias measures have a negative influence and demonstrate an increase in disparity following optimization in Fig 7(c).

thumbnail
Table 7. Report the difference in the RF classifier’s five-fold performance between using and not using the improved preprocessing fairness mitigation approach.

Bias measures (SPD, DI, EOD, and AAOD) and group metrics (TPR, TNR, FPR, FNR, FDR, FOR, PPV, and NPV) are two categories of fairness evaluation that are presented.

https://doi.org/10.1371/journal.pone.0342062.t007

thumbnail
Fig 7. Comparison of bias metrics of the RF ML model with and without using the optimized preprocessing fairness mitigation technique on (a) race, (b) age, and (c) gender features.

https://doi.org/10.1371/journal.pone.0342062.g007

Further exposing differences in performance linked with race, age, and gender is the group metrics study. Fig 8(a), 8(b), and 8(c) contrast the performance of the classifier across these characteristics. Although racial differences show a consistent decrease in performance across all group metrics, indicating the need for further improvement of the mitigating strategy to properly address these biases, racial disparities improve in some measures, such as true positive rates and statistical parity.

thumbnail
Fig 8. Comparison of the RF ML model’s group metric with and without the application of the improved preprocessing fairness mitigation method on (a) race, (b) age, and (c) gender features.

https://doi.org/10.1371/journal.pone.0342062.g008

5.2 Robustness of the models

Hypothesis 01: A poison attack on the fairness assessment based on DT will reduce the robustness of the model and compromise its predictive performance.

Hypothesis 02: The resilience of a model may vary depending on the type of adversarial attack.

Hypothesis 03: The characteristics of the dataset and their variation under adversarial attacks, including the level of noise introduced and the relevance of leaked labels, affect the robustness of the model.

5.2.1 Poison attack on fairness assessment using DT.

The poison attack substantially impaired the performance of the DT classifier across all reported measures in Table 8. The assault threw biased data points and synthetic noise into the training set, therefore upsetting the decision-making process and producing somewhat distorted decision limits. In medical applications where biased or erroneous predictions directly influence patient outcomes, this is especially relevant. The measures expose a significant performance decline in all three demographic groups: gender, age, and race. The poison attack accuracy has dropped noticeably in all areas. For race, accuracy declined from 81.50% to 71.10%; for age, from 79.80% to 73.20%; for gender, from 84.50% to 75.10%). This drop in accuracy shows Fig 13(c) how seriously the poison attack compromises the classifier’s capacity to produce accurate forecasts. Precision and recall measures also show similar patterns; precision declines for race from 68.90% to 64.80%, age from 70.00% to 66.80%, and gender from 77.20% to 69.30%). Recall also follows this downward trend with a drop in the race category from 71.10% to 66.10%, age from 69.00% to 67.00%, and gender from 74.00% to 69.20%. Also declining is the F1 score, which combines precision and accuracy, therefore indicating again another disparity between these two measures.

thumbnail
Table 8. Impact of a poison attack on fairness assessment using DT classifier.

The table reports the difference in five-fold performance and fairness metrics between using and not using the improved preprocessing fairness mitigation approach. Bias measures (SPD, DI, EOD, and AAOD) and group metrics (TPR, TNR, FPR, FNR, FDR, FOR, PPV, and NPV) are two categories of fairness evaluation that are presented.

https://doi.org/10.1371/journal.pone.0342062.t008

Regarding fairness criteria, the poisoning attack compromises model fairness, as reflected in bias assessments. While DI increases from 0.38 to 0.79, SPD decreases from −0.46 to −0.105, indicating reduced fairness after the attack. In addition, EOD worsens from −0.29 to −0.12, suggesting a decline in group-level fairness. For race, AAOD decreases from −0.35 to −0.102. These findings indicate that the poisoning attack not only degrades the overall performance of the classifier but also amplifies the disparities in its predictions, thus exacerbating concerns about fairness.

The group-level metrics provide a more detailed view of the impact on classifier performance. As shown in Fig 9, the (TPR) and (TNR) decrease across all demographic categories. For race, TPR declines from 0.05 to 0.04, indicating a reduced ability of the classifier to correctly identify positive cases. In contrast, both the (FPR) and (FNR) increase. Notably, the FPR for race rises to 0.98, suggesting a substantial increase in misclassified instances. Moreover, the (FOR) and (FDR) also increase, reflecting a higher proportion of incorrect predictions. Specifically, FOR increases to 0.99, indicating a decline in the predictive reliability of the model. Consistently, the (NPV) for race decreases to 0.01, and both the (PPV) and NPV exhibit overall reductions. Together, these results highlight a degradation in the reliability and consistency of the model’s predictions, as further illustrated by the fairness measures reported in Fig 10(a)–10(c).

thumbnail
Fig 9. Comparison between the poison attack on fairness assessment using the DT’s bias measure while not utilizing and with utilizing the improved preprocessing fairness mitigation strategy on (a) race, (b) age, and (c) gender features.

https://doi.org/10.1371/journal.pone.0342062.g009

thumbnail
Fig 10. Comparison of the poison-attacked decision tree ML model’s group metric with and without the application of the improved preprocessing fairness mitigation method on (a) race, (b) age, and (c) gender features.

https://doi.org/10.1371/journal.pone.0342062.g010

These results highlight, especially in sensitive fields like healthcare, the major negative consequences of poison assaults on the performance, fairness, and dependability of machine learning models. Apart from lowering the general performance of the classifier, the assault aggravates already existing fairness inequalities, hence producing possibly biased results.

5.2.2 Label leak attack on fairness assessment using RF.

The complete data in Table 9 reveal that the accuracy of the RF classifier decreased significantly after the label leak attack, going from 88.50% to 60.50% for race, from 85.50% to 58.50% for age, and from 90.00% to 62.00% for gender. This drastic decrease of over 20 percentage points across all demographic groups highlights the substantial impact of the attack on the model’s ability to make accurate predictions. Furthermore, there was a notable decline in recall for all groups, with Race dropping from 78.00% to 57.50%, Age from 76.00% to 55.50%, and Gender from 81.00% to 60.00%. Precision also decreased, with Race falling from 77.50% to 58.50%, Age from 75.00% to 56.50%, and Gender from 80.00% to 59.00%. These decreases indicate the model’s diminished ability to identify both true positives and true negatives, impairing its overall predictive performance. The F1-score, which reflects both precision and recall, also dropped significantly, from 77.80% to 58.00% for Race, from 75.80% to 56.00% for Age, and from 80.50% to 59.50% for gender. This decline in the F1 score, shown in Fig 13(a) and 13(d), further highlights the model’s diminished accuracy following the attack.

thumbnail
Table 9. Impact of label leak attack on fairness assessment using five-fold random forest classifier.

The table reports the difference in performance and fairness metrics between using and not using the improved preprocessing fairness mitigation approach. Bias measures (SPD, DI, EOD, and AAOD) and group metrics (TPR, TNR, FPR, FNR, FDR, FOR, PPV, and NPV) are two categories of fairness evaluation that are presented.

https://doi.org/10.1371/journal.pone.0342062.t009

Regarding fairness metrics, the SPD worsened after the attack shown in Fig 11, with Race increasing from –0.12 to –0.40, Age from –0.09 to –0.45, and Gender from –0.06 to –0.50. These changes indicate an increasing disparity in favorable outcomes across demographic groups. The DI measure dropped significantly, reaching 0.50 for Race, 0.55 for Age, and 0.60 for gender, suggesting a significant increase in bias against these groups. Furthermore, the AAOD for Race increased from –0.07 to –0.38, for Age from –0.06 to –0.42, and for gender from –0.04 to –0.47, while EOD ranged from –0.05 to –0.35 for Race, from –0.04 to –0.40 for Age, and from –0.03 to –0.45 for gender. These results underscore that the label leak attack exacerbated disparities, particularly in the rates of false positives, false negatives, and true positives in different demographic groups.

thumbnail
Fig 11. Comparison between the label leak attack on fairness assessment using random forest model’s bias measure while not utilizing and with utilizing the improved preprocessing fairness mitigation strategy on (a) race, (b) age, and (c) gender features.

https://doi.org/10.1371/journal.pone.0342062.g011

In Fig 12, the TPR decreased for all groups, from 0.78 to 0.58 for Race, from 0.76 to 0.55 for Age, and from 0.81 to 0.60 for gender, further emphasizing the negative impact on model performance. Similarly, the TNR also decreased, with Race value dropping from 0.90 to 0.70, Age from 0.89 to 0.68, and Gender from 0.91 to 0.68, indicating that the classifier is misclassifying more instances. Moreover, the increased FPR and FNR highlight a rise in incorrect classifications. The PPV and NPV suffered as well, with PPV decreasing from 0.77 to 0.59 for Race, from 0.75 to 0.58 for Age, and from 0.80 to 0.60 for gender. NPV also dropped from 0.77 to 0.67 for Race, from 0.76 to 0.65 for Age, and from 0.80 to 0.70 for gender. These results indicate that the label leak attack has compromised the model’s predictive reliability, reducing its ability to provide correct positive and negative predictions across all demographic groups.

thumbnail
Fig 12. Comparison between the label leak attack on fairness assessment using random forest model’s bias measure while without/with utilizing the improved preprocessing fairness mitigation strategy on (a) race, (b) age, and (c) gender features.

https://doi.org/10.1371/journal.pone.0342062.g012

5.3 Evaluation of fairness statistical tests

We performed paired t-tests on the performance of the RF and DT classifiers prior to and following fairness preprocessing in order to ascertain the statistical significance of the fairness gains reported in our models. Significant gains in fairness indices, such as EOD and SPD, are shown by these tests (p-value < 0.05). Racial bias was significantly reduced, as evidenced by the SPD for race falling from –0.4512 to –0.0058 for DT (p = 0.017) and from –0.38 to –0.05 for RF (p = 0.021). Compared to earlier studies, our models’ performance was on par with the highest recorded accuracy of (85.3% accuracy),(83.9% accuracy) [42], but with the added benefit of adversarial robustness and fairness. This illustrates the statistical benefit of our dual approach, which combines adversarial robustness evaluation with fairness mitigation [9497].

5.4 Evaluation of robustness using statistical tests

Additionally, we compared model performance under adversarial scenarios (label leakage and poisoning attacks) using paired t-tests. Adversarial attacks significantly reduced the accuracy of both models (p < 0.01). Label leakage attacks caused RF accuracy to drop by over 20%, which is substantially greater than the approximately 10% drop observed for DT. According to [98100], these results demonstrate the increased susceptibility of more complex models under hostile conditions. As the poisoning rate increases, we observe a monotonic decline in utility (Accuracy, F1-score, and AUROC) and widening of fairness gaps (SPD and EOD). RF consistently degrades less than a single DT, indicating greater resilience to label noise. This analysis quantifies robustness directly within our proposed pipeline and satisfies the generalization assessment without requiring external validation data [81,88,101].

6 Discussion

This detailed investigation demonstrates how label leakage attacks affect both the performance and fairness of the RF classifier, using quantitative results to empirically validate our hypothesis. These findings highlight the need for robust safeguards in ML models, particularly in high-stakes domains such as healthcare. The results further indicate that, although fairness-aware preprocessing methods can lead to a modest reduction in raw predictive accuracy, they substantially mitigate demographic bias in PD prediction. These trends are summarized in Table 1, which compares recent ML-based approaches for the detection of PD and emphasizes the novelty of our proposed approach that focuses on fairness and robustness. These insights have important implications for clinical practice. Specifically, this study provides a practical framework for developing more reliable and equitable AI systems in healthcare by showing that adversarial evaluation can uncover latent vulnerabilities and that fairness-aware preprocessing can effectively reduce bias in PD diagnosis.

  1. Group-specific metrics: For both DT and RF classifiers, the TPR and TNR decreased under adversarial conditions, as shown in Tables 6 and 7, indicating a reduced capacity to correctly identify true outcomes. Following the poisoning attack, the RF model exhibited a larger drop in TNR compared to the label leakage attack, suggesting increased susceptibility to false negative predictions under poisoning.
  2. Bias amplification: In the Fig 5 shows how adversarial manipulation may increase bias by employing a poison attack on the DT classifier, amplifying bias across all fairness indicators (SPD, DI, EOD, AAOD). Under a label leak attack, the RF classifier in Fig 7 shows more fairness shift and bias than the DT under a poison attack.
  3. Model sensitivity to data composition: Sensitive to the composition of the training data are models such as RF and DT shows in Fig 6, Fig 5. While label leak attacks significantly lower RF performance metrics Fig 12, and Fig 11 on gender variations significantly influences DT performance.
  4. Predictive value changes: Although the PPV and NPV of both classifiers decreased, the RF model exhibited a larger reduction in NPV under the label leakage attack than the DT model under the poisoning attack, as shown in Tables 8 and 9. This indicates a greater adverse impact on negative predictions for the RF classifier.
  5. Error rate disparities: Biased data or adversarial threats cause both classifiers to make distinct mistakes in item title error rates. After a poison attack, Fig 9 shows a 2% increase in FNR for race in the DT; Fig 12 shows the same trend in the RF after a label leak attack. This implies that under adversarial settings, both models raise false negatives.
  6. Impact on fairness metrics: The poisoning attack on the DT model, shown in Figs 9 and 10, reduces model fairness, as indicated by SPD converging toward zero, thereby suggesting increased inequality. Furthermore, the label leakage attack on the RF model lowers DI and increases bias.
  7. Performance metrics degradation: The poisoning attack on the DT model reduces accuracy, precision, recall, and F1-score across all demographic groups, as shown in Fig 13. Although accuracy declines more substantially for the RF model suggesting that label leakage may be more detrimental to RF than poisoning is to DT. The label leakage attack on the RF model also leads to a noticeable degradation in overall performance metrics.
  8. Robustness to attacks: The RF classifier appears slightly more vulnerable to malicious inputs compared to the DT model; however, this difference is not sufficiently large to draw a definitive conclusion about its general applicability, as shown in Fig 13(a). In particular, both RF and DT exhibit greater performance degradation under label leakage attacks than under poisoning attacks.
  9. Fairness optimization potential: Although the DT model exhibited lower overall accuracy after training on the fairness-adjusted dataset, its accuracy improvement in Fig 13(b) is greater than that of the MLP while maintaining equal treatment across groups, indicating stronger potential for fairness optimization.
  10. Mitigation strategy implications: The differences in experimental results between the DT and RF models suggest that strategies for reducing bias and enhancing robustness must be tailored to each model type, as shown in Fig 13(a)–13(d). No single technique is universally effective, which highlights the importance of model-specific mitigation strategies for achieving improved fairness and resilience in machine learning systems.
thumbnail
Fig 13. Comparison between the RF (a) and DT (b) model’s ML result while not utilizing and with utilizing the improved preprocessing fairness mitigation strategy and adversarial attack (c), and (d) is DT and RF.

https://doi.org/10.1371/journal.pone.0342062.g013

7 Limitations

The generalizability of the present study is limited by several methodological and data-related factors. First, the exclusion of DL models restricts the applicability of the results to more complex architectures commonly used in natural language processing and medical imaging [99,100]. In addition, the analysis focused exclusively on pre-processing-based fairness mitigation strategies. Although such methods are effective for model-agnostic bias correction, they may be less effective than in-processing or adversarial debiasing approaches, particularly in scenarios where feature-label relationships are inherently skewed [32]. Adversarial robustness was evaluated using only poisoning and label leakage attacks. While these attacks provide useful insights, they do not represent the full range of adversarial threats, such as gradient-based, universal, or physical attacks, and may therefore underestimate model vulnerability in real-world deployment settings. Furthermore, the discretization of sensitive attributes (race, gender, and age) may have oversimplified the structure of bias, particularly for continuous variables such as age. Although this choice may obscure intra-group heterogeneity, it was adopted to remain consistent with the formal definitions of widely used fairness metrics [86,89,90]. Finally, although the PPMI dataset provides high-quality clinical data, its demographic skew (e.g., over-representation of white male participants) and class imbalance may limit the external validity of the results when generalizing to more diverse populations.

8 Conclusion and future work

In this study, we examined the fairness and robustness in ML models for PD prediction and showed that demographic bias and adversarial vulnerability remain significant challenges in clinical AI. Our findings demonstrate that data-driven biases associated with gender, age, and race can substantially affect model predictions if not explicitly addressed. Pre-processing-based fairness interventions were effective in reducing demographic disparities, although in some cases this improvement was accompanied by a modest reduction in predictive accuracy. We also showed that ML models are susceptible to adversarial perturbations, particularly poisoning and label leakage attacks, which degraded both performance and fairness. The Decision Tree model was especially sensitive to poisoning, highlighting the risks of deploying ML systems in high-stakes healthcare settings without systematic robustness evaluation.

Future work will extend the current experimental framework to include representative in-processing (e.g., adversarial debiasing and equalized-odds reduction) and post-processing (e.g., equalized-odds and equal-opportunity threshold optimization) methods, using the same data splits, metrics, and reporting format. Additional directions include evaluating the framework on external and more diverse cohorts, incorporating explainability methods (e.g., SHAP and LIME), and exploring multi-objective optimization strategies that jointly balance accuracy, fairness, and robustness.

References

  1. 1. Tsipras D, Santurkar S, Engstrom L, Turner A, Madry A. Robustness may be at odds with accuracy. arXiv preprint 2018. https://arxiv.org/abs/1805.12152
  2. 2. Carlini N, Wagner D. Towards evaluating the robustness of neural networks. In: 2017 IEEE Symposium on Security and Privacy (SP). 2017. p. 39–57. https://doi.org/10.1109/sp.2017.49
  3. 3. Turner A, Tsipras D, Madry A. Label-consistent backdoor attacks. arXiv preprint 2019. https://arxiv.org/abs/1912.02771
  4. 4. Goodfellow I, McDaniel P, Papernot N. Making machine learning robust against adversarial inputs. Commun ACM. 2018;61(7):56–66.
  5. 5. Moosavi-Dezfooli SM, Fawzi A, Frossard P. Deepfool: a simple and accurate method to fool deep neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2016. p. 2574–82.
  6. 6. Papernot N, McDaniel P, Jha S, Fredrikson M, Celik ZB, Swami A. The limitations of deep learning in adversarial settings. In: 2016 IEEE European Symposium on Security and Privacy (EuroS&P). 2016. p. 372–87. https://doi.org/10.1109/eurosp.2016.36
  7. 7. Zhang C, Zhang K, Li Y. A causal view on robustness of neural networks. Advances in Neural Information Processing Systems. 2020;33:289–301.
  8. 8. Bazargani JS, Rahim N, Sadeghi-Niaraki A, Abuhmed T, Song H, Choi S-M. Alzheimer’s disease diagnosis in the metaverse. Comput Methods Programs Biomed. 2024;255:108348. pmid:39067138
  9. 9. Du M, Yang F, Zou N, Hu X. Fairness in deep learning: a computational perspective. IEEE Intell Syst. 2021;36(4):25–34.
  10. 10. Rahim N, Ahmad N, Ullah W, Bedi J, Jung Y. Early progression detection from MCI to AD using multi-view MRI for enhanced assisted living. Image and Vision Computing. 2025;157:105491.
  11. 11. Morse L, Teodorescu MHM, Awwad Y, Kane GC. Do the ends justify the means? Variation in the distributive and procedural fairness of machine learning algorithms. J Bus Ethics. 2021;181(4):1083–95.
  12. 12. Pessach D, Shmueli E. A review on fairness in machine learning. ACM Comput Surv. 2022;55(3):1–44.
  13. 13. Mehrabi N, Morstatter F, Saxena N, Lerman K, Galstyan A. A survey on bias and fairness in machine learning. ACM Comput Surv. 2021;54(6):1–35.
  14. 14. Baek J, Li Y, Lim L, Chong JW. An interpretable AI for smart homes: identifying fall prevention strategies for older adults using multimodal deep learning. IEEE J Biomed Health Inform. 2025;29(10):7643–56. pmid:40366850
  15. 15. Wiens J, Saria S, Sendak M, Ghassemi M, Liu VX, Doshi-Velez F, et al. Do no harm: a roadmap for responsible machine learning for health care. Nat Med. 2019;25(9):1337–40. pmid:31427808
  16. 16. Wang D, Wang L, Zhang Z, Wang D, Zhu H, Gao Y, et al. “Brilliant AI Doctor” in rural clinics: challenges in AI-powered clinical decision support system deployment. In: Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 2021. p. 1–18. https://doi.org/10.1145/3411764.3445432
  17. 17. Toreini E, Aitken M, Coopamootoo K, Elliott K, Zelaya CG, van Moorsel A. The relationship between trust in AI and trustworthy machine learning technologies. In: Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency. 2020. p. 272–83. https://doi.org/10.1145/3351095.3372834
  18. 18. Rashed-Al-Mahfuz M, Haque A, Azad A, Alyami SA, Quinn JMW, Moni MA. Clinically applicable machine learning approaches to identify attributes of Chronic Kidney Disease (CKD) for use in low-cost diagnostic screening. IEEE J Transl Eng Health Med. 2021;9:4900511. pmid:33948393
  19. 19. Arachchige PCM, Bertok P, Khalil I, Liu D, Camtepe S, Atiquzzaman M. A trustworthy privacy preserving framework for machine learning in industrial IoT systems. IEEE Trans Ind Inf. 2020;16(9):6092–102.
  20. 20. Huang X, Kroening D, Ruan W, Sharp J, Sun Y, Thamo E, et al. A survey of safety and trustworthiness of deep neural networks: verification, testing, adversarial attack and defence, and interpretability. Computer Science Review. 2020;37:100270.
  21. 21. Hossain MA, Ferdousi R, Alhamid MF. Knowledge-driven machine learning based framework for early-stage disease risk prediction in edge environment. Journal of Parallel and Distributed Computing. 2020;146:25–34.
  22. 22. Baek J, Li Y, Lim L, Chong JW. An interpretable AI for smart homes: identifying fall prevention strategies for older adults using multimodal deep learning. IEEE J Biomed Health Inform. 2025;29(10):7643–56. pmid:40366850
  23. 23. Tai Y, Gao B, Li Q, Yu Z, Zhu C, Chang V. Trustworthy and intelligent COVID-19 diagnostic IoMT through XR and deep-learning-based clinic data access. IEEE Internet Things J. 2021;8(21):15965–76. pmid:35782175
  24. 24. Arnold M, Bellamy RKE, Hind M, Houde S, Mehta S, Mojsilović A, et al. FactSheets: increasing trust in AI services through supplier’s declarations of conformity. IBM J Res & Dev. 2019;63(4/5):6:1-6:13.
  25. 25. Calmon F, Wei D, Vinzamuri B, Natesan Ramamurthy K, Varshney KR. Optimized pre-processing for discrimination prevention. Advances in neural information processing systems. 2017;30.
  26. 26. Liu B. Lifelong machine learning: a paradigm for continuous learning. Front Comput Sci. 2017;11(3):359–61.
  27. 27. Corbett-Davies S, Goel S. The measure and mismeasure of fairness: a critical review of fair machine learning. arXiv preprint 2018. https://arxiv.org/abs/1808.00023
  28. 28. Kim M, Reingold O, Rothblum G. Fairness through computationally-bounded awareness. Advances in Neural Information Processing Systems. 2018;31.
  29. 29. Mourali M, Novakowski D, Pogacar R, Brigden N. Public perception of accuracy-fairness trade-offs in algorithmic decisions in the United States. PLoS One. 2025;20(3):e0319861. pmid:40080482
  30. 30. Awan M, Whangbo TK, Shin J. Deep learning methods for autonomous driving scene understanding tasks: a review. Expert Systems with Applications. 2025;287:128098.
  31. 31. Haaxma CA, Bloem BR, Borm GF, Oyen WJG, Leenders KL, Eshuis S, et al. Gender differences in Parkinson’s disease. J Neurol Neurosurg Psychiatry. 2007;78(8):819–24. pmid:17098842
  32. 32. Paul S, Maindarkar M, Saxena S, Saba L, Turk M, Kalra M, et al. Bias investigation in artificial intelligence systems for early detection of Parkinson’s disease: a narrative review. Diagnostics (Basel). 2022;12(1):166. pmid:35054333
  33. 33. Islam N, Turza MSA, Fahim SI, Rahman RM. Single and multi-modal analysis for Parkinson’s disease to detect its underlying factors. Hum-Cent Intell Syst. 2024;4(2):316–34.
  34. 34. Rohit SA, Yaswanthram P, Nair PR, Rajendra Prasath S, Akella SV. Prediction of Parkinson’s disease using machine learning models—a classifier analysis. Advanced Computing and Intelligent Technologies: Proceedings of ICACIT 2021. Springer; 2021. p. 453–60.
  35. 35. Jain D, Singh V. Feature selection and classification systems for chronic disease prediction: a review. Egyptian Informatics Journal. 2018;19(3):179–89.
  36. 36. Zhang J, Li Y, Gao Y, Hu J, Huang B, Rong S, et al. An SBM-based machine learning model for identifying mild cognitive impairment in patients with Parkinson’s disease. J Neurol Sci. 2020;418:117077. pmid:32798842
  37. 37. Zhang J, Bareinboim E. Equality of opportunity in classification: a causal approach. Advances in Neural Information Processing Systems. 2018;31.
  38. 38. Almukadi W, Abdel-Khalek S, Bahaddad AA, Alghamdi AM. Driven early detection of chronic kidney cancer disease based on machine learning technique. PLoS One. 2025;20(7):e0326080. pmid:40663560
  39. 39. Watts J, Khojandi A, Shylo O, Ramdhani RA. Machine learning’s application in deep brain stimulation for Parkinson’s disease: a review. Brain Sci. 2020;10(11):809. pmid:33139614
  40. 40. Zhang C, Bengio S, Hardt M, Recht B, Vinyals O. Understanding deep learning (still) requires rethinking generalization. Commun ACM. 2021;64(3):107–15.
  41. 41. Martinez-Eguiluz M, Arbelaitz O, Gurrutxaga I, Muguerza J, Perona I, Murueta-Goyena A, et al. Diagnostic classification of Parkinson’s disease based on non-motor manifestations and machine learning strategies. Neural Comput & Applic. 2022;35(8):5603–17.
  42. 42. Malik J, Muthalagu R, Pawar PM. A systematic review of adversarial machine learning attacks, defensive controls, and technologies. IEEE Access. 2024;12:99382–421.
  43. 43. Teodorescu MHM, Yao X. Machine learning fairness is computationally difficult and algorithmically unsatisfactorily solved. In: 2021 IEEE High Performance Extreme Computing Conference (HPEC). 2021. p. 1–8. https://doi.org/10.1109/hpec49654.2021.9622861
  44. 44. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: synthetic minority over-sampling technique. JAIR. 2002;16:321–57.
  45. 45. Han J, Kamber M, Mining D. Concepts and techniques. Morgan Kaufmann; 2006.
  46. 46. Lema A, Nogueira F, Aridas CK. Imbalanced-learn: a python toolbox to tackle the curse of imbalanced datasets in machine learning. Journal of Machine Learning Research. 2017;18(17):1–5.
  47. 47. Vaswani PA, Tropea TF, Dahodwala N. Overcoming barriers to parkinson disease trial participation: increasing diversity and novel designs for recruitment and retention. Neurotherapeutics. 2020;17(4):1724–35. pmid:33150545
  48. 48. Bolón-Canedo V, Sánchez-Maroño N, Alonso-Betanzos A. A review of feature selection methods on synthetic data. Knowl Inf Syst. 2012;34(3):483–519.
  49. 49. Hirose N, Xia F, Martin-Martin R, Sadeghian A, Savarese S. Deep visual MPC-policy learning for navigation. IEEE Robot Autom Lett. 2019;4(4):3184–91.
  50. 50. Hoehn MM, Yahr MD. Parkinsonism: onset, progression and mortality. Neurology. 1967;17(5):427–42. pmid:6067254
  51. 51. Kusner MJ, Loftus J, Russell C, Silva R. Counterfactual fairness. Advances in neural information processing systems. 2017;30.
  52. 52. Yang K, Stoyanovich J. Measuring fairness in ranked outputs. In: Proceedings of the 29th International Conference on Scientific and Statistical Database Management, 2017. p. 1–6. https://doi.org/10.1145/3085504.3085526
  53. 53. Kleinberg J, Raghavan M. Selection problems in the presence of implicit bias. arXiv preprint. 2018. https://arxiv.org/abs/1801.03533
  54. 54. Kearns M, Roth A, Wu ZS. Meritocratic fairness for cross-population selection. In: International conference on machine learning. 2017. p. 1828–36.
  55. 55. Joseph M, Kearns M, Morgenstern J, Neel S, Roth A. Fair algorithms for infinite and contextual bandits. arXiv preprint 2016. https://arxiv.org/abs/1610.09559
  56. 56. Liu Y, Radanovic G, Dimitrakakis C, Mandal D, Parkes DC. Calibrated fairness in bandits. arXiv preprint 2017. https://arxiv.org/abs/1707.01875
  57. 57. Madras D, Pitassi T, Zemel R. Predict responsibly: improving fairness and accuracy by learning to defer. Advances in Neural Information Processing Systems. 2018;31.
  58. 58. Doroudi S, Thomas PS, Brunskill E. Importance sampling for fair policy selection. 2017.
  59. 59. Crispino P, Gino M, Barbagelata E, Ciarambino T, Politi C, Ambrosino I, et al. Gender differences and quality of life in Parkinson’s disease. Int J Environ Res Public Health. 2020;18(1):198. pmid:33383855
  60. 60. Yoon J-E, Kim JS, Jang W, Park J, Oh E, Youn J, et al. Gender differences of nonmotor symptoms affecting quality of life in Parkinson disease. Neurodegener Dis. 2017;17(6):276–80. pmid:28848156
  61. 61. Heinzel S, Kasten M, Behnke S, Vollstedt E-J, Klein C, Hagenah J, et al. Age- and sex-related heterogeneity in prodromal Parkinson’s disease. Mov Disord. 2018;33(6):1025–7. pmid:29570852
  62. 62. Dluzen DE, McDermott JL. Gender differences in neurotoxicity of the nigrostriatal dopaminergic system: implications for Parkinson’s disease. J Gend Specif Med. 2000;3(6):36–42. pmid:11253381
  63. 63. Rocha E, Chamoli M, Chinta SJ, Andersen JK, Wallis R, Bezard E, et al. Aging, Parkinson’s disease, and models: what are the challenges?. Aging Biol. 2023;1:e20230010. pmid:38978807
  64. 64. Madry A, Makelov A, Schmidt L, Tsipras D, Vladu A. Towards deep learning models resistant to adversarial attacks. arXiv preprint 2017. https://arxiv.org/abs/1706.06083
  65. 65. Pillutla K, Kakade SM, Harchaoui Z. Robust aggregation for federated learning. IEEE Trans Signal Process. 2022;70:1142–54.
  66. 66. Huang W, Ye M, Shi Z, Wan G, Li H, Du B, et al. Federated learning for generalization, robustness, fairness: a survey and benchmark. IEEE Trans Pattern Anal Mach Intell. 2024;46(12):9387–406. pmid:38917282
  67. 67. Cowling TE, Cromwell DA, Bellot A, Sharples LD, van der Meulen J. Logistic regression and machine learning predicted patient mortality from large sets of diagnosis codes comparably. J Clin Epidemiol. 2021;133:43–52. pmid:33359319
  68. 68. Lundberg SM, Lee SI. A unified approach to interpreting model predictions. Advances in Neural Information Processing Systems. 2017;30.
  69. 69. Ghassemi M, Oakden-Rayner L, Beam AL. The false hope of current approaches to explainable artificial intelligence in health care. Lancet Digit Health. 2021;3(11):e745–50. pmid:34711379
  70. 70. Papernot N, McDaniel P, Sinha A, Wellman MP. SoK: security and privacy in machine learning. In: 2018 IEEE European Symposium on Security and Privacy (EuroS&P). 2018. p. 399–414. https://doi.org/10.1109/eurosp.2018.00035
  71. 71. El-Sayed RS. A hybrid CNN-LSTM deep learning model for classification of the Parkinson disease. IAENG International Journal of Applied Mathematics. 2023;53(4).
  72. 72. Azhari M, Alaoui A, Abarda A, Ettaki B, Zerouaoui J. A comparison of random forest methods for solving the problem of pulsar search. In: The Proceedings of the Third International Conference on Smart City Applications. 2019. p. 796–807.
  73. 73. Ramakrishna S, Sajja V, Jhansi Lakshmi P, Bhupal Naik D, Kalluri HK. Student performance monitoring system using decision tree classifier. Machine intelligence and soft computing: Proceedings of ICMISC 2020. Springer; 2021. p. 393–407.
  74. 74. Azhari M, Abarda A, Alaoui A, Ettaki B, Zerouaoui J. Detection of pulsar candidates using bagging method. Procedia Computer Science. 2020;170:1096–101.
  75. 75. Buczak-Stec EW, König H-H, Hajek A. Impact of incident Parkinson’s disease on satisfaction with life. Front Neurol. 2018;9:589. pmid:30140250
  76. 76. Baracas S, Hardt M, Narayanan A. Fairness and machine learning. Recommender systems handbook. 2020. p. 453–9.
  77. 77. Mehta S, Bhardwaj R. Deep learning meets traditional machine learning: CNN-SVM hybrid models for Parkinson’s diagnosis. In: 2025 International Conference on Automation and Computation (AUTOCOM). 2025. p. 1402–6.
  78. 78. Chang JY, Im EG. Data poisoning attack on random forest classification model. In: Proc of SMA. 2020.
  79. 79. West MT, Tsang S-L, Low JS, Hill CD, Leckie C, Hollenberg LCL, et al. Towards quantum enhanced adversarial robustness in machine learning. Nat Mach Intell. 2023;5(6):581–9.
  80. 80. Kanca E, Ayas S, Kablan EB, Ekinci M. Implementation of fast gradient sign adversarial attack on vision transformer model and development of defense mechanism in classification of dermoscopy images. In: 2023 31st Signal Processing and Communications Applications Conference (SIU). 2023. p. 1–4. https://doi.org/10.1109/siu59756.2023.10223851
  81. 81. Kleinberg J, Mullainathan S, Raghavan M. Inherent trade-offs in the fair determination of risk scores. arXiv preprint 2016. https://arxiv.org/abs/1609.05807
  82. 82. Charbuty B, Abdulazeez A. Classification based on decision tree algorithm for machine learning. JASTT. 2021;2(01):20–8.
  83. 83. Li L, Shi Z, Hu X, Dong B, Qin Y, Liu X, et al. T2isafety: benchmark for assessing fairness, toxicity, and privacy in image generation. In: Proceedings of the Computer Vision and Pattern Recognition Conference. 2025. p. 13381–92.
  84. 84. Mishra S, Mallick PK, Tripathy HK, Bhoi AK, González-Briones A. Performance evaluation of a proposed machine learning model for chronic disease datasets using an integrated attribute evaluator and an improved decision tree classifier. Applied Sciences. 2020;10(22):8137.
  85. 85. Alabdulwahab S, Moon B. Feature selection methods simultaneously improve the detection accuracy and model building time of machine learning classifiers. Symmetry. 2020;12(9):1424.
  86. 86. Ren C, Xu Y. Robustness verification for machine-learning-based power system dynamic security assessment models under adversarial examples. IEEE Trans Control Netw Syst. 2022;9(4):1645–54.
  87. 87. Straw I, Wu H. Investigating for bias in healthcare algorithms: a sex-stratified analysis of supervised machine learning models in liver disease prediction. BMJ Health Care Inform. 2022;29(1):e100457. pmid:35470133
  88. 88. Berk R, Heidari H, Jabbari S, Kearns M, Roth A. Fairness in criminal justice risk assessments: the state of the art. Sociological Methods & Research. 2018;50(1):3–44.
  89. 89. Rodriguez D, Nayak T, Chen Y, Krishnan R, Huang Y. On the role of deep learning model complexity in adversarial robustness for medical images. BMC Med Inform Decis Mak. 2022;22(Suppl 2):160. pmid:35725429
  90. 90. Tramer F, Carlini N, Brendel W, Madry A. On adaptive attacks to adversarial example defenses. Advances in Neural Information Processing Systems. 2020;33:1633–45.
  91. 91. Li B, Jiang X, Zhang K, Harmanci AO, Malin B, Gao H, et al. Enhancing fairness in disease prediction by optimizing multiple domain adversarial networks. PLOS Digit Health. 2025;4(5):e0000830. pmid:40445951
  92. 92. PPMI. Parkinson’s progression markers initiative. 2010. https://www.ppmi-info.org
  93. 93. Goodfellow IJ, Shlens J, Szegedy C. Explaining and harnessing adversarial examples. arXiv preprint 2014. https://arxiv.org/abs/1412.6572
  94. 94. Nabi R, Shpitser I. Fair inference on outcomes. AAAI. 2018;32(1).
  95. 95. Gianfrancesco MA, Tamang S, Yazdany J, Schmajuk G. Potential biases in machine learning algorithms using electronic health record data. JAMA Intern Med. 2018;178(11):1544–7. pmid:30128552
  96. 96. Hort M, Chen Z, Zhang JM, Harman M, Sarro F. Bias mitigation for machine learning classifiers: a comprehensive survey. ACM J Responsib Comput. 2024;1(2):1–52.
  97. 97. Chang H, Shokri R. On the privacy risks of algorithmic fairness. In: 2021 IEEE European Symposium on Security and Privacy (EuroS&P). 2021. p. 292–303. https://doi.org/10.1109/eurosp51992.2021.00028
  98. 98. Grgic-Hlaca N, Zafar MB, Gummadi KP, Weller A. The case for process fairness in learning: feature selection for fair decision making. In: NIPS symposium on machine learning and the law. vol. 1. Barcelona, Spain; 2016. p. 11.
  99. 99. Corbett-Davies S, Pierson E, Feller A, Goel S, Huq A. Algorithmic decision making and the cost of fairness. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2017. p. 797–806. https://doi.org/10.1145/3097983.3098095
  100. 100. Cui G, Wang C. The machine learning algorithm based on decision tree optimization for pattern recognition in track and field sports. PLoS One. 2025;20(2):e0317414. pmid:39946363
  101. 101. Chouldechova A. Fair prediction with disparate impact: a study of bias in recidivism prediction instruments. Big Data. 2017;5(2):153–63. pmid:28632438