Suboptimal capability of individual machine learning algorithms in modeling small-scale imbalanced clinical data of local hospital

In recent years, artificial intelligence (AI) has shown promising applications in various scientific domains, including biochemical analysis research. However, the effectiveness of AI in modeling small-scale, imbalanced datasets remains an open question in such fields. This study explores the capabilities of eight basic AI algorithms, including ridge regression, logistic regression, random forest regression, and others, in modeling a small, imbalanced clinical dataset (total n = 387, class 0 = 27, class 1 = 360) related to the records of the biochemical blood tests from the patients with multiple wasp stings (MWS). Through rigorous evaluation using k-fold cross-validation and comprehensive scoring, we found that none of the models could effectively model the data. Even after fine-tuning the hyperparameters of the best-performing models, the results remained below acceptable thresholds. The study highlights the challenges of applying AI to small-scale datasets with imbalanced groups in biochemical or clinical research and emphasizes the need for novel algorithms tailored to small-scale data. The findings also call for further exploration into techniques such as transfer learning and data augmentation, and they underline the importance of understanding the minimum dataset scale required for effective AI modeling in biochemical contexts.


Introduction
Due to strong performance in generative and representative modeling for scientific problems, the application of Artificial Intelligence for Science (AI4Science) has become increasingly attractive in many traditional research fields [1][2][3].This trend is driven by the unprecedented ability of AI algorithms to uncover complex patterns, make predictions, and generate new insights from vast and intricate datasets.In biochemical and bioinformatical studies particularly, a lot of great works have emerged intensively in recent years: These works encompass a wide range of applications, including drug discovery [4,5], protein structure analysis [6,7], traditional medicines [8][9][10], genetic sequence analysis [11], disease prediction [12,13], and pandemic management [14], etc.For instance, deep learning techniques have been employed to predict protein structures with remarkable accuracy, revolutionizing the field of structural biology [7].Similarly, machine learning models like random forest and support vector machines have been instrumental in identifying potential biomarkers for various diseases, thereby aiding in early diagnosis and personalized treatment [11,15].The convergence of AI technologies with traditional biochemical and bioinformatical methods is thus heralding a new era in scientific exploration and discovery, promising to reshape the landscape of biomedical research in the coming years.
However, many of those AI models require large-scale datasets, or big data, for sufficient training and efficient generalizing [16].The necessity for vast amounts of data usually stems from the complexity of the models and the intricate relationships they are designed to capture [17].In the context of biochemical and bioinformatical studies, this often means needing detailed and comprehensive molecular structures, genetic sequences, or clinical records.Therefore, while big data has enabled unprecedented advancements in these fields, it also presents significant challenges.The collection, preprocessing, and management of such extensive datasets can be resource-intensive and time-consuming [17].Moreover, issues related to data privacy, security, and standardization further complicate the utilization of big data in AIdriven research [18].The requirement for large-scale data also raises concerns about the applicability of the AI models in scenarios where only limited data are available-this is a very often case in biochemical or clinical research.For the clinicians and researchers in a local or regional hospital with limited patient records, this problem becomes even more severe.
Thus, a pressing question that remains is: what types of AI algorithms can be effectively applied to model small-scale biochemical data?In other words: Is it feasible to directly employ individual basic AI models to achieve satisfactory modeling outcomes?Addressing this question is crucial, as not all research scenarios have access to large datasets.The ability to glean meaningful insights from limited data is vital for expanding the applicability of AI in scientific research.If basic AI models prove inadequate for the task, it would be prudent to exercise caution in relying solely on them.Such shortcomings warrant a deeper investigation into the root causes and may necessitate the development of innovative solutions, such as generative models [15,19,20], transfer learning [21,22], or the use of large language models [23].
To explore this, we adopted a small dataset collected from clinical practice, comprising the biochemical blood test (BBT) results and biophysical conditions of patients with multiple wasp stings (MWS).The size of this dataset is modest (n = 387 after preparation), reflecting a common challenge in specialized medical research.We examined the modeling capability of eight fundamental machine learning algorithms with this small-scale dataset, aiming to assess their performance and identify potential strategies for effective modeling in data-constrained environments of biochemical research.

Materials and methods
In this study, anonymized BBT results and biophysical conditions from 408 individuals diagnosed with MWS were analyzed, with the data as of May 2023 backing to June 2016 (data accessed in 30 th of May 2023).After an initial assessment of the unprocessed data, 21 patient records were omitted due to noticeable mistakes (such as indecipherable content) or improper entries.Within those 21 records, 6 patients were admitted to the hospital as a result of anaphylactic shock from wasp stings.Among them, 4 patients received initial treatment at local clinics near the sting location, and 2 patients were attended to in an ambulance.Since the shock symptoms had ameliorated by the time of hospital admission, these cases were not considered in the comparative analysis.
Within the dataset, each observation consists of 21 attributes: 18 results from BBT (utilized to assess biochemical conditions, such as ALT, CK, etc.), and 3 physical conditions, specifically the counts of wasp stings on the head, limbs, and trunk.Accompanying these features, a label is assigned to delineate the ultimate clinical outcome (either survival or death).The structure of the dataset and the full names of the BBT items can be found in Fig 1 .The objective of the modeling process is to enable AI algorithms to discern the patterns within a subset of these features (the training set) and subsequently predict the label for previously unseen data (the test or validation set).
To achieve this goal, the dataset was partitioned into two distinct segments: a training set and a test set.The training set was employed to instruct the models, a process also known as supervised learning, while the test set was reserved for the models to forecast the labels for features not encountered during the training phase.To reduce the uncertainty of randomly choosing the training and test sets, in this process, k-fold cross-validation was implemented to provide a robust evaluation of the models [24].This technique involves dividing the training data into 'k' subsets (each subset contains 387/k samples), or folds, and iteratively training the model on 'k-1' folds while validating on the remaining fold, which means the training set contain 387*(k-1)/k samples.This process ensures a more comprehensive assessment of the model's performance across different subsets of the data.Following the training and validation stages, the model's performance was assessed using the Mean Squared Error (MSE), the R2 score, and other scoring matrices.To reflect the imbalanced label categories, stratified sampling was adopted in the k-fold process, where each sub-group data contains the same ratio of minority labels as the whole dataset.The entire workflow of this procedure is depicted in Fig 2. A total of eight AI regression models were utilized in this study, including ridge regression, logistic regression, random forest (RF) regression, gradient boosting (GB) regression, Gaussian Naïve Bayes (GNB), linear Support Vector Machine (SVM), Radial Basis Function (RBF) kernel SVM, and multi-layer perceptron regression (using artificial neural network, ANN).These models were individually trained and assessed in a systematic manner to identify the best-performing ones.We first used the models to perform a classification task.After evaluating classification performance of the models, we also performed the regression tasks to comprehensively judge the models: we set a threshold (0.5) to convert all calculated probabilities into the classifications.To keep the evaluation fair, no threshold-moving techniques were involved in the current study.Subsequently, the hyperparameters of the selected models were meticulously tuned to further refine and optimize their modeling capabilities.All models were compiled using the Scikit-learn package of Python [25], with default settings (using packagedefined hyperparameters).
To fine-tune the models, we identified several key hyperparameters that significantly affect the model's performance (e.g., the maximum depth of the random forest, which can influence the overfitting of the model), according to the Scikit-learn package.These parameters were then systematically adjusted with the same data while continuously monitoring relevant performance metrics.When multiple hyperparameters were involved, their impact on performance was visualized using color maps.Conversely, when only a single hyperparameter required tuning, we plotted its relationship with the performance metrics using line graphs.

Results and discussion
A significant characteristic of this dataset is imbalance between the number of positive labels (n = 27) and negative labels (n = 360), with more patients surviving after treatment.This imbalance is reflective of a common scenario in general biochemical research, where results often exhibit an uneven distribution.This label imbalance poses a significant challenge for machine learning algorithms, potentially leading to a model biased towards the majority class, in this case, survivors.Such a bias could undermine the model's sensitivity in identifying the minority class, which is often the critical group in medical prognosis.Furthermore, the skewed distribution may compromise the model's generalizability to other patient cohorts with a more balanced outcome distribution.To address these issues, techniques such as oversampling the minority class or employing cost-sensitive learning algorithms are often recommended [26].However, so far, those methods rely on a certain level of prior knowledge and there's no discussion about the clinical datasets where introducing any prior knowledge should be much more careful.We will show in the following part that failure to account for label imbalance could result in misleading performance metrics and limit the clinical applicability of the model.
It's worth noting that the issue extends beyond mere algorithmic bias.In a clinical setting, a model biased towards predicting survival could lead to inadequate resource allocation, as the healthcare system may underestimate the number of patients requiring intensive care.Moreover, the ethical implications of such a bias cannot be ignored, as it could result in unequal treatment of patients based on flawed predictions.We have further considered the real-world implications of label imbalance in clinical settings.For diseases where the survival rate is naturally high or low, it's crucial that machine learning models are robust enough to handle such imbalances without losing predictive power.This is especially pertinent in resource-constrained healthcare systems where accurate prognosis is vital for effective resource allocation.
We initially evaluated the accuracy of the models using k-fold cross-validation.Fig 3 illustrates the relationship between the average accuracy and its variance (shallow regions) against the k values for the eight models.As depicted in the figure, ridge regression and linear SVM produced the best desirable outcomes, characterized by the highest accuracies.Conversely, GB and ANN demonstrated the worst accuracies, characterized by the highest accuracies and variances.However, all models' accuracies are higher than 0.8 -a relatively acceptable accuracy value (note that these values are only valid for this datasets).Whereas a model with high accuracy but low specificity may lead to unnecessary treatments or interventions, thereby straining However, in the context of managing patients with MWS, the accurate prediction of True Negatives (TN) takes on heightened significance for optimizing patient outcomes.Accurate TN predictions serve as a crucial decision point for healthcare providers, enabling the targeted allocation of medical resources right from the moment of a patient's admission.By correctly identifying these TN cases, healthcare systems can proactively deploy interventions, therapies, or additional monitoring, thereby substantially improving the likelihood of patient survival.In essence, a high rate of TN predictions not only conserves valuable healthcare resources but also serves as a pivotal factor in enhancing patient survival rates.
To evaluate the classification capability for the TN, we adopted the specificity score, which can be defined as

Specificity ¼ True Negative ðTNÞ True Negative ðTNÞ þ False Positive ðFPÞ ð5Þ
We conducted an evaluation of model specificity employing k-fold cross-validation, the results of which are presented in Fig 7.This figure delineates the relationship between average specificities and their associated variances across varying k-values for the eight models under consideration.Notably, the RBF SVM model exhibited suboptimal performance, manifesting both the lowest specificity (equals 0) and the highest misclassification among the models.In contrast, GNB emerged as the most effective in terms of specificity.However, it's important to note that even the best-performing model achieved a specificity value of only approximately 0.6.Detailed insights into the classification performance for both TN and TP are readily available in the confusion matrices, displayed in Fig 8 .A closer examination reveals a pronounced rate of misclassification for instances with a true label of 0 (Died).In summary, the elevated values of accuracy, precision, and F1 score observed in our models can be attributed to the models' propensity to predict most instances as positive labels.Given the scarcity of negative labels in the dataset, the cost of misclassifying them is minimal, leading to inflated values for these metrics.However, the low specificity underscores the models' limitations in correctly identifying negative cases.These findings collectively highlight the significant impact of label imbalance in small-scale datasets on the performance of machine learning algorithms, cautioning against the sole reliance on commonly used metrics like accuracy, precision, and F1 score for model evaluation.
Having extensively evaluated the performance of various classification models in terms of their ability to accurately identify TP and TN, we also need to evaluate the performance of the regression modeling.The objective of this subsequent analysis is to delve into the continuous variables that may provide a more detailed information of the patient outcomes' probabilities or patient survival rates, thereby providing a more nuanced understanding of the factors during the healthcare process.
In assessing the regression performance, we began with the MSE via k-fold cross-validation.The R2 scores, reflecting the variance in the dependent variable explained by independent variables, were next on our evaluation list.Fig 11 indicates that only RF and RBF SVM models achieved R2 scores above zero, implying a reasonable fit for regression tasks.In contrast, GB and ridge regression, which performed adequately in classification, saw a decline in regression validation.
Hyperparameter tuning for RF and RBF SVM was detailed in Fig 12, targeting key parameters such as max depth, number of estimators, Gamma, and C. Yet, their peak performance did not cross the 0.5 mark, questioning their predictive reliability beyond random chance.This suggests a possible mismatch between model complexity and the nuanced nature of the studied clinical data.
For the GNB model, despite its classification promise, fine-tuning of prior ratio and variance smoothing did not yield effective modeling of the dataset, as shown in Fig 13.This difficulty with small, imbalanced datasets may stem from the GNB's sensitivity to the underlying data distribution.
The results suggest that the inherent complexity of the clinical data, possibly due to its multifaceted nature and the imbalance in the dataset, poses significant challenges for modeling.The small sample size and the disproportionate label distribution likely contribute to the models' inability to generalize effectively, which is a common hurdle in clinical data analysis where outcomes are often skewed towards one class.This underscores the need for more sophisticated or tailored modeling approaches that can handle such data intricacies.Compared with other studies, although the accuracies of the models in this study match the contemporary studies, the specificity in particular, can only reach values around 0.4, which is below the values of the references [27,28].However, it should be noted that in the referred studies, large-scale, balanced datasets were used, which is reckoned to be able to benefit the modeling capability.
In the context of this study, the exploration of alternative methodologies may provide avenues for further investigation.Techniques such as transfer learning [29], data augmentation [20,30], and synthetic data generation [31] could potentially be employed to leverage existing data and knowledge for new tasks.Transfer learning, for instance, allows a model trained on one task to be adapted for a second related task, potentially mitigating the need for extensive data in the target domain.Data augmentation techniques, which create variations of the existing data, and synthetic data generation, which creates entirely new data samples, could also enhance the training process.
However, these advanced techniques often still necessitate a certain level of prior knowledge or a sufficient size of the initial dataset for effective implementation.The balance between data sufficiency and model complexity is a nuanced challenge, particularly in the field of biochemical research where data may be imbalanced, scarce, or expensive to obtain for a local regional hospital.Moreover, understanding the boundary of the dataset scale and determining the precise quantity of data required for training an AI model in this specific domain remains an open question.This issue underscores the need for further study, not only to optimize model This work, while comprehensive, has certain limitations.First, we deliberately refrained from integrating prior knowledge into our modeling process to better mirror real-world scenarios.However, it is generally acknowledged that incorporating prior knowledge can significantly enhance model performance in various tasks.Incorporating clinically-informed prior knowledge, in particular, may prove beneficial in future iterations of this study.Second, our study utilized a single-center dataset, which may inherently carry biases due to regional variations.Such biases could potentially limit the generalizability of our findings.
Furthermore, the field of AI is evolving at an unprecedented pace, with new models and methodologies emerging continuously.This study, by focusing on the evaluation of eight fundamental algorithms, may not encompass the full spectrum of advancements in the field.Such a focus, while providing valuable insights, also presents an inherent limitation.Future research could benefit from exploring a wider array of cutting-edge techniques, such as clinically informed rescaling, to capture the rapidly changing landscape of AI technology more comprehensively.

Conclusion
In summary, this study employed a small clinical dataset with imbalanced classes, encompassing both biochemical and physical features, to scrutinize the capabilities of eight fundamental AI algorithms in supervised learning tasks.Despite the promising advances in AI, our findings demonstrate that all eight models with default settings failed to effectively model the data in validation processes.Even after fine-tuning the hyperparameters of the best-performing models, the results remained below acceptable thresholds.The results reveal that many AI algorithms, as they stand today, may not be the panacea for all challenges in biochemical research.
These findings prompt a critical reflection on the current state of AI in the field and suggest that more concerted efforts are needed.Future work may focus on preparing large-scale datasets, exploring techniques like transfer learning, data augmentation, or synthetic data generation, or developing novel algorithms specifically tailored to small-scale, imbalanced datasets.

Fig 2 .
Fig 2. The process of analyzing the clinical biochemical and physical data for evaluating the AI models.After data pre-preparation, the data were fed into one selected model.Each model in the model bag was trained and k-fold-cross validated.Several best-performing models were then fine-tuned to obtain the best capability.https://doi.org/10.1371/journal.pone.0298328.g002

Fig 3 .
Fig 3.The relationship between the average accuracies (the higher the better) and their variance against the k values for the eight models from (a) ridge regression to (h) ANN.The titles above the subfigures indicate the models' names.K values can also represent the size of the validation set: The smaller the validation set the higher the K values.All models show a good mean accuracy (>0.8).https://doi.org/10.1371/journal.pone.0298328.g003

Fig 4 .Fig 5 .
Fig 4. The recalls of all the AI models with different training samples (the higher the better).The titles above the subfigures indicate the models' names.K values can also represent the size of the validation set: The smaller the validation set the higher the K values.All models show a good mean recall score (>0.8).https://doi.org/10.1371/journal.pone.0298328.g004

Fig 6 .
Fig 6.The F1 scores of all the AI models with different training samples (the higher the better).The titles above the subfigures indicate the models' names.K values can also represent the size of the validation set: the smaller the validation set the higher the K values.All F1 scores are reasonably high because of the high accuracy, precision, and recall.https://doi.org/10.1371/journal.pone.0298328.g006

Fig 9
plots the average MSE against k values for the eight models, showing ridge regression and linear SVM with the highest MSEs, indicative of suboptimal performance.RF, GB, and RBF SVM, however, registered the lowest MSEs, with RF and GB exhibiting the least variance, suggesting more stable predictions.The similar results can be found in Fig 10, which shows the mean absolute errors (MAE) of each model.

Fig 7 .Fig 8 .
Fig 7. The specificity scores of all the AI models with different training samples (the higher the better).All models' specifications are lower than the acceptable threshold (normally 0.6, indicating 60% true negative labels can be successfully identified).https://doi.org/10.1371/journal.pone.0298328.g007

Fig 9 .Fig 10 .
Fig 9.The mean squared errors (MSE) of all the AI models with different training samples.The regression modeling is complementary to the classification.The titles above the subfigures indicate the models' names.K values can also represent the size of the validation set: The smaller the validation set the higher the K values.All MSEs are below 0.5, which is in accordance with the classification tasks.https://doi.org/10.1371/journal.pone.0298328.g009

Fig 11 .Fig 12 .
Fig 11.The R2 scores of all the AI models with different training samples.Because the models cannot effectively model the patterns of true negative labels, the R2 scores are below the acceptable threshold (normally 0.5, indicating a much better performance than randomly guessing).https://doi.org/10.1371/journal.pone.0298328.g011

Fig 13 .
Fig 13.The fine-tuning of the GB model (the one with highest specificity score in k-fold cross validation tests) regarding the R2 scores.Unfortunately, the GB's R2 is below 0. The hyperparameters for Gaussian NB model are the prior ratio and the var smoothing.By changing these two parameters, different R2 scores can be extracted and shown in the color map.https://doi.org/10.1371/journal.pone.0298328.g013