Table 1.
Risk factors for intestinal parasitic infections grouped by categories such as demographic, socioeconomic, health, environmental, and hematological factors.
Table 2.
Risk factors for each infection outcome based on univariate (U) and multivariate (M) logistic regression (α = 0.05) and the feature selection methods: InfoGain (IG), ReliefF (ReF), Joint Mutual Information, and Minimum Redundancy Maximum Relevance (MRMR).
Robustly selected features (appeared in 95% of the top 20 of 100 feature selection runs for each method). Risk factors are ranked according to their frequency of occurrence in three approaches. Upwards arrows indicates significant odds ratio greater than 1, and downwards arrows indicate odds ratio lesser than 1. Arrows with a * lost statistical significance after Benjimini-Hochberg p-value adjustment.
Fig 1.
Heatmaps illustrating the accuracy scores for different feature selection and classifier combinations using SMOTE; for infection by (a) any parasite, (b) any helminth, (c) protozoan, and (d) any STH.
Green indicates a high accuracy, while red indicates a low accuracy. Feature selection include all features, or top 20 features selected through Joint Mutual Information (JMI-20), Minimum Redundancy Maximum Relevance (MRMR-20), InfoGain (IG-20) and ReliefF (ReF-20). Classifiers include Logistic Regression (LR), Random Forests (RF), Support Vector Machines (SVM), and XGBoost (XGB).
Fig 2.
Receiver operating characteristic (ROC) curves for different classifiers using best feature selection method and best hyperparameters for that infection outcome with SMOTE; for infection by (a) any parasite, (b) any helminth, (c) protozoan, and (d) any STH.
Classifiers include Logistic Regression (LR), Random Forests (RF), Support Vector Machines (SVM), and XGBoost (XGB). Blue dashed line represents the ROC curve for a random guess. The Area Under Curve (AUC) scores are (a) LR: 0.52, RF: 0.56, SVM: 0.50, XGB: 0.54, (b) LR: 0.47, RF: 0.51, SVM: 0.50, XGB: 0.48, (c) LR: 0.70, RF: 0.73, SVM: 0.50, XGB: 0.58, and (d) LR: 0.54, RF: 0.62, SVM: 0.50, XGB: 0.52.
Table 3.
The top five rules based on association rule learning with SMOTE for each infection outcome.
For each infection, the five rules with the highest lift values are chosen and sorted. The combinations of risk factors specified on the left leads to the given infection.