Fig 1.
Overview of the machine learning pipeline.
A count matrix undergoes pre-processing, including normalization and filtering. The data is randomly split into training (60%), validation (20%), and test (20%) sets independently for each cell type. The training sets are used to train the models. The validation set provides an initial test for accuracy of the trained models and is used to adjust the model’s hyperparameters. Once the hyperparameters are optimized, the test set is run through each model and the F-beta score distribution across all clusters is used for model comparison.
Table 1.
Machine learning hyperparameters.
Fig 2.
Classification performance using two feature selection methods (CV vs BIN filtering) and five machine learning methods.
For feature selection using coefficient of variation (CV), the filtering thresholds from left to right were 0.52, 1.5, 2.5, 3.5, and 4.5. For binary score (BIN), the filtering thresholds from left to right are 0.15, 0.10, 0.05, and 0.01. The resulting number of genes for each threshold is listed below the threshold labels. Each of these feature sets was used by five different machine learning methods (LightGBM, Neural Network, SVM, Logistic Regression, Random Forest) using the training data. F-beta was calculated as a measure of classification accuracy.
Fig 3.
Classification performance between the default and optimal hyperparameter settings.
The 3.5 CV feature set was used and models produced using default and optimal hyperparameter setting. F-beta was calculated as a measure of classification accuracy. Log2 size is log base 2 of the cluster size. The labeled p-values for each method are from the Wilcoxon signed-rank test between the default and optimal validation.
Fig 4.
Model performance on training, validation, and test datasets.
CV thresholds of 2.5 were used for Multinomial Logistic Regression and Neural Networks while a threshold of 3.5 was used for all other models. F-beta was calculated as a measure of classification accuracy for the training, validation, and test datasets. Log2 size is log base 2 of the cluster size. Differences between these distributions highlight the effect of overfitting.
Fig 5.
Classification performance for seven supervised machine learning models with optimal feature selection and hyperparameters.
F-beta was calculated as a measure of classification accuracy. See S1 Table in S1 File for optimal hyperparameter settings.
Fig 6.
Model performance on the kidney dataset.
Four models (Binary Logistic Regression, Multinomial Logistic Regression, Neural Networks, and LightGBM) were built using the optimal hyperparameters trained from the MTG dataset. Eight CV thresholds (1.0, 1.5, 2.5, 3.5, 4.5, 5.5, 6.5, and 7.5) were applied for feature selection. The red horizontal line indicates the median F-beta value for the best CV threshold for a given method from the MTG dataset.