Table 1.
Accession numbers of HVR I sequences retrieved from GenBank database.
Fig 1.
Framework for machine learning.
Data is collected from database and undergo data pre-processing techniques such as one-hot encoding to transform and enhance the quality of the data. The resulting data is split into a training and testing set. The training set is used to train the ML model while the testing set is used to evaluate the model and make predictions.
Table 2.
Accession numbers of HVR I sequences retrieved from GenBank for dataset 2.
Table 3.
AMOVA showing genetic variation.
Table 4.
Pairwise fixation index (FST) values of population differentiation due to genetic structure and p-values.
Table 5.
Summary of the diversity and neutrality indices calculated for population groups.
Table 6.
Comparison of 5-fold CV accuracy measures on the dataset.
Table 7.
Confusion matrix table of the PCA-SVM test performed on the dataset without PCA and 5-fold CV.
Table 8.
5-fold CV accuracy measures on dataset 2.
Table 9.
Comparison of machine learning algorithms model with one hot encoder, BoW and without PCA.
Table 10.
Comparison of machine learning algorithms model with one hot encoder, BoW and PCA.
Fig 2.
Confusion matrix results generated with one hot encoding, BoW and PCA on the dataset.
Numbers 0, 1 and 2 on the X and Y axis represent the African, Asian and Caucasian race groups, respectively. The values in the matrix denote the number of correct and incorrect predictions made by classifiers: (a) Support vector machine, (b) Linear discriminant analysis, (c) Quadratic discriminant analysis and (d) Random forest.
Fig 3.
Confusion matrix results generated with one hot encoding, BoW and without PCA on the dataset.
Numbers 0, 1 and 2 on the X and Y axis represent the African, Asian and Caucasian race groups, respectively. The values in the matrix denote the number of correct and incorrect predictions made by classifiers: (a) Support vector machines, (b) Linear discriminant analysis, (c) Quadratic discriminant analysis and (d) Random forest.
Table 11.
Machine learning algorithms with one hot encoding and PCA using Python.
Table 12.
Machine learning algorithms with one hot encoding, BoW and without PCA using Python.
Table 13.
Comparison of machine learning algorithms model with one hot encoder, BoW and without PCA dataset 2.
Table 14.
Comparison of machine learning algorithms model with one hot encoder, BoW and PCA on dataset 2.
Table 15.
Accuracy measures on a new independent dataset without PCA.
Table 16.
Accuracy measures on a new independent dataset with PCA.
Table 17.
Race group classification accuracy (%) results from Python and WEKA.