The application of machine learning to predict genetic relatedness using human mtDNA hypervariable region I sequences

doi:10.1371/journal.pone.0263790

Table 1.

Accession numbers of HVR I sequences retrieved from GenBank database.

More »

Expand

Fig 1.

Framework for machine learning.

Data is collected from database and undergo data pre-processing techniques such as one-hot encoding to transform and enhance the quality of the data. The resulting data is split into a training and testing set. The training set is used to train the ML model while the testing set is used to evaluate the model and make predictions.

More »

Expand

Table 2.

Accession numbers of HVR I sequences retrieved from GenBank for dataset 2.

More »

Expand

Table 3.

AMOVA showing genetic variation.

More »

Expand

Table 4.

Pairwise fixation index (F_ST) values of population differentiation due to genetic structure and p-values.

More »

Expand

Table 5.

Summary of the diversity and neutrality indices calculated for population groups.

More »

Expand

Table 6.

Comparison of 5-fold CV accuracy measures on the dataset.

More »

Expand

Table 7.

Confusion matrix table of the PCA-SVM test performed on the dataset without PCA and 5-fold CV.

More »

Expand

Table 8.

5-fold CV accuracy measures on dataset 2.

More »

Expand

Table 9.

Comparison of machine learning algorithms model with one hot encoder, BoW and without PCA.

More »

Expand

Table 10.

Comparison of machine learning algorithms model with one hot encoder, BoW and PCA.

More »

Expand

Fig 2.

Confusion matrix results generated with one hot encoding, BoW and PCA on the dataset.

Numbers 0, 1 and 2 on the X and Y axis represent the African, Asian and Caucasian race groups, respectively. The values in the matrix denote the number of correct and incorrect predictions made by classifiers: (a) Support vector machine, (b) Linear discriminant analysis, (c) Quadratic discriminant analysis and (d) Random forest.

More »

Expand

Fig 3.

Confusion matrix results generated with one hot encoding, BoW and without PCA on the dataset.

Numbers 0, 1 and 2 on the X and Y axis represent the African, Asian and Caucasian race groups, respectively. The values in the matrix denote the number of correct and incorrect predictions made by classifiers: (a) Support vector machines, (b) Linear discriminant analysis, (c) Quadratic discriminant analysis and (d) Random forest.

More »