Trainable High Resolution Melt Curve Machine Learning Classifier for Large-Scale Reliable Genotyping of Sequence Variants

doi:10.1371/journal.pone.0109094

Figure 1.

Gblocks output.

The blue-highlight underneath represents the region that passes the criteria according to the parameters and this region will be considered as a candidate to be a primer.

More »

Expand

Figure 2.

Classification results with varied parameters.

A) The KNN classifiers were tested by varying number of neighbors, k from 1 to 7. The plot shows average accuracy for each k. k = 1 and k = 2 resulted in the best performance. B) PCA-LDA classification result with varied number of eigenvectors. Our PCA-LDA classifiers were tested for dimensionality reduction varied from one through seven different eigenvectors. The plot shows the highest accuracy when using six eigenvectors.

More »

Expand

Figure 3.

Illustration of the ensemble binary classifiers.

Each classifier would be used to differentiate two classes and the score will be count for each serotype. In a SVM classifier, each class consists of 9 melt curves from 9 different conditions. The result will be based on the serotype that returns the highest score.

More »

Expand

Table 1.

List of target DNA sequences.

More »

Expand

Table 2.

List of 7 primer pairs used to differentiate 92 serotypes of S. pneumonia.

More »

Expand

Figure 4.

Predicted melt curves of serotype 1 with the first primer set across 9 different conditions.

The predicted melt curve were generated using uMelt with 9 different conditions, which are all combinations between [Na+ K+]: 47 mM, 50 mM, and 53 mM and [Mg2+]: 1.4 mM, 1.5 mM, and 1.6 mM.

More »

Expand

Figure 5.

Accuracy of different classifiers under different conditions.

Horizontal axis shows the different Na+, K+ and Mg2+ concentrations respectively that were used to generate the predict curves. Vertical axis shows accuracy in %age. Different curves labeled with different legends represent the performance of different classifiers.

More »

Expand

Table 3.

Average accuracy of the classifier under different Na+, K+ and Mg2+ concentrations.

More »

Expand

Figure 6.

Experimental melt curves from six different number of ‘CG’ sites DNA sequences.

Melt curves of six synthetic DNA sequences from two duplicate experiments from different days. Different colors represent different sequences as legend. The fully methylated sequences represented in dark blue color with 10 ‘CG’ sites and then two ‘CG’ sites were changed to ‘TG’ to be the next target of 8 ‘CG’ sites and so on until all ‘CG’ sites were changed to ‘TG’ as 0 ‘CG’ sites (non-methylated) represented in light blue.

More »

Expand