Somnotate: A probabilistic sleep stage classifier for studying vigilance state transitions
Fig 2
Automated sleep stage classification by Somnotate exceeds manual accuracy.
(A) Somnotate was trained and tested, in a hold-one-out fashion, on six 24-hour data sets. Using a consensus annotation based on at least three manual annotations, the accuracy of the classifier was compared to the accuracy of individual manual annotations (n = 25 manual annotations from 13 experienced sleep researchers). (B) The confusion matrix for individual manual annotations compared to the manual consensus (left), for the automated classifier compared to the manual consensus (middle), and the difference between these two confusion matrices (right). (C) Comparison of state occupancies between the automated and manual consensus annotations. (D) State transition probabilities in the automated annotation, normalised to the state transition probabilities in the manual consensus annotation. (E) Cumulative frequency plot shows the duration of the differences between the automated annotation and the manual consensus. Note that the manual annotation had a temporal resolution of 4 s (vertical dashed line), whereas the automated classification was performed at a time resolution of 1 s. (F) Venn-diagram of the time points at which the automated annotation and manual consensus differed. (G) Excluding samples where Somnotate is not certain improves accuracy. Classifier accuracy was compared between cases when all samples were included (‘All data’) and when 5.5% of samples were removed because the likelihood of the predicted state dropped below 0.995 (‘High certainty’). The plot indicates mean ± standard deviation and p-values are derived from a Wilcoxon signed rank test. (H) Somnotate was trained on six 24-hour data sets and then tested on a 12-hour data set, which had been independently annotated by ten experienced sleep researchers (as in Fig 1). The accuracy of the annotation by Somnotate was compared to consensus annotations generated from different numbers of manual annotations. Error bars indicate standard deviation. P-values are derived from a Wilcoxon signed rank test.