Quantifying influence of human choice on the automated detection of Drosophila behavior by a supervised machine learning algorithm

doi:10.1371/journal.pone.0241696

Fig 1.

Variability of human annotation of Drosophila social behaviors.

(A). Schematic of workflow and evaluations performed in this study. Movies of a pair of Drosophila adults were annotated both by human observers and by machine-learning-based automated classifiers. Inter-observer variability was quantified (B-D) and the performance of human and machine annotations were subsequently compared (Fig 3). The effects of the diversity of training movies (Fig 4) and features (Fig 5) were also quantified. (B-D). Summary of human annotations for wing-extension (B), lunge (C), and headbutt (D) behaviors. The total number of annotated behavioral bouts and frequency are categorized according to interaction type in (B₁-D₁). The distributions of human score combinations are shown in 4-by-4 grids with pseudocolor representing relative abundance (scale bars on the right of each grid) (B₂-D₂), and are also broken down according to whether bouts were counted by one or two observers (B₃-D₃) and by combined score (B₄-D₄).

More »

Expand

Table 1.

Complete descriptions of the movies annotated by human observers.

More »

Expand

Fig 2.

Frame-based z-score distributions and their variances according to behavior labels.

Distribution of frame-based feature z-scores (A₁, B₁, C₁) and variances (A₂, B₂, C₂) according to human annotations for wing extension (A), lunge (B), and headbutt (C). Z-score distributions are plotted in boxplot, where a thick bar represents median, a box represents 25 and 75 percentiles, and whiskers represents 0.5 and 99.5 percentiles. Features calculated from relative positions of the 2 flies (relative features) are shown in brown.

More »

Expand

Fig 3.

Human confidence and JAABA confidence are correlated.

(A-I) Precision (A, D, G) and recall (B, E, H) of fully trained classifiers for wing extensions (A, B), lunges (D, E), and headbutts (G, H) are shown for varied JAABA score thresholds, as indicated at the top of the figure. For recall, detected bouts were binned according to human combined scores of 1 to 6. In (C) (wing extensions), (F) (lunges), and (I) (headbutts), the distributions of average JAABA scores for true-positive (green), false-negative (pink), and false-positive (gray) bouts are shown as both violin plots (see Materials and methods for definitions) and box plots. As references, distributions of positive (light green) and negative (crimson) training bouts are shown at right, and the median values for positive and negative training bouts are shown by even and uneven broken lines, respectively. n.s. p > 0.05, * p < 0.05, ** p < 0.01, *** p < 0.001 by Kruskal-Wallis one-way ANOVA and post-hoc Mann-Whitney U-test. (J). Kruskal-Wallis p-value distributions of shuffled (open circles) and observed (filled circles) data sets across human combined scores. (K). Recall rates across human combined scores for shuffled and observed data sets at a JAABA score threshold = 0.1 (observed data sets are replotted from (B), (E), and (H)). Average and 95% confidence intervals for shuffled data are shown in light colors.

More »

Expand

Table 2.

Annotations of false positives by the wing extension classifier.

More »

Expand

Table 3.

Annotations of false positives by the lunge classifier.

More »

Expand

Table 4.

Annotations of false positives by the headbutt classifier.

More »

Expand

Fig 4.

Classifier performance improves as diversity of training frames increases.

Precision (A, D, G), false-positive rates (B, E, H), and recall (C, F, I) of classifiers for wing extensions (A-C), lunges (D-F), and headbutts (G-I) are plotted according to the types of training movies used (indicated by the circles below A, D, G). False positive rates are shown separately for the evaluating movie types indicated on the right. “All other” movies include fruitless mutants as indicated in S1 Table. Gray bars with broken outlines (A, B, D, E, G, H) and broken lines (C, F, I) represent the mean and 95% confidence intervals of the classifiers trained with frames downsampled proportional to the ratio of the training frames from a single type of movie (left-most bars on A, D, G) to the entire number of training frames. Note that the 95% confidence intervals are generally very small. Also, recall for wing-extension and lunge classifiers with downsampled training frames are very similar to those for fully trained classifiers. Precision and recall for classifiers trained by “all movies” (shown in gray) are replotted from Fig 3.

More »

Expand

Fig 5.

Performance of classifiers change when some rules or features are removed.

Precision (A, D, G), false-positive rates (B, E, H), and recall (C, F, I) of classifiers for wing extensions (A-C), lunges (D-F), and headbutts (G-I) are plotted according to the features not available for training on JAABA (shown below A, D, G). False-positive rates are shown separately for the evaluating movie types indicated on the right. Precision and recall for classifiers trained with all features (shown in gray) are replotted from Fig 3.

More »

Expand

Table 5.

Weights given to features in each classifier (ascending order).

More »

Expand