A fast and accurate zebra finch syllable detector

doi:10.1371/journal.pone.0181992

Fig 1.

The spectrogram of the song of the bird “lny64”, used as an example throughout this paper.

This image was made by superposing the spectra of our 2818 aligned songs. Our example detection points, , are shown as red lines, with example recognition regions of 30 ms × 1–8 kHz marked as rectangles.

More »

Expand

Table 1.

For each bird, detectors were trained for the specified timepoints, using 1000 songs and 1000 non-song samples of the same length.

The test set consisted of the remaining songs and an equal number of non-songs.

More »

Expand

Fig 2.

Each plot shows one network output unit’s responses to all 2818 presentations of lny64’s song shown in Fig 1.

We show only the syllables , , and , and we do not show the non-response to presentation of non-song. The horizontal axis is time relative to the beginning of the aligned song, and the vertical axis is an index for the 2818 individual song presentations. The grey shading shows the audio amplitude of song Y at time T. Detection events on training songs are shown in cyan, with detections of unseen test songs in red. To provide an intuition of intra-song variability, songs have been stably sorted by the time of detection events; thus, each of the three detection graphs shows the songs in a different order.

More »

Expand

Fig 3.

Accuracy variability over 100 different training runs for each of the test detection points.

Each dot shows the test-set accuracy for an independently trained detector. Because the horizontal positions have been randomised slightly so as not to occlude same-valued measurements, test syllable is also indicated by colour. The means are given in Table 2.

More »

Expand

Table 2.

Mean values for the detection accuracies shown in Fig 3.

More »

Expand

Fig 4.

Timing varies as the FFT frame interval changes.

Here we show results for the ideal detector and the LabVIEW and Swift+serial implementations, for the constructed δ-syllable and for trigger of lny64’s song. The lines show latency; error bars are standard deviation (jitter). Points have been shifted horizontally slightly for clarity; original positions are [0.5 1 1.5 2 4] ms.