Fig 1.
Block diagram of the NMF_CC parameterization.
NMF_CC coefficients are computed after applying the Discrete Cosine Transform (DCT) on the audio magnitude spectrum filtered with the NMF-based filter bank. The final set of acoustic features, which is the input of the Support Vector Machine (SVM)-based classifier module, is obtained by performing a temporal feature integration on these short-time parameters.
Fig 2.
Block diagram of the NMF-based acoustic models building process.
The Spectral Basis Vectors (SBVs) of the i-th class, Wi, are found by applying the NMF algorithm to the training audio data of this class. Finally, the SBVs of all the classes are concatenated to form a single set of SBVs, Wbs.
Fig 3.
Block diagram of the H_CC (upper part), MFCC/NMF_CC (lower part) feature extraction processes and their combination.
The short-time features are obtained by applying the Discrete Cosine Transform (DCT) to the audio magnitude spectrum that is filtered with the conventional mel-scaled (MFCC) or NMF-based filter bank (NMF_CC). The final set of acoustic features, which is the input of the Support Vector Machine (SVM)-based classifier module, is obtained by performing temporal feature integration on the combination of the H_CC and MFCC/NMF_CC short-time parameters.
Table 1.
Composition of the database used in the experiments.
The number of audio files and syllables per bird species are indicated.
Table 2.
Fixed and random effects for the linear mixed model that measures the effect of the parameterization used on the accuracy.
The front-ends under consideration are MFCC (intercept), LDA1, LDA2, NMF_CC, H_CC + G_NMF and the combinations MFCC + H_CC + G_NMF and NMF_CC + H_CC + G_NMF. The inclusion of the first derivatives is indicated with the suffix “+ Δ‘”. For the fixed effect (parameterization), the classification rate ([%]) estimates, standard errors and t-values are shown. The statistical significance at p < 10−4 is marked with ***, p < 10−3 is marked with ** and p < 0.05 is marked with *. For the random effects (bird species and fold), variances and standard deviations are reported.
Fig 4.
Confusion matrices [%] for two parameterization schemes: (a) MFCC (baseline); (b) NMF_CC + H_CC + G_NMF + Δ (proposed combination).
The columns and rows correspond to the correct class and the hypothesized one, respectively. Different colors indicate the audio recordings belonging to species of the same genus.
Table 3.
Execution time (s), average execution time per frame (ms) and number of features for the MFCC, NMF_CC and H_CC + G_NMF parameterizations and their combinations.
Fig 5.
Frequency responses of the auditory filter bank used in the feature extraction process: (a) Triangular mel-scaled filter bank (U); (b) Filter bank determined by NMF (W).
To improve the readability of the figures, different colors have been used to represent adjacent filters.
Fig 6.
Spectral Basis Vectors (SBVs) for the twelve bird species sounds.
To improve the readability of the figures, different colors have been used to represent the adjacent spectral basis vectors.
Fig 7.
Examples of vocalization spectrograms from the following different bird species: (a) Aramides cajanea; (b) Rupornis magnirostris; (c) Piranga olivacea; (d) Piranga rubra; (e) Crypturellus cinereus; and (f) Crypturellus soui.
The two first examples illustrate that, although sounds of some species present an important content in low frequencies as in the case of speech (a), in general, their spectral characteristics are very distinct from the speech spectra (b); (c) and (d) show the acoustic similarity between sounds of different species, which may produce errors in the classification process; and (e) and (f) contain examples of very noisy spectrograms, which are very likely to be misclassified.