Machine Learning Algorithms for Automatic Classification of Marmoset Vocalizations

Automatic classification of vocalization type could potentially become a useful tool for acoustic the monitoring of captive colonies of highly vocal primates. However, for classification to be useful in practice, a reliable algorithm that can be successfully trained on small datasets is necessary. In this work, we consider seven different classification algorithms with the goal of finding a robust classifier that can be successfully trained on small datasets. We found good classification performance (accuracy > 0.83 and F1-score > 0.84) using the Optimum Path Forest classifier. Dataset and algorithms are made publicly available.


Introduction
The common marmoset (Callithrix jacchus) is a species of arboreal New World monkeys native to the northeast region of Brazil. This species is becoming an increasingly important primate model of a number of human diseases, as well as for basic research in neuroscience [1,2] and genetics [3]. Examples of the species' use as a disease model come from multiple sclerosis [4], herpes virus [5] and tuberculosis [6] research. Some of the reasons for this popularity are that marmosets have a similar disease susceptibility profile to humans, are relatively easy to handle, have a high reproductive rate, and that important genetic and neuroscience research tools already exist [7]. In particular, marmosets are an excellent model for the neurophysiological study of vocal communication [8]. A pubmed search on "Callithrix jacchus" shows between 113 and 164 publications per year during the last 10 years, with most of the publications in biomedicine. Given the widespread use of marmosets in laboratories, methods to reliably monitor health and behavior in captive colonies are of a high priority.
As typical for Neotropical arboreal primates, marmosets rely heavily on acoustic communication. This together with cooperative breeding and the complex social organization that follows, means that marmosets have a large vocal repertoire, with 9-13 different types of vocalizations reported [9][10][11][12][13]. The vocalizations produced by a group of marmosets provide a rich source of information about their activities and well-being [14]. However, although informative, it is practically untenable to manually track the vocalizations of a colony over any longer period of time. Manually classifying calls require expert knowledge and daily hours of work, making it too time-consuming and prone to inconsistencies among researchers. Thus, a reliable automatic method for call identification is necessary.
Much work has been done on the automatic classification of animal vocalizations, especially bird song [15][16][17][18][19], but also mammalian [20][21][22] and amphibian calls [23][24][25]. However, only a few studies have addressed non-human primates (hereafter primates). Pertinent to the current study, there are only three other studies in which different primate vocalizations were analyzed. Mielke and Zuberbühler (2013) used artificial neural networks (ANNs) with Mel-Frequency Cepstral Coefficients features to classify blue monkey call types (Cercopithecus mitis stuhlmanni). Furthermore, they predicted caller identity from the alarm call, and identified the blue monkey alarm call among the alarm calls of other sympatric species [26].
Pozzi, Gamba and Giacoma (2010) also used ANNs, but with hand-designed features derived from fundamental frequency and formants to classify call type among seven distinct types made by the black lemur (Eulemur macaco) [27]. In a later study, the same group used the long grunt (a vocalization included in the repertoire of all lemurs) and similar analytical methods to classify species among the five species in the Eulemur genus [28]. Therefore, to our knowledge, there is no previous work on classifying vocalization types from Neotropical primates.
In contrast, several studies have explored the related question of how specific to the caller are the acoustic properties of individual vocalizations, that is, how well caller identity can be predicted from a given call. These studies have all relied on manually designed features and linear discriminant analysis (LDA). Particularly relevant are Jones et al. (1993) and Miller et al. (2010), who both studied the common marmoset's Phee call to classify caller individual [29,30]. Both studies found that the call is highly caller-specific. Similar studies have been done on Japanese macaque (Macaca fuscata) [31], blue monkey [32], ring-tailed lemur (Lemur catta) [33], and cotton-top tamarin (Saguinus oedipus) [34].
Algorithms for the automatic classification of vocalizations learn the mapping from input (call features) to label (call type). Therefore, a dataset with labeled calls is necessary to train the algorithms. In general, classification performance increases with the amount of training data. However, in ethology, large sets of labeled data are often hard to obtain. The amount of data may be limited because data collection is labor-intensive, or the data of interest are inherently scarce because they are produced by animals passing through transitory learning or developmental stages. In the latter case, the amount of possible data is strictly limited. Thus, a method that achieves high accuracy with a relatively small number of labeled call exemplars is highly desirable.
To address the need for automatic call classification given the aforementioned constraints, we compared seven different types of classification algorithms with the goal of finding a reliable method.
To extract acoustic features, we used Linear Predictive Coding (LPC), a method commonly used for speech processing [35]. For classification purposes, we used seven different algorithms:

Dataset description
The subjects were five captive-born adult common marmosets (two females and three males), housed at the Instituto do Cérebro, Universidade Federal do Rio Grande do Norte. The marmosets where housed socially in two wire mesh enclosures (1.20 x 1.50 x 2.45 m), enriched with tree branches, ropes, plants, hammocks and nesting tubes. The animals were fed twice daily with fresh and dried fruit, nuts, egg and chicken, and had ad libitum access to water. The colony was maintained outdoors protected by a roof allowing daily sunbaths in natural light. The animals were housed in compliance with SISBIO permit 18394, and the experiment was approved by the ethics committee of Universidade Federal do Rio Grande do Norte with CEUA permit 11/2016. No animal was sacrificed at the end of the experiment.
A directional microphone (ECM-CG50 Pro Shotgun Microphone, Sony, Tokyo, Japan) was placed at a distance of 10 cm above the home cage and connected to a computer. The microphone signal was streamed to a computer where a custom-written Python script segmented the incoming signal. The signal was bandpass filtered between 4 and 10 kHz, and when the amplitude exceeded a threshold a segment beginning 0.5 s before first threshold crossing and ending 0.5 s after the last threshold crossing was saved. Sound was sampled at 44.1 kHz.
From the raw recordings, we selected and manually labeled 27-30 exemplars per class of marmoset vocalization. We attempted to cover the marmoset's vocal repertoire of approximately nine [36] to 13 [37] distinct vocalizations. The number of exemplars of each of the 11 types investigated in this work is listed in Table 1, and spectrograms of representative exemplars of each type are shown in Fig 1. All vocalizations were produced spontaneously, that is, without any intervention from the experimenter. The recordings were done over a period of two months. The dataset is available at https://osf.io/yqpvk/ and http://neuro.ufrn.br/data/ marmosetvocalizations.

Feature extraction
The success of a signal classification system depends on the choice of features used to characterize the raw signals. In this work, we used LPC, a method commonly used for analysis and compression of speech and animal vocalizations [35,38]. The linear prediction filter coefficients were used as input features to the classification algorithms. Those are the coefficients of an n th -order linear finite impulse response filter that predicts the current value of the vocalization from past samples [39]. The number of features extracted from each call was set to 20 after experimenting with filter orders from 10 to 25, in steps of 5.

Classification algorithms
Optimum-Path Forest. The OPF classifier models the problem of pattern recognition as a graph partition task, in which a predefined set of samples from each class (i.e. prototypes)  compete for minimal path cost to the rest of the samples. This results in a collection of optimum-path trees rooted at the prototype nodes, building an optimum-path forest considering from all training samples. Test samples are classified through incrementally evaluating the optimum paths from the prototypes, as though they were part of the forest, and assigning the labels of the most strongly connected roots. The notion of optimum-path connectivity comes from the minimization of a path-cost function [40]. An OPF classifier can be designed as long as we use a smooth path-cost function. Although there are two different versions of the supervised OPF classifier [41][42][43], in this paper we make use of the former and most widely used approach, as described below.
The OPF with complete graph was first proposed by Papa et al. [41]. Later on, Papa et al. [42] presented an improved version with a more efficient classification step. In this section, we present the OPF algorithm as described by those authors, using the same formalism.
Let D ¼ D 1 [ D 2 be a dataset partitioned into a training (D 1 ) and a test (D 2 ) set. In addition, let G ¼ ðV; E; wÞ be the graph originated from D 1 , such that V ¼ D 1 and E stands for an adjacency relation that defines a full connectedness graph, that is, a graph where each pair of nodes is connected to each other. The arcs are weighted by the distance between their corresponding nodes. Each graph node s 2 V is modeled as an n-dimensional feature vector, and w (s,t) is a weight (distance) between two graph nodes x i and x j used to weight the arc hs; ti 2 E. Mathematically, w is a function that takes two graph nodes and returns the distance between them, that is, w ! V Â V : < þ . In this work, a node is the set of n features extracted from a particular marmoset vocalization.
As aforementioned, the training step of the OPF classifier aims at building an optimumpath forest rooted in a set of prototype samples. Let S be that set, such that S V. A very common and easy way to obtain S would be though a random sampling over the training samples. However, two requirements need to be met: (i) each class should be represented by, at least, one prototype node, and (ii) the prototype's distribution should cover different regions of the feature space. These requirements make the naïve approach too time-consuming. In order to circumvent this problem, Papa et al. [41] proposed to place prototypes in the regions more prone to errors, that is, nearby the frontier of the classes. The idea is to compute a Minimum Spanning Tree (MST) over the training set, and then mark the connected samples from different classes as prototypes. We say that S Ã is an optimum set of prototypes when the training step minimizes the classification errors in D 1 [44]. The optimum prototypes are the closest samples of the MST with different labels in D 1 .
Given the prototypes, the next step is to find the smallest path-cost from the prototypes to the remaining training samples in S Ã to the remaining nodes in V=S Ã . In this work, we adopted the same path-cost function as Papa et al. [41,42], which computes the maximum arc-weight along a path, as follows: þ1 otherwise: where π Á hs, ti stands for the concatenation of path π and the arc hs, ti. A path is defined as a sequence of distinct and adjacent samples. After computing the optimum-path forest, unseen samples from D 2 can be classified. For each test sample, the classification step consists of connecting the sample to all training nodes, evaluating which training sample offers the optimum-path cost C according to f max defined in Eq 1, that is: Let s Ã be the training sample that satisfies the above equation. Then, test sample t will be assigned the same label as sample s Ã . For the current experiments we used the LibOPF library available at https://github.com/LibOPF/LibOPF. Bayesian Classifier. A Bayesian Classifier estimates the probability that a given vocalization belongs to a certain class. This probability can be derived from Bayes' Theorem [45]: where p(x|ω i ) denotes the probability of observing feature vector x given the class ω i , P(ω i ) is the prior probability of class ω i , and p(x) is the probability of x. In order to estimate p(x|ω i ) we assumed that the likelihood function is Gaussian, and could thus estimate its parameters from the dataset [46]. Multilayer Artificial Neural Network. An MLP classifier is a feedforward neural network composed of several neuron layers aiming to solve multiclass problems [47]. The input to each layer is a weighted sum of the output from the previous layer. The number of neurons in the first layer equals the number of features of the input, while the number of neurons in the last layer is equal to the number of classes. The neural network assigns a feature vector extracted from a vocalization x to the class ω q if the q-th output neuron has the highest activation. We used the MLP implementation from scikit-learn [48], with two hidden layers of eight and 16 neurons. The network was trained using backpropagation and the Limited-memory BFGS optimization algorithm [49] to update the weights. The learning rate was set to 0.001. Support Vector Machines. While the learning of MLP is based on the principle of empirical risk minimization, the SVM induction process is rooted in the principle of structural risk minimization [50][51][52], aiming at establishing an optimal discriminative function between two classes of patterns while accomplishing the trade-off between generalization and overfitting. The SVM training algorithm constructs the optimal hyperplane separating the two classes [50]. In order to extend from linear to nonlinear classification the kernel trick is used [51], where kernel functions nonlinearly map input data into high-dimensional feature spaces in a computationally-efficient manner.
For classification problems with multiple classes, two approaches are commonly used for binary SVMs, one-against-one and one-against-all [53]. Both strategies tend to lead to similar results in terms of classification accuracy, but the former, which was the one adopted here usually requires shorter training time, although incurring a higher number of binary decompositions.
k-Nearest Neighbors. k-NN is a very simple algorithm that works well in many different applications. In contrast to the OPF, the k-NN uses all training samples as prototypes.
The k-NN requires an input parameter k setting the number of neighbors that contribute to the classification of a sample [55,56]. In order to classify a test sample t, the majority label in a region (e.g. sphere or hypercube) containing k training samples and centered at t, determines t's label. Note that for k = 1, the testing sample t is classified as the class of the closest training sample. For the current experiments, we defined the value of k as the best value of a grid-search in the range 1; b m 5 c Â Ã in steps of two; where m is the number of training samples.

Logistic regression.
The logistic regression classifier is essentially a one-layer artificial neural network where the weighted input features are feed to the logistic function. Here, we trained the classifier under a one-against-all scheme, regularized by the L2 norm of the classifier weights. We used the LIBLINEAR [57] implementation of logistic regression.
AdaBoost. AdaBoost, short for Adaptive Boosting, is a meta-classifier that trains an ensemble of weak classifiers. It iteratively trains classifiers on the same dataset while adjusting the weights of incorrectly classified samples such that subsequent classifiers focus more on those difficult samples. The resulting ensemble classifier tends to be less susceptible to over-fitting than other classifiers, but it is also known to be sensitive to noisy data and outliers [58]. Here, we used the AdaBoost-SAMME.R algorithm [59] from the from scikit-learn package [48].

Statistical evaluation metrics
In regard to the recognition rate, we used an accuracy measure proposed by Papa et al. [41], which is similar to the Kappa index [60], but more restrictive. If, for example, there are two classes of vocalizations with very different sizes and a classifier always assigns the label of the largest class, the average number of correct assignments will be deceivingly high. A better accuracy measure should take into account the high error rate of the smallest class. The accuracy used here is measured by taking into account that classes may have different sizes in D 2 . Let us define: and where K stands for the number of classes, jD i 2 j concerns the number of samples in D 2 that come from vocalization class i, and FP i and FN i stand for the numbers of false positives and false negatives for class i, respectively. That is, FP i is the number of samples from other vocalization classes that were classified as being from the class i in D 2 , and FN i is the number of samples from the class i that were incorrectly classified as being from other classes in D 2 . The error terms e i,1 and e i,2 are then used to define the total error from class i: Finally, the accuracy Acc is then defined as follows: Sensitivity (Se), often called recall, is the ratio of the number of correctly classified vocalizations from a given class and the total number of vocalizations in that class (including misclassified vocalizations): where TP and FN are the number of vocalizations from a given class that were correctly or incorrectly classified, respectively.
Positive predictive value (PPV), often called precision, is the ratio between the correctly classified vocalizations from a given class and the total number of vocalizations classified as pertaining to that class: where FP denotes the number of vocalizations incorrectly classified as belonging to the considered class. Finally, as a more global performance metric, we calculated the averaged F 1 -score for all eight classes. The F 1 -score for a given class is calculated as the harmonic mean of the Se and PPV values for that class: These four metrics allow us to reliably evaluate the performance of the classification algorithms considered in this work. The performance metrics are reported as the averages over 100 repetitions of classifier training and testing. All training and testing of the classification algorithms was done on a computer with an Intel i7 5500U processor with 8GB of RAM using Linux as operational system. A Python script to reproduce the results is available at https://github.com/ kalleknast/call_class.

Results
In order to find a robust method for the automatic classification of marmoset vocalizations, we compared the classification performance of seven different algorithms. In addition, we further explored different configurations of both OPF and SVM. OPF was tested using the following distance metrics: Euclidean, Manhattan, Canberra, Chi-Square and Bray-Curtis, and SVM was tested with linear, radial basis function and polynomial kernels. The goal was to find an algorithm that could be successfully trained on small sets of primate vocalization data. For this reason, we split our original dataset into training sets of increasing sizes, ranging from 10% to 90% of the original dataset. We found good performance of all algorithms except AdaBoost, Naive Bayes, SVM when using linear and polynomial kernels, and OPF when using Chi-Square, Bray-Curtis and Canberra distance metrics (see Table 2 and Fig 2). These poorly performing algorithms were excluded from further consideration. SVM, k-NN and OPF using Euclidean and Manhattan distances performed similarly, and well above chance (accuracy % 0.5 and F 1score % 0.5) using as little as 10% of the original dataset, and reaching an accuracy around 0.8 when trained on 90% of the data. However, in spite of the good performance, Fig 2 shows that classification accuracy keeps improving with training set size, suggesting that adding even more training data would further improve performance. The OPF algorithm configured with the Euclidean distance metric was selected for further analysis since it yielded better or comparable classification performance on both the smallest and the largest datasets (top accuracy % 0.83 and F 1 -score % 0.84). It is parameter free and thus, easy to train, and computation time is an order of magnitude less than for SVM and k-NN (see Table 2). Table 3 presents the results as a confusion matrix.
We opted for a hierarchical strategy to investigate how well all 11 types of vocalization in the dataset could be classified. Such approach was performed through sorting the Phee and Tsik classes into three and two sub-classes, respectively. The Phee class was sorted into Phee-2, Phee-3 and Phee-4 depending on how many separate whistles the call contained (Fig 1(d), 1(e) and 1(f), respectively). The Tsik call was sorted into the sub-classes Tsik and Tsik-ek depending on the presence of the harmonic "ek" component following the "tsik" in the Tsik-ek calls [37] (Fig 1(i) and 1(j), respectively). Table 4 presents a confusion matrix of the results from the Phee calls using OPF with Euclidean distance metric. Overall, the classification accuracy (56%) was not quite as good as when classifying the eight principal classes. Phee-2 was correctly classified in 70% of the cases, and Phee-3 was classified at 46%, where the remaining 64% were misclassified as Phee-2 15% or Phee-3 39%. Finally, Phee-4 was accurately classified in 52% of the samples, where the remaining 48% were misclassified as Phee-3 34% or Phee-2 15%. Classification of the Phee sub-class from the eight principal classes resulted in a compounded accuracy of 58% for Phee-2, 38% for Phee-3, and 42% for Phee-4, since accuracy for the principal Phee class was 83% (see Tables 3 and 4). Table 5 presents a confusion matrix of the results using OPF with Euclidean distance metric for Tsik vocalizations. The overall accuracy was 83%, where the accuracy for sub-class Tsik was 91%, and that of Tsik-ek was 76%. The compounded accuracies for the sub-classes Tsik and Tsik-ek were 70% and 58%, respectively (see Tables 3 and 5).

Discussion
In this work, we presented methods for the automatic classification of commonly occurring vocalizations of the common marmoset. The provided dataset can be used for acoustic analysis, further algorithm development and playback experiments. The method presented should enable the online monitoring of vocal activity in colonies of captive marmosets, so as to provide valuable information about the colony's health and well-being. Further, the method allows for interactive experimental designs, in which different actions can be triggered depending on the vocal behavior of the subjects.   Since recording and manually labeling data is both labor-intensive and time-consuming, the most important factor when selecting a classification algorithm is how well it performs on a small amount of data. Fig 2 demonstrates clear advantages for the k-NN, SVM and OPF (using Euclidean and Manhattan distances) algorithms. These algorithms perform best on both the least and the greatest amount of data.
Another factor to consider is how easy an algorithm is to use. Most algorithms have hyperparameters that require careful optimization for good performance. The standard way of doing this is to repeatedly re-train the algorithm on a manually specified range of hyperparameter values, each time evaluating classification performance on data that were not included among the training data. Both the k-NN and SVM algorithms require such hyperparameter optimization, whereas the OPF algorithm is parameterless, making it easier to use.
Under some circumstances, the time required for classification can become important. Algorithms that do not require much computational resources are valuable when real-time classification is necessary, and especially when the computations are performed on small single board computers or embedded devices with limited capacity. Examples of this are on-site portable audio acquisition and analysis [61], or home-cage vocal conditioning systems [62]. The OPF algorithm requires an order of magnitude less time than comparably performing k-NN and SVM algorithms and is thus suitable for these goals.
The proposed methods are mainly suited for laboratory conditions where audio can be recorded under consistent conditions, and with high signal-to-noise ratio. Such conditions are unlikely to be available under field conditions. Robust classification within field conditions would probable require much improved pre-processing before the classification step.