Semi-Supervised Active Learning for Sound Classification in Hybrid Learning Environments

Coping with scarcity of labeled data is a common problem in sound classification tasks. Approaches for classifying sounds are commonly based on supervised learning algorithms, which require labeled data which is often scarce and leads to models that do not generalize well. In this paper, we make an efficient combination of confidence-based Active Learning and Self-Training with the aim of minimizing the need for human annotation for sound classification model training. The proposed method pre-processes the instances that are ready for labeling by calculating their classifier confidence scores, and then delivers the candidates with lower scores to human annotators, and those with high scores are automatically labeled by the machine. We demonstrate the feasibility and efficacy of this method in two practical scenarios: pool-based and stream-based processing. Extensive experimental results indicate that our approach requires significantly less labeled instances to reach the same performance in both scenarios compared to Passive Learning, Active Learning and Self-Training. A reduction of 52.2% in human labeled instances is achieved in both of the pool-based and stream-based scenarios on a sound classification task considering 16,930 sound instances.


Introduction
Sound classification is a relatively recent topic in the audio analysis research community when compared to speech and music analysis. Yet, it has a wide range of applications such as multimedia data search, context awareness and activity detection [1][2][3][4], security surveillance [5,6], military interest tracking [7], assistive devices for independent living [8], healthcare monitoring [9,10], among others.
In Table 1, we show an overview of state-of-the-art research in sound classification. Noticeably, two main features characterize this area of research. Firstly, statistical classifiers and fully supervised learning algorithms are the most common approaches to sound classification. This means that large amounts of training data (typically labeled by human annotators) are required to create robust classification systems. Secondly, prototypical databases with size less than 10,000 instances are employed in most case. Indeed, and although the largest database mentioned in Table 1 comprises as many as 10,500 instances, the average size of each sound class is as small as 100 instances. In comparison with automatic speech recognition research where typical corpora comprise hundreds of hours of transcribed speech, annotated data in sound classification is scarce. Therefore, there is a gap between the desirability of sufficient labeled data for training robust models and the scarcity of annotated corpora.
While the development of web technology has allowed free access to vast amounts of sound media data for research usage, the shortage of labeled data remains an important issue that compromises the development of robust sound classification systems, which in turn limits their performance in practical scenarios [12][13][14]. To our best knowledge, even the largest environmental sound database ESC-US [15] so far contains only a limited number of labeled instances (2,000 instances) and a large amount of unlabeled instances (250,000 instances). This situation can be attributed to the burdensome and costly annotation process that requires assigning a predefined label to each of the various sound samples, which is especially critical for large databases [15]. Given this scenario, it is of extreme importance to develop techniques that allow the development of sound classification systems using databases with only partial human annotations available. This issue is addressed in this paper, and our proposal to overcome the above mentioned limitations is to combine Active Learning (AL) and Semi-Supervised Learning (SSL). With this approach, we target real-use scenarios whereby machines are required to make sense of the acoustic world surrounding them in meaningful ways by learning autonomously (SSL), through interacting with humans (AL), and by continuously adapting to a specific environment. Additionally, it also reduces the need for human labeled data for the development of robust sound classification systems.
The best of two worlds: AL and SSL AL [16] is a Machine Learning technique that aims at achieving greater accuracy with fewer training labels by (actively) choosing the data from which it learns. In contrast with the most commonly used Passive Learning (PL) techniques that randomly select instances from data pools to be labeled, AL algorithms select those instances that are the 'most informative' (with respect to a given measure function), and subsequently query human or machine annotator for labeling. The informativeness of the instances to be selected concerns their potential to improve the model's performance by selecting the best examples during training. There are various strategies by which the informativeness of unlabeled samples can be processed (as Table 1. Overview of state-of-the-art research in sound classification. For features, BoAP: bag-of-audio-phrases descriptor, UFL: unsupervised feature learning, E: energy, SF: spectral features, ZCR: zero-crossing rate, TFB-ED: triangle filter bank and eigen-decomposition, MFCC: mel-frequency cepstral coefficients, STE: subband temporal envelopes, and for classifiers, SVM: support vector machines, RF: random forest, KFDA: kernel Fisher discriminant anlysis, HMM: hidden Markov models, for learning methods, FS: fully supervised learning. detailed in the next section), and the effectiveness of AL has been shown in typical classification tasks such as automatic speech recognition [17], multimedia retrieval [18], speech emotion recognition [19], among others. As a result of employing an certainty-based AL query strategy, especially when it comes to a large scale raw data collection, a considerable number of unlabeled instances will be left out because of their high confidence scores (i. e., low informativeness). Here, we consider to further exploit this remaining set of instances (which are not selected for the human to label) with a traditional SSL method. These instances, and their corresponding labels automatically annotated by the machine classifier, will be added to the human-labeled set to create a new, larger training set. As a result, we will combine AL and SSL methods to reduce the amount of human-labeled data. Specifically, human annotators are required to label only those instances with the lowest certainty as determined by the AL algorithm, while the remaining instances (those with the highest certainty) are automatically labeled by a machine annotator. Then, both groups of instances are fused and used to re-train the classifier. We will refer to this approach as Semi-Supervised Active Learning (SSAL) throughout this paper. The effectiveness of SSAL in reducing the amount of data to be labeled by human annotators will be validated in a sound database with a size of 16,930 instances.
The major contribution of this work is the application of a hybrid method combining AL and SSL in the field of sound classification, which is of extreme importance to the field given the scarcity of labeled data and the need to minimise the costs associated with human annotations. Furthermore, we provide a detailed operationalization of the proposed method in two target scenarios: pool-based (all data is available at once) and stream-based (a practical scenario whereby instances are gathered sequentially from actual distributions) scenarios.

Related work Active Learning
One of the most promising approaches proposed in the literature to efficiently exploit unlabeled data for model development is AL [20][21][22]. By estimating the informativeness of the unlabeled instances, AL selects only those with high potential to improve the model's performance for annotation. There are various strategies by which such informativeness can be processed (aka, query strategies), and, according to the different types of feedback considered, at least three categories can be generalized from previous work [16]: 1) certainty-based sampling, 2) query-by-committee, 3) expected error reduction. In the first type of strategy, the model (or active learner) determines the certainty of the predictions on unlabeled data based on a previously trained model, and queries an annotator for the labeling of those with the least certain classification. This is perhaps the most commonly used query strategy. For instance, it has been applied in text classification [22], automatic speech recognition [17], speech emotion classification [19], audio retrieval [23], among others. The second type of strategy (query-bycommittee) involves two or more classifiers and the selection of those instances about which the various models disagree the most, which are then delivered for human annotation. This strategy can also be employed in regression tasks by measuring disagreement as the variance among the committee members [24]. The third type of strategy (expected error reduction) is a decision-theoretic approach that aims to estimate how much the model's generalization error is likely to be reduced. The instances estimated to have a high impact on the expected model's error are selected for human annotation. This strategy has been adopted for text classification task with Naive Bayes models [25], and leads to a dramatic improvement over certainty-based and query-by-committee strategies. Unfortunately, the expected error reduction method is also, in most cases, the most computationally expensive [16]. The effectiveness of AL and the various query strategies has been shown in typical classification tasks [16,19,[22][23][24][25].

Semi-Supervised Learning
Similarly to AL, the goal of SSL techniques is to exploit the availability of unlabeled data for model training and improvement. Two broad categories of SSL have been investigated to date: self-training [26] and co-training [27,28]. Self-training is a technique that permits to automatically annotate unlabeled data by using a preexisting model trained on a smaller set of labeled data. Usually, those instances of the unlabeled data set that are predicted with the highest degree of confidences are added to the training set (together with the respective labels), and the classifier is re-trained with the new (larger) set. This procedure is then repeated iteratively until a certain target performance is achieved (or until no more unlabeled candidate data is available). This approach is very attractive and useful to enhance the robustness of existing classifiers, because it does not require the intervention of human annotators [29,30]. The effectiveness of self-training has been demonstrated in various areas, including spoken language understanding [31], handwritten digit and text classification [32], and sound event classification [33].
Another set of algorithms with the potential to exploit unlabeled data pools is multi-view learning [30,34,35]. Multi-view learning techniques focus on improving the learning process by training different models for the same task concurrently, but using different feature sets (aka, "views") [16]. Co-training is one of the earliest schemes for multi-view learning proposed in the literature. In this method, two models are initially trained with two distinct different feature sets of the same labeled data set. Then, the most confident predictions of each model on the unlabeled data are added to the training set to train each other. The algorithm relies on three assumptions or conditions: (a) sufficiency: each "view" is sufficient for classification on its own, (b) compatibility: the target functions in both "views" predict the same labels for cooccurring features with high probability, and (c) conditional independence: the "views" are conditionally independent given the class label [27].
Combining Active and Semi-Supervised Learning AL strategies can greatly reduce the time-consuming and expensive human labeling work and lead to excellent performance improvements [16]. Nevertheless, AL is still inadequate for some situations in which obtaining a large amount of human annotations is unpractical (or not possible at all), and therefore needs to be minimized. Given that SSL also aims at using unlabeled data in an efficient way, but without the intervention of human annotators, it is natural to think about combining both techniques. Indeed, various examples can be found in the literature and are summarized in Table 2. One of the first works exploring combinations of AL and SSL algorithms was reported in [36]. Later, [34] proposed a variant of query-by-committee method, which is known as co-testing. In this method, two classifiers were trained separately on two different views (similarly to co-training), and the unlabeled instances in which the classifier disagree the most ('contention points') were selected for human annotation. Cotesting was then combined with co-training using an expectation maximization (co-EM) algorithm to automatically label instances that showed a low disagreement between the two classifiers. The combined method proposed in [34] clearly outperformed co-EM, general co-testing and co-training in Web pages and pictures classification. [37] also achieved significant performance improvements by combining co-testing and co-training methods in image retrieval compared to either co-testing or co-training retrieval method. Certainty-based AL has been also used alongside self-training to significantly reduce the human labeling effort in spoken language understanding [31] and natural language processing [38]. In the work presented in this paper, we will tandem certainty-based AL and self-training methods for sound classification.

Active Learning in two scenarios
In this paper we adopt an certainty-based AL approach. Moreover, we consider two target scenarios: pool-based scenario and stream-based scenario. The focus on the first scenario tackles situations where a large pool of unlabeled data can be gathered at once (the most common in previous work; cf. Table 2). In this case, before deciding which instances should be selected in each training iteration, every instance in the pool can be evaluated in terms of their informativeness. The second scenario fits a practical scenario in which unlabeled instances are gathered sequentially from actual distributions (e.g., an online sound processing system). In this case, the (active) learner decides whether to keep or discard each instance individually. Unlike the pool-based scenario, the stream-based scheme is more appropriate for situations in which memory or processing power may be limited (e.g., mobile and embedded devices) [16].
A detailed description of the AL strategies used in this paper are shown in Tables 3 and 4. In both strategies we start with a small set of labeled instances S l for training an initial classifier M. With this classifier, we estimate the confidence scores Cs for the instances that are candidates for labeling. In the pool-based scenario, the entire pool of unlabeled instances S u is estimated, and only those instances with confidence scores equal to or lower than the predefined threshold th a are selected for human annotation. In the stream-based scenario, the  Table 3. Certainty-based Active Learning algorithm in a pool-based scenario.
Input: Classify each instance in S u using classifier M and calculate the confidence score C for each selected instance.
Select those instances with Cs that are equal to or lower than threshold th a , and submit them to human annotation.
Refer to the new labeled set as S new .
Re-train classifier M using new S l . instances are analyzed sequentially and selections are made individually. At each iteration, the buffer B is send to human for annotation as soon as it is full filled with instances with confidence scores less than the pre-defined threshold th a . The threshold th a is determined by the human labeling resources available or by the performance of the current classifier.

Semi-supervised Learning
As mention, in order to further reduce the need for human annotation and enhancing the classification performance, we complement the AL phase with self-training. A detailed description of this strategy is presented in Table 5. First, we train an initial model M using an initial (small) set of human-labeled data S l . Then, we classify the unlabeled instances S u and Table 4. Certainty-based Active Learning algorithm in a stream-based scenario. Classify every instance in S u using classifier M and calculate the corresponding confidence score C.
Select those instances with Cs that are equal to or higher than threshold th s , and label them with corresponding predicted categories.
Refer to the machine-labeled set as S new .
Re-train classifier M using the new set S l .
Until model training converges/unlabeled data is unavailable doi:10.1371/journal.pone.0162075.t005 calculate the confidence scores (as it will be defined later in this paper). Finally, we select those unlabeled instances with confidence scores equal to or greater than a given threshold th s , and add them (together with the respective machine-annotated labels) to the training set for the next iteration.
There are two parameters that need to be set in this strategy: the confidence threshold th s and the size of the initial human-labeled data set |S l |. Regarding the first, which defines the amount of unlabeled data to be selected at each iteration of the algorithm, we have to find a compromise between the impact of adding noisy instances (low th s ) and adding less informative ones (high th s ). Regarding the second, we have to consider that if the set is too small the initial model will have a high classification error rate, and if the set is too large no improvement over the initial model can be expected because there is nothing to be learned. In this paper, we will optimize these parameters as it will be described in experimental section.

Combining Active and Semi-supervised Learning
As discussed above, active and semi-supervised learning share the common goal to reduce the amount of human annotation effort by means of selective data sampling. However, they further share the same criteria for data sampling-the confidence score. The difference is that they achieve their goals from opposite 'ends': active learning samples data with low classifier confidence, while semi-supervised learning samples the data with high confidence. Thus, it comes naturally to combine them for more efficient model learning. Our proposed approach is as follows.
By using two given confidence thresholds th ssaL and th ssaH , the candidate instances that are evaluated for labeling can be sampled to generate two subsets: one subset containing instances whose confidence scores are lower than th ssaL , and another subset containing those instances whose confidence scores are equal to or higher than th ssaH . It follows that the former subset of instances is selected for human labeling, and the latter for machine labeling. This approach can be referred to as Semi-Supervised Active Learning (SSAL), since it tandems the standard Table 6. Semi-Supervised Active Learning in a pool-based scenario.

Input:
S l : small set of labeled instances Classify every instance in S u using classifier M and calculate the corresponding confidence score C.
Select instances with Cs lower than th ssaL from S u and submit them to human annotation.
Refer to the new labeled set as S a new . Select those instances with Cs equal to or higher than th ssaH , and add the corresponding predicted labels.
Refer to the machine-labeled set as S s new . fully supervised AL strategy with a bootstrapping strategy SSL, (i.e., self-training). SSAL is formally described in Tables 6 and 7 for pool-based and stream-based scenarios, respectively.
In the pool-based scenario, at every learning iteration, we incrementally increase the initial training set with a set of human-labeled instances (those with confidence scores lower than the threshold th ssaL ), and a variable number of machine-labeled instances (those with confidence scores equal to or higher than the threshold th ssaH . As can be observed from Table 6, there are twice as many model re-training operations in each learning iteration compared to the individual AL and self-training approaches. In our approach, we first re-train the model with the human-labeled date set S a new (AL phase), and then produce the machine-labeled data set S s new (SSL phase). The purpose of this design aims at improving the quality of the data set S s new by making use of a model previously trained with reliable (human) labels. This is very important for the SSL phase, since having the model trained first with reliable annotations from the AL phase will decrease the amount of noisy data (instances with potentially wrong labels assigned). This will avoid the deterioration of the performance that can occur in the SSL phase. The same approach for avoiding noisy data is adopted in the stream-based scenario, see Table 7. Additionally, we continuously fill the buffer B with new instances. Once the buffer is full, two confidence thresholds th ssaL and th ssaH are adopted for data splitting.

Database and Acoustic Features
For the purpose of this work, we use the FindSounds database (http://www.findsounds.com/ types.html-accessed on 25 July 2011), which provides a large amount of varied real life sounds already categorized. In order to better suit our study and avoid very unbalanced class distributions, we discarded those categories with only a few instances (insects, with 7 subsets, Table 7. Semi-Supervised Active Learning in a stream-based scenario.

Input:
S l : small set of labeled instances S u : large stream of unlabeled instances M: initial classifier trained by S l B: fixed buffer th ssaL , th ssaH : confidence thresholds Do Classify current instance from S u using classifier M and calculate its confidence score C.

Retain current instance in buffer B.
if Buffer B is full Select those instances with Cs lower than th ssaL from B and submit them to human annotation.
Refer to the human-labeled set as S a new .
*Re-train classifier M using the new set S l , and re-classify the remaining instances in B.
Automatically label those instances with Cs higher than th ssaH in B with predicted labels.
In total, there are 16,930 sound instances in our database with durations ranging from 1 to 10 seconds, which correspond to (approximately) 15 hours of environmental sounds. All sound files were converted into raw 16 bit encoding, mono-channel, and 16 kHz sampling rate, as various formats and rates were used in the original versions retrieved from the web. The details of the database and categories used are shown in Table 8. Throughout this paper we will refer to the database as FINDSOUNDS. (The whole database together with corresponding labels can be downloaded for research and academic purpose from https://www. dropbox.com/sh/nmw4ef7ma5ok8df/AACnx63TtkrwXyHyiJ0FpSw8a?dl=0.) In order to evaluate the effectiveness of the new method proposed in this paper, we adopted the baseline audio feature set used in the Audio/Visual Emotion Challenge (AVEC) 2012. This feature set comprises 1,841 features that result from a systematic combination 25 energy-and spectral-related low-level descriptors (LLDs) with 42 functionals, 6 voicingrelated LLDs with 32 functionals, 25 delta coefficients of energy/spectral-related LLDs with 23 functionals, 6 delta coefficients of voicing-related LLDs with 19 functionals, and 10 voiced/ unvoiced durational features (for full details on the feature set please refer to [39]). All features and functionals were extracted with the OpenSMILE toolkit [40].

Experiments and Results
In this section, we describe a series of experiments conducted with the purpose of empirically investigating the effectiveness of three learning methods in the context of sound classification: 1) certainty-based AL; 2) SSL; and 3) our proposed method, SSAL.

Experimental Setup
For every experiment presented in this paper, we run a 10-fold cross validation (the split is 90% for train, 10% for test) to obtain stable estimates of the algorithm's performance. We compute unweighted average recalls (UARs), the sum of the accuracies per class divided by the number of classes without considerations of instances per class, as evaluation metric. For result representation in figures below, the UARs over 10 rounds along with the standard deviation bar are used. All experiments use the FINDSOUNDS corpus introduced in previous section. In order to deal with the imbalance between the number of instances in each category (or class distributions), we employ data oversampling in the training set in order to add more instances belonging to the less represented classes. Oversampling is performed in WEKA [41] using the Synthetic Minority Over-sampling Technique (SMOTE) [42] (WEKA defaults settings are used). Specifically, SMOTE does oversampling by creating "synthetic" examples for minority class. It takes each minority class sample and produces synthetic examples making use of all of the k minority class nearest neighbors. Depending upon the amount of oversampling required, neighbors from the k nearest neighbors are randomly chosen. Our experimental setup currently uses 5 nearest neighbors. Synthetic samples are generated in the following way: Take the difference between the feature vector (sample) under consideration and its nearest neighbor. Multiply this difference by a random number between 0 and 1, and add it to the feature vector under consideration. This approach effectively forces the decision region of the minority class to become more general.
As classifier we use Support Vector Machines (SVM) [43] with linear kernels and pairwise multi-class discrimination sequential minimal optimization (implemented in the WEKA framework [41]). SVMs are supervised learning models based on the concept of decision hyperplanes that define decision boundaries-hyperplanes in a multidimensional space that separate sets of elements based on class memberships. The output value of SVMs is the distance of a specific point from the separating hyperplane, but a central aspect of our AL approach is the calculation of the confidence scores. To convert these distances to probability estimates within the range of [0, 1] there are various parametric and nonparametric approaches. In this work, we employed a parametric method of logistic regression proposed in [44], which is one of the most frequently used approaches to transform the output distances of SVMs into (pseudo) probabilistic values [23,45,46]. This method assumes that the posterior probability consists of finding the parameters A and B for a form of sigmoid function: mapping the value f(x) into probability estimates P(y|f(x)). For each instance, the sum of the posterior probability for all classes is equal to 1. This probability indicates the classifier's confidence about the predicted label given. We then define the confidence score of x as follows: Additionally, in the context of pool-based AL, and AL phase in SSAL experiments, instead of using a threshold mechanism for data splitting as described in Tables 3, 6 and 7, we select 500 instances with lowest confidence scores for human annotation in each learning round. And for stream-based AL as described in Table 4, we set the instances buffer size as 500 for the sake of consistency. The reason behind is to fix the number of human labeled instances in each learning iteration to further make an unified performance comparison platform for different learning methods.

Confidence Scores Evaluation and Distribution
The learning methods proposed in this paper are based on two assumptions. First, the confidence scores (cf. Eq (2)) are good indicators of the classifier's output certainty level. This is essential to ensure that the instances with the lowest classification certainty (low confidence scores) are selected to be delivered for human annotation, and the instances with high classification certainty (high confidence scores) are directly added to training data set with labels automatically given by the machine annotator. Second, only a small portion of the unlabeled instances are classified with low certainty, otherwise human effort cannot be dramatically reduced.
Before starting our experiments, it is relevant to evaluate whether these two assumptions are in fact supported. To do so, we train a SVM classifier with 500 and 5,000 instances  Fig 1, an increase in the UAR of the classifier is matched by an increase in the confidence scores. Moreover, when the classifier is trained with more labeled instances, the confidence scores tend to reflect better the classifier's UAR. Hence, the classifier confidence scores seem to reflect well the classifier's certainty level regarding the corresponding classification results. In relation to the second assumption, as shown in Fig 2, the majority of unlabeled instances are classified with high confidence values. It is also evident that the classifier initially trained with more labeled instances, tends to classify more unlabeled instances with higher confidence levels. Therefore, only a small portion of the unlabeled data is classified with low certainty.

Active Learning Experiments
In the certainty-based AL pool-based scenario, we use the same set of 500 samples as preselected in above section to train the initial classifier. Then, in order to study the evolution of classification performance, we incrementally select, and manually label, 500 instances per iteration from the pool of remaining data (14,737 instances) for model re-training until all data is labeled. The learning curves (UAR vs. number of instances added) for the AL method are shown in Fig 3. Additionally, we also show the results for a passive learning (PL) method (i.e., randomly select instances for labeling) for the sake of comparison. As it can be observed from Fig 3, the AL method effectively reduces the amount of human annotations needed to achieve a given UAR. For instance, the PL method achieves a top classification UAR up to statistical significance of 68.5% when using 11,500 instances (75.5% of the total number of instances in the data pool), while the AL approach reaches the same UAR with 43.5% less labeled data (6,500 instances). The best UAR up to statistical significance with AL, 69.3%, is achieved with only 7,500 manually labeled instances (49.2% of the total number of instances in the data pool), which is statistical significantly higher than that of PL with pvalue = 0.0326 for two sample Kolmogorov-Smirnov test.
In order to simulate the stream-based scenario, we continuously sample instances from the candidate set, one by one, in a random fashion. We decide to accept or discard the selected instance immediately after sampling. Those with confidence scores lower than the given threshold are accepted and added to the buffer. As soon as the buffer is full (500 instances), the selected instances are delivered to human annotation, and finally added to the training data set (together with respective label). The model is then re-trained and the same process repeated. However, in most cases, the buffer can not be filled up in last iteration. The selected instances are still manually labeled by human for model training. Based on the analysis of the confidence score distribution shown earlier in Fig 2, which shows that only a few instances fall in the interval between 0.0 and 0.4, we decided to test five different thresholds th a s: 0.5, 0.6, 0.7, 0.8, and 0.9. Additionally, for the sake of comparison, we also tested the PL method, whereby instances are randomly selected (which can be considered as a stream-based AL process with 1.0 as confidence threshold). The results are shown in Fig 4. From Fig 4, we can see that the AL approach with any of the five threshold levels leads to better classification performances with a smaller amount of labeled instances (compared to the PL approach). Furthermore, AL with lower threshold performs better than with higher threshold, which indicates that selecting instances that are more informative can lead to better performance with less annotation effort. However, lower threshold also means a larger amount of discarded unlabeled instances, which is why the learning curves with lower thresholds stop earlier-less instances are used for training. Therefore, the value of threshold should carefully be tuned according to the specific application. Quantitatively, in the best case scenario, to achieve the top classification UAR up to statistical significance of PL (68.5%, with 11,500 instances labeled), the AL method with a threshold of 0.9 requires only 6,500 instances to be annotated (43.5% less than PL). Therefore, AL efficiently reduces the need for human annotations while achieving the same performance as PL.

Semi-supervised Learning Experiments
In this section, we evaluate the SSL method described in Table 5. Four initial training data sizes (i.e., 500, 1,000, 2,000, and 5,000) and six thresholds th s s (i.e., 0.6, 0.7, 0.8, 0.9, 0.95, and 1.0) are considered here. Note that with a threshold of 1.0, no machine-labeled instances are added to the initial training data set. Additionally, in each case, those learning iterations are going on until no more unlabeled data is available.
The classification UAR figures for the different tests are depicted in Fig 5. As it can be seen, the best UAR with 500 human-labeled instances is achieved with a threshold of 0.95, while for other initial numbers of instances used the best UARs are achieved with a threshold of 0.8. This result may indicate that using less data to train the initial classifier may require a higher confidence threshold in order to guarantee the quality of machine labeling. With more data to train the initial classifier, the UAR of the classifier is likely to increase and lower confidence thresholds seem to ensure the informativeness of the instances.

Semi-supervised Active Learning Experiments
The effectiveness of active and semi-supervised learning methods has been separately evaluated in the previous two sections. Both methods showed advantages in boosting the initial classification performance, while reducing manual labeling effort. In this section, we focus on assessing the combination of the two learning methods-the new method proposed in this paper-for both pool-based and stream-based scenarios.
In the pool-based scenario, we use the same 500 instances as in previous active learning experiments for initial model training, and then incrementally select new instances from the remaining pool (14,737 instances) for either human or machine annotation. Specifically, in each round 500 instances are selected for human labeling and a variable number of instances with confidence scores above a given threshold are selected for machine labeling. In last iteration, once less than 500 instances are available for selecting, human annotators label them all for model re-training. Fig 6 shows the classification performance of the SSAL method with a threshold of 0.95, as well as that of the AL and PL methods. As it can be observed in Fig 6, the SSAL method achieves similar classification UAR with AL (69.4% (SSAL) vs 69.3% (AL)), and outperforms the PL by circa 0.9% (69.4% (SSAL) vs 68.5% (PL)) with p-value = 0.0173 for two-sample Kolmogorov-Smirnov test. Moreover, the classification performance curve for SSAL stops earlier than other two since a larger amount of instances are labeled at each iteration. In order to achieve the best performance of the PL method (68.5%; 11,500 human labeled instances), SSAL requires only 5,500 human labeled instances, 52.2% less than PL and 15.4% than AL (6,500).
In order to evaluate the impact of the confidence thresholds on SSAL in the pool-based scenario, we tested three values: 0.60, 0.80, and 0.95. The results are shown in Fig 7. With a threshold of 0.60 many selected instances are labeled by machine and the classification performance is worst compare to other two cases. A threshold of 0.80 leads to a similar classification performance curve to that of 0.95, but its curve stops earlier with lower performance level for more instances are delivered to machine for annotation. Therefore, a threshold of 0.95 is preferred in our experiments. Furthermore, these tests indicate that the tuning of the threshold level is critical for the optimization of the learning process.
In relation to the stream-based scenario, we started once more with 500 instances for the training of the initial model. In order to simulate a steady stream of incoming data, we randomly sampled new instances from the remaining set (14,737 instances) until the buffer was full (1,000 instances) in a sequential process. At this point, we selected the 500 instances with lowest confidence scores for human annotation, and the 100 instances with the highest confidence scores for machine annotation.  PL approaches. In particular, for the same number of human labeled instances (6,000 instances), SSAL leads to a 10.0% increase in UAR up to statistic significance in relation to AL with p-value = 0.0446 for two-sample Kolmogorov-Smirnov test. Moreover, it reaches the best performance of PL (68.5%) with less 52.2% human effort (i.e., using only 5,500 labeled instances).
In Table 9, we summarize the best performances in a statistically significant way for all methods evaluated (SSAL, AL, and PL) in the pool-based and stream-based scenarios, as well as the number of human-labeled instances needed to achieve that performance. Specifically, in each learning iteration, AL and AL phase of SSAL in both scenarios are all parameterized with a selection of 500 instances for human annotation, the SSL phase of pool-based SSAL selects a number of instances with confidence scores higher than 0.95 for machine annotation, and the SSL phase of stream-based SSAL selects 100 instances with highest confidence scores for machine annotation. As it can be observed, the SSAL effectively reduces the human labeling effort.  Table 9. Best performances up to statistic significance achieved using semi-supervised active learning (SSAL), active learning (AL), and passive learning (PL) in pool-based and stream-based scenarios, as well as the number of human-labeled instances (#HLI) needed to achieve that performance.  Learning curves for semi-supervised active learning (in each round 500 instances with lowest confidence scores are selected for human annotation, and 100 instances with the highest confidence scores are selected for machine annotation), active learning, and passive learning in stream-based scenario. doi:10.1371/journal.pone.0162075.g008

Conclusion
In this paper, we proposed to tandem Active Learning and Self-Training with the aim of bridging the gap between the desire of sufficient amounts of training data and the scarcity of labeled data in the context of sound classification. In this method, we exploited human and machine labeling with the goal of minimizing the human labeling effort: humans were asked to selectively label those instances that the machine was most uncertain about, and the machine automatically labeled those instances that it could predict with a high confidence level. In order to evaluate the certainty of the labels predicted by the machine annotator, we used a classifier confidence score to determine the informativeness of the labeled instances, which, as demonstrated is a good indicator of the classifier's certainty about the classification results.
Our proposed method was evaluated on a database with 16,930 instances in both poolbased and stream-based scenarios. Furthermore, we compared our method to Active Learning, Self-Training and Passive Learning. Results show that Active Learning requires significantly less human-labeled data compared to Passive Learning to achieve the same UAR, and that Semi-Supervised Active Learning outperforms both these methods in terms of classification performance and number of human labeled instances necessary to achieve such performance. In both of the pool-based and stream-based scenarios, the Semi-Supervised Active Learning approach allowed us to reduce by 52.2% the amount of human annotations necessary to achieve the best performance of all other methods tested.
While demonstrating the effectiveness of our method, it became also evident that for a successful application of Semi-Supervised Active Learning, the tuning of the confidence threshold is crucial. As we have shown, performance deterioration can occur due to the inclusion of noisy machine-labeled data in the training set. Also, if too many instances are machinelabeled, the classifier performance may never reach a satisfactory level given that very few instances are left for human labeling (considered to be more reliable). Therefore, an optimization process for searching an appropriate threshold is fundamental for the application of Semi-Supervised Active Learning. This tuning is certainly task-specific as it will depend on the complexity of the classification problem (and respective confidence levels), and the objectivity of the ground truth or golden standard (which affects the quality of the labels). While the current fixed threshold strategy may not be suitable in other classification tasks, one can refer to [47], [48] and the references therein for more sophisticated thresholding and selection criteria that delicately balance the trade-off between asking for human labeling versus receiving machine labels.
Finally, and while in this paper we demonstrated the effectiveness of Semi-Supervised Active Learning in largely reducing the need for human annotations in the context of sound classification. Given the non task-specific nature of the algorithm proposed, our method can also be applied to other classification scenarios. In particular, this methodology fits applications in hybrid learning environments where the machine is required to continuously increase and adapt its knowledge about the acoustic environment as well as being able to learn in cooperation with humans.

Author Contributions
Conceptualization: WH BS.