Continuous robust sound event classification using time-frequency features and deep learning

The automatic detection and recognition of sound events by computers is a requirement for a number of emerging sensing and human computer interaction technologies. Recent advances in this field have been achieved by machine learning classifiers working in conjunction with time-frequency feature representations. This combination has achieved excellent accuracy for classification of discrete sounds. The ability to recognise sounds under real-world noisy conditions, called robust sound event classification, is an especially challenging task that has attracted recent research attention. Another aspect of real-word conditions is the classification of continuous, occluded or overlapping sounds, rather than classification of short isolated sound recordings. This paper addresses the classification of noise-corrupted, occluded, overlapped, continuous sound recordings. It first proposes a standard evaluation task for such sounds based upon a common existing method for evaluating isolated sound classification. It then benchmarks several high performing isolated sound classifiers to operate with continuous sound data by incorporating an energy-based event detection front end. Results are reported for each tested system using the new task, to provide the first analysis of their performance for continuous sound event detection. In addition it proposes and evaluates a novel Bayesian-inspired front end for the segmentation and detection of continuous sound recordings prior to classification.


Introduction
Sound event classification requires a trained system, when presented with an unknown sound, to correctly identify the class of that sound. Robust sound event classification specifically introduces real-world complications into the classification task, most notably interfering acoustic noise, sounds occluded by overlap and event detection. Recent years have seen a significant can be found at http://dx.doi.org/10.17504/ protocols.io.iw5cfg6.

Funding: This work was supported by the Huawei Innovation Research Program under Machine
Hearing and Perception Project Contract No. YB2012120147. Huawei Technologies provided support in the form of salaries for author WX, but did not have any additional role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript. The specific role of this author is articulated in the 'author contributions' section.

Competing interests:
We have the following interests: Wei Xiao is employed by Huawei Technologies Duesseldorf GmbH, and has been contributing time, effort and practical advice to this research in an effort to help ensure its eventual success. While he is employed by the same company as the funder, he is a coauthor on the basis of his defined intellectual input to this work in the area of signal processing methods, analytical techniques specifically related to feature dimension reduction. There are no patents, products in development or marketed products to declare. This does not alter our adherence to all the PLOS ONE policies on sharing data and materials, as detailed online in the guide for authors.
As an example, in smart cities or in automated surveillance of public spaces, a computer could infer events from audible information using audio sensors that are lower cost, require less networking bandwidth, consume less power, are potentially more robust and less easily obscured by weather, dust or pollution than video sensors. They also have the ability to sense non-line-of-sight events and are likely to enjoy a lower computational burden for automated processing than moving image data. When used in a future smart city environment, networked audio sensors could be deployed city-wide at relatively low cost. At the very least, automated audio event detection could alert city staff to view appropriate video footage, at best it could trigger automated responses appropriate to the inferred events. The same is true of smart-home environments, or in security monitoring. As a human-computer interfacing aid, machine hearing allows a speech-based dialogue system to react to auditory events in a similar way to humans. Reactions could range from pausing dialogue in response to sounds, repeating words obscured by sounds as well as appropriate reaction to sounds as diverse as alarms, laughter, sneezes, screams, smashing glass, dog barks and car horns. In fact there are many identifiable everyday sounds that, during a conversation, one would normally expect both conversing parties to react to. For truly natural speech dialogue between human and computer, the computer should be expected to react to similar events as a human, and this implies machine hearing capabilities.

Continuous robust audio event detection task
The evaluation task The evaluation task used in this paper builds upon the standard isolated sound evaluation task first reported by Dennis et.al. [12,13]. The advantage of having a standard evaluation is that it is repeatable by others, and eases the comparison of results when other authors make use of the same method to evaluate their research [11,14,15]. The task uses freely available sound recordings from the Real World Computing Partnership (RWCP) Sound Scene Database in Real Acoustic Environments [20], with robustness evaluation performed by mixing these sounds with background noises from the NOISEX-92 database at several signal-to-noise (SNR) levels.
For the 'traditional' isolated sound evaluation, 50 sound classes, each comprising 80 recordings, are selected from the RWCP database. All sounds have both lead-in and lead-out silence sections and have no added noise. For each class, 50 randomly-selected files are used for training, with the remaining 30 reserved for evaluation. When cross-verifying, different selections of files are made. The arrangement and procedure for the isolated sound evaluation task can be found at http://www.lintech.org/machine_hearing with baseline code at [31].
Evaluation is performed separately and reported separately for clean sounds and those corrupted by additive noise. Noise-corrupted tests use four background noise environments selected from the NOISEX-92 database, namely "Destroyer Control Room", "Speech Babble", "Factory Floor 1" and "Jet Cockpit 1". These environments were chosen as described by Dennis [12] to be realistic examples of non-stationary noise with predominantly low-frequency components.
To evaluate noisy conditions, one of the four NOISEX-92 recordings is randomly selected, a random starting point identified within the noise file, and then sample-wise added to the sound file. SNR is calculated over the entire noise and sound file in each case, and four separate test databases are created for clean sounds (i.e. no added noise), as well as noise mixtures with SNRs of 20, 10 and 0 dB.
For evaluation of continuous robust audio event detection, a new standard task is defined using the same auditory data as discussed above. Specifically, 100 separate 60 second sound vectors are created. 15 randomly selected instances from the 1500 test files (i.e. 30 examples from 50 classes) are then added into each sound vector at random positions. Finally, background noise is added in the normal way at the specified SNRs.
There are thus four testing databases (clean, 20, 10 and 0 dB) each comprising a set of 100 different 60 s evaluation recordings. This process is illustrated in Fig 1, while a visualisation of one of the 100 recordings generated through this process is given in Fig 2, showing the times during which each of the 15 randomly selected sounds (chosen from the 50 classes) are present within the recording.
All of the test parameters and settings are summarised in Table 1, and the details of the files and steps required to create the test databases have been published and are available at http:// dx.doi.org/10.17504/protocols.io.iw5cfg6.
Performance is assessed in terms of precision and recall. Precision P computes the proportion of all detected sounds that are of the correct class. This score evaluates how accurate the classification decisions are, but does not evaluate the performance of the detection process since it does not account for sound events that were not detected (and hence not classified). Recall R, by contrast, computes the proportion of detected sound events out of the total  number of sound events. As is common in the literature, we make use of an F-measure to combine these, F 1 = 2(P −1 + R −1 ) −1 , and will use this in particular to explore trade-offs between precision and recall.

Classifiers
This section will separately describe the following classifiers; MFCC with HMM [13] and then SIF with SVM [14], SIF with DNN [14], SIF with CNN [15] using energy-based event detection criteria. Finally, the Bayesian Inference Criteria (BIC) segmentation detector will be described.

MFCC-HMM
MFCC features are extracted from 10 ms analysis frames with a 50% overlap. The first 12 MFCCs are concatenated with their frame-wise differential (Δ) and second differential (ΔΔ). A separate hidden Markov model (HMM) is then trained for each class in the evaluation data set. For continuous sound testing, the Viterbi algorithm is used to explore all possible state sequences to decode the observed test file feature sequences, obtaining the most probable model explanation.

SIF with SVM, DNN and CNN
This section describes the spectrogram image feature (SIF) as used with the various classifiers. The structure of the feature extraction and classification stages are compared in Fig 3, in particular for the DNN and CNN [14,15]. The diagram shows the formation of the spectrogram and energy information into a matrix which is denoised and then formed into features. The DNN feature vector is formed from a rectangular region that is reshaped into a vector prior to classification by the DNN on the left, and is identical to that used in the SVM system (not shown). The CNN classifier on the right preserves the rectangular shape of the region as its input feature map. In each case, the classifier output is a set of K class probabilities. The energy and BIC detectors are used to select the time domain regions that form the input into Fig 3. SIF. The SIF feature begins with a linear scaled and normalised spectrogram constructed from highly overlapped and windowed frames of length w s samples. For frame index F, where δ is the advance between frames, in samples, w(n) defines a w s -point Hamming window. Spectrogram f F (k) is then downsampled in frequency into B bins by averaging over B 0 = bw s / 2Bc samples. The resulting average spectra are then stacked to form an overlapped spectrogram (S), To provide context, a history of up to D consecutive spectral lines (i.e. m ¼ 0 . . . D À 1) are concatenated to populate a BD + 1 dimension feature vector V which is augmented by a scalar energy measure, one per frame. Feature vector v comprises elements v(i); with the scalar energy metric defined as; vðBDÞ ¼ This captures frame energy, which is useful based on the hypothesis that very low energy frames are likely to be less discriminative to sound classification than higher energy frames. v is thus the input to the classifier feature extraction stage, with a dimensionality of only DB + 1.
In practice, several values of B, D and δ were tested and subsequently fixed to a system which balances efficiency with consistent performance, having B = 24, D = 30 and δ = 16. Each SIF analysis frame spans 16ms time duration with an 8 ms overlap between frames, and thus we observe that this method primarily operates by classifying short-time spectral characteristics. The final image dimensionality is thus DB + 1 = 721.
SVM. An input feature vector is denoted With a linear kernel, SVM solves the primal optimisation of the normal vector to the hyperplane, w; ξ are slack variables which are used to define an acceptable tolerance, and c > 0 is a regularisation constant. ψ(v i ) maps v i to a higher dimensionality, and, Since w typically has high dimensionality [21], for computational efficiency we usually solve the related problem, with e = [1, . . .V] T being a vector of all ones. Q is a positive semi-definite matrix of dimension Having solved Eq (8), using the primal-dual relationship, the optimal w satisfies, and the decision function becomes the sign of w T ψ(v i ) + b from Eq (7) which is easily computed from, The SVM input feature vector was scaled and mapped to a [−1, +1] input range prior to training and testing using v denotes the ith element of unscaled input vector u and v(i) represents the ith element of the scaled feature vector v.
This is implemented using LIBSVM [21] with which alternative kernels are easily evaluated. Tested kernels were linear

SVM system parameters:
Development testing revealed that best performance was achieved overall using a linear kernel v T i v j with regularisation constant c = 32. γ was estimated by the LIBSVM toolkit and set to 0.03. This is close to the default (i.e. 1/N = 0.02) but resulted in slightly improved performance. All parameters were fixed globally (i.e. maintained as constant for all classes) over the K(K − 1)/2 binary models required to partition the results into K classes using one-against-one models. Majority voting was applied to contiguous frames to determine overall classification score for a particular region.
We evaluated systems with 50 and 51 classes. The latter reserved a single class for 'no sound' analysis frames, however performance was found to be very poor, most likely due to the lack of a positive energy signal to discriminate against (i.e. the classifier was effectively being trained on the absence of something rather the presence of something). Thus, the systems evaluated in this paper have 1225 binary classifiers yielding K = 50 class outputs. Let w ji represent the weight between the ith visible and the jth hidden unit, so that weight In a Gaussian-Bernoulli RBM, every visible unit v i adds a parabolic offset to the energy function, governed by σ i , which is generally predetermined, rather than derived from the data. The Gaussian-Bernoulli RBM energy function can be described [22] as, The Gaussian-Bernoulli RBM model parameters are thus θ gb = {W, b h , b v , σ 2 }. The energy function of the Bernoulli-Bernoulli RBM for state E bb (v, h) is computed similarly, but does not require σ i given the binary nature of input nodes, Bernoulli-Bernoulli RBM model parameters are thus θ bb = {W, b h , b v }. Given an energy function E(v, h) defined as in either Eq (11) or Eq (12), the joint probability associated with configuration (v, h) is defined as, where Z is a partition function, Pre-training: RBM model parameters θ are typically estimated from training data in a maximum likelihood sense using contrastive divergence (CD) [23]. This algorithm updates hidden nodes h by stepping through a Gibbs Markov chain with early termination, given visible nodes v and previously updated h. Layer 1 hidden nodes are trained first based on the input feature vector (from training data). The states of the trained hidden units then become the visible data for training layer 2, and the process repeats to produce multiple trained layers of RBMs. These are then stacked to produce the DNN.
Fine-tuning: A softmax output labelling layer of K units is appended to the pre-trained stack of RBMs [24]. The function of the layer is to convert the Bernoulli distributed outputs in the final layer into a multinomial distribution. If p(k|h L ; θ L ) is the probability of the DNN classifying final output layer states h L into the k-th class then, where y L ¼ fy 1 gb ; y 2 bb :::y L bb g are the trained model parameters for the entire L-layer DNN. Back propagation (BP) is then used to train the stacked network, including the softmax class layer, based on minimising the cross entropy error, C ¼ À P K k¼1 c k log pðkjh; y L Þ, between the true class label, c and that predicted by the softmax layer.
DNN system parameters: The DNN classifier is implemented using the winning structure as defined in the authors' previous work [14], which is a five layer network of the form 721 − 210 − 210 − 50 with dropout during training (the proportion of weights fixed during each training batch, in order to prevent over-training) of 0.1, mini-batch training size of 100 and up to 1000 training epochs. Momentum is 0 and learning rate begins at 10 then drops to 5 after 100 epochs, 2 after 400 and 1 after 800 [14].
As with SVM, the DNN classifier has 50 output classes, one for each sound. Again, the benefit of an additional 'no sound' class was explored and found to be detrimental in practice. The consequence of this is that the DNN (and SVM) systems are forced to assign each analysis frame to one of 50 classes, with no way to indicate absence of sound, i.e. they are doing sound classification rather than sound detection. Thus a separate means of detecting the absence or presence of sound is necessary. In general, two methods are described in this paper, the first being a short-time energy detector described in the following subsection and the second being a novel BIC method discussed later.

CNN
Convolutional neural networks (CNNs) are multi-layer neural networks typically consisting of several pairs of convolution layers and subsampling layers plus a set of fully connected output layers. While the large number of layers and degree of connectivity describes a network that is high in complexity, weights are shared within layers to reducing the number of parameters that require training. Despite this simplification, CNNs share the need for relatively large amounts of training data with DNNs, and yet have been shown to outperform DNNs in several fields including image processing [25,26] and ASR [27,28].
A spectrogram of sound events is essentially an image of different time-frequency patterns, many of which exhibit local relationships but only weak absolute locality, i.e. recognisable sounds may appear at different times and in slightly different frequency ranges. CNNs have been shown able to classify image data well [25,26] and are insensitive to pattern placement within an image (thanks to the convolution and subsampling steps), thus are potentially wellsuited to sound event classification from two dimensional time-frequency spectrogram input. In this application, the CNN feature map is constructed from spectrogram and energy information as shown in Fig 3. As with multi-layer perceptrons (MLPs), CNNs can be trained by gradient descent using back-propagation. Since units in the same feature map share the same parameters, the gradient of a shared weight is simply computed as the sum of the shared parameter gradients.
In general, for a convolutional layer l, we form the jth output map x l j from where x lÀ 1 i is the ith input map, k l ij denotes the kernel that is applied, and M j is one of a selection of input maps [29]. The subsampling layer is simpler, x l j ¼ f ðb l j # ðx lÀ 1 i Þ þ b l j Þ with #(.) representing sub-sampling and β and b being biases. After repeating convolutional and subsamping layer pairs, the output is formed by what is effectively a dual layer (or deeper) MLP. The size of the MLP input layer is determined by the total number of nodes in the final CNN subsampling layer, while the size of the MLP output layer is determined by the number of classes.
CNN system parameters: The CNN classifier is implemented based on the method presented in [15], except that the classification is performed on all detected energy points rather than just three per file. Each energy point triggers a set of six overlapping analysis frames that are downsampled to a resolution of 52 × 40 and then fed to the input layer of the CNN. The five layers comprising the CNN then consist of a 5 × 5 kernel convolution layer with outputmap size 6 followed by a 2 : 1 subsampling layer, then a second 5 × 5 kernel convolution later with outputmap size 12 and a final 2 : 1 subsampling layer. The output layer feeds a two-layer fully interconnected MLP that has 50 output classes, yielding K output probabilies as per Eq (14).

Energy detector
The energy detector uses both instantaneous peak energy and short-time energy criteria to detect candidate frames for sound classification. Specifically, if E F is the energy of frame F, then if E F > and E F > E F−i : E F+i where i = −2D. . .2D, the current frame and its context is selected for classification.
For the experimental results presented in this paper, the threshold is simply set to the mean energy of all N F frames, i.e. ¼ 1 has been pre-calculated as v(BD) for SIF features), leading to a large number of potential trigger positions, limited only by the temporal criteria.
If an experimental evaluation comprises N F analysis frames in total, the effect of the energy detector is to reduce the number of frames to be classified to N 0 F where N 0 F < N F . This means that the array of features, originally of dimension [BD + 1, N F ] is then reduced to dimension ½BD þ 1; N 0 F prior to classification. The classifier will then output dimension ½K; N 0 F classification probabilities.

Bayesian inference detector
The BIC approach attempts to partition an input array into two parts that have more similar statistical distributions within each part than between parts. Given a search window z, which we construct from contiguous features, two hypotheses are considered. H 0 is that z is distributed according to a single Gaussian model, and H 1 is that z is distributed according to two Gaussian models and can thus be separated into two different models x and y [30]. We next define, where N, N x and N y = N − N x are the window lengths of models z, x, and y, d is the feature dimension and S z , S x , S y are covariance matrices of the feature estimates from each respective window. For the results presented here, we use a fixed model complexity penalty λ = 1.0, and model the Gaussians on 39 dimension features comprising MFCC, ΔMFCC and ΔΔMFCC, computed frame-wise [30]. We exhaustively compute ΔBIC for all possible partitionings within the set. In each case, if max(ΔBIC) > 0, then hypothesis H 1 is true and t = argmax(ΔBIC) marks a separation point whereas if max(ΔBIC) < = 0, then hypothesis H 0 is true and there is no partition in window z. The process repeats, iteratively splitting windows until either all remaining windows are best represented by a single Gaussian distribution, or the length of a remaining window is smaller than the minimum allowed for classification. In practice, z spans 200 overlapping SIF analysis windows with a very large overlap of 199 (i.e. 16 ms) so that initial BIC segment sizes are 1.608 s in duration. Each split window is then subjected to the energy detector as usual, to obtain a detection point (with their usual backwards-forward context) within each window. This implies at least one classification result for every window, meaning that every BIC error automatically contributes a classification error.
As with the energy detector, the Bayesian inference detector similarly reduces the number of frames of features for classification. We can again denote this as having dimension ½BD þ 1; N 0 F prior to classification, although the number and identity of frames chosen using the two detection methods will of course differ.

Background probability scaling and thresholding
When either the energy or BIC detectors are used, the result is a sequence of N 0 F candidate frames for classification, that are input to the feature extraction block, shown in Fig 3. Each frame, F is classified separately by the DNN or CNN to derive a set of posterior probabilities, p (k|θ) for trained model θ from Eq (14) where k = 1. . .K.
Contemporary sound classification algorithms tend to expect isolated sound events, typically arranged with one sound occurrence per file Given N F analysis frames in a recording, each classified separately, the overall classification is computed by looking at all classes over all N F frames. Either the posterior probabilities for each class are simply summed over all frames to find the class with highest aggregate score, or the probabilities are first scaled by the frame energy prior to summation [14]. Neither method works well for continuous sounds, due to the uncertainty regarding start and end positions of sounds and the case where no sounds are present but the classifiers are forced to choose. Certain classes are inherently more noise-like, so that classifying NOISEX-92 background noise in the absence of foreground sounds results in persistent misclassifications into a small number of classes. It is thus necessary to normalise the output probabilities.
Given classification probability p(k, n) for class k in frame n, we obtain the long term average classifier output probability over N F frames, " pðkÞ ¼ 1 nÞ for all classes k = 1. . .K. Now, instead of attributing each frame to arg max k pðk; nÞ and then attributing the entire recording to the class which wins the highest number of frames as in non-continuous systems [14], we will instead determine the winning class for each classification region as the one that has the highest probability compared to the mean posterior probability; max pðk; nÞ À w pðkÞ À where χ accounts for the degree to which background noise triggers individual classifiers. Testing trained classifiers in the presence of noise alone, reveals that several sound classes have an inherent similarity to some periods of background noise. In a system which classifies a segment of audio based directly on the highest posterior probability, noise is therefore often missattributed to noise-like classes, causing miss-classification. However the difference between actual sounds and background noise is twofold. Firstly, actual sounds cause continuously high probabilities from their matching class, whereas spurious noise triggers are sporadic and usually of much shorter duration. Secondly, actual sounds-even in high levels of noise-exhibit a higher probability score from their matching class compared to the background probabilities by other classes. We thus introduced p TH as a probability threshold that balances the trade-off between false-positive and false-negative classifications and χ to account for background noise triggering. In practice a χ of 0.2 was sufficient to prevent background noise triggering, and this was fixed for the remaining tests. The probability threshold p TH is then varied to plot receiver operating curves (ROC), allowing us to explore the performance of different detectors. Neither parameter is tuned independently for each tested system, as discussed below, however it is expected that careful adjustment of p TH using a development data set would yield optimal values for each system.

Results and discussion
This section will first present the performance of each of the classifiers and features for the 'traditional' task of classifying isolated sound files according to the standard evaluation task, then evaluate the same classifiers for continuous classification. We will explore the baseline classification using an energy detector, then evaluate the use of probability scaling and thresholding, both with and without the BIC detector. Finally, we will explore the influence of the probability threshold p TH on performance. Table 2 presents the classification accuracy by HMM with MFCC features, and by SVM, DNN and CNN using SIF features. The systems are each evaluated in different levels of NOISEX-92 background noise. The mean result is computed over all noise conditions to provide a single measure of the performance of each system for comparison. From these results it is clear that MFCC-HMM performs best in noise-free conditions ('clean'), but degrades rapidly with increasing acoustic noise. None of the SIF-based methods perform quite as well as MFCC-HMM in noise-free conditions, but all are able to maintain performance with only small degradation as noise levels increase. The ASR-inspired MFCC-HMM method is thus the least noise-robust method, while SIF-CNN appears most capable for the 20 dB and 10 dB conditions, which are likely to encompass the main range of realistic deployment scenarios, while SVM maintains a slight advantage in the highly noisy 0 dB environment. By mean performance, the SIF-CNN system performs best, followed by SIF-SVM and then SIF-DNN. The comparatively good performance of the CNN classifier in noise echoes the results of other research [15].

Continuous sound results
Having established an isolated sound classification benchmark for each of these systems, we now aim to evaluate performance for the continuous task, however we first perform a series of experiments to assess the trade-off between recall and precision achieved by adjusting the probability threshold p TH . Continuous robust sound event classification using time-frequency features and deep learning Table 3 are the recall, precision and F 1 score for the mean performance over all noise types (i.e. clean, 20 dB, 10 dB and 0 dB SNR) for three systems, and for a range of p TH settings.

Results shown in
The first system is a straightforward implementation of the SIF-CNN baseline system using an energy detector to trigger classification regions and a majority vote of classifier outputs. We can see that the best F 1 is achieved when p TH is 0.8 or 0.7, however precision is maximised at a higher p TH and recall is maximised at a lower threshold.
The second system applies the background probability scaling and thresholding methods, such that the classification outputs within a detection region are normalised with respect to the mean classification output probabilities as discussed above. The effect of this is to improve the peak F 1 score, and slightly increase precision, at the expense of recall. This is to be expected because it will naturally result in more selective classification regions (hence increasing precision), at the expense of additional false negatives (hence affecting recall). Again the best F 1 score is achieved at a p TH of around 0.7 to 0.8, whereas the best precision and recall are at the extremes of the table. Clearly, the p TH setting is operating as a tradeoff between the two conflicting demands of better recall and better precision.
The final system uses the BIC separation method at the front-end prior to the energy detector and probability scaling/thresholding. The results reveal that the optimum p TH for overall F 1 score is now lower at about 0.5. Interestingly, while precision has improved substantially over other methods, recall is slightly degraded. The final combined F 1 score achieves over 80% accuracy. Table 4 now presents results for continuous detection and classification for several systems in different levels of noise, with overall with p TH fixed to 0.7. According to the results in Table 3, p TH = 0.7 was the best value for the baseline system but is slightly sub-optimal for the proposed SIF-CNN/BIC method. Further experimentation using a development data set would be required to determine an optimal p TH for each system, and this may reasonably be expected to further enhance the SIF-CNN/BIC results. In the following section, different p TH settings will be evaluated to determine a receiver operating curve (ROC) response.
The results in Table 4 show that all of the tested deep neural learning systems outperform the HMM in all but the recall of clean sounds (a task at which the MFCC-HMM system excels with almost 95% performance). This confirms results for isolated sound classification systems reported elsewhere [13,14].
The results also confirm the good performance of CNNs, especially for the important noise-corrupted tests. More surprisingly, SVM performance is highly competitive to the CNN system in all cases, more so than the DNN in fact. When comparing these results to the isolated sound classification performance, it appears that the SVM classifier is better able to accomplish detection (i.e. distinguishing presence versus absence of sound) than the CNN. Contrasting the SIF-CNN and SIF-CNN/BIC results, it seems that the BIC segmentation method performs better than the energy detector in general, apart from slightly lower recall due to the more selective nature of the segmentation. The proposed SIF-CNN/BIC system achieves the best combined F 1 score, as well as the best precision for all noise conditions. Comparing the continuous classification precision to the isolated sound recognition accuracy, it is notable that apart from the MFCC-HMM system, the evaluated techniques degrade by less than 10% in accuracy for clean sounds, but by as much as 50 to 60% at 0dB SNR. The implication is that the detection process is less noise robust than the underlying classification process.
To better visualise the process, Fig 4 plots spectrograms of a 9.6s long segment of one test recording. The upper spectrogram is without additional noise, whereas the one below it is the same region with noise added at an SNR of 0 dB. For clarity, this segment only contains two sounds, and these are visible not only in the spectrograms but also in the frame-by-frame energy plot. Vertical lines in the spectrogram are drawn to indicate BIC segmentation markers in each case, with more segmentations occurring in the noisy case.
To explore further, Fig 5 uses the same example to visualise the classification probabilities. The figure shows the actual sound classes that are present (top), the classifier output probabilities (middle) and re-plots the corresponding spectrograms (bottom). The noisy example (right hand side) evidently exhibits far more spurious classification points than the clean recording (left hand side) but in both cases, several classes are continuously active. The influence of these is countered by the background probability scaling and thresholding process. Continuous robust sound event classification using time-frequency features and deep learning Probability threshold and tradeoffs Fig 6 displays an ROC plot of recall against precision for the three systems, namely the SIF-CNN baseline, probability scaled and BIC methods. Each of these are evaluated in terms of mean F 1 score over all noise conditions. This evaluation is performed for a range of probability thresholds to adjust the trade-off points between recall and precision. What is clear from the graph is that the background probability scaled system outperforms the baseline, and in turn the proposed SIF-CNN/BIC method outperforms the background probability scaled method.

Conclusion and future work
Classification of sounds in potential future deployment scenarios will require robust approaches that work in the presence of interfering acoustic noise, with sounds that may be occluded or overlapping, and which can operate continuously with no prior knowledge of the start and end times of sounds. This paper has extended three state-of-the-art machine-learning based sound event classification methods to the continuous case: these methods have previously only been evaluated for classification of isolated sounds or those having known starting and ending times. This paper has additionally proposed a standard evaluation task for overlapping continuous sounds, based upon the commonly-used evaluation task for isolated sounds. This has been used to evaluate the robustness of the various techniques. As other authors develop their own Continuous robust sound event classification using time-frequency features and deep learning continuous sound event classification algorithms, it is hoped that they will adopt the same evaluation criteria, since it consists of easily available data, and presents a realistic deployment scenario.
In this paper, all evaluated methods use energy-based criteria to detect candidate onset positions for sounds, while a Bayesian inference criteria has been developed specifically for the CNN classifier, and shown to yield a performance improvement. Results show that classification performance reduces by an average of approximately 20 to 30% (in terms of precision) between the isolated and continuous cases, with by far the largest degradation occurring at the highest noise levels, implying that the detection process is inherently less noise robust than the classification process. Other researchers may therefore expect to obtain good performance gains in future by separating and separately optimising the detection and classification tasks, and by exploring the effect of tuning parameters such as p TH , χ, λ, , as well as the number of CNN layers, outputmap and subsampling parameters.