Birdsong Denoising Using Wavelets

Automatic recording of birdsong is becoming the preferred way to monitor and quantify bird populations worldwide. Programmable recorders allow recordings to be obtained at all times of day and year for extended periods of time. Consequently, there is a critical need for robust automated birdsong recognition. One prominent obstacle to achieving this is low signal to noise ratio in unattended recordings. Field recordings are often very noisy: birdsong is only one component in a recording, which also includes noise from the environment (such as wind and rain), other animals (including insects), and human-related activities, as well as noise from the recorder itself. We describe a method of denoising using a combination of the wavelet packet decomposition and band-pass or low-pass filtering, and present experiments that demonstrate an order of magnitude improvement in noise reduction over natural noisy bird recordings.


Introduction
More than 13% (1,373) of bird species are vulnerable or in danger of extinction from causes such as deforestation, introduction of alien species, and global climate change (International Union for the Conservation of Nature Red Data List, 2014). In order to conserve bird populations, wildlife managers require accurate information about species presence and population estimates derived from monitoring programmes. Although birds are hard to spot visually even when the observers are in the correct place, they are more vocal than other terrestrial vertebrates and therefore birdsong is usually the most direct way for humans to detect them. With the development of acoustic recorders that can be left in the field for extensive periods of time capturing all songs, including rare ones, traditional call count surveys are being replaced by the collection of terabytes of data, which can be collected relatively cheaply and easily with limited human involvement.
The permanent storage of this acoustic data brings the advantage of being able to listen to the songs and to view their spectrograms again and again, improving the accuracy of both species recognition and call counting. However, this work is still largely manual, requiring spectrogram reading and listening, which makes it a costly approach that requires well-trained individuals; it reportedly takes an expert approximately one hour to scan the spectrogram of ten hours of recording [1] (depending on the quality of the recordings, species being the oscillogram one. The frequency representation provides information about the frequency components that comprise the signal, but not about when those frequencies occur. Converting the waveform into the frequency domain is performed by the Fourier transform, which represents the signal as a weighted combination of sine and cosine waves at different frequencies. The Fourier transform is invertible, meaning that processing can be performed in the frequency domain and then transformed back into the time domain, for example to enable the sound to be played. Birdsong is transferred into the frequency domain by applying the Discrete Fourier Transform (DFT), and in practice the Fast Fourier Transform (FFT), which is a computationally efficient algorithm for the DFT, is used. Fig 2(a) shows a non-stationary signal. During the first 100 ms the frequency of the signal is 20 Hz, during the second 100 ms the frequency doubles and again during the last 100ms. The right of the figure shows the power spectrum, which plots the energy per time unit (power) against the frequency components, and which clearly shows the basic frequencies of the original signal. Thus the power spectrum is a good representation of sound, summarising its periodic structure. However, it is suitable only for stationary signals while most signals in the real world are transient (non-stationary). The reason for this is that the signal is assumed to be infinite in time, and choosing a short time window has the effect of causing aliasing, where signal from outside the chosen range affects the appearance inside the range.
Segmenting the entire signal into fixed size small time windows and then calculating frequency components from these windows is a common practice based on the assumption that the signal is stationary over a short duration. Careful use of windows that decay to zero at the edges of their range and overlapping the windows enables Short Time Fourier Transformation (STFT) to be used, and this is the basis of the spectrogram. First, the power spectrum of each window is calculated, and then rotated 90°clockwise, and the amplitude is replaced by a greyscale. The complete spectrogram is generated by stacking all those images of subsequent windows appropriately. Provided that the time windows are short enough that the frequency components are stable in the time window this provides a faithful representation of the frequency components of the data against time, but it comes at a cost, since estimating frequencies accurately requires time: frequency resolution can only be achieved at the cost of time resolution and vice versa. The result of this is that larger windows are required for low frequencies, but STFT cannot deal with these subtleties. This led us to consider wavelets as a representation of birdsong, as we shall discuss after we consider the types of noise that are present in birdsong recordings.

Bird Recording and Noise
Until recently, manual (attended) recording was the method of choice for recording birdsong. This generally enables the capture of good quality close-range songs provided the recordist has the skills not only to tune and handle the recorder, but also a good knowledge of the bird being recorded and how to approach it closely. The advent of waterproof programmable recorders with good battery life and high recording capacity has enabled a new form of birdsong recording, enabling ecologists to collect every sound in the forest (or other area of interest) without disturbing the birds or requiring groups of experts to perform call counting in the field. However, recordings made in natural environments are highly susceptible to a variety of noises. During attended recordings, some noise can be controlled by careful screening, but in automatic recording this is impossible.

Types of Noise
The sounds that can be heard can be categorised into three broad types: biophony, geophony, and anthrophony [12]. Biophony refers to any sound produced by biological agents: in the forest major biophonies are birds, insects, frogs/toads, and mammals. Because we are only interested in acoustic activity of birds, all other biological sounds are categorised as noise; with recordings targeted at particular bird species, even other birdsong is regarded as noise. Geophony refers to all non-biological, natural sounds in the environment such as wind and its effect on trees, rain, thunder, and running water. Field recordings are always blended with these geophonies. Anthrophony refers to all sound generated from human-made machines such as aircraft, vehicles, wind turbines, and the recording device itself: there is always some microphone and recorder hum. Collectively, these noises contaminate all acoustic data to a greater or lesser extent, see Fig 3(a) and 3(b). The problems of noise are both that it can mask the signal of the bird call, and also transform it so that it looks different, making it hard to identify. While there is some research on features that are invariant to noise, meaning that they look the same even in noisy data, they are not general, and we will not consider them further here.
We differentiate between denoising of a signal, which is principally the removal/filtering of consistent noise, from source separation, which is identifying that there are several birds calling simultaneously and separating the signals into individual birds. We do not consider the second further in this paper; [13] provides a survey of approaches to the problem, but notes that very few of the methods have been shown to work for real-world signals.
There is a theory of noise in digital signal processing (see, for example [14]), which characterises the noise according to its properties into: White noise has equal energy at all frequencies, meaning that the power spectrum is flat. In practice, noise is only white over a limited range of frequencies (Fig 3(e)). While not all white noise is Gaussian, natural white noise can often be modelled as such.
Coloured noise shows a non-uniform power spectrum, with the energy generally decreasing in proportion to the frequency f. Common types of coloured noise include pink (power / 1 f ) and brown (power / 1 f 2 ).
Impulsive noise refers to sudden click like sounds that last for a very short period of time (milliseconds), such as switching noise. An ideal impulse generates a horizontal line in the power spectrum because these sharp pulses contain all frequencies equally.
Narrow-band noise such as microphone hum shows a small range of frequencies.
Transient noise is a burst of noise that occurs for some time, and then disappears.
An important property of any sound is whether or not it is stationary i.e., its properties do not change substantially over time. Most noise in natural recordings is at least quasi-stationary, being geophonic in nature. However, birdsong is not stationary (i.e., it is transient) since it is generally short-lived and varies quickly. This difference between the properties of the noise and signal enables noise reduction techniques to be applied.

Noise Filtering
Noise filtering is the most common approach to dealing with noisy recordings. Traditional signal processing, based on electronics, uses two basic filters, low-pass and high-pass, which allow frequencies respectively below and above a pre-defined cut-off frequency to pass through, and attenuate the rest. Combining a low-pass filter and a high-pass filter gives a band-pass filter. If the noise occupies high frequencies while the bird of interest sings low frequency songs then this would be sufficient to eliminate noise, but since the spectra of the noise and the signal overlap, this is not the case.  frequency and low frequency noise components have been removed successfully, but all the noise in the range of the bird's song frequency (visible as grey background) is still there, confirming that this basic filtering is not sufficient to recover birdsong. Further, birds have different call categories from different frequency bands. For example, the kakapo (Strigops habroptilus) generate two types of vocalisation: booming, which is a very low frequency call and chinging, which is a relatively high frequency call. Designing a common filter to clean hours of kakapo recordings is impossible because they do not share the same frequency range.
Another traditional approach is the Wiener filter, which generates an estimate of the desired or target random (Gaussian) process based on linear time-invariant filtering and the minimum mean square error between the estimated signal and the desired signal by assuming that the signal and noise are stationary and spectral information is available [14]. This is not true for birdsong, therefore we did not consider it further here.

Wavelets
We explained earlier that the Fourier transform, while commonly used in birdsong analysis, is not really suitable because of the tradeoff between temporal resolution and frequency resolution. An alternative is the wavelet transform, which is a relatively recent development in signal processing [15], although it has been invented independently in fields as diverse as mathematics, quantum analysis and in electrical engineering [16]. Wavelets have been applied in many areas, such as data compression, feature detection and denoising signals [17].
In the Fourier transform the signal is mapped into a basis of sine and cosine waves. The wavelet transform also uses a basis, but the basis elements are scale-invariant, meaning that they look the same at all scales, and they are localised in space. The upshot is that in the wavelet representation different window sizes can be used to see the signal at different resolutions; an analogy would be viewing a forest and its trees at the same time. If we need to see the whole forest we have to see it at a large scale and then we can capture global features. In order to see the trees, we have to zoom in and to focus on a tree. Zooming more allows us to see leaves. We can see the forest, trees and even leaves by using different scales. is more flexible (allowing large windows for low frequencies and small windows for higher frequencies), which is important for broad spectrum non-stationary signals such as birdsong.
There are several choices of basis features (referred to as mother wavelets) C, and unfortunately the best mother wavelet for a particular application needs to be determined experimentally. Fig 4(c)-4(e) shows examples of some mother wavelets, including the simplest Haar wavelet, which is a discontinuous step function. While the discontinuity can be a disadvantage in some domains, including birdsong, it is beneficial for those that exhibit sudden transitions like machine failure [18]. In order to construct other elements of the wavelet basis the mother wavelet is scaled and translated by factors a and b using: Parameter a 6 ¼ 0 determines the amount of stretching or compression of the mother wavelet (depending whether a is greater than or less than 1). Therefore, when a is small high frequency components are introduced to the wavelet family; in return those wavelets can capture high frequencies of the signals. In the same manner, when a is large low frequency components are introduced to the wavelet family and help to capture low frequency signals. Parameter b determines the amount of shifting of the wavelet along the horizontal axis: b > 1 shifts the wavelet to the right, while b < 1 shifts it to the left. Therefore, parameter b specifies the onset of that wavelet. Fig 4(f) illustrates the effect of a and b with respect to a given mother wavelet. Accordingly, wavelets are defined by the wavelet function (mother wavelet) and scaling function (also called the father wavelet). The scaled wavelets are known as daughter wavelets.

Wavelet Packet Decomposition
When wavelets are applied to a discrete signal, low-pass and high-pass filters are used, splitting the data into a low frequency (approximation) part and a high frequency (detail) part. These filtered representations of the data can then be analysed again by a wavelet with smaller scale by creating a new daughter wavelet, typically at half the scale. One modelling choice that can be made is whether to reanalyse both the approximation and detail parts of the signal, or just the approximation coefficients. We choose to analyse both, in what is known as the wavelet packet decomposition [19]. It leads to a tree of wavelet decompositions, as shown in Fig 4(g), and provides a rich spectral analysis, since there are 2 N leaves at the base of the tree when there are N levels.
However, the question of how many levels to use in the tree still remains. This question is often answered experimentally, but since we want a method that can work unaided on birdsong, we need to find a computational approach. We have approached this by considering how much information about the signal is contained in the approximation at each node, reasoning that nodes that do not contain information are representing the noise, and so should be discarded. In the field of information theory, Shannon entropy provides the standard measure of uncertainty or disorder in a system [20], and this is connected to the amount of information contained in a given signal [21].
The entropy S of a set of probabilities p i is calculated as (using the convention that 0 log 0 = 0): where p i is the probability of i th state in the state space. In wavelets, we used a slightly different version of this Shannon entropy: where s i is i th sample of the signal [22,23]. The idea of using entropy for wavelets is to argue that when the entropy is small, the accuracy of the selected wavelet basis is higher [23]. We used this computation at each node to choose whether or not to retain a node, and stopped creating the tree at the point where all of the nodes contained noise are removed by this computation, meaning that the signal was fully described.

Previous Uses of Wavelets for Bioacoustic Denoising
The use of wavelets for noise reduction, referred to as denoising, is still an emerging advance in digital signal processing. While there are some examples of denoising in other audio signal domains such as partial discharges (PD) signals [24][25][26], music [27], speech [28], and phonocardiography [29,30], their use in bioacoustic denoising is still uncommon. In addition, the two studies we know of which used wavelets for denoising animal sounds did not use natural noise, but added manual noise to their recordings. [31] denoised West Indian manatee (Trichechus manatus latirostris) vocalisations with added boat noise, while [32] attempted to denoise vocalizations of the ortolan bunting (Emberiza hortulana), rhesus monkey (Macaca mulatta), and humpback whale (Megaptera novaeanglia), with added white noise.
However, wavelets have been used for birdsong recognition: Selin et al. [33,34] used the wavelet packet decomposition to extract features from birdsongs from eight species. Interestingly, in [34] they added noise filtering via either a low pass filter or an adaptive filter bank with eight uniformly spaced frequency bands. These filtered signals were also analysed by wavelets and compared for recognition accuracy with the unfiltered version. In addition, Chou et al. [35] used wavelets to represent birdsong in conjunction with Mel Frequency Cepstral Coefficients (MFCCs) for recognition of 420 bird species; but the dataset in their experiment was very limited, with only one recording per species (half of each birdsong file for training and the remaining for testing).

Our Algorithm
To summarise our approach to birdsong denoising, we took the following steps, which are discussed further next: 1. Find a suitable mother wavelet.
2. Find the most suitable decomposition level based on the Shannon entropy.
3. Apply the wavelet transform to the noisy signal to produce the noisy wavelet coefficients. 4. Determine the appropriate threshold to best remove the noise based on the Shannon entropy.
5. Invert the wavelet transform of the retained wavelet coefficients to obtain the denoised signal.
6. Apply a suitable ordinary band-pass or low-pass filter where possible to remove any noise left outside the frequency range of the signal.
Selecting the Mother Wavelet. Choosing an appropriate mother wavelet is the key to the successful estimation of the noiseless signal. One approach is to visually compare the shapes of the mother wavelets and small portions of the signal, choosing the wavelet that best matches the signal [25]. However, given that we want the method to work with a wide variety of different bird calls, eyeball selection is not sufficient.
Another approach is based on the correlation between the given signal and its denoised signal [36], reasoning that if two signals are strongly linked they should have high correlation. Therefore, we can expect that the optimum wavelet maximises the correlation of initial signal and denoised signal. We can compare the correlation under different wavelets and pick the wavelet that generates the highest correlation. Accordingly, we analysed the correlation given by different wavelets including the Daubechies wavelets (dbN, where N represents the order) and the Discrete Meyer wavelet (dmey). Initial experiments showed that the dmey wavelet generated the highest correlation. For instance, db2 (0.9950) was better than db1 (0.9884), db6 (0.9970) was better than db2, db10 (0.9971) was better than db6, and dmey (0.9973) was better than db10. Then, we investigated the spectrograms of the denoised examples in order to see the actual improvement of the songs. Visual inspection (for example Fig 5) also confirmed that the dmey wavelet (Fig 4(e)), successfully denoised the songs without distorting them with a selection of different birdsongs, and so we used that for the rest of our experiments.
Selecting the Best Decomposition Level. Because we used Shannon entropy to choose the decomposition level, different birdsongs will produce trees of different depths: less complex birdsong will have small trees, while more complex birdsong will require larger trees. In fact, even within single types of call, different depths of tree can be seen. We therefore ran the depth selection algorithm on every birdsong individually; while this is computationally expensive, it does lead to significantly better results. Methods to speed up this approach will be investigated in future work. So far we found that the top-down approach (start with a small tree with level 1 and expand it based on the Shannon entropy) is more efficient than the bottom-up approach (start with a big tree and shrink it); therefore we used the top-down calculation here. Starting from level 1, decomposition was continued until the maximum entropy of a parent node (at level L) was lower than the maximum entropy of its child nodes (at level L+1). At that point the decomposition was stopped, and the best decomposition level was determined as L.
Selecting the Threshold. Each node in the decomposition tree is represented by its wavelet coefficients, and the 'impurity' of those nodes can be calculated using (Shannon) entropy. Then, eliminating noisy nodes is done by applying a threshold to each node. There are two forms of thresholding methods: hard thresholding and soft thresholding. In hard thresholding, sometimes called the 'keep or kill' method [37], coefficients are removed if they are below a previously defined threshold. In contrast, soft thresholding shrinks the wavelet coefficients below the threshold rather than cutting them off sharply. Soft thresholding provides a continuous mapping and in our case it demonstrated better noise reduction without information loss yielding high SnNR (this term will be defined in the section on evaluation metrics) in initial experiments. Therefore, we used soft thresholding here.
The challenge of setting the threshold remains, however: ideally, the selected threshold should achieve satisfactory noise removal without significant information loss. If the selected threshold is too high, then it removes too many nodes from the tree, resulting in a denoised signal with missing information, while if the threshold is unnecessarily low, it does not remove all the noisy nodes, resulting in a signal that still has noise in. There will be no globally optimal threshold, and so we again selected it based on analysis of each birdsong. As was mentioned previously, many types of noise can be approximated as having a Gaussian distribution, and this is more obvious in the high frequency parts of the spectrum. We therefore computed the standard deviation of the lowest level detail coefficients in the tree, and used 4.5 standard deviations as the threshold, which should cover 99.99% of the noise [26].

Experimental Evaluation
In this section we compare our wavelet-based algorithm with traditional band-pass or low-pass filtering. We introduce our dataset, and the metrics that we use to compare the results, before demonstrating the results.

Datasets
Primary Dataset. Initially, three manually generated pure sound examples (one impulsive 'click' sound and two tonal combinations) were used to examine the performance of the proposed method against white and different coloured noises. These examples were separately polluted with different levels of these noises manually and then denoised to eliminate the noise.
Secondly, songs of two endangered and one relatively common New Zealand bird species were considered: North Island brown kiwi, kakapo, and ruru (Ninox novaeseelandiae). Most of the recordings were collected using automated recorders, but a few were recorded manually. Most ruru and kiwi calls were obtained by the authors, while some ruru and all kakapo calls came from other sources (see the Acknowledgements). The spectrogram patterns of these species are shown in Fig 1. Birdsongs were segmented manually into syllable level components (e.g., Fig 6). The dataset (available at http://avianz.massey.ac.nz) contained a total of 700 syllables from seven basic call types, 100 of each (Table 1). These recordings were polluted with different types and levels of noise while recording. Mainly the noise was wide-band; sometimes it was concentrated more to low frequencies (for example due to wind and aeroplane noise) others to high frequencies and/or to narrow bands (for example due to insect noises like crickets and weta).
Secondary Dataset. We tested our algorithm on a secondary data set because our primary data set did not cover all possible spectrogram patterns we expect to see in recordings collected in the wild. The songs in this data set were mostly collected by the authors using manual recorders, but the selected recordings include significant amounts of noise. The kaka and tui songs were recorded using automated recorders by others. The eight species in this dataset (see Table 2) comprise seven song birds and one parrot, which have complex songs and great song diversity. We used whole songs instead of syllable level components. Five noisy song examples of each species were used, except for hihi; this species has very short songs and therefore we used ten examples. Further, we were interested to see the performance of this technique over unsegmented recordings. Therefore, we denoised five series of consecutive calls from each call type from each species mentioned in the primary data set. Then we compared the calls in the denoised series to their respective segmented calls.
Another concern when denoising birdsong is the effect of overlapping bird calls. To test this issue, we selected ten examples of recodings that contained overlapping songs from different combinations of species. Examples include overlapped male kiwi-female kiwi, male kiwi-ruru trill, male kiwi-more-pork, two of more-pork-trill, two of male kiwi-female kiwi-more-pork, tui-more-pork, robin-tui, and kakapo chinging-mottled petrels. Again the dataset is available at http://avianz.massey.ac.nz.

Evaluation Metrics
The main measurement of true interest in denoising is the Signal-to-Noise Ratio (SNR), which can be calculated by dividing the power of the signal (S) by the power of noise (N), as given in Eq 4, which is in units of decibels (dB). The higher the value of the SNR, the less noisy the We use the common names given by researchers for the different types of calls.
doi:10.1371/journal.pone.0146790.t001 Table 2. List of species introduced to the secondary dataset and their song characteristics. signal.
The challenge for real-world applications such as birdsong is that the signal and noise are not actually known because they are together in the recording. This means that computing S and N is not actually possible. Under the assumption that the noise is relatively stationary, we have estimated the power of the pure noise by isolating parts of the recording without birdsong, which should theoretically be silent, and modified Eq 4: where S + N is the power in the initial signal. By comparing this computation with the denoised version we can see how effective the denoising is. Fig 6 illustrates the calculation. Notice that to be able to calculate the initial noise and denoised noise we segmented the recording leaving a small period of silence at the beginning and/or end of the bird call. A comparison of original SNR and respective SnNR are shown in Fig 7, where noise and signal are known. If we recall that the noise is approximately Gaussian, a second possible metric is to measure its statistical properties, particularly its variance, reasoning that successful denoising should substantially reduce the variance of the noise. We used the same segments of 'pure' noise in the signal as were used to estimate the power of the noise in the SnNR to compute the variance of the noise before and after denoising, terming this measure the success ratio: where Nb is the initial noise and Na is the noise after denoising. If the success ratio is greater than 0, it implies that song denoising has been successful. A third possibility is to calculate the Peak Signal to Noise Ratio (PSNR), a widely used objective quality metric in image and video processing [38,39]. PSNR looks only at the peak value the signal can reach and the mean-squared error between the reference and the test signals.
Here we used a modified PSNR [40] to compare noise reduced songs with their original noisy version.
where MAX sig is the maximum value of the reference signal and MSE is the mean-squared error. In this calculation we maintained the noisy song as the reference and its recovered song as the test. PSNR will be relatively lower if the song is less cleaned and higher if the song is well cleaned.

Results
We implemented our algorithm in Matlab using the Wavelet Toolbox, which is a comprehensive toolbox for wavelet analysis. The code is available at: http://avianz.massey.ac.nz. As an initial experiment, white noise, pink noise, and brown noise were added to selected tonal and impulse sounds separately as a percentage of the strength of the signal. These noisy examples were cleaned using the proposed denoising approach (steps 1-5 only; without filtering), and the calculated SNR and SnNR of noisy and recovered songs are plotted in Fig 7. Here we can calculate the SNR of the noisy examples perfectly because we know the actual noise added as well as the pure signal. A comparison of conventional SNR and SnNR is illustrated in Fig 7(a) confirming that both metrics perform almost equally. The same figure also shows that even in the presence of high levels of white noise, denoising using our approach is very successful . Fig 7  (b) reveals that the proposed denoising approach can deal well with pink noise, but not to the extent of white noise. However, denoising brown noise still remains a challenge as shown in the Fig 7(c). This is because of its strong non-Gaussianity. Each call example in both primary and secondary datasets was treated with three approaches: band-pass or low-pass filtering alone (F), wavelets alone (D), and wavelets and band-pass or low-pass filtering (DF). In the case of filtering, the frequency bands were selected according to Tables 1 and 2. Fig 8 (S1 Audio) demonstrates that our algorithm removed most of the noise from the birdsong while preserving most of the song information. Success is visually clear from the spectrograms, for example if we consider Fig 8(a), almost all the background grey colour (caused by noise) in the original kiwi whistle has been eliminated, while the five original harmonics are still present without distortion after denoising. We examined visually and aurally each example individually to confirm whether they were improved after denoising, and found that all the calls were significantly improved. The improvement in the sound quality of the songs was successfully reflected by SnNR and Sussess Ratio (Table 3 and Fig 9). The overall SnNR improved from 0.667 to 3.506, an improvement of more than 5 times while SnNR improved only up to 1.526 after conventional filtering. Success ratios after filtering alone and with denoising were 1.071 and 2.170 respectively. Parallel to this, PSNR increased from 10.428 to 10.694. While we have included PSNR (Eq 7) in our results, we do not believe that it is a particularly useful measure. First, the numerator uses the maximum amplitude (hence the 'peak' in the name), which does not change when the signal is denoised, while the denominator is the root mean square of the error, which is small. This leads to an less sensitive measurement. For example, denoising alone always generated the highest PSNR because the oscillogram was not substantially changed after denoising as much as it does with filtering (see Fig 8) leading to a comparatively small MSE. However, these results altogether confirm that our wavelet denoising approach performs really well for birdsong. Even for the very low frequency kakapo booming the denoising was still better with wavelets. On the other hand, in the case of less noisy bird calls, after denoising there was no significant information loss (see Fig 8(g)).
As discussed under Selecting the Best Decomposition Level, our automated method selects the appropriate decomposition level based on the complexity of the given signal. We classified the complexity of a song by examining the spectrogram pattern and listening to the sound; songs with more harmonics and wide frequency range were considered more complex (for example, in the case of ruru, trill calls are rather complex compared to narrow band more-pork calls). The results confirm that there is a relationship between the best decomposition level and the complexity of birdsongs: if we order the calls according to their complexity from simplest to complex, the order is kakapo booming, more and pork, kakapo chinging, kiwi female, kiwi male, and finally ruru trilling, and this order can be seen in the depth of the tree (WMDL) in the last column of Table 3.

Extensions
Our method achieved impressive noise removal for the birdsongs of the species we considered in the secondary data set. Table 4 and Fig 9 show that the overall SnNR reached more than seven fold (2.758) after the treatment compared to their initial SnNR (0.353). Some examples of these songs are presented in Fig 10 (S2 Audio).
While the main aim of our approach was to denoise individual bird calls, we also considered two extensions: denoising a series of bird calls in a sequence without segmenting them, and denoising a signal that is comprised of two or more overlapping bird calls. Table 5 and Fig 11 compare the results of denoising unsegmented series of bird calls to their segmented calls. In the presence of unsegmented recordings, the mean SnNR of initial, filtered, and denoised songs were 0.548, 1.204, and 7.326 respectively. This success was confirmed by further analysis of their spectrograms and sound quality, for example, Fig 10(d) shows a denoised version of a series of kakapo chinging. These results support the fact that denoising unsegmented long recordings is also possible and performs nearly equally to denoising their isolated calls.
The method worked very well even when presented with more than two overlapping birdsongs, with the combination of the birdsongs being retained, but the noise significantly reduced (Fig 12, S3 Audio). This is also reflected in the evaluation metrics where overall SnNR improved significantly from 0.556 to 5.222 (more than 9 times) after denoising and band-pass filtering compared to band-pass filtering alone (1.652-less than 3 times) and denoising alone (0.634) for the ten examples described at the end of the Section Secondary Dataset. Confirming the potential, success ratio displayed a significant improvement after treating the examples with denoising and band-pass filtering (2.994) than filtering alone (1.261) and denoising alone (0.169). As usual, PSNR was highest with denoising alone (28.543) while filtering (12.818) and the combination of denoising and filtering (13.205) displayed relatively low PSNR because of the increase of MSE.

Discussion
The spectrogram has been the basis for much of birdsong analysis for decades, but it suffers from a fundamental tradeoff between temporal and frequency resolution because it is based on the Fourier transform. The more modern approach of wavelet analysis does not suffer from this tradeoff. However, there have been surprising few published studies in the field of birdsong recognition where wavelet analysis is used [32][33][34] In this paper we have investigated the use of wavelets for denoising automatically recorded birdsong. Denoising is becoming progressively more important as larger numbers of automatic recorders are deployed worldwide, recording not just birdsong, but every other noise in the environment. Whether these recordings are analysed automatically or manually, there is a need to reduce the extraneous noise from the recordings. We have demonstrated that wavelets are very good at washing out stationary noise from the recordings without distorting the birdsongs. Even though some of the background noise (such as other animals) are not stationary, there is a substantial amount of stationary noise in recordings collected from nature. Therefore the applicability of this method to clean natural acoustic recordings is high. Further, much (although not all) natural noise is white or pink, and wavelets work well for removing it. Both the success ratio and modified SNR (SnNR) are useful measures of noise reduction. However, PSNR turned out to be a less reliable method to evaluate the success of noise reduction of audio signals. This is one step towards our ultimate goal of automatically recognising birdsong by algorithm, and so our real aim is not the reproduction of a perfectly noiseless birdsong, but to remove the noise without damage to the signal, so that features of the song can be computed and used as input to other algorithms. In practice, the major reason for low recall rate or sensitivity (the percent of songs retrieved from the total number of songs in the recording) as well as low song recognition rate is the noise associated with the recordings: noise mixed with birdsong tends to hide song information [41,42]. Therefore, cleaning the recordings prior to call detection and segmentation would improve any method of song recognition. However, we have demonstrated that our method also allows impressive reproduction of denoised birdsong for use by biologists. On the other hand, this reproduction capability provides more flexibility to extract any preferred features for classification and recognition in contrast to the case in [34].
The major challenge of using this method to clean long field recordings is its high computational cost: it requires significant computer memory and time. The complete analysis of approximately 2 minutes of calls (the segmented versions in Table 5) took approximately just over 10 minutes on a 2.4 GHz quad core i7. This can be improved through a compiled implementation of the method, rather than the general research-focused Matlab code that we have used here. In addition, we have demonstrated our approach on fairly long recordings so that we are confident that this method can be used to clean those too. Noise removal from the original recordings rather than from extracted isolated songs is really important both in semi-manual and in automated recognition. However, it is important to note that the level of noise, its nature, and strength of the song can cause significant effect when denoising using wavelets. For example we observed that denoising tended to remove both signal and noise when presented with very faded calls embedded in a high level of noise (calls that are hardly visible in the spectrogram). We observed the same when we inputted a North Island robin song mixed with strong noise at high frequencies (Fig 13, S4 Audio). Interestingly, in this case, down-sampling saved the birdsong. If we initially apply low-pass filtering to filter out the frequencies beyond birdś frequency range, we end up with a signal that contains the birdsong and non-Gaussian noise. This means that when we filter out the high frequency noise, the signal still has capacity for high frequencies unless we down sample it. Therefore, wavelet denoising cannot remove the remaining noise because it is non-Gaussian. In contrast, down sampling restricts the signal's frequency range, and automatically removes high frequency noise. Generally, birds produce vocalisations within the range of human hearing. The dedicated recording devices we normally use in the field are also made to capture audible frequencies, but not ultrasonic (> 20kHz) or infra-sound (< 20Hz). However, many species produce sounds that are outside human and machine range. For example, species like the kakapo, North African Houbara bustard (Chlamydotis undulata undulata), and bittern (Botaurus lentiginosus) generate boomings that are very low frequency signals [43]. These low frequency signals fall near or below the threshold of human hearing (20 Hz). Birds have relatively greater hearing sensitivity than humans. For example, pigeons (Columbidae) have exceptional low-frequency (infrasound) perception [44]. However, current recording devices fail to fully capture these exceptional bird vocalisations. Accordingly, parallel to the development of birdsong recognisers there is a need of improving recorders and recording techniques. In this study we have concentrated on birdsong, but automatic recorders also capture the sounds of other animals. In New Zealand for example, where introduced predators are responsible for the endangered status of many bird and reptile species, automatic recordings could be used to monitor population of these introduced animals. Our denoising technique would work well to prepare recordings for identification and estimation of abundance of species such as stoats (Mustela nivalis), feral cats (Felis catus), rats (Rattus spp) and dogs (Canis familiaris) whose calls are high frequency. Song detection from long recordings and segmentation is another sub-topic in the field of birdsong recognition, especially when it comes to practical use. The segmentation method used to isolate the bird songs has a huge influence on both the recognition rate and recall rate of a recogniser. Conventional energy based segmentations done using the waveform would easily skip faded songs in the recordings mainly as a result of overlapping noise or the distance to the bird from the recording. On the other hand, this type of time domain approaches fail to separate bird songs from the background noise as they simply look at the energy and commonly a thresholding method to cutoff less energy sections. Therefore, this leads to increase false positives if the recogniser also fails to realise noise and discard them. But we speculate about using wavelet coefficients to do the segmentation in a more sophisticated manner. Separation of sound sources is another concept we did not consider in this context. Clearly it is not possible to separate sound sources easily in the presence of naturally recorded overlapping songs because the sounds are not linearly mixed even when we assume so, and the number of receivers (microphones) is always less than the number of sound sources. While the current study mainly focused on removing the stationary noise, it is essential to devise methods to tackle transient noise, but this would be more challenging because the birdsongs are also transient.
Future work is to be carried out extending the usability of wavelets to address aforementioned gaps in this research area. We are currently investigating different feature extraction methods including MFCC [45] and wavelet coefficients as well as potential machine learning algorithms and similarity measures for recognition and classification of birdsongs; it is important to determine the best combination of features that are strong enough to represent the birdsongs uniquely. The final goal is to develop a non-species specific, robust and user friendly automated platform for ecologists to automatically process natural field recordings collected using any recorder.