Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Exploiting deep neural network and long short-term memory method-ologies in bioacoustic classification of LPC-based features

  • Cihun-Siyong Alex Gong ,

    Roles Conceptualization, Funding acquisition, Investigation, Methodology, Project administration, Resources, Software, Validation, Writing – original draft

    alex.mlead@gmail.com

    Affiliations Department of Electrical Engineering, Chang Gung University, Taoyuan, Taiwan, Department of Ophthalmology, Chang Gung Memorial Hospital, Linkou Branch, Taoyuan City, Taiwan

  • Chih-Hui Simon Su,

    Roles Methodology, Writing – review & editing

    Affiliation Department of Electrical Engineering, Chang Gung University, Taoyuan, Taiwan

  • Kuo-Wei Chao,

    Roles Data curation, Investigation

    Affiliation Department of Mechanical Engineering, National Cheng Kung University, Tainan, Taiwan

  • Yi-Chu Chao,

    Roles Data curation, Validation

    Affiliation Department of Public Health, National Taiwan University, Taipei, Taiwan

  • Chin-Kai Su,

    Roles Data curation, Validation

    Affiliation Fudan High School, Taoyuan, Taiwan

  • Wei-Hang Chiu

    Roles Validation

    Affiliation Fudan High School, Taoyuan, Taiwan

Abstract

The research describes the recognition and classification of the acoustic characteristics of amphibians using deep learning of deep neural network (DNN) and long short-term memory (LSTM) for biological applications. First, original data is collected from 32 species of frogs and 3 species of toads commonly found in Taiwan. Secondly, two digital filtering algorithms, linear predictive coding (LPC) and Mel-frequency cepstral coefficient (MFCC), are respectively used to collect amphibian bioacoustic features and construct the datasets. In addition, principal component analysis (PCA) algorithm is applied to achieve dimensional reduction of the training model datasets. Next, the classification of amphibian bioacoustic features is accomplished through the use of DNN and LSTM. The Pytorch platform with a GPU processor (NVIDIA GeForce GTX 1050 Ti) realizes the calculation and recognition of the acoustic feature classification results. Based on above-mentioned two algorithms, the sound feature datasets are classified and effectively summarized in several classification result tables and graphs for presentation. The results of the classification experiment of the different features of bioacoustics are verified and discussed in detail. This research seeks to extract the optimal combination of the best recognition and classification algorithms in all experimental processes.

1. Introduction

In nature, communication between animals entails the transmission of specific information between individuals of one or different species to invoke specific behaviors [1]. Therefore, considerable work has focused on the study of animal behavior based on acoustic feature analysis [2, 3]–even those abiotic signals have been studied. Several available adaptive theories analytical methods can be used to extract hidden information conveyed by any sound [4]. For example, the sound of human breathing, the release of vibration energy from objects, or the abnormal automobile driving sound characteristics may implicitly indicate the existence of some abnormal problems [5, 6]. Different acoustic characteristics represent dynamic behavior characteristics under actual conditions. The sound characteristics of each animal reflect the actual state of animal behavior, and thus reveal information about different behaviors [7], and the sound information communicated by a large number of animals can be automatically and systematically measured and monitored in nature.

By collecting and analyzing the characteristics of animal communication sounds of different species, this research provides a more benefit and convenient way to monitor the dynamic behavior of specific animal species, avoiding time-consuming manual monitoring and analysis [8]. The application of bioacoustic monitoring technology is very effective in identifying existing species, especially in the case of species for which limited data is available [9]. Many well-known research cases have established that acoustic signal data can be effectively collected and digitally filtered feature identification [10, 11]. The application of signal comparison and recognition for bioacoustics includes well-trained artificial listening recognition or classification by multi-channel spectrogram observation. Detection based on collected signals depends on sensor signal measurement and acquisition using classifier algorithms such as machine learning. Well-trained professional observers can distinguish subtle spectrogram features, and then can identify relevant sound features in the surrounding environment [12]. The time series classification and calculation method has emerged as a popular artificial intelligence research topic.

Most supervised and unsupervised algorithms are typically applied to dynamic time series signals [13]. Automatic animal sound detection and recognition from audio recordings is gradually becoming an emerging topic in bioacoustics [14]. Technically speaking, bioacoustic features and classification, after collecting and processing data, produce meaningful feature information and provide a better method to measure ecosystem changes [15]. A research project conducted at the Academia Sinica Biodiversity Research Center [16] has collected and analyzed audio field signals in forests, thereby constructing characteristic sound field training datasets models for forest environments. Different from [16], this presented algorithms used in this study is entirely new approaches of more samples.

Artificial intelligence (AI) techniques have been widely applied in many fields such as image recognition, speech recognition, characteristic signal models, deduction and reasoning, and data mining to solve problems that otherwise are addressed using traditional calculation methods. Implementation challenges include difficult characteristic classification [17]. Nowadays, big data-related applications are a major application of AI for the algorithmic classification of huge amounts of data to identify more practical optimization decision models. Machine learning classification and recognition methods from AI are then applied to obtain optimal prediction performance [18]. Appropriate machine learning techniques can be applied to acoustic datasets to facilitate model training to obtain prediction solutions with optimal adaptive calculations and minimal errors. In the iterative process of machine learning model training, the loss weighting function is minimized to approximate the solution’s optimization trend to train a prediction model that most closely approximates an ideal solution [19, 20]. All in all, this research focuses on the basic application of artificial intelligence through the feature extraction of original signals through filtering calculations, and the classification and recognition of feature spectrum datasets using machine learning techniques.

So-called machine learning (ML) techniques can deduce a system’s optimal model solution from large datasets, and simultaneously perform large volume data analysis and classification. The model is trained from known datasets, and testing data is used to extract the most suitable prediction solution [21]. ML provides complementary data modeling techniques with traditional statistical methods [22]. Among modern algorithms, deep learning (DL) has attracted widespread attention for its ability to train from large datasets [23]. The present research selected characteristic sounds of 35 amphibian species, using a novel digital speech algorithm to perform digital filtering analysis of the sound characteristics. Increasing demand for big data collection and the advancement of computer processing speeds has driven the use of deep learning techniques in practical applications in many fields. In the field of speech recognition, convolutional neural networks (CNN) [2426], deep neural networks (DNN) [27] long short-term memory (LSTM) [28] and other machine learning methods have been widely used as classification algorithms in recent years. This article introduces deep neural network (DNN) and long short-term memory (LSTM) and discusses to solution of the classification problem for bioacoustic features in practical applications. In bioacoustic digital filtering, both linear predictive coding (LPC) and Mel-frequency cepstral coefficient (MFCC) digital speech algorithms can distinguish characteristic speech signals. These two popular filters are widely used in digital speech signal processing [29, 30], especially in feature extraction of speech signals [31]. The sound feature datasets are used to introduce a mainstream data dimensionality reduction algorithm using principal component analysis (PCA) to perform calculations on a large number of feature datasets, thus reducing dimensionality and calculation loading, thus obtaining better recognition and classification performance. Prior to implementation of image processing or audio feature algorithms, many studies first reduce the dimensionality of big data features to effectively reduce computational complexity and overhead. This PCA method is commonly used for dimensionality reduction in the field of audio signal processing. It helps not only expedite learning efficiency of the datasets but also classify the most effective feature data for further analysis [32].

DNN of the adaptive learning has become major breakthrough in acoustic speech recognition [33, 34]. DNN is a classification algorithm that is often applied to very large amounts of data and is used to develop the proposed experimental framework for bioacoustic classification. The calculation characteristics of the neural network are modulated by a set of digital variables called weights. We seek to optimize the neural network’s calculation performance based on these optimal weights. Based on the multi-layer network connection architecture, we calculate the approximate optimal solution of each node in each neural network. After training a learning model, the neural network is used as an automatic iterative structure to calculate the machine learning training model from the selected input to the required output [35].

In recent years, the long short-term memory (LSTM) algorithm has been increasingly applied for continuous sequential speech signal processing [36, 37]. LSTM is a modified recurrent neural network (RNN) which can store information of previous input for a long time [38]. It can solve the problems of vanishing and exploding gradients along with long sequence training and memory retention [39]. All RNNs have feedback loops in the recurrent layer to help store information in "memory" over time. However, standard RNNs may be difficult to train to resolve the dependence of long-term problems that require learning. The gradient of the loss function decays exponentially over time (a phenomenon called the vanishing gradient problem), making training for a typical RNN difficult. This is why the modified RNN is modified to include a memory cell that can maintain information in memory over time. The most widely used modified RNN is called LSTM, which uses a set of gates to control when information enters the memory, thus solving the vanishing or exploding gradient problem [40]. In this study, animal acoustic features are classified using the Python pytorch platform and we analyze the performance of the two previously mentioned algorithms using principal component analysis in terms of calculation time, and performance. We then filter out the most suitable category recognition algorithm classification structure for this dataset. Later in the article we discuss the influence of principal component analysis on deep neural networks and long and short-term memory, and further infer the respective advantages of the two calculation methods.

2. Theoretical description

2.1. Linear Predictive Coding (LPC) method

The digital speech linear predictive coding (LPC) method describes that a sample L[k] can be approximately expressed as a function of the linear combination of the previous samples [41], which is . {am} represents the combined coefficient k = 1,2,…P called the linear prediction coefficient. The basic structure of LPC algorithm model is illustrated as Fig 1.

thumbnail
Fig 1. This figure presents the speech production model through LPC method.

https://doi.org/10.1371/journal.pone.0259140.g001

The characteristics of LPC is a linear combination of this function [42]. (1) where Aj and Bl are prediction coefficients. G is the gain value, and u[k] represents the unknown input signal.

The z transformation signal T(z) of signal L[k] is expressed as [43]: (2)

The transfer function H(z) is the output of the filter to the input and corresponds to the following items.

(3)

Fig 2 shows the process from collecting the original signals of the amphibian to constructing the bioacoustic feature datasets. With the digital filtering algorithm called LPC, we are able to do feature extraction to the original acoustic signals of every single specy of the amphibian, adjust the linear predictive coefficients to create multiple filtering effects, and collect the feature spectral values of every single specy to construct the training datasets.

thumbnail
Fig 2. Shows our study based on LPC to construct the bioacoustic feature datasets.

https://doi.org/10.1371/journal.pone.0259140.g002

2.2. Mel-Frequency Cepstral Coefficient (MFCC) method

This study is inspired from the feature classification experiments in [16]. The methods in [16] are to use the MFCC digital filtering algorithm to extract features from the original acoustic signals every single specy of the amphibian. The methods in [16] adjust the pre-emphasis coefficients to create multiple filtering effects, collect the feature spectral values, and construct the training datasets. Fig 3 shows the architecture of the MFCC.

2.3. Deep Neural Network (DNN) method

DNN provides better feature classification and is suitable for high-complexity mapping. The basic structure of a neural network transforms the input into the desired output that meets the goal. Inputs form input nodes, and outputs are represented as output nodes. The middle layer between the input and output is called the hidden layer. The number of layers is not strictly fixed, and networks typically use more layers. The general function of each neuron in a neural network is basically described as follows [44].

(4)

In fact, various neural networks can be constructed, depending on how the neurons are connected. Fig 4 shows the constructed datasets based on the digital filter using the first machine learning classifier, DNN, to perform feature classification.

thumbnail
Fig 4. DNN structure consisting of many hidden layers.

In the experiment, four structures of DNNs with different hidden layer number are constructed in the classification. There are 10240 feature lengths in the input layer. The output layer generates 35 predictive targets.

https://doi.org/10.1371/journal.pone.0259140.g004

2.4. Long Short-Term Memory (LSTM) method

The LSTM architecture is designed to solve the vanishing gradient problem and is the first tool to introduce a gating mechanism. The modern LSTM architecture is shown in Fig 5.

thumbnail
Fig 5. Modern LSTM units and its layer structure are illustrated.

Same as those described for Fig 4, there are 10240 feature lengths in the input layer, where he output layer generates 35 predictive targets.

https://doi.org/10.1371/journal.pone.0259140.g005

Mathematically, the LSTM structure is defined as [45]: (5) (6) (7) (8) (9) (10)

it, ft, ct and ot are four gates, respectively used for input, forgetting, cell and output. Threshold values are calculated based on the linear combination of the gates, the current input xt and the previous state ht−1 through the sigmoid activation function. The updated candidate zt is calculated by the linear combination of xt and ht−1, and pass the tanh activation function. The cell state of the previous time period, ct−1, will be modified to obtain the cell state of the current time period, ct, and this process is not directly related to any weight factor multiplication. The output gate determines how to update the values of the hidden units [46]. Similar to the aforementioned DNN method, the training model constructed by the digital filter is introduced in this experiment through the second machine learning classifier using long and short-term memory (LSTM) to perform feature classification.

2.5. Principal Component Analysis (PCA) method

The number of so-called principal components is basically less than or equal to the number of original variables. The main concept of this conversion is that the first principal component contains the largest possible variance [43]. The matrix to map the vector xi in the feature dimension to the corresponding vector ui in the lower dimension needs to be defined. The set of vectors yi and xi corresponds to yi = MTxi. The scattering matrix calculated in the eigen-dimensional vector can be expressed as [43]: (11) where represents the mean vector calculated on the feature dimension. Let the scattering matrix calculated from the low-dimensional vector be calculated as Fu, which corresponds to Fv because Fu = MTFvM.

The transformation matrix M is optimized to maximize the variance of each element in the transformation vector. is maximized by the constraint . This can be solved by the Langrangian method given as follows.

(12)

2.6. Optimizer function of neural networks

The Adam algorithm exponentially smoothens a step to combine momentum and update. When the processing forecast of the smoothed value is unrealistically initialized to zero, it directly addresses the trend inherent in exponential smoothness [47]. Let Xt be the exponential average of the tth parameter and set it to wt. This value can be modified by a formula similar to RMSProp, but the parameter is ρ and the range is 0 to 1 [47].

(13)

This gradient is maintained with exponentially smoothed values, for which the tth component is denoted as Ft. The smoothing process is also represented by another attenuation parameter ρf.

(14)

Adaptive Moment Estimation optimizer (Adam) is widely used because it combines the advantages of many optimizers and is quite competitive [47]. It is used here as an optimizer function for deep neural networks (DNN) and long short-term memory (LSTM).

3. Experimental methods and verification

3.1. Raw data information of anuras

Roughly speaking, the experiment is divided into four main steps: collection of animals bioacoustic data, characteristic digital speech signal processing, classification, and recognition [48]. Fig 6 shows the experimental structure of the process [16, 49]. Table 1 below lists the 35 amphibians for which bioacoustics were collected. The source of the bioacoustic data sets can be found in http://learning.froghome.org/D/index.html. The signal sampling rate is 44100Hz, and the time series data captured by each sound file is about 20 seconds. Prior to processing, we first obtain the original amphibian audio as shown in Fig 7.

thumbnail
Fig 6. The structure of the experimental process for anuran bioacoustic classification.

https://doi.org/10.1371/journal.pone.0259140.g006

thumbnail
Fig 7. The collected data information of the first 4 anuras, including Rhacophorus taipeianus, Rhacophorus arvalis, Fejervarya limnocharis, Lithobates catesbeianus, is plotted with time length of approximately 20 seconds for each raw data.

https://doi.org/10.1371/journal.pone.0259140.g007

3.2. Bioacoustic filtering processing

The LPC as well as MFCC filtering algorithms convert the signal from a common timing signal to a bioacoustic spectrum feature, as shown in Figs 8 and 9 for LPC and Figs 10 and 11 for MFCC. First of all, the construction of the feature data datasets is based on 35 types of amphibians, each with 40 sets of LPC coefficients. The P value of the linear estimation filter ranges from 22 to 100 and obtains one every 2 intervals, so there are a total of 1400 feature spectral coefficients. The number of feature lengths selected for each coefficient is 10240, so the experimental feature spectrum datasets are in the form of a 1400×10240 matrix as shown in Fig 12, which belongs to multi-label multi-class datasets. In the same way, the MFCC method uses 40 pre-emphasis coefficients for each of 35 categories to construct feature datasets. The selection range of the pre-emphasis coefficients ranges from 0.22 to 1 with an interval of 0.02. There are also 1400 feature spectral coefficients, each with a feature length of 10240.

thumbnail
Fig 8. The spectrum diagram of anuran bioacoustic features filtered through the LPC algorithm with P coefficient equal to 60, including Rhacophorus taipeianus, Rhacophorus arvalis, Fejervarya limnocharis, Lithobates catesbeianus, Babina adenopleura, Microhyla ornata, Rana longicrus, Hoplobatrachus rugulosus, Hylarana taipehensis, Pelophylax plancyi, Polypedates megacephalus, Pseudoamolops sauteri, Odorrana swinhoana, Rana okinavana and Rana guentheri.

https://doi.org/10.1371/journal.pone.0259140.g008

thumbnail
Fig 9. The spectrum diagram of anuran bioacoustic features filtered through the LPC algorithm with P coefficient equal to 60, including Microhyla butleri, Microhyla heymonsi, Micryletta steinegeri, Kaloula pulchra, Limnonectes fujianensis, Rana latouchii, Fejervarya cancrivora, Buergeria japonica, Buergeria otai, Buergeria robusta, Kurixalus eiffingeri, Kurixalus idiootocus, Polypedates braueri, Rhacophorus aurantiventris, Rhacophorus moltrechti, Rhacophorus prasinatus, Khirixalus wangi, Bufo bankorensis, Duttaphrynus melanostictus and Hyla chinensis.

https://doi.org/10.1371/journal.pone.0259140.g009

thumbnail
Fig 10. The spectrum diagram of anuran bioacoustic features filtered through the MFCC algorithm with pre-emphasis coefficient equal to 0.9, including Rhacophorus taipeianus, Rhacophorus arvalis, Fejervarya limnocharis, Lithobates catesbeianus, Babina adenopleura, Microhyla ornata, Rana longicrus, Hoplobatrachus rugulosus, Hylarana taipehensis, Pelophylax plancyi, Polypedates megacephalus, Pseudoamolops sauteri, Odorrana swinhoana, Rana okinavana and Rana guentheri.

https://doi.org/10.1371/journal.pone.0259140.g010

thumbnail
Fig 11. The spectrum diagram of anuran bioacoustic features filtered through the MFCC algorithm with pre-emphasis coefficient equal to 0.9, including Microhyla butleri, Microhyla heymonsi, Micryletta steinegeri, Kaloula pulchra, Limnonectes fujianensis, Rana latouchii, Fejervarya cancrivora, Buergeria japonica, Buergeria otai, Buergeria robusta, Kurixalus eiffingeri, Kurixalus idiootocus, Polypedates braueri, Rhacophorus aurantiventris, Rhacophorus moltrechti, Rhacophorus prasinatus, Khirixalus wangi, Bufo bankorensis, Duttaphrynus melanostictus and Hyla chinensis.

https://doi.org/10.1371/journal.pone.0259140.g011

thumbnail
Fig 12. The label establishment of 35 anuran datasets through bioacoustic spectral features filtered by an LPC algorithm.

The label in the first column, X_YY, indicates the X-th anura with linear prediction coefficient equal to YY. MFCC also uses similar data labeling and data model construction methods to generate 10240 feature lengths corresponding to the 40 pre-emphasis coefficients. The two datasets are divided into two parts in the machine learning classification stage. The experiment randomly selects 70% of the datasets for training, with the remaining 30% used for testing.

https://doi.org/10.1371/journal.pone.0259140.g012

3.3. Results of classification and identification

In terms of category recognition applications, the DNN and LSTM are used for feature recognition in this experiment to train bioacoustic feature datasets. Pytorch is a very popular computing platform that uses a parallel decentralized calculation GPU processor for feature data classification using the “Adam” as the optimizer function. In the experimental process, a PCA classification method that can be used for dimensionality reduction of sound spectrum datasets is used out to compare the effectiveness of each algorithm’s architecture, where the number of principal component has been set as 200.

There are four important parameter settings: the number of iterations is set to 1000, the learning rate is set to 0.00002, and batch size is set to 1400, which means that the training process for this model is an iterative operation to calculate neural network weighting and update the value. The ratio of randomly selected validation datasets is 0.3, which means that 30% of the model datasets are randomly selected as testing datasets, which is the basis for model calculation verification. Moreover, LPC and MFCC perform feature classification based on the two deep learning classifiers mentioned previously.

The first classifier used in this study is deep neural network. We construct four different DNN models for classification during the classifying process. Table 2 shows the four types of deep neural network models. Model 1 through 4 respectively have 12, 16, 20 and 24 hidden layers. The activation function used in every neural network here is sigmoid activation function, where the number of inputs here is 10240 feature lengths. The output layer has predicted target number of 35.

Table 3 shows the LPC and MFCC feature classification results of DNN structures from Table 2. For LPC datasets, using PCA for classification increases accuracy while reducing the training period. Figs 13(A), 14(A), 15(A) and 16(A) respectively show the loss function of the LPC-DNN-12-layer, LPC-DNN-16-layer, LPC-DNN-20-layer and LPC-DNN-24-layer models while Figs 13(B), 14(B), 15(B) and 16(B) show the classification process following PCA. Similarly, Figs 1720 respectively show the similar illustrations as Figs 1316 but with MFCC filtering algorithm. The LPC and MFCC feature datasets obtain different feature classification results. Compared with the LPC-DNN model, the MFCC-DNN model presents a smoother gradient decent. Introducing the PCA dimensionality reduction method smoothes the gradient descent for both the LPC-PCA-DNN and MFCC-PCA-DNN models. However, the accuracy score calculated by the MFCC-PCA-DNN model is slightly lower than that of the MFCC-DNN model. The performance decline of the model from 12-layers to 24-layers is -0.3%, -0.1%, -0.2% and -1.2% in sequence. This result shows that importing the PCA method has no obvious benefit to the MFCC feature datasets. In addition, as the number of hidden layers of the DNN increases, the accuracy score of the LPC feature datasets is reduced, while the MFCC accuracy remains relatively stable. It can be seen that increasing the number of hidden layers has a greater impact on the LPC model than the MFCC model.

thumbnail
Fig 13.

(a) The graph shows the performance of LPC-DNN-12-layer model; (b) The graph shows the performance of LPC-PCA-DNN-12-layer model.

https://doi.org/10.1371/journal.pone.0259140.g013

thumbnail
Fig 14.

(a) The graph shows the performance of LPC-DNN-16-layer model; (b) The graph shows the performance of LPC-PCA-DNN-16-layer model.

https://doi.org/10.1371/journal.pone.0259140.g014

thumbnail
Fig 15.

(a) The graph shows the performance of LPC-DNN-20-layer model; (b) The graph shows the performance of LPC-PCA-DNN-20-layer model.

https://doi.org/10.1371/journal.pone.0259140.g015

thumbnail
Fig 16.

(a) The graph shows the performance of LPC-DNN-24-layer model; (b) The graph shows the performance of LPC-PCA-DNN-24-layer model.

https://doi.org/10.1371/journal.pone.0259140.g016

thumbnail
Fig 17.

(a) The graph shows the performance of MFCC-DNN-12-layer model; (b) The graph shows the performance of MFCC-PCA-DNN-12-layer model.

https://doi.org/10.1371/journal.pone.0259140.g017

thumbnail
Fig 18.

(a) The graph shows the performance of MFCC-DNN-16-layer model; (b) The graph shows the performance of MFCC-PCA-DNN-16-layer model.

https://doi.org/10.1371/journal.pone.0259140.g018

thumbnail
Fig 19.

(a) The graph shows the performance of MFCC-DNN-20-layer model; (b) The graph shows the performance of MFCC-PCA-DNN-20-layer model.

https://doi.org/10.1371/journal.pone.0259140.g019

thumbnail
Fig 20.

(a) The graph shows the performance of MFCC-DNN-24-layer model; (b) The graph shows the performance of MFCC-PCA-DNN-24-layer model.

https://doi.org/10.1371/journal.pone.0259140.g020

thumbnail
Table 3. Training results of DNN models and PCA-DNN models.

https://doi.org/10.1371/journal.pone.0259140.t003

Nevertheless, sometimes it is not necessary to expand the redundant hidden layers in a DNN, which means that datasets of different sizes will experimentally have the best parameter sets and appropriate structural applications. The impact of PCA implementation on classification effectiveness is clearly revealed in the test results. For the LPC Feature datasets, applying PCA not only reduces the time needed for model training, but also increases the smoothness of the loss function performance. This is counterproductive for the MFCC feature datasets. Moreover, for an appropriate range of neural network structures, classification effectiveness increases with the number of hidden layers.

The second neural network method used in this experiment is the long short-term memory (LSTM) algorithm. The experimental process presents different LSTM architectures, all based on two network hidden layers, respectively using 200, 300, 500 and 700 hidden neurons, using PCA for comparison. Table 4 lists the accuracy and training times of the four different number of hidden neural network layers with LPC and MFCC datasets, the LSTM training model network label layer = 2×200 indicates that there are 2 hidden layers containing 200 hidden neurons. Figs 21(A), 22(A), 23(A) and 24(A) show the classification process with LPC datasets while Figs 21(B), 22(B), 23(B) and 24(B) show the classification process after adding PCA method. Similarly, Figs 2528 respectively show the similar illustrations as Figs 2124 but with MFCC filtering algorithm. In addition, Figs 29 and 30 present, respectively, the two Feature datasets of the LPC and MFCC, where the long-term prediction of the LSTM algorithm has been added. The training set and test set occupy, respectively, 80% and 20% of the datasets. The reduced training time highlights the impact of PCA on LSTM calculations. The loss function with LPC datasets can show that PCA produces a smoother gradient descent process. In terms of time, PCA has a key impact on enhancing the advantages of LSTM algorithms. For the LSTM model, the accuracy of the LPC feature dataset increases with the number of hidden neurons. Introducing the PCA method increases the accuracy score and reduces the training period time. with increases from 200 to 700 hidden neuron structures resulting in sequential efficiency increases of 8.5%, 1.5%, 0.5%, and 0.2%. However, despite the significant decrease in the training period for the MFCC-PCA-LSTM, the accuracy of the MFCC feature datasets is slightly reduced, with increases from 200 to 700 hidden neurons producing sequential reductions in meta-architecture performance of -1.0%, -0.7%, -0.5%, and -0.2% in order. In other words, the MFCC-LSTM model can achieve a considerable degree of accuracy. In addition, as the number of hidden neurons increases, the LPC feature dataset gradually improves, while the MFCC feature dataset remains relatively unchanged. It can also be inferred from this that the number of hidden neurons will affect the accuracy score of the LPC model.

thumbnail
Fig 21.

(a) The graph shows the performance of LPC-LSTM-2×200 model; (b) The graph shows the performance of LPC-PCA-LSTM-2×200 model.

https://doi.org/10.1371/journal.pone.0259140.g021

thumbnail
Fig 22.

(a) The graph shows the performance of LPC-LSTM-2×300 model; (b) The graph shows the performance of LPC-PCA-LSTM-2×300 model.

https://doi.org/10.1371/journal.pone.0259140.g022

thumbnail
Fig 23.

(a) The graph shows the performance of LPC-LSTM-2×500 model; (b) The graph shows the performance of LPC-PCA-LSTM-2×500 model.

https://doi.org/10.1371/journal.pone.0259140.g023

thumbnail
Fig 24.

(a) The graph shows the performance of LPC-LSTM-2×700 model; (b) The graph shows the performance of LPC-PCA-LSTM-2×700 model.

https://doi.org/10.1371/journal.pone.0259140.g024

thumbnail
Fig 25.

(a) The graph shows the performance of MFCC-LSTM-2×200 model; (b) The graph shows the performance of MFCC-PCA-LSTM-2×200 model.

https://doi.org/10.1371/journal.pone.0259140.g025

thumbnail
Fig 26.

(a) The graph shows the performance of MFCC-LSTM-2×300 model; (b) The graph shows the performance of MFCC-PCA-LSTM-2×300 model.

https://doi.org/10.1371/journal.pone.0259140.g026

thumbnail
Fig 27.

(a) The graph shows the performance of MFCC-LSTM-2×500 model; (b) The graph shows the performance of MFCC-PCA-LSTM-2×500 model.

https://doi.org/10.1371/journal.pone.0259140.g027

thumbnail
Fig 28.

(a) The graph shows the performance of MFCC-LSTM-2×700 model; (b) The graph shows the performance of MFCC-PCA-LSTM-2×700 model.

https://doi.org/10.1371/journal.pone.0259140.g028

thumbnail
Fig 29.

(a) It shows the predictive coefficient in the LPC feature datasets is 50 of the long-term prediction; (b) It shows the predictive coefficient in the LPC feature datasets is 90 of the long-term prediction.

https://doi.org/10.1371/journal.pone.0259140.g029

thumbnail
Fig 30.

(a) It shows the pre-emphasis coefficient in the MFCC feature datasets is 0.5 of the long-term prediction; (b) It shows the pre-emphasis coefficient in the MFCC feature datasets is 0.9 of the long-term prediction.

https://doi.org/10.1371/journal.pone.0259140.g030

thumbnail
Table 4. Training results of LSTM models and PCA-LSTM models.

https://doi.org/10.1371/journal.pone.0259140.t004

For the datasets constructed in this experiment, different neural network configurations will have different effects, and PCA increases the difference in performance, especially with LPC datasets. A significant performance improvement implies that, at the practical application level, this feature dataset faces many unexpected external factors.

This article specifically discusses the efficiency and calculation time through several models, and further analyzes the best algorithm combination. Table 5 shows the average score of the k-fold cross validation. Figs 3134 present the feature datasets of the LPC and MFCC along with the obtained results of the confusion matrix from, respectively, the DNN and PCA-DNN. Figs 3538 show the feature datasets of the LPC and MFCC, where the results of the confusion matrix are obtained by means of the LSTM and PCA-LSTM, respectively. Table 6 lists the four specific algorithm combinations. In terms of accuracy, all provide high-precision recognition effects. Different deep learning algorithms have different configuration architectures, along with different accuracy score presentations and training periods. In addition, Fig 39 shows that, compared with the DNN model, the LSTM model produces very fast gradient descent convergence within 300 epochs and the fastest gradient descent is found in the MFCC-LSTM model, which can converge within 200 epochs.

thumbnail
Fig 31. The confusion matrix of DNN-12-layer model with LPC datasets.

https://doi.org/10.1371/journal.pone.0259140.g031

thumbnail
Fig 32. The confusion matrix of PCA-DNN-12-layer model with LPC datasets.

https://doi.org/10.1371/journal.pone.0259140.g032

thumbnail
Fig 33. The confusion matrix of DNN-12-layer model with MFCC datasets.

https://doi.org/10.1371/journal.pone.0259140.g033

thumbnail
Fig 34. The confusion matrix of PCA-DNN-12-layer model with MFCC datasets.

https://doi.org/10.1371/journal.pone.0259140.g034

thumbnail
Fig 35. The confusion matrix of LSTM-2×200 model with LPC datasets.

https://doi.org/10.1371/journal.pone.0259140.g035

thumbnail
Fig 36. The confusion matrix of PCA-LSTM-2×200 model with LPC datasets.

https://doi.org/10.1371/journal.pone.0259140.g036

thumbnail
Fig 37. The confusion matrix of LSTM-2×200 model with MFCC datasets.

https://doi.org/10.1371/journal.pone.0259140.g037

thumbnail
Fig 38. The confusion matrix of PCA-LSTM-2×200 model with MFCC datasets.

https://doi.org/10.1371/journal.pone.0259140.g038

thumbnail
Fig 39. The loss function of network structures, LPC-DNN, LPC-PCA-DNN, LPC-LSTM, LPC-PCA-LSTM, MFCC-DNN, MFCC-PCA-DNN, MFCC-LSTM and MFCC-PCA-LSTM, as the epoch increases.

20-layer structure is selected in DNN while 1200-hidden neurons with 2 layers is set in LSTM. It seems that MFCC-LSTM model needs only 100 epochs to let the loss function converge completely, which can also save the training period.

https://doi.org/10.1371/journal.pone.0259140.g039

thumbnail
Table 5. Average score of 5-fold cross validation results of proposed models.

https://doi.org/10.1371/journal.pone.0259140.t005

thumbnail
Table 6. Loss functions between PCA-DNN model and PCA-LSTM model.

https://doi.org/10.1371/journal.pone.0259140.t006

This study is inspired from the feature classification experiments in [16]. The methods in [16] are to use the MFCC digital filtering algorithm to extract features from the original acoustic signals every single specy of the amphibian. The methods in [16] adjust the pre-emphasis coefficients to create multiple filtering effects, collect the feature spectral values, and construct the training datasets. Two widely used deep learning algorithms (DNN and LSTM) are applied to the classification model. The feature DSP in [16] is MFCC, where this study investigates LPC and MFCC. The platform is also different. In [16], Matlab is used, where Python Pytorch has been chosen for this study. With regards to the classification, MLP and SVM are used for the work in [16], as the title, where DNN and LSTM are used in this study. Moreover, this work possesses 20 more types of sound samples.

4. Conclusions

This research applies two algorithm architectures, DNN and LSTM, for feature classification of amphibian sounds through the bioacoustic spectrum. The machine learning structure used is the key to determining feature extraction and classification recognition performance. Available sound data is first collected for analysis by applying the LPC and MFCC algorithms for digital filtering of the data. The characteristic acoustic spectrum values obtained from filtering are then collected and respectively aggregated to construct synthetic datasets. The DNN as well as LSTM are the classifiers that use the number of hidden layers, different parameters, and function settings to analyze the effect and determine the optimal algorithm combination. The experimental results are presented in graphs and tables. Strikingly different classification results are obtained using the GPU with adaptive moment estimation algorithm (Adam) optimizer function. Results clearly show that the PCA algorithm can effectively reduce dataset dimensionality to achieve better classification and identification results for LPC datasets, indicating that this PCA algorithm provides improved recognition performance with LPC datasets. However, for MFCC datasets, there is no obvious benefit to importing the PCA method. This result shows that PCA has a greater impact on LPC datasets, but no impact on MFCC. In short, in the training of machine learning models, deep learning neural networks have been shown to be applicable for the processing and analysis of big data models and can achieve reasonable classification results through the use of effective classifier algorithms and training models with reasonable characteristics to identify specific species. Based on the research data and analytical results in this study, it is concluded that MFCC-LSTM not only possess high precision, but also have more benefit in reducing time during training models.

Future research can focus on applying other modern machine learning methods and algorithms. The widespread use of acoustic features would establish a key milestone in the improvement of modern technologies. The experiments presented here focus on the classification of animal acoustic features, but these techniques can be further used in the detection of abnormal sounds in human physiology, which would present a significant development in the use of sound analysis for medical diagnosis [50, 51].

References

  1. 1. Penar W., Magiera A., Klocek C. Applications of bioacoustics in animal ecology. Ecol. Complex. 2020, 43.
  2. 2. Xie J., Indraswari K., Schwarzkopf L., Towsey M., Zhang J., Roe P. Acoustic classification of frog within-species and species-specific calls. Appl. Acoust. 2018, 131, 79–86.
  3. 3. Qian K., Zhang Z., Baird A., Schuller B. Active learning for bird sound classification via a kernel-based extreme learning machine. J. Acoust. Soc. Am. 2017, 142, 1796–1804. pmid:29092546
  4. 4. Chao K. W., Chao Y. C., Su C. K., Hu N. Z., Chiu W. H. Using machine learning method to identify for frog classification. IEEE Eurasia Conf. IOT, Comm. Eng., Yunlin, Taiwan, 3–6 Oct. 2019, IEEE.
  5. 5. Wu J. D., Bai M. R., Su F. C., Huang C. W. An expert system for the diagnosis of faults in rotating machinery using adaptive order-tracking algorithm. Expert Syst. Appl. 2009, 36, 5424–5431.
  6. 6. Li J., Qu W. Aero-engine Sensor Fault Diagnosis Based on Convolutional Neural Network. 37th Chi. Ctrl. Conf., Wuhan, China, 25–27 July 2018, IEEE.
  7. 7. Luque A., Romero-Lemos J., Carrasco A., Barbancho J. Non-sequential automatic classification of anuran sounds for the estimation of climate-change indicators. Expert Syst. Appl. 2018, 95, 248–260.
  8. 8. Thakur A., Thapar D., Rajan P., Nigam A. Deep metric learning for bioacoustic classification: overcoming training data scarcity using dynamic triplet loss. J. Acoust. Soc. Am. 2019, 146, 534–547. pmid:31370640
  9. 9. Noda Arencibia J. J., Travieso C. M., Sánchez-Rodríguez D., Dutta M. K., Vyas G. Automatic classification of frogs calls based on fusion of features and SVM. Eighth Int. Conf. Contemp. Computing, Noida, India, 20–22 Aug. 2015, IEEE.
  10. 10. Strout J., Rogan B., Seyednezhad S. M., Smart K., Bush M., Ribeiro E. Anuran call classification with deep learning. IEEE Int. Conf. Acoust., Speech Signal Process., New Orleans, LA, 5–9 Mar. 2017, IEEE.
  11. 11. Xie J., Towsey M., Zhang J., Roe P. Investigation of acoustic and visual features for frog call classification. J. Signal Process. Syst. 2019, 92, 23–36.
  12. 12. Blumstein D. T., Mennill D. J., Clemins P., Girod L., Yao K., Patricelli G., et al. Acoustic monitoring in terrestrial environments using microphone arrays: applications, technological considerations and prospectus. J. Appl. Ecol. 2011, 48, 758–767.
  13. 13. Gharehbaghi A., Lindén M. A deep machine learning method for classifying cyclic time series of biological signals using time-growing neural network. IEEE Trans. Neural Netw. Learn. Syst. 2017, 29, 4102–4115. pmid:29035230
  14. 14. Narasimhan R., Fern X. Z., Raich R. Simultaneous segmentation and classification of bird song using CNN. IEEE Int. Conf. Acoust., Speech Signal Process., New Orleans, LA, 5–9 Mar. 2017, IEEE.
  15. 15. Souza L. S., Gatto B. B., Fukui K. Classification of bioacoustic signals with tangent singular spectrum analysis. IEEE Int. Conf. Acoust., Speech Signal Process. Brighton, UK, 12–17 May 2019, IEEE.
  16. 16. Chao K. W., Hu N. Z., Chao Y. C., Su C. K., Chiu W. H. Implementation of artificial intelligence for classification of frogs in bioacoustics. MDPI Symmetry, 2019, 11.
  17. 17. Peng W., Gao W., Liu J. AI-enabled massive devices multiple access for smart city. IEEE Internet Things J. 2019, 6, 7623–7634.
  18. 18. Zhao Y., Yan B., Li Z., Wang W., Wang Y., Zhang J. Coordination between control layer AI and on-board AI in optical transport networks [Invited]. J. Opt. Commun. Netw. 2019, 12, A49–A57.
  19. 19. Tu Y. H., Du J., Lee C. H. Speech enhancement based on teacher-student deep learning using improved speech presence probability for noise-robust speech recognition. IEEE/ACM Trans. Audio, Speech, and Language Process. 2019, 27, 2080–2091.
  20. 20. Wang Y., Wei Z., Yang J. Feature trend extraction and adaptive density peaks search for intelligent fault diagnosis of machines. IEEE Trans. Industr. Inform. 2018, 15, 105–115.
  21. 21. Dua S., Du X. Data Mining and Machine Learning in Cybersecurity, 1st ed., Auerbach Publications: Boston, U.S., 2011.
  22. 22. Valletta J. J., Torney C., Kings M., Thornton A., Madden J. Applications of machine learning in animal behaviour studies. Animal Behaviour, 2017, 124, 203–220.
  23. 23. Zhong M., LeBien J., Campos-Cerqueira M., Dodhia R., Ferres J. L., Velev J. P., et al. Multispecies bioacoustic classification using transfer learning of deep convolutional neural networks with pseudo-labeling. Applied Acoustics, 2020, 166.
  24. 24. Lee K. H., Kim D. H. Design of a convolutional neural network for speech emotion recognition. Int. Conf. Inform. Comm. Tech. Conv., Jeju, Korea (South), 21–23 Oct. 2020, IEEE.
  25. 25. Abbasi A. N., He M. Convolutional neural network with PCA and batch normalization for hyperspectral image classification. Int. Geosci. Rem. Sens. Symp., Yokohama, Japan, 28 July-2 Aug. 2019, IEEE.
  26. 26. Abdel-Hamid O., Mohamed A. r., Jiang H., Deng L, Penn G., Yu D. Convolutional neural networks for speech recognition. IEEE/ACM Trans. Audio, Speech, Language Process. 2014, 22, 1533–1545.
  27. 27. Sharan R. V., Moir T. J. Robust acoustic event classification using deep neural networks. Inform. Sci. 2017, 396, 24–32.
  28. 28. Kao C. C., Sun M., Wang W., Wang C. A comparison of pooling methods on LSTM models for rare acoustic event classification. Int. Conf. Acoust., Speech and Signal Process., Barcelona, Spain, 4–8 May 2020, IEEE.
  29. 29. Chowdhury A., Ross A. Fusing MFCC and LPC features using 1D triplet CNN for speaker recognition in severely degraded audio signals. IEEE Trans. Inf. Forensics Secur. 2019, 15, 1616–1629.
  30. 30. Chakrasali S., Bilembagi U., Indira K. Formants and LPC analysis of kannada vowel speech signals. 3rd IEEE Int. Conf. Recent Trends Elec. Inform. Comm. Tech., Bangalore, India, 18–19 May 2018, IEEE.
  31. 31. Dixit A., Vidwans A., Sharma P. Improved MFCC and LPC algorithm for bundelkhandi isolated digit speech recognition. Int. Conf. Electr., Elec., and Optim. Techniques, Chennai, India, 3–5 Mar. 2016, IEEE.
  32. 32. Zhang X., Ren X. Two dimensional principal component analysis based independent component analysis for face recognition. Int. Conf. Multimed. Technol., Hangzhou, China, 26–28 July 2011.
  33. 33. Lozano-Diez A., Zazo R., Toledano D. T., Gonzalez-Rodriguez J. An analysis of the influence of deep neural network (DNN) topology in bottleneck feature based language recognition. PLOS ONE 2017. pmid:28796806
  34. 34. Babu K. A., Ramkumar B. Automatic recognition of fundamental heart sound segments from PCG corrupted with lung sounds and speech. IEEE Access 2020, 8, 179983–179994.
  35. 35. Hanrahan G. Artificial Neural Networks in Biological and Environmental Analysis, 1st ed., CRC Press: Boca Raton, U.S., 2011.
  36. 36. Dai S., Li L., Li Z. Modeling vehicle interactions via modified LSTM models for trajectory prediction. IEEE Access 2019, 7, 38287–38296.
  37. 37. Jyotishi D., Dandapat S. An LSTM-based model for person identification using ECG signal. IEEE Sens. Letter 2020, 4.
  38. 38. Zazo R., Lozano-Diez A., Gonzalez-Dominguez J., Toledano D. T., Gonzalez-Rodriguez J. Language identification in short utterances using long short-term memory (LSTM) recurrent neural networks. PLOS ONE 2016. pmid:26824467
  39. 39. Swedia E. R., Mutiara A. B., Subali M., Ernastuti Deep learning long-short term memory (LSTM) for Indonesian speech digit recognition using LPC and MFCC feature. 3rd Int. Conf. Inform. Comput., Palembang, Indonesia, 17–18 Oct. 2018.
  40. 40. Manaswi N. K. Deep Learning with Applications using Python, 1st ed., Apress: New York, USA, 2018.
  41. 41. Camastra F., Vinciarelli A. Machine Learning for Audio, Image and Video Analysis, Springer: London, U.K., 2015.
  42. 42. Wang F., Xu W. A comparison of algorithms for the calculation of LPC coefficients. IEEE Int. Conf. Inform. Sci., Elec. Elect. Eng., Sapporo, Japan, 26–28 April 2014.
  43. 43. Gopi E. S. Digital Speech Processing using Matlab, 1st ed., Springer: Berlin, Germany, 2013.
  44. 44. Skansi S. Introduction to Deep Learning from Logical Calculus to Artificial Intelligence, 1st ed., Springer: Berlin, Germany, 2018.
  45. 45. Hajiaghayi M., Vahedi E. Code failure prediction and pattern extraction using LSTM networks. IEEE 5th Int. Conf. Big Data Computing Ser. Appl., Newark, CA, USA, 4–9 April 2019.
  46. 46. Melin P., Castillo O., Kacprzyk J. Nature-Inspired Design of Hybrid Intelligent Systems, Springer: London, U.K., 2017.
  47. 47. Aggarwal C. C. Neural Networks and Deep Learning, A Textbook, 1st ed., Springer: Berlin, Germany, 2018.
  48. 48. Gong C. A., Su C. S., Tseng K. H. Implementation of machine learning for fault classification on vehicle power transmission system. IEEE Sens. J., 2020, 20, 15163–15176.
  49. 49. Huang C. J., Yang Y. J., Yang D. X., Chen Y. J. Frog classification using machine learning techniques. Expert Syst. Appl. 2009, 36, 3737–3743.
  50. 50. Mewada H. K., Patel A. V., Hassaballah M., Alkinani M. H., Mahant K. Spectral–spatial features integrated convolution neural network for breast cancer classification., MDPI Sensors, 2020, 20. pmid:32842640
  51. 51. Ahlstrom C., Liljefeldt O., Hult P., Ask P. Heart sound cancellation from lung sound recordings using recurrence time statistics and nonlinear prediction. IEEE Signal Process. Letters 2005, 12, 812–815.