Benchmarking of eight recurrent neural network variants for breath phase and adventitious sound detection on a self-developed open-access lung sound database—HF_Lung_V1

A reliable, remote, and continuous real-time respiratory sound monitor with automated respiratory sound analysis ability is urgently required in many clinical scenarios—such as in monitoring disease progression of coronavirus disease 2019—to replace conventional auscultation with a handheld stethoscope. However, a robust computerized respiratory sound analysis algorithm for breath phase detection and adventitious sound detection at the recording level has not yet been validated in practical applications. In this study, we developed a lung sound database (HF_Lung_V1) comprising 9,765 audio files of lung sounds (duration of 15 s each), 34,095 inhalation labels, 18,349 exhalation labels, 13,883 continuous adventitious sound (CAS) labels (comprising 8,457 wheeze labels, 686 stridor labels, and 4,740 rhonchus labels), and 15,606 discontinuous adventitious sound labels (all crackles). We conducted benchmark tests using long short-term memory (LSTM), gated recurrent unit (GRU), bidirectional LSTM (BiLSTM), bidirectional GRU (BiGRU), convolutional neural network (CNN)-LSTM, CNN-GRU, CNN-BiLSTM, and CNN-BiGRU models for breath phase detection and adventitious sound detection. We also conducted a performance comparison between the LSTM-based and GRU-based models, between unidirectional and bidirectional models, and between models with and without a CNN. The results revealed that these models exhibited adequate performance in lung sound analysis. The GRU-based models outperformed, in terms of F1 scores and areas under the receiver operating characteristic curves, the LSTM-based models in most of the defined tasks. Furthermore, all bidirectional models outperformed their unidirectional counterparts. Finally, the addition of a CNN improved the accuracy of lung sound analysis, especially in the CAS detection tasks.


Introduction
Respiration is vital for the normal functioning of the human body. Therefore, clinical physicians are frequently required to examine respiratory conditions. Respiratory auscultation (Bohadana et al., 2014;Goettel & Herrmann, 2019;Sarkar et al., 2015) using a stethoscope has long been a crucial first-line physical examination. The chestpiece of a stethoscope is usually placed on a patient's chest or back for lung sound auscultation or over the patient's tracheal region for tracheal sound auscultation. During auscultation, breath cycles can be inferred, which help clinical physicians evaluate the patient's respiratory rate. In addition, pulmonary pathologies are suspected when the frequency or intensity of respiratory sounds changes or when adventitious sounds, including continuous adventitious sounds (CASs) and discontinuous adventitious sounds (DASs), are identified (Bohadana et al., 2014;Goettel & Herrmann, 2019;Pramono et al., 2017). Patients with coronavirus disease 2019 exhibit adventitious sounds ; hence, auscultation may be a useful approach for disease diagnosis (Raj et al., 2020) and disease progression tracking. However, auscultation performed using a conventional handheld stethoscope involves some limitations (Sovijärvi et al., 1997). First, the interpretation of auscultation results substantially depends on the subjectivity of the practitioners. Even experienced clinicians might not have high consensus rates in their interpretations of auscultatory manifestations (Berry et al., 2016;Grunnreis, 2016). Second, auscultation is a qualitative analysis method. Comparing auscultation results between individuals and quantifying the sound change by reviewing historical records are difficult tasks. Third, prolonged continuous monitoring of respiratory sound is almost impractical.
To overcome the aforementioned limitations, computerized methods for respiratory sound recording and analyses based on traditional signal processing and machine learning have been proposed and reviewed (Gurung et al., 2011;Huq & Moussavi, 2012;Mesaros et al., 2016;Pasterkamp et al., 1997;Pramono et al., 2017). With the advent of the deep learning era, studies have developed novel deep learning-based methods for respiratory sound analysis. However, many of such studies have focused on only distinguishing healthy participants from participants with respiratory disorders (Chambres et al., 2018;Demir et al., 2020;Hosseini et al., 2020;Perna & Tagarelli, 2019;Pham et al., 2020) and distinguishing various types of normal breathing sounds from adventitious sounds (Acharya & Basu, 2020;Aykanat et al., 2017;Bardou et al., 2018;Chen et al., 2019;Grzywalski et al., 2019;Kochetov et al., 2018;Li et al., 2016). Only a few studies (Hsiao et al., 2020;Jácome et al., 2019;Liu et al., 2017;Messner et al., 2018) have explored the use of deep learning for detecting breath phases and adventitious sounds. Moreover, most previous studies on computerized lung sound analysis have been limited by insufficient data. As of writing this paper, the largest reported respiratory sound database is ICBHI 2017 Challenge (Rocha et al., 2017), which comprises 6,898 breath cycles and 10,775 events of wheezes and crackles acquired from 126 individuals.
Data size plays a major role in the creation of a robust and accurate deep learning-based respiratory sound analysis algorithm (Hestness et al., 2017;Sun et al., 2017). Accordingly, the first aim of the present study was to establish a large and open-access respiratory sound database for training such algorithms for the detection of breath phase and adventitious sounds, mainly focusing on lung sounds. The second aim was to conduct a benchmark test on the established lung sound database by using eight recurrent neural network (RNN)-based models. RNNs (Elman, 1990) are effective for time-series analysis; long short-term memory (LSTM; Hochreiter & Schmidhuber, 1997) and gated recurrent unit (GRU; Cho et al., 2014) networks, which are two RNN variants, exhibit superior performance to the original RNN model. However, whether LSTM models are superior to GRU models (and vice versa) in many applications, particularly in respiratory sound analysis, is inconclusive. Bidirectional RNN models (Graves & Schmidhuber, 2005;Schuster & Paliwal, 1997) can transfer not only past information to the future but also future information to the past; these models consistently exhibit superior performance to unidirectional RNN models in many applications (Khandelwal et al., 2016;Linchuan Li et al., 2016;Parascandolo et al., 2016) as well as in breath phase and crackle detection (Messner et al., 2018). However, whether bidirectional RNN models outperform unidirectional RNN models in CAS detection has yet to be determined. Furthermore, the convolutional neural network (CNN)-RNN structure has been proven to be suitable for heart sound analysis (Deng et al., 2020), lung sound analysis (Acharya & Basu, 2020), and other tasks (Linchuan Li et al., 2016;Zhao et al., 2018). Nevertheless, the application of the CNN-RNN structure in respiratory sound detection has yet to be fully investigated. Benchmarking can enable demonstrating the reliability and goodness of a database; it can also be applied to investigate the 8 performance of the RNN variants in respiratory analysis.
In summary, the aims of this study are outlined as follows: ◼ Establish the largest open-access lung sound database as of writing this paper-HF_Lung_V1 (https://gitlab.com/techsupportHF/HF_Lung_V1).
◼ Conduct a performance comparison between LSTM and GRU models, between unidirectional and bidirectional models, and between models with and without a CNN in breath phase and adventitious sound detection based on lung sound data.
◼ Discuss factors influencing model performance.

Data sources and patients
The lung sound database was established using two sources. The first source was a database used in a datathon in Taiwan  Helsinki Declaration and its later amendments or comparable ethical standards.
All patients were Taiwanese and aged older than 20 years. Descriptive statistics regarding the patients' demographic data, major diagnosis, and comorbidities are presented in Table 1; however, information on the patients in the TSECC database is missing. Moreover, all 18 RCW/RCC residents were under mechanical ventilation.

Sound recording
Breathing lung sounds were recorded using two devices: (1) a commercial electronic stethoscope (Littmann 3200, 3M, Saint Paul, Minnesota, USA) and (2) a customized multichannel acoustic recording device (HF-Type-1) that supports the connection of eight electret microphones.
The signals collected by the HF-Type-1 device were transmitted to a tablet (Surface Pro 6, Microsoft, Redmond, Washington, USA; Fig. 1). Breathing lung sounds were collected at the eight locations All lung sounds in the TSECC database were collected using the Littmann 3200 device only, where 15.8-s recordings were obtained sequentially from L1 to L8 ( Fig. 2b; Littmann 3200). One round of recording with the Littmann 3200 device entails a recording of lung sounds from L1 to L8.
The TSECC database was composed of data obtained from one to three rounds of recording with the Littmann 3200 device for each patient.
We recorded the lung sounds of the 18 RCW/RCC residents by using both the Littmann 3200 device and the HF-Type-1 device. The Littmann 3200 recording protocol was in accordance with that used in the TSECC database, except that data from four to five rounds of lung sound recording were collected instead. The HF-Type-1 device was used to record breath sounds at L1, L2, L4, L5, L6, and L8. One round of recording with the HF-Type-1 device entails a synchronous and continuous recording of breath sounds for 30 min ( Fig. 2b; HF-Type-1). However, the recording with the HF-Type-1 device was occasionally interrupted; in this case, the recording duration was <30 min.
Voluntary deep breathing was not mandated during the recording of lung sounds. The statistics of the recordings are listed in Table 2.  recorded sequentially from L1 to L8. Subsequently, all recordings were truncated to 15 s. When the HF-Type-1 device was used, sounds at L1, L2, L4, L5, L6, and L8 were recorded simultaneously.
Subsequently, each 2-min signal was truncated to generate new 15-s audio files.

Audio file truncation
In this study, the standard duration of an audio signal used for inhalation, exhalation, and adventitious sound detection was 15 s. This duration was selected because a 15-s signal contains at least three complete breath cycles, which are adequate for a clinician to reach a clinical conclusion.

Table 2
Statistics of recordings and labels of HF_Lung_V1 database. Because each audio file generated by the Littmann 3200 device had a length of 15.8 s, we cropped out the final 0.8-s signal from the files ( Fig. 2b; Littmann 3200). Moreover, we used only the first 15 s of each 2-min signal of the audio files ( Fig. 2b; HF-Type-1) generated by the HF-Type-1 device. Table 2 presents the number of truncated 15-s recordings and the total duration.

Data labeling
Because the data in the TSECC database contains only classification labels indicating whether a CAS or DAS exists in a recording, we attempted to label the event level of all sound recordings. Two board-certified respiratory therapists (NJL and YLW) and one board-certified nurse (WLT), with 8, 3, and 13 years of clinical experience, respectively, were recruited to label the start and end points of inhalation (I), exhalation (E), wheeze (W), stridor (S), rhonchus (R), and DAS (D) events in the recordings. They labeled the sound events by listening to the recorded breath sounds while simultaneously observing the corresponding patterns on a spectrogram by using customized labeling software (Hsu et al., 2021). The labelers were asked not to label sound events if they could not clearly identify the corresponding sound or if an incomplete event at the beginning or end of an audio file caused difficulty in identification. BFH held regular meetings to ensure that the labelers had good agreement on labeling criteria based on a few samples by judging the mean pseudo-κ value (Jácome et al., 2019). When developing artificial intelligence (AI) detection models, we combined the W, S, and R labels to form CAS labels (C). Moreover, the D labels comprised only crackles, which were not differentiated into coarse or fine crackles. The labelers were asked to label the period containing crackles but not a single explosive sound (generally less than 25 ms) of a crackle. Each recording was annotated by only one labeler; thus, the labels did not represent perfect ground truth.
However, we used the labels as ground-truth labels for model training, validation, and testing. The statistics of the labels are listed in Table 2.

Framework
The inhalation, exhalation, CAS, and DAS detection framework developed in this study is displayed in Fig. 3. The prominent advantage of the research framework is its modular design.
Specifically, each unit of the framework can be tested separately, and the algorithms in different parts of the framework can be modified to achieve optimal overall performance. Moreover, the output of some blocks can be used for multiple purposes. For instance, the spectrogram generated by the preprocessing block can be used as the input of a model or for visualization in the user interface for real-time monitoring.

Preprocessing
We processed the lung sound recordings at a sampling frequency of 4 kHz. First, to eliminate the 60-Hz electrical interference and a part of the heart sound noise, we applied a high-pass filter to the recordings by setting a filter order of 10 and cut-off frequency of 80 Hz. The filtered signals were then processed using the short-time Fourier transform (STFT). In the STFT, we set a Hanning window size of 256 and hop length of 64; no additional zero-padding was applied. Thus, a 15-s sound signal could be transformed into a corresponding spectrogram with a size of 938 × 129. To obtain the spectral information regarding the lung sounds, we extracted the following features (Chamberlain et al., 2016;Messner et al., 2018): ◼ Spectrogram: We extracted 129-bin log-magnitude spectrograms.
◼ Mel frequency cepstral coefficients (MFCCs): We extracted 20 static coefficients, 20 delta coefficients (Δ), and 20 acceleration coefficients (Δ 2 ). We used 40 mel bands within a frequency range of 0-4,000 Hz. The frame width used to calculate the delta and acceleration coefficients was set to 9, which resulted in a 60-bin vector per frame.
After extracting the aforementioned features, we concatenated them to form a 938 × 193 feature matrix. Subsequently, we conducted min-max normalization on each feature. The values of the normalized features ranged between 0 and 1.

Deep learning models
We investigated the performance of eight RNN models, namely LSTM, GRU, bidirectional We used Adam as the optimizer in the benchmark model, and we set the initial learning rate to 0.0001 with a step decay (0.2×) when the validation loss did not decrease for 10 epochs. The learning process stopped when no improvement occurred over 50 consecutive epochs.

Postprocessing
The prediction vectors obtained using the adopted models can be further processed for different purposes. For example, we can transform the prediction result from frames to time for real-time monitoring. The breathing duration of most humans lies within a certain range; we considered this fact in our study. Accordingly, when the prediction results obtained using the models indicated that two consecutive inhalation events occurred within a very small interval, we checked the continuity of these two events and decided whether to merge them, as illustrated in the bottom panel of Fig. 4a.
For example, when the interval between the jth and ith events was smaller than T s, we computed the difference in frequency between their energy peaks (| − |). Subsequently, if the difference was below a given threshold P, the two events were merged into a single event. In the experiment, T was set to 0.5 s, and P was set to 25 Hz. After the merging process, we further assessed whether a burst event existed. If the duration of an event was shorter than 0.05 s, the event was deleted.

Dataset arrangement and cross-validation
We adopted fivefold cross-validation in the training dataset to train and validate the models.
Moreover, we used an independent testing dataset to test the performance of the trained models.
According to our preliminary experience, the acoustic patterns of the breath sounds collected from one patient at different auscultation locations or between short intervals had many similarities. To avoid potential data leakage caused by our methods of collecting and truncating the breath sound signals, we assigned all truncated recordings collected on the same day to only one of the training, validation, or testing datasets; this is because these recordings might have been collected from the same patient within a short period. The statistics of the datasets are listed in Table 3. We used only audio files containing CASs and DASs to train and test their corresponding detection models.

Table 3
Statistics of the datasets and labels of the HF_Lung_V1 database.  Pramono et al. (2017) clearly defined classification and detection at the segment, event, and recording levels. In this study, we performed two tasks. The first task involved performing detection at the segment level. The acoustic signal of each lung sound recording was transformed into a spectrogram. The temporal resolution of the spectrogram depended on the window size and overlap ratio of the STFT. The aforementioned parameters were fixed such that each spectrogram was a matrix of size 938 × 129. Thus, each recording contained 938 time segments (time frames), and each time segment was automatically labeled (Fig. 5b) according to the ground-truth event labels (Fig. 5a) assigned by the labelers. The output of the prediction process was a sequential prediction matrix (  The second task entailed event detection at the recording level. After completing the sequential prediction (Fig. 5c), we assembled the time segments associated with the same label into a corresponding event (Fig. 5e). We also derived the start and end times of each assembled event. The Jaccard index (JI; Jácome et al., 2019) was used to determine whether an AI inference result correctly matched the ground-truth event. For an assembled event to be designated as a TP event (orange horizontal bars in Fig. 5e), the corresponding JI value must be greater than 0.5. If the JI was between 0 and 0.5, the assembled event was designated as an FN event (yellow horizontal bars in Fig.   5e), and if it was 0, the assembled event was designated as an FP event (black horizontal bars in Fig.   5e). A TN event cannot be defined in the task of event detection.

Task definition and evaluation metrics
The performance of the models was evaluated using the F1 score, and that of segment detection was evaluated using the receiver operating characteristic (ROC) curve and area under the ROC curve 25 (AUC). In addition, the mean absolute percentage error (MAPE) of event detection was derived. The accuracy, positive predictive value (PPV), sensitivity, specificity, and F1 score of the models are presented in Appendix A.

Hardware and software
We trained the baseline models on an Ubuntu 18.04 server that was provided by the National  Table 4 presents the F1 scores used to compare the eight LSTM-and GRU-based models. When a CNN was not added, the GRU models outperformed the LSTM models by 0.7%-9.5% in terms of the F1 scores. However, the CNN-GRU and CNN-BiGRU models did not outperform the CNN-LSTM and CNN-BiLSTM models in terms of the F1 scores (and vice versa).

LSTM versus GRU models
According to the ROC curves presented in Fig. 6a-d, the GRU-based models outperformed the LSTM-based models in all compared pairs, except for one pair, in terms of DAS segment detection (AUC of 0.891 for CNN-BiLSTM vs 0.889 for CNN-BiGRU).

Table 4
Comparison of F1 scores between LSTM-based models and GRU-based models. The bold values indicate the higher F1 score between the compared pairs of models.

Unidirectional versus bidirectional models
As presented in Table 5, the bidirectional models outperformed their unidirectional counterparts in all the defined tasks by 0.4%-9.8% in terms of the F1 scores, even when the bidirectional models had fewer trainable parameters after model adjustment.

Models with CNN versus those without CNN
According to Table 6, the models with a CNN outperformed those without a CNN in 26 of the 32 compared pairs.
The models with a CNN exhibited higher AUC values than did those without a CNN (Fig. 6a-d), Table 5 Comparison of F1 scores between the unidirectional and bidirectional models. The bold values indicate the higher F1 score between the compared pairs of models. SIMP means the number of trainable parameters is adjusted.
except that BiGRU had a higher AUC value than did CNN-BiGRU in terms of inhalation detection (0.963 vs 0.961), GRU had a higher AUC value than did CNN-GRU in terms of exhalation detection (0.886 vs 0.883), and BiGRU had a higher AUC value than did CNN-BiGRU in terms of exhalation detection (0.911 vs 0.899).
Moreover, compared with the LSTM, GRU, BiLSTM, and BiGRU models, the CNN-LSTM, CNN-GRU, CNN-BiLSTM, and CNN-BiGRU models exhibited flatter and lower MAPE curves over a wide range of threshold values in all event detection tasks ( Fig. 7a-d).

Table 6
Comparison of F1 scores between models without and with a CNN. The bold values indicate the higher F1 score between the compared pairs of models.

Benchmark results
According to the F1 scores presented in Table 4, among models without a CNN, the GRU and BiGRU models consistently outperformed the LSTM and BiLSTM models in all defined tasks.
However, the GRU-based models did not have superior F1 scores among models with a CNN.
Regarding the ROC curves and AUC values (Fig. 6a-d), the GRU-based models consistently outperformed the other models in all but one task. Accordingly, we can conclude that GRU-based models perform slightly better than LSTM-based models in lung sound analysis. Previous studies have also compared LSTM-and GRU-based models (Chung et al., 2014;Khandelwal et al., 2016;Shewalkar, 2018). Although a concrete conclusion cannot be drawn regarding whether LSTM-based models are superior to the GRU-based models (and vice versa), GRU-based models have been reported to outperform LSTM-based models in terms of computation time (Khandelwal et al., 2016;Shewalkar, 2018).
As presented in Table 5, the bidirectional models outperformed their unidirectional counterparts in all defined tasks, a finding that is consistent with several previously obtained results (Graves & Schmidhuber, 2005;Khandelwal et al., 2016;Messner et al., 2018;Parascandolo et al., 2016).
A CNN can facilitate the extraction of useful features and enhance the prediction accuracy of RNN-based models. The benefits engendered by a CNN are particularly vital in CAS detection. For the models with a CNN, the F1 score improvement ranged from 26.0% to 30.3% and the AUC improvement ranged from 0.067 to 0.089 in the CAS detection tasks. Accordingly, we can infer that considerable information used in CAS detection resides in the local positional arrangement of the features. Thus, a two-dimensional CNN facilitates the extraction of the associated information.
Notably, CNN-induced improvements in model performance in the inhalation, exhalation, and DAS detection tasks were not as high as those observed in the CAS detection tasks. The MAPE curves ( Fig. 7a-d) reveal that a model with a CNN has more consistent predictions over various threshold values.
In our previous study (Hsiao et al., 2020), Another reason is that an exhalation label is not always available following an inhalation label in our data. Finally, we did not specifically control the sounds we recorded; for example, we did not ask patients to perform voluntary deep breathing or keep ambient noise down. The factors influencing the model performance are further discussed in the next section.

Factors influencing model performance
The benchmark performance of the proposed models may have been influenced by the following factors: (1) unusual breathing patterns; (2) imbalanced data; (3) low signal-to-noise ratio (SNR); (4) noisy labels, including class and attribute noise, in the database; and (5) sound overlapping. Fig. 8 displays most of the breath patterns present in the HF_Lung_V1 database. Fig. 8a illustrates the general pattern of a breath cycle in the lung sounds when the ratio of inhalation to exhalation durations is approximately 2:1 and an expiratory pause is noted (Pramono et al., 2017;Sarkar et al., 2015). Fig. 8b presents a frequent condition under which an exhalation is not completely heard by the labelers. However, because we did not ask the subjects to breath voluntarily when recording the sound, many unusual breath patterns might have been recorded, such as patterns caused by shallow breathing, fast breathing, and apnea as well as those caused by double triggering of the ventilator (Thille et al., 2006) and air trapping (Blanch et al., 2005;Miller et al., 2014). These unusual breathing patterns might confuse the labeling and learning processes and result in poor testing results. represents an identifiable exhalation event, and the black areas represent pause phases.
The developed database contains imbalanced numbers of inhalation and exhalation labels (34,095 and 18,349, respectively) because not every exhalation was heard and labeled. In addition, the proposed models may possess the capability of learning the rhythmic rise and fall of breathing signals but not the capability of learning acoustic or texture features that can distinguish an inhalation from an exhalation. This may thus explain the models' poor performance in exhalation detection. However, these models are suitable for respiratory rate estimation and apnea detection as long as appropriate inhalation detection is achieved. Furthermore, for all labels, the summation of the event duration was smaller than that of the background signal duration (these factors had a ratio of approximately 1:2.5 to 1:7). The aforementioned phenomenon can be regarded as foregroundbackground class imbalance (Oksuz et al., 2020) and will be addressed in future studies.
Most of the sounds in the established database were not recorded during the patients performed deep breathing; thus, the signal quality was not maximized. However, training models with such nonoptimal data increase their adaptability to real-world scenarios. Moreover, the SNR may be reduced by noise, such as human voices; music; sounds from bedside monitors, televisions, air conditioners, fans, and radios; sounds generated by mechanical ventilators; electrical noise generated by touching or moving the parts of acoustic sensors; and friction sounds generated by the rubbing of two surfaces together (e.g., rubbing clothes with the skin). A poor SNR of audio signals can lead to difficulties in labeling and prediction tasks. The features of some noise types are considerably similar to those of adventitious sounds. The poor performance of the proposed models in CAS detection can be partly attributed to the noisy environment in which the lung sounds were recorded. In particular, the sounds generated by ventilators caused numerous FP events in the CAS detection tasks. Thus, additional effort is required to develop a superior preprocessing algorithm that can filter out influential noise or to identify a strategy to ensure that models focus on learning the correct CAS features. Furthermore, the integration of active noise-canceling technology (Wu et al., 2020) or noise suppression technology (Emmanouilidou et al., 2017) into respiratory sound monitors can help reduce the noise from auscultatory signals.
The sound recordings in the HF_Lung_V1 database were labeled by only one labeler; thus, some noisy labels, including class and attribute noise, may exist in the database (Zhu & Wu, 2004).
These noisy labels are attributable to (1) the different hearing abilities of the labeler, which can cause differences in the labeled duration; (2) the absence of clear criteria for differentiating between target and confusing events; (3) individual human errors; (4) tendency to not label events located close to the beginning and end of a recording; and (5) confusion caused by unusual breath patterns and poor SNRs. However, deep learning models exhibit high robustness to noisy labels (Rolnick et al., 2017).
Accordingly, we are currently working toward establishing better ground-truth labels.
Breathing generates CASs and DASs under abnormal respiratory conditions. This means that the breathing sound, CAS, and DAS might overlap with one another during the same period. This sound overlapping, along with the data imbalance, makes the CAS and DAS detection models learn to read the rise and fall of the breathing energy and falsely identify an inhalation or exhalation as CAS or DAS, respectively. This FP detection was observed in our benchmark results. In the future, strategies must be adopted to address the problem of sound overlap.

Conclusion
We We also investigated the performance of eight RNN-based models in terms of inhalation, exhalation, CAS detection, and DAS detection in the HF_Lung_V1 database. We determined that the bidirectional models outperformed the unidirectional models in lung sound analysis. Furthermore, the addition of a CNN to these models further improved their performance.
Future studies can develop more accurate respiratory sound analysis models. First, highly accurate ground-truth labels should be established. Second, researchers should investigate the performance of RNN-based models containing state-of-the-art convolutional layers. Third, regional CNN variants can be adopted in lung sound analysis if the labels are expanded to two-dimensional bounding boxes (Jácome et al., 2019). Fourth, wavelet-based approaches, empirical mode decomposition, and other methods that can extract different features should be investigated (Pramono et al., 2017;Pramono et al., 2019). Finally, respiratory sound monitors should be equipped with the capability of tracheal breath sound analysis (Wu et al., 2020).

Conflicts of Interest
The authors declare that they have no conflicts of interest relevant to this research.