Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Scene-dependent sound event detection based on multitask learning with deformable large kernel attention convolution

  • Haiyue Zhang,

    Roles Conceptualization, Investigation, Methodology, Writing – original draft, Writing – review & editing

    Affiliation School of Information Science and Technology, North China University of Technology, Beijing, China

  • Menglong Wu ,

    Roles Software, Supervision, Writing – review & editing

    wumenglong@126.com

    Affiliation School of Information Science and Technology, North China University of Technology, Beijing, China

  • Xichang Cai,

    Roles Data curation, Software

    Affiliation School of Information Science and Technology, North China University of Technology, Beijing, China

  • Wenkai Liu

    Roles Supervision, Visualization, Writing – review & editing

    Affiliation School of Information Science and Technology, North China University of Technology, Beijing, China

Abstract

Sound event detection (SED) and acoustic scene classification (ASC) are closely related tasks in environmental sound analysis. Given the interrelationship between sound events and scenes, some previous studies have proposed using the multitask learning (MTL) method to jointly analyze SED and ASC. However, these multitask learning methods are generally based on hard parameter-sharing, which exchange sound event and scene features only through the low-level network. Such approaches are difficult to balance the complex interrelationships between SED and ASC, and limits the feature sharing and information flow between tasks during the training. To address these challenges, this study proposes a novel multitask network based on residual multi-level feature extraction (R-MFE) framework, which aims to jointly analyze SED and ASC tasks, and utilize scene information to improve the performance of sound event detection. In addition, this study designs the D-LKAC attention module, which combines the advantages of self-attention mechanisms and convolution to capture global and local features. To further enhance SED performance, this study introduces the MS-conv module, which captures audio details from multiple dimensions. The proposed MTL method is evaluated on the TUT Acoustic Scenes 2016/2017 and TUT Sound Events 2016/2017 datasets. Experimental results indicate that our approach outperforms state-of-the-art techniques, improving the F-scores by 6.44%.

1. Introduction

In recent years, environmental sound analysis has received increasing attention. It has shown great potential in various application scenarios such as life recording systems [1], surveillance systems [2,3], abnormal detection systems [4], and biomonitoring systems. Sound event detection (SED) [5] and acoustic scene classification (ASC) [6] are two key tasks in the field of environmental sound analysis. SED aims to detect and categorize events in audio recordings, including “keystrokes,” “car driving,” or “alarm sounds.” The process requires identifying these events and assessing their start and end times. Meanwhile, ASC aims to distinguish and classify scene category information from audio recordings, such as typical environments like “office,” “coffee shop,” or “grocery store.” These scenes often consist of a complex mix of multiple sound events, forming an intricate and realistic acoustic environment. Recent advances in deep learning have gradually made it the mainstream method for solving complex problems like this [713].

In fact, acoustic scenes and sound events are closely related. Many sound events are often highly correlated with certain acoustic scenes. For example, compared to acoustic scenes like “park,” sound events such as “keyboard typing,” “mouse-clicking,” and “people talking” are more commonly in acoustic scenes like “office.” Therefore, the scene information can be used to exclude sound events that are unlikely to occur in the scene, thereby improving the accuracy of sound event detection. Given the inherent relationship between sound events and scenes, some studies have proposed using multitask learning (MTL) [14] to jointly analyze the SED and ASC tasks. Multitask learning has shown broad potential in many fields [1520] and has gradually becoming the mainstream method for environmental sound analysis[2125]. For instance, Liang et al.[23] introduced a novel framework based on weak supervision that combines audio tagging (AT) and SED, demonstrating strong competitiveness between tasks. Jung et al. [24] proposed a comprehensive system that simultaneously addresses tasks in AT, ASC, and SED. Furthermore, Hou et al. [25] introduced the relation matrix to the joint analysis of SED and ASC, and used probabilistic relationships to improve the performance of ASC.

Recently, Imoto et al. [26,27] utilized sound event information to enhance ASC performance within the Bayesian generative theoretical model. Afterward, neural networks based on the MTL approach [2831] were presented to jointly analyze SED and ASC. The method combining MTL with soft scene labeling [32,33] was proposed to enhance the performance of the SED task. Unlike conventional MTL methods, the approach allows for more accurate modeling of the connections between specific scenes and events. In addition, Liang et al. [34] explored an attention-based MTL method to extract and establish shared and independent representations between scenes and events. Komatsu et al. [35] used a unidirectional conditional loss from scene to event to combine scene and event information. Tsubaki et al [36] proposed an MTL framework with weak labels to jointly analyze SED and ASC. Nada et al. [37] implemented dynamic adjustment of weights using multi-focal loss and then introduced a dynamic weight averaging strategy [38]. Subsequently, Hou et al. [39] suggested an approach for the collaborative modeling of scenes and events. This method makes full use of the connection between sound events and scenes, enabling parallel processing of SED and ASC tasks. Although the MTL techniques mentioned above can improve performance, they generally employ hard parameter-sharing strategies. This approach only uses the fundamental feature-sharing mechanism of the MTL framework to extract common features, neglecting the continuous interaction and information flow between tasks. Moreover, such methods often struggle to effectively balance the complex relationships between tasks.

To address these issues, inspired by [40,41], this study proposes a novel multitask network designed to model the relationship between acoustic scenes and sound events, while leveraging scene information to enhance SED performance. Different from earlier multitask networks based on hard-parameter sharing [4244], the proposed MTL method distinctly separates shared and task-specific experts for SED and ASC, which reduces the interference of harmful parameters between tasks and alleviates the performance conflicts caused by task diversity in conventional multitask networks. In addition, the introduction of gated networks allows for the fusion of more abstract representations, enabling dynamic adjustment of parameter weights between shared and task-specific information. Furthermore, the network employs a multi-level feature extraction strategy to capture audio features from multiple dimensions, which further improves the efficiency of information transfer.

The main contributions of this study are as follows:

  1. This study proposes a multitask network based on the R-MFE framework. The model is designed to model the relationship between acoustic scenes and sound events, and leverages scene information to enhance the performance of sound event detection.
  2. This study designs a D-LKAC module. This module combines the advantages of convolution and attention mechanisms to effectively capture both global and local features in audio sequences. Moreover, compared to traditional attention mechanisms, D-LKAC can dynamically focus on adjacent time-frequency bands, capturing richer feature information.
  3. To further improve the performance of the SED task, this study introduces the MS-conv module, which captures audio features more comprehensively. The proposed MTL method is compared with existing joint analysis methods for SED and ASC, and the results show that the proposed method outperforms the current state-of-the-art techniques.

This paper is structured as follows: Section 2 discusses conventional methods. Section 3 provides a detailed description of the proposed method. Section 4 discusses the experimental setup and findings. The paper concludes with Section 5.

2. Conventional methods

In many previous studies, SED and ASC are typically analyzed separately. However, sound events and scenes are closely linked and often co-occur, as shown in Table 1. Therefore, acoustic scene information will be beneficial in sound event detection. Based on this idea, some researchers have proposed joint analyses of SED and ASC [2932] using the MTL, as shown in Fig 1. The architecture of these methods includes shared layers and independent branch networks for each task.

thumbnail
Table 1. The presence/absence relationship of sound events and scenes.

https://doi.org/10.1371/journal.pone.0322002.t001

The shared layer uses a convolutional neural network (CNN), while the task-specific branches for event detection and scene classification employ bidirectional gated recurrent unit (BiGRU) and CNN, respectively.

The loss function in the conventional MTL method usually utilizes a weighted sum approach to train multi-task models for SED and ASC. This study adopts the same approach. For the ASC task, the model parameters are optimized using the cross-entropy (CE) loss function. The loss function is defined as follows:

(1)

where is the network output, N is the number of acoustic scene categories, and sn represents the target scene labels.

On the other hand, due to the possible overlapping of sound events on the time axis, this study uses the binary cross entropy (BCE) loss function to train model parameters in multi-label classification. The loss function is expressed as:

(2)

where represents the network output, corresponds to the target event labels, T denotes the number of time frames, and M is the count of sound event categories.

In this study, the loss functions of SED and ASC are linearly combined with constant weights and . During the training phase, the weights of the two loss functions are optimized to improve classification accuracy. The joint loss function L(θ) is represented as follows:

(3)

3. Proposed method

Conventional MTL methods often lack the flow of information between tasks, and it is difficult to balance the complex interrelationships between SED and ASC. To solve this problem, this study proposes a novel MTL network, as illustrated in Fig 2. The network consists of the R-MFE framework, shared experts, and task-specific experts. In addition, the number of shared experts, task-specific experts and expert units in the R-MFE framework can be flexibly adjusted as required.

The parameter settings in this study are based on the optimal results obtained from multiple experiments. Specifically, five shared experts were established, with each network comprising three layers of MS-conv, aiming at extracting common feature information of sound events and scenes. Meanwhile, five task-specific experts were configured for SED and ASC, respectively. Each task experts employ the same design architecture, which includes one layer of MS-conv and two layers of D-LKAC blocks. The subsequent subsections provide detailed descriptions of each block.

3.1 R-MFE framework

The R-MFE framework employs a multi-level feature extraction technique to model interactions between experts. This approach starts with extracting information from the lower-level expert networks and progressively separating task-specific parameters at higher levels. The framework includes two layers of feature extraction network and task-specific tower structure with SED and ASC at the top. The bottom layer of the extraction network processes raw audio features directly, while the second layer handles output from various gating units in the lower network. This study have designed task-specific tower networks at the model’s top. These networks primarily consist of a sequence of linear layers responsible for task categorization. Furthermore, to maximize the flexibility of knowledge sharing, residual connections were introduced to model sharing and task-specific learning separately. Each layer of the R-MFE framework’s extraction network comprises SED task experts, ASC task experts, and shared experts responsible for information sharing. The shared and task-specific expert modules are distinctly separated, with each module consisting of multiple subnetworks, and the number of expert units in each module can be adjusted as a hyperparameter.

In the R-MFE framework, shared and task experts selectively fuse features through the gating unit. The gating unit uses the input as a selector and employs the softmax as the activation function to compute the weighted sum of these chosen vectors, constituting the expert networks’ output. The gating unit’s architecture is built upon a feedforward network, and its output for task z can be expressed as:

(4)

where x denotes the input feature, represents the weighting function, and is the matrix consisting of vectors from shared and specific experts for task z. The weighted vector of task z is computed by linear transformation and SoftMax, expressed as follows:

(5)

where denotes the parameter matrix, with d as the dimensionality of input features, as the number of experts for task z, and indicates the number of shared experts. Finally, the prediction result for task z is denoted as:

(6)

Where is the tower network for task z.

3.2 Multi-scale convolution

The detailed structure of the multi-scale convolution (MS-conv) module is illustrated in Fig 3. This MS-conv module utilizes three sets of parallel convolutions to extract features from log-mel spectra. The convolutions of each scale complement each other, enhancing the model’s ability to capture diverse features. Furthermore, the traditional n × n convolutions are substituted with 1×n and n×1 convolutions to save computational time, where n denotes the convolution kernel size.

In the MS-conv block, the first set of parallel CNN branches consists of a single 1×1 convolutional layer. The second set of parallel CNN branches comprises three convolutional layers with kernel sizes of 1×1, 1×3, and 3×1, respectively. Similarly, the third set of parallel CNN branches contains three convolutional layers with kernel sizes of 1×1, 1×5, and 5×1. Each set of parallel convolutional structures undergoes batch normalization after each convolutional operation to normalize the inputs, followed by applying the ReLU activation function for nonlinear transformation. Finally, the outputs from these parallel convolution groups converge at the maxpooling layer for downsampling, which aims to reduce feature dimensions and enhance the model’s computational efficiency.

3.3 Deformable large kernel attention convolution

3.3.1 Overall architecture.

Inspired by references [4548], this study designed a deformable large kernel attention convolution (D-LKAC) block, as depicted in Fig 4. The block comprises a deformable large kernel attention (D-LKA) module, a convolution module, and residual connections. The D-LKA module is an attention mechanism combined with deformable convolution [49], playing an important role. Similar to a self-attention mechanism, it can extract global feature information, effectively overcoming CNN’s limits in processing global information. The convolution module includes a set of symmetrical 1×1 convolution layers that adjust the channel count and depth convolution layers that extract local feature information. The design reduces computational complexity and enhances overall computational efficiency.

Compared to methods just relying on CNNs and Transformer [50], the block merges the benefits of attention and convolution, which can simultaneously capture the global and local features. Unlike the widely used Conformer [45] model, our approach can capture more diverse and rich relevant information than conventional attention mechanisms by dynamically focusing on adjacent time-frequency bands. Furthermore, the residual connections improve the information flow between the network’s shallow and deep layers, reducing information loss during transmission.

3.3.2 Deformable large kernel attention.

Inspired by earlier research in image classification and detection [51,52], this study designed the D-LKA module, as illustrated in purple in Fig 4. The D-LKA module incorporates deformable convolution with depth-wise convolution, depth-wise dilation convolution, and 1×1 convolution, achieving the capability to capture global features with relatively low computational cost and fewer parameters. Specifically, the D-LKA module decomposes standard K×K convolution into three essential steps. Firstly, it gathers local information from the feature map using a (2d−1)×(2d−1) deformable depth-wise convolution (Deform-DW Conv). Secondly, it utilizes a deformable depth-wise dilation convolution (Deform-DW-D-Conv) with a dilation rate of d and dimensions [K/d]×[K/d] to expand the receptive field, thereby capturing long-range dependencies information without adding additional parameters. Finally, it captures channel-wise relationships by a 1×1 convolution.

(7)(8)

4. Experiments

4.1 Experimental conditions

4.1.1 Dataset and evaluation metrics.

This study conducted experiments with the TUT Sound Events 2016/2017 and TUT Acoustic Scenes 2016/2017 datasets [53,54]. The dataset contains audio recordings from four distinct acoustic environments: “city center,” “home,” “office,” and “residential area,” totaling 266 minutes. Detailed information about the dataset is available in [55]. The audio recordings cover 25 categories of sound events, each tightly linked to its corresponding acoustic environment. The relationship indicating the presence or absence of these events in the scenes is detailed in Table 1. “1” represents presence, and “0” represents absence.

This study uses Macro-Fscore, Micro-Fscore and Error Rate as evaluation metrics. The Macro-Fscore is computed by aggregating scores across all classes, whereas the Micro-Fscore is obtained by calculating the F-score for each class individually and then averaging these scores. The F-score is computed as:

(9)

Precision and Recall are calculated as follows:

(10)(11)

where TP, FP, and FN represent the total number of true positive, false positive, and false negative sound segments, respectively, over all time intervals and acoustic events.

The Error Rate (ER) is calculated as:

(12)

where substitution errors S(k), deletion errors D(k), and insertion errors I(k) and are computed as follows:

(13)(14)(15)

Here, k denotes the time frame index, and N(k) indicates the number of sound events present in time frame k.

4.1.2 Training setup.

This study employed 64-dimensional log-mel energies as input features, which were calculated for each 40ms timeframe with a 50% overlap. All experiments were conducted on the 3090 GPU. This study applied dropout regularization with a dropout rate of 0.15 after each neural network layer to improve the model’s generalization ability. Furthermore, this study employs an adaptive thresholding method [56] tuned for each sound event to enhance the precision of the SED task by predicting the presence of sound events. Table 2 details the hyper-parameter settings.

4.2 Experimental results

4.2.1 Overall performances.

This study evaluates the performance of single-task and MTL approaches in SED and ASC using varying weighting coefficients. To focus on the SED task, the weight λSED for SED was adjusted to be higher than λASC for ASC. Moreover, to ensure a fair comparison, we trained the SED and ASC tasks independently on a network that was consistent with the MTL architecture. The term “Single SED” refers to the network that combines the shared with SED, while “Single ASC” refers to the network that combines the shared with ASC.

Table 3 shows the comparison results between single task and conventional MTL methods. Experimental results demonstrate that MTL methods significantly improve the F-scores of SED and ASC tasks compared to single-task methods. Specifically, when λSED is set to 0.9 and λASC to 0.1, conventional MTL methods achieved increases of 2.43% in Micro-Fscore and 0.89% in Macro-Fscore of the SED task. The findings suggest that sound events and scenes mutually enhance. Particularly in SED tasks, information from acoustic scenes can enhance detection performance. However, conventional MTL methods have certain limitations compared with our proposed method.

thumbnail
Table 3. Performance of conventional methods for SED and ASC.

https://doi.org/10.1371/journal.pone.0322002.t003

Table 4 shows the comparison results between our proposed MTL and single-task methods. The results demonstrate that our MTL method achieves optimal performance when λSED is set to 0.9 and λASC to 0.1. Compared to conventional MTL models, our method shows increases of 7.87% and 21.8% in the Micro-Fscore and Macro-Fscore for the SED task, with particularly significant improvements in Macro-Fscore. The Macro-Fscore is calculated by averaging the F-scores of each sound event, which reflects the model’s balance in recognizing various events. Experimental results validate the efficacy of our MTL method. By clearly distinguishing shared and task-specific experts in the conventional MTL framework, the proposed method minimizes interference from harmful parameters between tasks. Furthermore, the multi-level feature extraction strategy facilitates the transfer of information between tasks and improves the overall performance.

thumbnail
Table 4. Performance of the proposed method for SED and ASC.

https://doi.org/10.1371/journal.pone.0322002.t004

4.2.2 Detailed investigation of SED.

To further investigate the performance of the SED, we evaluated the classification results of 25 different types of sound events, as detailed in Table 5. The results show that our proposed MTL approach improves the F-scores for many sound events, notably “(object) banging,” “ car,” and “people talking.” The improvement is attributed to the close connection between these sound events and the specific scene “residential area”, as shown in Table 1. This finding suggests that the proposed MTL method can effectively mine and utilize the strong correlations between acoustic scenes and events.

thumbnail
Table 5. Average segment-based F-scores for event sound events.

https://doi.org/10.1371/journal.pone.0322002.t005

Due to data imbalances, some sound class such as “fan”, “rustling”, “squeaking” and “breathing” have relatively few training samples, resulting in limited performance improvements for these classes. To address this issue, future work will involve several enhancements. For example, data augmentation techniques such as noise addition, time stretching, and frequency shifting will be used to increase the representation of these categories in the training data. Additionally, weighted loss functions will be explored, assigning higher weights to these underrepresented classes during training to improve the learning process for imbalanced categories.

In summary, the MTL approach surpasses single-task and conventional methods, achieving higher F-scores and reduced ER in various sound events. The findings emphasize the importance of acoustic scenes to enhance the accuracy of SED in acoustic environment analysis.

4.2.3 Ablation studies.

In this section, a series of ablation studies were conducted to evaluate the effectiveness of each module in our proposed MTL network. Table 6 shows the detailed analysis results. Using the conventional MTL model described in Section 2 as the baseline, we sequentially evaluated the contributions of the P_MFE framework, MS-conv block, and D-LKAC block in enhancing SED performance.

Experimental results indicate that the introduction of the P_MFE framework enhances the SED task’s performance, increasing the Micro-Fscore by 0.51% and the Macro-Fscore by 15.45%, respectively. It is worth noting that even without MS-conv and D-LKAC block, the P_MFE framework still outperforms conventional MTL Methods. This demonstrates that the P_MFE framework effectively sustains ongoing interactions and information flow across tasks, and deeply explores more complex relationships between acoustic scenes and events, promoting joint optimization. With the introduction of the MS-conv block, SED performance further increased by 3%, highlighting the crucial role of the MS-conv in capturing subtle features of audio data. The addition of the D-LKAC module enhanced performance by 3.62%, showing its strong dynamic adaptability. This module efficiently captures both global and local features from audio sequences, thereby enhancing SED’s performance.

4.2.4 Comparison with other methods.

To assess the superiority of the proposed MTL method, this study conducted comparisons with existing SED and ASC joint analysis methods. Table 7 shows the detailed comparison results. The compared methods follow the MTL model, which is based on hard parameter-sharing described in Section 2. Section 1 provides detailed descriptions of the design principles and characteristics of each method.

thumbnail
Table 7. Comparison with state-of-the-art methods in SED.

https://doi.org/10.1371/journal.pone.0322002.t007

Experiments demonstrate that our proposed MTL approach surpasses current methods on the TUT 2016/2017 datasets. Specifically, the F-scores for the SED task reached 56.93%, which is 6.44 percentage points higher than the previous best model [34]. This improvement is attributed to the proposed MTL model’s ability to effectively balance complex interactions of SED and ASC tasks, maintaining information throughout the training process. Furthermore, the introduction of MS-conv and D-LKAC blocks enables the model to capture more subtle event features. Experimental results prove the proposed MTL model’s superiority, demonstrating its excellent coordination and generalization capabilities in capturing complex interactions between SED and ASC tasks.

5. Conclusion

This study proposes a multitask network based on the R-MFE framework, aimed at exploring the relationship between acoustic scenes and events, and utilizing scene information to enhance the performance of sound event detection. The proposed method overcomes the limitations of information interaction and flow between tasks in conventional MTL approaches and effectively balances the complex interrelationships between SED and ASC tasks. Moreover, this study introduces the D-LKAC attention module, which captures both global and local contextual features, and extracts richer feature information by dynamically focusing on adjacent time-frequency bands compared to conventional attention mechanisms. To further optimize the performance of the SED task, the MS-conv module is designed to capture audio detail features from multiple dimensions. We conducted experiments using the TUT 2016/2017 dataset to evaluate the performance of and SED. Experimental results indicate that the proposed method outperforms single-task learning and conventional MTL approaches, achieving a 6.44% improvement in F-score compared to current state-of-the-art methods. These experiments confirm the effectiveness of the proposed MTL method, and show that scene information significantly enhances SED performance.

References

  1. 1. Stork JA, Spinello L, Silva J, Arras KO. Audio-based human activity recognition using Non-Markovian Ensemble Voting. 2012 IEEE RO-MAN: The 21st IEEE International Symposium on Robot and Human Interactive Communication. 2012. p. 509–14. https://doi.org/10.1109/roman.2012.6343802
  2. 2. Harma A, McKinney MF, Skowronek J. Automatic Surveillance of the Acoustic Activity in Our Living Environment. 2005 IEEE International Conference on Multimedia and Expo. Amsterdam, The Netherlands: IEEE; 2005. p. 634–7. https://doi.org/10.1109/ICME.2005.1521503
  3. 3. Ntalampiras S, Potamitis I, Fakotakis N. On acoustic surveillance of hazardous situations. 2009 IEEE International Conference on Acoustics, Speech and Signal Processing. Taipei, Taiwan: IEEE; 2009. p. 165–8. https://doi.org/10.1109/ICASSP.2009.4959546
  4. 4. Chan C-F, Eric W. An abnormal sound detection and classification system for surveillance applications. 2010 18th European Signal Processing Conference. IEEE; 2010. p. 1851–5.
  5. 5. Podwinska Z, Sobieraj I, Fazenda BM, Davies WJ, Plumbley MD. Acoustic event detection from weakly labeled data using auditory salience. ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE; 2019. p. 41–5.
  6. 6. Ding B, Zhang T, Liu G, Wang C. Hierarchical classification for acoustic scenes using deep learning. Appl Acoustics. 2023;212:109594.
  7. 7. Khan AA, Laghari AA, Baqasah AM, Bacarra R, Alroobaea R, Alsafyani M, et al. BDLT-IoMT—a novel architecture: SVM machine learning for robust and secure data processing in Internet of Medical Things with blockchain cybersecurity. J Supercomput. 2024;81(1).
  8. 8. Khan AA, Yang J, Laghari AA, Baqasah AM, Alroobaea R, Ku CS, et al. BAIoT-EMS: Consortium network for small-medium enterprises management system with blockchain and augmented intelligence of things. Eng Appl Artif Intel. 2025;141:109838.
  9. 9. Khan AA, Laghari AA, Baqasah AM, Alroobaea R, Almadhor A, Sampedro GA, et al. Blockchain-enabled infrastructural security solution for serverless consortium fog and edge computing.
  10. 10. Ayub Khan A, Chen Y-L, Hajjej F, Ahmed Shaikh A, Yang J, Soon Ku C, et al. Digital forensics for the socio-cyber world (DF-SCW): a novel framework for deepfake multimedia investigation on social media platforms. Egypt Inform J. 2024;27:100502.
  11. 11. Ayub Khan A, Laghari AA, Baqasah AM, Alroobaea R, Reddy Gadekallu T, Avelino Sampedro G, et al. ORAN-B5G: a next-generation open radio access network architecture with machine learning for beyond 5G in industrial 5.0. IEEE Trans on Green Commun Netw. 2024;8(3):1026–36.
  12. 12. Khan AA, Laghari AA, Alroobaea R, Baqasah AM, Alsafyani M, Bacarra R, et al. Secure remote sensing data with blockchain distributed ledger technology: a solution for smart cities. IEEE Access. 2024;12:69383–96.
  13. 13. Ayub Khan A, Dhabi S, Yang J, Alhakami W, Bourouis S, Yee PL. B-LPoET: a middleware lightweight Proof-of-Elapsed Time (PoET) for efficient distributed transaction execution and security on Blockchain using multithreading technology. Comput Electr Eng. 2024;118:109343.
  14. 14. Caruana R. Multitask learning. Machine Learn. 1997;28:41–75.
  15. 15. Ditthapron A, Lammert AC, Agu EO. Multitask deep learning methods for improving human context recognition from low sampling rate sensor data. IEEE Sensors J. 2023;23(16):18821–31.
  16. 16. Feng Z, Wu F, Zhao L. A multitask electronic nose data processing model based on transformer encoder. IEEE Sensors J. 2024;24(5):6482–9.
  17. 17. Heng H, Li S, Li P, Lin Q, Chen Y, Zhang L. MTSTR: Multi-task learning for low-resolution scene text recognition via dual attention mechanism and its application in logistics industry. PLoS One. 2023;18(12):e0294943. pmid:38085712
  18. 18. Liao K, Peng X. Underwater image enhancement using multi-task fusion. PLoS One. 2024;19(2):e0299110. pmid:38408101
  19. 19. Zhao M, Wang L, Jiang Z, Li R, Lu X, Hu Z. Multi-task learning with graph attention networks for multi-domain task-oriented dialogue systems. Knowl-Based Syst. 2023;259:110069.
  20. 20. Wang Y, Dong L, Li Y, Zhang H. Multitask feature learning approach for knowledge graph enhanced recommendations with RippleNet. PLoS One. 2021;16(5):e0251162. pmid:33989299
  21. 21. Krause DA, Mesaros A. Binaural signal representations for joint sound event detection and acoustic scene classification. 2022 30th European Signal Processing Conference (EUSIPCO). IEEE; 2022. p. 399–403.
  22. 22. de Benito-Gorrón D, Zmolikova K, Toledano DT. Analysis and interpretation of joint source separation and sound event detection in domestic environments. PLoS One. 2024;19(7):e0303994. pmid:38968280
  23. 23. Liang Y, Long Y, Li Y, Liang J, Wang Y. Joint framework with deep feature distillation and adaptive focal loss for weakly supervised audio tagging and acoustic event detection. Digital Signal Process. 2022;123:103446.
  24. 24. Jung J, Shim H, Kim J, Yu H-J. DCASENet: An integrated pretrained deep neural network for detecting and classifying acoustic scenes and events. ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE; 2021. p. 621–5.
  25. 25. Hou Y, Kang B, Van Hauwermeiren W, Botteldooren D. Relation-guided acoustic scene classification aided with event embeddings. 2022 International Joint Conference on Neural Networks (IJCNN). IEEE; 2022. p. 1–8.
  26. 26. Imoto K, Shimauchi S. Acoustic scene analysis based on hierarchical generative model of acoustic event sequence. IEICE Trans Inf Syst. 2016;E99.D(10):2539–49.
  27. 27. Imoto K, Ono N. Acoustic topic model for scene analysis with intermittently missing observations. IEEE/ACM Trans Audio Speech Lang Process. 2019;27(2):367–82.
  28. 28. Igarashi A, Imoto K, Komatsu Y, Tsubaki S, Hario S, Komatsu T. How information on acoustic scenes and sound events mutually benefits event detection and scene classification tasks. 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE; 2022. p. 7–11.
  29. 29. Bear HL, Nolasco I, Benetos E. Towards joint sound scene and polyphonic sound event recognition. arXiv preprint. 2019.
  30. 30. Tonami N, Imoto K, Yamanishi R, Yamashita Y. Joint analysis of sound events and acoustic scenes using multitask learning. IEICE Trans Inf Syst. 2021;E104.D(2):294–301.
  31. 31. Tonami N, Imoto K, Niitsuma M, Yamanishi R, Yamashita Y. Joint analysis of acoustic events and scenes based on multitask learning. 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). IEEE; 2019. p. 338–42.
  32. 32. Imoto K, Tonami N, Koizumi Y, Yasuda M, Yamanishi R, Yamashita Y. Sound event detection by multitask learning of sound events and scenes with soft scene labels. ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE; 2020. p. 621–5.
  33. 33. Leng Y, Zhuang J, Pan J, Sun C. Multitask learning for acoustic scene classification with topic-based soft labels and a mutual attention mechanism. Knowl-Based Syst. 2023;268:110460.
  34. 34. Liang H, Ji W, Wang R, Ma Y, Chen J, Chen M. A scene-dependent sound event detection approach using multi-task learning. IEEE Sensors J. 2022;22(18):17483–9.
  35. 35. Komatsu T, Imoto K, Togami M. Scene-dependent acoustic event detection with scene conditioning and fake-scene-conditioned loss. ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE; 2020. p. 646–50.
  36. 36. Tsubaki S, Imoto K, Ono N. Joint analysis of acoustic scenes and sound events with weakly labeled data. 2022 International Workshop on Acoustic Signal Enhancement (IWAENC). IEEE; 2022. p. 1–5.
  37. 37. Nada K, Imoto K, Iwamae R, Tsuchiya T. Multitask learning of acoustic scenes and events using dynamic weight adaptation based on multi-focal loss. 2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE; 2021. p. 1156–1160.
  38. 38. Nada K, Imoto K, Tsuchiya T. Joint analysis of acoustic scenes and sound events based on multitask learning with dynamic weight adaptation. Acoust Sci Tech. 2023;44(3):167–75.
  39. 39. Hou Y, Kang B, Mitchell A, Wang W, Kang J, Botteldooren D. Cooperative Scene-Event Modelling for Acoustic Scene Classification. IEEE/ACM Transactions on audio, Speech, and Language Processing. 2023.
  40. 40. Li D, Zhang Z, Yuan S, Gao M, Zhang W, Yang C, et al. AdaTT: Adaptive Task-to-Task Fusion Network for Multitask Learning in Recommendations. Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2023. p. 4370–9.
  41. 41. Tang H, Liu J, Zhao M, Gong X. Progressive layered extraction (ple): A novel multi-task learning (mtl) model for personalized recommendations. Proceedings of the 14th ACM Conference on Recommender Systems. 2020. p. 269–78.
  42. 42. Cao Y, Iqbal T, Kong Q, An F, Wang W, Plumbley MD. An Improved Event-Independent Network for Polyphonic Sound Event Localization and Detection. arXiv; 2021. Available: http://arxiv.org/abs/2010.13092
  43. 43. Misra I, Shrivastava A, Gupta A, Hebert M. Cross-stitch networks for multi-task learning. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016. pp. 3994–4003.
  44. 44. Ma J, Zhao Z, Yi X, Chen J, Hong L, Chi EH. Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts. Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. London United Kingdom: ACM; 2018. p. 1930–9. https://doi.org/10.1145/3219819.3220007
  45. 45. Gulati A, Qin J, Chiu C-C, Parmar N, Zhang Y, Yu J, et al. Conformer: Convolution-augmented Transformer for Speech Recognition. 2020 [cited 6 Apr 2024]. Available from:
  46. 46. Zhuang X, Liu F, Hou J, Hao J, Cai X. Modality attention fusion model with hybrid multi-head self-attention for video understanding. PLoS One. 2022;17(10):e0275156. pmid:36201513
  47. 47. Kim K, Wu F, Peng Y, Pan J, Sridhar P, Han KJ, et al. E-branchformer: Branchformer with enhanced merging for speech recognition. 2022 IEEE Spoken Language Technology Workshop (SLT). IEEE; 2023. p. 84–91.
  48. 48. Roshan M, Rawat M, Aryan K, Lyakso E, Mekala AM, Ruban N. Linguistic based emotion analysis using softmax over time attention mechanism. PLoS One. 2024;19(4):e0301336. pmid:38625932
  49. 49. Dai J, Qi H, Xiong Y, Li Y, Zhang G, Hu H, et al. Deformable convolutional networks. Proceedings of the IEEE International Conference on Computer Vision. 2017. p. 764–73.
  50. 50. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is All you Need. Advances in Neural Information Processing Systems. Curran Associates, Inc.; 2017. Available from: https://proceedings.neurips.cc/paper_files/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html
  51. 51. Lau KW, Po L-M, Rehman YAU. Large Separable Kernel Attention: Rethinking the Large Kernel Attention design in CNN. Expert Syst Appl. 2024;236:121352.
  52. 52. Guo M-H, Lu C-Z, Liu Z-N, Cheng M-M, Hu S-M. Visual attention network. Comp Visual Med. 2023;9(4):733–52.
  53. 53. Mesaros A, Heittola T, Virtanen T. TUT database for acoustic scene classification and sound event detection. 2016 24th European Signal Processing Conference (EUSIPCO). IEEE; 2016. p. 1128–1132.
  54. 54. Mesaros A, Heittola T, Diment A, Elizalde B, Shah A, Vincent E, et al. DCASE 2017 challenge setup: Tasks, datasets and baseline system. DCASE 2017-Workshop on Detection and Classification of Acoustic Scenes and Events. 2017.
  55. 55. Imoto K. Dataset. Available from: https://www.ksuke.net/dataset
  56. 56. Xu Y, Kong Q, Wang W, Plumbley MD. Surrey-CVSSP system for DCASE2017 challenge task4. arXiv preprint arXiv:170900551. 2017.
  57. 57. Bai J, Yin H, Wang M, Shi D, Gan W-S, Chen J. AudioLog: LLMs-powered long audio logging with hybrid token-semantic contrastive learning. arXiv. 2023.
  58. 58. Nam H, Kim S-H, Ko B-Y, Park Y-H. Frequency Dynamic Convolution: Frequency-Adaptive Pattern Recognition for Sound Event Detection. Interspeech 2022. ISCA; 2022. p. 2763–7. https://doi.org/10.21437/Interspeech.2022-10127
  59. 59. Zhang H, Li S, Min X, Yang S, Zhang L. Conformer-based Sound Event Detection with Data Augmentation. 2022 International Conference on Knowledge Engineering and Communication Systems (ICKES). Chickballapur, India: IEEE; 2022. p. 1–7. https://doi.org/10.1109/ICKECS56523.2022.10060191
  60. 60. Nam H, Kim S-H, Min D, Park Y-H. Frequency & channel attention for computationally efficient sound event detection. arXiv. 2023.