Figures
Abstract
Oral translation in English serves as a critical conduit for international communication and cultural exchange. However, the prevalent variations in pronunciation and the rapid pace of spoken language currently impede the efficacy of synchronous translation methods. To improve the quality and efficiency of synchronous oral translation, this paper explores the integration of cross-modal semantic understanding and synchronous enhancement specifically for English oral translation. This exploration commences with the implementation of a cross-modal translation scenario. Subsequently, the text sequence derived from this process is amalgamated with the original speech features via Bidirectional Encoder Representations from Transformers (BERT). The cross-information between modalities is explored, and linear transformation optimization is performed on the self-attention mechanism in Transformer to achieve context-awareness and understanding of oral-transcribed text. In conclusion, the integration of dynamic time warping (DTW) enhances real-time synchronization between speech and text, thereby improving translation fluency. Experimental results reveal that, when compared to the existing bilingual attention neural machine translation (NMT) model and the context-aware NMT model, the model proposed in this study yields an average bilingual evaluation understudy (BLEU) score that is 9.3% and 26.9% higher, respectively. Furthermore, its synchronization speed surpasses that of the other two models by 17.9% and 16.8%, respectively. These findings suggest that the fusion model, which incorporates context-awareness and an attention mechanism in cross-modal translation, can significantly elevate the quality and efficiency of English oral translation, offering a novel approach to the synchronous translation of spoken English.
Citation: Wang Y (2025) Enhancing english oral translation through cross-modal learning and synchronous optimization. PLoS One 20(8): e0329381. https://doi.org/10.1371/journal.pone.0329381
Editor: Y-h. Taguchi, Chuo University, JAPAN
Received: October 19, 2024; Accepted: July 15, 2025; Published: August 18, 2025
Copyright: © 2025 Yan Wang. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The data supporting the findings of this study are partially sourced from publicly available datasets, including LibriSpeech, TED talk dataset, and Common Voice database. For proprietary reasons, some data are not publicly available. Requests for data access can be made to the Institutional Data Access Committee at Anhui Wenda University of Information Engineering.Email: [bgs@wenda.edu.cn].
Funding: This work was supported by the Construction of the evaluation system of ideological and political teaching in College English courses under CIPP model. The role of funders in this study is mainly to provide financial support for the construction of an evaluation system for ideological and political teaching in college English courses (CIPP model). Funders are involved in study design and project implementation.
Competing interests: The authors have declared that no competing interests exist.
1. Introduction
In the era of globalization, the importance of international communication has surged, particularly in realms such as international business, academic exchanges, and daily interactions, where the need for oral translation is increasingly evident [1,2]. However, contemporary translation algorithms often struggle to accurately interpret the contextual relationships and correlations among multiple modalities when handling multimodal information. As the field of information technology rapidly evolves, substantial advancements have been observed in natural language processing (NLP) [3]. The incorporation of context-awareness technology enhances the model’s ability to comprehend the overarching meaning of a sentence, while attention mechanisms bolster the model’s capacity to capture the sentence’s core information. In the domain of oral translation, integrating these elements and effectively leveraging multimodal information can lead to innovative cross-modal semantic understanding and simultaneous enhancement techniques. This holds significant theoretical and practical implications for fulfilling the growing demand for high-quality, real-time translation and advancing the practical utilization of intelligent translation technology.
This article explores the application of cross-modal semantic understanding and synchronous enhancement to English oral translation, with a view to enhancing both translation quality and efficiency, and meeting the demands of real-time translation. The study integrates context-awareness and attention mechanisms to achieve optimization in multimodal scenarios. When compared with models based on the bilingual attention neural machine translation (NMT) model and the context-aware neural machine translation (NMT) model, our model demonstrates superior performance in aspects such as translation accuracy, fluency, synchronization speed, and consistency. The average bilingual evaluation understudy (BLEU) score of our model surpasses that of the other two models by 9.3% and 26.9%, respectively. Further, our model shows an average perplexity that is 9.2% and 15% lower in relation to the other two types of models, respectively, and it also has a lower average grammar error rate. In terms of synchronization speed, our model’s average token processing speed per second exceeds that of the other two types of models by 17.9% and 16.8%, respectively. As for translation consistency, the mean translation edit rate (TER) of our model in each test subset reaches 15.6%, compared to the mean TER values of 18.9% and 20.2% for the bilingual attention NMT model and the context-aware NMT model, respectively. Our findings suggest that a fusion model based on context-awareness and attention mechanisms in cross-modal translation can enhance the real-time performance of English oral translation tasks while ensuring the quality of translation.
The research contributions of this article include:
- (1) Cross-modal semantic understanding:
This paper presents a novel cross-modal semantic understanding mechanism that integrates visual and auditory data. This integration addresses the limitations of single-modal information in current models and mitigates the issues of reduced translation accuracy due to variations in accent and speech speed. Consequently, it enhances the comprehension of oral content in cross-modal contexts.
- (2) Integration of context awareness and attention mechanisms:
The integration of context-aware techniques enhances the model’s ability to capture long-range dependencies, thereby augmenting its contextual sensitivity. The attention mechanism, refined through linear transformation, permits a more flexible allocation of attention weights. This flexibility enables superior capture of long-range dependencies within the input sequence, ultimately improving the translation quality of complex oral expressions.
- (3) Synchronization enhancement
This article employs the Dynamic Time Warping (DTW) algorithm to facilitate real-time fusion of information during the processing of multimodal inputs, thereby synchronizing speech and text in real time. This approach enhances both the performance and accuracy of the translation. Furthermore, the model dynamically adjusts its dependence on various modal information when managing oral translation tasks.
2. Related work
Current English translation models may yield inaccuracies when handling rare vocabulary or intricate grammatical structures [4, 5]. To enhance translation efficacy, Khan S. N. integrated rule-based translation methods with artificial neural networks, creating an environment for efficient pattern matching and example learning. This integration led to the implementation of Neural Machine Translation (NMT), which achieved an n-gram BLEU score of 0.5903 across approximately 500 Hindi test sentences, thereby indicating superior translation quality [6]. Saini S. proposed a novel machine translation system, influenced by the sequential adaptive memory cognitive model. This system utilized a sequence-to-sequence learning method to comprehend and generate speech and language, reflecting the structure of the cerebral neocortex region. The experimental results confirmed that this method could deliver satisfactory, high-quality translation results [7]. In an effort to enhance grammatical accuracy in translation, Wang X. developed a fundamental English grammar analyzer for English-Chinese translation, improving upon the generalized maximum likelihood ratio algorithm. A character mapping function was employed to automatically recognize sentence boundaries, and same-level clauses were used to outline the grammatical structure of sentences. This facilitated the transition of proverbs from the original sentences to the target sentences. The effectiveness of this method was subsequently validated through example sentence analysis [8]. Koul N introduced an RNN (Recurrent Neural Network)-based NMT model to mitigate the challenges associated with translating long and atypical sentences. This recursive neural network was adeptly trained using sequence-to-sequence examples, thereby addressing uncommon scenarios such as extended sentences and rare lexical items. During the training phase, specific weighting factors were employed to account for the nuances of unusual words. Upon experimental evaluation, Koul N’s model demonstrated a notable 10% enhancement in translation accuracy when juxtaposed with a naive Bayes classifier [9]. While contemporary research has undeniably bolstered the precision of English translation, challenges persist, particularly in areas of single modality and inadequate contextual comprehension.
Cross-modal translation, which amalgamates information from diverse modalities, notably enhances translation accuracy [10,11]. By integrating context-awareness and leveraging multi-head attention mechanisms, one can more effectively delineate the dependency relationships among sentences [12,13]. To mitigate the gradient problem, Kang L introduced an NMT model predicated on bilingual attention, with experimental outcomes indicating its superior performance over mainstream Transformer models [14]. Premjith B employed long short-term memory (LSTM) and bidirectional RNN to construct an NMT system, adeptly addressing the challenge of translating extended sentences [15]. The context-based NMT algorithm proposed by Sharma V. K harnessed expansive unlabeled corpora and diminutive bilingual parallel corpora for translation, with experimental evaluations revealing its enhanced precision compared to baseline methods [16]. Ramadani R. A facilitated cross-modal translation by transmuting audio into text, processing it, and aligning it with a sign language system; tests affirmed a commendable level of accuracy [17]. While cross-modal translation augments context comprehension and translation quality by meticulously analyzing multi-modal data, it persistently endeavors to fulfill real-time requisites for synchronous translation across varied modalities. A majority of the extant cross-modal translation algorithms are deeply rooted in neural networks. Although they proficiently navigate everyday translation challenges, they grapple with thoroughly comprehending intricate syntactic and syntax connections, thereby impairing translation quality. Moreover, regarding the management of rare words, numerous models presently lack the capability to effectively augment the dictionary, leading to mistranslations or omissions.
3. English oral translation with the integration of context-awareness and attention mechanism
3.1. Multimodal data integration
Multimodal data fusion is a process that amalgamates data from diverse sources and formats for all-encompassing analysis [18]. This paper focuses on the integration of speech and text data. The initial step involves utilizing Long Short-Term Memory (LSTM) to scrutinize the audio features of spoken English and to recognize the corresponding oral content. Subsequently, these recognized results are depicted as probability distributions, paving the way for modal fusion. Upon decoding the speech, the textual information is transformed into coherent text data, a process illustrated in Fig 1.:
The collected oral English audio sequence is represented as , where
is the length of the sequence. Mel frequency cepstrum coefficients are used for feature extraction to obtain feature sequence
. The expression
of the input gate, forget gate, and output gate for each time step is processed with
as input. The formulas are [19,20]:
Table 1 presents the definitions of the variables in Formulas (1) to (3):
Using the LSTM gating mechanism, the mechanism and hidden state at the current time can be calculated through the formulas [21]:
Table 2 presents the definitions of the variables in Formulas (4) to (6):
Audio files undergo preprocessing, after which their features are extracted. These features are then fed into the LSTM network, where speech processing is executed based on specific gating mechanisms. The use of LSTM facilitates the transfer of feature information to the fully connected layer, endowing the classification output with the requisite nonlinear transformation capability. Subsequently, the probability distribution of characters that might occur at each time point is predicted by transforming the results of the fully connected layer via Softmax. This process enables the joint temporal classification loss function to decode the probability distribution of characters and generate a series of text sequences.
The English oral translation system is enhanced by integrating visual information, specifically lip movements, as an additional input modality alongside speech and text data. Utilizing lip-reading technology, both speech and text inputs can be processed. This allows for the use of visual information to decode alterations in mouth shape, thereby providing supplementary clues particularly beneficial in instances where the speech signal is unclear or disrupted by noise interference.
BERT, a pre-trained language model, possesses a robust capability for processing and understanding natural language texts [22,23]. In this study, BERT is employed to integrate the derived text sequence with the original speech features, thereby creating a unified modal representation. To more effectively harness the cross-modal information and the long-term relationships within each modality, a joint feature representation is adopted for the vectors. This process encompasses embedding data from three modalities and amalgamating multiple vectors from a singular modality to produce a comprehensive multimodal vector, as depicted in Fig 2.
In Fig 2, it is postulated that corresponding features are extracted from N modal data and subsequently transformed via a feature extraction network. The single-mode features of text, sound, and visual information modes are mapped onto a linear layer, facilitating the high-dimensional conversion of these three single-mode features; this entails converting the output vector dimension to a higher one. Ultimately, the feature vectors from both modalities are projected into the global feature space [24,25]. In Fig 2, N linear layers, corresponding to the three modal vectors, are employed to transform N feature vectors.
In order to ensure a distinct embedding for each modality, the BERT vector embedding simulates the process by embedding all the information from each individual modality, symbolized by. During the fusion of feature vectors, all information pertaining to the modality is accepted, and the features of that modality are pooled to the maximum extent. This process is illustrated in Fig 3:
According to Fig 3, after performing the maximum pooling operation, the most significant feature in this modality is taken as the initial value for the fused feature:
The sequence representation of input features fused with feature vectors is:
In order to process multimodal data efficiently, the positions of each modality are encoded separately using a modal embedding, and corresponding models are subsequently established. During the process of positional embedding, a modal encoding method is employed to ensure that the input feature sequence is distinguishable among different modalities. For N modal data, each modality is recognized by the modal code of N positions, and the coding dimension of each modal position is . The position modal encoding sequence of the fused vector is represented as:
The feature vectors of each modality contain semantic and modal information. On this basis, a unified modal representation is defined by the formula:
3.2. Construction of context-awareness model and optimization of attention mechanism
The Transformer model boasts flexibility and scalability in the field of Natural Language Processing (NLP) [26,27]. The resulting unified modal expression is input into the Transformer network. On this basis, the self-attention mechanism in Transformer is utilized to perform context-awareness and understanding of oral-transcribed text. By establishing a contextual model to determine each part of the text sequence, omissions and non-standard grammar can be recognized.
The context-awareness model is predicated on the architecture of the Transformer encoder, illustrated in Fig 4.
In Fig 4, the Add&Norm function incorporates residual connections and Layer-Normalization operations. The role of residual connections is to mitigate the loss of gradient during the backpropagation process. On the other hand, Layer-Normalization serves as a normalized method capable of discerning feature differences within a singular sample, thereby enhancing the stability of the model. In the context of feedforward neural networks, fully connected layers and ReLU activation functions are employed.
The Transformer encoder’s primary role is to convert the input into a hidden layer, a mathematical representation encompassing natural language sequences [28]. The decoder subsequently maps this hidden layer back to the natural language sequence, thus allowing the semantic model to address complex problems effectively [29,30]. In accordance with Bayes’ rule, given the context encoding sequence and the initial target vectors for each target variable, the distribution is broken down into the product of the conditional distributions for each target vector:
The limitations of self-attention mechanisms in creating effective models for high-dimensional input vectors necessitate the establishment of multi-head attention mechanisms [31,32]. are replaced with matrix coefficients
, and they are mapped to a semantic space 1, as shown in Formula (12):
Here, represent input features. On this basis, the Attention operation is performed and expressed as [33]:
represents a single vector sequence, and its length is the same as
. By using matrix coefficients
to transform input features
into semantic space 2, the Attention operation is repeatedly performed to obtain
. Similarly,
can also be obtained.
Finally, by integrating the formula, a complete multi-head attention representation is obtained:
In multimodal scenarios, each modality’s data is subjected to numerous Attention operations, potentially impacting the model’s learning efficiency and effectiveness. This paper employs a linear transformation method to compute multiple head attention points and aggregates them, thereby enhancing the efficiency and capacity of attention representation. This approach is depicted in Fig 5.
The linear transformation incorporates an additional linear layer into the traditional attention computation, thereby enhancing attention efficiency. During the calculation of queries, keywords, and values, this paper employs a weighted matrix for linear transformation alongside the original word embedding vectors. This approach augments the model’s capacity to capture input features, rendering the attention mechanism more versatile in weight distribution and more effective in identifying long-distance correlations within sequences. Through strategic adjustments to the weight matrix in linear transformation, an adaptive modification of attention allocation strategies under varying input conditions can be accomplished. Consequently, the representation ability and generalization effect of the model are significantly improved.
Each modality’s input data undergoes a linear transformation to generate various attention heads, which are subsequently weighted together for an integrated expression. This approach to multi-attribute recognition, based on sparse representation, effectively addresses the prevalent issue of redundancy in traditional multi-attribute recognition algorithms. It also minimizes unnecessary calculations, thereby enhancing the efficiency of multimodal data processing.
The output vector is ultimately derived by merging and transforming the results of each attention head through an additional linear transformation matrix. This “multi-head” architecture not only bolsters the capture of intricate patterns but also augments computational efficiency via parallel processing. Building on this, the paper employs hierarchical normalization and residual connectivity techniques to ensure robust learning and performance enhancement of self-interested factors. Consequently, the model can concentrate more effectively on pertinent information without being disturbed by inter-modal noise.
Firstly, by obtaining , the dot product of
and
below the i-th head is calculated and represented as
. The calculation method is shown in Formula (15) [34,35]:
The parameter matrix is multiplied with the calculation result so that each dot product contains the attention information of all heads, and it is projected to space .
is used to represent the parameter matrix. The linear transformation is calculated in Formula (16):
In Formula (16), represents a parameter in a matrix. Its spatial information is normalized using Softmax and then multiplied by a numerical matrix to obtain the head attention. All the head attentions are concentrated and multiplied by matrix
to obtain mixed multi-attention. Formula (17) shows the calculation process:
In Formula (17), represents the value after spatial normalization;
represents the value matrix;
represents the head attention. Mixed head attention can correlate information from multiple heads of attention, thereby improving the translation quality of the model.
3.3 Synchronous enhancement algorithm
Dynamic Time Warping (DTW) serves as a metric for gauging the similarity between two time series, allowing them to be matched [36]. The DTW algorithm is adept at identifying the optimal alignment path between two sequences, even in instances where their timing isn’t strictly synchronized. It essentially seeks a pathway between two sequences that minimizes the cumulative distance of the entire route [37]. In a cross-modal translation context, there often exists a pronounced temporal mismatch between speech and text. This discrepancy arises as speakers’ paces can cause the content of a speech to either extend or contract in duration. Conventional timing window alignment techniques prove versatile in addressing such inherent timing inconsistencies. However, the DTW algorithm stands out by enabling precise positioning, owing to its ability to search for the best possible path based on temporal shifts. This method formulates a two-dimensional cost matrix where each entry denotes the distance between corresponding positions of the two sequences. Leveraging dynamic programming principles, it then discerns the most abbreviated cumulative route from its inception to its ultimate destination. Employing the DTW algorithm for data alignment ensures meticulous mapping and transformation of information across different modalities during conversion. Moreover, the robust adaptability of the DTW algorithm renders it well-suited for managing both linear temporal relationships and intricate temporal structures inherent in multimodal data within cross-modal translation tasks.
This article addresses the issue of processing time series with varying lengths by utilizing the Dynamic Time Warping (DTW) algorithm for time series alignment. This is achieved through the development of a context-aware mechanism and the optimization of an attention mechanism, which facilitate real-time synchronization between speech and text. Consequently, the fluency of translation is enhanced. The process is illustrated in Fig 6.
Assuming that the oral audio feature sequence is and the text feature sequence is
, Euclidean distance is used to calculate the distance matrix
between
and
:
Cumulative distance matrix is calculated:
Starting from the bottom right corner of matrix
, the optimal path
is found through backtracking:
The principle of backtracking is to move along the direction of minimum value:
The conventional Dynamic Time Warping (DTW) method exhibits reduced computational efficiency when applied to data with extended time series. This article uses Locality Sensitive Hashing (LSH) to preprocess speech sequences and maps them to a low dimensional space to improve matching efficiency. Firstly, using the LSH algorithm, its dimensionality is reduced, and is defined as a set of hash functions. In the low dimensional space, the audio sequence data
is expressed as:
On this basis, feature extraction is performed on the data processed by LSH to capture temporal dependencies in the sequence. The oral audio feature sequence is matched with the text feature sequence
. The aligned sequence
is:
The Dynamic Time Warping (DTW) algorithm’s matching results are seamlessly integrated with the module. We apply a context-aware model, which is combined with an optimized linear variation-based attention mechanism, to influence the final translation decision process collectively. The translation is executed via an aligned sequence, facilitating the automatic generation of the target language text.
4. English oral translation experiment
4.1 Experimental data
The translation model in this article is shown in Fig 7.
This study utilizes LibriSpeech as the experimental dataset, which provides both text and audio formats. It comprises approximately 1000 hours of 16kHz read English speeches, meticulously segmented and normalized. These speeches are divided into audio files of roughly 10 seconds each, accompanied by text annotations. Prior to experimental analysis, both audio and text data from the original dataset undergo preprocessing. The audio files are edited to remove noisy sections, while the text data is stripped of special characters and punctuation, retaining only standard text. Text differentiation is achieved through the use of lowercase letters, and it is further segmented into sub-word units to create a vocabulary.
To simulate a diverse spoken language environment and ensure model adaptability to dialects and intricate structures, additional processing is performed on the experimental dataset. Extra oral English data is sourced from the TED talk dataset, with the supplementary corpus featuring distinctly marked Indian English, Australian English, Irish English, Scottish English, and Southern English. Conversational data from the Common Voice database is also included. To reflect real-world usage scenarios, the Switchboard corpus provides additional voice data from everyday phone conversations. Texts from the Penn Treebank database and COCA Complex Sentences that contain nested clauses, split sentences, and inverted sentences undergo speech conversion, representing 12% of the corpus. Moreover, the CELEX corpus is employed to embed specialized terms for high-frequency applications into various texts, such as news reports, academic speeches, and classroom discussions, thereby generating new speech samples. The refined dataset comprises 1200 hours of speech samples, with the dataset feature breakdown detailed in Table 3.
The combination of these datasets provides a large and diverse data source, which is beneficial for the training of the English oral translation model. LibriSpeech dataset is the basic part, and other datasets make up the diversity of dialects, conversational style, complex grammar and special vocabulary. The use of this large-scale dataset has made the translation model more accurate, fluent and robust in different scenarios of oral translation.
To mitigate potential translation bias and safeguard speech data privacy, this study employs the Word Embedding Association Test to ascertain the presence of gender, race, and other biased vocabulary in training samples. By detecting such hidden biases, techniques like semantic substitution and data class balancing are implemented to diminish model output bias. Furthermore, the voice anonymity algorithm, predicated on Voice Masking, prevents the unauthorized disclosure of users’ voice identities. This is achieved by ensuring non-identification of voice data while maintaining voice recognition. Moreover, differential confidentiality approaches are utilized to secure data.
The oral data collected is automatically segmented into approximately 10-second intervals using the Librosa library in Python. These segmented audio clips are then annotated with text using the original audio paired with subtitles. For consistency with the LibriSpeech data, all audio files are converted to the WAV format. This processed oral data is subsequently integrated with the LibriSpeech dataset, resulting in a comprehensive corpus encompassing both formal reading and oral expressions. The entire corpus is categorized into 10 subsets based on specific spoken language scenarios. The composition of the final experimental data is detailed in Table 4.
To assess the efficacy of the model presented in this paper, we compare it with existing cross-modal translation research using experimental data from Table 4. Specifically, we evaluate its performance against two baseline models: the bilingual attention-based NMT model from reference 14 and the context-aware NMT model from reference 16. The bilingual attention-based NMT model employs an encoder-decoder architecture that dynamically adjusts focus during cross-modal translation by computing attention weights between the source and target languages. In contrast, the context-aware NMT model integrates multimodal information to form context representations, leveraging context windows and guiding generation during decoding through context vectors.
In our experiments, we utilize a rigorous experimental setup to guarantee the reliability and validity of the model. For training, we meticulously curated a comprehensive dataset encompassing LibriSpeech, TED talk dataset, Common Voice database, Switchboard corpus, Penn Treebank database, COCA Complex Sentences, and CELEX corpus. Each dataset underwent preprocessing to eliminate noise and extraneous data, and the audio data was partitioned into roughly 10-second segments, accompanied by text annotations. Model parameters were initialized using the Xavier initialization method to preserve the variance of the activations and gradients. Training was executed using an Adam optimizer with a learning rate of 0.001. We conducted 50 training sessions with a batch size of 32, shuffling the training data at each session to maintain diversity. The model’s learning was guided by a combination of cross-entropy loss and attentional loss. Post-initial training, we fine-tuned the model via domain adaptation, employing domain-specific subsets, data enhancement techniques, and iterative improvements involving multiple training and evaluation cycles. Experiments were performed on a high-performance computing cluster featuring NVIDIA® 3090 GPUs, multi-core CPUs, and 512GB of RAM. The software environment comprised the PyTorch deep learning framework and various Python libraries such as NumPy, SciPy, pandas, Librosa, NLTK, and SpaCy for data and text processing tasks. The parameter settings of the model in this study are detailed in Table 5.
In the experiment, we tuned the main hyperparameters to achieve the best model performance. The learning rate and batch size are selected by grid and random search within a specific range of values. Specifically, the learning rate is searched between 0.0001 and 0.01, and finally 0.001 is chosen as a balance between convergence speed and model stability. Similarly, the batch size is searched between 16 and 64, and 32 is chosen as a balance between memory consumption and training efficiency. Meanwhile, we observed the model performance on the validation set to select the hyperparameters that produced the lowest validation loss and the highest translation accuracy.According to the settings of the model parameters, the model is trained and tested in each sub-set. We use the same parameter settings for bilingual attention NMT model and context-aware NMT model. As shown in the experimental results, the model converges after 50 training epochs and the BLEU score increases gradually before it converges to a stable value. Moreover, the proposed model has a lower error rate and syntax error rate, which indicates that the proposed model has a good translation fluency. The synchronization speed and translation consistency of the proposed model are also better than those of the baseline model.
In this study, we strictly adhered to ethical guidelines for data use. All experimental data were sourced from publicly available academic databases, such as LibriSpeech and the TED Talk Dataset, and their use complied with the licensing agreements of the data providers. Additionally, to protect user privacy, we employed speech anonymization algorithms to ensure that speech data could not be identified as belonging to specific individuals. Furthermore, we conducted a Word Embedding Association Test to identify and address any potential biased vocabulary in the training data, thereby reducing bias in the model’s outputs. All these measures were implemented to ensure the ethical integrity of the research and the legitimacy of the data.
4.2. Evaluation indicators
- (1) Translation accuracy
The BLEU score is employed to evaluate the accuracy of translation, using the similarity between the generated translation results and the reference translation as the primary evaluation criterion. A high BLEU score suggests a substantial agreement between the model’s interpretation effect and the reference translation [38,39]. The calculation formula is as follows:
Among them, there is:
In Formulas (24) and (25), is the length penalty factor;
and
represent the weight and precision of the n-gram, respectively;
is the length of the candidate translation;
is the closest translation length to
in the reference translation.
- (2) Translation fluency
The fluency of translation is assessed using two metrics: perplexity and grammar error rate. Perplexity, in particular, serves as an indicator of the fluency and naturalness of the text [40]. On the other hand, the grammar error rate quantifies the ratio of grammatical errors to the total word count. The formula for computing perplexity is as follows:
represents the total number of words in the sentence, and
represents the logarithmic probability of the sentence.
- (3) Synchronization speed
The synchronization speed is quantified using the metric “tokens per second,” which denotes the number of tokens processed each second. A higher value for this indicator corresponds to a more rapid synchronization speed in oral translation.
- (4) Translation consistency
Translation consistency is quantified using TER, which assesses the editing distance between the translation output of each model and the corresponding reference translation. This metric indicates the frequency of required edits in relation to the total number of words in the reference translation.
- (5) METEOR
Compared to BLEU, METEOR focuses more on the semantic similarity between the translation and the reference translation, rather than strict lexical matching. It can better reflect the quality of the translation, especially in terms of grammar changes, vocabulary usage, and other aspects. Its calculation formula is as follows:
4.3. Experimental results
- (1) Comparison of translation accuracy
The variation of BLEU model under the number of training rounds in this article is shown in Fig 8.
As illustrated in Fig 8, the BLEU score experiences a rapid ascent from 12.4 to 41.2 during the initial phases of training, thereby indicating the model’s swift adaptation to fundamental translation patterns. This early progress can largely be attributed to the model’s capacity for identifying prevalent language patterns from its original random state and for learning relatively straightforward syntactic and lexical alignment relationships. However, a deceleration in the growth of the BLEU score is evident from the 16th epoch onwards, with the rate of improvement gradually diminishing. This suggests that the model incrementally masters basic translation rules and starts addressing more intricate syntactic structures and semantic relationships. Overall, the model exhibits high training efficiency, achieving performance convergence within a confined number of training iterations.
The BLEU scores of the model in this article are compared with the bilingual attention NMT model and the context-aware NMT model in each test subset, and its ability to handle complex language structures and long-term dependencies is explored. The results are shown in Fig 9.
As illustrated in Fig 9, the model presented in this article outperforms both the bilingual attention Neural Machine Translation (NMT) model and the context-aware NMT model across all test subsets in terms of BLEU score. Specifically, the average BLEU score for the model in this article is approximately 30.7, compared to 28.1 and 24.2 for the other two models, respectively. This represents an improvement of about 9.3% and 26.9% over these models, respectively. The bilingual attention NMT model struggles with effective integration of multimodal information, leading to issues such as incomplete oral translation. Similarly, the context-aware NMT model tends to overlook details during long sequence translations. In contrast, the model in this article harmoniously merges speech and text data, thereby leveraging a broad spectrum of information resources and enhancing translation comprehension for each test set, ultimately resulting in more accurate translations.
- (2) Comparison of translation fluency
In the analysis of translation fluency, a comparison of the perplexity results and grammar error rate results of various models is made, as shown in Fig 10 and Table 5.
The perplexity comparison illustrated in Fig 10 demonstrates that the model articulated in this article consistently outperforms its counterparts across all subsets, exhibiting notably lower perplexity results. A comparative analysis of the average perplexity reveals that the model discussed in this paper yields an average perplexity of approximately 11.9, whereas the translation models founded on bilingual attention NMT and context-aware NMT yield average perplexities of 13.1 and 14.0, respectively. This indicates that the model in this article is superior by margins of 9.2% and 15%, respectively. The attention mechanism has been optimized through linear variation within the article, thereby focusing attention across various text segments and assigning appropriate weights during translation to enhance text coherence. The relatively limited capacity of both the bilingual attention and context-aware NMT models to extrapolate distant information could potentially compromise the coherence and fluency of the translation.
In Table 6, the model presented in this article boasts an average grammar error rate of 2.5%, compared to the averages of 3.5% and 3.9% for the other two models. These comparative results indicate that the grammar accuracy of the translations produced by our model is superior. Our model consistently generates translations that are both natural and linguistically appropriate while maintaining fluency.
- (3) Comparison of synchronization speed
In evaluating synchronization speed, we utilize the metric of “tokens per second” to quantify the number of tokens processed per second by each model within the test set. The conclusive results are depicted in Fig 11.
In Fig 11, a pronounced disparity is observed in the processing speed of tokens per second among the models evaluated in the test set. The model presented in this study processes an average of 265.7 tokens per second. In contrast, the bilingual attention and the context-aware NMT model process an average of 225.4 and 227.5 tokens per second, respectively. Notably, during oral translation tasks, the synchronization speed of the proposed model surpasses that of the other two models by 17.9% and 16.8%, respectively. While the bilingual attention and the context-aware NMT models typically operate in a serial mode characterized by limited parallelism—with each time step’s output being contingent on the preceding step’s state—the model in this study employs the DTW method to minimize the cumulative distance between two sequences. This strategy effectively reduces conversion delays and expedites the synchronization process, making its translation synchronization speed optimal across each test subset.
- (4) Translation consistency
In the analysis of translation consistency, TER is used to compare and verify the translation effects of various models. The lower the TER, the higher the consistency of the translation. The final comparison results are shown in Fig 12.
In Fig 12, the average TER (Translation Edit Rate) of the model presented in this paper across each test subset is 15.6%. In comparison, the average TER values for the bilingual attention NMT (Neural Machine Translation) model and the context-aware NMT model are 18.9% and 20.2%, respectively. This suggests that the translations produced by the model in this study align more closely with the reference translations, indicating superior translation quality. The model in question employs a multi-head attention mechanism and captures the semantic nuances of spoken language across modalities using a context-awareness mechanism. This approach facilitates a deeper understanding of the meaning within the broader context. On the other hand, the bilingual attention NMT model encounters challenges in effectively incorporating contextual information, especially in longer sentences. Moreover, the attention mechanism of the context-aware NMT model has a limited capacity for expression, leading to suboptimal consistency performance.
- (5) Comparison of METEOR
The METEOR results of various models in the same dataset are compared, as shown in Table 7:
As indicated by the METEOR comparison results in Table 7, the model discussed in this paper achieves a mean METEOR score of 55.8 across various test subsets. This surpasses the mean METEOR scores of 53.7 and 51.4 achieved by the bilingual attention NMT and context-aware NMT models respectively. In a detailed comparison against these other two model types, the model in this paper demonstrates an increase in the mean METEOR score by 3.9% and 8.6% respectively. These results suggest that the model examined in this paper performs more effectively in acknowledging context awareness and incorporating differing modal information. As such, it enhances the comprehension and translation of multimodal information, particularly demonstrating increased robustness and accuracy when faced with diverse and complex inputs.
- (6) User feedback
This study employs a block-based self-attention mechanism, optimized for addressing computational efficiency issues in long sequence translation. The input speech signal is segmented into fixed 1.5-second windows, with the linear transformation method calculating the local attention for each window. To reduce algorithmic complexity, sparse attention is applied at a global scale. The model’s performance is validated in real-world scenarios by deploying it within a real-time English oral translation system used in business conferences to translate speeches from multiple parties simultaneously. In order to evaluate the model’s performance, data is collected on the inference time pre- and post-optimization as well as user assessments of translation accuracy and fluency. A group of 10 users who participated in the conference were asked to rate the model on a scale from 1–10, with higher scores indicating superior translation performance. The results are presented in Table 8.
The user feedback results presented in Table 8 demonstrate that the optimization strategy, which is based on the block self-attention mechanism, significantly diminishes the time complexity of the model and substantially enhances the processing efficiency for extended speech inputs. Considering the average time per sentence, the model’s inference time has decreased by approximately 38.5% compared to pre-optimization, thereby improving real-time performance. In terms of user ratings, the average value indicates a substantial enhancement in both translation accuracy and fluency following optimization.
Based on the aforementioned considerations, we conducted interference testing on the model to ascertain its robustness in real-world settings. We introduced office environment noise, quantified by the signal-to-noise ratio (SNR), to the input speech. Concurrently, other human voices were integrated into the speech sample to assess the model’s capability in isolating the primary speech signal and subsequently translating it. Metrics such as the BLEU score, METEOR, and user subjective evaluation were employed to gauge the model’s performance. The findings are presented in Table 9.
The comparative results before and after interference testing indicate a slight decrease in the BLEU score, METEOR, and user subjective evaluation of the model, following the addition of environmental noise and speech interference. However, the translation performance remains relatively robust, suggesting that the model’s translation performance is resilient.
- (7) Comparison of advanced methods
In order to thoroughly assess the performance of the proposed model, this study employs a selection of cutting-edge and recent models for comparative experiments.
FairseqS2T is a speech-to-text translation model, leveraging the capabilities of the Transformer architecture, and demonstrates robust context capture capabilities.
ESPnet-ST is a sequence-to-sequence learning-based speech translation model, adept at translating various language pairs.
Utilizing this foundation, we conducted longitudinal comparisons among a variety of models. We employed newly collected translation data, which was gathered every three months, as well as the accumulated speech data, to train the respective models. The performance of these models was evaluated based on their BLEU, TER, and METEOR scores, using the same test dataset for each. The outcomes of these evaluations are presented in Table 10.
In Table 10, from a horizontal comparison standpoint, the model in this article demonstrates BLEU, TER, and METEOR scores of 30.7, 15.6%, and 55.7 respectively when using the first set of data. In contrast, the Fairseq S2T yields BLEU, TER, and METEOR scores of 29.8, 16.3%, and 54.1 respectively, while ESPnet ST records 29.6%, 16.6%, and 53.7% respectively. From a longitudinal view, as more temporal data is introduced, the model in this paper shows marked improvements in BLEU, TER, and METEOR evaluations, effectively capturing a greater diversity of speech and translation details and enhancing translation accuracy and fluency. Both horizontal and vertical comparative experiments indicate that the model in this paper surpasses leading-edge models in terms of both BLEU and METEOR metrics, thereby showcasing exemplary translation performance.
- (8) Experimental analysis of statistical tests
In order to strengthen the significance of the results, we conducted an experimental analysis of statistical tests, the results of which are shown in Table 11.
Table 11 presents the BLEU scores, 95% confidence intervals, and p-values for the model discussed in this paper, juxtaposed against the bilingual attention NMT model and the context-aware NMT model across five test subsets. The findings indicate that the model’s BLEU scores consistently surpass those of the other two models in all test subsets. Moreover, with p-values all below 0.05, the enhancement in BLEU scores is statistically significant. Additionally, the 95% confidence intervals for the BLEU scores of the models discussed in this paper do not encompass zero, further bolstering the assertion that the observed improvements are not attributable to chance. These results suggest that the model introduced in this paper is superior in capturing the semantic nuances of spoken language and generating precise translations. The incorporation of p-values and confidence intervals lends greater validity to the experimental results, offering a more robust evaluation of the model’s performance.
5. Discussion
In the experimental analysis, this article conducts comparisons from four aspects, translation accuracy, translation fluency, synchronization speed, and translation consistency.
- (1) Translation accuracy: from the comparison results of translation accuracy, the BLEU scores of the model in this article in various test subsets are generally higher than those of bilingual attention NMT model and context-aware NMT model. This represents that by integrating context-awareness and attention mechanisms across modalities, the meaning of the original text can be more accurately captured in oral translation generation, effectively improving translation quality;
- (2) Translation fluency: from the comparison results of translation fluency, the model in this article has lower perplexity and grammar error rate, resulting in smoother generated translation text with fewer grammar errors. Compared to the limitations of bilingual attention and context-aware NMT model in handling long sequences, the model in this article uses multimodal feature extraction and fusion to make the translation results smoother and more natural;
- (3) Synchronization Speed: A comparative analysis reveals that the model proposed in this study processes an average of 265.7 tokens per second. This is approximately 17.9% and 16.8% faster than the bilingual attention and context-aware NMT models, respectively. Such results suggest that our model can generate translation outcomes more swiftly, thereby reducing latency and enhancing user experience during oral translation tasks;
- (4) Translation Consistency: Upon comparing the translation consistency results, the model in this study achieves a mean TER of 15.6%. This value is notably lower than those obtained by bilingual attention and context-aware NMT models. Such results suggest that the discrepancy between the translations produced by our model and the reference translations is minimal, indicating superior translation consistency.
Despite the model’s performance enhancements in several areas, it maintains certain limitations. Initially, in specific situations like pronounced accents or overlapping speech, the model’s performance might not be at its best. Although the model has been trained using a substantial volume of experimental data—encompassing a variety of accents such as Indian, Australian, Irish, Scottish, and Southern English—it might grapple with recognizing overly thick accents. Moreover, overlapping speech can introduce noise and interference, thereby challenging the model’s precise distinction and processing of individual speech segments. Future studies will aim to bolster the model’s efficiency in these demanding contexts to boost its resilience and versatility. Secondly, the computational demands of this model are considerable. Its multimodal architecture augments translation efficacy by merging speech and text data, but this also elevates computational intricacy and resource usage. The model’s training necessitates substantial computational resources, encompassing top-tier GPUs, multi-core CPUs, and vast memory capacities. This might restrict the model’s utility in settings with limited resources for language acquisition. Lastly, the model’s training is contingent upon high-caliber corpora. The existing data’s diversity and richness could, to a degree, constrain the model’s universality. A comprehensive analysis is still needed to gauge the model’s performance in intricate translation scenarios.
To address the limitations highlighted above, several avenues can be explored in future research. Firstly, further optimization of the model architecture and algorithms is crucial to reduce computational costs. Incorporating lightweight neural network designs and utilizing model compression techniques can substantially decrease the resource demands of the model without compromising translation quality. Secondly, enhancing the model’s capability to manage heavy accents and overlapping speech is essential. This can be achieved by expanding the diversity of accent data in the training corpus and refining the model’s speech recognition and separation functionalities. Moreover, increasing the variety of training data and employing data augmentation methods can bolster the model’s generalization capacity and adaptability to intricate translation contexts. Investigating novel multimodal fusion strategies and attention mechanisms might offer innovative approaches to elevate the model’s translation efficacy. Lastly, improving the model’s interpretability is vital, as it allows researchers to gain deeper insights into its decision-making process, facilitating continuous optimization and enhancement.
In conclusion, while the model introduced in this study represents a step forward in English interpretation, there remains significant potential for further advancements. Through a comprehensive comparative analysis with existing research and an honest assessment of our model’s limitations, we hope to pave the way for future studies and foster the ongoing evolution of English interpretation technology.
6. Conclusion
In response to the growing demand for high-quality English oral translation, this paper examines the utilization and enhancement of cross-modal semantic understanding and synchronous improvement in this field. By harnessing and synthesizing multimodal data from both audio and text sources, and by integrating context-awareness and attention mechanisms, the study not only boosts the accuracy and fluency of translations but also elevates the synchronization rate between audio and text in oral translation. This leads to a more consistent translation output. While the research offers novel insights and theoretical value for the intelligent evolution of English oral translation, it is not without limitations. The model’s training is contingent upon top-tier corpora. The diversity and abundance of the existing datasets might constrain the model’s generalization capacity. Furthermore, the model’s efficacy in intricate translation scenarios warrants a deeper analysis. The LibriSpeech and TED Talk fusion dataset employed in this study encompasses a broad spectrum of situations. However, real-world speech data often entails challenges such as noise and specialized jargon. The model’s performance in these multifaceted spoken language translation contexts requires additional refinement. Future studies might explore the integration of corpus development and application scenario expansion with data augmentation techniques to further enrich model learning and training via stylistic variations and other means, thereby continually advancing the quality of English oral translation.
References
- 1. Wang JP. Translation activities on oral translation and written translation. Applied Translation. 2021;15(2):1–11.
- 2. Carr S, Sun S. Ensuring accuracy and quality for oral examinations in translation. Assessment & Evaluation in Higher Education. 2022;47(6):830–42.
- 3. Galassi A, Lippi M, Torroni P. Attention in Natural Language Processing. IEEE Trans Neural Netw Learn Syst. 2021;32(10):4291–308. pmid:32915750
- 4. Ashengo YA, Aga RT, Abebe SL. Context based machine translation with recurrent neural network for English–Amharic translation. Machine Translation. 2020;35(1):19–36.
- 5. Fan A, Bhosale S, Schwenk H, Ma Z, El-Kishky A, Goyal S, et al. Beyond english-centric multilingual machine translation. J Machine Learning Res. 2020;22(107):1–48.
- 6. Khan SN, Usman IA. A model for english to urdu and hindi machine translation system using translation rules and artificial neural network. Int Arab J Inf Technol. 2019;16(1):125–31.
- 7. Saini S, Sahula VA. A novel model based on sequential adaptive memory for English–Hindi translation. Cognitive Computation and Systems. 2021;3(2):142–53.
- 8. Wang X. Translation correction of English phrases based on optimized GLR algorithm. J Intelligent Systems. 2021;30(1):868–80.
- 9. Koul N, Manvi SS. A proposed model for neural machine translation of Sanskrit into English. Int J Information Technol. 2021;13(1):375–81.
- 10. Wu Y. The medial turn and multimodal translation studies. J Foreign Languages. 2021;44(1):115–23.
- 11. Li X, Tang Q. A visualized overview of domestic research on multimodal translation: Status quo, problems and suggestions. J University of Sci Technol Beijing (Social Sciences Edition). 2021;37(5):534–42.
- 12. Peng Y, Qi J. Reinforced cross-media correlation learning by context-aware bidirectional translation. IEEE Transactions on Circuits Systems for Video Technol. 2019;30(6):1718–31.
- 13. Hernández A, Amigó JM. Attention mechanisms and their applications to complex systems. Entropy (Basel). 2021;23(3):283. pmid:33652728
- 14. Kang L, He S, Wang M, Long F, Su JB. Bilingual attention based neural machine translation. Appl Intelligence. 2023;53(4):4302–15.
- 15. Premjith B, Kumar MA, Soman KP. Neural machine translation system for English to Indian language translation using MTIL parallel corpus. J Intelligent Syst. 2019;28(3):387–98.
- 16. Sharma VK, Mittal N, Vidyarthi A. Context-based translation for the out of vocabulary words applied to hindi-english cross-lingual information retrieval. IETE Technical Rev. 2022;39(2):276–85.
- 17. Ramadani RA, Sudarma M, Dharma Putra I, Giriantari IAD. A new technology on translating Indonesian spoken language into Indonesian sign language system. Int J Electric Comput Eng. 2021;11(4):3338–46.
- 18. Liu J, Ding X, Luo X. Survey of multimodal deep learning. Appl Res Comput. 2020;37(6):1601–13.
- 19. Sulistyo DA, Wibawa AP, Prasetya DD, Ahda FAI. LSTM-Based Machine Translation for Madurese-Indonesian. J Appl Data Sci. 2023;4(3):189–99.
- 20. Yuan Q, Dai Y, Li G. Exploration of English speech translation recognition based on the LSTM RNN algorithm. Neural Computing and Applications. 2023;35(36):24961–70.
- 21. Khan A, Sarfaraz AR. RNN-LSTM-GRU based language transformation. Soft Computing. 2019;23(24):13007–24.
- 22. Liu H, Zhang Z, Wang Y. A Review on Main Optimization Methods of BERT. Data Analysis and Knowledge Discovery. 2021;5(1):3–15.
- 23. Wang Y, Lin M, Li Y, Zhao J. Research on Embedded Text Topic Model Based on BERT. J Comp Engineering & Applications. 2023;59(1):169–79.
- 24. Wang H, Li X, Li Z, Ma C, Ren Z, Yang D. Survey of multimodal pre-training models. J Computer Applications. 2023;43(4):991–1004.
- 25. Wu Youzhe n, Li Haora n, Yao Tin g, He Xiaodo n. A survey of multimodal information processing frontiers: Application, fusion and pre-training. J Chinese Information Processing. 2022;36(5):1–20.
- 26. Li Y, Shan Y, Liu Z, Che C, Zhong Z. Transformer fast gradient method with relative positional embedding: a mutual translation model between English and Chinese. Soft Computing. 2023;27(18):13435–43.
- 27. Li H, Tuo X. Research on an English translation method based on an improved transformer model. J Intelligent Syst. 2022;31(1):532–40.
- 28. Badawi SA. A transformer-based neural network machine translation model for the kurdish sorani dialect. UHD J Sci Technol. 2023;7(1):15–21.
- 29. Chen Con g, Li Maox i, Luo Q i. A transformer-based unified neural network for quality estimation of machine translation. J Chinese Information Processing. 2021;35(6):47–54.
- 30. Zhang Y, Liu J, Song D, Huang D. Incorporating data enhancement and diverse decoding into neural machine translation. Journal of Xiamen University (Natural Science). 2021;60(4):670–4.
- 31. Soydaner D. Attention mechanism in neural networks: where it comes and where it goes. Neural Computing and Applications. 2022;34(16):13371–85.
- 32. Liu HI, Chen WL. X-transformer: a machine translation model enhanced by the self-attention mechanism. Appl Sci. 2022;12(9):4502–16.
- 33. Fu P, Liu D, Yang H. LAS-transformer: An enhanced transformer based on the local attention mechanism for speech recognition. Information. 2022;13(5):250–62.
- 34. Wang P. A study of an intelligent algorithm combining semantic environments for the translation of complex English sentences. Journal of Intelligent Systems. 2022;31(1):623–31.
- 35. Bian Y, Li J, Zhao Y. Model optimization of English intelligent translation based on outlier detection and machine learning. Soft Computing. 2023;27(14):10297–303.
- 36. Csillik O, Belgiu M, Asner GP, Kelly M. Object-based time-constrained dynamic time warping classification of crops using Sentinel-2. Remote Sensing. 2019;11(10):1257–82.
- 37. Cheng Q, Li J. Improved dynamic time warping for similar metrics and trajectory clustering. Electronic Measurement Technology. 2021;44(23):1–5.
- 38. Rivera-Trigueros I. Machine translation systems and quality assessment: a systematic review. Language Resources and Evaluation. 2022;56(2):593–619.
- 39. Salmani B. Dynamic assessment of translation quality: A new approach to translator development. Teaching English Language. 2023;17(1):335–58.
- 40. Shukla M, Goyal I, Gupta B, Sharma J. A comparative study of ChatGPT, Gemini, and Perplexity. International Journal of Innovative Research in Computer Science & Technology. 2024;12(4):10–5.