Skip to main content
Browse Subject Areas

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Performance of machine translators in translating French medical research abstracts to English: A comparative study of DeepL, Google Translate, and CUBBITT



Non-English speaking researchers may find it difficult to write articles in English and may be tempted to use machine translators (MTs) to facilitate their task. We compared the performance of DeepL, Google Translate, and CUBBITT for the translation of abstracts from French to English.


We selected ten abstracts published in 2021 in two high-impact bilingual medical journals (CMAJ and Canadian Family Physician) and used nine metrics of Recall-Oriented Understudy for Gisting Evaluation (ROUGE-1 recall/precision/F1-score, ROUGE-2 recall/precision/F1-score, and ROUGE-L recall/precision/F1-score) to evaluate the accuracy of the translation (scores ranging from zero to one [= maximum]). We also used the fluency score assigned by ten raters to evaluate the stylistic quality of the translation (ranging from ten [= incomprehensible] to fifty [= flawless English]). We used Kruskal-Wallis tests to compare the medians between the three MTs. For the human evaluation, we also examined the original English text.


Differences in medians were not statistically significant for the nine metrics of ROUGE (medians: min-max = 0.5246–0.7392 for DeepL, 0.4634–0.7200 for Google Translate, 0.4815–0.7316 for CUBBITT, all p-values > 0.10). For the human evaluation, CUBBITT tended to score higher than DeepL, Google Translate, and the original English text (median = 43 for CUBBITT, vs. 39, 38, and 40, respectively, p-value = 0.003).


The three MTs performed similarly when tested with ROUGE, but CUBBITT was slightly better than the other two using human evaluation. Although we only included abstracts and did not evaluate the time required for post-editing, we believe that French-speaking researchers could use DeepL, Google Translate, or CUBBITT when writing articles in English.


The dominance of English as the publishing language can penalize non-English speaking researchers seeking to share their work, as the stylistic quality of articles can have an impact on their likelihood of being published and/or cited by other publications [1]. In order to improve their chances of publication in English-language journals, researchers often rely on professional translation services to improve the style of their articles before submission [1]. However, these services are expensive, use translators who are not necessarily experts in the field, and are time-consuming, which often greatly delays the submission of articles [1].

Machine translators (MTs) are increasingly used in everyday life [2, 3]. Indeed, thanks to neural networks, the quality of translation has greatly improved in the last decades [46] and they do not require advanced computer skills. They are also used in medicine, for example to translate electronic medical records and to improve patient management in clinical practice, with mixed results [3, 715]. For example, Taira et al assessed the use of Google Translate for translating commonly used Emergency Department discharge instructions into seven languages [9]. While the overall meaning was retained in 82.5% of translations, accuracy rates varied across languages, and the study concluded that Google Translate should not be relied upon for patient instructions due to inconsistency in translation quality. In another study, Turner et al assessed the feasibility of using Google Translate followed by human post-editing to translate public health materials from English to Chinese [10]. The results showed that common machine translation errors and challenges in post-editing led to lower quality translations, suggesting the need for improvements in machine translation and post-editing processes before routine use in public health practice. However, a previous study of the same research team suggested that Google Translate and post-editing could yield translations of comparable quality in a more efficient and cost-effective manner for English to Spanish [11]. Blind ratings by two bilingual public health professionals indicated that when comparing human translation and machine translation followed by human post-editing, both types of translations were considered overall equivalent, with 33% preferring human translation, 33% preferring machine translation followed by human post-editing, and 33% finding both translations to be of equal quality. According to the authors, these divergent results between the two studies are linked to significant differences between English and Chinese, for example in syntactic structures. Khoong et al also found marked differences between Spanish and Chinese when using Google Translate for translations of emergency department discharge instructions [12]. Among the 100 sets of patient instructions containing 647 sentences, Google Translate accurately translated 92% of sentences into Spanish and 81% into Chinese. A minority of the inaccuracies in the translations had the potential for clinically significant harm.

Only a few studies evaluated the use of MTs in academic research, and they mainly focused on the extraction of relevant data from non-English articles [1618]. For example, Balk et al compared Google Translate’s ability to translate non-English language studies for systematic reviews in five languages and found variations in accuracy [16]. Spanish translations demonstrated the highest correct extraction rate (93% of items correctly extracted more than half the time), followed by German and Japanese (89%), French (85%), and Chinese (78%). According to the authors, caution is advised when using machine translation, as there is a trade-off between achieving comprehensive reviews and the potential for translation-related errors.

The objective of the current study was to assess the performance of three MTs, namely DeepL, Google Translate, and CUBBITT, in translating medical abstracts from French to English. We aimed to compare the accuracy of translations using nine metrics of Recall-Oriented Understudy for Gisting Evaluation (ROUGE), while also considering the stylistic quality through human evaluation. This study addressed the challenges faced by non-English speaking medical researchers and explored the practicality of using machine translation in this context. By testing our hypothesis that MTs may exhibit variations in translating medical research, we aimed to provide valuable insights for French-speaking researchers seeking to publish in English-language journals.


Selection of abstracts and machine translators (MTs)

We selected the two most prestigious general medical journals (according to the 2020 Journal Citation Reports impact factor) that translate all (Canadian Family Physician, impact factor = 3.3) or some (CMAJ, impact factor = 8.3) of the abstracts of published articles into French. We limited this preliminary study to general medical journals and did not include medical specialty or basic science journals that may use more technical language. We selected high-impact journals in a bilingual English/French country (Canada) to ensure that the French abstracts included in the study were of high quality.

We randomly extracted ten articles published in 2021 with abstracts available in French, five published in CMAJ (abstracts #1 to #5) and five in Canadian Family Physician (abstracts #6 to #10). We included ten articles in the study to obtain a variety of topics and study designs. Taken together, these ten abstracts contained 12,153 words in total.

Then, in spring 2022, we selected all MTs allowing the translation of at least 5,000 characters from French to English for free. Three MTs met these criteria (i.e., DeepL [], Google Translate [], and CUBBITT (Charles University Block-Backtranslation-Improved Transformer Translation) []. At the time of the study, DeepL was free up to 5,000 characters, and 26 languages were available for translation; Google Translate was also free up to 5,000 characters, and over 100 languages were supported; CUBBITT had no character limit, but only six languages were available, including French and English.

Selection of metrics to evaluate the accuracy of the translation

We selected nine metrics of Recall-Oriented Understudy for Gisting Evaluation (ROUGE) [1921], namely ROUGE-1 recall/precision/F1-score, ROUGE-2 recall/precision/F1-score, and ROUGE-L recall/precision/F1-score.

ROUGE-N measures the number of identical n-grams between the text generated by a translator and a reference text considered as the gold standard. An n-gram is a grouping of words. For example, a unigram (1-gram) consists of a single word and a bigram (2-gram) consists of two consecutive words. The reference is a human-made optimal result. Thus, for ROUGE-1 and ROUGE-2, we measure the match rate of unigrams and bigrams, respectively, between the translated text and the reference. ROUGE-L measures the longest sequence of words that appear in the same order in both the translated text and the reference. The idea behind this metric is that a longer shared sequence indicates greater similarity between the two versions.

ROUGE-N and ROUGE-L are evaluated using three different metrics. The recall metric counts the number of identical n-grams, respectively the longest sequence of words that appear in both the translated text and the reference, divided by the total number of n-grams in the reference. It is used to verify that the translated text captures all the information contained in the reference. The precision metric is calculated in almost the same way, but, instead of dividing by the number of n-grams in the reference, it is divided by the number of n-grams in the translated text. It is used to check that the translator does not produce irrelevant words. Finally, the F1-score combines the recall and precision metrics to obtain an overall measure of translation accuracy.

Table 1 shows how to calculate the nine metrics using an example for the translated text and an example for the reference text. Implementing these metrics is easy in Python [20]. There are no recognized criteria defining above which scores of ROUGE a MT can be considered accurate. These measures are mainly used to compare MTs with each other, knowing that the higher the scores, the higher the accuracy of the translation. The main drawback of ROUGE is that it measures syntactic and not semantic matches. Thus, if two sequences have the same meaning, but use different words to express that meaning, the scores could be relatively low. For this reason, we also included a human evaluation of the translation performance, by analyzing the fluency score. This score is used to assess whether the text contains errors that native speakers would not have made or, more simply, whether the text is written in good English [22]. The best way to evaluate fluency is to use a multi-point fluency scale, with anchor text for each value [22]:

How do you judge the fluency of the translation?

  1. 5 → Flawless English
  2. 4 → Good English
  3. 3 → Non-Native English
  4. 2 → Disfluent English
  5. 1 → Incomprehensible

ROUGE and fluency were used in a large number of studies to evaluate texts. Koto et al. identified 106 studies using ROUGE and 45 studies examining fluency [23].

Table 1. Calculation of the different metrics of Recall-Oriented Understudy for Gisting Evaluation (ROUGE) using an example for the translated text and an example for the reference text.

Data collection

We translated the ten abstracts from French to English using the three selected tools. The ten original abstracts in English and the versions obtained after translation by the three MTs are available elsewhere ( Then, we evaluated the accuracy of the translation using the nine metrics of ROUGE, taking the original English abstract as reference text.

We also asked ten native English-speakers (five women and five men) to rate the fluency of the abstracts, including the original version, using the multi-point fluency scale. We added up the scores of the ten raters to get the overall score, ranging from 10 to 50. All study raters were acquaintances of the investigators, with a scientific background. In detail, there were five physicians and a fifth year medical student, two non-governmental organization (NGO) workers, a data scientist, and a manager. To avoid biasing the human evaluation, the raters were told that all versions were authored by translators in training, and the order of the versions was different for each abstract.

Statistical analyses

The nine metrics of ROUGE and the fluency score were first recorded separately for each version of each abstract. For ROUGE, the number of texts analyzed was 30 (i.e., the ten abstracts for each of the three MTs), whereas for the fluency score, this number was 40, since the original version of the ten abstracts were also analyzed. Using medians and interquartile ranges (IQRs), the results were then summarized for each of the MTs for ROUGE (n = 3), and for each of the MTs and the original version for the fluency score (n = 4). We summarized the results using medians and IQRs, because the data did not follow a normal distribution. We used Kruskal-Wallis tests to assess whether the differences in medians were statistically significant. The assumption of a similar distribution shape for all groups was met. If there were significant differences between groups, we used Dunn tests, with adjustments for multiple comparisons (Sidak), to identify the specific groups that differed from each other.

We also extracted the number of abstracts with the highest score for the nine metrics of ROUGE for DeepL, Google Translate, and CUBBITT respectively. We did the same for the fluency score, for DeepL, Google Translate, CUBBITT, and the reference text respectively.

Finally, for the human evaluation, we examined the inter-rater agreement between the ten raters for the reference text and the versions translated by DeepL, Google Translate, and CUBBITT. We used the ‘kappaetc’ command in Stata to calculate the quadratic weighted agreement coefficients (percent agreement and Gwet’s AC) [24, 25]. The statistical significance was set at a two-sided p-value of ≤0.05. All analyses were performed with STATA 15.1 (College Station, USA).

Ethical considerations

Since this study did not involve the collection of personal health-related data it did not require ethical review, according to current Swiss law.


The raw data (i.e., the scores obtained for the nine metrics of ROUGE and the fluency score) are available in Open Science Framework (

Tables 2 (for ROUGE) and 3 (for the fluency score) summarize these data using medians and IQRs, as well as the number of abstracts with the highest score. Fig 1 presents the data in graphical form. ROUGE median scores were higher for Deepl than for CUBBITT, and higher for CUBBITT than for Google Translate, except for ROUGE-2 F1 and ROUGE-2 recall for which scores were higher for Google Translate than for CUBBITT. However, none of these differences was statistically significant (medians ranging from 0.5246 to 0.7392 for DeepL, from 0.4634 to 0.7200 for Google Translate, and from 0.4815 to 0.7316 for CUBBITT, all p-values > 0.10). For the overall fluency score (we added up the scores of the ten raters to get this score), CUBBITT tended to score higher than DeepL, Google Translate, and the original English text (median = 43 for CUBBITT, vs. 39 for DeepL, 38 for Google Translate, and 40 for the original English text, p-value = 0.003). The difference in median score was borderline significant between CUBBITT and the reference text (p-value = 0.05), whereas it was statistically significant between CUBBITT and DeepL (p-value = 0.03), and between CUBBITT and Google Translate (p-value = 0.001). All ten abstracts received an individual score ranging from 3 to 5 (no score was below 3), even Google Translate, which achieved the lowest overall median score.

Fig 1. Recall-Oriented Understudy for Gisting Evaluation (ROUGE) and fluency median scores.

Data are presented for the three machine translators (MTs) separately for ROUGE, and for the three MTs and the reference text for the fluency score. Median scores for ROUGE are presented as percentages.

Table 2. Median score (IQR) and number of abstracts with the highest score for the nine Recall-Oriented Understudy for Gisting Evaluation (ROUGE) metrics used to evaluate translations of ten abstracts of medical scientific articles by three machine translators (DeepL, Google Translate, and CUBBITT).

Table 3. Median fluency score (IQR) and number of abstracts with the highest score.

This score was used to assess the style of ten original English abstracts (= reference text) and the versions translated by DeepL, Google Translate, and CUBBITT.

ROUGE scores were highest for six to nine abstracts for DeepL (depending on the score considered), one to three abstracts for CUBBITT, and zero to two abstracts for Google Translate. Fluency scores were highest for eight abstracts for CUBBITT, two abstracts for DeepL, and zero abstracts for Google Translate and the reference text.

Finally, the inter-rater agreement between the ten raters was high (Table 4). The raters agreed on more than 97% of the abstracts (p-values < 0.001), and the chance-corrected Gwet’s agreement coefficients were high (p-values < 0.001).

Table 4. Inter-rater agreement between the ten native English-speaking raters who evaluated the style of ten original English abstracts (= reference text) and the versions translated by DeepL, Google Translate, and CUBBITT.


Main findings

We compared the performance of three MTs (i.e., DeepL, Google Translate, and CUBBITT) for translating medical research abstracts from French to English. In this preliminary study, we evaluated five abstracts published in CMAJ and five abstracts published in Canadian Family Physician. We found that the three MTs performed similarly when tested with ROUGE, but CUBBITT was slightly better than the other two using human evaluation. We also found that for human evaluation CUBBITT tended to perform better than the original English text.

Comparison with existing literature

MTs are increasingly used in medicine, particularly to translate electronic medical records and improve patient care [3, 715]. Current evidence also suggests that they are relatively reliable for extracting data from non-English articles in systematic reviews [16, 17]. However, there is little data available on their effectiveness in academic research. Using a subjective evaluation method, Takakusagi et al. investigated the accuracy of DeepL in translating an entire medical article from Japanese into English [26]. The authors compared the original Japanese article with the English version, which was back translated into Japanese by medical translators. They found that the overall accuracy was high, with an average match rate of 94%. However, the accuracy varied between sections of the article, with the ’Results’ section showing the highest accuracy (100%) and the ’Materials and Methods’ section showing the lowest accuracy (89%). The authors limited their analysis to the accuracy of the meanings and did not assess the stylistic quality of the translation.

In our study, we found that DeepL, Google Translate, and CUBBITT are three effective tools for accurately and fluently translating abstracts of medical articles from French into English. Surprisingly, CUBBITT did better in terms of fluency as the original English abstracts published in two high-impact English-language journals. These results are however in line with a recent study conducted by the developers of CUBBITT, which showed that the quality of translations done with this tool approached that of professional translators in terms of fluency [27].

Unlike the biomedical sciences, a large amount of data on machine translation is available in the field of educational linguistics, second language studies, and foreign language education. Two review articles have recently been published [28, 29]. These papers summarized the key concepts, insights, and findings, categorizing them into questions like how learners use MTs, what instructors and learners think about MTs, and how MTs affect language learning. Students have diverse opinions concerning the appropriateness, reliability, and ethical considerations of machine translation tools [3033]. Learners generally hold favorable views of machine translation, believing it has the potential to assist their learning and enhance the quality of their second language writing. However, these positive perceptions are counterbalanced by concerns about machine translation accuracy, an understanding of its limitations, and conflicting interpretations of what constitutes ethical behavior. The literature exploring the potential advantages of machine translation in language learning did not produce definitive findings. However, it suggests two potential trends: MTs might serve as a valuable resource for improving learners’ metalinguistic understanding [3437], and they can aid students in achieving better results in translation and second language writing tasks [38, 39]. Some of these studies focused on the use of Google Translate [4042], yet neither review included studies that made direct comparisons between MTs.

To our knowledge, few comparative studies are available in the literature. Hidalgo-Ternero assessed the performance of Google Translate and DeepL in translating Spanish idiomatic expressions into English, including both common and less frequent variants, with a focus on whether these idioms were presented in continuous or discontinuous forms [43]. The study found that Google Translate and DeepL performed well in accurately translating high-frequency idiomatic expressions, achieving an average accuracy rate of 86% and 89%, respectively. However, they struggled to detect and translate lower-frequency phraseological variants of these idioms, indicating limitations in handling less common idiomatic expressions. Focusing on human post-editing efforts, another study compared the performance of three MTs for translating Cochrane plain language health information from English to Russian [13]. The authors found that Google Translate performed best, slightly better than DeepL, while Microsoft Translator performed less well.

Our study was not designed to estimate the amount of time needed by researchers for post-editing (i.e., the time needed to make corrections to the text after it has been translated into English by the MT). Given the results obtained for the fluency score, post-editing should nevertheless be performed fairly quickly. Indeed, even without post-editing, the stylistic quality of the translation was considered by the evaluators to be better (for CUBBITT) or almost as good (for DeepL and Google Translate) as the original text.

This preliminary study included only abstracts, which are generally written to be more accessible and more quickly “digested” than full articles. We did not evaluate the performance of MTs with full articles. Translation apps often lack specialized medical terminology, which can make them useless for translating highly specialized medical articles. Further studies evaluating the performance of MTs for full articles and for various disciplines are therefore needed. However, we believe that non-English-speaking researchers who do not wish to rely on the services of professional translators (e.g. because of their cost) could have an interest in using DeepL, Google Translate, or CUBBITT for some of their work that is not highly specialized. Indeed, the time spent in post-editing after using these MTs probably be far outweighed by the time they would have to spend translating scientific articles themselves or the time spent writing the articles directly in English.

Strengths and limitations

Our study has several strengths. We incorporated a dual assessment approach, combining quantitative ROUGE metrics and qualitative fluency evaluations by native English speakers. This ensures a comprehensive evaluation of machine translation tools, providing a nuanced understanding of both syntactic and semantic aspects of the translations. In addition, focusing on medical texts, our method aligns with practical scenarios faced by non-English-speaking researchers. By evaluating tools in a domain-specific context, our approach offers insights directly applicable to researchers in the medical field, enhancing the relevance of the study. Finally, the inclusion of raters with varied scientific backgrounds enhances the robustness of the fluency assessment. This diversity ensures a broad perspective on the quality of translations, considering the expectations and language nuances across different professional domains.

However, our study also has some weaknesses. First, we included only French abstracts published in two general medical journals. It is not certain that the results would have been similar for full articles, other languages, and/or other journals. The selection of two bilingual journals introduces a potential limitation as the study’s outcomes rely on the quality of translated abstracts from English to French published in these journals. While the stylistic quality of these versions was generally deemed good or excellent by the raters, it is essential to acknowledge the influence of the initial French abstracts on the translation process. Second, although ROUGE is a validated instrument that is often used to evaluate the performance of MTs, it does not measure semantic matches. If two sequences have the same meaning, but use different words to express that meaning, the score assigned could be relatively low. Third, only ten abstracts were included in the study and only ten raters were recruited for the human evaluation. We included only ten abstracts, because it was important for the evaluators to carefully assess the four versions of the abstracts (the original version and the versions from DeepL, Google Translate, and CUBBITT), and this was a time-consuming task. Future studies may consider including a larger sample to obtain more robust results. Finally, we selected abstracts from the year 2021 to ensure that the texts were current and reflected the latest developments in medicine. Future studies may encompass a broader time frame to examine variations over the years.


Our study provides a thorough examination of the performance of MTs—DeepL, Google Translate, and CUBBITT—in the specific context of translating medical research abstracts from French to English. This focused evaluation contributes to a nuanced understanding of the applicability of these tools in the medical domain. We not only assessed the accuracy of translations using established metrics but also delved into the fluency of the translated text. Our study aims to highlight the practical utility of MTs for non-English-speaking researchers in medicine.

We found that the three MTs performed similarly when tested with ROUGE, but CUBBITT was slightly better than the other two using human evaluation. We also found that in terms of stylistic quality CUBBITT tended to perform better than the original English text.

Although the study was limited to the analysis of abstracts published in general medical journals and did not evaluate the time required for post-editing, we believe that French-speaking researchers could benefit from using DeepL, Google Translate, or CUBBITT to translate articles written in French into English. Further studies would be needed to evaluate the performance of MTs with full articles and languages other than French.


  1. 1. Ramírez-Castañeda V. Disadvantages in preparing and publishing scientific papers caused by the dominance of the English language in science: The case of Colombian researchers in biological sciences. PLOS ONE. 2020 Sep 16;15(9):e0238372. pmid:32936821
  2. 2. Vieira LN, O’Sullivan C, Zhang X, O’Hagan M. Machine translation in society: insights from UK users. Lang Resour Eval. 2023 Jun 1;57(2):893–914.
  3. 3. Dew KN, Turner AM, Choi YK, Bosold A, Kirchhoff K. Development of machine translation technology for assisting health communication: A systematic review. J Biomed Inform. 2018 Sep;85:56–67. pmid:30031857
  4. 4. Hirschberg J, Manning CD. Advances in natural language processing. Science. 2015 Jul 17;349(6245):261–6. pmid:26185244
  5. 5. Song R. Analysis on the Recent Trends in Machine Translation. Highlights Sci Eng Technol. 2022 Nov 10;16:40–7.
  6. 6. Mondal SK, Zhang H, Kabir HMD, Ni K, Dai HN. Machine translation and its evaluation: a study. Artif Intell Rev. 2023 Sep 1;56(9):10137–226.
  7. 7. Soto X, Perez-de-Viñaspre O, Labaka G, Oronoz M. Neural machine translation of clinical texts between long distance languages. J Am Med Inform Assoc JAMIA. 2019 Dec 1;26(12):1478–87. pmid:31334764
  8. 8. Randhawa G, Ferreyra M, Ahmed R, Ezzat O, Pottie K. Using machine translation in clinical practice. Can Fam Physician. 2013 Apr;59(4):382–3. pmid:23585608
  9. 9. Taira BR, Kreger V, Orue A, Diamond LC. A Pragmatic Assessment of Google Translate for Emergency Department Instructions. J Gen Intern Med. 2021 Nov;36(11):3361–5. pmid:33674922
  10. 10. Turner AM, Dew KN, Desai L, Martin N, Kirchhoff K. Machine Translation of Public Health Materials From English to Chinese: A Feasibility Study. JMIR Public Health Surveill. 2015;1(2):e17. pmid:27227135
  11. 11. Turner AM, Bergman M, Brownstein M, Cole K, Kirchhoff K. A comparison of human and machine translation of health promotion materials for public health practice: time, costs, and quality. J Public Health Manag Pract JPHMP. 2014;20(5):523–9. pmid:24084391
  12. 12. Khoong EC, Steinbrook E, Brown C, Fernandez A. Assessing the Use of Google Translate for Spanish and Chinese Translations of Emergency Department Discharge Instructions. JAMA Intern Med. 2019 Apr;179(4):580–2. pmid:30801626
  13. 13. Ziganshina LE, Yudina EV, Gabdrakhmanov AI, Ried J. Assessing Human Post-Editing Efforts to Compare the Performance of Three Machine Translation Engines for English to Russian Translation of Cochrane Plain Language Health Information: Results of a Randomised Comparison. Informatics. 2021 Mar;8(1):9.
  14. 14. Dahal SB, Aoun M. Exploring the Role of Machine Translation in Improving Health Information Access for Linguistically Diverse Populations. Adv Intell Inf Syst. 2023 Apr 3;8(2):1–13.
  15. 15. Herrera-Espejel PS, Rach S. The Use of Machine Translation for Outreach and Health Communication in Epidemiology and Public Health: Scoping Review. JMIR Public Health Surveill. 2023 Nov 20;9:e50814. pmid:37983078
  16. 16. Balk EM, Chung M, Chen ML, Chang LKW, Trikalinos TA. Data extraction from machine-translated versus original language randomized trial reports: a comparative study. Syst Rev. 2013 Nov 7;2:97. pmid:24199894
  17. 17. Jackson JL, Kuriyama A, Anton A, Choi A, Fournier JP, Geier AK, et al. The Accuracy of Google Translate for Abstracting Data From Non-English-Language Trials for Systematic Reviews. Ann Intern Med. 2019 Nov 5;171(9):677–9. pmid:31357212
  18. 18. Zulfiqar S, Wahab MF, Sarwar MI, Lieberwirth I. Is Machine Translation a Reliable Tool for Reading German Scientific Databases and Research Articles? J Chem Inf Model. 2018 Nov 26;58(11):2214–23. pmid:30358403
  19. 19. Lin CY, Och FJ. Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In: Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics—ACL ‘04 [Internet]. Barcelona, Spain: Association for Computational Linguistics; 2004 [cited 2023 Nov 5]. p. 605–es. Available from:
  20. 20. Briggs J. The Ultimate Performance Metric in NLP [Internet]. Medium. 2021 [cited 2023 Nov 5]. Available from:
  21. 21. Lin CY. ROUGE: A Package for Automatic Evaluation of Summaries. In: Text Summarization Branches Out [Internet]. Barcelona, Spain: Association for Computational Linguistics; 2004. p. 74–81. Available from:
  22. 22. Olive J, Christianson C, McCary J. Handbook of natural language processing and machine translation: DARPA global autonomous language exploitation. New York: Springer; 2011.
  23. 23. Koto F, Baldwin T, Lau JH. FFCI: A Framework for Interpretable Automatic Evaluation of Summarization. J Artif Intell Res [Internet]. 2022 Apr 29 [cited 2023 Nov 5];73. Available from:
  24. 24. Klein D. Implementing a General Framework for Assessing Interrater Agreement in Stata. Stata J. 2018 Dec 1;18(4):871–901.
  25. 25. Gwet KL. Handbook of inter-rater reliability: the definitive guide to measuring the extent of agreement among raters. Fourth edition. Gaithersburg, Md: Advances Analytics, LLC; 2014. 410 p.
  26. 26. Takakusagi Y, Oike T, Shirai K, Sato H, Kano K, Shima S, et al. Validation of the Reliability of Machine Translation for a Medical Article From Japanese to English Using DeepL Translator. Cureus. 2021 Sep;13(9):e17778.
  27. 27. Popel M, Tomkova M, Tomek J, Kaiser Ł, Uszkoreit J, Bojar O, et al. Transforming machine translation: a deep learning system reaches news translation quality comparable to human professionals. Nat Commun. 2020 Sep 1;11(1):4381. pmid:32873773
  28. 28. Jolley JR, Maimone L. Thirty Years of Machine Translation in Language Teaching and Learning: A Review of the Literature. L2 J Electron Refereed J Foreign Second Lang Educ [Internet]. 2022 [cited 2023 Nov 5];14(1). Available from:
  29. 29. Urlaub P, Dessein E. Machine translation and foreign language education. Front Artif Intell. 2022;5:936111. pmid:35937139
  30. 30. Jin L, Deifell E. Foreign Language Learners’ Use and Perception of Online Dictionaries: A Survey Study. 2013;9(4).
  31. 31. Maimone L, Jolley J. Free Online Machine Translation: Use and Perceptions by Spanish Students and Instructors. [cited 2023 Nov 5]; Available from:
  32. 32. Larson-Guenette J. “It’s just reflex now”: German Language Learners’ Use of Online Resources. Unterrichtspraxis Teach Ger. 2013;46(1):62–74.
  33. 33. Niño A. Exploring the use of online machine translation for independent language learning. Res Learn Technol [Internet]. 2020 Sep 4 [cited 2023 Nov 5];28. Available from:
  34. 34. Benda J. Google Translate in the EFL Classroom: Taboo or Teaching Tool? Writ Pedagogy. 2013;5(2):317–32.
  35. 35. Enkin E, Mejías-Bikandi E. Using online translators in the second language classroom: Ideas for advanced-level Spanish. Lat Am J Content Lang Integr Learn [Internet]. 2016 Jun 11 [cited 2023 Nov 5];9(1). Available from:
  36. 36. Niño A. Evaluating the use of machine translation post-editing in the foreign language class. Comput Assist Lang Learn. 2008 Feb 1;21(1):29–49.
  37. 37. Correa M. Leaving the “peer” out of peer-editing: Online translators as a pedagogical tool in the Spanish as a second language classroom. Lat Am J Content Lang Integr Learn. 2014 Apr 30;7(1):1–20.
  38. 38. Garcia I, Pena MI. Machine Translation-Assisted Language Learning: Writing for Beginners. Comput Assist Lang Learn. 2011;24(5):471–87.
  39. 39. Fredholm K. Effects of online translation on morphosyntactic and lexical-pragmatic accuracy in essay writing in Spanish as a foreign language. In: CALL Design: Principles and Practice—Proceedings of the 2014 EUROCALL Conference, Groningen, The Netherlands [Internet].; 2014 [cited 2023 Nov 5]. p. 96–101. Available from:
  40. 40. Kol S, Schcolnik M, Spector-Cohen E. Google Translate in Academic Writing Courses? EuroCALL Rev. 2018 Sep 30;26(2):50–7.
  41. 41. Fredholm K. Efectos del traductor de Google sobre la diversidad léxica: el desarrollo de vocabulario entre estudiantes de español como lengua extranjera: Effects of Google translate on lexical diversity: vocabulary development among learners of Spanish as a foreign language. Rev Nebrija Lingüíst Apl Enseñ Leng. 2019 Apr 29;13(26):98–117.
  42. 42. Lee SM. The impact of using machine translation on EFL students’ writing. Comput Assist Lang Learn. 2020 Mar 3;33(3):157–75.
  43. 43. Hidalgo Ternero C. Google Translate vs. DeepL: analysing neural machine translation performance under the challenge of phraseological variation. MonTI Monogr Trad E Interpret. 2021 Jan 13;154–77.