Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

A reproducible experimental survey on biomedical sentence similarity: A string-based method sets the state of the art

Registered Report Protocol

24 Mar 2021: Lara-Clares A, Lastra-Díaz JJ, Garcia-Serrano A (2021) Protocol for a reproducible experimental survey on biomedical sentence similarity. PLOS ONE 16(3): e0248663. https://doi.org/10.1371/journal.pone.0248663 View registered report protocol

Abstract

This registered report introduces the largest, and for the first time, reproducible experimental survey on biomedical sentence similarity with the following aims: (1) to elucidate the state of the art of the problem; (2) to solve some reproducibility problems preventing the evaluation of most current methods; (3) to evaluate several unexplored sentence similarity methods; (4) to evaluate for the first time an unexplored benchmark, called Corpus-Transcriptional-Regulation (CTR); (5) to carry out a study on the impact of the pre-processing stages and Named Entity Recognition (NER) tools on the performance of the sentence similarity methods; and finally, (6) to bridge the lack of software and data reproducibility resources for methods and experiments in this line of research. Our reproducible experimental survey is based on a single software platform, which is provided with a detailed reproducibility protocol and dataset as supplementary material to allow the exact replication of all our experiments and results. In addition, we introduce a new aggregated string-based sentence similarity method, called LiBlock, together with eight variants of current ontology-based methods, and a new pre-trained word embedding model trained on the full-text articles in the PMC-BioC corpus. Our experiments show that our novel string-based measure establishes the new state of the art in sentence similarity analysis in the biomedical domain and significantly outperforms all the methods evaluated herein, with the only exception of one ontology-based method. Likewise, our experiments confirm that the pre-processing stages, and the choice of the NER tool for ontology-based methods, have a very significant impact on the performance of the sentence similarity methods. We also detail some drawbacks and limitations of current methods, and highlight the need to refine the current benchmarks. Finally, a notable finding is that our new string-based method significantly outperforms all state-of-the-art Machine Learning (ML) models evaluated herein.

Introduction

Measuring semantic similarity between sentences is an important task in the fields of Natural Language Processing (NLP), Information Retrieval (IR), and biomedical text mining, among others. For instance, the estimation of the degree of semantic similarity between sentences is used in text classification [13], question answering [4, 5], evidence sentence retrieval to extract biological expression language statements [6, 7], biomedical document labeling [8], biomedical event extraction [9], named entity recognition [10], evidence-based medicine [11, 12], biomedical document clustering [13], prediction of adverse drug reactions [14], entity linking [15], document summarization [16, 17] and sentence-driven search of biomedical literature [18], among other applications. In the question answering task, Sarrouti and El Alaomi [4] build a ranking of plausible answers by computing the similarity scores between each biomedical question and the candidate sentences extracted from a knowledge corpus. Allot et al. [18] introduce a system to retrieve the most similar sentences in the BioC biomedical corpus [19] called Litsense [18], which is based on the comparison of the user query with all sentences in the aforementioned corpus. Likewise, the relevance of the research in this area is endorsed by the proposal of recent conference series, such as SemEval [2025] and BioCreative/OHNLP [26], and studies based on sentence similarity measures, such as the work of Aliguliyev [16] in automatic document summarization, which shows that the performance of these applications depends significantly on the sentence similarity measures used.

The aim of any semantic similarity method is to estimate the degree of similarity between two textual semantic units as perceived by a human being, such as words, phrases, sentences, short texts, or documents. Unlike sentences from the language in general use whose vocabulary and syntax is limited both in extension and complexity, most sentences in the biomedical domain are comprised of a huge specialized vocabulary made up of all sorts of biological and clinical terms, in addition to innumerable acronyms, which are combined in complex lexical and syntactical forms.

Currently, there are several papers in the literature that experimentally evaluate multiple methods on biomedical sentence similarity. However, they are either theoretical or have a limited scope and cannot be reproduced. For instance, Kalyan et al. [27], Khattak et al. [28], and Alsentzer et al. [29] introduce theoretical surveys on biomedical word and sentence embeddings with a limited scope. On the other hand, the experimental surveys introduced by Sogancioglu et al. [30], Blagec et al. [31], Peng et al. [32], and Chen et al. [33] among other authors, cannot be reproduced because of the lack of source code and data to replicate both methods and experiments, or the lack of a detailed definition of their experimental setups. For instance, Sogancioglu et al. [30] provide the BIOSSES evaluation dataset evaluated in this work, as well as a Demo application and the source code used in their biomedical sentence similarity dataset (https://tabilab.cmpe.boun.edu.tr/BIOSSES/About.html); however, they do provide neither the MetaMap [34] annotation tool and UMLS ontology subsets MeSH [35] and OMIM [36] versions to reproduce the ontology-based measures nor the Open Access Subset of PubMed Central (http://www.ncbi.nlm.nih.gov/pmc/) dataset used in their training stage. Blagec et al. [31] introduce a comprehensive experimental survey for biomedical sentence similarity measures, providing the detailed hyper-parameters used for training the models, as well as several code and data to allow the training and evaluation of their methods (https://github.com/kathrinblagec/neural-sentence-embedding-models-for-biomedical-applications); however, they provide neither the post-processed biomedical dataset used in their training phase, nor the pre-trained models. Peng et al. [32] provide the pre-trained models and pre-processed dataset used to train the models (https://github.com/ncbi-nlp/BLUE_Benchmark), but they do not provide detailed information about the pre-processing of the dataset. Finally, Chen et al. [33] provide the pre-trained models (https://github.com/ncbi-nlp/BioSentVec) but provide neither the detailed information about the data used for training the models nor the information on the pre-processing stage. Therefore, it is not possible to evaluate their results in our experiments. Likewise, there are other recent works whose results need to be confirmed. For instance, Tawfik and Spruit [37] experimentally evaluate a set of pre-trained language models, whilst Chen et al. [38] propose a system to study the impact of a set of similarity measures on a Deep Learning ensemble model, which is based on a Random Forest model [39].

The main aim of this work is to introduce a comprehensive and very detailed reproducible experimental survey of methods on biomedical sentence similarity to elucidate the state of the problem by implementing our previous registered report protocol [40]. Our experiments are based on our software implementation and evaluation of all methods analyzed herein into a common and new software platform based on an extension of the Half-Edge Semantic Measures Library (HESML) [41, 42], called HESML (http://hesml.lsi.uned.es) for Semantic Textual Similarity (HESML-STS). All our experiments have been recorded into a Docker virtualization image that is provided as supplementary material together with our software [43] and a detailed reproducibility protocol [44] and dataset [43] to allow the easy replication of all our methods, experiments, and results. This work is based on our previous experience developing reproducible research in a series of publications in the area, such as the experimental surveys on word similarity introduced in [4548], whose reproducibility protocols and datasets [49, 50] are detailed and independently confirmed in two companion reproducible papers [41, 51], and a reproducible benchmark on semantic measures libraries for the biomedical domain [42]. Finally, we refer the reader to our previous work [40] for a very detailed review of the literature on sentence similarity measures, which is omitted here because of the lack of space and to avoid repetition.

Main motivations and research questions

Our main motivation is the lack of a comprehensive and reproducible experimental survey on biomedical sentence similarity that allows state of the problem to be set out in a sound and reproducible way, as detailed in our previous registered report protocol [40]. Our main research questions are as follows:

  1. RQ1 Which methods get the best results on biomedical sentence similarity?
  2. RQ2 Is there a statistically significant difference between the best-performing methods and the remaining ones?
  3. RQ3 What is the impact of the biomedical Named Entity Recognition (NER) tools on the performance of the methods on biomedical sentence similarity?
  4. RQ4 What is the impact of the pre-processing stage on the performance of the methods on biomedical sentence similarity?
  5. RQ5 What are the main drawbacks and limitations of current methods on biomedical sentence similarity?

A second motivation is implementing a set of unexplored methods based on adaptations from other methods proposed for the general language domain. A third motivation is the evaluation in the same software platform of the three known benchmarks on biomedical sentence similarity reported in the literature as follows: the Biomedical Semantic Similarity Estimation System (BIOSSES) [30] and Medical Semantic Textual Similarity (MedSTS) [52] datasets, as well as the evaluation for the first time of the Microbial Transcriptional Regulation (CTR) [53] dataset in a sentence similarity task, despite it having been previously evaluated in other related tasks, such as the curation of gene expressions from scientific publications [54]. A fourth motivation is a study on the impact of the pre-processing stage and NER tools on the performance of the sentence similarity methods, such as that done by Gerlach et al. [55] for stop-words in a topic modeling task. And finally, our fifth motivation is the lack of reproducibility software and data resources on this task, which allow an easy replication and confirmation of previous methods, experiments, and results in this line of research, as well as encouraging the development and evaluation of new sentence similarity methods.

Definition of the problem and contributions

The two main research problems tackled in this work are the design and implementation of a large and reproducible experimental survey on sentence similarity measures for the biomedical domain, and the evaluation of a set of unexplored methods based on adaptations from previous methods used in the general language domain. Our main contributions are as follows: (1) the largest, and for the first time, reproducible experimental survey on biomedical sentence similarity; (2) the first collection of self-contained and reproducible benchmarks on biomedical sentence similarity; (3) the evaluation of a set of previously unexplored methods, such as a new string-based sentence similarity method, based on Li et al. [56] and Block distance [57], eight variants of the current ontology-based methods from the literature based on the work of Sogancioglu et al. [30], and a new pre-trained Word Embedding (WE) model based on FastText [58] and trained on the full-text of articles in the PMC-BioC corpus [19]; (4) the evaluation for the first time of an unexplored benchmark, called CTR [53]; (5) the study on the impact of the pre-processing stage and Named Entity Recognition (NER) tools on the performance of the sentence similarity methods; (6) the integration for the first time of most sentence similarity methods for the biomedical domain into the same software library, called HESML-STS, which is available both on Github (https://github.com/jjlastra/HESML) and in a reproducible dataset [43]; (7) a detailed reproducibility protocol together with a collection of software tools and datasets provided as supplementary material to allow the exact replication of all our experiments and results; and finally, (8) an analysis of the drawbacks and limitations of the current state-of-the-art methods.

The rest of the paper is structured as follows. First, we introduce a collection of new sentence similarity methods evaluated here for the first time. Next, we describe a detailed experimental setup for our experiments on biomedical sentence similarity and introduce our experimental results. Then, we discuss our results and answer the research questions detailed above. Subsequently, we introduce our conclusions and future work. Finally, we introduce three appendices with supplementary material as follows. S1 Appendix introduces all statistical significance results of our experiments, whilst S2 Appendix introduces all data tables reporting the performance of all methods with all pre-processing configurations evaluated herein, and the S3 Appendix introduces a reproducibility protocol detailing a set of step-by-step instructions to allow the exact replication of all our experiments, which is published at protocols.io [44].

The new sentence similarity methods

This section introduces a new string-based sentence similarity method based on the aggregation of the Li et al. [56] similarity and Block distance [57] measures, called LiBlock, as well as eight new variants of the ontology-based methods proposed by Sogancioglu et al. [30], and a new pre-trained word embedding model based on FastText [58] and trained on the full-text of the articles in the PMC-BioC corpus [19].

The new LiBlock string-based method

Two key advantages of the family of string-based methods are as follows. Firstly, they can be very efficiently computed because they do not require the use of external knowledge or pre-trained models, and secondly, they obtain competitive results as shown in Table 8. However, the string-based methods do not capture the semantics of the words in the sentence, which prevent them from recognizing semantic relationships between words, such as synonymy and meronymy among others. In contrast, the family of ontology-based methods capture the semantic relationships between words in a sentence pair and obtain state-of-the-art results in the sentence similarity task for the biomedical domain, as shown in Table 8. However, the effectiveness of ontology-based methods depends on the lexical coverage of the ontologies and the ability to recognize automatically the underlying concepts in sentences by using Named Entity Recognition (NER) and Word Sense Desambiguation (WSD) tools, whose coverage and performance could be limited in several application domains. In fact, the NER task is still an open problem [59] in the biomedical domain because of the vast biomedical vocabulary and the complex lexical and syntactic forms found in the biomedical literature. In comparison, the methods based on pre-trained word embedding models provide a broader lexical coverage than the ontology-based ones and obtain better results. However, the methods based on word embedding do not significantly outperform all ontology-based measures in a word similarity task [48] in addition to requiring a large corpus for training, a complex training phase, and more computational resources than the families of string-based and ontology-based methods.

To overcome the drawbacks and limitations of the string-based and ontology-based methods detailed above, we propose here a new aggregated string-based measure called LiBlock and denoted by simLiBk henceforth, which is based on the combination of a similarity measure derived from the Block Distance [57] and an adaptation from the ontology-based similarity measure introduced by Li et al. [56] that removes the use of ontologies, such as WordNet [60] or Systematized Nomenclature of Medicine Clinical Terms (SNOMED-CT) [61]. The LiBlock similarity measure obtains the best results in combination with the cTAKES NER tool [62], which allows the detection of synonyms of CUI concepts. Nevertheless, the LiBlock method obtains competitive results regarding the state-of-the-art methods with no use, either implicitly or explicitly, of an ontology, as detailed in Table 12.

The simLiBk method detailed in Eq (1) is defined by the linear aggregation of an adaptation of the Li et al. [56] measure, called simLiAd (Eq (3)), and a similarity measure derived from the Block Distance measure [57], called simBk (Eq (2)). Let be LΣ the set of word sequences in a universal unseen alphabet Σ, the simLiBk function returns a value between 0 and 1 which indicates the similarity score between two input sentences, as defined in Eq 1. The simBk function is based on the computation of the word frequencies fr(wi, sj) for each input sentence s1 and s2 and their concatenation s1+ s2, as detailed in equation (Eq (2)). The auxiliary function fr(wi, sj) returns the frequency of a word wi in the word sequence sj, whilst the function fr(wi, s1+ s2) returns the number of occurrences of the word wi in the concatenation of the two word sequences, denoted by s1 + s2. On the other hand, the simLiAd function takes two word sets obtained by invoking the σ function (Eq (5)) with the sentences s1 and s2, and then it computes the cosine similarity of the two binary semantic vectors corresponding to invoke the φ(S1) function (Eq (4)) with the σ(s1) and σ(s2) word sets. Finally, the simLiBk score is defined by either the linear combination of simBk and simLiAd, as detailed in Eq (1), or simBk if simLiAd is 0.

A walk-through example.

Algorithm 1 details the step-by-step procedure to compute the simLiBk function, whilst Fig 1 shows the pipeline for calculating the LiBlock similarity score defined in Eq 1, as well as an example for illustrating an end-to-end calculation of the simLiBk similarity score of two sentences.

thumbnail
Fig 1. This figure details the workflow for computing the new LiBlock measure and an example illustrating a use case of the workflow following the steps defined in algorithm 1.

https://doi.org/10.1371/journal.pone.0276539.g001

Algorithm 1 LiBlock sentence similarity measure for two input pre-processed sentences.

1: function: simLiBlock (s1, s2)       ⊳ being s1, s2 word sequences ∈ LΣ

2:  S1σ(s1)       ⊳ word set sentence 1

3:  S2σ(s2)       ⊳ word set sentence 2

4:  DS1S2       ⊳ construct the dictionary D

5:  b1φ(S1)       ⊳ construct the semantic binary vector b1

6:  b2φ(S2)       ⊳ construct the semantic binary vector b2

7:  scoreLiAdsimLiAd(b1, b2)       ⊳ compute LiAdapted similarity

8:  scoreBksimBk(s1, s2)       ⊳ compute Block Distance similarity

9:  scoreLiBksimLiBk(scoreLiAd, scoreBk)       ⊳ compute LiBlock similarity

10:  return scoreLiBk

11: end function

(1) (2) (3) (4) (5)

The eight new variants of current ontology-based methods

The current family of ontology-based methods for biomedical sentence similarity proposed by Sogancioglu et al. [30] is based on the ontology-based semantic similarity between word and concepts within the sentences to be compared. Thus, this later family of methods defines a framework in which we can design new variants by exploring other word similarity measures. For this reason, we propose here the evaluation of a set of new ontology-based sentence similarity measures based on two different unexplored notions as follows: (1) the evaluation of state-of-the-art word similarity measures from the general domain [48] not evaluated in the biomedical domain yet; and (2) the evaluation of several ontology-based word similarity measures based on a recent and very efficient shortest-path algorithm, called Ancestors-based Shortest-Path Length (AncSPL) [42], which is a fast approximation of the Dijkstra’s algorithm [63] for taxonomies that is introduced with the first HESML version for the biomedical domain [42].

Thus, we propose here the evaluation based on the combination of WBSM and UBSM methods with the path-based word similarity methods as follows: WBSM-Rada (M7); WBSM-cosJ&C (M9); WBSM-coswJ&C (M10); WBSM-Cai (M11); UBSM-Rada (M12); UBSM-cosJ&C (M14); UBSM-coswJ&C (M15); and UBSM-Cai (M16). The detailed information about this later method is shown in Table 3.

The new pre-trained word embedding model

Current sentence similarity methods based on the evaluation of pre-trained embedding models are mostly trained using PubMed Central (PMC) Open Access dataset (https://www.ncbi.nlm.nih.gov/labs/pmc/), or Medical Information Mart for Intensive Care (MIMIC-III) clinical notes [64]. However, as far as we know, there are no models in the literature trained on the full text of the articles in the PMC-BioC corpus [19]. Therefore, we propose evaluating a new FastText [58] word embedding model trained on the aforementioned BioC corpus. FastText overcomes one significant limitation of other methods, such as word2vec [65] and GloVe [66], which ignore the morphology of words by assigning a vector to each word in the vocabulary. For a more detailed review of the family of word embedding methods, we refer the authors to the recent reproducible survey by Lastra-Díaz et al. [48]. The configuration parameters for training this model are detailed in Table 4, and all the necessary information and resources for evaluating it are available in our reproducibility dataset [43], as detailed in Table 6.

The reproducible experimental survey

This section introduces a detailed experimental setup to evaluate and compare all the sentence similarity methods for the biomedical domain proposed in our primary work [40], together with the new methods introduced herein. The main aims of our experiments are as follows: (1) the evaluation of most of known methods for biomedical sentence similarity on the three biomedical datasets shown in Table 1, and implemented on the same software platform; (2) the evaluation of a set of new sentence similarity methods adapted from their definitions for the general-language domain; (3) the evaluation of a new sentence method called LiBlock introduced in this work, eight variants of the current ontology-based methods from the literature based on the work of Sogancioglu et al. [30], and a new word embedding model based on FastText and trained on the full-text of articles in the PMC-BioC corpus [19]; (4) the setting out of the state of the art of the problem in a sound and reproducible way; (5) the replication and independent confirmation of previously reported methods and results; (6) a study on the impact of different pre-processing configurations on the performance of the sentence similarity methods; (7) a study on the impact of different Name Entity Recognition (NER) tools, such as MetaMap [34] and clinical Text Analysis and Knowledge Extraction System (cTAKES) [62], on the performance of the sentence similarity methods; (8) the evaluation for the first time of the CTR [53] dataset; (9) the identification of the main drawbacks and limitations of current methods; and finally, (10) a detailed statistical significance analysis of the results.

thumbnail
Table 1. Benchmarks on biomedical sentence similarity evaluated in this work.

https://doi.org/10.1371/journal.pone.0276539.t001

Selection of methods

The criteria for the selection of the sentence similarity methods evaluated herein is as follows: (a) all the methods that have been evaluated in BIOSSES and MedSTS datasets; (b) a selection of methods that have not been evaluated in the biomedical domain yet; (c) a collection of new variants or adaptations of methods previously proposed for the general or biomedical domain, which are evaluated for the first time in this work, such as the WBSM-cosJ&C [30, 42, 46, 67], WBSM-coswJ&C [30, 42, 46, 67], WBSM-Cai [30, 42, 68], UBSM-cosJ&C [30, 42, 46, 67], UBSM-coswJ&C [30, 42, 46, 67], and UBSM-Cai [30, 42, 68] methods detailed in Tables 3 and 4; and (d) a new string-based method based on Li et al. [56] introduced in this work. For a more detailed description of the selection criteria of the methods, we refer the reader to our registered report protocol [40].

Tables 2 and 3 detail the configuration of the string-based measures and ontology-based measures that are evaluated here, respectively. Both WBSM and UBSM methods are evaluated in combination with the following word and concept similarity measures: Rada et al. [69], Jiang&Conrath [70], and three state-of-the-art unexplored measures, called cosJ&C [42, 46], coswJ&C [42, 46], and Cai et al. [42, 68]. The word similarity measure which reports the best results is used to evaluate the COM method [30, 69]. Table 4 details the sentence similarity methods based on the evaluation of pre-trained character, word, and Sentence Embedding (SE) models that are evaluated in this work. Finally, Table 5 details the pre-trained language models that are evaluated in our experiments.

thumbnail
Table 2. Detailed setup for the string-based sentence similarity measures which are evaluated in this work.

All the string-based measures follow the implementation of Sogancioglu et al. [30], who use the Simmetrics library [71]. The LiBlock method proposed herein is an adaptation from Li et al. [56] combined with a string-based measure, as detailed in the previous section.

https://doi.org/10.1371/journal.pone.0276539.t002

thumbnail
Table 3. Detailed setup for the ontology-based sentence similarity measures evaluated in this work.

The evaluation of the methods using Rada [69], coswJ&C [46], and Cai [68] word similarity measures use a reformulation of the original path-based measures based on the new Ancestors-based Shortest-Path Length (AncSPL) algorithm [42].

https://doi.org/10.1371/journal.pone.0276539.t003

thumbnail
Table 4. Detailed setup for the sentence similarity methods based on pre-trained character, word (WE) and sentence (SE) embedding models evaluated herein.

https://doi.org/10.1371/journal.pone.0276539.t004

thumbnail
Table 5. Detailed setup for the sentence similarity methods based on pre-trained language models evaluated in this work.

https://doi.org/10.1371/journal.pone.0276539.t005

Pre-processing methods evaluated in this study

The pre-processing stage aims to ensure a fair comparison of the methods that are evaluated in a single end-to-end pipeline. To achieve this goal, the pre-processing stage normalizes and decomposes the sentences into a series of components that evaluate the same sequence of words applied to all the methods simultaneously. The selection criteria of the pre-processing components have been conditioned by the following constraints: (a) the pre-processing methods and tools used by state-of-the-art methods; and (b) the availability of resources and software tools. Fig 2 details all the possible combinations of pre-processing configurations that are evaluated in this work. String, word and sentence embedding, and ontology-based methods, are evaluated using all the available configurations except the WordPieceTokenizer [91], which is specific to BERT-based methods. Thus, BERT-based methods are evaluated using different char filtering, lower casing normalization, and stop word removal configurations. We use the Pearson and Spearman correlation metrics together with their harmonic score values to determine the impact of the different pre-processing configurations on the performance of the methods evaluated herein. However, we set the best overall performing pre-processing configuration using the harmonic average scores, as well as answering the remaining research questions.

thumbnail
Fig 2. Detail of the pre-processing configurations that are evaluated in this work.

(*) WordPieceTokenizer [91] is used only for BERT-based methods [30, 31, 34, 62, 9194, 99].

https://doi.org/10.1371/journal.pone.0276539.g002

Most methods receive as input the sequences of words making up the sentences to be compared. The process of splitting sentences into words can be carried out by tokenizers, such as the well-known general domain Stanford CoreNLP tokenizer [92], which is used by Blagec et al. [31], or the biomedical domain BioCNLPTokenizer [93]. On the other hand, the use of lexicons instead of tokenizers for sentence splitting would be inefficient because of the vast general and biomedical vocabulary. Besides, it would not be possible to provide a fair comparison of the methods because the pre-trained language models have no identical vocabularies.

The tokenized words that constitute the sentence, named tokens, are usually pre-processed by removing special characters and lower-casing, and removing the stop words. To analyze all the possible combinations of token pre-processing configurations from the literature, we replicate for each method those pre-processing configurations used by other authors, such as Blagec et al. [31] and Sogancioglu et al. [30], and we also evaluate all the pre-processing configurations that have not been evaluated yet. We also study the impact of the pre-processing configurations by not removing special characters and stop words from the tokens, nor normalizing them using lower-casing.

Ontology-based sentence similarity methods estimate the similarity of a sentence by exploiting the ‘is-a’ relationships between the concepts in an ontology. Therefore, the evaluation of any ontology-based method receives a set of concept-annotated pairs of sentences. The aim of the biomedical NER tools is to recognize automatically biomedical entities in pieces of raw text, such as diseases or drugs. We evaluate the impact of the three more broadly-used biomedical NER tools on the performance of the sentence similarity methods, as follows: (a) MetaMap [34], (b) cTAKES [62], and (c) MetaMap Lite [94]. MetaMap tool [34] is used by UBSM and COM methods [30] for recognizing Unified Medical Language System (UMLS) [95] concepts in the sentences, which is the standard compendium of biomedical vocabularies. Likewise, we use the default configuration of MetaMap restricted to the UMLS sources of SNOMED-CT and MeSH implemented by HESML V1R5 [42, 96], which is defined by the following features: (i) the use of all available semantic types; (ii) the MedPost Part-of-speech tagger [97]; and (iii) the MetaMap Word-Sense Disambiguation (WSD) module. We also evaluate cTAKES [63] because it has shown to be a robust and reliable tool to recognize biomedical entities [98]. Given the high computational cost of MetaMap in evaluating large text corpora, Demner-Fushman et al. [94] introduced a lighter MetaMap version, called Metamap Lite, which provides a real-time implementation of the basic MetaMap annotation capabilities without a large degradation of its performance.

Due to the large number of possible combinations of each pre-processing dimension, such as Named Entity Recognizers, tokenizers or char filtering methods, we have evaluated the pre-processing combinations of each dimension by defining a fixed pre-processing configuration for the rest of the dimensions, except for the string-based methods, whose performance is high enough to not cause a significant variation in the running time of the experiments.

Detailed workflow of our experiments

Fig 3 shows the workflow for running the experiments implemented in this work. Given an input dataset, such as BIOSSES [30], MedSTS [52], or CTR [53], the first step is to pre-process all the sentences, as shown in Fig 4. For each sentence pair (s1, s2) in the dataset, the pre-processing stage is divided into four stages as follows: (1.a) named entity recognition of UMLS [95] concepts, using different state-of-the-art NER tools, such as MetaMap [34] or cTAKES [62]; (1.b) tokenization of the sentences, using well-known tokenizers, such as the Stanford CoreNLP tokenizer [92], BioCNLPTokenizer [93], or WordPieceTokenizer [91] for BERT-based methods; (1.c) lower-case normalization; (1.d) character filtering, which allows the removal of punctuation marks or special characters; and finally, (1.e) the removal of stop-words, following different approximations evaluated by other authors like Blagec et al. [31] or Sogancioglu et al. [30]. Once each dataset is pre-processed in step 1 detailed in Fig 3), the aim of step 2 is to calculate the similarity score between each pair of sentences in the dataset to produce a raw output file containing all raw similarity scores, one score per sentence pair. Finally, a R-language script is used in step 3 to process the raw similarity files and produce the final human-readable tables reporting the Pearson and Spearman correlation values shown in Table 8, as well as the statistical significance of the results and any other supplementary data table required by our study on the impact of the pre-processing and NER tools reported in appendices A and B respectively.

thumbnail
Fig 3. Detailed workflow implemented by our experiments for pre-processing the input sentences, calculating the raw similarity scores, and post-processing the results obtained in the evaluation of the biomedical datasets.

This workflow generates a collection of raw and processed data files.

https://doi.org/10.1371/journal.pone.0276539.g003

thumbnail
Fig 4. Detailed sentence pre-processing workflow that are implemented in our experiments.

The pre-processing stage takes an input sentence and produces a pre-processed sentence as output. (*) The named entity recognizer are only evaluated in ontology-based methods.

https://doi.org/10.1371/journal.pone.0276539.g004

Finally, we also evaluate all the pre-processing combinations for each family of methods to study the impact of the pre-processing methods on the performance of the sentence similarity methods, with the only exception of the BERT-based methods. The pre-processing configurations of the BERT-based methods are only evaluated in combination with the WordPiece Tokenizer [91] because it is required by the current BERT implementations.

Evaluation metrics

The evaluation metrics used to compare the performance of the methods analyzed are the following: (1) the Pearson correlation, denoted by r in Eq (6); (2) the Spearman rank correlation, denoted by ρ in equation (Eq (7)); (3) and the harmonic score, denoted by h in equation (Eq (8)). The Pearson correlation evaluates the linear correlation between two random samples, whilst the Spearman rank correlation is rank-invariant and evaluates the monotonic relationship between two random samples, and the harmonic score allows comparing sentence similarity methods by using a single weighted score based on their performance in Pearson and Spearman correlation. (6) (7) (8)

Statistical significance of the results

We use the well-known t-Student test to carry-out a statistical significance analysis of the results of the evaluation of the methods in the tree biomedical datasets shown in Table 1. In order to compare the overall performance of the semantic measures that is evaluated in our experiments, we use the harmonic score average in all datasets. The statistical significance of the results is evaluated using the p-values resulting from the t-student test for the mean difference between the harmonic score values reported by each pair of semantic measures in all datasets. The p-values are computed using a one-sided t-student distribution on two paired random sample vectors made up of the harmonic (h) score values obtained in the evaluation of the three aforementioned datasets. Our null hypothesis, denoted by H0, is that the difference in the average performance between each pair of compared sentence similarity methods is 0, whilst the alternative hypothesis, denoted by H1, is that their average performance is different. For a 5% level of significance, it means that if the p-value is greater than or equal to 0.05, we must accept the null hypothesis. Otherwise, we can reject H0 with an error probability of less than the p-value. In this latter case, we say that a first sentence similarity method obtains a statistically significantly higher value than the second one or that the former one significantly outperforms the second one.

Uniform size datasets for our statistical significance analysis. The scarcity of datasets for this problem and the notable size difference among datasets varying from 100 to 1,068 sentence pairs makes it impossible to study the statistical significance of the results with an adequate sample size and to carry out a fair and unbiased comparison of the results. It is a known fact [48] that the statistical distribution of the Pearson and Spearman correlation values reported by any semantic similarity measure can significantly vary regarding the dataset size, which means that the statistical distribution of the harmonic score obtained for small subsets of a large dataset as MedSTS is not the same as that obtained for the whole dataset, as shown in Fig 5a. Fig 5a shows the histogram plots for the harmonic score obtained by the Li-Block measure [M4] in evaluating the sentence similarity of 10,000 different equal-size subsets of sentence pairs extracted from the MedSTS dataset for four different subset sizes: 100, 300, 600, and 900 sentence pairs. Fig 5a shows that the harmonic score follows a different normal distribution for each subset size, whose normality is subsequently confirmed by the Q-Q plot shown in Fig 5b and the Shilford-Wilk (p-value = 0.123) and Chi-square (p-value = 0.317) tests for the sample of harmonic score values for subsets with size 100. Thus, the correlation values derived from MedSTS (1,068 pairs) could bias our results and violate the underlying hypothesis of the t-Student test that requires that the data has the same normal distribution. This potential risk of degradation of our significance analysis increases by the fact that we only have 3 datasets of different sizes (100; 1,068; 170). For this reason, we have divided the MedSTS dataset into 10 parts, considered as independent datasets, to perform the study of the statistical significance of the results. Thus, we have artificially obtained 12 datasets of 100 to 200 pairs of sentences to build the vectors of harmonic score values used in the computation of the p-values. This set of datasets allows us to obtain the p-values to compare the statistical significance between the different measures, but does not affect the processed results from Table 8. All the necessary resources for obtaining both Table 8 and the table containing all the p-values reported in S1 Appendix are publicly available in the reproducibility dataset and the companion Lab Protocol article currently in preparation, as detailed in Table 6.

thumbnail
Fig 5.

Figure (a) below shows the histogram plots for the harmonic score obtained by the Li-Block measure [M4] in evaluating the sentence similarity of 10,000 different equal-size subsets of sentence pairs extracted from the MedSTS dataset. Each histogram plot represents the frequency distribution of 10,000 samples of the harmonic score with subsets of sentence pairs with sizes: 100, 300, 600, and 900. Figure (b) shows the Q-Q plot normality test for the harmonic score obtained for a random subset with size 100, along with the p-values reported by the Shapiro-Wilk and Chi-square normality tests.

https://doi.org/10.1371/journal.pone.0276539.g005

thumbnail
Table 6. Supplementary material and reproducibility resources of this work.

https://doi.org/10.1371/journal.pone.0276539.t006

Bonferroni correction for multiple hypothesis testing. Our discussion introduces some conclusions derived from the evaluation of multiple pairwise hypothesis tests to elucidate the statistical significance of the outperformance of one baseline similarity measure among a family of methods. In these latter cases, we define a set of null hypotheses {H1, …, Hm} setting that the pairwise mean difference between the harmonic score obtained by one baseline measure and the remaining methods in the same family is 0. To reduce the family-wise type I error (false positives) derived from our multiple comparisons [100], we define a Bonferroni correction to evaluate the statistical significance of multiple hypothesis tests involved in those conclusions in which one baseline sentence similarity measure is compared with a family of methods. For each single conclusion comparing one baseline measure with other methods, we define a corrected null-hypothesis rejection threshold αc defined as αc = α/m, where α is equal to 0.05 for a 5% level of significance and m is the number of pairwise comparisons (uncorrected p-values). Thus, the null-hypothesis is only rejected if the p-values are lower than αc when multiple pairwise hypotheses are tested.

Statistical performance analysis of the best methods

In order to answer the RQ5 research question, we study how well the sentence similarity methods estimate the degree of semantic similarity between two sentences by analyzing the deviation of their estimated values with respect to the human similarity scores. We want to analyze why the methods are doing well or badly on specific sentence pairs to provide an explanation for this behaviour, as well as identifying the main drawbacks and limitations of the current state-of-the-art methods. To carry out this performance analysis, we analyze the statistics of the similarity error function Esim of the methods defined in Eq 9. We only use some sentences extracted from the BIOSSES dataset for this analysis because this dataset has no licensing restrictions on its use, which allows us to reproduce their sentences here, unlike MedSTS. We could have also used CTR because it has no licensing restrictions; however, CTR has not been previously used in this sentence similarity task. (9)

Our methodology to conduct the performance analysis is detailed below:

  1. 1. Selection of the best-performing method from each family of methods.
  2. 2. Estimation of the Probability Density Function (PDF) of the Esim function for the evaluation of the selected best-performing methods in each dataset by calling the “density” function provided by the R statistical package.
  3. 3. Selection of the sentences based on their similarity error in the BIOSSES dataset:
    1. 3.1 the sentences with the lowest and highest absolute similarity error |Esim| for each method are extracted.
    2. 3.2 each sentence selected in the step above is pre-processed using the best pre-processing configuration for each method.
    3. 3.3 the resulting pre-processed sentences and the statistical information of the similarity scores are analyzed in the Discussion section.

Software implementation

We have developed a new sentence measures library for the biomedical domain called HESML-STS, which is based on HESML V1R5 [41, 42], as detailed in Table 6. All our experiments are generated by running the HESMLSTSclient and HESMLSTSImpactpre-processingclient programs, which generates a raw output file in comma-separated file format (*.csv) for each dataset detailed in Table 1. The raw output files contain the raw similarity values returned by each sentence similarity method in the evaluation of the degree of similarity between sentences. The final results for the Pearson and Spearman correlation, and the harmonic values detailed in Table 8 are automatically generated by running an R-language script file on the collection of raw similarity files, which also generates all the tables reported in appendices A and B provided as supplementary material. All tables are written both in LaTeX and comma-separated file format (*.csv) formats. For a more detailed description of the protocol for running our experiments, we refer the reader to the protocol [44] detailed in S3 Appendix.

We implemented a parser for loading pre-trained embedding models based on FastText [58] and other word embedding models [7882], which are efficiently evaluated as sentence similarity measures in HESML by implementing the averaging Simple Word EMbedding (SWEM) approach introduced by Shen et al. [101]. However, the software replication required to evaluate sentence embedding and BERT-based language models is extremely complex and out of the scope of this work. For this reason, these models are evaluated using the original software artifacts used to generate the aforementioned pre-trained models. Thus, we implemented a collection of Python wrappers for evaluating the available models by using the provided software artifacts as follows: (1) Sent2vec-based models [33] are evaluated using the Sent2vec library [84]; (2) Flair models [77] are evaluated using the flairNLP framework [77]; and USE models [83] are evaluated using the open source platform TensorFlow [102]. All BERT-based pre-trained models are evaluated using the open source bert-as-a-service library [103].

Reproducing our benchmarks

For the sake of reproducibility, we introduce a detailed reproducibility protocol on protocols.io [44] that is based on a reproducibility dataset [43] containing all the software and data necessary to allow the exact replication of all our experiments and results. Our reproducibility protocol is mainly based on a Docker-based image that includes a pre-installation of all the necessary software and the Java source code and binary files of our benchmark program, which is provided as supplementary material in our reproducibility dataset [43] and DockerHub (https://hub.docker.com/repository/docker/alicialara/hesml_v2r1). Our source code files are tagged on Github with a permanent tag named “Release_HESML_V2R1” (https://github.com/jjlastra/HESML/releases/tag/Release_HESML_V2R1).

In addition, we plan to submit a Lab Protocol article under preparation [44] (https://collections.plos.org/collection/lab-protocols), which will provide a detailed description of the publicly available reproducibility dataset [43] and a very detailed reproducibility protocol [44] to allow the exact replication of all our methods, experiments, and results. We also plan to submit another article [104], currently in preparation, to introduce the new HESML-STS software library integrated into the latest HESML V2R1 version [105], together with a set of reproducible benchmarks on semantic measures libraries for biomedical sentence similarity. However, our reproducibility dataset allows the full and exact replication of all our experiments by completing the licensing requirements of the UMLS databases and the aforementioned NER tools for the National Library of Medicine (NLM) of the United States (https://www.nlm.nih.gov/databases/umls.html#license_request).

Table 6 details all the reproducibility resources provided as supplementary material with this work. Our benchmarks are implemented using Java 8, Python 3 and R programming languages, and thus, they can be reproduced in any Java-compliant or Docker-compliant platforms, such as Windows, MacOS, or any Linux-based system.

Results obtained

Table 7 shows the selected pre-processing configuration of each method for obtaining their best-performing results, whilst Table 8 shows the results obtained in the evaluation of all methods in the three biomedical datasets evaluated herein by using their best pre-processing configurations. Table 9 shows the comparison of results for the highest (best) and lowest (worst) average harmonic score values for the best-performing method of each family shown in blue in Table 8, which are defined by the method obtaining the highest average harmonic score. Furthermore, Table 10 shows the results obtained in our study on the impact of NER tools on the performance of the sentence similarity methods in the evaluation of the MedSTS dataset [52]. Table 11 shows the harmonic and average harmonic scores obtained in the evaluation of the three biomedical datasets, as well as the resulting p-values comparing the NER tools for each ontology-based method. Table 12 shows the results obtained in the evaluation of the LiBlock method in the three biomedical datasets by using its best pre-processing configuration, and annotating the sentences with all the NER tools combinations. In addition, the aforementioned table details the resulting p-values comparing the best-performing LiBlock-NER combination with the other NER tools. Tables 1316 show the raw input sentence pairs and their corresponding pre-processed versions in which the best-performing methods obtain the lowest and highest similarity error (Esim) in the BIOSSES dataset [30]. Table 17 details the statistical information for the best-performing methods of each family in the evaluation of the three biomedical datasets evaluated in this study. Finally, Fig 6 shows the Probability Density Function (PDF) of the similarity error obtained by the best-performing methods of each family in the evaluation of the BIOSSES, MedSTS, and CTR datasets respectively.

thumbnail
Fig 6. Probability Density Function (PDF) and mean value of the similarity error (Esim) obtained by the best-performing methods in the evaluation of each dataset as follows: (a) BIOSSES, (b) MedSTS, and (c) CTR.

https://doi.org/10.1371/journal.pone.0276539.g006

S1 Appendix shows the p-values resulting from comparing all the methods using their best pre-processing configuration as detailed in Table 8, which allows us to study the statistical significance of the results, as detailed in the Discussion section. In addition, appendix B shows the experimental results regarding the impact of pre-processing configurations in all the methods evaluated here; the best configuration has been used to determine the final scores for each method. Finally, S3 Appendix details the protocol for reproducing all the experiments evaluated in this paper, and is also published on protocols.io [44].

thumbnail
Table 7. Best-performing pre-processing configurations used to evaluate the methods compared in this work as reported in Table 8, derived from our cross-evaluation of each method with the pre-processing configurations shown in Fig 2 (see S2 Appendix).

(*) COM (M17) uses the best configuration of the WBSM-Rada (M7) and UBSM-Rada (M12) methods for computing the similarity scores.

https://doi.org/10.1371/journal.pone.0276539.t007

thumbnail
Table 8. Pearson (r), Spearman (ρ), harmonic (h), and harmonic average (AVG) scores obtained by each sentence similarity method evaluated herein in the three biomedical sentence similarity benchmarks arranged by families.

All reported values were obtained using the best pre-processing configurations detailed in Table 7. The results in bold show the best scores whilst results in show the best average harmonic score for each family.

https://doi.org/10.1371/journal.pone.0276539.t008

thumbnail
Table 9. Comparison of results for the “best” and the “worst” pre-processing configurations for the best-performing methods of each family in Table 8.

The last column shows the t-Student p-values comparing the best and worst configurations.

https://doi.org/10.1371/journal.pone.0276539.t009

thumbnail
Table 10. Pearson (r), Spearman (ρ) and harmonic (h) values obtained in our experiments from the evaluation of ontology similarity methods detailed below in the MedSTSfull [52] dataset for each NER tool.

https://doi.org/10.1371/journal.pone.0276539.t010

thumbnail
Table 11. Harmonic score obtained by each combination of a sentence similarity method with a NER tool in the evaluation of the three sentence similarity datasets.

The p-values shown in this table are obtained by using the method for building uniform size datasets detailed above. The last column shows the p-values corresponding to the t-Student test comparing the performance of each combination with the best pair in each group.

https://doi.org/10.1371/journal.pone.0276539.t011

thumbnail
Table 12. Pearson (r) and Spearman (ρ) correlation values, harmonic score (h), and harmonic average (AVG) score obtained by the LiBlock method in combination with each NER tool using the best pre-processing configuration detailed in Table 7.

In addition, the last column (p-val) shows the p-values for the comparison of the LiBlock method with cTAKES and the remaining NER combinations.

https://doi.org/10.1371/journal.pone.0276539.t012

thumbnail
Table 13. Raw and pre-processed sentence pairs obtaining the lowest and highest similarity error Esim together with their corresponding Normalized human similarity score (Human) and normalized similarity value (Method) estimated by the LiBlock (M4) method for the raw and pre-processed sentence pairs with the lowest (L) and highest (H) similarity error Esim.

https://doi.org/10.1371/journal.pone.0276539.t013

thumbnail
Table 14. Raw and pre-processed sentence pairs obtaining the lowest and highest similarity error Esim together with their corresponding Normalized human similarity score (Human) and normalized similarity value (Method) estimated by the COM (M17) method for the raw and pre-processed sentence pairs with the lowest (L) and highest (H) similarity error Esim.

We show the raw and pre-processed sentence pairs evaluated by the WBSM and UBSM similarity methods that make up the COM method. The UBSM method use the cTAKES NER tool.

https://doi.org/10.1371/journal.pone.0276539.t014

thumbnail
Table 15. Raw and pre-processed sentence pairs obtaining the lowest and highest similarity error Esim together with their corresponding Normalized human similarity score (Human) and normalized similarity value (Method) estimated by the BioWordVecint (M26) method for the raw and pre-processed sentence pairs with the lowest (L) and highest (H) similarity error Esim.

https://doi.org/10.1371/journal.pone.0276539.t015

thumbnail
Table 16. Raw and pre-processed sentence pairs obtaining the lowest and highest similarity error Esim together with their corresponding Normalized human similarity score (Human) and normalized similarity value (Method) estimated by the OuBioBert (M47) method for the raw and pre-processed sentence pairs with the lowest (L) and highest (H) similarity error Esim.

https://doi.org/10.1371/journal.pone.0276539.t016

thumbnail
Table 17. Comparison of the mean, minimum and maximum similarity scores of the Normalized Human similarity scores (Human) and the estimated values returned by the best-performing methods of each family in the evaluation of the three biomedical datasets.

https://doi.org/10.1371/journal.pone.0276539.t017

Discussion

Comparison of string-based methods

LiBlock (M4) obtains the highest average harmonic score among the family of string-based methods and significantly outperforms all of them. This conclusion can be drawn by looking at the average column in Table 8 for this group of methods and checking the p-values reported in Table A.1 in S1 Appendix. Table A.1 in S1 Appendix shows that LiBlock obtains p-values lower than αc = 0.05/5 (0,01) when it is compared with all the string-based methods, such as Block Distance (p-value = 0.000), Jaccard (p-value = 0.000), QGram (p-value = 0.000), Overlap Coefficient (p-value = 0.000), and Levenshtein (p-value = 0.000).

LiBlock (M4) obtains the highest Pearson correlation value in the BIOSSES and MedSTS datasets among the family of string-based methods, whilst Block Distance (M3) obtains the highest Pearson correlation in the CTR dataset. This conclusion can be drawn by looking at the results for the first group of methods detailed in Table 8.

LiBlock (M4) obtains the highest Spearman correlation value in all datasets among the family of string-based methods. This conclusion can be drawn by looking at the results for the first group of methods detailed in Table 8.

LiBlock (M4) obtains the highest harmonic score in all datasets among the family of string-based methods. This conclusion can be drawn by looking at the results for the first group of methods detailed in Table 8.

Comparison of Ontology-based methods

COM (M17) obtains the highest average harmonic score among the family of ontology-based methods and significantly outperforms all of them, with the sole exception of WBSM-Rada (M7). This conclusion can be drawn by looking at the average column in Table 8 for the second group of methods and checking the p-values shown in Table A.1 in S1 Appendix. Table A.1 in S1 Appendix shows that COM obtains a p-value lower than αc = 0.05/10 (0,005) when it is compared with all ontology-based methods, with the only exception of WBSM-Rada (M7) (p-value = 0.088).

COM (M17) obtains the highest Pearson correlation value in the BIOSSES and CTR datasets among the family of ontology-based methods, whilst the WBSM-Rada (M7) methods obtain the highest Pearson correlation value in the MedSTS dataset. This conclusion can be drawn by looking at the second group of methods in 8.

COM (M17) obtains the highest Spearman correlation values in the BIOSSES dataset among the family of ontology-based methods, whilst WBSM-Rada (M7) and UBSM-Rada (M12) do so in the MedSTS and CTR datasets, respectively. This conclusion can be drawn by looking at the second group of methods in 8.

COM (M17) obtains the highest harmonic score in the BIOSSES and CTR datasets among the family of ontology-based methods, whilst WBSM-Rada (M7) does so in the MedSTS dataset. This conclusion can be drawn by looking at the second group of methods detailed in Table 8.

Comparison of embedding methods

BioWordVecint (M26) obtains the highest average harmonic score in all datasets among the family of embedding methods detailed in Table 4, but does not significantly outperforms all of them. This conclusion can be drawn by looking at the third group of methods in Table 8 and checking the p-values reported in Table A.1 in S1 Appendix. Table A.1 in S1 Appendix shows that the BioWordVecint (M26) obtains p-values higher than αc = 0.05/15 (0,003) when it is compared with the FastText-SkGr-BioC (M33) and Flair (M18) embedding methods.

BioWordVecint (M26) obtains the highest Pearson correlation value in the BIOSSES and MedSTS datasets among the family of embedding methods, whilst the Newman-Griffisword2vec_sgns (M22) model does so in the CTR dataset. This conclusion can be drawn by looking at the results for third group of methods detailed in Table 8.

BioWordVecint (M26) obtains the highest Spearman correlation in the BIOSSES and MedSTS datasets among the family of embedding methods, whilst the Newman-Griffisword2vec_sgns (M22) model does so in the CTR dataset. This conclusion can be drawn by looking at the results for the third group of measures detailed in Table 8.

BioWordVecint (M26) obtains the highest harmonic score in the BIOSSES and MedSTS datasets among the family of embedding methods, whilst the Newman-Griffisword2vec_sgns (M22) model does so in the CTR dataset. This conclusion can be drawn by looking at the results for the third group of measures detailed in Table 8.

Comparison of BERT-based methods

OuBioBERT (M47) obtains the highest average harmonic score among the family of BERT-based methods. However, it does not significantly outperform all of them. This conclusion can be drawn by looking at the last group of methods in Table 8 and checking the p-values reported in Table A.1 in S1 Appendix. Table A.1 in S1 Appendix shows that ouBioBERT obtains p-values higher than αc = 0.05/16 (0,003) when it is compared with many BERT-based methods, such as BioBERT Large 1.1 (p-value = 0.224) and PubMedBERT (abstracts+full text) (p-value = 0.101) among others.

NCBI-BlueBERT Large PubMed (M40) obtains the highest Pearson correlation value in the BIOSSES dataset among the family of BERT-based methods, whilst the NCBI-BlueBERT Base PubMed + MIMIC-III (M41) and the ouBioBERT (M47) models do so in the MedSTS and the CTR datasets, respectively. This conclusion can be drawn by looking at the last group of measures detailed in Table 8.

ouBioBERT (M47) obtains the highest Spearman correlation value in the BIOSSES dataset among the family of BERT-based methods, whilst SciBERT (M43) and NCBI-BlueBERT Base PubMed (M39) do so in the MedSTS and CTR datasets, respectively. These conclusions can be drawn by looking at the last group of measures detailed in Table 8.

ouBioBERT (M47) obtains the highest harmonic score in the BIOSSES dataset among the family of BERT-based methods, whilst SciBERT (M43) and NCBI-BlueBERT Base PubMed (M39) do so in the MedSTS and CTR datasets, respectively. This conclusion can be drawn by looking at the last group of measures detailed in Table 8.

Comparison of all methods

LiBlock (M4) obtains the highest average harmonic score for all the methods evaluated herein, and significantly outperforms all the methods based on language models. However, there is no a statistically significant difference in performance with the embedding methods Flair (M18) and BioWordVecint (M26), and the ontology-based methods COM (M17) and WBSM-Rada (M7). This conclusion can be drawn by looking at the average column in Table 8 and checking the p-values reported in Table A.1 in S1 Appendix. Table A.1 in S1 Appendix shows that the LiBlock obtains p-values higher than αc = 0.05/16 (0,003) when it is compared with the embedding-based methods Flair (M18) and BioWordVecint (M26). In addition, the LiBlock method obtain p-values higher than αc = 0.05/11 (0,004) when it is compared with the ontology-based methods COM (M17) and WBSM-Rada (M7).

BioWordVecint (M26) obtains the highest Pearson correlation values in the BIOSSES dataset among all methods evaluated here, whilst WBSM-Rada (M7) and Newman-Griffisword2vec_sgns (M22) do so in the MedSTS and CTR datasets, respectively. This conclusion can be drawn by looking at the bold values detailed in Table 8.

LiBlock (M4) obtains the highest Spearman correlation value in the BIOSSES and MedSTS datasets among all methods evaluated here, whilst Newman-Griffisword2vec_sgns (M22) do so in the CTR dataset. These conclusions can be drawn by looking at the bold values detailed in Table 8.

LiBlock (M4) obtains the highest harmonic score in the BIOSSES dataset among all methods evaluated here, whilst WBSM-Rada (M7) and Newman-Griffisword2vec_sgns (M22) do so in the MedSTS and CTR datasets, respectively. This conclusion can be drawn by looking at the bold values detailed in Table 8.

COM (M17) obtains the second highest average harmonic score among all methods evaluated here, and it is able to outperform significantly all the methods based on language models. However, it does not significantly outperforms all the embedding, ontology or string-based methods. This conclusion can be drawn by looking at the bold values detailed in Table 8 and checking the p-values reported in Table A.1 in S1 Appendix. Table A.1 in S1 Appendix shows that COM obtains p-values lower than αc = 0.05/17 (0,002) when it is compared with all the methods based on language models. On the other hand, the COM method obtains p-values higher than αc = 0.05/6 (0,008), αc = 0.05/11 (0,004) and αc = 0.05/16 (0,003) respectively, when it is compared with string, ontology and embedding-based methods.

Non ML-based methods versus ML-based ones

The string-based method LiBlock (M4) obtain a higher average harmonic score than all the embedding-based methods in all datasets. Moreover, it significantly outperforms all methods based on embedding models, with the only exceptions of Flair (M18) and BioWordVecint (M26) This conclusion can be drawn by looking at the average column in Table 8 and checking the p-values reported in Table A.1 in S1 Appendix. Table A.1 in S1 Appendix shows that LiBlock obtains p-values lower than αc = 0.05/16 (0,003) when it is compared with all the embedding-based methods except for the BioWordVecint (p-value 0.003) and Flair (p-value 0.027) methods.

All string-based methods obtain a higher average harmonic score than all the BERT-based methods considering all datasets, with the only exception of the Levenshtein distance (M5). However, string-based methods do not significantly outperform all BERT-based methods. This conclusion can be drawn by looking at the average column in Table 8 and checking the p-values reported in Table A.1 in S1 Appendix. Table A.1 in S1 Appendix shows that the string-based methods Qgram (M1), Jaccard (M2), Block distance (M3), Levenshtein distance (M5) and Overlap coefficient (M6) obtain p-values higher than αc = 0.05/17 (0,002) when they are compared with all the BERT-based methods.

The ontology-based methods COM (M17), WBSM-Rada (M7) and UBSM-Rada (M12) obtain a higher average harmonic score than all the embedding-based methods considering all datasets. However, they do not significantly outperform all embedding-based methods. This conclusion can be drawn by looking at the average column in Table 8 and checking the p-values reported in Table A.1 in S1 Appendix. Table A.1 in S1 Appendix shows that the ontology-based methods COM (M17), WBSM-Rada (M7) and UBSM-Rada (M12) obtain p-values higher than αc = 0.05/16 (0,003) when they are compared with all the embedding-based methods.

The ontology-based methods UBSM-Rada (M12), WBSM-Rada (M7), COM (M17) and UBSM-coswJ&C (M15) obtain a higher average harmonic score than all the BERT-based methods. Moreover, the ontology-based methods UBSM-Rada (M12), WBSM-Rada (M7), and COM (M17) significantly outperform all the BERT-based methods. This conclusion can be drawn by looking at the average column in Table 8 and checking the p-values reported in Table A.1 in S1 Appendix. Table A.1 in S1 Appendix shows that the UBSM-Rada (M12), WBSM-Rada (M7) and COM (M17) obtain p-values lower than αc = 0.05/17 (0,002) when they are compared with all the BERT-based methods.

All embedding methods obtain a higher average harmonic score than all BERT-based methods, with the only exceptions of Flair (M18), BioConceptVecglove (M25), BioConceptVecfastText (M30) and USE (M31). This conclusion can be drawn by looking at the last column in Table 8.

BioWordVecint (M26) obtains a higher average harmonic score than all the BERT-based methods considering all datasets and significantly outperforms all of them, with the only exception of NCBI-BlueBERT Base PubMed + MIMIC-III (M41). This conclusion can be drawn by looking at the average column in Table 8 and checking the p-values reported in Table A.1 in S1 Appendix. Table A.1 in S1 Appendix shows that the BioWordVecint (M26) method obtains p-values lower than αc = 0.05/17 (0,002) when it is compared with all the BERT-based methods, except for the NCBI-BlueBERT Base PubMed + MIMIC-III (p-value = 0.002).

Impact of the NER tools on the ontology-based methods

This section analyzes the impact of the NER tools on the performance of the sentence similarity methods, and studies the overall impact of the NER configurations. Table 10 shows the results obtained on the performance of NER tools for the sentence similarity methods evaluated in the MedSTS dataset [52], whilst Table 11 shows the harmonic and average harmonic scores, as well as the p-values which result from comparing the harmonic score of the best-performing NER tool for each ontology-based method in the three datasets with the harmonic scores obtained by the other two NER tools.

MetamapLite obtains the highest Pearson, Spearman, and harmonic scores for the MedSTS dataset in combination with UBSM-J&C (M13), UBSM-cosJ&C (M14), UBSM-coswJ&C (M15) and UBSM-Cai (M16), whilst cTAKES obtains the highest Pearson, Spearman and harmonic scores for the MedSTS dataset in combination with UBSM-Rada (M12) and COM (M17). This latter conclusion can be drawn by looking at the results shown in Table 10.

cTAKES obtains the highest average harmonic score for the three datasets in combination with UBSM-Rada (M12), UBSM-coswJ&C (M15) and COM (M17) methods, whilst MetamapLite obtains the highest average harmonic score for the three datasets in combination with UBSM-J&C (M13), UBSM-cosJ&C (M14) and UBSM-Cai (M16). This conclusion can be drawn by looking at the harmonic scores of the NER tools in Table 11.

cTAKES combined with COM (M17) obtains the best-performing results of ontology-based methods for the three datasets. This conclusion can be drawn by looking at the average harmonic scores column shown in Table 11.

cTAKES is the best-performing tool in combination with the UBSM-Rada (M12), UBSM-coswJ&C (M15), and COM (M17) methods in the three datasets, and significantly outperforms MetamapLite and Metamap or the two former methods. However, there is no a statistically significant difference regarding the Metamap tools when it is combined with the COM (M17) method. This conclusion can be drawn by looking at the average harmonic scores and p-values shown in Table 11, which are lower than αc = 0.05/2 (0,025).

MetamapLite is the best-performing tool in combination with the UBSM-J&C (M13), UBSM-cosJ&C (M14), and UBSM-Cai (M16) methods in the three datasets, and significantly outperforms cTAKES and Metamap. This conclusion can be drawn by looking at the average harmonic scores and p-values shown in Table 11, which are lower than αc = 0.05/2 (0,025).

The choice of the best NER tool for each method significantly impacts their performance in most cases. This conclusion follows from the conclusions above.

Answering RQ3.

Our results show that the ontology-based methods obtain their best performance in the task of biomedical sentence similarity when they use either MetamapLite or cTAKES. Thus, Metamap should not be used in combination with any of the ontology-based methods evaluated here in this task. Likewise, the results and p-values reported Table 11 show that there is a significant difference in the performance of each ontology-based method according to the NER tool used in most cases. The conclusions above confirm that the selection of the NER tool significantly impacts the performance of the sentence similarity methods using it.

Impact of the NER tools on the new LiBlock measure

This section analyzes the impact of the NER tools on the new simLiBk similarity measure. Table 12 shows the results obtained by the simLiBk measure in the three biomedical datasets using its best pre-processing configuration, and annotating the sentences with all the combinations of NER tools. In addition, the aforementioned table details the p-values resulting from comparing the best-performing LiBlock-NER combination with the combinations based on the other two NER tools.

LiBlock-cTAKES obtains the highest average harmonic score for the three datasets among the LiBlock-NER combinations. However, it does not significantly outperform LiBlock with no use of a NER tool. This conclusion can be drawn by looking at the average column in Table 12 and checking the p-values in the last column. This conclusion is especially relevant because it shows that there is no statistically significant difference between using a NER tool like cTAKES or not using it, in the case of the LiBlock measure. We conjecture that this result could have two explanations: firstly, the inability of LiBlock to capture semantic relationships beyond the synonymy, and secondly, the current limitations of cTAKES in recognizing all mentions of biomedical entities.

LiBlock-cTAKES obtains the highest Pearson correlation value in the BIOSSES dataset among all LiBlock-NER combinations, whilst LiBlock with no use of a NER tool obtains the highest Pearson correlation value in the MedSTS and CTR datasets, respectively. This conclusion can be drawn by looking at the results detailed in Table 12.

LiBlock-cTAKES obtains the highest Spearman correlation value in the BIOSSES and MedSTS datasets among the LiBlock-NER combinations, whilst LiBlock-cTAKES and LiBlock-MetamapLite obtain the highest Spearman correlation value in the CTR dataset. This conclusion can be drawn by looking at the results detailed in Table 12.

LiBlock-cTAKES obtains the highest harmonic correlation value in the BIOSSES and MedSTS datasets among the LiBlock-NER combinations, whilst LiBlock-MetamapLite obtains the highest harmonic correlation value in the CTR dataset. This conclusion can be drawn by looking at the results detailed in Table 12.

Impact of the remaining pre-processing stages

This section analyzes the impact of each pre-processing step on the performance of the sentence similarity methods, except for the NER tools already analyzed in the previous section. Finally, we study the overall impact of the pre-processing configurations.

Impact of tokenization.

The family of string-based methods obtains its best-performing results either by splitting the sentence on the spaces between words or using the Stanford CoreNLP tokenizer. This conclusion can be drawn by looking at Table 7, which summarizes the pre-processing tables detailed in S2 Appendix.

The family of ontology-based methods obtains its best-performing results in combination with the Stanford CoreNLP tokenizer. This conclusion can be drawn by looking at Table 7.

The family of methods based on embedding obtains its best-performing results in combination with the Stanford CoreNLP tokenizer, with the only exception of Flair (M18). This conclusion can be drawn by looking at Table 7.

No method based on strings, ontologies, or embedding obtains its best-performing results in combination with the BioCNLPTokenizer. This conclusion can be drawn by looking at Table 7. Thus, the BioCNLPTokenizer should not be used in combination with any method in the abovementioned families in the task of biomedical sentence similarity. On the other hand, we recall that all BERT-based methods evaluated herein can only be used in combination with the WordPiece Tokenizer [91] based on a subword segmentation algorithm, because it is required by the current BERT implementations.

All families of methods show a strong preference for a specific tokenizer, with the only exception of the string-based one. This conclusion can be drawn from previous conclusions that confirm the preference of the methods based on ontologies and embedding for the CoreNLP tokenizer, and the mandatory use of the WordPiece tokenizer by the family of BERT-based methods.

Impact of character filtering.

The family of string-based methods obtains its best-performing results by using either the BIOSSES char-filtering method or the default method which removes the punctuation marks and special symbols from the sentences, with the only exception of the Levenshtein distance method (M5), which does not remove special characters. This conclusion can be drawn by looking at Table 7, which summarizes the pre-processing tables detailed in S2 Appendix.

All ontology-based methods obtain their best-performing results in combination with the BIOSSES char-filtering method. This conclusion can be drawn by looking at Table 7.

Most embedding methods obtain their best-performing results in combination with the default char filtering method. However, Flair (M18), BioWordVec (M26,M27), and BioSentVec (M32) do better with BIOSSES char-filtering. This conclusion can be drawn by looking at Table 7.

The BERT-based methods do not show a noticeable preference pattern for a specific char filtering method, obtaining their best-performing results with the BIOSSES, Blagec2019, or the default one. This conclusion can be drawn by looking at Table 7.

Impact of stop-words removal.

All string-based methods obtain their best-performing results in combination with the NLTK2018 stop-word list, with the only exception of the Levenshtein distance (M5). This conclusion can be drawn by looking at Table 7, which summarizes the pre-processing tables detailed in S2 Appendix.

All ontology-based methods obtain their best-performing results in combination with the NLTK2018 stop-word list, with the only exception of WBSM-J&C (M8), WBSM-cosJ&C (M9), which do not remove stop words. This conclusion can be drawn by looking at Table 7.

The methods based on embedding do not show a noticeable preference pattern for a specific stop-word list, obtaining their best-performing results by using the stop-word list of BIOSSES, NLTK2018, or none at all. This conclusion can be drawn by looking at Table 7.

The methods based on language models do not show a noticeable preference pattern for a specific stop-word list, obtaining their best-performing results by using the stop-word list of BIOSSES, NLTK2018, or none at all. This conclusion can be drawn by looking at Table 7.

The best-performing results for the methods based on strings or ontologies show a noticeable preference for the use of the stop-words list NLTK2018. This conclusion can be drawn by looking at the Table 7.

Impact of lower-casing.

Only 10 of the 50 methods evaluated in this work obtain their best performance without converting words to lowercase at the sentence pre-processing stage. This conclusion can be drawn by looking at Tables 7 and 8, and the pre-processing tables detailed in S2 Appendix. Moreover, these ten aforementioned methods obtain a low performance in our experiments, with the sole exception of the BioNLP2016win30 (M29) pre-trained model, which obtains the third best Spearman correlation value in the CTR dataset. Thus, our experiments confirm that the lower-casing normalization of the sentences positively impacts the performance of the methods, and it should be considered as the default option in any biomedical sentence similarity task.

We conjecture that lower-casing improves the performance of the families of string-based and ontology-based methods because it improves the exact comparison of words. On the other hand, we also conjecture that the impact of lower-casing the sentences on the families of methods based on embedding and language models strongly depends on the pre-processing methods used in their training.

Overall impact of pre-processing.

To study the overall impact of the pre-processing stage on the performance of the sentence similarity methods, we selected the configuration reporting the highest (best) and lowest (worst) average harmonic score values for each method, as shown in Table 9. These configurations were selected from a total of 1081 pre-processing configurations reported in S2 Appendix.

The best-performing methods of each family show a statistically significant difference in performance between their best and worst pre-processing configurations. This conclusion can be drawn by looking at the average (AVG) and the p-values in Table 9.

Answering RQ4.

Our results and the conclusions above show that the pre-processing configurations significantly impact the performance of the sentence similarity methods, and thus, they should be specifically defined for each method. All families of methods show a strong preference for a specific tokenizer, with the sole exception of the string-based one. In addition, the BioCNLPTokenizer does not contribute to the best-performing configuration of any method evaluated here. The family of string-based methods shows a preference pattern for using either the BIOSSES or default char filtering method, whilst all ontology-based methods use the BIOSSES char filtering method, and most embedding methods use the default char filtering method. However, BERT-based methods do not show a noticeable preference pattern for a specific char filtering method. On the other hand, the families of string and ontology-based methods show a noticeable preference pattern for the use of the NLTK2018 stop-words list, whilst the families of embedding- and BERT-based methods do not show a noticeable pattern. Finally, the experiments confirm that the lower-casing normalization of the sentences positively impacts the performance of the methods, and it should be considered as the default option in any biomedical sentence similarity task.

The new state of the art

We establish the new state of the art to answer our RQ1 and RQ2 questions as follows.

The LiBlock (M4) method sets the new state of the art for the sentence similarity task in the biomedical domain (see Table 8), being the best overall performing method to tackle this task. Moreover, LiBlock significantly outperforms all the methods based on language models. However, LiBlock cannot significantly outperform the ontology-based methods COM (M17) and WBSM-Rada (M7), and the embedding-based methods Flair (M18) and BioWordVecint (M26) (see S1 Appendix). Thus, LiBlock is a convincing but non-definitive winner among the biomedical sentence similarity methods evaluated here.

The COM (M17) method sets the new state of the art among the family of ontology-based methods for biomedical sentence similarity, being the best-performing method in this task (see Table 8). Moreover, COM significantly outperforms all methods based on language models (see S1 Appendix).

BioWordVecint (M26) sets the new state of the art among the family of methods based on pre-trained embedding models, being the best-performing method in this task (see Table 8). However, BioWordVecint does not significantly outperforms the remaining methods in the same family (see S1 Appendix).

OuBioBERT (M47) sets the new state of the art among the family of methods based on pre-trained BERT models, being the best-performing method in this task (see Table 8). However, OuBioBERT is unable to outperform significantly all remaining methods from the same family (see S1 Appendix).

Finally, our results show that our new string-based method, called LiBlock (M4), obtains the best overall results, despite not capturing the semantic information of the sentences. This is a very notable finding because it contradicts a common belief that ontology-based methods, which integrate word and concept semantics, will outperform the non-semantic methods in this similarity task. A second and very interesting finding is that our non-semantic and non-ML LiBlock method is able to outperform significantly state-of-the-art methods based on BERT language models [86] in an unsupervised context. This latter finding is very remarkable because LiBlock is easy to implement and evaluate, very efficient (2635 sentence pairs per second with no use of a NER tool), and it requires neither large text resources nor complex algorithms for its training and evaluation, which is a very clear advantage in the biomedical sentence similarity task.

Answering RQ1 and RQ2.

The string-based method LiBlock (M4) obtains the highest average harmonic score in all datasets, and significantly outperforms the remaining string-based methods, as well as all methods based on language models, and all the ontology-based methods with the only exceptions of COM (M17) and WBSM-Rada (M7). In addition, LiBlock obtains the highest Spearman correlation values in the BIOSSES and MedSTS datasets, which contain 100 and 1068 sentence pairs respectively.

Main drawbacks and limitations of current methods

This section analyzes the behaviour of the best-performing methods in each family of sentence similarity methods to answer our RQ5. The best-performing methods of each family, according to the harmonic average value reported in Table 8, are LiBlock (M4), COM (M17), BioWordVecint (M26), and OuBioBERT (M47).

String and ontology-based methods underestimate, on average, the human similarity value in the BIOSSES and CTR datasets, whilst their average similarity error is close to 0 in the MedSTS dataset. This conclusion can be drawn by looking at the average similarity error values and the mean error values shown in Fig 6 together with the mean values shown in Table 17. LiBlock and COM obtain mean error values of -0.021 and -0.001 in MedSTS, as shown in Fig 6b. On the other hand, both methods report a mean similarity score much lower than the mean of the Human normalized score in the BIOSSES and CTR datasets and a mean similarity score close to the Human normalized score in the MedSTS dataset, as shown in Table 17.

The methods based on embedding and language models overestimate, on average, the human similarity value in the three datasets. This conclusion can be drawn by looking at the average similarity error values and the mean error values shown in Fig 6, together with the mean similarity values shown in Table 17. The two aforementioned families of methods report a mean similarity score much higher than the mean of the Human normalized score in the three datasets, as show in Table 17.

String and ontology-based methods share a similar underestimation behavior, in contrast to the overestimation behaviour shown by the methods based on embedding and language models, which is very noticeable in the three datasets. This conclusion can be drawn by looking at the minimum and maximum similarity values columns in Table 17, and the plots of the probability error distribution function for the three datasets in Fig 6. For instance, in spite of the human similarity scores being in the range of 0 to 1 in the BIOSSES dataset, as shown in Table 17, the string and ontology-based methods report similarity scores in the range of 0 to 0.596, whilst the methods based on embedding and language models report similarity scores in the range of 0.582 to 0.987.

String and ontology-based methods tend to obtain their best results in sentences with a Human normalized score close to 0, whilst the methods based on embedding and language models obtain their best results in sentences with a Human normalized score close to 1. This conclusion can be drawn by looking at Tables 1316. On the other hand, string and ontology-based methods tend to obtain their worst results in sentences with a Human normalized score close to 1, whilst the methods based on embedding and language models obtain their worst results in sentences with a Human normalized score close to 0.

None of the methods for semantic similarity of sentences in the biomedical domain evaluated here use an explicit syntactic analysis or syntax information to obtain the similarity value. We conjecture that syntactic analysis would improve the performance in some cases. For instance, the sentences s1 and s2 with highest Esim in Table 13 show an implicit relation between the concepts “miRNA” and “oncogenesis”, which should increase the final semantic similarity score of the sentences. However, none of the methods evaluated here consider and reward these semantic relationships because its recognition demands some form of syntactic analysis. On the one hand, string and ontology-based methods consider the concepts in a sentence as bags of words, whilst on the other hand the methods based on embedding and language models implicitly consider the structure of the sentences but not the relationships between the parts of the sentences that are related.

Our results show that the family of string-based methods benefits from a high frequency of overlapping words in the sentences of the current biomedical datasets, whilst such methods are not able to deal properly with sentences that are semantically different but not exhibit a word overlapping pattern. The main advantages of the string-based methods are as follows: (1) they are able to obtain high correlation values without the need of using external resources for their training or evaluation; (2) they are fast and efficient; and finally; (3) they require low computational resources. However, string-based methods are unable to capture the semantics of the words in the sentence, which prevent them from recognizing semantic relationships, such as synonymy, meronymy and morphological variants. On the other hand, the use of NER tools in combination with string-based methods is a good option to integrate at least the capability of recognizing synonyms, as shown by LiBlocK-cTAKES (M4).

Ontology-based methods strongly depend on the lexical coverage of the ontologies and the ability to recognize automatically the underlying concepts in sentences. Our results show that the ontology-based methods are able to properly estimate a similarity score when used either with a dataset with high word overlapping, or with NER and WSD tools that find all possible entities to properly calculate the similarity between sentences. The main advantages of ontology-based methods are that they are fast and require low computational resources. However, the effectiveness of the ontology-based methods depends on the lexical coverage of the ontologies and the ability of the NER and WSD tools to recognize the underlying concepts in sentences, whose coverage and performance could be limited in several application domains.

The LiBlock (M4) string-based method and the COM (M17) ontology-based method use a NER tool in the pre-processing stage to recognize the biomedical entities (UMLS CUI codes) present in the input sentences. The objective of annotating entities in the semantic similarity task is the identification and disambiguation of biomedical concepts to provide semantic information to sentences. LiBlock uses the NER tool to normalize and disambiguate the underlying concepts in a sentence, unifying different concepts with acronyms and synonyms in the same CUI code and creating an overlapping between concepts, while ontologies also make use of the similarity of concepts within ontologies.

The biomedical NER tools evaluated in this work are unable to identify and disambiguate correctly many biomedical concepts due to the use of acronyms and different morphological variations, among others. For example, the CUI concepts “KRAS gene” (C1537502), “BRAF gene” (C0812241), and “RAF1 gene” (C0812215) in the sentences s1 and s2 with highest Esim obtained by the COM (M17) method in Table 14, appear as “K-ras”, “Braf”, “c-Raf” and “Craf’. However, cTAKES is unable to recognize these later morphological variants of the same biomedical concepts. A second example is the word “act” in the sentence “Consequently miRNAs have been demonstrated to act either as oncogenes […]”, which is wrongly recognized as the entity “Activated clotting time measurement” (C0427611), rather than as a verb in the sentence s1 with highest Esim in Table 13. And finally, a third example is the acronym “NSCLC”, which denotes the concept “Non-Small Cell Lung Carcinoma (C0007131), which is not recognized in the plural variant “NSCLCs” in the sentence s2 with highest Esim from Table 14.

The methods based on pre-trained embedding and language models provide a broader lexical coverage than the ontology-based methods, and do not need the use of NER or WSD tools to find intrinsic semantic relationships between the words in the sentences. However, these methods need large corpora for their training, as well as a complex training phase and more computational resources than the methods from the string-based and ontology-based families. Moreover, our experiments show that those methods tend to estimate higher similarity values than those estimated by a human being in the three datasets. In most cases, the aforementioned methods report similarity scores that tend towards 1, which indicates that the semantics obtained from the sentences is not sufficient to compute correctly a similarity score. For instance, the sentences s1 and s2 with highest Esim from Tables 15 and 16 shows similarity values close to 1, where the sentences have neither word overlapping nor similar concepts, and the human similarity score is 0 in both cases. Lastly, BERT-based methods, are trained for downstream tasks, using a supervised approach, and do not perform well in an unsupervised context.

Answering RQ5.

String-based methods capture neither the word semantics within the sentences nor the semantic relationships between words, such as synonymy and meronymy, and their effectiveness mainly relies on the word overlapping frequency in the sentences. However, the LiBlock method uses the NER tool to normalize and disambiguate the underlying concepts in a sentence, but unfortunately, it does not significantly outperform LiBlock with no use of a NER tool, which could have two explanations: firstly, the inability of LiBlock to capture semantic relationships beyond the synonymy; secondly, the current limitations of cTAKES in recognizing all mentions of biomedical entities. On the other hand, ontology-based methods use NER and WSD tools to recognize the underlying concepts in the sentences, which are not able correctly to identify and disambiguate these concepts in many cases. In addition, they require external resources to capture the semantic information from the sentences, which limits their lexical coverage. Thus, ontology-based methods require both high word overlapping and high recognition coverage of named entities to properly estimate the similarity between sentences. In comparison, the methods based on pre-trained embedding and language models need large corpora for training, a complex training phase, and considerable computational resources to calculate the similarity between sentences. Moreover, those methods tend to obtain high similarity scores in most cases, which may penalize them in a balanced dataset and in a real environment. Finally, BERT-based methods are trained for downstream tasks, using a supervised approach, and do not perform well in an unsupervised context.

Comparison of running times

Table 18 details the running time reported by the best-performing methods for each family, as well as the sentences per second that each method computes on average for the three datasets evaluated herein. The experiments were executed on a desktop computer with an AMD Ryzen 7 5800x CPU (16 cores) with 64 Gb RAM and a 2TB Gb SSD disk. In all cases, the running time includes the pre-processing time for each method. The string-based method Block Distance (M3) obtains the lowest running times because it does not need complex mechanisms or pre-trained models to calculate the similarity between sentences. On the other hand, the BERT-based methods obtain the worst results mainly due to their pre-processing stage, which uses the WordPiece tokenization method.

thumbnail
Table 18. This table shows the running times in milliseconds (ms) and the average sentences pairs per second (sent/sec) reported by the best-performing method of each family of methods in the evaluation of the 1339 sentence pairs that comprise the three datasets.

(*) The LiBlock method reports the running times in both NER and noNER versions showing that the efficiency of the method with no NER tool is much higher, despite the fact that there is no statistically significant difference in the results between both pre-processing configurations.

https://doi.org/10.1371/journal.pone.0276539.t018

Inconsistent results in the calculation of the statistical significance matrix

Despite the artificial increase of datasets to calculate the statistical significance of the results, we have identified an inconsistent result with respect to the comparison of the p-values of the LiBlock (M4) and the WBSM-Rada (M7) and UBSM-Rada (M12) methods. Table 8 shows that the UBSM-Rada method (M12) has a higher average harmonic score compared to WBSM-Rada (M7). However, by building the artificial datasets, the value of UBSM-Rada (M12) with respect to LiBlock (M4) shows a significant difference, while WBSM-Rada (M7) with respect to LiBlock (M4) shows a non-significant difference. We conjecture that this problem could be solved by increasing the number of datasets created for this task, which would allow the sample size to be increased and obtain more consistent results.

Conclusions and future work

We have introduced the largest, detailed, and for the first time, reproducible experimental survey on biomedical sentence similarity reported in the literature. Our work also introduces a collection of self-contained and reproducible benchmarks on biomedical sentence similarity based on the same software platform, called HESML-STS, which has been especially developed for this work, being provided as part of the new HESML V2R1 version that is publicly available [105]. We provide a detailed reproducibility protocol [44] and dataset [43] to allow the exact replication of all our experiments, methods, and results. In addition, we introduce a new aggregated string-based sentence similarity method called LiBlock, together with eight variants of the ontology-based methods introduced by Sogancioglu et al. [30], and a new pre-trained word embedding model based on FastText [58] and trained on the full-text of the articles in the PMC-BioC corpus [19]. We also evaluate for the first time the CTR [53] dataset in a benchmark on biomedical sentence similarity.

The string-based LiBlock (M4) measure sets the new state of the art for the sentence similarity task in the biomedical domain and significantly outperforms all the methods of each family evaluated here, with the only exceptions of the Flair (M18), BioWordVecint (M26), COM (M17) and WBSM-Rada (M7) methods. However, our data analysis shows that at least with the three datasets evaluated herein, there is no statistically significant difference between the performance of the LiBlock (M4) method using the cTAKES or using no NER tool at all. Thus, using the LiBlock method without any NER tool could be a competitive and much more efficient solution for high-throughput applications.

Concerning the impact of the Named Entity Recognition (NER) tools, our results confirm that the choice of the best NER tool for each method significantly impacts their performance. MetamapLite [94] and cTAKES [62] set the best-performing configurations for the family of ontology-based methods, whilst Metamap [34] was not the best performer in any method.

Our experiments confirm that the pre-processing stage has a very significant impact on the performance of the sentence similarity methods evaluated here, and yet this aspect has neither been studied nor reported in the literature. Thus, the selection of the proper configuration for each sentence similarity method should be confirmed experimentally. However, our experiments suggest some default configurations to make these decisions, such as the use of lower-casing normalization, some specific char filtering methods, and some specific tokenizers with the sole exception of BioCNLPTokenizer. Finally, the families of string and ontology-based methods show a noticeable preference pattern for the use of the NLTK2018 stop-words list. For a detailed description of the best pre-processing configurations, we refer the readers to our discussion.

String-based methods do not capture either the semantics of the words in the sentence or the semantic relationships between words, and their effectiveness relies on the word overlapping frequency in the sentences. Ontology-based methods Named Entity Recognition (NER) and Word Sense Disambiguation (WSD) tools to recognize the underlying concepts in the sentences and require external resources to capture the semantic information from the sentences, which limits their lexical coverage. In addition, they require either high word overlapping or high recognition coverage of named entities in order to properly calculate the similarity between sentences. On the other hand, the methods based on pre-trained embedding and language models need a large corpus for training, a complex training phase, and considerable computational resources to calculate the similarity between sentences. Moreover, these methods tend to obtain high similarity scores in most cases, which may penalize them in a balanced dataset and in a real environment. Finally, BERT-based methods are trained for downstream tasks, using a supervised approach, and do not perform well in an unsupervised context.

Our experiments suggest that the current benchmarks do not cover all the language features that characterize the biomedical domain, such as the frequent use of acronyms and rhetorical expressions like synonymy, meronymy, etc. In addition, current benchmarks have a very limited sample size that make the analysis of results difficult. We conjecture that LiBlock, COM, and UBSM-Rada perform well because there is a noticeable overlap of terms that may benefit these methods over the others reported in the literature. Furthermore, Chen et al. [106] highlight the need to improve and create new benchmarks from different perspectives, to reflect the multifaceted notion of the similarity of sentences. Therefore, we found a strong need for improving existing benchmarks for the task of semantic similarity of sentences in the biomedical domain.

As part of our forthcoming activities, we plan to evaluate the new sentence similarity methods introduced herein in a benchmark for the general language domain. In addition, we will study the evaluation of sentence similarity methods in an extrinsic task, such as semantic medical indexing [107] or summarization [108]. We also consider the evaluation of further pre-processing configurations, such as biomedical NER systems based on recent Deep Learning techniques [10], or extending our experiments and research to the multilingual scenario by integrating multilingual biomedical NER systems like Cimind [109]. Finally, we plan to evaluate some recent biomedical concept embeddings based on MeSH [35], which has not been evaluated in the sentence similarity task yet.

Supporting information

S1 Appendix. The statistical significance results.

We provide a series of tables reporting the p-values for each pair of methods evaluated in this work as supplementary material.

https://doi.org/10.1371/journal.pone.0276539.s001

(PDF)

S2 Appendix. The pre-processing raw output files.

We provide all the pre-processing raw output tables for the experiments evaluated herein as supplementary material.

https://doi.org/10.1371/journal.pone.0276539.s002

(PDF)

S3 Appendix. A reproducibility protocol and dataset on the biomedical sentence similarity.

We provide the reproducibility protocol published at protocols.io [44] as supplementary material to allow the exact replication of all our experiments, methods, and results.

https://doi.org/10.1371/journal.pone.0276539.s003

(PDF)

Acknowledgments

We are grateful to Gizem Sogancioglu and Kathrin Blagec for answering kindly our questions to replicate their methods and experiments, Fernando González and Juan Corrales for setting up our reproducibility dataset, and Hongfang Liu and Yanshan Wang for providing us the MedSTS dataset. UMLS CUI codes, SNOMED-CT US ontology and MeSH thesaurus were used in our experiments by courtesy of the National Library of Medicine of the United States. Finally, we thank David Pritchard for checking the use of English in our manuscript.

References

  1. 1. Tafti AP, Behravesh E, Assefi M, LaRose E, Badger J, Mayer J, et al. bigNN: An open-source big data toolkit focused on biomedical sentence classification. In: 2017 IEEE International Conference on Big Data (Big Data); 2017. p. 3888–3896.
  2. 2. Kim S, Kim W, Comeau D, Wilbur WJ. Classifying gene sentences in biomedical literature by combining high-precision gene identifiers. In: Proc. of the 2012 Workshop on Biomedical Natural Language Processing; 2012. p. 185–192.
  3. 3. Chen Q, Panyam NC, Elangovan A, Davis M, Verspoor K. Document triage and relation extraction for protein-protein interactions affected by mutations. In: Proc. of the BioCreative VI Workshop. vol. 6; 2017. p. 52–51.
  4. 4. Sarrouti M, Ouatik El Alaoui S. A passage retrieval method based on probabilistic information retrieval model and UMLS concepts in biomedical question answering. J Biomedical Informatics. 2017;68:96–103. pmid:28286031
  5. 5. Kosorus H, Bögl A, Küng J. Semantic Similarity between Queries in QA System using a Domain-specific Taxonomy. In: ICEIS (1); 2012. p. 241–246.
  6. 6. Ravikumar KE, Rastegar-Mojarad M, Liu H. BELMiner: adapting a rule-based relation extraction system to extract biological expression language statements from bio-medical literature evidence sentences. Database. 2017;2017(1). pmid:28365720
  7. 7. Rastegar-Mojarad M, Komandur Elayavilli R, Liu H. BELTracker: evidence sentence retrieval for BEL statements. Database. 2016;2016. pmid:27173525
  8. 8. Du J, Chen Q, Peng Y, Xiang Y, Tao C, Lu Z. ML-Net: multi-label classification of biomedical texts with deep neural networks. J Am Med Inform Assoc. 2019;26(11):1279–1285. pmid:31233120
  9. 9. Liu H, Hunter L, Kešelj V, Verspoor K. Approximate subgraph matching-based literature mining for biomedical events and relations. PLoS One. 2013;8(4):e60954. pmid:23613763
  10. 10. Hahn U, Oleynik M. Medical Information Extraction in the Age of Deep Learning. Yearb Med Inform. 2020;29(1):208–220. pmid:32823318
  11. 11. Kim SN, Martinez D, Cavedon L, Yencken L. Automatic classification of sentences to support Evidence Based Medicine. BMC Bioinformatics. 2011;12 Suppl 2:5. pmid:21489224
  12. 12. Hassanzadeh H, Groza T, Nguyen A, Hunter J. A supervised approach to quantifying sentence similarity: with application to evidence based medicine. PLoS One. 2015;10(6):e0129392. pmid:26039310
  13. 13. Boyack KW, Newman D, Duhon RJ, Klavans R, Patek M, Biberstine JR, et al. Clustering more than two million biomedical publications: comparing the accuracies of nine text-based similarity approaches. PLoS One. 2011;6(3):e18029. pmid:21437291
  14. 14. Dey S, Luo H, Fokoue A, Hu J, Zhang P. Predicting adverse drug reactions through interpretable deep learning framework. BMC Bioinformatics. 2018;19(Suppl 21):476. pmid:30591036
  15. 15. Lamurias A, Ruas P, Couto FM. PPR-SSM: personalized PageRank and semantic similarity measures for entity linking. BMC Bioinformatics. 2019;20(1):534. pmid:31664891
  16. 16. Aliguliyev RM. A new sentence similarity measure and sentence based extractive technique for automatic text summarization. Expert Syst Appl. 2009;36(4):7764–7772.
  17. 17. Shang Y, Li Y, Lin H, Yang Z. Enhancing biomedical text summarization using semantic relation extraction. PLoS One. 2011;6(8):e23862. pmid:21887336
  18. 18. Allot A, Chen Q, Kim S, Vera Alvarez R, Comeau DC, Wilbur WJ, et al. LitSense: making sense of biomedical literature at sentence level. Nucleic Acids Res. 2019;. pmid:31020319
  19. 19. Comeau DC, Wei CH, Islamaj Doğan R, Lu Z. PMC text mining subset in BioC: about three million full-text articles and growing. Bioinformatics. 2019;. pmid:30715220
  20. 20. Agirre E, Cer D, Diab M, Gonzalez-Agirre A. Semeval-2012 task 6: A pilot on semantic textual similarity. In: * SEM 2012: The First Joint Conference on Lexical and Computational Semantics–Volume 1: Proc. of the main conference and the shared task, and Volume 2: Proc. of the Sixth International Workshop on Semantic Evaluation (SemEval 2012). ACL; 2012. p. 385–393.
  21. 21. Agirre E, Cer D, Diab M, Gonzalez-Agirre A, Guo W. * SEM 2013 shared task: Semantic textual similarity. In: Second Joint Conference on Lexical and Computational Semantics (* SEM), Volume 1: Proc. of the Main Conference and the Shared Task: Semantic Textual Similarity. vol. 1. ACL; 2013. p. 32–43.
  22. 22. Agirre E, Banea C, Cardie C, Cer D, Diab M, Gonzalez-Agirre A, et al. Semeval-2014 task 10: Multilingual semantic textual similarity. In: Proc. of the 8th international workshop on semantic evaluation (SemEval 2014). ACL; 2014. p. 81–91.
  23. 23. Agirre E, Banea C, Cardie C, Cer D, Diab M, Gonzalez-Agirre A, et al. Semeval-2015 task 2: Semantic textual similarity, english, spanish and pilot on interpretability. In: Proc. of the 9th international workshop on semantic evaluation (SemEval 2015). ACL; 2015. p. 252–263.
  24. 24. Agirre E, Banea C, Cer D, Diab M, others. Semeval-2016 task 1: Semantic textual similarity, monolingual and cross-lingual evaluation. 10th International Workshop on Semantic Evaluation (SemEval-2016). 2016;.
  25. 25. Cer D, Diab M, Agirre E, Lopez-Gazpio I, Specia L. SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation. In: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017). Vancouver, Canada: Association for Computational Linguistics; 2017. p. 1–14.
  26. 26. Wang Y, Afzal N, Liu S, Rastegar-Mojarad M, Wang L, Shen F, et al. Overview of the BioCreative/OHNLP Challenge 2018 Task 2: Clinical Semantic Textual Similarity. Proc of the BioCreative/OHNLP Challenge. 2018;2018.
  27. 27. Kalyan KS, Sangeetha S. SECNLP: A survey of embeddings in clinical natural language processing. J Biomed Inform. 2020;101:103323. pmid:31711972
  28. 28. Khattak FK, Jeblee S, Pou-Prom C, Abdalla M, Meaney C, Rudzicz F. A survey of word embeddings for clinical text. Journal of Biomedical Informatics: X. 2019;4:100057. pmid:34384583
  29. 29. Alsentzer E, Murphy J, Boag W, Weng WH, Jindi D, Naumann T, et al. Publicly Available Clinical BERT Embeddings. In: Proc. of the 2nd Clinical Natural Language Processing Workshop. Minneapolis, Minnesota, USA: Association for Computational Linguistics; 2019. p. 72–78.
  30. 30. Sogancioglu G, Öztürk H, Özgür A. BIOSSES: a semantic sentence similarity estimation system for the biomedical domain. Bioinformatics. 2017;33(14):49–58. pmid:28881973
  31. 31. Blagec K, Xu H, Agibetov A, Samwald M. Neural sentence embedding models for semantic similarity estimation in the biomedical domain. BMC Bioinformatics. 2019;20(1):178. pmid:30975071
  32. 32. Peng Y, Yan S, Lu Z. Transfer Learning in Biomedical Natural Language Processing: An Evaluation of BERT and ELMo on Ten Benchmarking Datasets. In: Proc. of the 18th BioNLP Workshop and Shared Task. Florence, Italy: Association for Computational Linguistics; 2019. p. 58–65.
  33. 33. Chen Q, Peng Y, Lu Z. BioSentVec: creating sentence embeddings for biomedical texts. In: 2019 IEEE International Conference on Healthcare Informatics (ICHI). IEEE; 2019. p. 1–5.
  34. 34. Aronson AR, Lang FM. An overview of MetaMap: historical perspective and recent advances. J Am Med Inform Assoc. 2010;17(3):229–236. pmid:20442139
  35. 35. Abdeddaïm S, Vimard S, Soualmia LF. The MeSH-Gram Neural Network Model: Extending Word Embedding Vectors with MeSH Concepts for Semantic Similarity. In: Ohno-Machado L, Séroussi B, editors. MEDINFO 2019: Health and Wellbeing e-Networks for All—Proceedings of the 17th World Congress on Medical and Health Informatics. vol. 264 of Studies in Health Technology and Informatics. IOS Press; 2019. p. 5–9.
  36. 36. Hamosh A, Scott AF, Amberger JS, Bocchini CA, McKusick VA. Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Research. 2005;33(suppl1):D514–D517. pmid:15608251
  37. 37. Tawfik NS, Spruit MR. Evaluating Sentence Representations for Biomedical Text: Methods and Experimental Results. J Biomed Inform. 2020; p. 103396. pmid:32147441
  38. 38. Chen Q, Du J, Kim S, Wilbur WJ, Lu Z. Deep learning with sentence embeddings pre-trained on biomedical corpora improves the performance of finding similar sentences in electronic medical records. BMC Medical Informatics and Decision Making. 2020;20(1):73. pmid:32349758
  39. 39. Breiman L. Random Forests. Machine Learning. 2001;45(1):5–32.
  40. 40. Lara-Clares A, Lastra-Díaz JJ, Garcia-Serrano A. Protocol for a reproducible experimental survey on biomedical sentence similarity. PLoS One. 2021;16(3):e0248663. pmid:33760855
  41. 41. Lastra-Díaz JJ, García-Serrano A, Batet M, Fernández M, Chirigati F. HESML: a scalable ontology-based semantic similarity measures library with a set of reproducible experiments and a replication dataset. Information Systems. 2017;66:97–118.
  42. 42. Lastra-Díaz JJ, Lara-Clares A, Garcia-Serrano A. HESML: a real-time semantic measures library for the biomedical domain with a reproducible survey. BMC Bioinformatics. 2022;23(23). pmid:34991460
  43. 43. Lara-Clares A, Lastra Diaz JJ, Garcia Serrano A. Reproducible experiments on word and sentence similarity measures for the biomedical domain; 2022. e-cienciaDatos, v1. https://doi.org/10.21950/EPNXTR.
  44. 44. Lara-Clares A, Lastra-Díaz JJ, Garcia-Serrano A. A reproducibility protocol and dataset on the biomedical sentence similarity; 2022. Protocols.io, v1. https://www.protocols.io/view/a-reproducibility-protocol-and-dataset-on-the-biom-b5ckq2uw.
  45. 45. Lastra-Díaz JJ, García-Serrano A. A new family of information content models with an experimental survey on WordNet. Knowledge-Based Systems. 2015;89:509–526.
  46. 46. Lastra-Díaz JJ, García-Serrano A. A novel family of IC-based similarity measures with a detailed experimental survey on WordNet. Engineering Applications of Artificial Intelligence Journal. 2015;46:140–153.
  47. 47. Lastra-Díaz JJ, García-Serrano A. A refinement of the well-founded Information Content models with a very detailed experimental survey on WordNet. ETSI Informática. Universidad Nacional de Educación a Distancia (UNED). http://e-spacio.uned.es/fez/view/bibliuned:DptoLSI-ETSI-Informes-Jlastra-refinement; 2016. TR-2016-01.
  48. 48. Lastra-Diaz JJ, Goikoetxea J, Hadj Taieb MA, García-Serrano A, Ben Aouicha M, Agirre E. A reproducible survey on word embeddings and ontology-based methods for word similarity: Linear combinations outperform the state of the art. Engineering Applications of Artificial Intelligence. 2019;85:645–665.
  49. 49. Lastra-Díaz JJ, García-Serrano A. WordNet-based word similarity reproducible experiments based on HESML V1R1 and ReproZip; 2016. Mendeley Data, v1. http://doi.org/10.17632/65pxgskhz9.1.
  50. 50. Lastra-Díaz JJ, Goikoetxea J, Hadj Taieb MA, García-Serrano A, Aouicha MB, Agirre E. Reproducibility dataset for a large experimental survey on word embeddings and ontology-based methods for word similarity. Data in Brief. 2019;26:104432. pmid:31516953
  51. 51. Lastra-Díaz JJ, Goikoetxea J, Hadj Taieb M, García-Serrano A, Ben Aouicha M, Agirre E, et al. A large reproducible benchmark of ontology-based methods and word embeddings for word similarity. Information Systems. 2021;96:101636.
  52. 52. Wang Y, Afzal N, Fu S, Wang L, Shen F, Rastegar-Mojarad M, et al. MedSTS: a resource for clinical semantic textual similarity. Language Resources and Evaluation. 2018; p. 1–16.
  53. 53. Lithgow-Serrano O, Gama-Castro S, Ishida-Gutiérrez C, Mejía-Almonte C, Tierrafría VH, Martínez-Luna S, et al. Similarity corpus on microbial transcriptional regulation. Journal of Biomedical Semantics. 2019;10(1):8. pmid:31118102
  54. 54. Lithgow-Serrano O, Gama-Castro S, Ishida-Gutiérrez C, Collado-Vides J. L-Regulon: A novel soft-curation approach supported by a semantic enriched reading for RegulonDB literature. bioRxiv. 2020.
  55. 55. Gerlach M, Shi H, Amaral LAN. A universal information theoretic approach to the identification of stopwords. Nature Machine Intelligence. 2019;1(12):606–612.
  56. 56. Li Y, McLean D, Bandar ZA, James DO, Crockett K. Sentence Similarity Based on Semantic Nets and Corpus Statistics. IEEE Trans Knowl Data Eng. 2006;18(8):1138–1150.
  57. 57. Krause EF. Taxicab Geometry: An Adventure in Non-Euclidean Geometry. Online: Courier Corporation; 1986.
  58. 58. Bojanowski P, Grave E, Joulin A, Mikolov T. Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics. 2017;5:135–146.
  59. 59. Song B, Li F, Liu Y, Zeng X. Deep learning methods for biomedical named entity recognition: a survey and qualitative comparison. Brief Bioinform. 2021;22(6). pmid:34308472
  60. 60. Miller GA. WordNet: A Lexical Database for English. ACM. 1995;38(11):39–41.
  61. 61. Donnelly K. SNOMED-CT: The advanced terminology and coding system for eHealth. Books Google. 2006;121:279–290. pmid:17095826
  62. 62. Savova GK, Masanz JJ, Ogren PV, Zheng J, Sohn S, Kipper-Schuler KC, et al. Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications. J Am Med Inform Assoc. 2010;17(5):507–513. pmid:20819853
  63. 63. Dijkstra EW. A note on two problems in connexion with graphs. Numerische Mathematik. 1959;1(1):269–271.
  64. 64. Johnson AEW, Pollard TJ, Shen L, Lehman LWH, Feng M, Ghassemi M, et al. MIMIC-III, a freely accessible critical care database. Sci Data. 2016;3:160035. pmid:27219127
  65. 65. Mikolov T, Sutskever I, Chen K, Corrado GS, others. Distributed representations of words and phrases and their compositionality. Adv Neural Inf Process Syst. 2013;.
  66. 66. Pennington J, Socher R, Manning C. Glove: Global vectors for word representation. In: Proc. of the 2014 conference on empirical methods in natural language processing (EMNLP). ACL Web; 2014. p. 1532–1543.
  67. 67. Sánchez D, Batet M, Isern D. Ontology-based information content computation. Knowledge-Based Systems. 2011;24(2):297–303.
  68. 68. Cai Y, Zhang Q, Lu W, Che X. A hybrid approach for measuring semantic similarity based on IC-weighted path distance in WordNet. Journal of intelligent information systems. 2017; p. 1–25.
  69. 69. Rada R, Mili H, Bicknell E, Blettner M. Development and application of a metric on semantic nets. IEEE Transactions on Systems, Man, and Cybernetics. 1989;19(1):17–30.
  70. 70. Jiang JJ, Conrath DW. Semantic similarity based on corpus statistics and lexical taxonomy. In: Proc. of International Conference Research on Computational Linguistics (ROCLING X); 1997. p. 19–33.
  71. 71. Chapman S, Norton B, Ciravegna F. Armadillo: Integrating knowledge for the semantic web. In: Proceedings of the Dagstuhl Seminar in Machine Learning for the Semantic Web. Researchgate; 2005. p. 90.
  72. 72. Ukkonen E. Approximate string-matching with q-grams and maximal matches. Theor Comput Sci. 1992;92(1):191–211.
  73. 73. ,Jaccard P. Nouvelles recherches sur la distribution florale. Bull Soc Vaud sci nat. 1908;44:223–270.
  74. 74. Manning CD, Manning CD, Schütze H. Foundations of Statistical Natural Language Processing. Online: MIT Press; 1999.
  75. 75. Levenshtein VI. Binary codes capable of correcting deletions, insertions, and reversals. In: Soviet physics doklady. vol. 10. Springer; 1966. p. 707–710.
  76. 76. Lawlor LR. Overlap, Similarity, and Competition Coefficients. Ecology. 1980;61(2):245–251.
  77. 77. Akbik A, Blythe D, Vollgraf R. Contextual String Embeddings for Sequence Labeling. In: Proc. of the 27th International Conference on Computational Linguistics. Santa Fe, New Mexico, USA: Association for Computational Linguistics; 2018. p. 1638–1649.
  78. 78. Pyysalo S, Ginter F, Moen H, Salakoski T, Ananiadou S. Distributional semantics resources for biomedical text processing. Proc of LBM. 2013; p. 39–44.
  79. 79. Chen Q, Lee K, Yan S, Kim S, Wei CH, Lu Z. BioConceptVec: Creating and evaluating literature-based biomedical concept embeddings on a large scale. PLOS Computational Biology. 2020;16(4):1–18. pmid:32324731
  80. 80. Newman-Griffis D, Lai A, Fosler-Lussier E. Insights into Analogy Completion from the Biomedical Domain. In: BioNLP 2017. Vancouver, Canada,: Association for Computational Linguistics; 2017. p. 19–28.
  81. 81. Zhang Y, Chen Q, Yang Z, Lin H, Lu Z. BioWordVec, improving biomedical word embeddings with subword information and MeSH. Sci Data. 2019;6(1):52. pmid:31076572
  82. 82. Chiu B, Crichton G, Korhonen A, Pyysalo S. How to Train good Word Embeddings for Biomedical NLP. In: Proc. of the 15th Workshop on Biomedical Natural Language Processing. Berlin, Germany: Association for Computational Linguistics; 2016. p. 166–174.
  83. 83. Cer D, Yang Y, Kong Sy, Hua N, Limtiaco N, St John R, et al. Universal Sentence Encoder for English. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Brussels, Belgium: Association for Computational Linguistics; 2018. p. 169–174.
  84. 84. Pagliardini M, Gupta P, Jaggi M. Unsupervised Learning of Sentence Embeddings Using Compositional n-Gram Features. In: Proc. of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). New Orleans, Louisiana: Association for Computational Linguistics; 2018. p. 528–540.
  85. 85. Lee J, Yoon W, Kim S, Kim D, Kim S, So CH, et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics. 2019;36(4):1234–1240.
  86. 86. Devlin J, Chang M, Lee K, Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In: Burstein J, Doran C, Solorio T, editors. Proc. of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT, (Long and Short Papers). Minneapolis, MN, USA: Association for Computational Linguistics; 2019. p. 4171–4186. Available from: https://doi.org/10.18653/v1/n19-1423.
  87. 87. Beltagy I, Lo K, Cohan A. SciBERT: A Pretrained Language Model for Scientific Text. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China: Association for Computational Linguistics; 2019. p. 3615–3620.
  88. 88. Huang K, Altosaar J, Ranganath R. ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission. arXiv e-prints. 2019; p. arXiv:1904.05342.
  89. 89. Gu Y, Tinn R, Cheng H, Lucas M, Usuyama N, Liu X, et al. Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing. arXiv e-prints. 2020; p. arXiv:2007.15779.
  90. 90. Wada S, Takeda T, Manabe S, Konishi S, Kamohara J, Matsumura Y. A pre-training technique to localize medical BERT and to enhance biomedical BERT. arXiv e-prints. 2020; p. arXiv:2005.07202.
  91. 91. Wu Y, Schuster M, Chen Z, Le QV, Norouzi M, Macherey W, et al. Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. arXiv. 2016;.
  92. 92. Manning C, Surdeanu M, Bauer J, Finkel J, Bethard S, McClosky D. The Stanford CoreNLP natural language processing toolkit. In: Proc. of 52nd annual meeting of the association for computational linguistics: system demonstrations. ACL; 2014. p. 55–60.
  93. 93. Comeau DC, Islamaj Doğan R, Ciccarese P, Cohen KB, Krallinger M, Leitner F, et al. BioC: a minimalist approach to interoperability for biomedical text processing. Database. 2013;2013:bat064. pmid:24048470
  94. 94. Demner-Fushman D, Rogers WJ, Aronson AR. MetaMap Lite: an evaluation of a new Java implementation of MetaMap. J Am Med Inform Assoc. 2017;24(4):841–844. pmid:28130331
  95. 95. Bodenreider O. The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Res. 2004;32(Database issue):267–70. pmid:14681409
  96. 96. Lastra-Díaz JJ, Lara-Clares A, Garcia-Serrano A. HESML V1R5 Java software library of ontology-based semantic similarity measures and information content models; 2020. e-cienciaDatos, v1. https://doi.org/10.21950/1RRAWJ.
  97. 97. Smith L, Rindflesch T, Wilbur WJ. MedPost: a part-of-speech tagger for bioMedical text. Bioinformatics. 2004;20(14):2320–2321. pmid:15073016
  98. 98. Reátegui R, Ratté S. Comparison of MetaMap and cTAKES for entity extraction in clinical notes. BMC Med Inform Decis Mak. 2018;18(Suppl 3):74. pmid:30255810
  99. 99. Bird S, Klein E, Loper E. Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. O’Reilly Media, Inc.; 2009.
  100. 100. Ludbrook J. Multiple comparison procedures updated. Clinical and experimental pharmacology & physiology. 1998;25(12):1032–1037. pmid:9888002
  101. 101. Shen D, Wang G, Wang W, Min MR, Su Q, Zhang Y, et al. Baseline Needs More Love: On Simple Word-Embedding-Based Models and Associated Pooling Mechanisms. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Melbourne, Australia: Association for Computational Linguistics; 2018. p. 440–450.
  102. 102. Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, et al. Tensorflow: A system for large-scale machine learning. In: 12th USENIX symposium on operating systems design and implementation OSDI 16). usenix.org; 2016. p. 265–283.
  103. 103. Xiao H. bert-as-service; 2018. https://github.com/hanxiao/bert-as-service.
  104. 104. Lara-Clares A, Lastra-Díaz JJ, Garcia-Serrano A. HESML Java software library of semantic similarity measures for the biomedical domain. To be submitted. 2020.
  105. 105. Lara-Clares A, Lastra-Díaz JJ, Garcia-Serrano A. HESML V2R1 Java software library of semantic similarity measures for the biomedical domain; 2022. e-cienciaDatos, v2. https://doi.org/10.21950/AQLSMV.
  106. 106. Chen Q, Rankine A, Peng Y, Aghaarabi E, Lu Z. Benchmarking Effectiveness and Efficiency of Deep Learning Models for Semantic Textual Similarity in the Clinical Domain: Validation Study. JMIR Medical Informatics. 2021;9(12):e27386. pmid:34967748
  107. 107. Couto FM, Krallinger M. Proposal of the First International Workshop on Semantic Indexing and Information Retrieval for Health from Heterogeneous Content Types and Languages (SIIRH). In: Advances in Information Retrieval. Springer International Publishing; 2020. p. 654–659.
  108. 108. Mishra R, Bian J, Fiszman M, Weir CR, Jonnalagadda S, Mostafa J, et al. Text summarization in the biomedical domain: a systematic review of recent research. J Biomed Inform. 2014;52:457–467. pmid:25016293
  109. 109. Cabot C, Darmoni S, Soualmia LF. Cimind: A phonetic-based tool for multilingual named entity recognition in biomedical texts. J Biomed Inform. 2019;94:103176. pmid:30980962