A reproducible experimental survey on biomedical sentence similarity: A string-based method sets the state of the art

This registered report introduces the largest, and for the first time, reproducible experimental survey on biomedical sentence similarity with the following aims: (1) to elucidate the state of the art of the problem; (2) to solve some reproducibility problems preventing the evaluation of most current methods; (3) to evaluate several unexplored sentence similarity methods; (4) to evaluate for the first time an unexplored benchmark, called Corpus-Transcriptional-Regulation (CTR); (5) to carry out a study on the impact of the pre-processing stages and Named Entity Recognition (NER) tools on the performance of the sentence similarity methods; and finally, (6) to bridge the lack of software and data reproducibility resources for methods and experiments in this line of research. Our reproducible experimental survey is based on a single software platform, which is provided with a detailed reproducibility protocol and dataset as supplementary material to allow the exact replication of all our experiments and results. In addition, we introduce a new aggregated string-based sentence similarity method, called LiBlock, together with eight variants of current ontology-based methods, and a new pre-trained word embedding model trained on the full-text articles in the PMC-BioC corpus. Our experiments show that our novel string-based measure establishes the new state of the art in sentence similarity analysis in the biomedical domain and significantly outperforms all the methods evaluated herein, with the only exception of one ontology-based method. Likewise, our experiments confirm that the pre-processing stages, and the choice of the NER tool for ontology-based methods, have a very significant impact on the performance of the sentence similarity methods. We also detail some drawbacks and limitations of current methods, and highlight the need to refine the current benchmarks. Finally, a notable finding is that our new string-based method significantly outperforms all state-of-the-art Machine Learning (ML) models evaluated herein.


Introduction
Measuring semantic similarity between sentences is an important task in the fields of Natural Language Processing (NLP), Information Retrieval (IR), and biomedical text mining, among others.For instance, the estimation of the degree of semantic similarity between sentences is used in text classification [1][2][3], question answering [4,5], evidence sentence retrieval to extract biological expression language statements [6,7], biomedical document labeling [8], biomedical event extraction [9], named entity recognition [10], evidence-based medicine [11,12], biomedical document May 19, 2022 1/48 clustering [13], prediction of adverse drug reactions [14], entity linking [15], document summarization [16,17] and sentence-driven search of biomedical literature [18], among other applications.In the question answering task, Sarrouti and El Alaomi [4] build a ranking of plausible answers by computing the similarity scores between each biomedical question and the candidate sentences extracted from a knowledge corpus.Allot et al. [18] introduce a system to retrieve the most similar sentences in the BioC biomedical corpus [19] called Litsense [18], which is based on the comparison of the user query with all sentences in the aforementioned corpus.Likewise, the relevance of the research in this area is endorsed by the proposal of recent conference series, such as SemEval [20][21][22][23][24][25] and BioCreative/OHNLP [26], and works based on sentence similarity measures, such as the work of Aliguliyev [16] in automatic document summarization, which shows that the performance of these applications depends significantly on the sentence similarity measures used.
The aim of any semantic similarity method is to estimate the degree of similarity between two textual semantic units as perceived by a human being, such as words, phrases, sentences, short texts, or documents.Unlike sentences from the language in general use whose vocabulary and syntax is limited both in extension and complexity, most sentences in the biomedical domain are comprised of a huge specialized vocabulary made up of all sort of biological and clinical terms, in addition to an uncountable list of acronyms, which are combined in complex lexical and syntactic forms.
Nowadays, there exist several works in the literature that experimentally evaluate multiple methods on biomedical sentence similarity.However, they are either theoretical or have a limited scope and cannot be reproduced.For instance, Kalyan et al. [27], Khattak et al. [28], and Alsentzer et al. [29] introduce theoretical surveys on biomedical embeddings with a limited scope.On the other hand, the experimental surveys introduced by Sogancioglu et al. [30], Blagec et al. [31], Peng et al. [32], and Chen et al. [33] among other authors, cannot be reproduced because of the lack of source code and data to replicate both methods and experiments, or the lack of a detailed definition of their experimental setups.Likewise, there are other recent works whose results need to be confirmed.For instance, Tawfik and Spruit [34] experimentally evaluate a set of pre-trained language models, whilst Chen et al. [35] propose a system to study the impact of a set of similarity measures on a Deep Learning ensembled model, which is based on a Random Forest model [36].
The main aim of this work is to introduce a comprehensive and very detailed reproducible experimental survey of methods on biomedical sentence similarity to elucidate the state of the problem by implementing our previous registered report protocol [37].Our experiments are based on our software implementation and evaluation of all methods analyzed herein into a common and new software platform based on an extension of the Half-Edge Semantic Measures Library (HESML) [38,39], called HESML 1 for Semantic Textual Similarity (HESML-STS).All our experiments have been recorded into a Docker virtualization image that is provided as supplementary material together with our software [40] and a detailed reproducibility protocol [41] and dataset [42] to allow the easy replication of all our methods, experiments, and results.This work is based on our previous experience developing reproducible research in a series of publications in the area, such as the experimental surveys on word similarity introduced in [43][44][45][46], whose reproducibility protocols and datasets [47,48] are detailed and independently confirmed in two companion reproducible papers [38,49], and a reproducible benchmark on semantic measures libraries for the biomedical domain [39].Finally, we refer the reader to our previous work [37] for a very detailed review of the literature on sentence similarity measures, which is omitted herein because of the lack of room and to avoid being redundant.

Main motivations and research questions
Our main motivation is the lack of a comprehensive and reproducible experimental survey on biomedical sentence similarity that allows setting the state of the problem in a sound and reproducible way, as detailed in our previous registered report protocol [37].Our main research questions are as follows: RQ1 Which methods get the best results on biomedical sentence similarity?
RQ2 Is there a statistically significant difference between the best-performing methods and the remaining ones?
RQ3 What is the impact of the biomedical Named Entity Recognition (NER) tools on the performance of the methods on biomedical sentence similarity?
RQ4 What is the impact of the pre-processing stage on the performance of the methods on biomedical sentence similarity?
RQ5 What are the main drawbacks and limitations of current methods on biomedical sentence similarity?
A second motivation is implementing a set of unexplored methods based on adaptations from other methods proposed for the general language domain.A third motivation is the evaluation in the same software platform of the three known benchmarks on biomedical sentence similarity reported in the literature as follows: the Biomedical Semantic Similarity Estimation System (BIOSSES) [30] and Medical Semantic Textual Similarity (MedSTS) [50] datasets, as well as the evaluation for the first time of the Microbial Transcriptional Regulation (CTR) [51] dataset in a sentence similarity task, despite it having been previously evaluated in other related tasks, such as the curation of gene expressions from scientific publications [52].A fourth motivation is a study on the impact of the pre-processing stage and NER tools on the performance of the sentence similarity methods, such as that done by Gerlach et al. [53] for stop-words in topic modeling task.And finally, our fifth motivation is the lack of reproducibility software and data resources on this task, which allow an easy replication and confirmation of previous methods, experiments, and results in this line of research, as well as encouraging the development and evaluation of new sentence similarity methods.

Definition of the problem and contributions
The two main research problems tackled in this work are the design and implementation of a large and reproducible experimental survey on sentence similarity measures for the biomedical domain, and the evaluation of a set of unexplored methods based on adaptations from previous methods used in the general language domain.Our main contributions are as follows: (1) the largest, and for the first time, reproducible experimental survey on biomedical sentence similarity; (2) the first collection of self-contained and reproducible benchmarks on biomedical sentence similarity; (3) the evaluation of a set of previously unexplored methods, such as a new string-based sentence similarity method, based on Li et al. [54] and Block distance [55], eight variants of the current ontology-based methods from the literature based on the work of Sogancioglu et al. [30], and a new pre-trained Word Embedding (WE) model based on FastText [56] and trained on the full-text of articles in the PMC-BioC corpus [19]; (4) the evaluation for the first time of an unexplored benchmark, called May 19, 2022 3/48 CTR [51]; (5) the study on the impact of the pre-processing stage and Named Entity Recognition (NER) tools on the performance of the sentence similarity methods; (6) the integration for the first time of most sentence similarity methods for the biomedical domain into the same software library, called HESML-STS, which is available both in Github 2 and in a reproducible dataset [42]; (7) a detailed reproducibility protocol together with a collection of software tools and datasets provided as supplementary material to allow the exact replication of all our experiments and results; and finally, (8) an analysis of the drawbacks and limitations of the current state-of-the-art methods.The rest of the paper is structured as follows.First, we introduce a collection of new sentence similarity methods evaluated herein for the first time.Next, we describe a detailed experimental setup for our experiments on biomedical sentence similarity and introduce our experimental results.Then, we discuss our results and answer the research questions detailed above.Subsequently, we introduce our conclusions and future work.Finally, we introduce three appendices with supplementary material as follows.Appendix A introduces all statistical significance results of our experiments, whilst Appendix B introduces all data tables reporting the performance of all methods with all pre-processing configurations evaluated herein, and the Appendix C introduces a reproducibility protocol detailing a set of step-by-step instructions to allow the exact replication of all our experiments, which is published at protocols.io[41].

The new sentence similarity methods
This section introduces a new string-based sentence similarity method based on the aggregation of the Li et al. [54] similarity and Block distance [55] measures, called LiBlock, as well as eight new variants of the ontology-based methods proposed by Sogancioglu et al. [30], and a new pre-trained word embedding model based on FastText [56] and trained on the full-text of the articles in the PMC-BioC corpus [19].

The new LiBlock string-based method
Two key advantages of the family of string-based methods are as follows.Firstly, they can be very efficiently computed because they do not require the use of external knowledge or pre-trained models, and secondly, they obtain competitive results as shown in table 8.However, the string-based methods do not capture the semantics of the words in the sentence, which prevent them from recognizing semantic relationships between words, such as synonymy and meronymy among others.On the other hand, the family of ontology-based methods capture the semantic relationships between words in a sentence pair and obtain state-of-the-art results in the sentence similarity task for the biomedical domain, as shown in table 8.However, the effectiveness of ontology-based methods depends on the lexical coverage of the ontologies and the ability to recognize automatically the underlying concepts in sentences by using Named Entity Recognition (NER) and Word Sense Desambiguation (WSD) tools, whose coverage and performance could be limited in several application domains.Precisely, the NER task is still an open problem [57] in the biomedical domain because of the vast biomedical vocabulary and the complex lexical and syntactic forms found in the biomedical literature.Otherwise, the methods based on pre-trained word embedding models provide a broader lexical coverage than the ontology-based ones and obtain better results.However, the methods based on word embeddings do not significantly outperform all ontology-based measures in a word similarity task [46] in addition to requiring large corpus for training, a complex training phase, and more computational resources than the families of string-based and ontology-based methods.
To overcome the drawbacks and limitations of the string-based and ontology-based methods detailed above, we propose here a new aggregated string-based measure called LiBlock and denoted by sim LiBk henceforth, which is based on the combination of a similarity measure derived from the Block Distance [55] and an adaptation from the ontology-based similarity measure introduced by Li et al. [54] that removes the use of ontologies, such as WordNet [58] or Systematized Nomenclature of Medicine Clinical Terms (SNOMED-CT) [59].The LiBlock similarity measure obtains the best results in combination with the cTAKES NER tool [60], which allows the detection of synonyms of CUI concepts.Nevertheless, the LiBlock method obtains competitive results regarding the state-of-the-art methods with no use, either implicitly or explicitly, of an ontology, as detailed in table 12.
The sim LiBk method detailed in equation ( 1) is defined by the linear aggregation of an adaptation of the Li et al. [54] measure, called sim LiAd (3), and a similarity measure derived from the Block Distance measure [55], called sim Bk (2).Let be L Σ the set of word sequences in a universal unseen alphabet Σ, the sim LiBk function returns a value between 0 and 1 which indicates the similarity score between two input sentences, as defined in equation 1.The sim Bk function is based on the computation of the word frequencies f r(w i , s j ) for each input sentence s 1 and s 2 and their concatenation s 1 + s 2 , as detailed in equation ( 2).The auxiliary function f r(w i , s j ) returns the frequency of a word w i in the word sequence s j , whilst the function f r(w i , s 1 + s 2 ) returns the number of occurrences of the word w i in the concatenation of the two word sequences, denoted by s 1 + s 2 .On the other hand, the sim LiAd function takes two word sets obtained by invoking the σ function ( 5) with the sentences s 1 and s 2 , and then it computes the cosine similarity of the two binary semantic vectors corresponding to invoke the ϕ(S 1 ) function ( 4) with the σ(s 1 ) and σ(s 2 ) word sets.Finally, the sim LiBk score is defined by either the linear combination of sim Bk and sim LiAd , as detailed in equation (1), or sim Bk if sim LiAd is 0. A walk-through example.Algorithm 1 details the step-by-step procedure to compute the sim LiBk function, whilst figure 1 shows the pipeline for calculating the LiBlock similarity score defined in equation 1, as well as an example for illustrating an end-to-end calculation of the sim LiBk similarity score of two sentences.
Algorithm 1 LiBlock sentence similarity measure for two input pre-processed sentences.

2:
S 1 ← σ(s 1 ) ⊲ word set sentence 1 3: (word set generator) (5) The eight new variants of current ontology-based methods The current family of ontology-based methods for biomedical sentence similarity proposed by Sogancioglu et al. [30] is based on the ontology-based semantic similarity between word and concepts within the sentences to be compared.Thus, this later family of methods defines a framework in which we can design new variants by exploring other word similarity measures.For this reason, we propose herein the evaluation of a set of new ontology-based sentence similarity measures based on two different unexplored notions as follows: (1) the evaluation of state-of-the-art word similarity measures from the general domain [46] not evaluated in the biomedical domain yet; and (2) the evaluation of several ontology-based word similarity measures based on a recent and very efficient shortest-path algorithm, called Ancestors-based Shortest-Path Length (AncSPL) [39], which is a fast approximation of the Dijkstra's algorithm [61] for taxonomies that is introduced with the first HESML version for the biomedical domain [39].Thus, we propose here the evaluation based on the combination of WBSM and UBSM methods with the path-based word similarity methods as follows: WBSM-Rada (M7); WBSM-cosJ&C (M9); WBSM-coswJ&C (M10); WBSM-Cai (M11); UBSM-Rada (M12); UBSM-cosJ&C (M14); UBSM-coswJ&C (M15); and UBSM-Cai (M16).The detailed information about this later method is shown in table 3. on functional Craf but not on Braf." step 1: s 1 ← {c0280089, formation, mice, oncogenic, c1537502, requires, formation, craf, c0812241} s 2 ← {oncogenic, activity, mutant, c1537502, appears, dependent, functional, craf, c0812241} step 2: S 1 ← {c0280089, formation, mice, oncogenic, c1537502, requires, craf, c0812241} step 3: S 2 ← {oncogenic, activity, mutant, c1537502, appears, dependent, functional, craf, c0812241} step 4: D ← {c0280089, formation, mice, oncogenic, c1537502, requires, craf, c0812241, activity, mutant, appears, dependent, functional} The new pre-trained word embedding model Current sentence similarity methods based on the evaluation of pre-trained embedding models are mostly trained using PubMed Central (PMC) Open Access dataset3 , or Medical Information Mart for Intensive Care (MIMIC-III) clinical notes [62].However, as far as we know, there are no models in the literature trained on the full text of the articles in the PMC-BioC corpus [19].Therefore, we propose evaluating a new FastText [56] word embedding model trained on the aforementioned BioC corpus.FastText overcomes one significant limitation of other methods, such as word2vec [63] and GloVe [64], which ignore the morphology of words by assigning a vector to each word in the vocabulary.For a more detailed review of the family of word embedding methods, we refer the authors to the recent reproducible survey by Lastra-Díaz et al. [46].The configuration parameters for training this model are detailed in table 4, and all the necessary information and resources for evaluating it are available in our reproducibility dataset [42], as detailed in table 6.

The reproducible experimental survey
This section introduces a detailed experimental setup to evaluate and compare all the sentence similarity methods for the biomedical domain proposed in our primary work [37], together with the new methods introduced herein.The main aims of our experiments are as follows: (1) the evaluation of most of known methods for biomedical sentence similarity onto the three biomedical datasets shown in table 1, and implemented in the same software platform; (2) the evaluation of a set of new sentence similarity methods adapted from their definitions for the general-language domain; (3) the evaluation of a new sentence method called LiBlock introduced in this work, eight variants of the current ontology-based methods from the literature based on the work of Sogancioglu et al. [30], and a new word embedding model based on FastText and trained on the full-text of articles in the PMC-BioC corpus [19]; (4) the setting of the state of the art of the problem in a sound and reproducible way; (5) the replication and independent confirmation of previously reported methods and results; (6) a study on the impact of different pre-processing configurations on the performance of the sentence similarity methods; (7) a study on the impact of different Name Entity Recognition (NER) tools, such as MetaMap [65] and clinic Text Analysis and Knowledge Extraction System (cTAKES) [60], onto the performance of the sentence similarity methods; (8) the evaluation for the first time of the CTR [51] dataset; (9) the identification of the main drawbacks and limitations of current methods; and finally, (10) a detailed statistical significance analysis of the results.BIOSSESNormalized.tsvMedSTS [50] 1,068 CTRNormalized averagedScore.tsvCTR [51] 170 MedStsFullNormalized.tsv

Selection of methods
The criteria for the selection of the sentence similarity methods evaluated herein is as follows: (a) all the methods that have been evaluated in BIOSSES and MedSTS datasets; (b) a selection of methods that have not been evaluated in the biomedical domain yet; (c) a collection of new variants or adaptations of methods previously proposed for the general or biomedical domain, which are evaluated for the first time in this work, such as the WBSM-cosJ&C [30,39,44,66], WBSM-coswJ&C [30,39,44,66], WBSM-Cai [30,39,67], UBSM-cosJ&C [30,39,44,66], UBSM-coswJ&C [30,39,44,66], and UBSM-Cai [30,39,67] methods detailed in tables 3 and 4; and (d) a new string-based method based on Li et al. [54] introduced in this work.For a more detailed description of the selection criteria of the methods, we refer the reader to our registered report protocol [37].Tables 2 and 3 detail the configuration of the string-based measures and ontology-based measures that are evaluated herein, respectively.Both WBSM and UBSM methods are evaluated in combination with the following word and concept similarity measures: Rada et al. [68], Jiang&Conrath [69], and three state-of-the-art unexplored measures, called cosJ&C [39,44], coswJ&C [39,44], and Cai et al. [39,67].The word similarity measure which reports the best results is used to evaluate the COM method [30,68].Table 4 details the sentence similarity methods based on the evaluation of pre-trained character, word, and Sentence Embedding (SE) models that are evaluated in this work.Finally, table 5 details the pre-trained language models that are evaluated in our experiments.
Table 2. Detailed setup for the string-based sentence similarity measures which are evaluated in this work.All the string-based measures follow the implementation of Sogancioglu et al. [30], who use the Simmetrics library [70].LiBlock method proposed herein is an adaptation from Li et al. [54] combined with a string-based measure, as detailed in the previous section.
M2 Jaccard [72,73] sim(a, b) = |a∪b| |a∩b| , being a and b sets of words of the first and second sentence respectively.

M4 LiBlock (this work)
LiBlock method (see eq. 1) annotated with CUI concepts and using cTAKES combined with the Block Distance [55] method using its best pre-processing configuration.

M5
Levenshtein distance [74] Measures the minimal cost number of insertions, deletions and replacements needed for transforming the first into the second sentence.Insert, delete and substitution cost set to 1.
Most methods receive as input the sequences of words making up the sentences to be compared.The process of splitting sentences into words can be carried out by BERT [85] trained on PubMed abstracts M44 ClinicalBERT [87] BERT [85] trained on PubMed abstracts M45 PubMedBERT [88] (abstracts) BERT [85] trained on PubMed abstracts
On the other hand, the use of lexicons instead of tokenizers for sentence splitting would be inefficient because of the vast general and biomedical vocabulary.Besides, there would not be possible to provide a fair comparison of the methods because the pre-trained language models have no identical vocabularies.The tokenized words that conform the sentence, named tokens, are usually pre-processed by removing special characters and lower-casing, and removing the stop words.To analyze all the possible combinations of token pre-processing configurations from the literature, we replicate for each method those pre-processing configurations used by other authors, such as Blagec et al. [31] and Sogancioglu et al. [30], and we also evaluate all the pre-processing configurations that have not been evaluated yet.We also study the impact of the pre-processing configurations by not removing special characters and stop words from the tokens, nor normalizing them using lower-casing.
Ontology-based sentence similarity methods estimate the similarity of a sentence by exploiting the 'is-a' relationships between the concepts in an ontology.Therefore, the evaluation of any ontology-based method receives a set of concept-annotated pairs of sentences.The aim of the biomedical NER tools is to recognize automatically biomedical entities in pieces of raw text, such as diseases or drugs.We evaluate the impact of the three more broadly-used biomedical NER tools on the performance of the sentence similarity methods, as follows: (a) MetaMap [65], (b) cTAKES [60], and (c) MetaMap Lite [93].MetaMap tool [65] is used by UBSM and COM methods [30] for recognizing Unified Medical Language System (UMLS) [94] concepts in the sentences, which is the standard compendium of biomedical vocabularies.Likewise, we use the default configuration of MetaMap restricted to the UMLS sources of SNOMED-CT and MeSH implemented by HESML V1R5 [39,95], which is defined by the following features: (i) the use of all available semantic types; (ii) the MedPost Part-of-speech tagger [96]; and (iii) the MetaMap Word-Sense Disambiguation (WSD) module.We also evaluate cTAKES [60] because it has shown to be a robust and reliable tool to recognize biomedical entities [97].Encouraged by the high computational cost of MetaMap in evaluating large text corpus, Demner-Fushman et al. [93] introduce a lighter MetaMap version, called Metamap Lite, which provides a real-time implementation of the basic MetaMap annotation capabilities without a large degradation of its performance.
Due to the large number of possible combinations of each pre-processing dimension, such as Named Entity Recognizers, tokenizers or char filtering methods, we have evaluated the pre-processing combinations of each dimension by defining a fixed pre-processing configuration for the rest of dimensions, except for the string-based methods, whose performance is high enough to not cause a significant variation in the running time of the experiments.

Detailed workflow of our experiments
Figure 3 shows the workflow for running the experiments implemented in this work.Given an input dataset, such as BIOSSES [30], MedSTS [50], or CTR [51], the first step is to pre-process all the sentences, as shown in figure 4. For each sentence pair (s 1 , s 2 ) in the dataset, the pre-processing stage is divided into four stages as follows: (1.a) named entity recognition of UMLS [94] concepts, using different state-of-the-art NER tools, such as MetaMap [65] or cTAKES [60]; (1.b) tokenization of the sentences, using well-known tokenizers, such as the Stanford CoreNLP tokenizer [91], BioCNLPTokenizer [92], or WordPieceTokenizer [90] for BERT-based methods; (1.c) lower-case normalization; (1.d) character filtering, which allows the removal of punctuation marks or special characters; and finally, (1.e) the removal of stop-words, Pre-procesing configurations -NLTK2018 [31,98] following different approximations evaluated by other authors like Blagec et al. [31] or Sogancioglu et al. [30].Once each dataset is pre-processed in step 1 detailed in figure 3), the aim of step 2 is to calculate the similarity score between each pair of sentences in the dataset to produce a raw output file containing all raw similarity scores, one score per sentence pair.Finally, a R-language script is used in step 3 to process the raw similarity files and produce the final human-readable tables reporting the Pearson and Spearman correlation values shown in table 8, as well as the statistical significance of the results and any other supplementary data table required by our study on the impact of the pre-processing and NER tools reported in appendices A and B respectively.Finally, we also evaluate all the pre-processing combinations for each family of methods to study the impact of the pre-processing methods on the performance of the sentence similarity methods, with the only exception of the BERT-based methods.The pre-processing configurations of the BERT-based methods are only evaluated in combination with the WordPiece Tokenizer [90] because it is required by the current BERT implementations.

Evaluation metrics
The evaluation metrics used to compare the performance of the methods analyzed are the following: (1) the Pearson correlation, denoted by r in equation ( 6); (2) the Spearman rank correlation, denoted by ρ in equation ( 7); (3) and the harmonic score, denoted by h in equation (8).The Pearson correlation evaluates the linear correlation between two random samples, whilst the Spearman rank correlation is rank-invariant and evaluates the monotonic relationship between two random samples, and the harmonic score allows comparing sentence similarity methods by using a single weighted score based on their performance in Pearson and Spearman correlation.
Finally, we use the well-known t-Student test to carry-out a statistical significance analysis of the results of the evaluation of the methods in the tree biomedical datasets shown in table 1.In order to compare the overall performance of the semantic measures that is evaluated in our experiments, we use the harmonic score average in all datasets.The statistical significance of the results is evaluated using the p-values resulting from the t-student test for the mean difference between the harmonic score values reported by each pair of semantic measures in all datasets.The p-values are computed using a one-sided t-student distribution on two paired random sample vectors made up by the harmonic (h) score values obtained in the evaluation of the three aforementioned datasets.Our null hypothesis, denoted by H 0 , is that the difference in the average performance between each pair of compared sentence similarity methods is 0, whilst the alternative hypothesis, denoted by H

Statistical performance analysis of the best methods
In order to answer the RQ5 research question, we study how well the sentence similarity methods are estimating the degree of semantic similarity between two sentences by analyzing the deviation of their estimated values regarding the human similarity scores.We want to analyze why the methods are doing well or bad on specific sentence pairs to elucidate some explanation to this behaviour, as well as identifying the main drawbacks and limitations of the current state-of-the-art methods.
To carry out this performance analysis, we analyze the statistics of the similarity error function E sim of the methods defined in equation 9. We only use some sentences extracted from the BIOSSES dataset for this analysis because this dataset has no licensing restrictions on its use, which allows us to reproduce their sentences herein, unlike MedSTS.On the other hand, we could have also used CTR because it has no licensing restrictions; however, CTR has not been previously used in this sentence similarity task.
Our methodology to conduct the performance analysis is detailed below: 1. Selection of the best-performing method from each family of methods.
2. Estimation of the Probability Density Function (PDF) of the E sim function for the evaluation of the selected best-performing methods in each dataset by calling the "density" function provided by the R statistical package.
3. Selection of the sentences based on their similarity error in the BIOSSES dataset: 3.1 the sentences with the lowest and highest absolute similarity error |E sim | for each method are extracted.
3.2 each sentence selected in the step above is pre-processed using the best pre-processing configuration for each method.
May 19, 2022 16/48 3.3 the resulting pre-processed sentences and the statistical information of the similarity scores are analyzed in the Discussion section.

Software implementation
We have developed a new sentence measures library for the biomedical domain called HESML-STS, which is based on HESML V1R5 [38,39], as detailed in table 6.All our experiments are generated by running the HESMLSTSclient and HESMLSTSImpactpre-processingclient programs, which generates a raw output file in comma-separated file format (*.csv) for each dataset detailed in table 1.The raw output files contain the raw similarity values returned by each sentence similarity method in the evaluation of the degree of similarity between sentences.The final results for the Pearson and Spearman correlation, and the harmonic values detailed in table 8 are automatically generated by running a R-language script file on the collection of raw similarity files, which also generates all the tables reported in appendices A and B provided as supplementary material.All tables are written both in Latex and comma-separated file format (*.csv) formats.For a more detailed description of the protocol for running our experiments, we refer the reader to the protocol [41] detailed in appendix C. We implemented a parser for loading pre-trained embedding models based on FastText [56] and other word embedding models [77][78][79][80][81], which are efficiently evaluated as sentence similarity measures in HESML by implementing the averaging Simple Word EMbeddings (SWEM) approach introduced by Shen et al. [99].On the other hand, the software replication required to evaluate sentence embeddings and BERT-based language models is extremely complex and out of the scope of this work.For this reason, these models are evaluated using the original software artifacts used to generate the aforementioned pre-trained models.Thus, we implemented a collection of Python wrappers for evaluating the available models by using the provided software artifacts as follows: (1) Sent2vec-based models [33] are evaluated using the Sent2vec library [83]; (2) Flair models [76] are evaluated using the flairNLP framework [76]; and USE models [82] are evaluated using the open source platform TensorFlow [100].All BERT-based pre-trained models are evaluated using the open source bert-as-a-service library [101].

Reproducing our benchmarks
For the sake of reproducibility, we introduce a detailed reproducibility protocol at protocols.io[41] that is based on a reproducibility dataset [42] containing all the software and data necessary to allow the exact replication of all our experiments and results.Our reproducibility protocol is mainly based on a Docker-based image 4 that include a pre-installation of all the necessary software and the Java source code and binary files of our benchmark program.Our source code files are tagged in Github with a permanent tag named "SentenceSimilarityBenchmark" 5 .
In addition, we plan to submit a Lab Protocol6 article under preparation [102], which will provide a detailed description of the publicly available reproducibility dataset [42] and a very detailed reproduciblility protocol [41] to allow the exact replication of all our methods, experiments, and results.We also plan to submit a research article under preparation [103] to introduce the new HESML-STS software library integrated into the latest HESML V2R1 version, together with a set of reproducible benchmarks on semantic measures libraries for biomedical sentence similarity.The new HESML V2R1 release will make publicly available soon, once we have appropriately separated the configurations requiring software restricted by third-party licenses, such as cTAKES and Metamap NER tools, from the rest of the project.However, our reproducibility dataset allows the full and exact replication of all our experiments by completing the licensing requirements of the UMLS databases and the aforementioned NER tools for the National Library of Medicine (NLM) of the United States 7 .
Table 6 details all the reproducibility resources provided as supplementary material with this work.Our benchmarks are implemented using Java 8, Python 3 and R programming languages, and thus, they can be reproduced in any Java-complaint or Docker-complaint platforms, such as Windows, MacOS, or any Linux-based system.Table 6.Supplementary material and reproducibility resources of this work.

Material Description
Reproducibility dataset [42] All raw input and output data files, pre-trained model files, and a long-term reproducibility image based on Docker, which is publicly available in the Spanish Dataverse Network 8 Reproducibility protocol [41] Raw step-by-step instructions to download the required resources and reproduce the experiments evaluated in this work Lab Protocol article [102] (under preparation) Data and methods article introducing a very detailed description of our experiments, datasets, and reproducibility protocol to allow the independent replication of our experiments and results

HESML-STS software library (integrated into HESML V2R1)
Release of the new HESML-STS library.This library is based on the previous HESML V1R5 version [38,39] published in Github 9 and the Spanish Dataverse Network [42] under a CC By-NC-SA-4.0 license.

HESML V2R1 software release (under preparation)
Release of the new HESML V2R1 version which will be published soon.This new release will be based on the previous HESML V1R5 version, including the new HESML-STS software package that has been developed for this work, after managing all the licensing restrictions of the NER tools.
HESML-STS software paper [103] (under preparation) Software article introducing our sentence similarity library, called HESML-STS, together with some benchmarks under preparation.

Results obtained
Table 7 shows the selected pre-processing configuration of each method for obtaining their best-performing results, whilst table 8 shows the results obtained in the evaluation of all methods in the three biomedical datasets evaluated herein by using their best pre-processing configurations.Table 9 shows the comparison of results for the highest (best) and lowest (worst) average harmonic score values for the best-performing method of each family shown in blue in table 8, which are defined by the method obtaining the highest average harmonic score.Furthermore, table 10 shows the results obtained in our study on the impact of NER tools on the performance of the sentence similarity methods in the evaluation of the MedSTS dataset [50].Table 11 shows the harmonic and average harmonic scores obtained in the evaluation of the three biomedical datasets, as well as the resulting p-values comparing the NER tools for each ontology-based method.Table 12 shows the results obtained in the evaluation of the LiBlock method in the three biomedical datasets by using its best pre-processing configuration, and annotating the sentences with all the NER tools combinations.In addition, the aforementioned table details the resulting p-values comparing the best-performing LiBlock-NER combination with the other NER tools.Tables 13, 14, 15, and 16 show the raw input sentence pairs and their corresponding pre-processed versions in which the best-performing methods obtain the lowest and highest similarity error (E sim ) in the BIOSSES dataset [30].Table 17 detail the statistical information for the best-performing methods of each family in the evaluation of the three biomedical datasets evaluated herein.Finally, figure 5 shows the Probability Density Function (PDF) of the similarity error obtained by the best-performing methods of each family in the evaluation of the BIOSSES, MedSTS, and CTR datasets respectively.On the other hand, appendix A shows the resulting p-values comparing all the methods using their best pre-processing configuration as detailed in 8, which allows us to study the statistical significance of the results, as detailed in the Discussion section.In addition, appendix B shows the experimental results on the impact of pre-processing configurations in all the methods evaluated herein, whose best configuration has been used to determine the final scores for each method.Finally, appendix C detail the protocol for reproducing all the experiments evaluated herein, which is also published in protocols.io[41].LiBlock (M4) obtains the highest Spearman correlation value in all datasets among the family of string-based methods.This conclusion can be drawn by looking at the results for the first group of methods detailed in table 8.
LiBlock (M4) obtains the highest harmonic score in all datasets among the family of string-based methods.This conclusion can be drawn by looking the results for the first group of methods detailed in table 8.

Comparison of Ontology-based methods
COM (M17) obtains the highest average harmonic score among the family of ontology-based methods significantly outperform all of them, with the only exception of WBSM-Rada (M7).This conclusion can be drawn by looking at the average column in table 8 for the second group of methods and checking the p-value shown in table A.1 for the comparison of COM (M17) with WBSM-Rada (M7) (p-value=0.088).
COM (M17) obtains the highest Pearson correlation value in the BIOSSES and CTR datasets among the family of ontology-based methods, whilst the WBSM-Rada (M7) methods obtain the highest Pearson correlation value in the MedSTS dataset.This conclusion can be drawn by looking at the second group of methods in 8.
COM (M17) obtains the highest Spearman correlation values in the BIOSSES dataset among the family of ontology-based methods, whilst WBSM-Rada (M7) and UBSM-Rada (M12) do it in the MedSTS and CTR datasets, respectively.This conclusion can be drawn by looking at the second group of methods in 8.
COM (M17) obtains the highest harmonic score in the BIOSSES and CTR datasets among the family of ontology-based methods, whilst WBSM-Rada (M7) does it in the MedSTS dataset.This conclusion can be drawn by looking at the second group of methods detailed in table 8.

Comparison of embeddings methods
BioWordVec int (M26) obtains the highest average harmonic score in all datasets among the family of embedding methods detailed in table 4, and significantly outperforms all of them.This conclusion can be drawn by looking at the third group of methods in table 8 and checking the p-values reported in table A.1, which compare the harmonic score values obtained by the BioWordVec int (M26) method with the rest of methods from the same family, such as FastText-SkGr-BioC (p-value=0.032),BioWordVec ext (p-value = 0.007), and BioSentVec (p-value=0.022)among others.
BioWordVec int (M26) obtains the highest Pearson correlation value in the BIOSSES and MedSTS datasets among the family of embedding methods, whilst the Newman -Griffis word2vec sgns (M22) model does it in the CTR dataset.This conclusion can be drawn by looking the results for third group of methods detailed in table 8.
BioWordVec int (M26) obtains the highest Spearman correlation in the BIOSSES and MedSTS datasets among the family of embedding methods, whilst the Newman-Griffis word2vec sgns (M22) model does it in the CTR dataset.This later conclusion can be drawn by looking the results for the third group of measures detailed in table 8.
BioWordVec int (M26) obtains the highest harmonic score in the BIOSSES and MedSTS datasets among the family of embedding methods, whilst the Newman-Griffis word2vec sgns (M22) model does it in the CTR dataset.This later conclusion can be drawn by looking the results for the third group of measures detailed in table 8.

Comparison of BERT-based methods
OuBioBERT (M47) obtains the highest average harmonic score among the family of BERT-based methods.However, it does not significantly outperform all of them.This conclusion can be drawn by looking at the last group of methods in table 8 and checking the p-values reported in table A.1.Table A.1 shows that ouBioBERT obtains p-values higher than 0.05 when it is compared with many BERT-based methods, such as BioBERT Large 1.1 (p-value=0.224) and PubMedBERT (abstracts+full text) (p-value=0.101)among others.
NCBI-BlueBERT Large PubMed (M40) obtains the highest Pearson correlation value in the BIOSSES dataset among the family of BERT-based methods, whilst the NCBI-BlueBERT Base PubMed + MIMIC-III (M41) and the ouBioBERT (M47) models do it in the MedSTS and the CTR datasets, respectively.This later conclusion can be drawn by looking at the last group of measures detailed in table 8.
ouBioBERT (M47) obtains the highest Spearman correlation value in the BIOSSES dataset among the family of BERT-based methods, whilst SciBERT (M43) and NCBI-BlueBERT Base PubMed (M39) do it in the MedSTS and CTR datasets, respectively.This conclusions can be drawn by looking at the last group of measures detailed in table 8.
ouBioBERT (M47) obtains the highest harmonic score in the BIOSSES dataset among the family of BERT-based methods, whilst SciBERT (M43) and NCBI-BlueBERT Base PubMed (M39) do it in the MedSTS and CTR datasets, respectively.This conclusion can be drawn by looking at the last group of measures detailed in table 8.

Comparison of all methods
LiBlock (M4) obtains the highest average harmonic score for all the methods evaluated herein, and significantly outperforms all the methods based on embeddings and language models.However, there is no a statistically significant difference in performance with the ontology-based methods COM (M17) and WBSM-Rada (M7).This conclusion can be drawn by looking at the average column in table 8 and checking the p-value reported in table A.1, which compare the harmonic score obtained by the LiBlock method with the COM (p-value=0.121) and WBSM-Rada (p-value=0.098)methods.
BioWordVec int (M26) obtain the highest Pearson correlation values in the BIOSSES dataset among all methods evaluated herein, whilst WBSM-Rada (M7) and Newman-Griffis word2vec sgns (M22) do it in the MedSTS and CTR datasets, respectively.This conclusion can be drawn by looking at the bold values detailed in table 8.
LiBlock (M4) obtains the highest Spearman correlation value in the BIOSSES and MedSTS datasets among all methods evaluated herein, whilst Newman-Griffis word2vec sgns (M22) does it in the CTR dataset.This conclusions can be drawn by looking at the bold values detailed in table 8.
LiBlock (M4) obtains the highest harmonic score in the BIOSSES dataset among all methods evaluated herein, whilst WBSM-Rada (M7) and Newman-Griffis word2vec sgns (M22) do it in the MedSTS and CTR datasets, respectively.This conclusion can be drawn by looking at the bold values detailed in table 8.
COM (M17) obtains the second highest average harmonic score among all methods evaluated herein, and it is able to outperform significantly all methods with the only exception of LiBlock (M4) and WBSM-Rada (M7).This conclusion can be drawn by looking at the bold values detailed in table 8 and checking the p-value reported in table A.1.
at the results shown in table 10.
cTAKES obtains the highest average harmonic score for the three datasets in combination with UBSM-Rada (M12), UBSM-coswJ&C (M15) and COM (M17) methods, whilst MetamapLite obtains the highest average harmonic score for the three datasets in combination with UBSM-J&C (M13), UBSM-cosJ&C (M14) and UBSM-Cai (M16).This conclusion can be drawn by looking at the harmonic scores of the NER tools in table 11.
cTAKES combined with COM (M17) obtains the best-performing results of ontology-based methods for the three datasets.This conclusion can be drawn by looking at the average harmonic scores column shown in table 11.
cTAKES is the best-performing tool in combination with the UBSM-Rada (M12), UBSM-coswJ&C (M15), and COM (M17) methods in the three datasets, and significantly outperforms MetamapLite and Metamap or the two former methods.However, there is no a statistically significant diference regarding the Metamap tools when it is combined with the COM (M17) method.This conclusion can be drawn by looking at the average harmonic scores and p-values shown in table 11.
MetamapLite is the best-performing tool in combination with the UBSM-J&C (M13), UBSM-cosJ&C (M14), and UBSM-Cai (M16) methods in the three datasets, and significantly outperforms cTAKES and Metamap.This conclusion can be drawn by looking at the average harmonic scores and p-values shown in table 11.
The choice of the best NER tool for each method significantly impact their performance in most cases.This conclusion follows from the conclusions above.
Answering RQ3.Our results show that the ontology-based methods obtain their best performance in the task of biomedical sentence similarity when they use either MetamapLite or cTAKES.Thus, Metamap should not be used in combination with any of the ontology-based methods evaluated herein in this later task.Likewise, the results and p-values reported table 11 show that there is a significant difference in the performance of each ontology-based method according to the NER tool used in most cases.The conclusions above confirm that the selection of the NER tool significantly impacts the performance of the sentence similarity methods using it.

Impact of the NER tools on the new LiBlock measure
This section analyzes the impact of the NER tools on the new sim LiBk similarity measure.Table 12 shows the results obtained by the sim LiBk measure in the three biomedical datasets using its best pre-processing configuration, and annotating the sentences with all the combinations of NER tools.In addition, the aforementioned table details the resulting p-values comparing the best-performing LiBlock-NER combination with the combinations based on the other two NER tools.
LiBlock-cTAKES obtains the highest average harmonic score for the three datasets among the LiBlock-NER combinations.However, it does not significantly outperform LiBlock with no use of a NER tool.This conclusion can be drawn by looking at the average column in table 12 and checking the p-values in the last column.This conclusion is especially relevant because it shows that there is no a statistically significant difference between using a NER tool like cTAKES or not using it in the case of the LiBlock measure.We conjecture that this later conclusion could be caused by two reasons, firstly the incapability of LiBlock to capture semantic relationships beyond the synonymy, and secondly the current limitations of cTakes to recognize all mentions of biomedical entities.
LiBlock-cTAKES obtains the highest Pearson correlation value in the BIOSSES dataset among all LiBlock-NER combinations, whilst LiBlock with no use of a NER tool obtains the highest Pearson correlation value in the MedSTS and CTR datasets, respectively.This conclusion can be drawn by looking the results detailed in table 12.
LiBlock-cTAKES obtains the highest Spearman correlation value in the BIOSSES and MedSTS datasets among the LiBlock-NER combinations, whilst LiBlock-cTAKES and LiBlock-MetamapLite obtain the highest Spearman correlation value in the CTR dataset.This conclusion can be drawn by looking the results detailed in table 12.
LiBlock-cTAKES obtains the highest harmonic correlation value in the BIOSSES and MedSTS datasets among the LiBlock-NER combinations, whilst LiBlock-MetamapLite obtains the highest harmonic correlation value in the CTR dataset.This conclusion can be drawn by looking the results detailed in table 12.

Impact of the remaining pre-processing stages
This section analyzes the impact of each pre-processing step on the performance of the sentence similarity methods, except for the NER tools already analyzed in the previous section.Finally, we study the overall impact of the pre-processing configurations.

Impact of tokenization
The family of string-based methods obtains its best-performing results either by splitting the sentence from the white spaces between words or using the Stanford CoreNLP tokenizer.This conclusion can be drawn by looking at the table 7, which summarizes the pre-processing tables detailed in Appendix B.
The family of ontology-based methods obtains its best-performing results in combination with the Stanford CoreNLP tokenizer.This conclusion can be drawn by looking at the table 7.
The family of methods based on embeddings obtains its best-performing results in combination with the Stanford CoreNLP tokenizer, with the only exception of Flair (M18).This conclusion can be drawn by looking at the table 7.
None method based on strings, ontologies, or embeddings obtain its best-performing results in combination with the BioCNLPTokenizer.This conclusion can be drawn by looking at the table 7. Thus, the BioCNLPTokenizer should not be used in combination with any method in the former families in the task of biomedical sentence similarity.On the other hand, we recall that all BERT-based methods evaluated herein can only be used in combination with the WordPiece Tokenizer [90] based on a subword segmentation algorithm, because it is required by the current BERT implementations.
All families of methods show a strong preference by a specific tokenizer, with the only exception of the string-based one.This conclusion can be drawn from previous conclusions that confirm the preference of the methods based on ontologies and embeddings by the CoreNLP tokenizer, and the mandatory use of the WordPiece tokenizer by the family of BERT-based methods.

Impact of character filtering
The family of string-based methods obtains its best-performing results by using either the BIOSSES char-filtering method or the default method which removes the punctuation marks and special symbols from the sentences, with the only exception of the Levenshtein distance method (M5), which does not remove special characters.This conclusion can be drawn by looking at the table 7, which summarizes the pre-processing tables detailed in Appendix B.
All ontology-based methods obtain their best-performing results in combination with the BIOSSES char-filtering method.This conclusion can be drawn by looking at the table 7.
Most of embeddings methods obtain their best-performing results in combination with the default char filtering method.However, Flair (M18), BioWordVec (M26,M27), and BioSentVec (M32) obtain their best-performing results with the BIOSSES char-filtering method.This conclusion can be drawn by looking at the table 7.
The BERT-based methods do not show a noticeable preference pattern by a specific char filtering method, obtaining their best-performing results with the BIOSSES, Blagec2019, or the default one.This conclusion can be drawn by looking at the table 7.

Impact of stop-words removal
All string-based methods obtain their best-performing results in combination with the NLTK2018 stop-word list, with the only exception of the Levenshtein distance (M5).This conclusion can be drawn by looking at the table 7, which summarizes the pre-processing tables detailed in Appendix B.
All ontology-based methods obtain their best-performing results in combination with the NLTK2018 stop-word list, with the only exception of WBSM-J&C (M8), WBSM-cosJ&C (M9), which do not remove stop words.This conclusion can be drawn by looking at the table 7.
The methods based on embeddings do not show a noticeable preference pattern by a specific stop-word list, obtaining their best-performing results by using the stop-word list of BIOSSES, NLTK2018, or none.This conclusion can be drawn by looking at the table 7.
The methods based on language models do not show a noticeable preference pattern by a specific stop-word list, obtaining their best-performing results by using the stop-word list of BIOSSES, NLTK2018, or none.This conclusion can be drawn by looking at the table 7.
The best-performing results for the methods based on strings or ontologies show a noticeable preference by the use of the stop-words list NLTK2018.This conclusion can be drawn by looking at the table 7.

Impact of lower-casing
Only 10 of the 50 methods evaluated in this work obtain their best performance by avoiding converting words to lowercase at the sentence pre-processing stage.This conclusion can be drawn by looking at the tables 7 and 8, and the pre-processing tables detailed in Appendix B.Moreover, these ten aforementioned methods obtain a low performance in our experiments, with the only exception of the BioNLP2016 win30 (M29) pre-trained model, which obtains the third best Spearman correlation value in the CTR dataset.Thus, our experiments confirm that the lower-casing normalization of the sentences positively impacts the performance of the methods, and it should be considered as default option in any biomedical sentence similarity task.
We conjecture that lower-casing improves the performance of the of string-based and ontology-based methods because it improves the exact comparison of words.On the other hand, we also conjecture that the impact of lower-casing the sentences on the families of methods based on embeddings and language models strongly depends on the pre-processing methods used in their training.task (see table 8).However, OuBioBERT is unable to outperform significantly all remaining methods from the same family (see Appendix A.1).
Finally, our results show that our new string-based method, called LiBlock (M4), obtains the best overall performing results, despite it does not capture the semantic information of the sentences.This is a very noticeable finding because it contradicts a common belief on the potential outperformance of the ontology-based methods integrating word and concept semantics over the non-semantics methods in this similarity task.A second and very noticeable finding is that our non-semantics and non-ML LiBlock method is able to outperform significantly state-of-the-art methods based on large ML models trained with the most recent and advanced word embeddings [46] and BERT language models [85] in an unsupervised context.This later finding is very remarkable because LiBlock is easy of implementing, of evaluating, very efficient (2635 sentence pairs per second with no use of a NER tool), and it requires neither large text resources nor complex algorithms for its training and evaluation, which is a very clear advantage in the biomedical sentence similarity task.
Answering RQ1 and RQ2.The string-based method LiBlock (M4) obtains the highest average harmonic score in all datasets, and significantly outperforms the remaining string-based methods, as well as all methods based on embeddings and BERT language models, and all the ontology-based methods with the only exceptions of COM (M17) and WBSM-Rada (M7).In addition, LiBlock obtains the highest Spearman correlation values in the BIOSSES and MedSTS datasets, which contains 100 and 1068 sentence pairs respectively.

Main drawbacks and limitations of current methods
This section analyzes the behaviour of the best-performing methods in each family of sentence similarity methods to answer our RQ5.The best-performing methods of each family, according to the harmonic average value reported in table 8, are LiBlock (M4), COM (M17), BioWordVec int (M26), and OuBioBERT (M47).
String and ontology-based methods underestimate in average the human similarity value in the BIOSSES and CTR datasets, whilst their average similarity error is close to 0 in the MedSTS dataset.This conclusion can be drawn by looking at the average similarity error values and the mean error values shown in figure 5 together with the mean values shown in table 17.LiBlock and COM obtain mean error values of -0.021 and -0.001 in MedSTS, as shown in figure 5.b.On the other hand, both methods report a mean similarity score much lower than the mean of the Human normalized score in the BIOSSES and CTR datasets and a mean similarity score close to the Human normalized score in the MedSTS dataset, as shown in table 17.
The methods based on embeddings and language models overestimate in average the human similarity value in the three datasets.This conclusion can be drawn by looking at the average similarity error values and the mean error values shown in figure 5, together with the mean similarity values shown in table 17.The two aforementioned families of methods report a mean similarity score much higher than the mean of the Human normalized score in the three datasets, as show in table 17.
String and ontology-based methods share a similar underestimation behavior, in opposition to the overestimation behaviour shown by the methods based on embeddings and language models, which is very noticeable in the three datasets.This conclusion can be drawn by looking at the minimum and maximum similarity values columns in table 17, and the plots of the probability error distribution function for the three datasets in figure 5.For instance, despite the human similarity scores are in the range of 0 to 1 n the BIOSSES dataset, as shown in table 17, the string and ontology-based methods report similarity scores in the range of 0 to 0.596, whilst the methods based on embeddings and language models report similarity scores in the range of 0.582 to 0.987.
String and ontology-based methods tend to obtain their best results in sentences with a Human normalized score close to 0, whilst the methods based on embeddings and language models obtain their best results in sentences with a Human normalized score close to 1.This conclusion can be drawn by looking at the tables 13, 14, 15 and 16.On the other hand, string and ontology-based methods tend to obtain their worst results in sentences with a Human normalized score close to 1, whilst the methods based on embeddings and language models obtain their worst results in sentences with a Human normalized score close to 0.
None of the methods for semantic similarity of sentences in the biomedical domain evaluated herein use an explicit syntactic analysis or syntax information to obtain the similarity value.We conjecture that syntactic analysis would improve the performance in some cases.For instance, the sentences s1 and s2 with highest E sim in table 13 shows an implicit relation between the concepts "miRNA" and "oncogenesis", which should increase the final semantic similarity score of the sentences.However, none of the methods evaluated herein consider and reward these semantics relationships because its recognition demands some form of syntactic analysis.On the one hand, string and ontology-based methods consider the concepts in a sentence as bags of words, whilst on the other hand the methods based on embeddings and language models implicitly consider the structure of the sentences but not the relationships between the parts of the sentences that are related.
Our results show that the family of string-based methods is rewarded by the high frequency of overlapping words in the sentences of the current biomedical datasets, whilst the former methods are not able to deal properly with sentences that are semantically different but not exhibit a word overlapping pattern.The main advantages of the string-based methods are as follows: (1) they are able to obtain high correlation values without the need of using external resources for their training or evaluation; (2) they are fast and efficient; and finally; (3) they require low computational resources.However, string-based methods are unable to capture the semantics of the words in the sentence, which prevent them from recognizing semantic relationships, such as synonymy, meronymy and morphological variants.On the other hand, the use of NER tools in combination with string-based methods is a good option to integrate at least the capability of recognizing synonyms, as shown by LiBlocK-CTakes (M4).
Ontology-based methods strongly depends on the lexical coverage of the ontologies and the ability to recognize automatically the underlying concepts in sentences.Our results show that the ontology-based methods are able to properly estimate a similarity score when it is evaluated in a dataset with either high word overlapping or NER and WSD tools that find all possible entities to properly calculate the similarity between sentences.The main advantages of ontology-based methods are that they are fast and require low computational resources.However, the effectiveness of the ontology-based methods depends on the lexical coverage of the ontologies and the ability of the NER and WSD tools to recognize the underlying concepts in sentences, whose coverage and performance could be limited in several application domains.
The LiBlock (M4) string-based method and the COM (M17) ontology-based method use a NER tool in the pre-processing stage to recognize the biomedical entities (UMLS CUI codes) present in the input sentences.The objective of annotating entities in the semantic similarity task is the identification and disambiguation of biomedical concepts to provide semantic information to sentences.LiBlock uses the NER tool to normalize and disambiguate the underlying concepts in a sentence, unifying different concepts with acronyms and synonyms in the same CUI code and creating an overlapping between concepts, while ontologies also make use of the similarity of concepts within ontologies.
The biomedical NER tools evaluated in this work are unable to identify and disambiguate correctly many biomedical concepts due to the use of acronyms and different morphological variations, among others.For example, the CUI concepts "KRAS gene" (C1537502), "BRAF gene" (C0812241), and "RAF1 gene" (C0812215) in the sentences s1 and s2 with highest E sim obtained by the COM (M17) method in table 14, appear as "K-ras", "Braf", "c-Raf" and "Craf'.However, cTakes is unable of recognizing these later morphological variants of the same biomedical concepts.A second example is the word "act" in the sentence "Consequently miRNAs have been demonstrated to act either as oncogenes [...]", which is wrongly recognized as the entity "Activated clotting time measurement" (C0427611), rather than as a verb in the sentence s1 with highest E sim in table 13.And finally, a third example is the acronym "NSCLC", which denotes the concept "Non-Small Cell Lung Carcinoma (C0007131), which is not recognized in the plural variant "NSCLCs" in the sentence s2 with highest E sim from table 14.
The methods based on pre-trained embeddings and language models provide a broader lexical coverage than the ontology-based methods, and do not need the use of NER or WSD tools to find intrinsic semantic relationships between the words in the sentences.However, these later methods need large corpus for their training, as well as a complex training phase and more computational resources than the methods from the families of string-based and ontology-based.On the other hand, our experiments show that those methods tend to estimate higher similarity values than those estimated by a human being in the three datasets.In most cases, the aforementioned method report similarity scores that tend to 1, which indicates that the semantics obtained from the sentences is not sufficient to compute correctly a similarity score.For instance, the sentences s1 and s2 with highest E sim from tables 15 and 16 shows similarity values close to 1, where the sentences have neither word overlapping nor similar concepts, and the human similarity score is 0 in both cases.On the other hand, BERT-based methods are trained for downstream tasks, using a supervised approach, and do not perform well in an unsupervised context.
Answering RQ5.String-based methods capture neither the word semantics within the sentences nor the semantic relationships between words, such as synonymy and meronymy, and their effectiveness mainly relies on the word overlapping frequency in the sentences.However, the LiBlock method uses the NER tool to normalize and disambiguate the underlying concepts in a sentence, but unfortunately, it does not significantly outperform LiBlock with no use of a NER tool, which could be caused by two reasons as follows.Firstly, the incapability of LiBlock to capture semantic relationships beyond the synonymy, and secondly the current limitations of cTakes to recognize all mentions of biomedical entities.On the other hand, ontology-based methods use NER and WSD tools to recognize the underlying concepts in the sentences, which are not able to correctly identify and disambiguate these concepts in many cases.In addition, they require external resources to capture the semantic information from the sentences, which limits their lexical coverage.Thus, ontology-based methods require both high word overlapping and high recognition coverage of named entities to properly estimate the similarity between sentences.On the other hand, the methods based on pre-trained embeddings and language models need large corpus for training, a complex training phase, and considerable computational resources to calculate the similarity between sentences.Moreover, those methods tend to obtain high similarity scores in most cases, which may penalize them in a balanced dataset and in a real environment.Finally, BERT-based methods are May 19, 2022 37/48 trained for downstream tasks, using a supervised approach, and do not perform well in an unsupervised context.

Comparison of running times
Table 18 details the running time reported by the best-performing methods for each family, as well as the sentences per second that computes each method by average for the three datasets evaluated herein.The experiments were executed in a desktop computer with an AMD Ryzen 7 5800x CPU (16 cores) with 64 Gb RAM and 2TB Gb SSD disk.In all the cases, the running time also comprises the pre-processing time for each method.The string-based method Block Distance (M3) obtain the lowest running times because it does not need complex mechanisms or pre-trained models to calculate the similarity between sentences.On the other hand, the BERT-based methods obtain the worst results mainly due to its pre-processing stage, which uses the WordPiece tokenization method.
Table 18.This table shows the running times in miliseconds (ms) and the average sentences pairs per second (sent/sec) reported by the best-performing method of each family of methods in the evaluation of the 1339 sentence pairs that conform the three datasets.(*) The LiBlock method reports the running times in both NER and noNER versions showing that the efficiency of the method with no NER tool is much higher, despite the fact that there is no statistically significant difference in the results between both pre-processing configurations.Inconsistent results in the calculation of the statistical significance matrix.
Despite the artificial increase of datasets to calculate the statistical significance of the results, we have identified an inconsistent result with respect to the comparison of the p-values of the LiBlock (M4) and the WBSM-Rada (M7) and UBSM-Rada (M12) methods.Table 8 shows that the UBSM-Rada method (M12) has a higher average harmonic score compared to WBSM-Rada (M7).However, by building the artificial datasets, the value of UBSM-Rada (M12) with respect to LiBlock (M4) shows a significant difference, while WBSM-Rada (M7) with respect to LiBlock (M4) shows a non-significant difference.We conjecture that this problem could be solved by increasing the number of datasets created for this task, which would allow to increase the sample size and obtain more consistent results.

Conclusions and future work
We have introduced the largest, detailed, and for the first time, reproducible experimental survey on biomedical sentence similarity reported in the literature.Our work also introduces a collection of self-contained and reproducible benchmarks on biomedical sentence similarity based on the same software platform, called HESML-STS, which has been especially developed for this work, being provided as part of the new HESML V2R1 version that will be made publicly available soon.We provide a detailed reproducibility protocol [41] and dataset [42] to allow the exact replication of all our experiments, methods, and results.In addition, we introduce a new aggregated string-based sentence similarity method called LiBlock, together with eight variants of the ontology-based methods introduced by Sogancioglu et al. [30], and a new pre-trained word embedding model based on FastText [56] and trained on the full-text of the articles in the PMC-BioC corpus [19].We also evaluate for the first time the CTR [51] dataset in a benchmark on biomedical sentence similarity.
The string-based LiBlock (M4) measure sets the new state-of-the-art for the sentence similarity task in the biomedical domain and significantly outperforms all the methods evaluated herein, with the only exception of the COM (M17) and WBSM-Rada (M7) ontology-based methods.However, our data analysis shows that at least with the three datasets evaluated herein, there is no statistically significant difference between the performance of the LiBlock (M4) method using the cTakes or none NER tool.Thus, using the LiBlock method without any NER tool could be a competitive and much more efficient solution for high-throughput applications.
Concerning the impact of the Named Entity Recognition (NER) tools, our results confirm that the choice of the best NER tool for each method significantly impacts their performance.MetamapLite [93] and cTAKES [60] set the best-performing configurations for the family of ontology-based methods, whilst Metamap [65] sets the best-performing option for none.
Our experiments confirm that the pre-processing stage has a very significant impact on the performance of the sentence similarity methods evaluated herein, despite this fact have neither been studied nor reported in the literature.Thus, the selection of the proper configuration for each sentence similarity method should be confirmed experimentally.However, our experiments suggest some default configurations to make these decisions, such as the use of lower-casing normalization, some specific char filtering methods, and some specific tokenizers with the only exception of BioCNLPTokenizer.Finally, the families of string and ontology-based methods show a noticeable preference pattern by the use of the NLTK2018 stop-words list.For a detailed description of the best pre-processing configurations, we refer the readers to our discussion.
String-based methods do not capture either the semantics of the words in the sentence or the semantic relationships between words, and their effectiveness relies on the word overlapping frequency in the sentences.Ontology-based methods Named Entity Recognition (NER) and Word Sense Disambiguation (WSD) tools to recognize the underlying concepts in the sentences and require external resources to capture the semantic information from the sentences, which limits their lexical coverage.In addition, they require either high word overlapping or high recognition coverage of named entities in order to properly calculate the similarity between sentences.On the other hand, the methods based on pre-trained embeddings and language models need a large corpus for training, a complex training phase, and considerable computational resources to calculate the similarity between sentences.Moreover, these methods tend to obtain high similarity scores in most cases, which may penalize them in a balanced dataset and in a real environment.Finally, BERT-based methods are trained for downstream tasks, using a supervised approach, and do not perform well in an May 19, 2022 39/48 unsupervised context.Our experiments suggest that the current benchmarks do not cover all the language features that characterize the biomedical domain, such as the frequent use of acronyms and rhetorical expressions like synonymy, meronymy, etc.In addition, current benchmarks have a very limited sample size that difficult the analysis of results.We conjecture that LiBlock, COM, and UBSM-Rada perform well because there is a noticeable overlap of terms that may benefit the former methods over the others reported in the literature.Furthermore, Chen et al. [104] highlights the need to improve and create new benchmarks from different perspectives, to reflect the multifaceted notion of the similarity of sentences.Therefore, we found a strong need for improving existing benchmarks for the task of semantic similarity of sentences in the biomedical domain.
As forthcoming activities, we plan to publish our new software release HESML V2R1 including the HESML-STS software package developed for this work.We also plan to evaluate the new sentence similarity methods introduced herein in a benchmark for the general language domain.In addition, we will study the evaluation of the sentence similarity methods in an extrinsic task, such as semantic medical indexing [105] or summarization [106].We also consider the evaluation of further pre-processing configurations, such as biomedical NER systems based on recent Deep Learning techniques [10], or extending our experiments and research to the multilingual scenario by integrating multilingual biomedical NER systems like Cimind [107].Finally, we plan to evaluate some recent biomedical concept embeddings based on MeSH [108], which has not been evaluated in the sentence similarity task yet.

Fig 1 .
Fig 1.This figure details the workflow for computing the new LiBlock measure and an example illustrating a use case of the workflow following the steps defined in algorithm 1.
|a∩b| |Min(|a|,|b|)| , being a and b sets of words of the first and second sentence respectively.

Fig 2 .
Fig 2. Detail of the pre-processing configurations that are evaluated in this work.(*) WordPieceTokenizer [90] is used only for BERT-based methods.

Fig 3 .ForFig 4 .
Fig 3.  Detailed workflow implemented by our experiments for pre-processing the input sentences, calculating the raw similarity scores, and post-processing the results obtained in the evaluation of the biomedical datasets.This workflow generates a collection of raw and processed data files.

Fig 5 .
Fig 5. Probability Density Function (PDF) and mean value of the similarity error (E sim ) obtained by the best-performing methods in the evaluation of each dataset as follows: (a) BIOSSES, (b) MedSTS, and (c) CTR.

Table 1 .
Benchmarks on biomedical sentence similarity evaluated in this work.

Table 5 .
Detailed setup for the sentence similarity methods based on pre-trained language models evaluated in this work.
For a 5% level of significance, it means that if the p-value is greater or equal than 0.05, we must accept the null hypothesis.Otherwise, we can reject H 0 with an error probability of less than the p-value.In this latter case, we say that a first sentence similarity method obtains a statistically significantly higher value than the second one or that the former one significantly outperforms the second one.Uniform size datasets for our statistical significance analysis.The scarcity of the datasets and the notable size difference among datasets varying from 100 to 1,068 sentence pairs prevent both from studying the statistical significance of the results with adequate sample size and carry-out a fair comparison of the results.For this reason, we have divided the MedSTS dataset into 10 parts considered as independent datasets to perform the study of the statistical significance of the results.Thus, we have artificially obtained 12 datasets of 100 to 200 pairs of sentences.This set of datasets allows us to obtain the p-values comparing the statistical significance between the measure, but does not modify the processed results from table 8.All the necessary resources for obtaining both the table 8 and the table containing all the p-values reported in Appendix A are publicly available in the reproducibility dataset and the companion Lab Protocol article under preparation, as detailed in table 6.

Table 7 .
Best-performing pre-processing configurations used to evaluate the methods compared in this work as reported in table 8, which are derived from our cross-evaluation of each method with the pre-processing configurations shown in figure2(see Appendix B). (*) COM (M17) uses the best configuration of the WBSM-Rada (M7) and UBSM-Rada (M12) methods for computing the similarity scores.

Table 8 .
Pearson (r), Spearman (ρ), harmonic (h), and harmonic average (AVG) scores obtained by each sentence similarity method evaluated herein in the three biomedical sentence similarity benchmarks arranged by families.All reported values were obtained using the best pre-processing configurations detailed in table 7. The results in bold show the best scores whilst results in blue color show the best average harmonic score for each family.

Table 9 .
Comparison of results for the "best" and the "worst" pre-processing configurations for the best-performing methods of each family in table 8.The last column shows the t-Student p-values comparing the best and worst configurations.

Table 10 .
[50]son (r), Spearman (ρ) and harmonic (h) values obtained in our experiments from the evaluation of ontology similarity methods detailed below in the MedSTS f ull[50]dataset for each NER tool.

Table 11 .
Harmonic score obtained by each combination of a sentence similarity method with a NER tool in the evaluation of the three sentence similarity datasets.The p-values shown in this table are obtained by using the method for building uniform size datasets detailed above.The last column shows the p-values corresponding to the t-Student test comparing the performance of each combination with the best pair in each group.

Table 12 .
Pearson (r)and Spearman (ρ) correlation values, harmonic score (h), and harmonic average (AVG) score obtained by the LiBlock method in combination with each NER tool using the best pre-processing configuration detailed in 7.In addition, last column (p-val) report the p-values for the comparison of the LiBlock method with cTAKES and the remaining NER combinations.

Table 13 .
Raw and pre-processes sentence pairs obtaining the lowest and highest similarity error E sim together with their corresponding Normalized human similarity score (Human) and normalized similarity value (Method) estimated by the LiBlock (M4) method for the raw and pre-processed sentence pairs with the lowest (L) and highest (H) similarity error E sim .

Table 14 .
Raw and pre-processes sentence pairs obtaining the lowest and highest similarity error E sim together with their corresponding Normalized human similarity score (Human) and normalized similarity value (Method) estimated by the COM (M17) method for the raw and pre-processed sentence pairs with the lowest (L) and highest (H) similarity error E sim .We show the raw and pre-processed sentence pairs evaluated by the WBSM and UBSM similarity methods that make up the COM method.The UBSM method use the cTAKES NER tool.

Table 15 .
Raw and pre-processes sentence pairs obtaining the lowest and highest similarity error E sim together with their corresponding Normalized human similarity score (Human) and normalized similarity value (Method) estimated by the BioWordVec int (M26) method for the raw and pre-processed sentence pairs with the lowest (L) and highest (H) similarity error E sim .

Table 16 .
Raw and pre-processes sentence pairs obtaining the lowest and highest similarity error E sim together with their corresponding Normalized human similarity score (Human) and normalized similarity value (Method) estimated by the OuBioBert (M47) method for the raw and pre-processed sentence pairs with the lowest (L) and highest (H) similarity error E sim .

Table 17 .
Comparison of the mean, minimum and maximum similarity scores of the Normalized Human similarity scores (Human) and the estimated valued returned by the best-performing methods of each family in the evaluation of the three biomedical datasets.Pearson correlation value in the BIOSSES and MedSTS datasets among the family of string-based methods, whilst Block Distance (M3) obtains the highest Pearson correlation in the CTR dataset.This conclusion can be drawn by looking the results for the first group of methods detailed in table 8.