A reproducible experimental survey on biomedical sentence similarity: A string-based method sets the state of the art

doi:10.1371/journal.pone.0276539

Fig 1.

This figure details the workflow for computing the new LiBlock measure and an example illustrating a use case of the workflow following the steps defined in algorithm 1.

More »

Expand

Table 1.

Benchmarks on biomedical sentence similarity evaluated in this work.

More »

Expand

Table 2.

Detailed setup for the string-based sentence similarity measures which are evaluated in this work.

All the string-based measures follow the implementation of Sogancioglu et al. [30], who use the Simmetrics library [71]. The LiBlock method proposed herein is an adaptation from Li et al. [56] combined with a string-based measure, as detailed in the previous section.

More »

Expand

Table 3.

Detailed setup for the ontology-based sentence similarity measures evaluated in this work.

The evaluation of the methods using Rada [69], coswJ&C [46], and Cai [68] word similarity measures use a reformulation of the original path-based measures based on the new Ancestors-based Shortest-Path Length (AncSPL) algorithm [42].

More »

Expand

Table 4.

Detailed setup for the sentence similarity methods based on pre-trained character, word (WE) and sentence (SE) embedding models evaluated herein.

More »

Expand

Table 5.

Detailed setup for the sentence similarity methods based on pre-trained language models evaluated in this work.

More »

Expand

Fig 2.

Detail of the pre-processing configurations that are evaluated in this work.

(*) WordPieceTokenizer [91] is used only for BERT-based methods [30, 31, 34, 62, 91–94, 99].

More »

Expand

Fig 3.

Detailed workflow implemented by our experiments for pre-processing the input sentences, calculating the raw similarity scores, and post-processing the results obtained in the evaluation of the biomedical datasets.

This workflow generates a collection of raw and processed data files.

More »

Expand

Fig 4.

Detailed sentence pre-processing workflow that are implemented in our experiments.

The pre-processing stage takes an input sentence and produces a pre-processed sentence as output. (*) The named entity recognizer are only evaluated in ontology-based methods.

More »

Expand

Fig 5.

Figure (a) below shows the histogram plots for the harmonic score obtained by the Li-Block measure [M4] in evaluating the sentence similarity of 10,000 different equal-size subsets of sentence pairs extracted from the MedSTS dataset. Each histogram plot represents the frequency distribution of 10,000 samples of the harmonic score with subsets of sentence pairs with sizes: 100, 300, 600, and 900. Figure (b) shows the Q-Q plot normality test for the harmonic score obtained for a random subset with size 100, along with the p-values reported by the Shapiro-Wilk and Chi-square normality tests.

More »

Expand

Table 6.

Supplementary material and reproducibility resources of this work.

More »

Expand

Fig 6.

Probability Density Function (PDF) and mean value of the similarity error (E_sim) obtained by the best-performing methods in the evaluation of each dataset as follows: (a) BIOSSES, (b) MedSTS, and (c) CTR.

More »

Expand

Table 7.

Best-performing pre-processing configurations used to evaluate the methods compared in this work as reported in Table 8, derived from our cross-evaluation of each method with the pre-processing configurations shown in Fig 2 (see S2 Appendix).

(*) COM (M17) uses the best configuration of the WBSM-Rada (M7) and UBSM-Rada (M12) methods for computing the similarity scores.

More »

Expand

Table 8.

Pearson (r), Spearman (ρ), harmonic (h), and harmonic average (AVG) scores obtained by each sentence similarity method evaluated herein in the three biomedical sentence similarity benchmarks arranged by families.

All reported values were obtained using the best pre-processing configurations detailed in Table 7. The results in bold show the best scores whilst results in show the best average harmonic score for each family.

More »

Expand

Table 9.

Comparison of results for the “best” and the “worst” pre-processing configurations for the best-performing methods of each family in Table 8.

The last column shows the t-Student p-values comparing the best and worst configurations.

More »

Expand

Table 10.

Pearson (r), Spearman (ρ) and harmonic (h) values obtained in our experiments from the evaluation of ontology similarity methods detailed below in the MedSTS_full [52] dataset for each NER tool.

More »

Expand

Table 11.

Harmonic score obtained by each combination of a sentence similarity method with a NER tool in the evaluation of the three sentence similarity datasets.

The p-values shown in this table are obtained by using the method for building uniform size datasets detailed above. The last column shows the p-values corresponding to the t-Student test comparing the performance of each combination with the best pair in each group.

More »

Expand

Table 12.

Pearson (r) and Spearman (ρ) correlation values, harmonic score (h), and harmonic average (AVG) score obtained by the LiBlock method in combination with each NER tool using the best pre-processing configuration detailed in Table 7.

In addition, the last column (p-val) shows the p-values for the comparison of the LiBlock method with cTAKES and the remaining NER combinations.

More »

Expand

Table 13.

Raw and pre-processed sentence pairs obtaining the lowest and highest similarity error E_sim together with their corresponding Normalized human similarity score (Human) and normalized similarity value (Method) estimated by the LiBlock (M4) method for the raw and pre-processed sentence pairs with the lowest (L) and highest (H) similarity error E_sim.

More »

Expand

Table 14.

Raw and pre-processed sentence pairs obtaining the lowest and highest similarity error E_sim together with their corresponding Normalized human similarity score (Human) and normalized similarity value (Method) estimated by the COM (M17) method for the raw and pre-processed sentence pairs with the lowest (L) and highest (H) similarity error E_sim.

We show the raw and pre-processed sentence pairs evaluated by the WBSM and UBSM similarity methods that make up the COM method. The UBSM method use the cTAKES NER tool.

More »

Expand

Table 15.

Raw and pre-processed sentence pairs obtaining the lowest and highest similarity error E_sim together with their corresponding Normalized human similarity score (Human) and normalized similarity value (Method) estimated by the BioWordVec_int (M26) method for the raw and pre-processed sentence pairs with the lowest (L) and highest (H) similarity error E_sim.

More »

Expand

Table 16.

Raw and pre-processed sentence pairs obtaining the lowest and highest similarity error E_sim together with their corresponding Normalized human similarity score (Human) and normalized similarity value (Method) estimated by the OuBioBert (M47) method for the raw and pre-processed sentence pairs with the lowest (L) and highest (H) similarity error E_sim.

More »

Expand

Table 17.

Comparison of the mean, minimum and maximum similarity scores of the Normalized Human similarity scores (Human) and the estimated values returned by the best-performing methods of each family in the evaluation of the three biomedical datasets.

More »

Expand

Table 18.

This table shows the running times in milliseconds (ms) and the average sentences pairs per second (sent/sec) reported by the best-performing method of each family of methods in the evaluation of the 1339 sentence pairs that comprise the three datasets.

(*) The LiBlock method reports the running times in both NER and noNER versions showing that the efficiency of the method with no NER tool is much higher, despite the fact that there is no statistically significant difference in the results between both pre-processing configurations.

More »

Expand