Fig 1.
This figure details the workflow for computing the new LiBlock measure and an example illustrating a use case of the workflow following the steps defined in algorithm 1.
Table 1.
Benchmarks on biomedical sentence similarity evaluated in this work.
Table 2.
Detailed setup for the string-based sentence similarity measures which are evaluated in this work.
All the string-based measures follow the implementation of Sogancioglu et al. [30], who use the Simmetrics library [71]. The LiBlock method proposed herein is an adaptation from Li et al. [56] combined with a string-based measure, as detailed in the previous section.
Table 3.
Detailed setup for the ontology-based sentence similarity measures evaluated in this work.
The evaluation of the methods using Rada [69], coswJ&C [46], and Cai [68] word similarity measures use a reformulation of the original path-based measures based on the new Ancestors-based Shortest-Path Length (AncSPL) algorithm [42].
Table 4.
Detailed setup for the sentence similarity methods based on pre-trained character, word (WE) and sentence (SE) embedding models evaluated herein.
Table 5.
Detailed setup for the sentence similarity methods based on pre-trained language models evaluated in this work.
Fig 2.
Detail of the pre-processing configurations that are evaluated in this work.
(*) WordPieceTokenizer [91] is used only for BERT-based methods [30, 31, 34, 62, 91–94, 99].
Fig 3.
Detailed workflow implemented by our experiments for pre-processing the input sentences, calculating the raw similarity scores, and post-processing the results obtained in the evaluation of the biomedical datasets.
This workflow generates a collection of raw and processed data files.
Fig 4.
Detailed sentence pre-processing workflow that are implemented in our experiments.
The pre-processing stage takes an input sentence and produces a pre-processed sentence as output. (*) The named entity recognizer are only evaluated in ontology-based methods.
Fig 5.
Figure (a) below shows the histogram plots for the harmonic score obtained by the Li-Block measure [M4] in evaluating the sentence similarity of 10,000 different equal-size subsets of sentence pairs extracted from the MedSTS dataset. Each histogram plot represents the frequency distribution of 10,000 samples of the harmonic score with subsets of sentence pairs with sizes: 100, 300, 600, and 900. Figure (b) shows the Q-Q plot normality test for the harmonic score obtained for a random subset with size 100, along with the p-values reported by the Shapiro-Wilk and Chi-square normality tests.
Table 6.
Supplementary material and reproducibility resources of this work.
Fig 6.
Probability Density Function (PDF) and mean value of the similarity error (Esim) obtained by the best-performing methods in the evaluation of each dataset as follows: (a) BIOSSES, (b) MedSTS, and (c) CTR.
Table 7.
Best-performing pre-processing configurations used to evaluate the methods compared in this work as reported in Table 8, derived from our cross-evaluation of each method with the pre-processing configurations shown in Fig 2 (see S2 Appendix).
(*) COM (M17) uses the best configuration of the WBSM-Rada (M7) and UBSM-Rada (M12) methods for computing the similarity scores.
Table 8.
Pearson (r), Spearman (ρ), harmonic (h), and harmonic average (AVG) scores obtained by each sentence similarity method evaluated herein in the three biomedical sentence similarity benchmarks arranged by families.
All reported values were obtained using the best pre-processing configurations detailed in Table 7. The results in bold show the best scores whilst results in show the best average harmonic score for each family.
Table 9.
Comparison of results for the “best” and the “worst” pre-processing configurations for the best-performing methods of each family in Table 8.
The last column shows the t-Student p-values comparing the best and worst configurations.
Table 10.
Pearson (r), Spearman (ρ) and harmonic (h) values obtained in our experiments from the evaluation of ontology similarity methods detailed below in the MedSTSfull [52] dataset for each NER tool.
Table 11.
Harmonic score obtained by each combination of a sentence similarity method with a NER tool in the evaluation of the three sentence similarity datasets.
The p-values shown in this table are obtained by using the method for building uniform size datasets detailed above. The last column shows the p-values corresponding to the t-Student test comparing the performance of each combination with the best pair in each group.
Table 12.
Pearson (r) and Spearman (ρ) correlation values, harmonic score (h), and harmonic average (AVG) score obtained by the LiBlock method in combination with each NER tool using the best pre-processing configuration detailed in Table 7.
In addition, the last column (p-val) shows the p-values for the comparison of the LiBlock method with cTAKES and the remaining NER combinations.
Table 13.
Raw and pre-processed sentence pairs obtaining the lowest and highest similarity error Esim together with their corresponding Normalized human similarity score (Human) and normalized similarity value (Method) estimated by the LiBlock (M4) method for the raw and pre-processed sentence pairs with the lowest (L) and highest (H) similarity error Esim.
Table 14.
Raw and pre-processed sentence pairs obtaining the lowest and highest similarity error Esim together with their corresponding Normalized human similarity score (Human) and normalized similarity value (Method) estimated by the COM (M17) method for the raw and pre-processed sentence pairs with the lowest (L) and highest (H) similarity error Esim.
We show the raw and pre-processed sentence pairs evaluated by the WBSM and UBSM similarity methods that make up the COM method. The UBSM method use the cTAKES NER tool.
Table 15.
Raw and pre-processed sentence pairs obtaining the lowest and highest similarity error Esim together with their corresponding Normalized human similarity score (Human) and normalized similarity value (Method) estimated by the BioWordVecint (M26) method for the raw and pre-processed sentence pairs with the lowest (L) and highest (H) similarity error Esim.
Table 16.
Raw and pre-processed sentence pairs obtaining the lowest and highest similarity error Esim together with their corresponding Normalized human similarity score (Human) and normalized similarity value (Method) estimated by the OuBioBert (M47) method for the raw and pre-processed sentence pairs with the lowest (L) and highest (H) similarity error Esim.
Table 17.
Comparison of the mean, minimum and maximum similarity scores of the Normalized Human similarity scores (Human) and the estimated values returned by the best-performing methods of each family in the evaluation of the three biomedical datasets.
Table 18.
This table shows the running times in milliseconds (ms) and the average sentences pairs per second (sent/sec) reported by the best-performing method of each family of methods in the evaluation of the 1339 sentence pairs that comprise the three datasets.
(*) The LiBlock method reports the running times in both NER and noNER versions showing that the efficiency of the method with no NER tool is much higher, despite the fact that there is no statistically significant difference in the results between both pre-processing configurations.