Replacing non-biomedical concepts improves embedding of biomedical concepts

doi:10.1371/journal.pone.0322498

Fig 1.

Schematic of the approach: This schematic illustrates the entire workflow of the project.

The process begins with initial text preprocessing using marea software to obtain the PM corpus [6]. The PM corpus is then processed through non-biomedical concept replacement, resulting in the WN corpus; to fairly assess the concept replacement proposal, both the PM and WN corpora are embedded using the same text-embedding algorithm (Word2Vec in our experiments - due to its broad usage and relative simplicity), and pairwise distances between sets of related biomedical concepts in the embedded PM corpus are compared to those in the embedded WN corpus.

More »

Expand

Fig 2.

Text Transformation Pipeline: An example of the multi-stage text transformation pipeline applied to a sample abstract (PMID: 30609739).

The process begins with the original text, followed by biomedical entity recognition and standardization using PubTator, which replaces medical terms and their synonyms with standardized identifiers (e.g., MeSH IDs). The text is then processed by MAREA, which simplifies and prepares it for machine learning by retaining standardized biomedical terms and ensuring consistent tokenization. In the final stage, non-biomedical synonyms are replaced using WordNet to further refine the embeddings. This figure illustrates the transformation applied across 30 million abstracts.

More »

Expand

Fig 3.

Non-biomedical word replacement algorithm: This algorithm outlines the process for replacing non-biomedical words in a corpus using WordNet.

More »

Expand

Fig 4.

Illustration of the text transformation process before and after synonym replacement: The process begins with a sample initial text segment (Sample text before synonym replacement’), followed by a word frequency count (‘Counter’).

This count generates a ‘Vocabulary List by Frequency’, which informs the final modified text (‘Sample text after synonym replacement’). The procedure exemplifies the algorithm’s systematic approach to replacing non-biomedical tokens.

More »

Expand

Table 1.

Comparison of mean interconcept distance for embedding with WordNet synonym replacement (WN) and without (PM). The initial number of unique concepts in the total corpus was 3,018,918. The Table summarizes results for different thresholds () and categories of concept/gene sets (M,B,K,G,P). Columns: : Replacement threshold; replaced: Unique Replaced Concepts; Category: M = MeSH, B = Biocarta, K = KEGG, G = GP(bp), P = PID; # sets: Number of concept/gene sets in the categories; #Concepts: number of concept vectors in the category; WN better: The count and percentage of concept/gene sets for which the mean interconcept distance was smaller for WN than for PM. “Winners” are shown in bold.; PM better: Analogous to “WN better” but for PM.

More »

Expand

Fig 5.

Comparative analysis of WordNet replacement impact on data distribution.

Figure (a) presents the Empirical Cumulative Distribution Functions (ECDFs), showcasing the cumulative frequency distribution before and after WordNet replacement, while Figure (b) illustrates the corresponding Empirical Q-Q Plot, detailing the quantile comparison between the original and the WordNet-replaced datasets. The close alignment of data points with the reference line in the Q-Q Plot and the overlap of the ECDF curves suggest minimal distributional deviation post-replacement.

More »

Expand

Fig 6.

Comparative analysis of WN and PM methodologies: Figure (a) displays the bar chart comparing WN and PM across five distinct concept sets (Methods), highlighting the number of concept sets where the cluster mean distance is significantly lower, indicative of superior embeddings.

‘Significant’ designates those with a statistically significant difference in cluster mean distances , while ‘All Comparisons’ encompasses the entire dataset. Figure (b) illustrates the spread of mean distances within the PID Gene Sets, detailing the variance and central tendency across 194 gene sets. ‘Significant’ encompasses gene sets with notable mean distance variations between ‘PM’ and ‘WN’ , and ‘All comparisons’ includes all evaluated gene sets.

More »

Expand

Table 2.

Comparison of window size for embedding with Wordnet synonym replacement (WN) and without (PM). While Table 1 compared the effects of different values of using a window size of 10, this Table shows the results for three different window sizes at a =. Abbreviations are the same as for Table 1.

More »

Expand