Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Empowering entity synonym set generation using flexible perceptual field and multi-layer contextual information

  • Subin Huang,

    Roles Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Resources, Software, Supervision, Validation, Visualization, Writing – original draft, Writing – review & editing

    Affiliation School of Computer and Information, Anhui Polytechnic University, Wuhu, Anhui, China

  • Daoyu Li ,

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Project administration, Software, Supervision, Validation, Visualization, Writing – original draft, Writing – review & editing

    lidaoyu@ahpu.edu.cn

    Affiliation School of Computer and Information, Anhui Polytechnic University, Wuhu, Anhui, China

  • Chengzhen Yu,

    Roles Conceptualization, Investigation, Validation

    Affiliation School of Computer and Information, Anhui Polytechnic University, Wuhu, Anhui, China

  • Junjie Chen,

    Roles Conceptualization, Investigation, Software, Validation

    Affiliation School of Computer and Information, Anhui Polytechnic University, Wuhu, Anhui, China

  • Qing Zhou,

    Roles Conceptualization, Investigation, Validation

    Affiliation School of Computer and Information, Anhui Polytechnic University, Wuhu, Anhui, China

  • Sanmin Liu

    Roles Conceptualization, Funding acquisition

    Affiliation School of Computer and Information, Anhui Polytechnic University, Wuhu, Anhui, China

Abstract

Automatic generation of entity synonyms plays a pivotal role in various natural language processing applications, such as search engines, question-answering systems, and taxonomy construction. Previous research on generating entity synonym sets has typically relied on approaches that involve sorting and pruning candidate entities or solving the problem in a two-stage manner (i.e., initially identifying pairs of synonyms and subsequently aggregating them into sets). Nevertheless, these approaches tend to disregard global entity information and are susceptible to error propagation issues. This paper introduces an innovative approach to generating entity synonym sets that leverages a flexible perception mechanism and multi-layer contextual information. Firstly, to determine whether to incorporate new candidate entities into synonym sets, the approach integrates a neural network classifier with a flexible perceptual field. Within the classifier, the approach builds a three-layer interactive network, and connects the entity layer, set layer, and sentence layer to the same embedding space to extract synonym features. Secondly, we introduce a dynamic-weight-based algorithm for synthesizing entity synonym sets, leveraging a neural network classifier trained to generate entity synonym sets from the candidate entity vocabulary. Finally, extensive experimental results on three public datasets demonstrate that our approach outperforms other comparable approaches in generating entity synonym sets.

Introduction

An entity synonym set consists of names or terms that convey similar or identical meanings. For instance, {“USA”, “U.S.”, “United States”} is a synonym set, where each term refers to the same nation [13]. The identification of entity synonym sets has become a crucial undertaking, offering substantial advantages for various downstream applications, including search engines [46], question answering systems [79], and taxonomy construction [1012]. For example, when analyzing the query “The capital of the U.S. is D.C.”, an intelligent system must accurately interpret “U.S.” as “United States” and “D.C.” as “Washington.” This understanding is critical for satisfying the informational demands of users [13].

Currently, most research on generating entity synonym sets falls into two main categories: ranking and pruning approaches [1416] and two-stage methods [1719].

  • Ranking and pruning approaches. Candidate terms are ranked according to their probability of referring to the same entity, and the ranked terms are then pruned into a set of entity synonyms. By treating each term in the glossary as a query, the approach ultimately outputs a complete set of entity synonyms derived from the glossary.
  • Two-stage task approaches. These approaches divide the process of generating synonym sets into two cohesive steps. Initially, a synonym detection technique is employed to identify candidate synonym pairs of terms. Subsequently, a synonym set generation algorithm is utilized to combine these pairs to form a synonym set.

However, the above solution, while valid, suffers from the following two drawbacks:

  • Ignoring the global information of candidate entities. The ranking and pruning approach employs a rank-and-prune strategy, which first ranks candidate terms according to their probability of referring to the same entity. Subsequently, the sorted list is pruned to create entity synonym sets. However, these approaches handle each candidate separately and independently calculate the probability of its reference to the queried entity, thereby overlooking the global information shared among candidate entities. Essentially, harnessing such global information could significantly enhance the quality of synonym set discovery.
  • Accumulation of error propagation. The two-stage approach include the initial extraction of entity synonym pairs, which are then combined into entity synonym sets. However, these approaches rely solely on the training data in the first stage and fail to fully utilize training signals in the second stage. Furthermore, during the organization of entity synonym sets, the identified entity synonym pairs are usually fixed, lacking a feedback mechanism between the first and second stages. This absence of feedback can result in the accumulation of errors.

This paper proposes a framework named EnSynFields to overcome the previously mentioned limitations in generating entity synonym sets. EnSynFields improves the extraction of sentence-layer global information for entity synonym sets, providing flexible field awareness for their automated generation. Specifically, EnSynFields consists of the following main components:

  • Three-layer interaction field network. We build a three-layer interaction field network, encompassing the entity layer, set layer, and sentence layer, to capture the interaction of global information among the candidate entities. A bidirectional propagation mechanism is proposed to learn entity feature representations across the three-layer interaction fields, thereby helping to alleviate the potential noise caused by relying on a single network layer.
  • Flexible perceptual field neural network classifier. We build a flexible perceptual field neural network classifier to determine whether to incorporate new candidate entities into synonym sets. This classifier jointly models entities, sets, and sentences, learning to represent entity synonym sets based on the three-layer interaction field.
  • Dynamic-weight-based set generation algorithm. We propose a dynamic-weight-based set generation algorithm for generating entity synonym sets. This algorithm utilizes a weighted cross-entropy loss function to balance the distribution of samples among the generated sets, thereby mitigating error propagation stemming from the flexible perceptual field neural network classifier.

This study makes the following contributions:

  • By considering multi-layer contextual information, our approach jointly learns the representations of entities and synonym sets within a network learning framework. This is achieved by encoding contextual information at three levels: the entity layer, set layer, and sentence layer.
  • A flexible perceptual field neural network classifier is proposed for the holistic modeling of entity synonym sets. Integrated with a dynamic-weight-based set generation algorithm, it effectively constructs new synonym sets.
  • Numerous experiments were conducted on three real-world datasets to assess the efficacy of our proposed EnSynFields framework. Experimental results demonstrate that our approach outperforms existing methods, validating its effectiveness.

In the subsequent sections, this paper is structured as follows: The “Related Work” section provides a concise overview of existing techniques for synonym set generation and highlights the key features of our proposed approach. The “Methodology” section details our proposed framework. The “Experiments” section presents experimental results and an in-depth analysis. Finally, the “Conclusion” section summarizes the key findings of this study and outlines future research directions.

Related work

This section explores two correlated approaches in the generation of entity synonym sets: the ranking and pruning techniques, and the two-stage task strategies.

Ranking and pruning approaches

Ranking and pruning approaches involve associating a given query term with an entity. Initially, candidate terms are evaluated based on their likelihood of denoting the same entity, followed by pruning the ranked list to generate the final synonym set. Viewing each vocabulary term as a query, these approaches ultimately result in synonym sets for all entities in the vocabulary [13].

Ranking-based approaches aim to improve the quality of generated synonyms by ranking them according to their relevance to the original word. Qu et al. [20] introduced a machine learning-based sorting framework to rank generated synonyms by learning weights, thereby improving their relevance. In addition, Yu et al. [21] proposed a corpus-based ranking approach that leverages a large-scale corpus to learn the distribution of synonyms for more accurate ranking.

The goal of the pruning technique is to eliminate low-quality or irrelevant candidates from the generated synonyms. Zhang et al. [22] developed a pruning approach based on syntactic and semantic information to filter out non-contextualized synonyms, highlighting the significance of pruning in synonym generation. In addition, Qu et al. [23] proposed a pruning strategy based on an attention mechanism, where the model automatically focuses on the most relevant synonyms, thereby reducing redundancy and irrelevant words in the generated set.

Two-stage task approaches

Two-stage task approaches implement a pair of sequential subtasks to discern sets of synonyms for entities. Initially, they focus on developing a model for synonym prediction, which evaluates the synonymous nature of proposed pairs of strings. Subsequently, these approaches utilize an algorithm for synonym expansion in conjunction with the previously mentioned model to generate synonym sets. Typically, these two-stage task approaches excel in discerning semantic connections among candidate strings, thereby efficiently collating synonymous terms [2, 13].

Ren and Cheng [17] developed an approach that combines a heterogeneous graph-based data model with a ranking algorithm to discover synonyms in web text. Their approach incorporates string names, key structured attributes, subqueries, and web page connections to compile an expanded set of synonyms. Shen et al. [18] developed a technique that combines context-based feature selection with a ranking-based model for extracting synonym sets from text corpora. In a separate study, Shen et al. [13] introduced an approach for creating entity synonym sets, designing a classifier to validate string pair synonymy and employing an expansion algorithm to refine these sets.

Huang et al. [19] introduced an approach for forming Chinese entity synonym sets, comprising both extraction and refinement stages. In the initial phase, they employed a combination of direct, pattern-based, and neural network mining techniques to gather potential Chinese entity synonyms. In the subsequent refinement phase, they applied semantic rules, coupled with field-specific and similarity-based filtering, to enhance the accuracy of the extracted Chinese entity synonym sets.

Related work discussion

Previously, we examined two approaches for synonym set discovery. This section provides an in-depth discussion of the salient aspects of our proposed approach and revisits the approaches previously addressed. The ranking and pruning approach can improve the quality of the generated set of synonyms; however, an overly aggressive ranking and pruning process may lead to the erroneous exclusion of some useful synonyms, thereby reducing the coverage of the synonym set. In addition, approaches that rely excessively on statistical and corpus information may not perform well when dealing with field-specific or low-resource languages, as such information may be insufficiently detailed. Most two-stage approaches are supervised, requiring a labeled dataset of synonymous words. Nevertheless, the availability of labeled synonym datasets is not consistently assured, and the development of such datasets can be a costly endeavor. This limitation hinders the scalability and broader applicability of these approaches.

This paper introduces an approach for the efficient generation of entity synonym sets, leveraging a flexible perceptual field and layered contextual insights. A three-layer interaction field network is constructed to learn entity representations in a mutually reinforcing manner, mitigating the noise introduced by rich contextual information. Specifically, to determine whether to include new candidate entities in synonym sets, the approach incorporates a neural network classifier with a flexible perceptual field. In the classifier, a three-layer interaction field network is employed to link entities, entity synonym sets, and entity sentences, encoding high-order context from a flexible perceptual field to extract synonym features. To overcome the previously identified constraints, this study presents an advanced dynamic-weight-based algorithm for generating synonym sets, utilizing a finely trained neural network classifier to extract these sets from a selection of candidate entities. Thorough experiments conducted on a variety of real-world datasets demonstrate the superiority of our approach over similar methods, confirming its effectiveness.

Methodology

This section introduces key concepts related to our research and outlines the structure of our proposed approach. Additionally, it provides an in-depth analysis of this structure, exploring each of its components in detail.

Problem formulation

Entity. Any distinct thing, object, or concept that exists independently or can be considered separately in reality. It may appear as either a single word or a phrase.

Entity synonym. A term (i.e., word or phrase) that has the same semantic meaning as another term in the same text corpus but differs in surface form.

Set of entity synonyms. A set containing entity names that either convey similar meanings or refer to an identical entity.

Knowledge base. A knowledge base is a collection of entities encompassing numerous facts. This paper specifically focuses on a particular type of fact: entity synonyms. These synonyms serve as initial seeds for discovering previously unknown synonym terms.

Set-layer context. A set-layer context is a semantic association within an interaction field network. If entity mention M links to a synonym set in this network, the synonym set is deemed the set-layer context of entity mention M. Similarly, a link between one synonym set and another synonym set within this network designates the latter as the set-layer context of the former.

Flexible perceptual fields. Given a text corpus C, flexible perceptual fields are sets of top-k sentences with the highest relevance scores from C. The relevance scores are computed based on entity perceptual fields within the interaction field network, utilizing the BM25 (Best Matching 25) information retrieval approach.

Problem Definition. Given a text corpus T, a knowledge base K, and a vocabulary V (a list of candidate entities) generated from T, our task aims to extend entity synonyms from T and V based on flexible perceptual fields and multiple layers of contextual information.

Overview of framework

As shown in Fig 1, the EnSynFields framework comprises three key components: a three-layer interaction field network, a flexible perceptual field neural network classifier, and a dynamic-weight-based set generation algorithm.

  1. Three-layer interaction field network: An interactive field is constructed to capture global information among candidate entities. The interaction field network includes contextual information from the entity layer, set layer, and sentence layer, helping to mitigate potential noise arising from individual fields.
  2. Flexible perceptual field neural network classifier: This classifier is developed to assess whether new candidate entities should be incorporated into the current entity set. It features a network with a flexible perceptual field, enabling the learning of additional contextual characteristics.
  3. Dynamic-weight-based set generation algorithm: A dynamic-weight-based algorithm for expanding entity synonym sets is devised. It resamples generated data and employs a weighted cross-entropy loss function to balance the distribution across various sets, thereby improving the efficiency of entity synonym set expansion.

Building a three-layer interaction field network

As depicted in Fig 2, the three-layer interaction field network , , consists of the entity-layer field f(MM), set-layer field f(SS), and sentence-layer field f(TT). The three-layer interaction field network preserves contextual information at the entity layer, set layer, and sentence layer, offering flexible receptive fields for additional encoding of entities, sets, and sentences.

thumbnail
Fig 2. Three-layer interactive field network architecture.

https://doi.org/10.1371/journal.pone.0321381.g002

Data preparation.

To build a synonym set generative model, we begin with data preparation, focusing on collecting training and validation data. We collected the data using the following approach.

  • Entity data acquisition: The existing named entity recognition tool [24] is used to identify entities mentioned in the text corpus, i.e., entity references.
  • Entity set data acquisition: An established entity linker, such as DBpedia Spotlight [25], is utilized to associate entity references in the corpus with corresponding entities in the knowledge base. The set of synonym seeds from linked corpora serves as a distant supervision signal, aiding in the discovery of additional sets of entity synonyms from raw text corpora that are not present in the knowledge base.
  • Sentence set data acquisition: The BM25 information retrieval algorithm [26] is employed to retrieve the five most relevant sentences from the original text corpus. This number was chosen as it balances providing sufficient context with maintaining computational efficiency. These selected sentences are scored based on factors such as word frequency, sentence length, and sentence frequency, and subsequently utilized as the sentence set for entities.

Building entity-layer field network.

The goal of building an entity-layer field network is to establish entity-to-entity semantic relationships. As shown in Fig 3, the entity mentions M serve as the nodes of the network. To capture entity-to-entity semantic relationships, we construct the entity-to-entity semantic field network f(MM) using the following rules:

  • Rule 1: If is the nearest entity mention to in the text corpus, then we construct the network using the top-k contextual information of entities and . For example, the network of and is .
  • Rule 2: If and appear in the same sentence in the text corpus, then we construct the network using the top-k contextual information of entities and . For example, the network of and is .

The entity-layer field network implicitly encodes semantic information among entities, facilitating the comprehensive capture of semantic relationships within this specific layer.

Building set-layer field network.

We construct a set-layer field network to capture set-layer contextual information. As illustrated in Fig 4, the set-layer field network consists of numerous candidate entities. To capture set-to-set semantic relationships, we construct the set-to-set semantic field network f(SS) using the following steps:

  • First, if the entity mentions and are synonymous, then we construct a set to form a set field. For example, and are two set fields.
  • Second, if the entity mentions and are the nearest neighbors or co-occurring in the entity-layer field network, then we construct a set-layer field network and consider and to be related. For example, the network of and is .

The set-layer field network implicitly provides contextual information at the set layer, ensuring that the generated synonyms are intricately connected to their preceding and following texts, thereby enhancing the coherence and relevance of synonym set generation.

Building sentence-layer field network.

We employ the BM25 [26] information retrieval algorithm to evaluate the match degree between sentences and entity mentions. Additionally, we consider factors such as entity mention frequency, sentence length, and sentence frequency to determine the contextual information of the sentence-layer field network .

Specifically, firstly, the entity mention frequency (Term Frequency ) of the sentence is calculated. indicates the significance of the entity mention in the context of the sentence. Secondly, the inverse sentence frequency () of the entity mention is calculated. denotes the rarity of the entity mention item, i.e., its importance in the whole set of sentences. Thirdly, the BM25 Score is calculated based on the , , and other parameters of the entity mentions. The formula for the calculation of the BM25 Score is as follows:

(1)

where ICF indicates the inverse sentence frequency of the entity mention term. TF indicates the word frequency of the entity mention term in the sentence. and b are the moderation parameters. indicates the length of the sentence. indicates the average sentence length. Finally, ranking is executed by sorting the text sentences based on the BM25 score. The sentences with higher scores are placed higher in the results. The top-k sentences with the highest scores are filtered out based on the sentence-layer field network.

We construct sentence-layer field network to capture sentence-layer contextual information. As illustrated in Fig 5, the sentence-layer field network consists of numerous candidate sentences. To capture sentence-to-sentence semantic relationships, we construct the sentence-to-sentence associate semantic field network using the following procedures:

  • First, we use the BM25 information retrieval algorithm to obtain the top-k sentences with the highest correlation as the sentence fields of the entities.
  • Second, if the entity mentions (where ) and (where ) are associated in the set-layer field network, then we construct a sentence-layer field network and consider that and are connected.

The sentence-layer field network employing the BM25 algorithm effectively filters and ranks candidate sentences. This ensures that the chosen contextual information is highly pertinent to the entity’s context, offering a more precise reflection of the entity’s usage and meaning in diverse contexts. Consequently, the generated synonym set becomes enriched with semantic details, enhancing its practical applicability and accuracy.

Flexible perceptual field neural network classifier

As shown in Fig 6, following data preparation and the construction of a three-layer interactive field network, we construct a flexible perceptual field neural classifier, denoted as f(S,m), which determines whether the set of synonyms s should contain candidate entity m.

thumbnail
Fig 6. The architecture of the flexible perceptual field neural network classifier.

https://doi.org/10.1371/journal.pone.0321381.g006

Flexible perceptual field neural network classifier architecture.

The fundamental objective of the flexible perceptual field neural network classifier is to assess whether a candidate entity m is suitable for inclusion in the synonym set s.

A previous study [27] demonstrated that the crux of learning synonym sets lies in the replacement invariance of the term classifier for synonym sets when new terms are introduced. They initially utilized a set tagger to ascertain the scores of the original synonym set and then determined the scores of the newly formed synonym set (i.e., the set to which new terms are added) using the same set tagger. To translate the disparity between these two scores into a probability, they applied an objective function involving sigmoid units. However, this approach solely utilizes entity embedding features to acquire the synonym signal, neglecting contextual information about entity synonymy between the entity set layer and the sentence layer.

Based on the three-layer interaction field network, this paper builds a flexible perceptual field neural classifier to holistically model entities, sets, and sentences. As shown in Fig 6, the lower part of the figure illustrates the overall framework of classifier f(S,m), which mainly includes multiple parallel raters Score(x). The upper part of the figure shows the specific architecture of each rater Score(x), which takes set x as input, passes it through the scoring system, and ultimately produces a quality score Q(x). This score Q(x) reflects the integrity and consistency of set x.

As shown in Fig 6, given a synonym set , a candidate entity m, and a three-layer interaction field network , we construct the flexible perceptual field neural network classifier as the following steps:

  • First, the sets and are input to the set scorer to obtain two scores, Q(S) and Q(Sum), respectively. The difference between Q(Sum) and Q(S) is calculated, denoted as follows:(2)
  • where Q(S) is the quality score of the input synonym set S, and Q(Sum) represents the quality score after adding the candidate entity m into the synonym set S.
  • Second, based on the three-layer interactive field network, we learn the entity representation in a mutually reinforcing way and obtain a flexible perceptual field entity representation. The sets and are input to the set scorer to obtain two scores, and , respectively. The difference between and is calculated, denoted as follows:(3)
  • where is the quality score of the input flexible perceptual field synonym set , and is the quality score after adding the flexible perceptual field candidate entity to the synonym set .
  • Third, the sum of these two difference scores is computed, and a sigmoid function is applied to transform it into a probability. To determine whether the candidate entity should be added to the synonym set, the probability is defined as:(4)
  • where is a nonlinear function, D(S,m) is the difference in quality score after adding the candidate entity, and is the difference in quality score after adding the candidate entity to the flexible perceptual field.

We employ log-loss as a loss function to train our classifier f(S,m). is defined as follows:

(5)

where if , then ; otherwise, . Next, we describe the quality score architecture .

Scorer architecture.

As shown in Fig 6, following Shen et al. [13], the scorer architecture consists of an embedding layer, a transformer, a summation operation, and a post-transformer network. The scoring architecture operates as follows:

  • First, given a set of items , the set scorer initially feeds each item into an embedding layer to derive its corresponding embedding vector, denoted as .
  • Second, employing the embedding transformer (a fully connected neural network with two hidden layers), we transform the original embedding vector into a new item representation, .
  • Third, a summation operation is performed on all the transformed term representations to yield the original set representation:
  • Last, set representation S(z) is fed into a post-hidden layer transformer (a fully connected neural network with three hidden layers) that produces the ultimate set quality score.

Dynamic-weight-based set generation algorithm

In the preceding study [27], a set generation algorithm was introduced, leveraging a set scorer to generate synonym sets from candidate entities . The algorithm employed the set scorer to compute a probability P to determine whether a new candidate entity should be added to the synonym set . If P is greater than the threshold a, then the entity is added to the set ; otherwise, a new set is created and added to the pool of sets denoted as Pool.

However, the study [28] exclusively relies on entity embedding features to capture synonym signals, neglecting the contextual information of entity synonyms between the set layer and the sentence layer. Furthermore, with an increasing number of sets, the set generation algorithm may face an issue of imbalanced set generation, wherein the model tends to exhibit bias toward larger sets. This issue could potentially result in a reduction in accuracy during synonym set generation.

To deal with the above issue, we design a dynamic-weight-based set generation algorithm. The algorithm solves the issue of imbalance in set generation and improves its accuracy. For sets with fewer synonyms, dynamic weights can increase their influence in training, making the model focus more on important features that might be obscured by lower-frequency sets.

To address the challenge of set imbalance, the dynamic-weight-based set generation algorithm employs a data sampling technique and introduces a weighted cross-entropy function. The algorithm aims to rebalance the distribution of samples across different sets by adjusting set weights. By assigning distinct weights to different sets, the function is tailored to guide the model toward effectively discerning sets with fewer entities. The algorithm proposes calculating the weight for each set based on the number of entities in different synonym sets. Specifically, it utilizes the inverse class frequency as weights, i.e., computing the total number of entities divided by the number of entities in each set. The algorithm fine-tunes the model by assigning higher weights to sets with fewer instances, which mitigates the effect of set imbalance. Details are provided below:

(6)

where H is the total number of entities, is the number of synonymous entities belonging to category set S, and is a small constant added to prevent division by zero.

The set weights are then considered using a weighted cross-entropy function, which multiplies the cross-entropy of each set by its corresponding weight and averages over all set samples. Details are given below:

(7)

where represents the weight of the set to which the ith entity belongs, stands for the true label, and denotes the predicted probability.

This dynamic weight assignment aims to ameliorate the issue of imbalance by automatically adjusting the weights of various sets during each training iteration. As a result, the candidate entities in each iteration are predominantly compared with synonym sets bearing larger weights. In this research, we directly set the probability threshold a as 0.5 and explore its impact on clustering performance in subsequent analyses.

Algorithm 1 Dynamic-weight-based set generation algorithm.

Require (1) flexible perceptual field neural network classifier

f(S,m); (2) vocabulary of candidate entities ;

  (3) probability threshold .

Ensure Entity synonym set pool , where ,

, and for

  Create and add the first cluster, , to Entity Synonym Set

  Pool

for each candidate entity in V do

   best_score 0;

   best_j1;

   for each set in Entity Synonym Set Pool do

    total_entities sum of entities in Synonym Set Pool;

    weight() total_entities / (number_of_entities_in_set +1);

    min(sets, key=get_set_weight);

    if best_score then

     best_score ;

     best_j ;

    end if

    if best_score >a then

     .add();

    else

     S.append();

    end if

   end for

end for

  return S

As shown in Algorithm 1, the input of the algorithm comprises three parts: (1) a flexible perceptual field neural network classifier f(S,m), (2) a vocabulary list of candidate entities , and (3) a probability threshold a. The output of the algorithm is an entity synonym set pool clustered from the vocabulary V.

Specifically, the algorithm enumerates each once and maintains a pool containing all identified sets . Within the pool, dynamic weight allocation is applied to adjust the weight of each set dynamically. The candidate words entered in each iteration are preferentially compared with sets of lower weight. For candidate entities , we apply the flexible perceptual field neural network classifier to compute the probability of adding this entity into each identified set in S.

If this probability exceeds the threshold a, is added to the set . Otherwise, if the probability falls below a, a new set is created using this candidate entity and added to the set pool. This iterative process continues until the entire vocabulary has been traversed once. The algorithm then returns all the detected synonym sets.

Experiments

In this section, to evaluate the effectiveness of the EnSynFields approach in generating synonym sets, we perform experiments on three real-world public datasets. Initially, we outline the experimental setup, followed by the presentation of experimental results. Finally, we delve into a comprehensive analysis of each component of EnSynFields, showcasing several specific case studies.

Experimental design

We commence by delineating the dataset, the comparative approaches, and the model variables. Additionally, we outline the experimental setup proposed for EnSynFields.

Datasets.

We evaluated EnSynFields on three public benchmark datasets of real-world.

Wiki. Wiki contains approximately 100,000 articles extracted from Wikipedia, comprising a total of 6,839,331 sentences. The Freebase knowledge base (https://developers.google.com/freebase?hl=zh-cn) is utilized to enrich and annotate the Wiki dataset.

NYT. NYT consists of 118,664 news articles from The New York Times (2013), totaling 3,002,123 sentences. The Freebase knowledge base is employed to generate NYT dataset.

PubMed. PubMed includes approximately 1.5 million abstracts of research papers from PubMed (https://www.ncbi.nlm.nih.gov/pubmed), amounting to 15,051,203 sentences. The UMLS knowledge base (https://www.nlm.nih.gov/research/umls/) is applied to enhance and structure the PubMed dataset.

DBpedia Spotlight [25] serves as the entity linker for the Wiki and NYT datasets, while PubMed utilizes PubTator [29]. Additionally, knowledge bases such as Freebase [30] and UMLS [31] are employed. For test set creation, a subset of linked entities was randomly selected, with the remaining entities reserved for training. The details of the datasets are summarized in Table 1.

Approaches of comparison.

Kmeans. Kmeans is an unsupervised clustering algorithm [32]. This approach takes candidate entity embeddings as input to the algorithm, and the output is the set of detected synonyms. It is used to cluster synonym sets within the entity vocabulary.

Louvain. Louvain is an unsupervised algorithm for discovering the structure of network communities based on modularity optimization [33]. This approach constructs a graph of candidate entities, where each candidate entity represents a node. The embedding of each candidate entity is then calculated using cosine similarity, and an edge is added to the graph if the similarity exceeds a predefined threshold.

SVM+Louvain. SVM+Louvain is a two-stage supervised approach [34], where the initial stage uses SVM to predict synonym pairs, and the second stage involves applying the predicted synonym pairs to construct a graph, which is then processed using the Louvain algorithm to obtain a synonym set.

SetExpan+Louvain. SetExpan+Louvain is a two-stage supervised approach, where the initial stage uses SetExpan (a weakly supervised set expansion algorithm) [34] to find the K nearest neighbors of each candidate entity in a vocabulary list and then constructs a k-NN graph. In the second stage, the Louvain algorithm is applied to this graph to obtain the set of synonyms.

COP-Kmeans. COP-Kmeans is a semi-supervised variant of Kmeans, combining the COP and Kmeans algorithms for clustering [35]. It utilizes constraint information to guide the clustering process, improving accuracy and stability. The approach sets the oracle number K of clusters for each dataset and converts the training synonyms into pairwise constraints.

SynSetMine. SynSetMine is a two-stage supervised approach [13], where the initial stage trains a classifier. In the second stage, a set generation algorithm is employed to generate a set of synonyms.

EnSynFields. EnSynFields is our proposed approach for generating entity synonym sets based on flexible perceptual fields. It combines multi-layer contextual information with a dynamic-weight-based algorithm to develop synonym sets from the candidate entity lexicon.

Hyper-parameter setting.

For entity embedding, we refer to the 50-dimensional entity embedding approach published by Shen et al. [13]. To ensure a fair comparison across datasets, we fix the dimensionality of the associative item embedding at 50. To tune the model hyper-parameters, we employ a five-fold cross-validation approach.

For EnSynFields, we use two hidden-layer embedding transformers of sizes 50 and 250 for the embedded feature layer and three post-hidden-layer transformers of sizes 250, 500, and 250. We apply the Adam optimization algorithm with an initial learning rate of 0.001 to optimize the EnSynFields model. Additional model hyper-parameters are provided in Table 2.

Evaluation metrics.

Based on previous research work, we adopted three standard clustering metrics that were used as evaluation metrics for this experiment.

ARI: ARI is a prevalent metric for gauging the similarity of two cluster assignments. A given model predicts a cluster assignment denoted by C1, and the true cluster assignment is denoted by C2. ARI is calculated as follows:

(8)(9)

where TP denotes the number of element pairs assigned to the same cluster in both C1 and C2, while TN represents the number of element pairs assigned to different clusters in both C1 and C2. FP denotes the number of element pairs assigned to the same cluster in C1 but not in C2, and FN represents the number of element pairs assigned to the same cluster in C2 but not in C1. The total number of element pairs is given by .

FMI: FMI is another similarity measure that evaluates the agreement between two cluster assignments, using precision and recall to calculate similarity. Details are as follows:

(10)

where TP denotes the number of element pairs correctly assigned to the same cluster in both C1 and C2, FP represents the number of element pairs assigned to the same cluster in C1 but not in C2, and FN denotes the number of element pairs assigned to the same cluster in C2 but not in C1.

NMI: NMI computes normalized mutual information between two cluster distributions, which consists of mutual information (MI) and information entropy (IE) as shown below:

(11)

where I(X,Y) is the mutual information (MI) between X and Y, and H(X) and H(Y) are the information entropy (IE) of X and Y, respectively.

Experimental results

In this study, we conduct experiments to evaluate the generation of entity synonym sets in terms of both efficiency and effectiveness. To fully understand the experimental results, this section is divided into two main parts: clustering performance and flexible perceptual field neural network classifier prediction performance.

Clustering performance.

Table 3 presents a comparison of clustering performance, where the best results are underlined. From the results, we observe that EnSynFields outperforms the baseline approaches across three publicly available datasets.

thumbnail
Table 3. Performance comparison of entity synonym set generation.

https://doi.org/10.1371/journal.pone.0321381.t003

As shown in Table 3, the unsupervised approaches Kmeans and Louvain exhibit lower performance than EnSynFields, mainly because they cannot leverage labeled signals, which limits their effectiveness when such labels are available. SetExpan+Louvain demonstrates lower performance compared to EnSynFields but significantly outperforms Kmeans and Louvain. This is because SetExpan+Louvain is a two-stage supervised algorithm that utilizes K-nearest neighbor information along with Louvain’s algorithm.

In contrast to the Kmeans algorithm, COP-Kmeans integrates additional supervised data from the training set, which enhances its performance through more informed clustering decisions. We observe that SVM+Louvain performs significantly worse than the other approaches, suggesting that using SVM to capture supervised information results in poor performance for synonym set generation. The primary drawback of SVM+Louvain stems from the fact that its learning model lacks a holistic view of the set and relies solely on pairwise similarity.

SynSetMine outperforms other approaches in terms of performance but still falls short of EnSynFields. This is because SynSetMine only considers semantic relations at the entity layer, ignoring the broader global entity context, which could substantially improve the quality of synonym set discovery. The above analysis suggests that the flexible perceptual field neural network classifier and the dynamic-weight-based set generation algorithm enhance the efficiency of entity synonym set expansion.

Flexible perceptual field neural network classifier performance.

To evaluate neural network classifiers with flexible perceptual fields, the F1 scores of EnSynFields and SynSetMine are compared using datasets from NYT and PubMed. Fig 7(a) shows these scores for varying negative sample sizes based on the NYT dataset. Observations reveal that EnSynFields consistently achieves a higher F1 score compared to SynSetMine. As depicted in Fig 7(b), this trend is evident across various training periods on the PubMed dataset, with EnSynFields maintaining superior F1 scores throughout.

thumbnail
Fig 7. Comparing F1 scores across different negative sampling sizes and epochs.

https://doi.org/10.1371/journal.pone.0321381.g007

The above results show that the flexible perceptual field neural network classifier captures entities, sets, and sentences as a whole, and integrates them into the dynamic-weight-based set generation algorithm. This integration effectively enhances the model’s predictive ability for synonym sets.

Semantic similarity evaluation.

To further evaluate the semantic consistency of the generated synonym sets, we incorporate BERT-based cosine similarity as an additional metric. Unlike clustering metrics, which evaluate structural accuracy, BERT similarity quantifies the semantic coherence of words within the same synonym set.

Table 4 presents the BERT cosine similarity scores across datasets, alongside their corresponding NMI scores for comparison.

thumbnail
Table 4. Comparison of BERT cosine similarity and NMI scores across datasets.

https://doi.org/10.1371/journal.pone.0321381.t004

We observe a positive correlation between BERT similarity and NMI scores across datasets. Specifically, PubMed, which exhibits the highest NMI (96.3%), also achieves the highest BERT similarity (0.923), while NYT, with the lowest NMI (91.5%), obtains a relatively lower similarity score of 0.877.

This suggests that our model not only performs well in clustering structure but also preserves strong semantic relationships within synonym sets. However, despite these high similarity scores, some errors still occur, particularly when contextually related but semantically distinct words are grouped together.

Model analysis

This section examines how various parameters, such as hidden layer sizes, embedding dimension, and threshold a, influence performance from multiple perspectives.

Analysis of different hidden layer sizes.

To assess the impact of different hidden layers on the scorer architecture more comprehensively, we use different hidden layer sizes for analysis. For the embedding transformer of scorer architecture, the different hidden layer sizes are denoted as 200, 250, 300, 350, 400. For the post-hidden layer transformer of scorer architecture, the different hidden layer sizes are denoted as 200 *2, 250*2, 300*2, 350*2, 400*2. Fig 8 shows the ARI and FMI for the different scorer architecture across the three datasets. It can be observed that the performance improvement of each architecture is noticeable when the hidden layer size transitions from 200 to 250. However, the performance of each architecture generally declines as the hidden layer increases. Therefore, we consider that embedding transformer hidden layer sizes ranging from 200 to 250, and post-hidden layer transformer sizes ranging from 400 to 500, are adequate for capturing the essential synonymous signals required for identifying entity synonymous relationships.

thumbnail
Fig 8. Results of different layer sizes for scorer architecture.

https://doi.org/10.1371/journal.pone.0321381.g008

Effect of dimension of embeddings.

In representation learning, the size of embeddings significantly impacts the efficacy of machine learning models. We explored this by fixing all other hyper-parameters and varying the embedding dimension across 25, 50, 75, 100, 125, 150. Fig 9 illustrates the influence of embedding size on the performance of SynSetMine and EnSynFields using NYT and Wiki datasets.

thumbnail
Fig 9. Comparing F1 scores across different negative sampling sizes and epochs.

https://doi.org/10.1371/journal.pone.0321381.g009

We observe that EnSynFields consistently outperforms SynSetMine, and if the dimensionality is not large enough, the model suffers from underfitting, leading to convergence difficulties. For example, the model performs poorly when the embedding dimension is set to 25. Performance improves gradually when the dimension exceeds 50. Moreover, the experimental results indicate that our model remains stable across varying embedding sizes.

Effect of different probability threshold a.

Algorithm 1 outlines our set generation approach, which uses a probability threshold to decide the inclusion of a candidate entity in existing sets. A higher threshold implies a more conservative approach by the algorithm, leading to the creation of additional sets.

This experiment focuses on examining the impact of this hyper-parameter on the experimental outcomes. We execute the set generation algorithm using a flexible perceptual field neural network classifier and vary the threshold to observe its effects. Fig 10 displays the results.

Initially, we observe that clustering effectiveness remains stable across different thresholds, with values between 0.4 and 0.6 often being more suitable. Furthermore, a threshold of 0.5 emerges as a robust choice, consistently yielding favorable outcomes across different datasets.

Case studies

Table 5 presents example outputs generated by EnSynFields. The output of entity synonym sets from our approach is randomly selected. We observed that our approach can generate entity synonym sets of different types and domains. For example, the prediction results of “film” and “movie” are the correct synonym, and “infant” and “baby” are also the correct synonym. However, our approach is not perfect and sometimes produces incorrect results. For instance, the prediction of “purse” and “medical doctor” as synonyms is incorrect. This failure case suggests that our model may struggle with distinguishing between contextually similar but semantically different words. Future improvements could involve integrating more fine-grained contextual embeddings to mitigate such errors.

thumbnail
Table 5. Entity synonym set output examples (G denotes the ground truth, and O denotes the output results of our approach).

https://doi.org/10.1371/journal.pone.0321381.t005

Conclusion

This paper presents an approach to extract entity synonym sets from text corpora. We construct a three-layer interaction field network to capture contextual information across the entity layer, set layer, and sentence layer. To determine the inclusion of a candidate entity in a synonym set, we developed a flexible perceptual field neural network classifier and incorporated it into a dynamic weight-based algorithm for new entity synonym set detection.

In the evaluation of our proposed approach, it was implemented across three distinct real-world synonym set datasets and compared with several contemporary state-of-the-art approaches. The empirical findings indicate that our approach exhibits enhanced efficacy in the task of entity synonym set generation, surpassing the performance of the most advanced existing techniques.

In subsequent research endeavors, our focus will be on investigating the potential of leveraging entity attribute information to augment the efficacy of entity synonym generation approaches. Additionally, we plan to design a multimodal strategy aimed at distinguishing near-sense entities from synonymous entities, thereby enhancing the precision of augmented entity synonym sets.

References

  1. 1. Shen J, Qiu W, Shang J, Vanni M, Ren X, Han J. SynSetExpan: an iterative framework for joint entity set expansion and synonym discovery. In: Webber B, Cohn T, He Y, Liu Y, editors. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16–20, 2020. Association for Computational Linguistics; 2020. p. 8292–307.
  2. 2. Huang S, Luo X, Huang J, Qin W, Gu S. Neural entity synonym set generation using association information and entity constraint. In: 2020 IEEE International Conference on Knowledge Graph, ICKG 2020, Online, August 9-11, 2020; 2020. p. 321–8.
  3. 3. Yang Y, Yin X, Yang H, Fei X, Peng H, Zhou K, et al. KGSynNet: a novel entity synonyms discovery framework with knowledge graph. In: Jensen CS, Lim E, Yang D, Lee W, Tseng VS, Kalogeraki V, et al., editors. Database Systems for Advanced Applications - 26th International Conference, DASFAA 2021, Taipei, Taiwan, April 11–14, 2021, Proceedings, Part I. vol. 12681 of Lecture Notes in Computer Science. Springer; 2021. p. 174–190.
  4. 4. Yin D, Hu Y, Tang J, Jr TD, Zhou M, Ouyang H, et al. Ranking relevance in Yahoo search. In: Krishnapuram B, Shah M, Smola AJ, Aggarwal CC, Shen D, Rastogi R, editors. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13–17, 2016. ACM; 2016. p. 323–32.
  5. 5. Al-Ubaydli M. Using search engines to find online medical information. PLoS Med. 2021;2(9).
  6. 6. Zhou G, Liu Y, Liu F, Zeng D, Zhao J. Improving question retrieval in community question answering using world knowledge. In: Rossi F, editor. IJCAI 2013, Proceedings of the 23rd International Joint Conference on Artificial Intelligence, Beijing, China, August 3–9, 2013. IJCAI/AAAI; 2013. p. 2239–45.
  7. 7. Gu S, Luo X, Wang H, Huang J, Wei Q, Huang S. Improving answer selection with global features. Exp Syst Appl. 2021;38(1). https://doi.org/10.1016/j.eswa.2021.113123
  8. 8. Bakhshi M, Nematbakhsh M, Mohsenzadeh M, Rahmani A. SParseQA: sequential word reordering and parsing for answering complex natural language questions over knowledge graphs. Knowl-Based Syst. 2022;235:107626.
  9. 9. Li X, Alazab M, Li Q, Yu K, Yin Q. Question-aware memory network for multi-hop question answering in human-robot interaction. arXiv preprint. 2021. Available at: https://arxiv.org/abs/2104.13173
  10. 10. Xiao K, Bai Y, Wang Z. Extracting prerequisite relations among concepts from the course descriptions (SEKEEO-RN). Int J Softw Eng Knowl Eng. 2022;32(4):503–23.
  11. 11. Huang S, Luo X, Huang J, Guo Y, Gu S. An unsupervised approach for learning a Chinese IS-A taxonomy from an unstructured corpus. Knowl Based Syst. 2019;182.
  12. 12. Shen J, Shen Z, Xiong C, Wang C, Wang K, Han J. TaxoExpan: self-supervised taxonomy expansion with position-enhanced graph neural network. In: Huang Y, King I, Liu T, van Steen M, editors. WWW ’20: The Web Conference 2020, Taipei, Taiwan, April 20–24, 2020. ACM/IW3C2; 2020. p. 486–97.
  13. 13. Shen J, Lyu R, Ren X, Vanni M, Sadler BM, Han J. Mining entity synonyms with efficient neural set generation. In: The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019. AAAI Press; 2019. p. 249–56.
  14. 14. Turney PD. Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL. In: Raedt LD, Flach PA, editors. Machine Learning: EMCL 2001, 12th European Conference on Machine Learning, Freiburg, Germany, September 5–7, 2001, Proceedings. vol. 2167 of Lecture Notes in Computer Science. Springer; 2001. p. 491–502.
  15. 15. Nakashole N, Weikum G, Suchanek FM. PATTY: a taxonomy of relational patterns with semantic types. In: Tsujii J, Henderson J, Pasca M, editors. Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, EMNLP-CoNLL 2012, July 12–14, 2012, Jeju Island, Korea. ACL; 2012. p. 1135–45.
  16. 16. Chakrabarti K, Chaudhuri S, Cheng T, Xin D. A framework for robust discovery of entity synonyms. In: Yang Q, Agarwal D, Pei J, editors. The 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’12, Beijing, China, August 12-16, 2012. ACM; 2012. p. 1384–92.
  17. 17. Ren X, Cheng T. Synonym discovery for structured entities on heterogeneous graphs. In: Gangemi A, Leonardi S, Panconesi A, editors. Proceedings of the 24th International Conference on World Wide Web Companion, WWW 2015, Florence, Italy, May 18–22, 2015 - Companion Volume. ACM; 2015. p. 443–53.
  18. 18. Shen J, Wu Z, Lei D, Shang J, Ren X, Han J. SetExpan: corpus-based set expansion via context feature selection and rank ensemble. In: Ceci M, Hollmen J, Todorovski L, Vens C, Dzeroski S, editors. Machine Learning and Knowledge Discovery in Databases - European Conference, ECML PKDD 2017, Skopje, Macedonia, September 18–22, 2017, Proceedings, Part I. vol. 10534 of Lecture Notes in Computer Science. Springer; 2017. p. 288–304.
  19. 19. Huang S, Qin W, Zhao S, Gu S. An automatic approach for extracting chinese entity synonyms from encyclopedias. In: ICBDT 2020: 3rd International Conference on Big Data Technologies, Qingdao, China, September, 2020; 2020. p. 182–7.
  20. 20. Qu M, Ren X, Zhang Y, Han J. Weakly-supervised relation extraction by pattern-enhanced embedding learning. Proceedings of the 2018 World Wide Web Conference on World Wide Web. ACM; 2018. p. 1257–66.
  21. 21. Yu J, Lu W, Xu W, Tang Z. Entity synonym discovery via multiple attentions. Semantic Technology - 9th Joint International Conference, JIST 2019, Hangzhou, China, November 25-27, 2019, Proceedings. n.d.;12032:271–86.
  22. 22. Zhang W, Huang Z, Wang Y, Wan X. Enhancing chinese entity recognition by unsupervised clustering of synonym sets. In: Proceedings of the 10th International Conference on Natural Language Processing and Knowledge Engineering. IALP; 2016. p. 380–8.
  23. 23. Qu M, Ren X, Han J. Automatic synonym discovery with knowledge bases. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, August 13–17, 2017. ACM; 2017. p. 997–1005.
  24. 24. Zhang C, Y Li N, Du WF, P Yu E. tity synonyms discovery via multi piece bilateral context matching. In: IJCAL; 2020.
  25. 25. Daiber J, Jakob M, Hokamp C, Mendes PN. Improving efficiency and accuracy in multilingual entity extraction. In: Sabou M, Blomqvist E, Noia TD, Sack H, Pellegrini T, editors. I-SEMANTICS 2013 - 9th International Conference on Semantic Systems, I-SEMANTICS ’13, Graz, Austria, September 4–6, 2013. ACM; 2013. p. 121–4.
  26. 26. Robertson, E S, Walker S, Jones S. ACM Transactions on Information Systems; 1996. p. 288–311.
  27. 27. Huang S, Luo X, Huang J, Qin W, Gu S. Neural entity synonym set generation using association information and entity constraint. In: Chen E, Antoniou G, editors. 2020 IEEE International Conference on Knowledge Graph, ICKG 2020, Online, August 9–11, 2020. IEEE; 2020. p. 321–8.
  28. 28. S, H., X, L., N HJ. A bilateral context and filtering strategy-based approach to Chinese entity synonym set expansion. Complex Intell Syst. 2023;(Webserver-Issue):1–21.
  29. 29. Wei C-H, Kao H-Y, Lu Z. PubTator: a web-based text mining tool for assisting biocuration. Nucleic Acids Res. 2013;41(Web Server issue):W518-22. https://doi.org/10.1093/nar/gkt441 PMID: 23703206
  30. 30. Bollacker K, Evans C, Paritosh P, Sturge T, Taylor J. Freebase: a collaboratively created graph database for structuring human knowledge. In: Proceedings of the 2008 ACM SIGMOD international conference on Management of data; 2008. p. 1247–50.
  31. 31. Bodenreider O. The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Res. 2004;32(Database issue):D267-70. https://doi.org/10.1093/nar/gkh061 PMID: 14681409
  32. 32. Blondel VD, Guillaume J, Lambiotte R. Fast unfolding of communities in large networks: 15 years later. arXiv preprint. 2023. Available at: https://arxiv.org/abs/2311.06047
  33. 33. Cortes C, Vapnik V. Support-vector networks. In: Machine Learning; 1995. p. 273–97.
  34. 34. Wagstaff K, Cardie C, Rogers S, Schrödl S. Constrained K-means clustering with background knowledge. In: Proceedings of the Eighteenth International Conference on Machine Learning (ICML 2001), Williams College, Williamstown, MA, USA, June 28–July 1, 2001; 2001. p. 577–84.
  35. 35. Hsu Y, Lv Z, Kira Z. Learning to cluster in order to transfer across domains and tasks. In: 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30–May 3, 2018, Conference Track Proceedings; 2018.