## Figures

## Abstract

Topic models are widely used to discover the latent representation of a set of documents. The two canonical models are latent Dirichlet allocation, and Gaussian latent Dirichlet allocation, where the former uses multinomial distributions over words, and the latter uses multivariate Gaussian distributions over pre-trained word embedding vectors as the latent topic representations, respectively. Compared with latent Dirichlet allocation, Gaussian latent Dirichlet allocation is limited in the sense that it does not capture the polysemy of a word such as “bank.” In this paper, we show that Gaussian latent Dirichlet allocation could recover the ability to capture polysemy by introducing a hierarchical structure in the set of topics that the model can use to represent a given document. Our Gaussian hierarchical latent Dirichlet allocation significantly improves polysemy detection compared with Gaussian-based models and provides more parsimonious topic representations compared with hierarchical latent Dirichlet allocation. Our extensive quantitative experiments show that our model also achieves better topic coherence and held-out document predictive accuracy over a wide range of corpus and word embedding vectors which significantly improves the capture of polysemy compared with GLDA and CGTM. Our model learns the underlying topic distribution and hierarchical structure among topics simultaneously, which can be further used to understand the correlation among topics. Moreover, the added flexibility of our model does not necessarily increase the time complexity compared with GLDA and CGTM, which makes our model a good competitor to GLDA.

**Citation: **Yoshida T, Hisano R, Ohnishi T (2023) Gaussian hierarchical latent Dirichlet allocation: Bringing polysemy back. PLoS ONE 18(7):
e0288274.
https://doi.org/10.1371/journal.pone.0288274

**Editor: **Kathiravan Srinivasan,
Vellore Institute of Technology: VIT University, INDIA

**Received: **June 13, 2022; **Accepted: **June 24, 2023; **Published: ** July 12, 2023

**Copyright: ** © 2023 Yoshida et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

**Data Availability: **Our data and also code is available at https://github.com/hisanor471/ghlda. One data cannot be shared publicly due to the provider’s terms of service. However, the other two are open datasets, and the same corpus used is included in our code. The Reuters data set could be bought from Refinitiv’s Machine Readable News service, as stated in the manuscript. One can purchase it from https://www.refinitiv.com/en/financial-news-services/machine-readable-news. The exact contact e-mail address varies by region, so please use the "Request Details" section on the above website. For clarity, we have submitted the headlines for the news in the above repository. Interested readers could obtain precisely the same data by using those headlines.

**Funding: **The authors received no specific funding for this work.

**Competing interests: ** The authors have declared that no competing interests exist.

## 1 Introduction

Topic models are widely used to identify the latent representation of a set of documents. Since latent Dirichlet allocation (LDA) [1] was introduced, topic models have been used in a wide variety of applications. Recent work includes the analysis of legislative text [2], detection of malicious websites [3], analyzing white papers [4], and analysis of the narratives of dermatological disease [5]. The modular structure of LDA, and graphical models in general [6], has made it possible to create various extensions to the plain vanilla version. Significant works include the correlated topic model (CTM), which incorporates the correlation among topics that co-occur in a document [7]; hierarchical LDA (hLDA), which jointly learns the underlying topic and the hierarchical relational structure among topics [8]; and the dynamic topic model, which models the time evolution of topics [9].

LDA uses multinomial distributions over words, whereas Gaussian LDA (GLDA) [10] uses multivariate Gaussian distributions over a pre-trained word embedding to represent the underlying topics. Using the word embedding vector space representation, GLDA has the added benefit of incorporating semantic regularities in a language, which results in increasing coherency [11–13] of topics [10]. Recent developments of this line of research include correlated Gaussian topic models (CGTM) [14], which add a correlational structure to the topics used in a document; the work of [15], which replaces the Gaussian distribution with a von Mises–Fisher distribution; and the latent concept topic model [16], which redefines each topic as the distribution over latent concepts, where the latent concept is modeled as a multivariate Gaussian distribution over the word embeddings.

A crucial discrepancy of GLDA and CGTM is that they fail to detect the polysemy of a term, such as “bank,” which LDA and hLDA capture well [17]. LDA is a mixed membership model with no mutual exclusivity constraint that restricts the assignment of words to one topic only [17]. As we show in the current paper, the delicate balance between a term that captures the probability of a word under a topic, and the probability of a topic given a document, in the collapsed Gibbs sampler of LDA [18], makes it possible to capture polysemy. However, although GLDA and CGTM are mixed membership models with no mutual exclusivity constraint, the probability of a word under a topic is characterized by a multivariate distribution that outweighs the term that reflects the likelihood of a topic given a document. Hence, mutual exclusivity is likely to be unintentionally recovered, and the ability to detect polysemy is lost.

In this paper, we show that the ability to capture polysemy in GLDA-type models can be recovered by restricting the set of topics that can be used to represent a given document. One parsimonious implementation of such a restriction can be achieved by incorporating a hierarchical topic structure, as in hLDA [1, 8]. In our Gaussian hLDA, topics that can be used in a document are restricted by a path of topics that are learned jointly from the data. Instead of assigning a topic to each word position in a document, we assign levels that describe the position of the path from which the word was sampled.

At first glance, our model may seem to have a price to pay in terms of time complexity because of the added complexity of the model. However, because we do not need to sample from the entire set of topics for each word position in a document, the time complexity of our model does not necessarily worsen compared with GLDA and CGTM. Moreover, our model has the benefit of capturing polysemy in addition to being able to learn a compact hierarchical structure that shows the relationships among topics. Additionally, as in hLDA [8], Bayesian nonparametric techniques can also be used, thereby making it possible to determine the hierarchical tree structure more flexibly.

Other works also exist that combine topic modeling and word embeddings. [19] used information from the word similarity graph to achieve more coherent topics. [20] modified the likelihood of the model by combining information from pre-trained word embeddings with a log-linear function. Instead of using pre-trained word embedding vectors, some works have attempted to learn word embeddings and topics from the corpus jointly. The embedded topic model [21] uses the inner product between a word embedding and topic embedding as the natural parameter that governs the multinomial distribution and learns the two representations from the corpus simultaneously. In [22], the model was further extended to incorporate the time evolution of the topic embeddings. The Wasserstein topic model [23] unifies topic modeling and word embedding using the framework of Wasserstein learning. Compared with these models, we leave the word embedding vectors as it is and enrich the topic co-occurrence structure of a document to adapt to the corpus of interest.

Our contributions are summarized as follows:

- We propose the Gaussian hLDA, which significantly improves the capture of polysemy compared with GLDA and CGTM.
- Our model jointly learns the topics, in addition to the hierarchical structure, and characterizes the relationship among the topics. The hierarchical structure can also be used to analyze the correlation structure among topics.
- The hierarchical tree structure can be estimated in a flexible manner using the nested Chinese restaurant process [8].
- Even though our model is far more expressive than GLDA and CGTM, the time complexity does not necessarily worsen compared with that of those two models.
- We show that our model exhibits a more parsimonious representation of topics than hLDA.
- Using three real-world corpora and three different pre-trained word embedding vectors, we show that our model outperforms state-of-the-art models both in terms of the held-out predictive likelihood and topic coherence.
- Although simple and relatively straightforward, our work shows how a simple model modification drastically improves its performance.
- Codes are made online (https://github.com/hisanor471/ghlda).

## 2 Notation

We briefly summarize the mathematical notation used throughout the paper. *D* denotes the number of documents in a corpus, *V* denotes the number of unique words in the corpus, *K* denotes the number of topics, *M* denotes the dimension of the word embedding vector, and *L* denotes the depth of the maximum level of the hierarchy. Lower-case letters (e.g., *d*, *v*, and *k*) denote a specific document, word, or topic. *θ*_{d,k} denotes the probability of topic *k* for document *d* and, *φ*_{k,v} denotes the probability of word *v* in topic *k*. *N*_{d} denotes the number of words in a document. For each word position *n* in document *d*, *z*_{d,n} denotes either the topic or level assignment for that word position and *w*_{d,n} denotes the word that appears in word position *n* for document *d*. Furthermore, *c*_{d} denotes a path assignment to document *d* and *l*_{d} denotes the level distribution in *d*. For path and level assignments, a topic is uniquely defined as shown in Fig 1. denotes the number of word positions in *d* that are assigned to either topic (LDA, GLDA, CGTM) or level *i* (hLDA, GhLDA), excluding *z*_{d,n}. is defined similarly counting the number of word positions above level *i*. denotes the number of word positions in the entire corpus with word *v* and topic *k*, excluding *z*_{d,n}. *GEM*(*m*, *b*) denotes the Griffiths, Engen, and McCloskey distribution [24], which is used to define a prior level distribution among a path, and *m* and *b* denote hyperparameters that control the stick-breaking process. *nCRP*(*γ*) represents the nested Chinese restaurant process [8], where *γ* denotes a hyperparameter that controls the probability of a new branch that emerges in the current tree (i.e., the parameter that controls the likelihood of the blue rectangle being chosen in Fig 1). *Dir*(*α*) represents a Dirichlet distribution and *Mult* represents a multinomial distribution, where *α* denotes a hyperparameter vector. *N*(*μ*, Σ) and denote a normal distribution and multivariate distribution with mean vector *μ* and covariance matrix Σ, respectively. denotes a normal inverse Wishart distribution with hyperparameters *u*, Ψ, *v*, *κ*, where *u* denotes a vector, Ψ denotes a matrix, and *v* and *κ* denote positive real values. Furthermore, *κ*_{k} = *κ* + |*s*_{k}|, *v*_{k} = *v* + |*s*_{k}|, , , where *s*_{k} denotes the set of indicators of word positions that is assigned to topic *k* and denotes the mean vector among the indicators in *s*_{k}.

## 3 Related work

### 3.1 Gaussian latent Dirichlet allocation

The generative process of LDA and GLDA can be written similarly, and we focus on the GLDA case. GLDA uses word embedding vectors to characterize words in a document. We define *D*, *N*_{d}, *K*, *θ*_{d}, *z*_{d,n} precisely, as summarized in the previous section. Instead of considering *w*_{d,n} as an indicator that denotes a word, as in LDA, we consider it as a vector from a pre-trained word embedding. The generative process is summarized as follows:

- (1) For all topics
*k*, sample . - (2) For each document
*d*,- (a) sample topic proportion
*θ*_{d}∼*Dir*(*α*); and - (b) for each word position in
*d*, sample topic assignments*z*_{d,n}∼*Mult*(*θ*_{d}) and words from .

The collapsed Gibbs sampler of GLDA can be written as (1)

LDA is recovered by replacing “” in (1) with “*φ*_{k}∼*Dir*(*β*),” “” in (2)(b) with “,” and the second term in the sampler with “.”

- (a) sample topic proportion

### 3.2 Correlated Gaussian topic model

CGTM [14] is an extension of GLDA that incorporates correlation among topics used in a document, similar to CTM [7]. The generative process is summarized as follows:

- (1) For all topics
*k*, sample . - (2) To model the correlation among topics, sample .
- (3) For all documents
*d*,- (a) sample ;
- (b) transform
*η*_{d}to a topic proportion vector*θ*_{d}using a softmax function ; and - (c) for all word positions in
*d*, sample topic assignments*z*_{d,n}∼*Mult*(*θ*_{d}) and resulting words from .

CGTM can be estimated by alternatively sampling *η*_{d} and topic assignments for each word position *z*_{d,n}. The sampling of *η*_{d} is rather involved, and includes additional auxiliary variable λ_{d} and sampling from a Polya–Gamma distribution [25, 26]. After *η*_{d} (and therefore *θ*_{d}) is sampled, the topic assignments *z*_{d,n}s are sampled using
(2)

### 3.3 Hierarchical latent Dirichlet allocation

The goal of hLDA is to identify topics and hierarchical relationships among the topics simultaneously from the corpus. Words in a document are drawn from the restricted set of topics that are characterized using paths from the hierarchical topic structure. Because of the hierarchical tree structure, topics in the upper level are used more frequently and thus capture more general terms than the lower level. To learn the hierarchical structure more flexibly, hLDA [8] uses the nested Chinese restaurant process as the prior distribution that defines the hierarchy over topics. The generative process is summarized as follows:

- (1) For all topics
*k*, sample*φ*_{k}∼*Dir*(*β*). - (2) For each document
*d*,- (a) sample a path assignment
*c*_{d}∼*nCRP*(*γ*); - (b) sample a distribution over levels in the path,
*l*_{d}∼*GEM*(*m*,*b*); and - (c) for all word positions in
*d*, first choose the level assignments*z*_{d,n}∼*Mult*(*l*_{d}) and then the resulting words from the topic at that level in the path, .

- (a) sample a path assignment

In hLDA, we need to sample both the path assignments for all documents and level assignments for all word positions. The Gibbs sampling algorithm is similar to those used in GhLDA, so we omit it here.

## 4 Gaussian hierarchical latent Dirichlet allocation

### 4.1 Mutual exclusivity

The problem with GLDA and CGTM can be clarified by considering the sampling equations of GLDA (i.e., Eq 1) and CGTM (i.e., Eq 2). Two observations are worth mentioning. First, the only difference between Eqs 1 and 2 is the first term on the right-hand side of each equation, which corresponds to the probability of a topic given a document (Eq 1) and the probability of a topic given a document with correlation (Eq 2).

Second, although the first term on the right-hand side of the sampling equation can vary at most in the order of among the topics, the second term is a multivariate probability density function that can vary much more widely. The order of variability of the distribution among the topics widens when the data points in the word embedding that we want to cluster are multimodal, thereby ensuring each centroid of the Gaussian mixture to be placed in distinct positions in the word embedding space. Similar words in a word embedding space tend to cluster together, which makes word embeddings far from unimodal. This condition results in the second term outweighing the first term, and mutual exclusivity is likely to be unintentionally recovered in GLDA and CGTM.

### 4.2 Gaussian hierarchical latent Dirichlet allocation

To create a mixed membership model with no mutual exclusivity constraint, even in cases that consider multivariate Gaussian distributions, we need to go beyond merely sampling topic assignments for each word position in the corpus and restrict the set of topics that can be used to represent a given document. By doing so, when a topic such as “finance, bank, loan” appears in a document, we can only use a particular topic such as “banks, ratio, interest” without being able to sample from all the available topics. This restriction guarantees that there is no restriction on mutual exclusivity and, as a bonus, can be used to capture the correlation among topics. One straightforward approach to add this constraint is via hierarchical topic modeling, as in [1, 8]. In the hierarchical construction, topics are ordered according to the level of abstraction from top to bottom. Path *c*_{d} is used to characterize the topics that can be used in a document *d*, and each word position in a document has level assignments *l*_{d,n}s that capture the level at which the word is sampled.

The generating process of GhLDA is as follows:

- (1) For all topics
*k*, sample . - (2) For each document
*d*,- (a) sample a path assignment
*c*_{d}∼*nCRP*(*γ*); - (b) sample a distribution over level of the path:
*l*_{d}∼*GEM*(*m*,*b*); and - (c) for all word positions in
*d*, first choose the level assignments*z*_{d,n}∼*Mult*(*l*_{d}) and the resulting words from the topic at level*z*_{d,n}in the path, .

- (a) sample a path assignment

### 4.3 Gibbs sampling algorithm

We need to sample both the path assignments for all documents *d* and level assignments for all word positions *w*_{d,n}. The Gibbs sampling algorithm is as follows;

- (1) For each document
*d*, first sample path assignment*c*_{d}∼*p*(*c*_{d}|*w*,*c*_{−d},*z*,*H*)*p*(*w*_{d}|*c*,*w*_{−d},*z*,*H*); and - (2) for all word positions in
*d*, sample level assignments*p*(*z*_{d,n}|*z*_{−(d, n)},*c*,*w*,*H*) ∝*p*(*z*_{d,n}|*z*_{d,−n},*H*)*p*(*w*_{d,n}|*z*,*c*,*w*_{−(d, n)},*H*),

where*H*is the set of hyperparameters in the model. The probability of a path is the product of the prior on paths defined by*nCRP*(*γ*) (i.e.,*p*(*c*_{d}|*w*,*c*_{−d},*z*,*H*)) [8], and the probability of a word given a specific path, which is (3) where denotes the number of documents assigned to path*c*, excluding*d*,*s*_{c[l]}denotes the set of word positions assigned to topic*c*[*l*],*t*_{l}denotes the set of word positions assigned to level*l*in*d*, and Γ_{d}denotes the multivariate gamma function. The probability of a level is defined as (4)

The qualitative characteristics of LDA, hLDA, GLDA, CGTM, and GhLDA are summarized in Table 1. Pruning implies the necessity to prune highly frequent words, such as stop words, from the corpus. Whereas LDA fails to provide interpretable topics without pruning, all the other models handle this with ease. The column name “Polysemy” implies the ability to capture polysemy. Further qualitative analysis of GLDA and CGTM is described in Section 5. Correlation implies capturing the co-occurrence of topics in a document and embedding means the use of pre-trained word embedding vectors.

### 4.4 Complexity analysis

We compare the running time complexity of all the models. Because hLDA, CGTM, and GhLDA include steps that require us to sample document-level parameters using all the words that appear in a document, we focus on the running time complexity to sample all assignments for a given document *d*. Table 2 summarizes the time complexities. Each sampling step in GLDA requires us to evaluate the determinant and inverse of the posterior covariance matrix, which is cubic. However, as indicated by [10], this can be reduced to *O*(*M*^{2}) using the Cholesky decomposition of a covariance matrix. Because each word position has *K* topics to consider, and there are *N*_{d} words in a document, the total time complexity of GLDA is *O*(*N*_{d}*KM*^{2}). LDA does not require us to calculate the inverse of the posterior covariance matrix, which makes the time complexity *O*(*N*_{d}*K*). For each document, CGTM requires the sampling of document-level parameters *η*_{d} and λ_{d}. This step adds another *O*(*K*^{3}) to the complexity.

Compared with these models, GhLDA first evaluates the posterior predictive probability for all paths. The straightforward calculation results in *O*(*PLM*^{2}), where *P* denotes the number of paths and *L* denotes the maximum depth among all paths. However, exploiting the tree structure, we can reduce the calculation to *O*(*KM*^{2}). After sampling the path, GhLDA proceeds to sample levels for each word position in a document. Because each path only has a most *L* topics, sampling-level assignment for all words in a document takes *O*(*N*_{d}*LM*^{2}). Adding both steps leads to *O*(*KM*^{2} + *N*_{d}*LM*^{2}) in total. Similar arguments can be used to calculate the time complexity of hLDA, which is *O*(*K* + *N*_{d}*L*).

A few points are worth mentioning. All the models that use word embedding vectors are much slower than their plain counterparts because of the additional step of computing the Cholesky decomposition. However, comparing GLDA and GhLDA, we can see that GhLDA does not necessarily increase the time complexity compared with GLDA. If *N*_{d}*K* ≤ *K* + *N*_{d}*L*, the time complexity of GhLDA is lower than that of GLDA (This is indeed a reasonable scenario. For instance, assume that there are 100 words in a document *d* (i.e., *N*_{d} = 100). Whereas GLDA with *K* = 20 leads to *N*_{d} × *K* = 2, 000, GhLDA with the branch structure of [1, 1, 4, 4] (i.e., *K* = 22 and *L* = 4) results in *K* + *N*_{d}*L* = 422.). Surely enough, this argument does not take into account the number of iterations required for collapsed Gibbs sampling to converge. However, it still highlights the fact that the time complexity of GhLDA is not necessarily worse than that of GLDA.

## 5 Experiments

### 5.1 Datasets

We conducted experiments using three open datasets, which were all included in our source code. One of the datasets (i.e., Wikipedia) was assembled particularly for the bank polysemy capturing task. We summarize the datasets below.

- The Wikipedia dataset, abbreviated as Wiki in the table, is a dataset particularly assembled for the bank polysemy capturing task. The corpus was created from DBpedia-2016 long abstract data [27]. Each long abstract in the DBpedia dataset has several labels that are attached to classify each article. We focused on the following six categories: “Rivers,” “Banks/Financial,” “Military,” “Law,” “Mathematical,” and “Football.” We sampled evenly from these categories to create a corpus of 6,000, of which 5,000 were used for training and 1,000 for testing. The main feature of this dataset is the inclusion of the “Rivers” and “Banks/Financial” categories. By randomly sampling from these categories, we created a corpus that used “bank” both as a financial institution and a steep place near a river. We used words that appeared more than 50 times in the corpus, and did not remove stop words, as in hLDA [8]. We further focused on words that appeared in all the pre-trained word embeddings described below.
- Amazon review data is a dataset of gathered ratings and review information [28] (The entire dataset is available at http://jmcauley.ucsd.edu/data/amazon/). We sampled evenly from the following five categories: “Electronics,” “Video Games,” “Home and Kitchen,” “Sports and Outdoors,” and “Movies and TV,” and created a corpus of 6,000, of which 5,000 were used for training and 1,000 for testing. The other settings were the same as above.
- Reuters data is a news dataset web-scraped from Reuters news. We collected 6,000 news stories during the period Jan 2016 to Feb 2016, of which 5,000 were used for training and 1,000 for testing. The other settings were the same as above.

For pre-trained word embedding vectors, we used the GloVe (50 dimension) [29], word2vec (300 dimension) [30], and fasttext (300 dimension) [31] word embedding vectors. Hence, in total, we had nine settings for models using word embeddings.

### 5.2 Settings

We compared GhLDA with LDA, hLDA [8], GLDA [10], and CGTM [14]. For the topic coherence and predictive held-out likelihood experiments, the number of topics for LDA, GLDA, and CGTM was fixed to 40. For our qualitative analysis, we also considered the case of 20 topics.

The hyperparameters that governed the topic distributions were set to *α* = 0.1, *β* = 0.1 for LDA, and *v* = 0.1, *κ* = 0.1, Ψ_{glove} = 50 * *I*, Ψ_{word2vec} = 40 * *I*, Ψ_{fasttext} = 20 * *I* for GLDA and CGTM, where *I* denotes an identity matrix. We ran the sampler for 50 epochs for these models, where one epoch was equal to sampling all the word positions in the corpus once. The hyperparameters controlling *GEM* and *nCRP* were set to *m* = 0.5, *b* = 100, *γ* = 0.1 similar to [8]. The initial tree structure of hLDA and GhLDA was set to [1, 1, 4, 4], where each number corresponds to the number of branches at each level. In hLDA, *η* was set to vary among the levels as [2, 1, 0.5, 0.25]. A similar strategy was used in GhLDA, where we adjusted Ψ to vary among the levels in the ratio [1, 0.8, 0.6, 0.4], where the top level was identical to GLDA. We truncated the tree at level four, as in [8]. For GhLDA, we further ran the sampler without adding any leaves for five epochs. For the initial level assignments, half of the assignments were chosen by dividing the cumulative distribution function of word frequency into four segments and assigning from top to bottom according to the segments. The other half was chosen randomly. These additional steps were performed to stabilize the learning of the Gaussian mixture components. We ran the sampler for 100 epochs.

### 5.3 Capturing polysemy

We compare the models’ ability to capture polysemy, paying particular attention to the term “bank(s),” using the Wikipedia dataset. We use GloVe as a case study; the other word embeddings provide similar results. First, as shown in Table 3, in topics trained using GLDA with *K* = 20, topic 10 included terms related to finance, such as “financial,” “banking,” and “central,” and topic 13 contained terms related to the river, such as “creek,” “lake,” and “water.” However, not a single “bank” or “banks” that appeared in the corpus was assigned to the river topic (i.e., topic 13), and all these words were assigned to the finance topic (i.e., topic 10). Similar observations were made, even when *K* was increased to 40. In this case, we could see terms related to finance, such as “financial,” “market,” and “management,” in topic 21, and terms related to river, such as “river,” “creek,” and “flows,” in topic 0. However, topic 0 also contained financial terms, such as “investment,” “credit,” and “exchange;” hence, the topic was inappropriately mixed. This observation implies that although increasing the number of topics makes the constraint on mutual exclusivity to soften; it does not improve the ability to capture polysemy. Similar observations hold for CGTM as well.

By contrast, GhLDA can capture the polysemy of “bank(s).” As is shown in Fig 2, path [0–1-2-6] is related to the river and path [0–1-3-10] is related to finance. Although all uses of “bank” in the Wikipedia dataset were assigned to topic 1 because of the high frequency of the word, the meaning could be discerned from the path assignment. Moreover, “banks” in the dataset were assigned to the correct topic (i.e., either topic 3 or 6) in terms of the label of the documents (i.e., we utilized “Rivers” and “Banks/Financial” categories explained in the dataset section). Hence, we observe that GhLDA can distinguish “bank(s)” polysemy.

The inability to capture polysemy in GLDA is further illustrated using low-dimensional representations. Fig 3 shows each word’s assignment of topics 10 and 13 in addition to their two-dimensional representation using T-sne [32]. We can see that “bank(s)” is far apart from terms related to the river, and “bank(s)” is never assigned to the topic about rivers. As a comparison, Fig 4 shows the two-dimensional representation of GhLDA. We can see that although “banks” is surrounded by terms that relate to finance, “banks,” which is located in the upper right of the figure, is also assigned to a path that refers to rivers, showing that GhLDA can capture polysemy. Similar observations hold for other words, such as “law” and “order” (i.e., paths [0–1-5-15] and [0–1-5-21]), as well.

We further note that, because our corpus does not exclude highly frequent terms as in [8], LDA cannot capture polysemy well (i.e., Table 3) because the topics are contaminated with stop words, and it is difficult to distinguish the difference between topics.

### 5.4 Comparison with hLDA

In this section, we mainly focus on the difference between GhLDA and hLDA. As shown in Fig 5, the main difference can be seen in the hierarchical structure learned between the two models. Whereas the numbers of paths and topics estimated in hLDA are 54 and 83, in GhLDA, they are 10 and 16, respectively, on the Wikipedia datasets, which shows that hLDA tends to have a higher number of paths and topics than GhLDA. Both GhLDA and hLDA have paths for finance (e.g., [0–1-3-10],[0–1-4-46],[0–1-45-65]) and the river (e.g., [0–1-2-6]), thus capturing the polysemy of words (i.e., Table 1). However, too many paths in hLDA cause crucial redundancy. For instance, there are seven paths related to finance that sometimes have no apparent distinction between them (e.g., [0, 1, 45, 65] and [0, 1, 5, 51]). This kind of redundant topic appears in topic models with a relatively large number of topics. We will show in the next section that this redundancy hearts the coherency of topics.

### 5.5 Topic coherence

We calculated the topic coherence score [11, 13, 33] using Palmetto [12] to check how coherently each model generates topics. We computed the average topic coherence score using the basic pointwise mutual information (PMI) measure, focusing on the top 10 words. Table 4 summarizes the results. First, we see that compared to LDA and hLDA models, using word embedding tends to outperform the no word embedding counterparts. Among the models that use word embeddings, GhLDA was the best model, except on the Reuters dataset.

However, the topics learned from the Reuters dataset using GhLDA were not at all worse than the GLDA and CGTM counterpart. For instance, in GhLDA-word2vec, there were topics such as “trump, republican, coal, party, workers, house, debate, school, bill, bankruptcy,” which indicate the news topic that Trump made a promise to coal miners during his campaign, and “vehicles, water, vw, flint, safety, cars, emissions, volkswagen, detroit, filed,” which indicate the news topics of Volkswagen’s diesel cars and the tap water problem of Flint. Although these news topics were widely reported during the period in which the news dataset was collected, the PMIs of the topics were -1.36 and -2.79, respectively, which shows the limitations of Palmetto for evaluating new combinations of words correctly.

Furthermore, even though they connect to real word news, neither Trump nor Volkswagen appeared in the top 15 words of the 40 topics learned from GLDA-word2vec and CGTM-word2vec. Topics in GLDA were much general, such as “rate, dollar, assets, buy, goal, drop” and “government, end, federal, chinese, countries,” which do not take into the word co-occurrence patterns of the corpus that we wish to analyze. As the Reuters examples suggest, even when the underlying word embedding is not in line with the corpus, the added flexibility of our model identifies critical topics that both GLDA and CGTM fail to identify. This observation further highlights the benefit of our model.

### 5.6 Quantitative comparison

We further used the predictive held-out likelihood to quantitatively compare our models, as in [8]. We eval5.6 uated the probability of the held-out dataset using the 1,000 test documents described in the dataset section. [8] used the harmonic mean [34] to evaluate the held-out likelihood. However, [35, 36] showed that the harmonic mean method is biased. Hence, we used the left-to-right sequential sampler [36], which estimates the quantity: (5) and to make a fair comparison of the models, we evaluated for all the models using the topic assignments derived from each model and assessed the likelihood.

Since we could not compare models with different embeddings, we focused on comparing LDA with hLDA and the best-performing model for each word embedding. Table 5 summarizes the results. First, comparing hLDA with LDA, the former is the clear winner. Secondly, CGTM beat GLDA significantly because of the additional correlation structure. Finally, GhLDA is the best model, outperforming all the other models for each word embedding.

## 6 Conclusion

In this paper, we proposed Gaussian hLDA, which significantly improves the capture of polysemy compared with GLDA and CGTM. Our model learns the underlying topic distribution and hierarchical structure among topics simultaneously, which can be further used to understand the correlation among topics. We demonstrated the validity of our approach using three real-world datasets.

Although there are other ways to capture polysemy in the more recent deep learning style models, such as the transformer-based models [37], capture polysemy using self-attention and demonstrate high performance in many tasks. However, there are several avenues for future work based on this paper. Word embeddings and topic models are just some of the ways these two domains could interact. Embeddings created from knowledge graphs or even causal bank [38] are other ways such an intersection could occur. Our hybrid modeling approach contributes to creating new modeling approaches in these areas. Moreover, network modeling itself is another exciting area where our idea might be helpful. Network modeling has a long transition with both sides of the spectra. It might be interesting to see how ideas written in this manuscript could extend network modeling. We leave this work for the future.

## Acknowledgments

We would like to thank Tsutomu Watanabe, Takaaki Ohnishi, Hiroshi Iyetomi, for vigorous discussions.

## References

- 1.
Blei DM, Jordan MI, Griffiths TL, Tenenbaum JB. Hierarchical Topic Models and the Nested Chinese Restaurant Process. In: Proceedings of the 16th International Conference on Neural Information Processing Systems. NIPS’03. Cambridge, MA, USA: MIT Press; 2003. p. 17–24.
- 2.
O’Neill J, Robin C, O’Brien L, Buitelaar P. An Analysis of Topic Modelling for Legislative Texts. In: ASAIL@ICAIL; 2016.
- 3.
Wen S, Zhao Z, Yan H. Detecting Malicious Websites in Depth through Analyzing Topics and Web-Pages. In: Proceedings of the 2nd International Conference on Cryptography, Security and Privacy. ICCSP 2018. New York, NY, USA: Association for Computing Machinery; 2018. p. 128–133.
- 4. Bongini P, Osborne F, Pedrazzoli A, Rossolini M. A topic modelling analysis of white papers in security token offerings: Which topic matters for funding? Technological Forecasting and Social Change. 2022;184:122005.
- 5.
Obot N, O’Malley L, Nwogu I, Yu Q, Shi WS, Guo X. From Novice to Expert Narratives of Dermatological Disease. In: 2018 IEEE International Conference on Pervasive Computing and Communications Workshops (PerCom Workshops); 2018. p. 131–136.
- 6.
Lauritzen SL. Graphical Models. Oxford University Press; 1996.
- 7.
Blei DM, Lafferty JD. Correlated Topic Models. In: Proceedings of the 18th International Conference on Neural Information Processing Systems. NIPS’05. Cambridge, MA, USA: MIT Press; 2005. p. 147–154.
- 8. Blei DM, Griffiths TL, Jordan MI. The Nested Chinese Restaurant Process and Bayesian Nonparametric Inference of Topic Hierarchies. J ACM. 2010;57(2).
- 9.
Blei DM, Lafferty JD. Dynamic Topic Models. In: Proceedings of the 23rd International Conference on Machine Learning. ICML’06. New York, NY, USA: Association for Computing Machinery; 2006. p. 113–120.
- 10.
Das R, Zaheer M, Dyer C. Gaussian LDA for Topic Models with Word Embeddings. In: ACL (1). The Association for Computer Linguistics; 2015. p. 795–804.
- 11.
Newman D, Lau JH, Grieser K, Baldwin T. Automatic Evaluation of Topic Coherence. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. HLT’10. Stroudsburg, PA, USA: Association for Computational Linguistics; 2010. p. 100–108.
- 12.
Röder M, Both A, Hinneburg A. Exploring the Space of Topic Coherence Measures. In: Proceedings of the Eighth ACM International Conference on Web Search and Data Mining. WSDM’15. New York, NY, USA: ACM; 2015. p. 399–408.
- 13.
Chang J, Gerrish S, Wang C, Boyd-graber JL, Blei DM. Reading Tea Leaves: How Humans Interpret Topic Models. In: Bengio Y, Schuurmans D, Lafferty JD, Williams CKI, Culotta A, editors. Advances in Neural Information Processing Systems 22. Curran Associates, Inc.; 2009. p. 288–296.
- 14.
Xun G, Li Y, Zhao WX, Gao J, Zhang A. A Correlated Topic Model Using Word Embeddings. In: Proceedings of the 26th International Joint Conference on Artificial Intelligence. IJCAI’17. AAAI Press; 2017. p. 4207–4213.
- 15.
Batmanghelich K, Saeedi A, Narasimhan K, Gershman S. Nonparametric Spherical Topic Modeling with Word Embeddings. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Berlin, Germany: Association for Computational Linguistics; 2016. p. 537–542.
- 16.
Hu W, Tsujii J. A Latent Concept Topic Model for Robust Topic Inference Using Word Embeddings. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Berlin, Germany: Association for Computational Linguistics; 2016. p. 380–386.
- 17.
Steyvers M, Griffiths T. Probabilistic Topic Models. In Latent Semantic Analysis: A Road to Meaning, Editors Landauer, T and McNamara, D and Dennis, S and Kintsch, W. 2006;.
- 18. Griffiths TL, Steyvers M. Finding scientific topics. Proceedings of the National Academy of Sciences. 2004;101(Suppl. 1):5228–5235. pmid:14872004
- 19.
Petterson J, Buntine W, Narayanamurthy SM, Caetano TS, Smola AJ. Word Features for Latent Dirichlet Allocation. In: Lafferty JD, Williams CKI, Shawe-Taylor J, Zemel RS, Culotta A, editors. Advances in Neural Information Processing Systems 23. Curran Associates, Inc.; 2010. p. 1921–1929.
- 20. Nguyen DQ, Billingsley R, Du L, Johnson M. Improving Topic Models with Latent Feature Word Representations. Transactions of the Association for Computational Linguistics. 2015;3:299–313.
- 21.
Dieng AB, Ruiz FJR, Blei DM. Topic Modeling in Embedding Spaces; 2019.
- 22.
Dieng AB, Ruiz FJR, Blei DM. The Dynamic Embedded Topic Model; 2019.
- 23.
Xu H, Wang W, Liu W, Carin L. Distilled Wasserstein Learning for Word Embedding and Topic Modeling. In: Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R, editors. Advances in Neural Information Processing Systems 31. Curran Associates, Inc.; 2018. p. 1716–1725.
- 24.
Pitman J. Combinatorial stochastic processes. vol. 1875 of Lecture Notes in Mathematics. Berlin: Springer-Verlag; 2006.
- 25. Polson N, Scott J, Windle J. Bayesian Inference for Logistic Models Using Polya-Gamma Latent Variables. Journal of the American Statistical Association. 2012;108.
- 26.
Makalic E, Schmidt D. High-Dimensional Bayesian Regularised Regression with the Bayesreg Package; 2016.
- 27.
Auer S, Bizer C, Kobilarov G, Lehmann J, Cyganiak R, Ives Z. DBpedia: A Nucleus for a Web of Open Data. In: Proceedings of the 6th International The Semantic Web and 2nd Asian Conference on Asian Semantic Web Conference. ISWC’07/ASWC’07. Berlin, Heidelberg: Springer-Verlag; 2007. p. 722–735.
- 28.
McAuley J, Targett C, Shi Q, van den Hengel A. Image-Based Recommendations on Styles and Substitutes. In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR’15. ACM; 2015. p. 43–52.
- 29.
Pennington J, Socher R, Manning CD. Glove: Global Vectors for Word Representation. In: EMNLP. vol. 14; 2014. p. 1532–1543.
- 30.
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J. Distributed Representations of Words and Phrases and their Compositionality. In: Burges CJC, Bottou L, Welling M, Ghahramani Z, Weinberger KQ, editors. Advances in Neural Information Processing Systems 26; 2013. p. 3111–3119.
- 31.
Bojanowski P, Grave E, Joulin A, Mikolov T. Enriching Word Vectors with Subword Information. arXiv preprint arXiv:160704606. 2016;.
- 32. van der Maaten L, Hinton GE. Visualizing High-Dimensional Data Using t-SNE. Journal of Machine Learning Research. 2008;9:2579–2605.
- 33. Churchill R, Singh L. The evolution of topic modeling. ACM Computing Surveys. 2022;54(10s):1–35.
- 34. Kass RE, Raftery AE. Bayes Factors. Journal of the American Statistical Association. 1995;90(430):773–795.
- 35.
Wallach HM, Murray I, Salakhutdinov R, Mimno D. Evaluation Methods for Topic Models. In: Proceedings of the 26th Annual International Conference on Machine Learning. ICML’09. New York, NY, USA: Association for Computing Machinery; 2009. p. 1105–1112.
- 36.
Buntine WL. Estimating Likelihoods for Topic Models. In: Zhou ZH, Washio T, editors. ACML. vol. 5828 of Lecture Notes in Computer Science. Springer; 2009. p. 51–64.
- 37.
Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding; 2018. Available from: http://arxiv.org/abs/1810.04805.
- 38.
Li Z, Ding X, Liu T, Hu JE, Van Durme B. Guided Generation of Cause and Effect. In: Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-20; 2020.