Exploratory mapping of tumor associated macrophage nanoparticle article abstracts using an eLDA topic modeling machine learning approach

doi:10.1371/journal.pone.0304505

Fig 1.

Obtaining the dataset diagram of the scoping review methodology.

Flow chart outlining the number of records from article identification, screening, and inclusion into the final data set used in the eLDA topic model. Of the 28,217 articles identified, 854 were included in the topic model that met all screening criteria. 95 newly identified records were used in the Living Review model. Adapted from [19].

More »

Expand

Fig 2.

Characterizing the dataset.

[a] Distribution of number of authors in the dataset. [b] Number of documents in the dataset over time. Gray lines indicate important advancements in the field. [c] Most common journals in the dataset. [d] Distribution of journal topics included in the dataset. [e] Breakdown of the journal scopes within [e] physical and engineering journals and [f] biological journals.

More »

Expand

Fig 3.

Workflow for creation of eLDA topic model.

The document set first underwent multiple stages of preprocessing. This included tokenization, in which documents were divided into a list of tokens, stop word removal, in which common words were removed, and stemming, in which word suffixes were removed. The topic model was created using the ensemble LDA algorithm [4, 16]. This algorithm includes the creation of multiple topic models [4]. The topics are then clustered using a version of the DBSCAN algorithm to differentiate between stable and noisy topics [16]. This was done using the Gensim ensembleLDA module [15].

More »

Expand

Fig 4.

Creating the ensemble LDA topic model.

[A] Intertopic Distance Map of LDA topics. Size of the circles represents the relative number of articles within each topic while the distance between circles represents similarity between topics. Marginal topic distribution for each topic indicates the percentage of the dataset that belong to each topic. [B] Most Salient Terms. These are the thirty most frequent terms used to create the topic model. This figure was generated using the pyLDAvis library for topic model visualization in Python [25].

More »

Expand

Fig 5.

Dataset by word categories and topic names.

[a] Word clouds by topic. Word size was determined by the frequency within the dataset while the colors were determined by their expert-assigned category. [b] Heat map of word categories within each topic by weight, determined by frequency within each topic.

More »

Expand

Fig 6.

Living scoping review.

[a] Schematic of process used to generate a living scoping review. [b] Table comparing the percentages of papers that were distributed into each topic between the original dataset and the newly acquired papers that form the Living Review. [c] Word cloud of most significant word by topic of papers included in the Living Review. Word size was determined by the frequency within the dataset while the colors were determined by their category.

More »

Expand