Text mining of CHO bioprocess bibliome: Topic modeling and document classification

doi:10.1371/journal.pone.0274042

Fig 1.

Overview of the CHO bibliome processing and analysis workflow.

More »

Expand

Fig 2.

LDA topic categorization of CHO bibliome: Distribution of bioprocessing documents in each of the 9 topics discovered by LDA model.

More »

Expand

Fig 3.

Comparison between automatically generated LDA topics and manually assigned categories.

(A) Distribution of human-annotated categories among computer-generated LDA topics. (B) Distribution of the top four LDA topics in manual categories. (C) Distribution of the top four manual categories in LDA topics.

More »

Expand

Fig 4.

LDA topics with term probability.

(A) The top 30 most frequent terms from nine LDA topics with weights. (B) Visualization of topic modeling results using pyLDAvis. Left shows semantic topic space, where each circle is a single topic and its size represents its importance in the model. The proximity between two circles reflects the semantic similarity of their concepts. Right shows Top-30 most salient terms for Topic-4. The terms (red bars) are in descending order of probability, and the blue bars show the terms’ frequency over whole corpus (i.e., a pair of overlaid bars represent both the corpus-wide frequency of a given term as well as the topic-specific frequency of the term). For a given term, when the red bar is almost the same length as the blue bar, it means it is a salient term almost exclusive to the topic.

More »

Expand

Table 1.

Logistic regression utilizing binary representation of terms.

More »

Expand

Table 2.

Logistic regression utilizing tf-idf with all terms and chosen top terms.

More »

Expand

Table 3.

Logistic regression utilizing under-sampling and tf-idf with chosen top terms (fraction of majority set = 0.1).

More »

Expand