Fig 1.
General Workflow for associating articles into topics, and then determining topic and paper assignment validity.
Four overarching actions are taken in this analysis: the initial barcode search, the results of which are used in a paper search, the resulting papers of which are used in a topic model, and finally the topic model is validated with a random forest model.
Fig 2.
Plate diagram displaying the conditional probabilistic dependencies in Latent Dirichlet Allocation.
Depicts the core of the LDA method. Plate diagrams are graphical models of probabilistic models with conditional dependence between random variables [37,38]. The boxes are “plates”; they are repeated entities of documents for the outer plate, and words in the inner plate. The Greek symbols denote the hidden or unsupervised areas of LDA, while the regular characters denote the user-created or supervised areas of this algorithm. α is the per-document topic distribution [39]. β is the per-topic word distribution, and φ is the word distribution for a singular topic K. Θ is the topic distribution in a particular document M, then Z is the topic for the N-th word in document M. These two pathways converge at W, which is the specific word. W is the observed variable while every other layer in this process is latent, or unsupervised.
Table 1.
Classic confusion matrix to visually analyze the classification performance of an algorithm.
Fig 3.
Examination of positive hit papers.
In panel “A”, papers are graphed according to what genomic region, if any, they mentioned in their text, regardless of whether the paper was retrieved via the ITS or 18S region. “Ds” indicates the paper did not mention a genomic region (ie. the paper will have been found via the metabarcoding search and dataset, but the paper may not link back to the dataset itself, this can indicate that the paper was attached to the dataset post publishing). Finally, papers are grouped by what environment they primarily discussed. In Panel “B,” A reader assigned the environment, overall, of each environment. Graph in panel “C” indicates the publisher, and Panel “D,” the date of publication for the papers used in this study.
Fig 4.
Sampling locations were derived for each record.
If coordinates for sample sites were provided in the original paper, those were used. If they were not, the centroid of the most precise geographic entity was used. Panel A shows these data with the symbol size scaled to the number of samples indicated at each location. Panel B shows a kernel density of sample locations. Panel C shows the substrate class at each sample locations. Together these locations show a global distribution of C. neoformans. It is likely that these locations have sampling bias. Constructed from vector base layers from Natural Earth [45].
Fig 5.
Count of the most common words across papers positive and negative.
excludes stopwords. Our results show that “crop”, “miner”, “manag-”, “forest-”, “litter”, “fertil-”, “wheat”, “contamin-”, “amend-”, and “root” have the highest association probabilities in the assignment of articles from groups of positive or negative hit papers to topics. The figure is a sample representation of the most popular words found across the text corpus that were statistically relevant after filtering for occurrences. This reveals a potential association between the most common words and environments that contain soil and rhizosphere. Only words above a count of 1000 shown.
Fig 6.
Graph signifying the coherence score performance in comparison to the number of topics for each model.
Coherence and Perplexity were calculated at different numbers of topics. Coherence is highest at 3, which indicates that three topics should be used in the model.
Fig 7.
shows the word count and word weight for the top lemmatized (truncated) words in each topic based on their attributes and by manipulating the outputs of the matplotlib library.
The most common words in each topic can give insights to what the potential topics are. They need to be extrapolated by the researchers with subject matter experts based on the text corpus given to the model and words present in the topics.
Fig 8.
Confusion matrix from random forest classifier with heatmap indicator for added visualization.
Starting from the top left corner, this square indicates the number of true positives. Moving clockwise, the next square is the false negative area, with the true negative square below it. The last square in the bottom left corner refers to the false positives.
Fig 9.
Graph of the Area Under the Curve—Receiver Operating Characteristic curve, generated by Scikit-plot.
By default, the curve algorithm splits the ROC Curve into three parts. The ROC curve by the classes, the micro-average of the ROC Curve, and the macro-average of the ROC Curve. The micro-average shows that the model predicts better than a random assignment of papers to models.
Table 2.
Summary table of accuracy, precision, recall, f1-score, and AUC, and ROC Score from the random forest classifier.