LDA2Net Digging under the surface of COVID-19 scientific literature topics via a network-based approach

During the COVID-19 pandemic, the scientific literature related to SARS-COV-2 has been growing dramatically. These literary items encompass a varied set of topics, ranging from vaccination to protective equipment efficacy as well as lockdown policy evaluations. As a result, the development of automatic methods that allow an in-depth exploration of this growing literature has become a relevant issue, both to identify the topical trends of COVID-related research and to zoom-in on its sub-themes. This work proposes a novel methodology, called LDA2Net, which combines topic modelling and network analysis, to investigate topics under their surface. More specifically, LDA2Net exploits the frequencies of consecutive words pairs (i.e. bigram) to build those network structures underlying the hidden topics extracted from large volumes of text by Latent Dirichlet Allocation (LDA). Results are promising and suggest that the topic model efficacy is magnified by the network-based representation. In particular, such enrichment is noticeable when it comes to displaying and exploring the topics at different levels of granularity.

Please include your amended statements within your cover letter; we will change the online submission form on your behalf.
Please remove any funding-related text from the manuscript and let us know how you would like to update your Funding Statement → Funding Statement G.M., M.W., and C.S. acknowledge financial support from the European Union Horizon 2020 project ISEED (Grant Agreement No. 960366).C.S. also acknowledges financial support from from PON R&I 2014-2020 (FSE REACT-EU) and from the EU project MUHAI (Grant Agreement No. 951846).
The funders had no role in relation to the data, content and opinions expressed in the article.

To Reviewers
We thank the reviewers for their critical assessment of our work.In the following, we address their concerns point by point.The reviewers' comments have been listed below in italic text, followed by our responses in the standard format.Also, for each comment, if we have modified the manuscript's contents or the appendices, we show changes in a table after each reply, displaying the original and the updated version.
Reviewer 1 Q 1.1 Your references are out-of-date.You should cite up-to-date studies.
-NET-LDA: a novel topic modeling method based on semantic document similarity -Concept-LDA: Incorporating Babelfy into LDA for aspect extraction -Combining semantic graph and probabilistic topic models for discovering coherent topics Reply: We thank the reviewer for this remark.We added the suggested references (Allahyari et al., 2019;Ekinci and Omurca, 2020;Ekinci and İlhan Omurca, 2020) -as well as other recent works that exploit domain knowledge from the semantic web to improve the coherence and interpretability of inferred topics, like Yao et al. (2017) -to the Related Work section of the paper.For the convenience of the reviewer, we show changes in the manuscript here below: Page & Line Previous Version New Version p. 3 -l. 92 The proposed topic model enrichment For example, alpha, beta...There is no information about realization of LDA.
Reply: We thank the reviewer for this suggestion.We have, accordingly, provided further clarifications about the model estimation choices and parameters used in the estimation process in the Materials and Methods section and in Appendix A.2.In particular, for what concerns the alpha and beta parameters, both have been estimated by using the topicmodels R library, which provides an interface to the C code for LDA models by Blei et al. (2003) and the C++ code for fitting LDA models using Gibbs sampling by Xuan-Hieu Phan et al. (https://gibbslda.sourceforge.net/),we do not impose any constraints on the values of these parameters.The alpha parameter value of the LDA model with 120 topics, estimated using Gibbs Sampling, is 0.4166667, while the beta parameters matrix of size [34563x120], containing the logs of the probabilities of words for each topic, is now available in .csvformat at the following link: https://drive.google.com/file/d/1awk3UTy7tyJXhKNSve8YQSWdfWQujC4O/view?usp=sharing.In addition, we added a new figure (Fig. 14) displaying the average values of the four criteria used for choosing K.
For the convenience of the reviewer, we show changes in the manuscript and appendices here below: Page & Line Previous Version New Version p. 4 -l. 134 This section outlines the basic concepts about Latent Dirichlet Allocation (LDA) needed for understanding our approach.As LDA analysis is not the crux of this work, we refer the reader to Blei et al. (2003) , 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140} we estimated 5 LDA models using the Gibbs Sampling approach with different seeds and 1300 iterations (with n.burnin.iter= 300).Then, four criteria (Arun et al., 2010;Griffiths and Steyvers, 2004;Deveaud et al., 2014;Cao et al., 2009) were employed to choose the optimal value of K. Based on the average values of the considered criteria, K = 120 appeared to be the a good candidate number of topics.Finally, we resumed the best run of the LDA estimation obtained for K = 120 running 3000 additional iterations to ensure convergence.
To select the number of topics K for the LDA model, we employed the method proposed by Nikita (2016) and the R library ldatuning.For each , 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140} we estimated multiple LDA models with different seeds and 1300 iterations (with n.burnin.iter= 300), using the Gibbs Sampling approach.Then, four criteria (Arun et al., 2010;Griffiths and Steyvers, 2004;Deveaud et al., 2014;Cao et al., 2009) were employed to choose the optimal value of K. Based on the average values of the four considered criteria (see Fig. 14), K = 120 appeared to be the a good candidate number of topics.Finally, we resumed the best run of the LDA estimation obtained for K = 120 running 3000 additional iterations to ensure convergence.The LDA models, and their parameters, have been estimated using the topicmodels R library, which interfaces with the C code developed by Blei et al. (2003)  Reply: We thank the reviewer for raising this important point.In order to understand the work as a whole, we would like to stress that the main goal of this project is not the topic modeling per se, but to propose a novel way to enrich topic models with bigrams, by using the proposed deterministic technique.In fact, our method can be applied to the outcomes of any estimated topic model (LDA, STM, CTM) that employs a bagof-words approach (i.e., based on unigrams only), to translate words-by-topic distributions into topic graphs.With this in mind, in this work, we did not compare the performance of different topic models because we aimed at showing the effectiveness of our approach in the topic modeling scope in general.Therefore, we used LDA as a reference tool for any topic modeling, given the fact LDA is the most common approach for this type of task.Nevertheless, in future works, we intend to show how our topic model enrichment technique can be easily applied to other topic model methods, such as correlated topic models (CTM), structural topic models (STM), and others, as suggested by the reviewer, as well as providing a comparative evaluation.As regards the extracted topic evaluation, we did not perform this stage since our project goal is not to measure how interpretable the topics are to humans (the clustering quality) but rather to figure out how our method makes extracted topics more interpretable to humans.Paradoxically, topics with a low coherence metric may only further support our point.Besides, LDA is an unsupervised machine learning algorithm, which means certain accuracy metrics cannot be directly applied since they depend on the notion of true classes for a datum (namely the ground truth, that we do not have).

Page & Line Previous Version New Version p. 3 -l. 101
Even though the method is here applied to a LDA model, it can be implemented with other topic models that can be estimated on words (unigrams), such as CTM Blei and Lafferty (2006) or STM Roberts et al. (2013).One of the critical differences with previously mentioned topic-modeling approaches is that, in this work, the enrichment of the topic model occurs after topics are inferred using unigrams.That means bigrams are later utilized to explore and summarize the syntagmatic structure of topics inferred with LDA at a deeper level.
Even though the method is here applied only to the LDA model, it can be easily implemented with other topic models, such as correlated topic models CTM Blei and Lafferty (2006), or structural topic models STM Roberts et al. (2013).That means our approach can be cast into any topic model framework based on the bag-of-words paradigm to transform the topic's distributions of unigrams into topic graphs.It is worth noting the project's primary focus is not on topic modelling per se but enhancing the topic modelling.That is possible through our novel deterministic method, based on bigrams, which takes place after the topic inference stage (i.e., unigrams-by-topic distributions).In other words, bigrams are used to enrich an already estimated topic model and to summarize the syntagmatic structure of inferred topics.
For the sake of demonstration, in this work, we will be utilizing only LDA as a basic illustrative example, precisely chosen for its simplicity and widespread use.
Reviewer 2 Q 2.1 The manuscript does not provide additional insightful analysis apart from the topic analysis.
That significantly restricts the contribution of the work.The topic analysis has some degree of randomness due to the initialization, and the outputs are based on statistics.Therefore, the manual interpretation and identification of the topic meanings are essential.Viewing the specific meaning of the target topic, Covid-19 scientific literature, the missing part is very important to distinguish the manuscript from some experiment reports.
Reply: We thank the reviewer for pointing this out.As concerns the additional insightful analysis, we agree this part is missing.Yet, we would like to clarify that our paper's goal is methodological rather than directly related to the medical domain or COV ID −19.Actually, the choice of the specific case study and corpus was made strategically to serve this methodological purpose.The richness, variety, and complexity of medical literature related to COV ID − 19, made it an ideal (and very demanding) setting to apply the proposed method and see if the results were meaningful from an expert's perspective.We acknowledge that, from a medical application viewpoint, our work does not go beyond a detailed analysis of the network-enriched topics and subtopics identified in the Cord-19 open research dataset but, as we stressed above, the focus of our study is only on developing, exploring and validating the methodology itself.About the randomness of results, a distinction must be made.Our topic enrichment, based on empirical counts of bigrams in each corpus document, does not involve an element of randomness on its own.On the other hand, an element of randomness derives from the estimated topic model (LDA) and its initialization.In other words, the element of randomness is inherited (but not originated by the proposed topic-enrichment method).On the other hand, we also know the LDA algorithm is robust and tends to converge, regardless of the random initialization, so we were not concerned about this matter.In any case, we will better specify and clarify that question in the text.Given that, the randomness aspect might be mitigated, for example, by using non-random topic modelling techniques, such as Structural Topic Modeling with spectral initialization, and that may be an interesting future work.Finally, recalling that the main novelty of this project relies on its original topic-enrichment methodology, we believe that this very facet distinguishes our work from many others based on the same corpus but focusing on medical aspects and related findings.

Page & Line
Previous Version New Version p. 2 -l.34 To better comprehend the ultimate goal of this project, it is essential to stress that the leading idea is not to study the topic modeling per se but to introduce a novel way to enhance topic model results and facilitate the interpretation.i) The filtering method only removes the articles uploaded before December 31, 2019.Is it a reasonable method?Although the rest of papers are concerning SARS and MERS, are they all related to Covid-19?
The inclusive and exclusive criteria of paper filtering are necessary to enhance the trustworthiness of the paper.
ii) You may reference the following papers which focus on literature surveying: -Information fusion and artificial intelligence for smart healthcare: a bibliometric study -A bibliometric review of soft computing for recommender systems and sentiment analysis Reply: We sincerely appreciate the Reviewer's suggestions for additional references and his thoughtful engagement with the corpus filtration process.Regarding the filtering method applied to the CORD-19 dataset, we opted to remove from the corpus only those articles that were published before December 31, 2019 -the official recognition date of the COVID-19 epidemic by China.This approach was chosen to adopt a conservative stance, ensuring that the articles selected were post-dating the outbreak's formal acknowledgment.While we recognise that some papers published after that data may also address topics such as SARS and MERS, we can assure that the CORD-19 corpus was designed to contain only works related to COVID-19 and SARS-CoV-2.We firmly believe that the creators of the CORD-19 corpus, including the White House medical experts and a coalition of leading research groups, possess a level of expertise in the medical domain that we could not surpass in terms of paper selection.Therefore, the conservative criteria applied through our minimal paper filtering process are precisely aimed at minimising the risks related to more stringent filtration choice, which could impact the representativeness of the CORD-19 corpus in our work.We are grateful for the opportunity to clarify this point.

Page & Line Previous Version New Version p. 2 -l. 38
Actually, the choice of the Cord-19 corpus was made strategically to serve such a purpose.The richness, variety, and complexity of medical literature related to COVID-19, made it an ideal (and very demanding) setting to apply the proposed method and see if the results were meaningful from an expert's perspective.However, it can be exploited for healthcare literature surveying tasks Chen et al. (2023)  out.The resulting dataset contains 398818 abstracts.This conservative filtering approach was adopted for ensuring that the articles selected were post-dating the outbreak's formal acknowledgment and to maintain the representativeness of the CORD-19 corpus in our work.
Q 2.3 Is it correct to claim the paper proposes an automatic topic labeling?Topic labeling should be the identification and interpretation of the identified topics (word clusters).In my understanding, the paper only discovers the word clusters, but does not provide an automatic method for topic meaning identification.The authors are required to justify this point or revise the manuscript.
Reply: We thank the reviewer for highlighting this issue, and we appreciate the opportunity to clarify our automatic labelling approach.While our method does indeed focus on word cluster identification within topics, it goes beyond mere clustering or grouping of words.Our method enables the discovery of subtopic networks, which are clusters of words with specific association structures that reflect semantic and syntagmatic structures inherent in the corpus.Furthermore, the proposed labelling method generates a frequency-weighted set of label candidates through a statistical random walk technique applicable to weighted and directed graphs, such as our subtopic networks.This automated graph-based approach does not incorporate other information and is clearly not based on semantic AI approaches for reasoning and meaning identification.Instead, it leverages graph properties, like the interconnectedness of words within the reconstructed topic and subtopic networks, to generate label candidates of varying lengths.It is important to note that our method focuses on statistical associations and structural relationships, not semantic or meaning inference.While the method doesn't engage in reasoning, it does provide an automated means of suggesting labels based on the analyzed structural properties of subtopic networks.We acknowledge the distinction, suggested by the Reviewer, between topic labelling and topic meaning identification, and our method aligns only with the former.We further clarified this aspect in the manuscript to ensure a comprehensive understanding of the approach's limitations and capabilities.
Page & Line Previous Version New Version p. 7 -l.241 The proposed heuristic does not incorporate information external to the corpus and is not based on semantics.Instead, it leverages graph properties, like the interconnectedness of words within the subtopic networks, to generate label candidates of varying lengths.It is important to note that our method focuses on statistical associations and structural relationships, not meaning inference.the doesn't engage in reasoning, it does provide an automated means of suggesting labels based on the analyzed structural properties of word networks.
Q 2.4 I am also curious about the mapping and agreement between the identified topics and the 120 topics in the dataset (again, are all the 120 topics concerning Covid-19?) Reply: We thank the reviewer for highlighting this issue.We wish to clarify that the decision to estimate a topic model with 120 topics is not directly tied to the identification by experts of 120 topics related to Covid-19 within the corpus.Rather, it stems from our methodological choice of using the four criteria proposed by Nikita (2016) to guide the selection of the number of topics in the LDA model we estimate.For each of the 120 topics hence inferred by the LDA model, we engaged a domain expert to provide representative titles/labels that capture the essence of the topic's contents.It is important to note that we did not disclose the automatically generated labels assigned by our method to the topics and subtopics during this process.This was done to ensure that the expert's judgments remained unbiased.

Errata Corrige
Corrections and other minor changes.
Page & Line Previous Version New Version p. 3 -l.70 However, less attention has been devoted to addressing LDA's limitations and related models concerning their "bag-of-words" approach, neglecting word order.LDA returns a (weighted) list of words for each topic, disregarding the short-distanced semantic information in the word arrangement sequences.
However, less attention has been devoted to addressing the limitations of LDA and related models, concerning their "bag-of-words" approach, which neglect word order.These models return a (weighted) list of words for each topic, disregarding the short-distanced semantic information in the word arrangement sequences.
You evaluate obtained results in many ways.But beside these, you should evaluate your extracted topics in terms of topic coherence, tf-idf coherence, f-score, precision, recall.You should compare your proposed topic model with LDA, STM, CTM.
containing the logarithms of word probabilities for each topic, can be accessed through the following GitHub repository: https: //github.com/carlosantagiustina/underthesurfaceofCOVID19topics Q 1.3