Knowledge reuse in software projects: Retrieving software development Q&A posts based on project task similarity

Software developers need to cope with a massive amount of knowledge throughout the typical life cycle of modern projects. This knowledge includes expertise related to the software development phases (e.g., programming, testing) using a wide variety of methods and tools, including development methodologies (e.g., waterfall, agile), software tools (e.g., Eclipse), programming languages (e.g., Java, SQL), and deployment strategies (e.g., Docker, Jenkins). However, there is no explicit integration of these various types of knowledge with software development projects so that developers can avoid having to search over and over for similar and recurrent solutions to tasks and reuse this knowledge. Specifically, Q&A sites such as Stack Overflow are used by developers to share software development knowledge through posts published in several categories, but there is no link between these posts and the tasks developers perform. In this paper, we present an approach that (i) allows developers to associate project tasks with Stack Overflow posts, and (ii) recommends which Stack Overflow posts might be reused based on task similarity. We analyze an industry dataset, which contains project tasks associated with Stack Overflow posts, looking for the similarity of project tasks that reuse a Stack Overflow post. The approach indicates that when a software developer is performing a task, and this task is similar to another task that has been associated with a post, the same post can be recommended to the developer and possibly reused. We believe that this approach can significantly advance the state of the art of software knowledge reuse by supporting novel knowledge-project associations.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming.
2. We noted in your submission details that a portion of your manuscript may have been presented or published elsewhere.
[Main results of precision and accuracy are published in Melo, G., Oliveira, T., Alencar, P., & Cowan, D. (2019). Retrieving curated stack overflow posts from project task similarities. In International Conference on Software Engineering Knowledge Engineering (pp. 415-418).] Please clarify whether this [conference proceeding or publication] was peer-reviewed and formally published. If this work was previously peer-reviewed and published, in the cover letter please provide the reason that this work does not constitute dual publication and should be included in the current manuscript.
>Answer: (This answer is also in the Cover Letter as requested). A preliminary and shorter version (4 pages) of this paper was peer-reviewed and formally published in the International Conference on Software Engineering Knowledge Engineering (SEKE 2019). This previous paper was extended to 35 pages in several ways. First, the process model description is extensively detailed to support an exact and accurate replication of the proposed model. Second, a completely new and unpublished systematic mapping study was introduced. This systematic mapping study answers research questions regarding current proposals that associate Stack Overflow with the development environment. Third, the experimental studies have been extended by the additional distance and similarity algorithm calculations. Also, the paper has been significantly enhanced by extensions that provide additional details about the background, related work, case studies, and the analysis of the results. Given that the first publication is published as a short paper, we believe that the findings presented in our paper will appeal to the PLOS One Readers and academic community who subscribe to PLOS One. Our findings will allow your readers to accurately reproduce our proposed model, as we have included details about the implementation. Besides, the novel systematic mapping study advances the state-of-the-art by providing integrated information regarding current proposals that associate Stack Overflow and software development.
Reviewer #1: This paper focused on software knowledge reuse on Stack Overflow based on project task similarity. Two research questions were conducted.
There are some issues that the authors might consider: (1) The manuscript suffers from how to generalize the proposed approach to other datasets or other domains. This paper focused on a domain-specific problem. The article fails to provide insights and give generalized suggestions to more diversified readers in this regard.

>Answer:
We have added the following paragraph to the manuscript, in the Discussion Section.
" With respect to the application of the proposed approach in other domains, we argue knowledge bases can be a useful asset in diverse domains such as language studies, health or mathematics. We believe these and other domains could use similar approaches to discover and reuse knowledge through automated tools and methods. Although the approach presented in our paper focuses on software development, the general principles of the proposed approach could be applied to knowledge bases in many additional domains. Of course, new approaches would require data from the experts using these knowledge bases. For example, in the case of health, physicians could take advantage of specific health information already curated by other experts that have used those knowledge bases. The approach could use clinical guideline tasks instead of software development tasks, and instead of Stack Overflow, the approach in health care could use a medical knowledge base Q&A website such as medhelp.org. The proposed approach could also be applied to knowledge bases in domains other than health such as mathematics math.stackexchange.com and the law avvo.com. " Regarding our dataset, although we have relied on one dataset in the software engineering domain, the dataset is diverse and considers multiple projects (five) of the company in an extended period of time (around 7 years), not only one project. The projects in the dataset have different characteristics, such as (1) legacy project with few maintenance tasks, (2) project that is new and uses modern programming languages for front-end, back-end and database -Ruby on Rails, NodeJS and MongoDB, (3) main project in the company, that had in 2018 been in production for over 8 years and lastly, a (4) project created to coordinate the migration of the application server. We have added this project characterization to the manuscript. This dataset, which considers multiple projects, covers a diversity of cases in the software engineering domain.
(2) Some figures and tables are not clear and are not friendly for readability. The authors should consider usin g vector graphics s uch as eps and pdf.

>Answer:
We have converted all the figures to .eps for improved quality.
(3) Some recently related works should be included. LinkLive: discovering Web learning resources for developers from Q&A discussions Learning to answer programming questions with software documentation through social context embedding Leveraging Official Content and Social Context to Recommend Software Documentation To Do or Not To Do: Distill crowdsourced negative caveats to augment api documentation >Answer: We have added the recommended references in the Related Work Section.