A survey on text classification: Practical perspectives on the Italian language

Andrea Gasparetto; Alessandro Zangari; Matteo Marcuzzo; Andrea Albarelli

doi:10.1371/journal.pone.0270904

Peer Review History

Original SubmissionJanuary 14, 2022
16 Mar 2022 Decision Letter - Liviu-Adrian Cotfas, Editor PONE-D-22-01277A multilingual perspective on existing text classification methodsPLOS ONE Dear Dr. Gasparetto, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. Please check the reviewers comments below. Please submit your revised manuscript by Apr 30 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript: A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols. We look forward to receiving your revised manuscript. Kind regards, Liviu-Adrian Cotfas Academic Editor PLOS ONE Journal Requirements: When submitting your revision, we need you to address these additional requirements. 1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf 2. We note that you have indicated that data from this study are available upon request. PLOS only allows data to be available upon request if there are legal or ethical restrictions on sharing data publicly. For more information on unacceptable data access restrictions, please see http://journals.plos.org/plosone/s/data-availability#loc-unacceptable-data-access-restrictions. In your revised cover letter, please address the following prompts: a) If there are ethical or legal restrictions on sharing a de-identified data set, please explain them in detail (e.g., data contain potentially sensitive information, data are owned by a third-party organization, etc.) and who has imposed them (e.g., an ethics committee). Please also provide contact information for a data access committee, ethics committee, or other institutional body to which data requests may be sent. b) If there are no restrictions, please upload the minimal anonymized data set necessary to replicate your study findings as either Supporting Information files or to a stable, public repository and provide us with the relevant URLs, DOIs, or accession numbers. For a list of acceptable repositories, please see http://journals.plos.org/plosone/s/data-availability#loc-recommended-repositories. We will update your Data Availability statement on your behalf to reflect the information you provide. 3. We note that you have stated that you will provide repository information for your data at acceptance. Should your manuscript be accepted for publication, we will hold it until you provide the relevant accession numbers or DOIs necessary to access your data. If you wish to make changes to your Data Availability statement, please describe these changes in your cover letter and we will update your Data Availability statement to reflect the information you provide. [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: No Reviewer #2: Yes ******** 2. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: N/A Reviewer #2: Yes ****** 3. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: No Reviewer #2: Yes ****** 4. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: Yes ****** 5. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: The paper promises to present a "multilingual" perspective on existing text classification methods but then goes on to concentrate mostly on the languages Italian and French in comparison to English. Irrespective of the remainder of the paper, this seems like an arbitrary choice - given that there are thousands of languages and many dozen spoken at least by as many people as Italian or French, one has to wonder if the authors would propose to publish a similar paper on each random pair of two languages in the same fashion. The paper provides quite a bit of useful information to anyone interested to get started with text classification (TC) on these two languages and also gives a quick overview over TC methods which can be useful to people getting started in that area. However the paper does not really present any valuable scientific research questions worthy of getting published in a journal. They list their contributions as 1) "analysis" of TC procedures, focusing on text representation and how it affects the compatibility of thise models with lowe resourced languages` 2) showcase a variety of datasets for Italian TC, and similar for French, 3) proved quantitative results on "understudied" TC tasks. Starting with 2) the paper does give a good overview of Italian/French and multilingual TC datasets, but while this is useful, there is not much scientific insight to be gained. No research questions are answered with regard to that contribution and there are now ample online web sites and search tools which will help to find that list of dataset with not too much effort. The authors mention for a number of different TC tasks, if there are fewer datasets for Italian or French than for English, but not much further information or analysis is presented. The "analysis of TC procedures" is really a very shallow overview over the main TC methods and tasks which would be part of an introductory course on NLP. The overview does not really contain any additional information with regard to how specifically some of these methods may perform differently for different languages from a theoretical point of view or point at literature where such question would have been looked into. No discussion of how properties of languages like vocabulary size, morphology, word segmentation, script used, etc. could maybe impact different algorithms in different ways. Which is of course even harder with only two languages in focus, both of which are quite similar to English in these respects. There is a whole section on "computational resources" without any indication how this relates to the topic of the paper. For contribution 3) the authors apply a number of TC methods to similar datasets for topic labelling and news classification in the three languages and then compare the results. The authors point out some difficulties and inconsistencies with some of the datasets but do not analyse the impact of that on the results. The distribution of classes is not reported, nor compared between languages. The choice of algorithms seems arbitrary and is not motivated by any relation to analysing the impact of language choice on the performance of the algorithm. Each method is run 4 times with different random data splits, but no other sources of randomness are mentioned (e.g. random initalization or optimization order for NN methods). The variance of the results is not reported, nor is there an indication of which differences between results could be considered significant. Similar comparisons between different algorithms have been performed numerous times before with much more detailed analysis on what the reasons for the performance differences could be. The comparison of results between languages is done only very superficially and there is some speculation of what could be possible causes, but there is no actual analysis or in depth-study into any of those potential causes, let alone into distinguishing between the impact of the language itself versus the impect of the specific corpus chosen for that language. Overall I do not think the paper provides useful research insights that would justify publication in a journal. I can also not see a way for how the paper could get sufficiently improved by revising it. Reviewer #2: The paper covers an interesting gap in our opinion, a good recent survey on the text classification problem. The fact that it is is a multilingual one, covering It, Fr and En is a plus. Regarding the references, I think that you could add more results: for Reuters database, the works of fabrizio Sebastiani (eg Machine Learning in Automated Text Categorization), or the work of Rusu and Dinu (Rank Distance Aggregation as a Fixed Classifier Combining Rule for Text Categorization, CICLing 2010). ****** 6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: Yes:** Johann Petrak Reviewer #2: No [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. https://doi.org/10.1371/journal.pone.0270904.r001
Revision 1
17 May 2022 Author Response Dear PLOS ONE Editors and Reviewers, We would like to thank the reviewers for their feedback. We have carefully considered your requests and suggestions, and addressed each and every one of them. First off, in response to the editor's request, we have reviewed PLOS ONE's style requirements, and have done our best to update the manuscript to precisely meet them. In response to prompt (a) and (b), we note that the Reuters Corpora utilized in our experiments is owned by NIST, who acts as its sole distributor. In particular, the researchers are asked to sign an agreement by which 'the display, reproduction, transmission, distribution or publication of the information is prohibited' and 'summaries, analyses and interpretations of the linguistic properties of the information may be derived and published, provided it is not possible to reconstruct the information from these summaries'. These restrictions are documented at https://trec.nist.gov/data/reuters/reuters.html. On the other hand, we have made the extracted Wikipedia dumps available, together with information on how to use them. The code repository is available at https://gitlab.com/distration/dsi-nlp-publib, which contains a link to the cloud folder storing the datasets. We provide below a response to the received feedback and the explanation of all major changes introduced. To facilitate the revision process, we provide a copy of the manuscript annotated in red and blue color, respectively marking removed and added content. We note that the manuscript has undergone significant changes; we believe such extensive additions were necessary to address the critical issues pointed out by the reviewers, reviewer 1 in particular. We now present the addressed issues point-by-point. Reviewer 1 pointed out that presenting the paper as ``multilingual'' might be misleading, as it includes a small fraction of the overall thousands of languages in the world. As such, we have overhauled our work to put more emphasis on the fact that our aim was to gauge how well various classification methods could be adapted to a real-world scenario with limited resources, and that our perspective was mainly on the Italian language. We also agree that the analysis of TC procedures was too shallow and not related enough to language-specific issues. In order to improve it, we have split it into three more organized sections. The Preprocessing section describes in more detail those preprocessing operations related to language, with a large focus on text segmentation. Indeed, this is the largest addition to the manuscript. The other two sections largely replace and expand the old 'TC procedures' section. The Text representation section goes into more detail about how text is projected into feature space as we underline the importance of proper text representation in any NLP procedure, especially from a linguistic point of view. Lastly, we've added a briefer section which provides some insight on the Classification step of the pipeline. We've clarified the meaning of the 'computational resources' section as a severe bottleneck to experimentation with recent language models, which is not a linguistic issue per-se but it has direct impact on the possibility of evaluating these models. In fact, our limited resources were one of the reasons why we could not perform more accurate experimentation. We hope that, by improving the previous sections, as well as specifying this section in more detail, it is clearer that computational resource requirements bring daunting issues to the adoption of NLP methods in other languages. Though we have slightly revised it, we find ourselves at a disagreement on the points brought about the analysis of datasets. Indeed, such research was meant to highlight the scarcity of downstream, task-specific datasets for the Italian language, which could only be proven by showing empirical data. Again, we disagree on the point that such datasets can be found with not too much effort, as can be proven by their scarce adoption in the literature. In our experience, many of these datasets are scattered around the web and cannot be reliably accessed through a single search tool. Moreover, we have found multiple references to existing datasets, only to find that they had been retired, made private or otherwise rendered unavailable. Nonetheless, we hope that this section can come across more smoothly after the revision done to the other parts of the manuscript. We agree on the critiques related to the experimental section, and have added quite a few considerations to the analysis. We've added variance statistics in the performance results, as well as distribution of labels for all datasets. Details about initialization and optimization procedures were already present in the supplemental material provided, which has however been further expanded to explain the reason behind the choice of methods and to clarify further other technical details. Lastly, Reviewer 2 has asked us to add some references with regards to the analyzed Reuters database, which we have done. We confirm that neither the manuscript nor any parts of its content are currently under consideration or published in another journal. All authors have approved the manuscript and agree with its submission to PLOS ONE. Sincerely yours, Andrea Gasparetto (corresponding author) andrea.gasparetto@unive.it Attachments Attachment Submitted filename: rebuttal.pdf https://doi.org/10.1371/journal.pone.0270904.r002
20 Jun 2022 Decision Letter - Sathishkumar V E, Editor A survey on text classification: practical perspectives on the Italian language PONE-D-22-01277R1 Dear Dr. Gasparetto, We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements. Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication. An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org. If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. Kind regards, Sathishkumar V E Academic Editor PLOS ONE Additional Editor Comments (optional): Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation. Reviewer #3: (No Response) ******** 2. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #3: (No Response) ****** 3. Has the statistical analysis been performed appropriately and rigorously? Reviewer #3: (No Response) ****** 4. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #3: (No Response) ****** 5. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #3: (No Response) ****** 6. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #3: (No Response) ****** 7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #3: Yes: Usha Moorthy ******** https://doi.org/10.1371/journal.pone.0270904.r003
Formally Accepted
24 Jun 2022 Acceptance Letter - Sathishkumar V E, Editor PONE-D-22-01277R1 A survey on text classification: practical perspectives on the Italian language Dear Dr. Gasparetto: I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department. If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org. If we can help with anything else, please email us at plosone@plos.org. Thank you for submitting your work to PLOS ONE and supporting open access. Kind regards, PLOS ONE Editorial Office Staff on behalf of Dr. Sathishkumar V E Academic Editor PLOS ONE https://doi.org/10.1371/journal.pone.0270904.r004

Open letter on the publication of peer review reports

PLOS recognizes the benefits of transparency in the peer review process. Therefore, we enable the publication of all of the content of peer review and author responses alongside final, published articles. Reviewers remain anonymous, unless they choose to reveal their names.

We encourage other journals to join us in this initiative. We hope that our action inspires the community, including researchers, research funders, and research institutions, to recognize the benefits of published peer review reports for all parts of the research system.

Learn more at ASAPbio .