Automatic authorship attribution in Albanian texts

Arta Misini; Ercan Canhasi; Arbana Kadriu; Endrit Fetahi

doi:10.1371/journal.pone.0310057

Peer Review History

Original SubmissionJuly 9, 2024
25 Jul 2024 Decision Letter - Muhammad Afzaal, Editor PONE-D-24-28247Automatic Authorship Attribution in Albanian textsPLOS ONE Dear Dr. Misini, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. Please submit your revised manuscript by Sep 08 2024 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript: A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols. We look forward to receiving your revised manuscript. Kind regards, Muhammad Afzaal, PhD Academic Editor PLOS ONE Journal Requirements: When submitting your revision, we need you to address these additional requirements. 1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf 2. In your Methods section, please include additional information about your dataset and ensure that you have included a statement specifying whether the collection and analysis method complied with the terms and conditions for the source of the data. 3. Please note that PLOS ONE has specific guidelines on code sharing for submissions in which author-generated code underpins the findings in the manuscript. In these cases, we expect all author-generated code to be made available without restrictions upon publication of the work. Please review our guidelines at https://journals.plos.org/plosone/s/materials-and-software-sharing#loc-sharing-code and ensure that your code is shared in a way that follows best practice and facilitates reproducibility and reuse. 4. Please note that your Data Availability Statement is currently missing the direct link to access each database. If your manuscript is accepted for publication, you will be asked to provide these details on a very short timeline. We therefore suggest that you provide this information now, though we will not hold up the peer review process if you are unable. 5. Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice. [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes Reviewer #2: Yes ******** 2. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes Reviewer #2: Yes ****** 3. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes ****** 4. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: Yes ****** 5. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: This study presents a statistically robust investigation into the authorship attribution of the low-resource language Albabian through a strategic employment of multiclass classification techniques and machine learning approaches on chosen sets of linguistic features. The subsections of the article are neatly organized with an elaborated description of the analysis procedure and a clear presentation of experimental results. The insights derived from the results are also interesting. However, several points need to be addressed: 1. Clarification of Results in the Abstract Section: The authors are recommended to summarize concrete results in the abstract section, specifying the most suitable set of linguistic features for authorship attribution and the best machine learning approaches for Albanian texts, as proposed in the research questions. 2. Possible Effect of “Average Number of Words (Tokens) per Author” on ML Algorithm Performance: In terms of data size for authorship attribution tasks, 10,000 words per author is considered the “reliable minimum for an authorial set” (Burrows, 2007). As presented in Table 2, “Authorship Statistics,” the average number of tokens per author in both genres is far too limited when measured against this standard. The effect of factors such as “number of candidate authors” and “data size” has been fully acknowledged by studies such as Luyckx & Daelemans (2011). Could the performance of authorship attribution in the present study also be influenced by factors like “author set size”? Reviewer #2: This paper introduced the extended Albanian authorship attribution corpus (A3C-e) and the most effective linguistic features and the most effective machine learning models. It took considerable efforts to build the Albanian authorship attribution corpus, consisting of the newsroom text and scanned books. Extensive experiments were also conducted to identify the most effective linguistic features and machine learning models. However, there is still some room for improvement: 1. As an authorship attribution corpus, it’s better to include of a breakdown of samples by author. For example, how many samples are collected from one author? How are the samples distributed by authors? 2. It’s better to detail the imperative and reflective programming paradigms that were used to extract features. Are there any open-source tools used? Will the codes for feature extraction be released publicly or not? 3. For deep learning methods, BERT multilingual also support Albanian. It’s better to include BERT in the deep learning methods. 4. It’s better to define F1, that is, the equation for F1 calculation. ****** 6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: Yes: Xiao Shanshan Reviewer #2: No ******** [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. https://doi.org/10.1371/journal.pone.0310057.r001
Revision 1
10 Aug 2024 Author Response Response to Review Comments (Manuscript number: PONE-S-24-36426) We would like to express our sincere gratitude to you for the time and effort you have dedicated to reviewing our manuscript, titled "Automatic Authorship Attribution in Albanian texts," ID PONE-S-24-36426. We appreciate your constructive feedback, which has been invaluable in improving the quality and clarity of our work. Furthermore, we value you assigning reviewers to our work in such a short time. We have carefully considered each of your suggestions and have made the necessary revisions to address the points raised. Our responses to your comments are detailed below. ====================================================================== Academic Editor 1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf Response: Done! 2. In your Methods section, please include additional information about your dataset and ensure that you have included a statement specifying whether the collection and analysis method complied with the terms and conditions for the source of the data. Response: Thank you for your valuable feedback. In response to this comment, we have revised the Experiments section to include additional details about the dataset and a compliance statement (subsection 4.1 Data preprocessing). This study's data collection and analysis adhered to the terms and conditions of the data sources. 3. Please note that PLOS ONE has specific guidelines on code sharing for submissions in which author-generated code underpins the findings in the manuscript. In these cases, we expect all author-generated code to be made available without restrictions upon publication of the work. Please review our guidelines at https://journals.plos.org/plosone/s/materials-and-software-sharing#loc-sharing-code and ensure that your code is shared in a way that follows best practice and facilitates reproducibility and reuse. Response: Thank you for bringing the PLOS ONE code-sharing guidelines to our attention. We acknowledge the importance of making our author-generated code available to facilitate reproducibility and reuse. We are committed to sharing the code supporting our findings and will make it publicly available upon manuscript publication. Meanwhile, we need more time to clean and document the code to meet the best practices outlined by PLOS ONE. We value your understanding and will ensure the code is prepared and shared without restrictions upon publication. 4. Please note that your Data Availability Statement is currently missing the direct link to access each database. If your manuscript is accepted for publication, you will be asked to provide these details on a very short timeline. We therefore suggest that you provide this information now, though we will not hold up the peer review process if you are unable. Response: Done! The DOI is 10.5281/zenodo.12699563 5. Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice. Response: Thank you for your thorough review and for highlighting the importance of maintaining an accurate and up-to-date reference list. We have carefully reviewed our reference list to ensure that it is complete and correct. To verify the status of the cited papers, we searched the Retraction Watch Database and confirmed that none of the papers in our reference list have been retracted. As such, no changes were necessary to the reference list. In addition, we have included new references related to the BERT model to support our analysis, as below: BertAA: BERT fine-tuning for Authorship Attribution Cross-domain authorship attribution using pre-trained language models Bert: Pre-training of deep bidirectional transformers for language understanding Unsupervised cross-lingual representation learning at scale Roberta: A robustly optimized bert pretraining approach Automated authorship attribution using advanced signal classification techniques A two level learning model for authorship authentication ====================================================================== Response to Reviewer #1 comments General comments: This study presents a statistically robust investigation into the authorship attribution of the low-resource language Albabian through a strategic employment of multiclass classification techniques and machine learning approaches on chosen sets of linguistic features. The subsections of the article are neatly organized with an elaborated description of the analysis procedure and a clear presentation of experimental results. The insights derived from the results are also interesting. However, several points need to be addressed. Response: Thank you for your positive and encouraging feedback on our study. We are pleased that you found the organization, description of the analysis procedure, and presentation of the experimental results to be clear and insightful. We appreciate your recognition of our efforts and the robustness of our investigation into Albanian authorship attribution. Regarding the specific points that need to be addressed, we have carefully considered each of your comments and have made the necessary revisions to the manuscript. Comment 1: Clarification of Results in the Abstract Section: The authors are recommended to summarize concrete results in the abstract section, specifying the most suitable set of linguistic features for authorship attribution and the best machine learning approaches for Albanian texts, as proposed in the research questions. Response: Thank you for your valuable feedback. We agree that summarizing the concrete results in the abstract will enhance the scope and clarity of our study. In response, we have revised the abstract to include specific findings, highlighting the most effective set of linguistic features and the best machine-learning approaches for Albanian authorship attribution. Comment 2.1: Possible Effect of “Average Number of Words (Tokens) per Author” on ML Algorithm Performance: In terms of data size for authorship attribution tasks, 10,000 words per author is considered the “reliable minimum for an authorial set” (Burrows, 2007). As presented in Table 2, “Authorship Statistics,” the average number of tokens per author in both genres is far too limited when measured against this standard. Response: Thank you for pointing out this important consideration. There seems to be a misunderstanding regarding the calculation of the average number of tokens per author. In our study, we calculated the average number of tokens per author per sample, not the total number of words per author. The formula used for calculating the average number of tokens per author (per sample) is as follows: avg no of words (per sample)= (total no of words)/(total no of samples) avg no of samples per author= (total no of samples)/(total no of authors) avg no of tokens per author (per sample)= (avg no of words (per sample))/(total no of authors) To address this and provide clarity, we have revised the dataset statistics table (Table 2) and included additional rows that provide the minimal total number of words per author, the maximal total number of words per author, and the average total number of words per author. With these updates, we believe in providing a clearer and more comprehensive overview of the dataset, addressing the concern regarding the minimum number of words per author. Comment 2.2: The effect of factors such as “number of candidate authors” and “data size” has been fully acknowledged by studies such as Luyckx & Daelemans (2011). Could the performance of authorship attribution in the present study also be influenced by factors like “author set size”? Response: To address any potential concerns, we have revised our dataset statistics table to provide a clearer representation of the data distribution. This revision aims to indicate the sufficiency of our dataset for the analysis conducted. This clarification illustrates that our dataset meets the standards for a reliable authorial set, thereby mitigating concerns regarding the data size per author. Regarding the “author set size,” we acknowledge that a larger number of candidate authors can increase the complexity of the classification task. We conducted our experiments with a carefully selected number of candidate authors, ensuring a diverse range of writing styles and genres, as well as maintaining a sufficient amount of data per author for effective analysis. However, we are also considering expanding the dataset and further investigating the impact of "author set size" in future work to provide a more comprehensive analysis of its effects on authorship attribution performance. ====================================================================== Response to Reviewer #2 comments General comments: This paper introduced the extended Albanian authorship attribution corpus (A3C-e) and the most effective linguistic features and the most effective machine learning models. It took considerable efforts to build the Albanian authorship attribution corpus, consisting of the newsroom text and scanned books. Extensive experiments were also conducted to identify the most effective linguistic features and machine learning models. However, there is still some room for improvement. Response: Thank you for your positive feedback on our paper and for recognizing the efforts involved in building the extended Albanian authorship attribution corpus (A3C-e) and identifying the most effective linguistic features and machine learning methods. We appreciate your acknowledgment of the extensive experiments conducted in this study. We understand that there is always room for improvement, and we have carefully considered your specific suggestions for enhancing the manuscript. Below, we address each of your comments and outline the revisions made to our work. Comment 1: As an authorship attribution corpus, it’s better to include of a breakdown of samples by author. For example, how many samples are collected from one author? How are the samples distributed by authors? Response: As part of our authorship attribution corpus, we have included a detailed breakdown of samples by author (Figure 2) in the revised manuscript, specifying the number of samples collected from each author and how they are distributed. Comment 2.1: It’s better to detail the imperative and reflective programming paradigms that were used to extract features. Response: To calculate the complete set of extracted features, we utilized a pipeline using the globals() function and the inspect module. Combining these two tools' functionality gave us a toolset for programmatically examining and manipulating objects in Python code. The process involved executing the globals() function to obtain a dictionary of the current global symbol table and then using inspect to retrieve information about the user-defined functions that were available during the execution of the Python code. By combining the functionality of these two tools, we inspected the current scope of the Python code and retrieved information about the user-defined functions that we manually programmed. A screenshot of a section of the code may be found below: Comment 2.2: Are there any open-source tools used? Response: We want to clarify that all the features extracted for this study were carefully and manually engineered and programmed. We did not rely on any open-source tools for feature extraction. Instead, the custom scripts we developed ensure that the features are tailored specifically to the unique characteristics of the Albanian language and the requirements of the authorship attribution analysis. Comment 2.3: Will the codes for feature extraction be released publicly or not? Response: We are committed to transparency and reproducibility and will make our code available upon the publication of our work. Comment 3: For deep learning methods, BERT multilingual also support Albanian. It’s better to include BERT in the deep learning methods. Response: Thank you for your insightful suggestion. We appreciate the recommendation to include BERT-multilingual to provide a more comprehensive evaluation of deep learning methods for authorship attribution in Albanian. BERT-multilingual and XLM-RoBERTA-base have been incorporated into our experimental setup. The manuscript now includes a comparison of its performance with other models (Table 8), demonstrating its effectiveness for Albanian texts. Comment 4: It’s better to define F1, that is, the equation for F1 calculation. Response: Thank you for pointing this out. In response to this comment, we have added a definition of the F1 score, including the equation for its calculation, to the manuscript. This will ensure clarity and provide a precise understanding of how the F1 score is computed. Attachments Attachment Submitted filename: response to reviewers.pdf https://doi.org/10.1371/journal.pone.0310057.r002
20 Aug 2024 Decision Letter - Muhammad Afzaal, Editor Automatic Authorship Attribution in Albanian texts PONE-D-24-28247R1 Dear Dr. Misini, We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements. Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication. An invoice will be generated when your article is formally accepted. Please note, if your institution has a publishing partnership with PLOS and your article meets the relevant criteria, all or part of your publication costs will be covered. Please make sure your user information is up-to-date by logging into Editorial Manager at Editorial Manager® and clicking the ‘Update My Information' link at the top of the page. If you have any questions relating to publication charges, please contact our Author Billing department directly at authorbilling@plos.org. If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. Kind regards, Muhammad Afzaal, PhD Academic Editor PLOS ONE Additional Editor Comments (optional): Reviewers' comments: https://doi.org/10.1371/journal.pone.0310057.r003
Formally Accepted
27 Aug 2024 Acceptance Letter - Muhammad Afzaal, Editor PONE-D-24-28247R1 PLOS ONE Dear Dr. Misini, I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now being handed over to our production team. At this stage, our production department will prepare your paper for publication. This includes ensuring the following: * All references, tables, and figures are properly cited * All relevant supporting information is included in the manuscript submission, * There are no issues that prevent the paper from being properly typeset If revisions are needed, the production department will contact you directly to resolve them. If no revisions are needed, you will receive an email when the publication date has been set. At this time, we do not offer pre-publication proofs to authors during production of the accepted work. Please keep in mind that we are working through a large volume of accepted articles, so please give us a few weeks to review your paper and let you know the next and final steps. Lastly, if your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. If we can help with anything else, please email us at customercare@plos.org. Thank you for submitting your work to PLOS ONE and supporting open access. Kind regards, PLOS ONE Editorial Office Staff on behalf of Dr. Muhammad Afzaal Academic Editor PLOS ONE https://doi.org/10.1371/journal.pone.0310057.r004

Open letter on the publication of peer review reports

PLOS recognizes the benefits of transparency in the peer review process. Therefore, we enable the publication of all of the content of peer review and author responses alongside final, published articles. Reviewers remain anonymous, unless they choose to reveal their names.

We encourage other journals to join us in this initiative. We hope that our action inspires the community, including researchers, research funders, and research institutions, to recognize the benefits of published peer review reports for all parts of the research system.

Learn more at ASAPbio .