Peer Review History
| Original SubmissionMarch 1, 2020 |
|---|
|
PONE-D-20-06032 American is to depression as Irish is to alcoholism: Artificial Intelligence in mental health and the biases of language based models. PLOS ONE Dear Dr Straw, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. The reviewers are in favor of asking the authors for a revised manuscript with technical and communication upgrades. Please address their comments in whole. We would appreciate receiving your revised manuscript by May 03 2020 11:59PM. When you are ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols Please include the following items when submitting your revised manuscript:
Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. We look forward to receiving your revised manuscript. Kind regards, Christopher M. Danforth Academic Editor PLOS ONE Journal Requirements: When submitting your revision, we need you to address these additional requirements. 1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at http://www.journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and http://www.journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf 2. Please include in your appendix the date when the systematic literature search was carried out. 3. Thank you for stating the following in the Competing Interests section:" I have read the journal's policy and the authors of this manuscript have the following competing interests: Isabel Straw has been an unpaid intern with the e-Health company "Neuroflow" as part of her Masters in Public Health (MPH) Program. While this has been entirely separate from this research project, the timing of the internship overlaps with the period of time taken to write this article. Chris Callison-Burch declares no competing interests." Please confirm that this does not alter your adherence to all PLOS ONE policies on sharing data and materials, by including the following statement: "This does not alter our adherence to PLOS ONE policies on sharing data and materials.” (as detailed online in our guide for authors http://journals.plos.org/plosone/s/competing-interests). If there are restrictions on sharing of data and/or materials, please state these. Please note that we cannot proceed with consideration of your article until this information has been declared. Please include your updated Competing Interests statement in your cover letter; we will change the online submission form on your behalf. [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Partly Reviewer #2: Yes ********** 2. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: No Reviewer #2: No ********** 3. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: No ********** 4. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: Yes ********** 5. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: I like the paper and the extent to which it describes how and where the biases may appear in various stage of a machine learning model. It is important to keep in mind that the model would learn from the data it is fed on and it is our responsibility to make sure the “reality" that the data represents does not favor some groups more than others! The paper does a good job getting that point across. That being said, there are few points I would like to highlight: - Authors do not go into details of word embeddings they used for the analogy task and similarity tasks. - The dimensions of word embeddings play an important role. It would be helpful to see more details and comparison across various dimensions of word embeddings - Under the analogy tasks, authors do not take into account the capitalization of words which may change the results - For example on line 317: the analogy shown is "American is to Depression as Irish is to Alcoholism" while in the equation in table under the line 317: vectors are for lower cases words! - Section 3.2 presents some interesting results about the individual risk’s to themselves and their risk to others. The graphs capture the biases with respect to religion, race, gender, nationality but the details to capture the biases were not clear. It would have helpful to have more details so that the results are reproducible. - Finally, It would be interesting to see if the biases hold in embeddings trained on domain specific data such as clinical notes and PubMed articles as mentioned in section 4.1.4 Overall, the paper presents an important case about highlighting the biases in language models and there is a need of collaboration among data scientists and medical experts to come together to build models that do not encode and further propagate the biased reality that data represents today! Reviewer #2: This article provides an overview of applications of word embeddings to psychology studies. It highlights the biases that can be involved in such methods and cautions practitioners to consider specific points in the scientific process at which bias can be introduced. This is an important message for both the NLP and the psychology research communities, and I think that this literature review is of great value. The article would be greatly improved and its message would be much stronger if some technical concerns were first addressed. With these revisions, I believe this would be a very strong contribution. My major concerns are the following: 1. What training data was used for the word2vec and GloVe models? This needs to be specified, particularly since one of the core arguments of the paper is around biases in the training data. I’m guessing that these are large, pre-trained models (but I don’t know for sure), in which case, does this actually mimic the setting in which you expect word embeddings to be used for psychology studies? In other words, wouldn’t you expect these studies to train their own models on their particular datasets of interest (perhaps social media, as mentioned earlier in the paper)? Citing Google Colab isn’t enough, since I can’t reproduce your work with only this knowledge. 2. What significance tests were done for the results in Tables 3.1.1-3.1.5 and in Graphs 1-4? How stable are these results? If you chose not to perform significance tests (perhaps because the model is very large and pre-trained), this should be explained. 3. I’m having trouble understanding Graphs 1-4. Why are the x and y axes showing the same measurements? What is the purpose in representing the results in this way? The diagonal lines don’t mean anything, so perhaps a 2D line or a barplot with error bars would be a better representation. 4. For Tables 3.1.1-3.1.5, why did you choose to use analogies rather than showing the most similar words for each demographic group or using a test like WEAT (Caliskan et al., 2016)? How were the seed term pairings decided (e.g. why is “depression” matched with “american” rather than “irish”?), and were the seed terms allowed to be returned in the results? 5. What papers were used in this literature review? The query terms are provided, but the paper titles are not. It would be helpful in understanding the results of the literature review. 6. The scope of the paper should be clarified in a few places to emphasize that this paper focuses specifically on word embedding models and not on other types of NLP techniques. Some minor suggested changes: Line 87: “human emotions” change to “linguistic representations of human emotions” Line 128: I think there should be another paragraph here about how the medical data previously described is used to train machine learning models, and so these models might also be biased. This is the central motivation of the paper, so I think it should be explained clearly and strongly in the introduction. Line 136: Reddit, Facebook, and Twitter aren’t the usual training sets for the embeddings used in the usual, big, pre-trained models. They are often used in computational social science research. Line 144: This information is in the Appendix, but I think it’s worth briefly mentioning in the text at least which venues the data was pulled from (e.g. ArXiv, …) Line 183-5: I think you could spell this out more clearly: "Models trained on data written mostly by white men will not accurately represent people in other demographic groups." Line 197: The distributional hypothesis is used in many models outside of deep learning. It’s not a contribution of or unique to deep learning. Nor is the concept of finding distances between word vectors a contribution of or unique to deep learning. Line 208: Does that research emphasize the model or the training data? There is some work showing that such biases can actually be magnified by the model, but my summary would be that most work focuses on effects of the biased training data (the bias is introduced through the data, and the model simply encodes that bias). Section 2.2: I think this section could be extended and included more references. It’s not totally clear what “data interpretation” actually means. Later you mention both understanding model results and labeling training data as “data interpretation” and I need more explanation to buy the combination of these separate tasks. Line 343: These two graphs are supposed to validate each other, but the words significantly change in order. For example, in the second graph, “sister” is now much more similar to “brother” “father” “uncle” than “mum” or “aunt”. Line 368: What is data “expression”? I think this section and its corollary in the literature review should be labeled “training data” or something clearer. Line 409-10: Embeddings trained on PubMed and clinical notes are indeed better at representing medical terms, but they are not necessarily better at representing the conversational and social data that has been cited in the paper and a prime use case for NLP models for psychology. Line 410: I would add “use training data produced by balanced demographic groups” to the recommendations in this section (or if you disagree, explain). Domain specific data addresses part of this concern, but even within domain, we also want our data to be produced by balanced demographic groups. Section 4.2.2: This section should also call out the extensive weaknesses of these “debiasing” methods. See, for example: Gonen, Hila, and Yoav Goldberg. 2019. “Lipstick on a Pig: Debiasing Methods Cover up Systematic Gender Biases in Word Embeddings But Do Not Remove Them.” arXiv [cs.CL]. arXiv. http://arxiv.org/abs/1903.03862. Section 4.3.1: What are reflexivity statements? Some more detail here would be useful. ********** 6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: Yes: Sandhya Gopchandani Reviewer #2: No [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files to be viewed.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Please note that Supporting Information files do not need this step. |
| Revision 1 |
|
British is to depression as Irish is to alcoholism: Artificial Intelligence in mental health and the biases of language based models. PONE-D-20-06032R1 Dear Dr. Straw, We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements. Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication. From an editorial point-of-view, we request that you change the manuscript's title to 'Artificial Intelligence in mental health and the biases of language-based models' i.e. to remove 'British is to depression as Irish is to alcoholism' as we feel this phrase could be perceived to be pejorative. Please make this change in your manuscript file and in the submission system alongside the other technical amendments. An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org. If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. Kind regards, Christopher M. Danforth Academic Editor PLOS ONE Additional Editor Comments (optional): Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation. Reviewer #2: All comments have been addressed ********** 2. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #2: Yes ********** 3. Has the statistical analysis been performed appropriately and rigorously? Reviewer #2: Yes ********** 4. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #2: Yes ********** 5. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #2: Yes ********** 6. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #2: The authors have satisfactorily addressed all of my comments except significance testing. I'm still concerned about the lack of significance tests / error bounds, especially given the types of plots included it the draft. Claiming a linguistic association between a gender or sexuality and a mental health disorder should include some sense of the stability of these results, and since this paper is introducing these concepts to the mental health field, it's important to emphasize such tests and educate about their use. I disagree that it's not possible for the authors to perform some kind of stability test; for example, they could re-calculate the vector distances across permutations of the training dataset (maybe re-train the model on some medically relevant dataset. Or the authors could have used one of the many bias tests, like WEAT, that include permutation tests of the seed terms. Indeed, other papers also present untested results in the fashion presented in the paper, but that doesn't mean they were correct in doing so. ********** 7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #2: No |
| Formally Accepted |
|
PONE-D-20-06032R1 Artificial Intelligence in mental health and the biases of language based models. Dear Dr. Straw: I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department. If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org. If we can help with anything else, please email us at plosone@plos.org. Thank you for submitting your work to PLOS ONE and supporting open access. Kind regards, PLOS ONE Editorial Office Staff on behalf of Dr. Christopher M. Danforth Academic Editor PLOS ONE |
Open letter on the publication of peer review reports
PLOS recognizes the benefits of transparency in the peer review process. Therefore, we enable the publication of all of the content of peer review and author responses alongside final, published articles. Reviewers remain anonymous, unless they choose to reveal their names.
We encourage other journals to join us in this initiative. We hope that our action inspires the community, including researchers, research funders, and research institutions, to recognize the benefits of published peer review reports for all parts of the research system.
Learn more at ASAPbio .