Peer Review History

Original SubmissionFebruary 6, 2020
Decision Letter - Xia Li, Editor

PONE-D-20-03484

A novel computational approach for predicting complex phenotypes by deriving their gene expression signatures from public data

PLOS ONE

Dear Dr. Ivanov,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Aug 23 2020 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols

We look forward to receiving your revised manuscript.

Kind regards,

Xia Li, Ph.D.

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2.Thank you for stating the following in the Acknowledgments Section of your manuscript:

[This work was supported by the

UK Dementia Research Institute which receives its funding from DRI Ltd, funded by

the UK Medical Research Council, Alzheimer's Society and Alzheimer's Research

UK. The project was also part-funded by the European Regional Development Fund

through the Welsh Government.]

We note that you have provided funding information that is not currently declared in your Funding Statement. However, funding information should not appear in the Acknowledgments section or other areas of your manuscript. We will only publish funding information present in the Funding Statement section of the online submission form.

Please remove any funding-related text from the manuscript and let us know how you would like to update your Funding Statement. Currently, your Funding Statement reads as follows:

 [The author(s) received no specific funding for this work.]

Additional Editor Comments (if provided):

According to the reviewers' comments, my final decision is "Major" revision.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The manuscript is written well and is understandable. However I think it should be made clearer in the abstract and within the text that both Array Express and Expression Atlas were used and an explanation of why the Expression Atlas was used.

Reviewer #2: Since the data was restricted to starvation and sterility, would suggest that the title reflect so, as phenotype is a very big bracket. And it is likely that the spectrum would vary between phenotypes e.g. colour, height etc vs binary cases like sterility. Also the data encompasses only Dropsophilia, so suggestion to be safer to be narrow in coverage in title and rest of the writing. By all means discuss on applicability across multiple organisms and phenotype in discussion bearing in mind limitations. It would be good to discuss applicability of towards different ranges of phenotypes. There is also a claim of capturing environmental influences, which is not really reflected in the results. Would suggest limiting claims and overall applicability, otherwise a lot of work should be done to substantiate those claims - multiple organisms, multiple range of phenotypes, etc.

Reviewer #3: The paper by Dobril K. Ivanov et al. demonstrated a combination of linear mixed effect models and principal components analyses approach for integrating gene expression data for specific phenotypes from independent labs widely available in public repositories. As a proof-of-concept study to show the promising in-silico approach for inferring phenotypic characteristics behind the published gene-expression data resulting from genetic or environmental perturbations, the data they selected is excellent for testing their methods and hypothesis. The results they presented are partial but convincing. I would like to see this approach be further explored, tested, and improved by more computational researchers. Along this line, since there are still some ambiguous or arbitrary steps or parameters in the study, I recommend that this paper only be accepted after the following major comments/issues are resolved:

The two represented phenotypes(starvation sensitive and sterile) were nicely investigated and presented. But in the current workflow, there are few steps that involve manual curation, like the selection of expression data and the representative GO terms. To show the application's universality, please add at least one additional test phenotype of interest that the related experiments and significant enrichment of GO terms were selected by predefined rules or algorithms.

The current flow diagram in Fig.1 is quite confusing, at least to me. Please consider separating the signature finding and phenotype prediction into two parts. Meanwhile, since the paper already discussed the complexities of the relation between the gene-expression changes and the phenotypes and described the limitation of the current approach. If possible, I would like to see an evaluation part by the end of the signature-finding section to evaluate whether the selected phenotype could be significantly presented or identified by the expression profiles of the currently available microarray experiments.

In the cross-validation section, the paper shows "the AUC was calculated using the class (control/mutant) probabilities derived from the randomForest package, using the top 200 genes from the molecular signature (based on the p-values from the logistic regression)". How this 200 was determined, whether this top-N influence the AUCs? I would like to know the results of a series of Ns to have a better understanding of this step. And prefer a more rigorous or data-driven way to calculate or determine the numbers of the signature genes for different phenotypes.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

Revision 1

Reponse to Reviewers

RE: PONE-D-20-03484R1

"A novel computational approach for predicting complex phenotypes by deriving their gene expression signatures from public data" by Ivanov et al.

We would like to express our gratitude to the reviewers for their time and constructive comments of our manuscript, especially during these difficult times.

We have amended the main text of the manuscript and the supplementary materials taking into account and addressing all of the reviewer's comments. We believe that this has improved the manuscript.

We believe that the detailed response and major revision of the manuscript will be satisfactory to the reviewers and the Editor and the manuscript will be accepted for a publication in PLOS ONE.

Reviewer #1:

The manuscript is written well and is understandable. However I think it should be made clearer in the abstract and within the text that both Array Express and Expression Atlas were used and an explanation of why the Expression Atlas was used.

Response:

We are pleased to see that the Reviewer thought that the manuscript was well written and understandable. We have amended the abstract to include that publicly available gene-expression and new experiments data were derived from EBI's Array Express and Expression Atlas respectively. We have also provided an explanation why the Expression Atlas was used. We utilised EBI's Expression Atlas due to the availability of normalised gene-expression values and contrasts for individual experiments. Without this, we would have needed to normalise all the raw microarray cel data files within EBI's Array Express. This would have required a substantial amount of time and resources. The text (section "Predicting freely available experiments for the presence of both phenotypes") has been amended to include an explanation of why we used the normalised data within Expression Atlas and not the raw data in Array Express. We think that adding the explanation above makes the applied methodology clearer.

Reviewer #2:

Since the data was restricted to starvation and sterility, would suggest that the title reflect so, as phenotype is a very big bracket. And it is likely that the spectrum would vary between phenotypes e.g. colour, height etc vs binary cases like sterility. Also the data encompasses only Dropsophilia, so suggestion to be safer to be narrow in coverage in title and rest of the writing. By all means discuss on applicability across multiple organisms and phenotype in discussion bearing in mind limitations. It would be good to discuss applicability of towards different ranges of phenotypes. There is also a claim of capturing environmental influences, which is not really reflected in the results. Would suggest limiting claims and overall applicability, otherwise a lot of work should be done to substantiate those claims - multiple organisms, multiple range of phenotypes, etc.

Response:

We thank the reviewer for the constructive criticism and helpful suggestions. Yes, the suggestion is very helpful and we have amended the title along with the manuscript body to reflect the case use of two phenotypes and that it was only tested in Drosophila.

The new title is: "A novel computational approach for predicting complex phenotypes in Drosophila (starvation-sensitive and sterile) by deriving their gene expression signatures from public data".

The Abstract and Introduction were amended throughout to be more specific that we have generated gene-expression signatures for two phenotypes in Drosophila. We completely agree that it is safer to narrow the applicability. We have also amended the Discussion to be more precise in the range of claims. To this effect, the Discussion was amended to always specify that the results and claims refer to the two specific Drosophila phenotypes. We also included a new paragraph in the Discussion section regarding different ranges of phenotypes and more specifically binary vs. continuous measures of phenotypes.

Throughout the text we used the term environmental perturbation in a very limited scope. We do not claim that the derived gene-expression results for the two phenotypes capture environmental influences. That is, we were interested if genetic (e.g. gene knock-out) or environmental perturbations can induce or result in the presence of the two tested phenotypes. The term environmental perturbation was meant to mean for example a chemical compound or another type of intervention that has induced the phenotypes under investigation. For example, an ArrayAtlas experiment (E-MTAB-3546; 3-week reproductive diapause under cold conditions (11C)) was predicted to exhibit the sterile phenotype with a mean mutant probability of 91% across the individual mutants within that experiment. The cold conditions are an environmental perturbation that induces the reproductive diapause. In this specific instance we are able to confidently predict the phenotypic manifestation.

Reviewer #3:

The paper by Dobril K. Ivanov et al. demonstrated a combination of linear mixed effect models and principal components analyses approach for integrating gene expression data for specific phenotypes from independent labs widely available in public repositories. As a proof-of-concept study to show the promising in-silico approach for inferring phenotypic characteristics behind the published gene-expression data resulting from genetic or environmental perturbations, the data they selected is excellent for testing their methods and hypothesis. The results they presented are partial but convincing. I would like to see this approach be further explored, tested, and improved by more computational researchers. Along this line, since there are still some ambiguous or arbitrary steps or parameters in the study, I recommend that this paper only be accepted after the following major comments/issues are resolved:

1. The two represented phenotypes(starvation sensitive and sterile) were nicely investigated and presented. But in the current workflow, there are few steps that involve manual curation, like the selection of expression data and the representative GO terms. To show the application's universality, please add at least one additional test phenotype of interest that the related experiments and significant enrichment of GO terms were selected by predefined rules or algorithms.

2. The current flow diagram in Fig.1 is quite confusing, at least to me. Please consider separating the signature finding and phenotype prediction into two parts.

3. Meanwhile, since the paper already discussed the complexities of the relation between the gene-expression changes and the phenotypes and described the limitation of the current approach. If possible, I would like to see an evaluation part by the end of the signature-finding section to evaluate whether the selected phenotype could be significantly presented or identified by the expression profiles of the currently available microarray experiments.

4. In the cross-validation section, the paper shows "the AUC was calculated using the class (control/mutant) probabilities derived from the randomForest package, using the top 200 genes from the molecular signature (based on the p-values from the logistic regression)". How this 200 was determined, whether this top-N influence the AUCs? I would like to know the results of a series of Ns to have a better understanding of this step. And prefer a more rigorous or data-driven way to calculate or determine the numbers of the signature genes for different phenotypes.

Response:

We thank the reviewer for the helpful and thorough comments and suggestions. We are pleased to see that the Reviewer would like to see this approach further explored, tested, and improved by more computational researchers in the future and hopefully once the study is in the public domain, this would happen.

We have addressed all of the comments below:

1. The reviewer suggested to add one additional phenotype to test the general applicability of the methods. While we completely agree that adding more phenotypes will improve the overall methods and design, this will take considerable amount of time and effort. The overall goal of the presented work was to see if it was possible to implement a set of methods (combination of linear mixed effect models, GO terms and principal components) that could be used to predict complex phenotypes using gene-expression data. We also agree that there was scope for automating the generation of the molecular signatures, although this would require large amount of time and effort and was beyond the scope of our original hypothesis, i.e. is it possible to predict complex phenotypes using gene-expression data in Drosophila.

This study is a proof-of-concept and we make this clear throughout the manuscript and this is why we feel that adding an additional phenotype is beyond the scope of this manuscript.

2. We agree that the Figure 1 was confusing and we have separated the signature finding and phenotype prediction into two parts to make it clearer.

3. Yes, to make the evaluation part clearer, we have amended the discussion section to include a more detailed discussion if a particular phenotype can be represented or identified/predicted. We further described that it is possible that there could be phenotypes not well predicted by changes in gene-expression and these could for example be better represented or predicted using other types of data, for example methylation. In addition, the type of leave-one-out cross validation that we perform, tests precisely the question raised by the reviewer. That is, we leave one whole experiment, for example all the controls/mutants part of the crol experiment, create the molecular signature with the rest of the controls/mutants and test if we can predict the controls/mutants that were left out. This ensures that the overall AUC reflects the ability of the selected microarray experiments to predict the phenotype of interest. Of course, in order to gain a better understanding of what type of phenotypes and how many can be predicted with gene-expression data we need to investigate a large number of such phenotype gene-expression combinations. All of the above is described in detail in the Discussion section as suggested by the Reviewer.

4. We agree with the reviewer that the selection of the 200 genes to predict and calculate AUC is relatively arbitrary and that is why we performed a series of additional experiments, which are included in the revised manuscript. We selected a range of top genes and performed the leave-one-out cross-validation for each of these for all principal components. This ranged from 50 to 3,000 genes (15 different numbers of top genes). Overall, these were 120 leave-one-out cross-validations and AUCs for the starvation-sensitive and sterile phenotypes respectively. These new results are summarised in detail in supplementary figures 6 and 7 (Figs S6 and S7). Furthermore, we have also amended the Materials and Methods and Results sections to describe these analyses (sections "Leave-one-out cross-validation" and "Determining the number of PCs for unwanted variation" in the Materials and methods and Results respectively). For the starvation-sensitive phenotype there was little difference when choosing the number of top genes, although there is a trend for higher number of top genes to deliver higher AUC. For the sterile phenotype the opposite trend was noted, fewer genes resulted in better AUC. This could potentially be caused by the size of the transcriptional network responsible for the phenotype, for example it has been previously reported that the starvation stress resistance involves transcriptional response of ~25% of the genome in Drosophila. Nevertheless, this is a speculation and further work in terms of a large number of phenotypes need to be examined to assess this and gain further understanding. All of the above was described in detail in the Results section and data summarised in the Supplementary materials.

Attachments
Attachment
Submitted filename: Response_reviewers_Ivanov_predict_phenos_gene_expression.docx
Decision Letter - Xia Li, Editor

A novel computational approach for predicting complex phenotypes in Drosophila (starvation-sensitive and sterile) by deriving their gene expression signatures from public data

PONE-D-20-03484R1

Dear Dr. Ivanov,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Xia Li, Ph.D.

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

My final decision is also "Accept"

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #2: All comments have been addressed

Reviewer #3: All comments have been addressed

Reviewer #4: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #2: Yes

Reviewer #3: Yes

Reviewer #4: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #2: N/A

Reviewer #3: Yes

Reviewer #4: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #2: Yes

Reviewer #3: Yes

Reviewer #4: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #2: Yes

Reviewer #3: Yes

Reviewer #4: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #2: Satisfied that comments are addressed in the various sections, abstract, intro and discussion. Can better clarify on the environmental influence part.

Reviewer #3: The authors have clarified most of the questions I raised in my previous review. I am glad to see the updated manuscript with a more focused title to reflect the study and clearer descriptions of the pipeline and the evaluation part.

Reviewer #4: I have a suggestion if the authors want to consider. The bracket in the title can be avoided with something like this:

A novel computational approach for predicting complex phenotypes in starvation-sensitive and sterile Drosophila by deriving their gene expression signatures from public data

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #2: No

Reviewer #3: No

Reviewer #4: No

Formally Accepted
Acceptance Letter - Xia Li, Editor

PONE-D-20-03484R1

A novel computational approach for predicting complex phenotypes in Drosophila (starvation-sensitive and sterile) by deriving their gene expression signatures from public data

Dear Dr. Ivanov:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Prof. Xia Li

Academic Editor

PLOS ONE

Open letter on the publication of peer review reports

PLOS recognizes the benefits of transparency in the peer review process. Therefore, we enable the publication of all of the content of peer review and author responses alongside final, published articles. Reviewers remain anonymous, unless they choose to reveal their names.

We encourage other journals to join us in this initiative. We hope that our action inspires the community, including researchers, research funders, and research institutions, to recognize the benefits of published peer review reports for all parts of the research system.

Learn more at ASAPbio .