Peer Review History
| Original SubmissionOctober 18, 2020 |
|---|
|
PONE-D-20-32745 Detecting fabrication in large-scale molecular omics data PLOS ONE Dear Dr. Bradshaw, Thank you for submitting your manuscript to PLOS ONE and apologies for the extended reviewing time. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. The worth of your approach is unquestionable yet, several aspects of your manuscript lack depth and maturity, as established by both reviewers. For example, the applicability of the method and its limitations are not addressed. Another example is the introduction of the Benford's law which raised many questions in the review process. Please submit your revised manuscript by April 12, 2021. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript:
If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols We look forward to receiving your revised manuscript. Kind regards, Frederique Lisacek Academic Editor PLOS ONE Journal Requirements: When submitting your revision, we need you to address these additional requirements. 1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and 2.We note that the grant information you provided in the ‘Funding Information’ and ‘Financial Disclosure’ sections do not match. When you resubmit, please ensure that you provide the correct grant numbers for the awards you received for your study in the ‘Funding Information’ section. [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Partly Reviewer #2: Partly ********** 2. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes Reviewer #2: No ********** 3. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes ********** 4. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: Yes ********** 5. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: The authors present an interesting approach to detect falsification in big datasets by means of Machine Learning and a well-known feature, Benford digit preferences. - In the "Methods" section, authors should make it explicit (or much clearer) that they are comparing TWO approaches based on ML: the first one using the actual copy-number values (not really raw data) as inputs and the second using extracted features (digit frequency)as inputs (not doing it clearly may induce the readers to confusion when reading the corresponding results section). - In the MLTraining section, the authors mention 6 ML methods they have chosen to evaluate but they don't offer any insight on the reasons behind such a choice: Why did they selected those 6 ML methods ? Were they already known for performing well for this kind of data/problem? Were they expected to perform better than other methods? Were they selected to represent a diverse-enough palette of ML methods?. - In addition to this, the authors should provide a short description and some literature pointers about these methods as they may allow the reader to better understand what these methods do (if not how). E.g., saying that Random Forest is an ensemble decision-tree based method, or that KNN is based on proximity/similarity in the input space and does not actually perform any "learning"... and so on for GBD, NB, MLP, and SVM. - In the Benford-like Digit Preferences section, authors should mention that this operation is a relatively simple “feature extraction” operation (I mean, it's not trivial, nor logic, but simple). So, it is this well-informed feature extraction which allows ML models to improve their predictive performance. In addition, they should assess how far the sole use of that feature reduces the prediction problem to a simple classification problem where ML is not really necessary. Would a much simpler method produce similar results? - In the ML with quantitative data section, the authors mention that they "evaluated the model on simple accuracy". I don't think that using simple accuracy is very informative. In the context of fake detection it should be important to assess how many false negatives and false positives are detected. I would propose the use of F1 metrics as much more informative (thus adequate) than accuracy. In addition, it should be expected that “real-world” data would have a very different distribution of fake-real samples making accuracy even less adequate (or predictive). - Authors say that SVM and MLP performed poorly. It is a bit surprising that these 2 methods had exhibited such a poor performance, but authors don't elaborate more on this: have they tried to investigate why? Could it be due to poor configuration efforts? I feel it was too "easy" to simply exclude them from further analysis. - What's the meaning of the red asterisks present in Figures 2 and 3? It was not possible for me to figure it out. Such kind of unexplained information may be perturbing for the readers. - In the ML with limited data section, authors mention they downsampled data, but they don't mention for which kind of fake-generation method was that done. Although it seems it was done for resampling. Stating this is very short and would facilitate the comprehension of the experiment performed. - The Discussion section is far too short and doesn't explore the possible implications of the proposed work nor the potential limits and reaches of the method. Among the potential issues that should be pertinent and I would expect to be discussed are the following: 1. As mentioned before, I would expect authors to report and analyse the figures concerning the False positive and False Negatives. Even at such high accuracy values, it would be good to know if the methods under evaluation would more easily miss fake data or produce false positives. 2. In the same sense it would be interesting to determine how the methods perform when the amount of falsified data is different (either higher or lower). What happens if the distribution of falsified data in the test set is (drastically) different than the one from the training set? Finally, authors could also (optionally) consider discussing the risk that their method could be used as "predictor" in an "adversarial attack" approach allowing to create fake data which should be detected as valid by this detector. Reviewer #2: ### General comments The article presents an evaluation of different machine-learning approaches to detect fraud data, and evaluates their performances on artificial fraud data generated according to three different models: random number generation, resampling and data imputation. The article addresses an important problem for life sciences, but int its current state the evaluation suffers from several weaknesses that should be handled before publication. In particular: 1. The three models used to generate fake data are not justified in a convincing way. Is there any reason to believe that they correspond to actual frauds? If so, examples should be provided. If not the relevance of the evaluation is questionable. Would it be possible to apply the method on actual fraud data, that has been published, detected (ans supposedly retracted)? 3. Normally, a comparative evaluation of supervised classification methods requires to tune the parameters of each of them, which was not done here. In R, you can fine tune methods for all the classical supervised classification methods (I guess similar methods exist for other language like Python) I strongly recommend to use them, identify the optimal parameters for each method, and redo the whole performance analysis. The comparison is worthless without this. 3. The main approach defended in the manuscript is to replace the actual measurements (real and fake data) by the two first decimal digits. The idea relies on Frank Benford's law, according to which the frequency distribution of leading digits from real-life sets of numerical data does not follow a uniform distribution, contrary to what might be expected. This law is invoked like a magical trick in this context: the manuscript does not provide any explanation about the reasons for this law, it does not indicates why it would apply to the CNV data analysed here. This should be clarified. For example, it is known that one situation in which Benford's law works is for long right-tailed distributions (which is for example the case of gene expression data). The article should at least provide an histogram of the distribution of the real values and discuss its adequacy to Benford's law. Besides, if this is the main idea, the actual distribution of leading digits should be displayed on some figure, for the real and fake data. 4. There is no indication about the usability of the method in real life conditions. How could the ML programs be trained for real dataset? Would you recommend to generate specific fake data for each one? What about the generalization power of the approach? What would the method give if they would be applied on a large collection of actual published data? Would some of these data set be qualified as fraud? In summary, I think that the paper address an important issue in data science (with applications to life sciences), but in its current state it is not convincing, because of methodological weaknesses in the evaluation of performances, and because there is no indication of the relevance of the models used to generate fake data. I however think these limitations could be addressed in a revised version of the manuscript. ### Specific comments Line 31. "When asked if their 31 colleagues had fabricated data, positive response rates rose to 14-19%" This question is imprecise and thus the answer impossible to interpret. Does it mean that the 14-19% of the researchers personally know colleagues who fabricated data, or that they are aware of published articles where data fabrication was demonstrated (and the articles this retracted), or that they have a general awareness of the fact that data fabrication happens? Line 57. "Frank Benford observed in a compilation of 20,000 numbers that the first digit did not follow a uniform distribution as one may anticipate" It would be useful to explain the reason for this surprising behavior, especially since it is the basis of one of your fraud detection method. Line 65. Section "Methods" The computing environment should be described, in particular the language and libraries used for the analysis. I guess all this could be found on the github repository, but we have no guarantee on the long-term sustainability of a github repository, so the minimal information should be provided in the Methods section, as recommended for scientific publications. Line 81. "Three different methods of varying sophistication are used for fabrication: random number generation, resampling with replacement and imputation" Is there any example of actual frauds (demonstrated) that use this kind of data number generation? If yes citations should be provided. If not it question the practical relevance of the evaluation. Line 86, section Real Data. This section should describe the dimensions of the real data set (number of features). The info comes below, but it is expected to be found here. Has the real data been published? If so, could you provide the reference of the publication, the data repository and the accession number? Could you also provide the URL of the CPTAC portal mentioned in this section? Line 103. "Then we iteratively nullified 10% of the data and imputed these NAs with missForrest until every value has been imputed" What is the principle of this method? Do you impute the values based on the neighboring cells in the rows (samples), columns (features), both? This matters since the imputation should reflect the likely method used by people who generate fraud data. Moreover, the way the imputation is done is likely to affect the machine-learning performances. L122, section "Machine learning training". It would be good to compute the performance L96, "For every gene locus, we first find the maximum and minimum values observed in the original data. A new sample is then fabricated by randomly picking a value within this gene specific range" and further L158. "the random data clusters far from the real data" Do you mean you used a uniform distribution to generate random numbers? If so it is not surprising that these fake samples clusterize far away from the real data and other fake data. Why did you use such a model rather than some random number model closer to the data ? For example a multivariate normal model whose parameters (correlation matrix- have been estimated on the real data. This would be a much more relevant way to generate more relevant random numbers. L182. The abbreviations are missing for several methods (NB, RF), whereas they are used in the text and figures. L196. The theoretical baseline accuracy is 66% according to the training/testing class sizes. It would be worth checking empirically the untrained performances of the different ML methods, by computing the accuracy in an "untrained" mode, i.e. by randomly permuting the training and testing labels. In principle this should return accuracies of ~66%, but there are sometimes tricky issues, so it is worth testing it. It would also be useful to plot the baseline + untrained performances on the accuracy box plots. L182. The parameters used for each ML method should be provided (either here or in the Material and Methods section). L190. The accuracy is not a sufficient parameter to evaluate the performances of a 2-group classifier aiming at detecting one particular case (declare as "positives" the fake samples). For each method, you should compute he sensitivity and false predictive rate. The results of the different methods could be displayed on a classical Sn / FPR plot (in addition, f you tune some quantitative parameters you could draw a ROC curve). L 194. "SVM and MLP performed poorly compared to other classification methods". I suspect this comes from the fact that you let all the methods run with their default parameters. In particular, SVM results vary hugely depending on the choice of the kernel, and the optimal kernel is a case-by-case affair, so you should absolutely test the performance of the different kernels (linear, radial, polynomial, sigmoid). Actually, a comparative evaluation requires to tune the parameters of each ML method, which was not done here. In R, you can fine tune methods for all the classical supervised classification methods. I strongly recommend to use them and redo the whole performance analysis. L225. "One challenge for machine learning in our data is that the number of features (~17,000) far exceeds the number of samples (75). We therefore explored ways to reduce or transform the feature set, and also to 228 make the feature set more general and broadly applicable." This is a very strange motivation for using a digit preference approach, which looks a bit like a magical trick in this context. If the goal is to reduce the over-dimensionality of the feature space, a first and obvious option would have been to train the classifiers on the first components (this is a very classical approach). Another possibility would be to test any classical method for feature selection. L225. Why are there 17.000 features in the original dataset ? There are ~50,000 genes in the current annotations of Human genome. L232 "the decimal of each gene expression value". Are we speaking of CNV or transcriptome ? L245. "Converting all measured variables to digit frequencies circumvents this problem. For instance, if you had a data set of CNA and transcriptomic data a machine learning model could not train and test on both of these. " I don't see any reason for this claim. If both CNV and expression data are real numbers (which is the case) they can perfectly be combined in a feature matrix to feed the ML methods. The fact that their range and distributions would differ might pose a problem for some methods, but most if not all of the methods you used are equipped to handle variables with different data ranges. And in any case you did not check if the data distribution would or not fit the assumptions underlying the different methods. "The features in these datasets would differ in the number of features and what these features represent. " This makes no sense. If you combine NCV and expression data for the same samples, the number of features (genes) should in principle be the same for the two datasets, and they should thus be balanced (and in any case, which would not even be a prerequisite for combining them). In addition, all the ML methods are classically used to analyse features representing different things (e.g. size, weight, fat content, protein content,...), this is the essence of multivariate analysis. L254. "over the 50 trails" Did you mean "trials" ? L283. "Surprisingly, these top performing models (GBC and Random Forest) do not drop below 95% accuracy until they have less than 20 gene-features." Why is this surprising? This simply reflects the fact that the fake data is simple to detect, which may come from the way they are generated. What I find surprising here is that you can learn something from the distribution of the two first digits (i.e. 10 x 10 numbers) computed from only 20 features. This means that each pair of digit is expected to be found 0.2 times in the data. I an thus very skeptical about this result, and I suspect there is a trick somewhere. I would suggest you to check how the distribution of digits evolves (separately for the real and for each fake data) as you reduce the dimension of the feature space, and to see if there is not some bias. L281. "In the 100 gene-feature trial, both Naive Bayes and KNN have a significant drop in performance" This drop of performances on Figure 4 may be a visual artifact resulting from the arbitrary numbers of features chosen for your analysis : you increase the number of features by steps of 10 until 100, then you jump from 100 to 500, then to 1000, 2000 and you increase by steps of 2000. If the goal is to display the impact of N on both e small and large range, you should better use an XY plot with a logarithmic X axis. Also, it would be worth exploring the region between 100 and 500, since this is the place where you claim to observe a drop. I would recommend to add a measurement of the performances with 200, 300, 400 features, respectively. ********** 6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: Yes: Jacques van Helden (ORCID 0000-0002-8799-8584) [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. |
| Revision 1 |
|
PONE-D-20-32745R1 Detecting fabrication in large-scale molecular omics data PLOS ONE Dear Dr. Bradshaw, Thank you for submitting your manuscript to PLOS ONE and for your patience. This manuscript is very well received and the stakes are rather high if such work is not given the attention it deserves. The selection of fair and expert reviewers is the main reason for the delay. This expertise is rare. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. The first round of reviews revealed issues that you have attended to and the new reviewer spotted a last issue regarding the processing of real datasets that you need to consider. This is rather minor in terms of effort on your part and will be major in terms of impact. Please submit your revised manuscript by Nov 01, 2021 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript:
If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols. We look forward to receiving your revised manuscript. Kind regards, Frederique Lisacek Academic Editor PLOS ONE Journal Requirements: Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice. [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation. Reviewer #1: All comments have been addressed Reviewer #3: (No Response) ********** 2. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes Reviewer #3: Partly ********** 3. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes Reviewer #3: Yes ********** 4. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #3: Yes ********** 5. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #3: Yes ********** 6. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: 2nd review ---------- - Response 4: "the exploration of alternative methods is outside of the scope of our manuscript" --> I was expecting more a comment about that possibility than a deep exploration. One could discuss such issue based for example on the fact that k-NN (the less ML of the methods presented) also improves drastically (significantly :-) its predictive performance. - A comment I didn't think of on the first review that could enrich the discussion concerns the potential of using a similar approach, but in an unsupervised (or semi-supervised) manner for detecting "anomalies" in datasets so as to flag potential falsifications (as done in fraud or cyberattack contexts) without having a training set of already known fake-data strategies. (For future work, not to be addressed now) - Finally, I haven't addressed the responses to the comments from the 2nd reviewer as I find he or she would be the best placed to judge on their quality. Minor comments - Line 131: per-se instead of per-say - Line 233: we tested FIVE different ,methods (not six) - Line 247: "the remaining four models" --> Not clear which are the "remaining" models or even why is that word used here. Reviewer #3: I love the idea behind this paper. Fraud is a very significant problem within scientific research, particularly for increasingly data right subject areas, and should be a concern for all of us in this community. Taking steps to develop tools to detect fraud is a key pillar of addressing this issue, alongside good Open Science practices that ensure transparency and replicability throughout the research chain (including in peer review!). Unfortunately, however, I have some significant concerns about this manuscript as it stand and I'm not convinced it's ready to be published as it stands. I outline those concerns below. First though, I would like to strongly encourage the authors to continue developing this manuscript, despite the significant review and publication delays since the first BioRxiv preprint. I have no doubt this will be a valuable piece of work in this important area. Primary concerns: 1. The authors use three different mechanisms for generating fake data, random number generation, resampling and imputation. These are implemented with the aim of approximating the real data as accurately as possible. It is far from clear to me that any of these strategies reflect strategies scientists would actually use to fake data, however they are as reasonable as any other strategies given the lack of evidence base on this. The way they are used however is where my concern lies. What motivation would someone have to fake data that reflects the real data and doesn't generate any clear 'result'?!? What the authors have here is a model that detects simulated data with the same characteristics as the current data. I suppose it is possible that scientists might want to fake (simulate) extra samples with similar statistical characteristics as their real data so as to inflate the sample number in their experiments, but it seems far more likely to me that scientists would try to fake data to generate a result. For this paper, the simplest fake result to add to the data would be a shift in the CNV value to higher or lower values for specific subsets of samples or (more likely) for specific genes within specific subsets of sample. This more realistic test would be simple to implement within the three methods used here. To summarise; in order to be convinced that these methods are useful, I want to see their performance on a real world dataset with a CNV-treatment result in it, with different types of faked data (global up-/down- & specific gene up-down) added to either enhance/deplete the significance of the result, or to add new results to the data. I'd also like to see how models trained on the fake data with these signals in perform; retraining of the models here would probably require a more nuanced investigation for the training in order to avoid training the models just to recognise the up-/down-regulation, rather than the other characteristics of the fake data. 2. For the model trained on the two decimal digits, the models are essentially being trained to detect data that doesn't obey Benford's-law. The authors haven’t demonstrated that the machine learning models outperform the far simpler process of making the appropriate histogram and fitting a curve based on Benford's law to this and seeing if you get a decent fit (with a KS test, for example). It's possible that the ML models outperform this simple test, but the authors need to do this comparison to motivate the use of the more complex and opaque ML algorithms. 3. Figures 2 & 3 contain boxplots suggesting that some of the models have zero variation in their performance across different data subsets. In some cases this is because the clarifiers are apparently perfectly good/bad accuracy (which I am deeply suspicious of and seems too good to be true) but in some cases it's perfectly consistent accuracy performance (e.g. Fig 2 panel C). These results seem to be in disagreement with Figure 4 which suggests that the average accuracy performance never reaches 100% for any of the models. Something isn't right here. The authors need to carefully inspect their methods reconcile these figures, and either convincingly justify the perfectly good/bad/consistent performance or (more likely) fix the bug that’s causing these. Detail comments: 1. Line 69/70. I think the readers would benefit from adding some clarity on the limitations of Benford's law here. In particular, it's only really valid for data that spans several orders of magnitude, and for data where the upper/lower limits are not tightly bounded. 2. Line 86. I disagree with the statement that "making up data is always wrong"; a bit more nuance is needed here. Firstly, simulating data has a long history of being informative in many areas of science. Secondly, there is a grey area here around imputed data, and particularly the imputation of missing data. It is, for example, commonplace to model a covariate in order to impute missing values in this data, and then to use this covariate data - including the imputed data - in a second model which leads to interpretation. From a certain perspective (my perspective, for example!) this could be seen as 'making up data' that has a direct impact on results/conclusions (depending on the scale of the missing data). This is certainly not widely considered wrong or inappropriate. 3. Line 111-113: does this mean that some of the samples are represented twice in the 150 sample dataset, albeit with 10% imputed data? How do we know that the ML models are learning to separate the fake samples from the real based on the imputed data signal, rather than needing both a duplicated sample and the imputed data signal. If you added samples that aren't in the original data, with a 10% imputation, would the performance of the models be as good? 4. Line 166: "Machine learning cannot…" I know what you're getting at, but this is not well worded. I think you wat to say something like : "Trained ML models are restricted to data that conform to the model input specifications (i.e. the same number of input features, for example). 5. Line 168: I think it would be worth noting here that the generalizability of this model comes with a cost - it will only work for data where Benfords-law should be valid, which is certainly not all datasets. 6. Line 177: "tiddy-verse" should be "tidyverse" 7. Line 233: "six" this should be five - this needs checking throughout the paper since the number of models used has changed through the review process. 8. Line 247: "The remaining four models…". I think this region of text has been re-ordered quite a bit during the review process and it doesn't make much sense now since we haven't has the results for the fifth model yet at this point. I think the authors need to give this section a careful read and make sure it flows sensibly. 9. Line 286: "[29149684]" I think this should be a reference?? 10. Line 289: "While Benford's law…" The shift to use the decimal point digits rather than the leading digit is necessary because of the constraint that Benford law works (best) for numbers spanning several orders of magnitude. This is not the case for the first digit in the CNV data, but this value is usually a non-zero value so the first and second digits necessarily span orders of magnitude. This is a dataset specific approach though. in a dataset comprised mainly of numbers between 0 and 0.09, you would need to use the third and fourth decimal point digits. This would be work illuminating here. 11. Line 301-307: "Machine learning typically…" repetition of previous text and discussion. I think this can be removed. 12. Line 339. I'm not very surprised at the reasonable performance with as few as 10 genes here. 10 genes x 75 samples = 750 datapoints. This is plenty to build a histogram to compare with the Benfords Law curve (the equivalent of which is what the ML models are learning to do) . 13. Line 353: I think "per data point" should be "per sample" here. 14. Figure 2: I can't really see the details of this figure well - the resolution is quite low in the PDF embedding. It would be useful to explain the components of the box plot (median, quartiles, indents, etc) for those not familiar with boxplots. 15. Supp. Fig. 3. This figure is really useful (I'd put it in the main paper) but it's a nightmare to read because it's very busy. I suggest that the authors split the figure into four facet panels with one dataset per panel. ********** 7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: Yes: Carlos Peña-Reyes Reviewer #3: Yes: Dr Nicholas Schurch [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. |
| Revision 2 |
|
Detecting fabrication in large-scale molecular omics data PONE-D-20-32745R2 Dear Dr. Bradshaw, We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements. Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication. An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org. If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. Kind regards, Frederique Lisacek Academic Editor PLOS ONE Additional Editor Comments (optional): Reviewers' comments: |
| Formally Accepted |
|
PONE-D-20-32745R2 Detecting fabrication in large-scale molecular omics data Dear Dr. Bradshaw: I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department. If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org. If we can help with anything else, please email us at plosone@plos.org. Thank you for submitting your work to PLOS ONE and supporting open access. Kind regards, PLOS ONE Editorial Office Staff on behalf of Dr. Frederique Lisacek Academic Editor PLOS ONE |
Open letter on the publication of peer review reports
PLOS recognizes the benefits of transparency in the peer review process. Therefore, we enable the publication of all of the content of peer review and author responses alongside final, published articles. Reviewers remain anonymous, unless they choose to reveal their names.
We encourage other journals to join us in this initiative. We hope that our action inspires the community, including researchers, research funders, and research institutions, to recognize the benefits of published peer review reports for all parts of the research system.
Learn more at ASAPbio .