Peer Review History

Original SubmissionAugust 17, 2023
Decision Letter - Sathishkumar Veerappampalayam Easwaramoorthy, Editor

PONE-D-23-25703Predicting malaria outbreak in The Gambia using machine learning techniquesPLOS ONE

Dear Dr. AJADI,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Dec 14 2023 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Sathishkumar Veerappampalayam Easwaramoorthy

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at 

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and 

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Please note that PLOS ONE has specific guidelines on code sharing for submissions in which author-generated code underpins the findings in the manuscript. In these cases, all author-generated code must be made available without restrictions upon publication of the work. Please review our guidelines at https://journals.plos.org/plosone/s/materials-and-software-sharing#loc-sharing-code and ensure that your code is shared in a way that follows best practice and facilitates reproducibility and reuse.

3. Thank you for stating the following financial disclosure: 

"This work was supported by the Deanship of Research Oversight and Coordination at King Fahd University of Petroleum and Minerals."

  

Please state what role the funders took in the study.  If the funders had no role, please state: "The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript." 

If this statement is not correct you must amend it as needed. 

Please include this amended Role of Funder statement in your cover letter; we will change the online submission form on your behalf.

4. In your Data Availability statement, you have not specified where the minimal data set underlying the results described in your manuscript can be found. PLOS defines a study's minimal data set as the underlying data used to reach the conclusions drawn in the manuscript and any additional data required to replicate the reported study findings in their entirety. All PLOS journals require that the minimal data set be made fully available. For more information about our data policy, please see http://journals.plos.org/plosone/s/data-availability.

"Upon re-submitting your revised manuscript, please upload your study’s minimal underlying data set as either Supporting Information files or to a stable, public repository and include the relevant URLs, DOIs, or accession numbers within your revised cover letter. For a list of acceptable repositories, please see http://journals.plos.org/plosone/s/data-availability#loc-recommended-repositories. Any potentially identifying patient information must be fully anonymized.

Important: If there are ethical or legal restrictions to sharing your data publicly, please explain these restrictions in detail. Please see our guidelines for more information on what we consider unacceptable restrictions to publicly sharing data: http://journals.plos.org/plosone/s/data-availability#loc-unacceptable-data-access-restrictions. Note that it is not acceptable for the authors to be the sole named individuals responsible for ensuring data access.

We will update your Data Availability statement to reflect the information you provide in your cover letter.

5. We note that [Figure 1] in your submission contain [map/satellite] images which may be copyrighted. All PLOS content is published under the Creative Commons Attribution License (CC BY 4.0), which means that the manuscript, images, and Supporting Information files will be freely available online, and any third party is permitted to access, download, copy, distribute, and use these materials in any way, even commercially, with proper attribution. For these reasons, we cannot publish previously copyrighted maps or satellite images created using proprietary data, such as Google software (Google Maps, Street View, and Earth). For more information, see our copyright guidelines: http://journals.plos.org/plosone/s/licenses-and-copyright.

We require you to either (1) present written permission from the copyright holder to publish these figures specifically under the CC BY 4.0 license, or (2) remove the figures from your submission:

a. You may seek permission from the original copyright holder of Figure 1 to publish the content specifically under the CC BY 4.0 license.  

We recommend that you contact the original copyright holder with the Content Permission Form (http://journals.plos.org/plosone/s/file?id=7c09/content-permission-form.pdf) and the following text:

“I request permission for the open-access journal PLOS ONE to publish XXX under the Creative Commons Attribution License (CCAL) CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). Please be aware that this license allows unrestricted use and distribution, even commercially, by third parties. Please reply and provide explicit written permission to publish XXX under a CC BY license and complete the attached form.”

Please upload the completed Content Permission Form or other proof of granted permissions as an "Other" file with your submission.

In the figure caption of the copyrighted figure, please include the following text: “Reprinted from [ref] under a CC BY license, with permission from [name of publisher], original copyright [original copyright year].”

b. If you are unable to obtain permission from the original copyright holder to publish these figures under the CC BY 4.0 license or if the copyright holder’s requirements are incompatible with the CC BY 4.0 license, please either i) remove the figure or ii) supply a replacement figure that complies with the CC BY 4.0 license. Please check copyright information on all replacement figures and update the figure caption with source information. If applicable, please specify in the figure caption text when a figure is similar but not identical to the original image and is therefore for illustrative purposes only.

The following resources for replacing copyrighted map figures may be helpful:

USGS National Map Viewer (public domain): http://viewer.nationalmap.gov/viewer/

The Gateway to Astronaut Photography of Earth (public domain): http://eol.jsc.nasa.gov/sseop/clickmap/

Maps at the CIA (public domain): https://www.cia.gov/library/publications/the-world-factbook/index.html and https://www.cia.gov/library/publications/cia-maps-publications/index.html

NASA Earth Observatory (public domain): http://earthobservatory.nasa.gov/

Landsat: http://landsat.visibleearth.nasa.gov/

USGS EROS (Earth Resources Observatory and Science (EROS) Center) (public domain): http://eros.usgs.gov/#

Natural Earth (public domain): http://www.naturalearthdata.com/

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Partly

Reviewer #2: Partly

Reviewer #3: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: No

Reviewer #3: No

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: This paper presents a study on comparing eight machine learning algorithms to predict malaria outbreaks in each district in The Gambia using historical meteorological data.

The model inputs and outputs should be clearly stated.

The 4 climatic variables are shown, but what about the other variables? Are the climatic variables monthly averages or yearly averages? Some clarifications should be given.

Reviewer #2: Although the relevance of the topic, in its current form, the article would need too many adjustments for a major revision. Below are some examples of issues.

The abstract is confusing. For instance, the authors state, "An early warning system that can accurately forecast malaria outbreaks years in advance would be helpful to policymakers to put in measures in reducing morbidity and mortality rate." However, this statement is a justification or conclusion rather than contextualization. This type of argument could be used in the introduction as a justification.

Additionally, the authors present some future research directions in the abstract. I recommend the authors to discuss them in the conclusion.

I recommend avoiding short sentences like "It also performs the worst in specificity analysis."

In the introduction, if possible, could you provide specific statistics on malaria in the Gambia?

The authors could clarify the statement on machine learning and large datasets in the introduction. In the literature, especially for healthcare, there is evidence of the relevant of using machine learning and small datasets,

The authors should state the specific machine learning algorithms they experimented with in the introduction. Additionally, they should justify the choice of such algorithms.

In the introduction, the authors should discuss the challenges of forecasting malaria outbreaks using machine learning.

In Related Works, the authors state, "by combining and aggregating the two datasets." Which are the two datasets?

In Related Works, the authors state, "Extreme gradient boosting (XGBoost) performed better than the other model with 96.26% accuracy." What models?

There is a sentence with no end in the related works section: "Similarly, using clinical data from PubMed abstracts from 1956 to 2019,"

The authors should review the acronym definition. For instance, some acronyms are defined in the incorrect part of the text (in the first mention).

The related works section should include a comprehensive discussion on limitations of previous studies. This discussion would clarify the contribution of the new study.

The following sentence is confusing: "proposed a hybrid classification and regression model to predict the disease outbreak using data on the malaria outbreak,"

Fig. 1 is not cited and explained in the text.

As "several studies [8,22] have shown malaria incidence is influenced by climatic factors such as rainfall, temperature, and humidity," what is the specific contribution of your study?

Did the authors need ethics committee approval to handle the data?

The article must include a more convincing justification for choosing the ML algorithms.

The authors state, "All the models were built using 70% of the data as the training set and the 190 remaining 30% as the testing set." Did you apply holdout or k-fold cross-validation (or both)?

The article needs to include more justification for the choice of performing oversampling.

The article needs to include more justification for the choice of performance metrics.

It needs to be clarified if the authors applied a method for hyperparameter tuning.

The authors should present a more detailed discussion of the specificity results. Are they not relevant to your scenario?

The quality of all figures is poor.

----------------------

The grammar and spelling need substantial revision. Additionally, clarity and readability need significant improvement. For instance:

However, [9] also reported --> However, Kalipe et al. [9] also reported

Please, for all references with the format, for instance "[13] predicts malaria", change for Author [13] or Author et al. [13].

this paper aim --> this paper aims

maturity, therefore --> maturity.. Therefore,

Support vector machine (SVM) --> support vector machine (SVM)

multilayered perceptron --> multilayer perceptron

support vector --> SVM?

Reviewer #3: Cite current related works.

Justification for data normalization is needed.

The mathematics of the ML models are needed.

Explain upsampling in the context of training and test data (cross-validation).

Why was 10-fold validation used?

Explain the concept of classification versus regression.

Discuss and link with related works.

How can the dataset be validated.

The features of the data should be shown (summary statistics).

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

Revision 1

Reply to comments by Academic Editor on “Predicting malaria outbreak in The Gambia using machine learning techniques” (PONE-D-23-25703).

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming.

Thank you for your message. We appreciate the guidance regarding file naming requirements for PLOS ONE. We have thoroughly reviewed our manuscript and taken the necessary steps to ensure that it complies with all the specified requirements, including those related to file naming.

2. Please note that PLOS ONE has specific guidelines on code sharing for submissions in which author-generated code underpins the findings in the manuscript. In these cases, all author-generated code must be made available without restrictions upon publication of the work.

We appreciate your diligence in ensuring transparency and reproducibility in the publication process. In our study, we utilized built-in functions from the caret library in R for model training and analysis. These functions are inherent to the R environment and were not custom-authored by us. As such, we don't have specific code snippets to provide for these operations. We have added this in the method section and can be found on page 4.

“In conducting the analysis for prediction, we utilized a robust set of tools and software to ensure accuracy and reliability. All the statistical procedures conducted in this study were implemented using the most recent version of R, a widely recognized open-source statistical tool known for its versatility and extensive data analysis capabilities. Specifically, the predictive modeling and analysis were performed using the caret package \\cite{kuhn2008building} in R. The package provides a comprehensive framework for building, training, and evaluating predictive models, making it particularly suitable for our study's objectives.”

3. Thank you for stating the following financial disclosure:

"This work was supported by the Deanship of Research Oversight and Coordination at King Fahd University of Petroleum and Minerals."

Please state what role the funders took in the study. If the funders had no role, please state: "The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript."

If this statement is not correct you must amend it as needed.

Please include this amended Role of Funder statement in your cover letter; we will change the online submission form on your behalf.

This work was supported by the Deanship of Research Oversight and Coordination at King Fahd University of Petroleum and Minerals. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

4. In your Data Availability statement, you have not specified where the minimal data set underlying the results described in your manuscript can be found. PLOS defines a study's minimal data set as the underlying data used to reach the conclusions drawn in the manuscript and any additional data required to replicate the reported study findings in their entirety. All PLOS journals require that the minimal data set be made fully available.

The clinical dataset utilized in this study, containing monthly records of malaria cases, deaths, and population size for each district from January 2013 to December 2021, was obtained from the health management information system (HMIS), Directorate of Planning and Information under the Ministry of Health in The Gambia. Access to this dataset is subject to legal and ethical restrictions, and as such, it cannot be made publicly available. However, interested parties may request access to the data by contacting the Directorate of Planning and Information, Ministry of Health, where the data was originally sourced. The program’s manager can be reached on this email: Abdoulie52000@yahoo.com

The meteorological data, including temperature, rainfall, and relative humidity, were collected from the Department of Water Resources, Meteorological Division under the Ministry of Fisheries, Water Resources, and National Assembly Matters in The Gambia. Monthly average readings from nine weather stations across the country were used in the analysis. Unfortunately, due to legal and ethical considerations, we are unable to share the raw meteorological dataset. However, interested researchers can obtain access to this data by contacting the Department of Water Resources, Meteorological Division. The office can be contacted using this email: info@mofwr.gov.gm

We acknowledge the importance of transparency in research, and while we cannot provide unrestricted public access to the datasets, we are committed to facilitating access within the bounds of legal and ethical constraints.

5. We note that [Figure 1] in your submission contain [map/satellite] images which may be copyrighted. All PLOS content is published under the Creative Commons Attribution License (CC BY 4.0), which means that the manuscript, images, and Supporting Information files will be freely available online, and any third party is permitted to access, download, copy, distribute, and use these materials in any way, even commercially, with proper attribution. For these reasons, we cannot publish previously copyrighted maps or satellite images created using proprietary data, such as Google software (Google Maps, Street View, and Earth). For more information, see our copyright guidelines: http://journals.plos.org/plosone/s/licenses-and-copyright.

We require you to either (1) present written permission from the copyright holder to publish these figures specifically under the CC BY 4.0 license, or (2) remove the figures from your submission:

Thank you for bringing this to our attention. We would like to clarify that the map of the Gambia showing major meteorological stations, presented in Figure 1, was generated using R software as an original creation for the purpose of illustrating the geographical distribution of meteorological stations. No copyrighted material or proprietary data sources were used in the creation of this figure.

We fully understand and respect PLOS' copyright policies and would appreciate further clarification on the specific aspect of the figure that has raised concerns. We are committed to compliance and are open to making any necessary modifications to ensure adherence to copyright guidelines.

Reply to comments by reviewers on “Predicting malaria outbreak in The Gambia using machine learning techniques” (PONE-D-23-25703).

The authors are thankful to the Editor and the anonymous reviewers for providing an opportunity to further improve our paper. The paper is revised by addressing all the suggested points. Here are point by point replies to all comments of the referees (replies are written in bold and the changes in the manuscript are highlighted with yellow).

Reply to reviewer 1

Reviewer #1: This paper presents a study on comparing eight machine learning algorithms to predict malaria outbreaks in each district in The Gambia using historical meteorological data.

The model inputs and outputs should be clearly stated.

Response: Thank you for your suggestion, the model inputs and outputs have been explicitly outlined in the 'Summary of the Dataset' section under the 'Results,' available in Table 1.

The 4 climatic variables are shown, but what about the other variables? Are the climatic variables monthly averages or yearly averages? Some clarifications should be given.

Response: Thank you for the suggestion. Based on your feedback, we have incorporated a dedicated subsection titled "Climatic Variables and Malaria Incidence Patterns" in the revised version. This new subsection is designed specifically to discuss and present results related to climatic variables, emphasizing their influence and impacts on predicting malaria outbreak in The Gambia. The choice to highlight climatic variables aligns with the study's overarching goal of addressing the intricate challenges in forecasting malaria outbreaks, where the complex interplay of factors, including climatic conditions, necessitates a targeted and nuanced exploration to enhance predictive modelling accuracy. Also, the climate variables are monthly averages and have been adjusted on page 5.

Reply to reviewer 2

Reviewer #2: Although the relevance of the topic, in its current form, the article would need too many adjustments for a major revision. Below are some examples of issues.

The abstract is confusing. For instance, the authors state, "An early warning system that can accurately forecast malaria outbreaks years in advance would be helpful to policymakers to put in measures in reducing morbidity and mortality rate." However, this statement is a justification or conclusion rather than contextualization. This type of argument could be used in the introduction as a justification.

Response: Thank you for your suggestion. The statement has been removed from the abstract and added to the Introduction.

Additionally, the authors present some future research directions in the abstract. I recommend the authors to discuss them in the conclusion.

Response: Thank you. We have added it in the conclusion.

“This research can be enhanced in the future through the implementation of a hybridized ensemble approach in our machine learning models. By integrating various methodologies, the resulting model is poised to achieve increased reliability and robustness. This step aligns with the evolving landscape of machine learning techniques and ensures our predictive models remain at the forefront of predictive modelling.

Moreover, expanding the scope of our analysis to include additional factors such as treated mosquito nets, indoor residual spray coverage, and other pertinent predictors could contribute significantly to the comprehensive understanding of malaria dynamics in The Gambia. These elements, though not addressed in the current study, represent crucial components that warrant consideration in future investigations.

In essence, this study lays the groundwork for subsequent studies to build upon, incorporating advanced methodologies and expanding the array of variables considered.”

I recommend avoiding short sentences like "It also performs the worst in specificity analysis."

Response: We have made the necessary correction.

In the introduction, if possible, could you provide specific statistics on malaria in the Gambia?

Response: Yes, it is possible. We have added some statistics on malaria in The Gambia in the Introduction. Please find it in line no. 10-20.

The authors could clarify the statement on machine learning and large datasets in the introduction. In the literature, especially for healthcare, there is evidence of the relevant of using machine learning and small datasets.

Response: Thank you for the suggestion. We have revised the introduction to provide clarification on machine learning's relevance to both large and small datasets in the healthcare literature. Please find the changes in line no. 25-30.

The authors should state the specific machine learning algorithms they experimented with in the introduction. Additionally, they should justify the choice of such algorithms.

Response: We have addressed your suggestion by explicitly stating the machine learning algorithms used in the introduction. Furthermore, we have provided justification for the selection of these algorithms. All changes could be found in line no. 41-46.

In the introduction, the authors should discuss the challenges of forecasting malaria outbreaks using machine learning.

Response: We have added the challenges of forecasting malaria outbreaks using machine learning in the Introduction.

In Related Works, the authors state, "by combining and aggregating the two datasets." Which are the two datasets?

Response: We appreciate your observation. In the updated manuscript, we have explicitly mentioned that the two datasets refer to historical meteorological data and records of malaria cases. Please find these changes in the Related work section and line number 60-63.

In Related Works, the authors state, "Extreme gradient boosting (XGBoost) performed better than the other model with 96.26% accuracy." What models?

Response: Thank you for the observation. The XGBoost model was compared with the various machine learning models such as K-nearest neighbor, Naïve Bayes, support vector machine, …etc.. We observed XGBoost performed better than the other models with 96% accuracy.

There is a sentence with no end in the related works section: "Similarly, using clinical data from PubMed abstracts from 1956 to 2019,"

Response: Thank you for the observation. It has been updated.

The authors should review the acronym definition. For instance, some acronyms are defined in the incorrect part of the text (in the first mention).

Response: Thank you for the suggestion. We have carefully reviewed and adjusted all the acronym definitions.

The related works section should include a comprehensive discussion on limitations of previous studies. This discussion would clarify the contribution of the new study.

Response: Thank you for the insight. We have updated the related work section as per your suggestion and can be found in line no. 118-125.

The following sentence is confusing: "proposed a hybrid classification and regression model to predict the disease outbreak using data on the malaria outbreak,"

Response: Thank you for noticing. We have clarified it.

Fig. 1 is not cited and explained in the text.

Response:

Thank you for bringing this to our attention. We would like to clarify that the map of the Gambia showing major meteorological stations, presented in Figure 1, was generated using R software as an original creation for the purpose of illustrating the geographical distribution of meteorological stations. No copyrighted material or proprietary data sources were used in the creation of this figure. We have cited the package that we employed.

As "several studies [8,22] have shown malaria incidence is influenced by climatic factors such as rainfall, temperature, and humidity," what is the specific contribution of your study?

Response: The specific contribution of our study has been added on the “Dataset used” section on page 5.

“The specific contribution of our study lies in utilizing two distinct datasets (historical meteorological and clinical datasets) spanning nine years (January 2013 to December 2021) to predict malaria outbreaks in each district of The Gambia. While previous studies have highlighted the influence of climatic factors such as rainfall, temperature, and humidity on malaria incidence \\cite{thomson2005use,ceesay2010continued}, our approach integrates machine learning techniques to provide a more accurate and district-specific prediction of malaria outbreaks. This extends the existing knowledge by leveraging advanced analytical methods for a targeted and nuanced understanding of the impact of climatic conditions on malaria transmission at a local level.”

Did the authors need ethics committee approval to handle the data?

Response: The dataset utilized in our study was provided by the relevant authorities, specifically the Ministry of Health and the Department of Water Resources. As the data involved anonymized and aggregated information and was obtained through official channels, no formal ethical approval was required for its usage. We ensured strict adherence to data protection regulations and maintained the confidentiality and privacy of the information throughout the study.

The article must include a more convincing justification for choosing the ML algorithms.

Response: Thank you for your suggestion. We have given a general justification for the use of the ML algorithms in the “Prediction models” subsection on page 6.

The authors state, "All the models were built using 70% of the data as the training set and the 190 remaining 30% as the testing set." Did you apply holdout or k-fold cross-validation (or both)?

Response: We utilized a combination of both methods. Initially, the holdout method was employed by partitioning the data into a training set (70%) and a testing set (30%) for initial validation. While training the model exclusively on the 70% training set, we implemented 10-fold cross-validation to mitigate the risk of overfitting.

The article needs to include more justification for the choice of performing oversampling.

Response: We have added the justification. It can be found on page 9. Thank you.

“By implementing up-sampling in this manner, the training data becomes more representative of the underlying distribution of outbreak and no outbreak cases. This contributes to a more balanced learning process during model training, allowing the machine learning algorithm to better capture patterns within the minority class. It's essential to note that up-sampling is a training-specific technique and does not directly influence the distribution of classes in the testing dataset or during cross-validation.

In summary, the use of up-sampling in the training dataset helps mitigate the challenges posed by class imbalance, fostering improved model generalization and predictive accuracy during the training phase while preserving the natural distribution of classes in the testing dataset and during cross-validation.”

The article needs to include more justification for the choice of performance metrics.

Response: We have added the justification and can be found on page 8.

“These selected performance metrics collectively offer a thorough evaluation of the models, capturing aspects of accuracy, sensitivity, specificity, and discriminatory capability. Their inclusion ensures a multifaceted assessment aligned with the diverse requirements of malaria outbreak prediction.”

It needs to be clarified if the authors applied a method for hyperparameter tuning.

Response: We performed hyperparameter tuning using the grid search method and have incorporated it into the script on page 12.

“To optimize the performance of each predictive model, hyperparameter tuning was conducted using the grid search method. This approach involves systematically searching through a predefined set of hyperparameter values to identify the combination that yields the optimal model performance.

For instance, the parameter $k$ in the K-Nearest Neighbors (KNN) classifier was explored across the values $K=\\{3,5,7,9,11\\}$, with the grid search determining $k=3$ as the optimal choice. Similar ranges and search procedures were applied to the hyperparameters of other classifiers.”

The authors should present a more detailed discussion of the specificity results. Are they not relevant to your scenario?

Response: The specificity results are relevant. We have added a detailed explanation of the specificity results on page 12.

The quality of all figures is poor.

----------------------

The grammar and spelling need substantial revision. Additionally, clarity and readability need significant improvement. For instance:

However, [9] also reported --> However, Kalipe et al. [9] also reported

Please, for all references with the format, for instance "[13] predicts malaria", change for Author [13] or Author et al. [13].

this paper aim --> this paper aims

maturity, therefore --> maturity.. Therefore,

Support vector machine (SVM) --> support vector machine (SVM)

multilayered perceptron --> multilayer perceptron

support vector --> SVM?

Response: Thank you for your suggestion. We have taken care of the issues raised and have improved the grammar also.

Reviewer #3: Cite current related works.

Response: Thank you for your feedback. We have incorporated some additional references to current related works in the revised manuscript.

“In Senegal, Diao et al. \\cite{diao2023generalized} conducted a significant study in malaria forecasting. They formulated a generalized linear model based on Poisson and negative binomial regression models, considering climatic variables, insecticide-treated bed-nets distribution, Artemisinin-based combination therapy, and historical malaria incidence. The study demonstrated the efficacy of the Poisson regression model and addressed issues of over-forecasting through the saturation of rainfall. In Rajasthan, India, Singh et al. \\cite{singh2023leveraging} proposed a hybrid ML algorithm (P2CA $-$ PSO $-$ ANN) for malaria outbreak prediction in districts like Barmer, Bikaner, and Jodhpur. Using meteorological variables, data fusion, and P2CA, the model achieved accurate predictions, outperforming benchmarks. It shows promise as an early warning system based solely on meteorological data.”

Justification for data normalization is needed.

Response: Thank you for your suggestion. We have added the justification for using data normalization on page 5.

“The dataset underwent a comprehensive normalization process to enhance the effectiveness of machine learning algorithms. It is a critical preprocessing step that plays a pivotal role in enhancing the quality and effectiveness of machine learning algorithms. In this study, normalization was applied to numerical independent variables using the min-max normalization technique. This process transforms features into a shared range, mitigating the impact of larger numeric values dominating the model's learning process. Normalization becomes particularly crucial in scenarios where the scale of numerical features varies significantly. Without normalization, features with larger numeric values may exert undue influence on the learning algorithm, potentially overshadowing the contributions of smaller-scale features. The rationale behind data normalization lies in its ability to create a level playing field for numerical features, ensuring that each contributes proportionally to the learning process. This contributes to the overall robustness and reliability of the machine learning models employed in predicting malaria outbreaks. The normalization formula employed in this study, as depicted in Eq.\\ref{eq:1}, ensures that all numerical variables are scaled to values between 0 and 1.”

The mathematics of the ML models are needed.

Response: Thank you for your feedback. We have added some mathematics of each of the ML models.

Explain upsampling in the context of training and test data (cross-validation).

Response: Thank you for your suggestion. We have added a detailed explanation of upsampling in the context of training and test data (cross-validation) on page 9.

“In the context of training and testing data, the initial random split of the dataset into these sets highlighted a significant class imbalance, particularly regarding "outbreak" and "no outbreak" cases, as depicted in Fig \\ref{fig:outbreak proportion}. Notably, 87\\% of both the training and testing datasets consisted of "no outbreak" cases.”

Why was 10-fold validation used?

Response: We have added the reason of using 10-fold cross validation on page 9.

“The choice of 10-fold cross-validation was motivated by its effectiveness in providing a robust evaluation of model performance. By dividing the training data into ten folds, our models underwent comprehensive training and testing iterations, promoting thorough exposure to diverse data patterns. This approach mitigates overfitting, enhances the reliability of performance estimates, and ensures a resource-efficient use of the available data. The decision to repeat the 10-fold cross-validation five times further strengthens the statistical significance of our performance evaluations.”

Explain the concept of classification versus regression.

Response: Thank you for your valuable suggestion. We recognize the importance of clarifying the concepts of classification and regression in the context of our paper. In response, we have provided a brief explanation in the “Prediction models” sections on page 9 to enhance clarity and comprehension.

“Chosen for their widespread use and high predictive accuracy, these models are versatile and commonly employed in both classification and regression tasks. In the context of machine learning, classification involves categorizing instances into predefined classes, such as predicting whether a district is likely to experience a malaria outbreak. On the other hand, regression predicts continuous numerical values. Our study primarily focuses on a classification problem.”

Discuss and link with related works.

Response: Thank you for your suggestion. We have updated the discussion as per your suggestion.

“Our approach aligns with studies worldwide, such as Kalipe et al. \\cite{kalipe2018predicting}, which successfully employed various machine learning techniques in predicting malaria outbreaks in Visakhapatnam, India. Similarly, Zinszer et al. \\cite{zinszer2015forecasting} and Lee et al. \\cite{lee2021machine} emphasized the significance of combining environmental and clinical indicators, demonstrating enhanced accuracy in malaria forecasts. These studies underscore the importance of considering diverse variables for robust predictive models, supporting our integration of climatic and non-climatic factors.

Comparing specific models, our results align with Adamu et al. \\cite{adamu2021malaria}, where Random Forest (RF) emerged as the best model. However, it's noteworthy that the optimal model may vary depending on the dataset and geographic location. Our study adds nuance to this understanding, as both XGBoost and C5.0 Decision Trees (C5.0 DT) consistently outperformed other models, providing a well-balanced and robust predictive capacity for malaria outbreak prediction in the Gambia.”

How can the dataset be validated.

Response: We have added how it can be validated in the “Data cleaning and preprocessing” section on pages 5 and 6.

The features of the data should be shown (summary statistics).

Response: Thank you for your suggestion. We have added a summary statistic of the dataset used in Table 1 on page 10.

Attachments
Attachment
Submitted filename: Response to the reviewers.docx
Decision Letter - Sathishkumar Veerappampalayam Easwaramoorthy, Editor

Predicting malaria outbreak in The Gambia using machine learning techniques

PONE-D-23-25703R1

Dear Dr. AJADI,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Sathishkumar Veerappampalayam Easwaramoorthy

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Formally Accepted
Acceptance Letter - Sathishkumar Veerappampalayam Easwaramoorthy, Editor

PONE-D-23-25703R1

PLOS ONE

Dear Dr. Ajadi,

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now being handed over to our production team.

At this stage, our production department will prepare your paper for publication. This includes ensuring the following:

* All references, tables, and figures are properly cited

* All relevant supporting information is included in the manuscript submission,

* There are no issues that prevent the paper from being properly typeset

If revisions are needed, the production department will contact you directly to resolve them. If no revisions are needed, you will receive an email when the publication date has been set. At this time, we do not offer pre-publication proofs to authors during production of the accepted work. Please keep in mind that we are working through a large volume of accepted articles, so please give us a few weeks to review your paper and let you know the next and final steps.

Lastly, if your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

If we can help with anything else, please email us at customercare@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Sathishkumar Veerappampalayam Easwaramoorthy

Academic Editor

PLOS ONE

Open letter on the publication of peer review reports

PLOS recognizes the benefits of transparency in the peer review process. Therefore, we enable the publication of all of the content of peer review and author responses alongside final, published articles. Reviewers remain anonymous, unless they choose to reveal their names.

We encourage other journals to join us in this initiative. We hope that our action inspires the community, including researchers, research funders, and research institutions, to recognize the benefits of published peer review reports for all parts of the research system.

Learn more at ASAPbio .