Machine learning-based e-commerce platform repurchase customer prediction model

Cheng-Ju Liu; Tien-Shou Huang; Ping-Tsan Ho; Jui-Chan Huang; Ching-Tang Hsieh

doi:10.1371/journal.pone.0243105

Peer Review History

Original SubmissionAugust 22, 2020
23 Sep 2020 Decision Letter - Zhihan Lv, Editor PONE-D-20-23992 Machine L earning- B ased E - C ommerce P latform R epurchase C ustomer P rediction M odel PLOS ONE Dear Dr. Ho, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. Please submit your revised manuscript by Nov 07 2020 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript: A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols We look forward to receiving your revised manuscript. Kind regards, Zhihan Lv, Ph.D. Academic Editor PLOS ONE Journal Requirements: When submitting your revision, we need you to address these additional requirements. 1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf 2.In your Data Availability statement, you have not specified where the minimal data set underlying the results described in your manuscript can be found. PLOS defines a study's minimal data set as the underlying data used to reach the conclusions drawn in the manuscript and any additional data required to replicate the reported study findings in their entirety. All PLOS journals require that the minimal data set be made fully available. For more information about our data policy, please see http://journals.plos.org/plosone/s/data-availability. Upon re-submitting your revised manuscript, please upload your study’s minimal underlying data set as either Supporting Information files or to a stable, public repository and include the relevant URLs, DOIs, or accession numbers within your revised cover letter. For a list of acceptable repositories, please see http://journals.plos.org/plosone/s/data-availability#loc-recommended-repositories. Any potentially identifying patient information must be fully anonymized. Important: If there are ethical or legal restrictions to sharing your data publicly, please explain these restrictions in detail. Please see our guidelines for more information on what we consider unacceptable restrictions to publicly sharing data: http://journals.plos.org/plosone/s/data-availability#loc-unacceptable-data-access-restrictions. Note that it is not acceptable for the authors to be the sole named individuals responsible for ensuring data access. We will update your Data Availability statement to reflect the information you provide in your cover letter. 3. PLOS requires an ORCID iD for the corresponding author in Editorial Manager on papers submitted after December 6th, 2016. Please ensure that you have an ORCID iD and that it is validated in Editorial Manager. To do this, go to ‘Update my Information’ (in the upper left-hand corner of the main menu), and click on the Fetch/Validate link next to the ORCID field. This will take you to the ORCID site and allow you to create a new iD or authenticate a pre-existing iD in Editorial Manager. Please see the following video for instructions on linking an ORCID iD to your Editorial Manager account: https://www.youtube.com/watch?v=_xcclfuvtxQ 4. Please ensure that you refer to Figure 5 and 6 in your text as, if accepted, production will need this reference to link the reader to the figure. [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes Reviewer #2: Yes ******** 2. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes Reviewer #2: Yes ****** 3. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes ****** 4. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: Yes ****** 5. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: (1) In many places in the article, "XGBoost" has been written as "xgboost", and this kind of error should not appear. (2) In the introduction, the review of the references is not rich enough, please add references to the references. (3) "The article first merges a single model, and uses the model fusion algorithm to fuse the prediction results of a single model." This sentence is unclear, such as the lack of part of the content. (4) The article mentioned the prediction model based on the decision tree algorithm more than once, but the author did not elaborate on this part. (5) What is "SSP"? What is the relationship with stable fluctuation mode? Please explain the situation to the author. (6) Please standardize the various abbreviations in the "evaluation criteria", such as: tp, fp, etc. Reviewer #2: The abstract part is too cumbersome, and it is easy for readers to quickly and accurately grasp the central idea of the article. Formulas should be numbered in detail and centered or center aligned in a uniform format. The author's description of the data source is not detailed enough. Please define the attributes such as the naming and size of the data source. I am very interested in the content of Figure 1 and Figure 2. This is an analysis and summary of the experimental results of a single model. Please elaborate on this part. The overall format of the article is not uniform, it seems very messy. Please complete the conclusions in detail. ****** 6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review?** For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. https://doi.org/10.1371/journal.pone.0243105.r001
Revision 1
30 Oct 2020 Author Response Author’s responses to Reviewer 1 ━━━━━━━━━━━━━━━━━━━◆━━━━━━━━━━━━━━━━━ Comment: Cons 1. In many places in the article, "XGBoost" has been written as "xgboost", and this kind of error should not appear. Response 1: Thank you for your comment. Your comment is very inspiring. We have modified the content of this part. Cons 2. In the introduction, the review of the references is not rich enough, please add references to the references. Response 2: Thank you for your comment. We have improved the number and content of references. References [1] Elbeltagi, I., & Agag, G. (2016). E-retailing ethics and its impact on customer satisfaction and repurchase intention. Internet Research, 26(1), 288-310. [2] Yang, S., Lu, Y., Chau, P. Y. K., & Gupta, S. (2016). Role of channel integration on the service quality, satisfaction, and repurchase intention in a multi-channel online-cum-mobile retail environment. International Journal of Mobile Communications, 15(1), 1-25. [3] Tarofder, A. K., Nikhashemi, S. R., Azam, S. M. F., Selvantharan, P., & Haque, A. (2016). The mediating influence of service failure explanation on customer repurchase intention through customers satisfaction. International Journal of Quality & Service Sciences, 8(4), 516-535. [4] Fazal-E-Hasan, S. M., Ahmadi, H., Kelly, L., & Lings, I. N. (2018). The role of brand innovativeness and customer hope in developing online repurchase intentions. Journal of Brand Management, 26(2), 1-14. [5] Chih-Cheng Volvic Chen, & Chih-Jou Chen. (2017). The role of customer participation for enhancing repurchase intention. Management Decision,55(3), 547-562. [6] Jean, N., Burke, M., Xie, M., Davis, W. M., Lobell, D. B., & Ermon, S. (2016). Combining satellite imagery and machine learning to predict poverty. Science, 353(6301), 790. [7] Holzinger, A. (2016). Interactive machine learning for health informatics: when do we need the human-in-the-loop?. Brain Informatics, 3(2), 119-131. [8] Wang, Jian-Xun, Wu, Jin-Long, & Xiao, H. (2016). A physics informed machine learning approach for reconstructing reynolds stress modeling discrepancies based on dns data. Physical Review Fluids, 2(3), 1-22. [9] Kavakiotis, I., Tsave, O., Salifoglou, A., Maglaveras, N., Vlahavas, I., & Chouvarda, I. (2017). Machine learning and data mining methods in diabetes research. Computational & Structural Biotechnology Journal,15(C), 104-116. [10] Elgohary, A., Boehm, M., Haas, P. J., Reiss, F. R., & Reinwald, B. (2017). Compressed linear algebra for large-scale machine learning. Vldb Journal, 9(12), 1-26. Cons 3. "The article first merges a single model, and uses the model fusion algorithm to fuse the prediction results of a single model." This sentence is unclear, such as the lack of part of the content. Response 3: Thank you for your comment. We have changed this sentence. In this paper, we first combine the single model, and then use the model fusion algorithm to fuse the prediction results of the single model. Cons 4. The article mentioned the prediction model based on the decision tree algorithm more than once, but the author did not elaborate on this part. Response 4: Thank you for your comment. We have completed the content of this part in "2.1 E-Commerce User Behavior Prediction Model Based on Decision Tree Algorithm". Cons 5. What is "SSP"? What is the relationship with stable fluctuation mode? Please explain the situation to the author. Response 5: Thank you for your comment. We have modified the content of this part. Cons 6. Please standardize the various abbreviations in the "evaluation criteria", such as: tp, fp, etc. Response 6: Thank you for your comment. We have modified the content of this part. 3.3 Evaluation Method AUC (area under the curve) is the area under the ROC (Receiver Operating Characteristics) curve. The X-axis of the ROC curve is FPR (false positive rate), and the Y-axis is TPR (true positive rate). The FPR and TPR values can be calculated using the confusion matrix. Here's how to calculate the confusion matrix. For two types of problems, the sample is divided into positive (positive) and negative (negative). When the classification model is classified, there are four cases: True positive (TP): A positive sample predicted by the model. False positive (FP): A negative sample predicted by the model as a positive example. False negative (FN): The positive sample predicted by the model is used as a negative sample. True negative (TN): A sample predicted to be negative by the model. Calculate FPR, TPR, and precision based on the confusion matrix: , where n is the number of negative samples; , where p is the number of positive samples; However, in many cases, the roc curve does not clearly indicate which classification algorithm is more efficient, and AUC as a numerical value can intuitively evaluate the quality of the classifier. AUC calculation method such as formula AUC is the probability value that the classifier randomly predicts positive and negative samples, and the positive samples are ranked before the negative samples. The larger the AUC value, the better the classification effect. Usually we use the "F1 value" to measure the accuracy of the prediction of the two types of problems. Among them, the accuracy (precision) refers to the ratio of the number of positive samples with correct classification prediction to the number of positive samples predicted by all classifications. The recall rate refers to the ratio of the positive sample number of the classification prediction to the positive sample number in the original training set. We hope the modifications that we have made can meet your requirements. Thank you again for your effort on our article. Author’s responses to Reviewer 2 ━━━━━━━━━━━━━━━━━━━◆━━━━━━━━━━━━━━━━━ Comment 1. The abstract part is too cumbersome, and it is easy for readers to quickly and accurately grasp the central idea of the article. Response 1: Thank you for your comment. Your comment is very inspiring. Abstract: In recent years, China's e-commerce industry has developed at a high speed, and the scale of various industries has continued to expand. Service-oriented enterprises such as e-commerce transactions and information technology came into being. This paper analyzes the shortcomings and challenges of traditional online shopping behavior prediction methods, and proposes an online shopping behavior analysis and prediction system. The paper chooses linear model logistic regression and decision tree based XGBoost model. After optimizing the model, it is found that the nonlinear model can make better use of these features and get better prediction results. In this paper, we first combine the single model, and then use the model fusion algorithm to fuse the prediction results of the single model. The purpose is to avoid the accuracy of the linear model easy to fit and the decision tree model over-fitting. The results show that the model constructed by the article has further improvement than the single model. Finally, through two sets of contrast experiments, it is proved that the algorithm selected in this paper can effectively filter the features, which simplifies the complexity of the model to a certain extent and improves the classification accuracy of machine learning. The XGBoost hybrid model based on p/n samples is simpler than a single model. Machine learning models are not easily over-fitting and therefore more robust. Comment 2. Formulas should be numbered in detail and centered or center aligned in a uniform format. Response 2: Thank you for your comment again. We have already numbered the formula and processed it in the center. Comment 3. The author's description of the data source is not detailed enough. Please define the attributes such as the naming and size of the data source. Response 3: Thank you for your comment again. We have already explained the source of the data in more detail. 3.1 Data Source According to the 7-day window size, the pre-processed data set is divided into several data subsets every 1-day interval, and then the historical data of each window is obtained through feature extraction and conversion (user id, item id) sample data. According to different windows, it consists of multiple sample subsets. Finally, the "uniform downsampling" method is used to sample the sample subset of each window according to the positive-negative sampling ratio of 1:9, and the obtained positive and negative samples are equalized. In the obtained sample subset, a subset of the data samples of 10 windows is extracted as the final training data. After sampling, the number of data samples per window is approximately 70,000. Since the validity of the algorithm selected in this paper needs to be verified, the sample sets of ten windows are divided into two categories: the sample set before feature selection and the sample set after feature selection. The dimensions of the samples were 110 and 56 dimensions, respectively. For convenience of representation, the 10 window sample sets before feature selection are named as training sample set s (1 ≤ i ≤ 10) from small to large, and the sample set after feature selection is named s' (1 ≤ i ≤ 10). Comment 4. I am very interested in the content of Figure 1 and Figure 2. This is an analysis and summary of the experimental results of a single model. Please elaborate on this part. Response 4: Thank you for your comment again. We have modified the content of Figure 1 and Figure 2. Based on the training of the model, the stepwise regression based on the AUC criterion is used to screen the features, and the model is optimized according to the changes of the indicators. The effect of different positive and negative sampling rates on the AUC results of the test set is shown in Figure 1. The accuracy is shown in Table 2 when the training samples increase, the added samples are mostly negative samples. If the model only learns how to classify negative samples, it will score higher on the test set. The classifier simply classifies all samples as negative samples and also achieves good accuracy. Therefore, here we mainly consider the value of AUC to evaluate the model. By observing the trend of AUC values, it was found that the fluctuation of AUC was not very obvious, but compared with other samples, the 1:3 training model performed best on the test set, and the ratio of positive samples to negative samples was the same. When tilting to a negative sample, the AUC value is lower than the predicted value of the sample ratio of 1:3. This is because as the amount of data increases, the complexity of model training becomes larger and larger, making the results worse. The parameters of the Logistic regression algorithm are obtained by the maximum likelihood estimation method. The purpose of this parameter is to enable the learning model to correctly classify the probability log and maximization of each sample, regardless of whether the sample is a majority or a small number of samples. Class samples, obviously the algorithm is not suitable for category imbalance problems. (2) Model fusion experiment results The AUC value of the fusion model is iteratively calculated by a large number of artificial weighting values. The logistic regression model was found to be a linear model. The prediction of a single model is not good, and the contribution in the fusion model is not very large. XGBoost is a single model. Not much, XGBoost has a greater impact in training fusion models. The typical weights of several groups are shown in Table 3. Comment 5. The overall format of the article is not uniform, it seems very messy. Response 5: Thank you for your comment. We have made uniform changes to the format of the full text. Comment 6. Please complete the conclusions in detail. Response 6: Thank you for your comment. We have improved the content of the conclusion. 5.Conclusions This paper analyzes and studies the shortcomings and challenges of traditional online shopping behavior prediction methods, and proposes a network shopping behavior analysis and prediction system. Through the analysis of customer behavior data, the system obtains the customer purchase behavior rules included in the customer, and stores the discovered rule knowledge in the knowledge base. The system is based on the customer's real-time browsing behavior, based on the knowledge in the knowledge base, combined with the customer's personalized attributes, real-time prediction of customer buying behavior trends. The paper selects linear model logistic regression and decision tree based XGBoost model. After optimizing the model, it is found that the nonlinear model can make better use of these features and get better prediction results. Study the fusion of individual models. In order to avoid the shortcomings of the linear model and the over-fitting of the decision tree model, the model fusion algorithm is used to fuse the prediction results of the single model, and the prediction results are further improved than the single model. Finally, through two sets of contrast experiments, it is proved that the algorithm selected in this paper can effectively filter the features, which simplifies the complexity of the model to a certain extent and improves the classification accuracy of machine learning. The xgbXGBoost hybrid model based on p/n samples is simpler than a single model. Machine learning models are not easily over-fitting and therefore more robust. We hope the modifications that we have made can meet your requirements. Thank you again for your effort on our article. Attachments Attachment Submitted filename: response.docx https://doi.org/10.1371/journal.pone.0243105.r002
16 Nov 2020 Decision Letter - Zhihan Lv, Editor Machine Learning-Based E-Commerce Platform Repurchase Customer Prediction Model PONE-D-20-23992R1 Dear Dr. Ho, We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements. Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication. An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org. If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. Kind regards, Zhihan Lv, Ph.D. Academic Editor PLOS ONE Additional Editor Comments (optional): Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation. Reviewer #1: All comments have been addressed Reviewer #2: (No Response) ******** 2. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: (No Response) Reviewer #2: (No Response) ****** 3. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: (No Response) Reviewer #2: (No Response) ****** 4. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: (No Response) Reviewer #2: (No Response) ****** 5. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: (No Response) Reviewer #2: (No Response) ****** 6. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: After the revision of this article, the standard of the article has been obviously improved, and the point of innovation is also very clear. Reviewer #2: The author of the article has given accurate explanations when the contents are not modified.The paper format also been improved. ****** 7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review?** For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No https://doi.org/10.1371/journal.pone.0243105.r003
Formally Accepted
20 Nov 2020 Acceptance Letter - Zhihan Lv, Editor PONE-D-20-23992R1 Machine Learning-Based E-Commerce Platform Repurchase Customer Prediction Model Dear Dr. Ho: I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department. If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org. If we can help with anything else, please email us at plosone@plos.org. Thank you for submitting your work to PLOS ONE and supporting open access. Kind regards, PLOS ONE Editorial Office Staff on behalf of Dr. Zhihan Lv Academic Editor PLOS ONE https://doi.org/10.1371/journal.pone.0243105.r004

Open letter on the publication of peer review reports

PLOS recognizes the benefits of transparency in the peer review process. Therefore, we enable the publication of all of the content of peer review and author responses alongside final, published articles. Reviewers remain anonymous, unless they choose to reveal their names.

We encourage other journals to join us in this initiative. We hope that our action inspires the community, including researchers, research funders, and research institutions, to recognize the benefits of published peer review reports for all parts of the research system.

Learn more at ASAPbio .