A new framework based on features modeling and ensemble learning to predict query performance

A query optimizer attempts to predict a performance metric based on the amount of time elapsed. Theoretically, this would necessitate the creation of a significant overhead on the core engine to provide the necessary query optimizing statistics. Machine learning is increasingly being used to improve query performance by incorporating regression models. To predict the response time for a query, most query performance approaches rely on DBMS optimizing statistics and the cost estimation of each operator in the query execution plan, which also focuses on resource utilization (CPU, I/O). Modeling query features is thus a critical step in developing a robust query performance prediction model. In this paper, we propose a new framework based on query feature modeling and ensemble learning to predict query performance and use this framework as a query performance predictor simulator to optimize the query features that influence query performance. In query feature modeling, we propose five dimensions used to model query features. The query features dimensions are syntax, hardware, software, data architecture, and historical performance logs. These features will be based on developing training datasets for the performance prediction model that employs the ensemble learning model. As a result, ensemble learning leverages the query performance prediction problem to deal with missing values. Handling overfitting via regularization. The section on experimental work will go over how to use the proposed framework in experimental work. The training dataset in this paper is made up of performance data logs from various real-world environments. The outcomes were compared to show the difference between the actual and expected performance of the proposed prediction model. Empirical work shows the effectiveness of the proposed approach compared to related work.


1.
Did the authors concentrate on optimization on joins and other complex queries such as nested sub queries in their experiment?
The following is mentioned in the training dataset metrics section home page 10. Queries syntax added to the supplementary material The queries which were used in the experiment were all complex, with sub queries and complex joins. The idea was to use complex queries from real-world implementations [the reference project was for a telecommunications company on a big data platform (cloudera environment) that handles streaming data]. The query syntax statement specifications, hardware specifications, software specifications, data architecture specifications, and historical performance logs will be used to derive these features. To extract input features from various types of queries, feature modeling is used.
An example of features extracted from syntax to extract query features.

Day Performance
The syntax of all queries used in the experiment work is attached and will be uploaded in the supplementary materials. Queries_Env.xlsx

2.
It will be better for the readers if the author presents the parameters of the proposed NN model in a tabular column (epochs, architecture info and so on).
The following has been added to the "XGBoost Algorithm parameters section" home page 11

3.
What is the optimal value of 'k' in equation 7 that the authors have considered?
The following has been added to the "ensemble learning XGBoost algorithm" section home page 7 Xgboost takes a more iterative approach to determining the best K. The equation shows the iteration numbers that should be used in the XGBoost to achieve the smallest possible error. The number of rounds parameter is used as an input parameter in the XGBoost. The ensemble technique is based on a cleverer approach because it is based on many iterations that are combined to perform the final one. Rather than training all the models separately, boosting trains models that succeed, with each new model being trained to correct the mistakes made by the previous ones. Models are added in a sequential order until no further advancements can be made. So, rather than specifying an exact number, we provide the algorithm with a set of parameters to determine the optimal number of trees (no of estimators).

4.
In the section "FEATURES ENGINEERING", the authors have discussed few preprocessing stages such as categorical data resolving and feature removal. Does the authors consider any feature selection algorithms such as filter methods, wrapper methods, for picking/dropping features?
The following has been added to the proposed framework section home page 5 The embedded features of the XGBoost are used in the proposed flow diagram. As a result, the feature engineering step is fed by the feed from feature importance. Label encoding also occurred. We proposed label encoding, which converts categorical features into numbers. The list of categorical features includes deployment environment encoded to (1 for cloud, 2 for on-premises), operating system environment encoded to (1 for Linux, 2 for Windows), data storage type encoded to (2 for SSD type), data transformation engine encoded to (1 for Hadoop, 2 for Oracle, and 3 for Greenplum), and query number encoded to 1 through 10 because we have 10 queries for data transformation.

5.
It is advised to compare the author's proposed work with any other standard work especially with the works discussed in section two (Related Work).
The following has been added to the discussion section home page 14

DISCUSSION
The key lessons of learning from the proposed approach and other approaches in the related work section, are shown in table 6. Number of features. Most of the approaches in the related work section limit the number of features used to train the prediction model to improve model accuracy. While the proposed approach did not limit the features, it did propose a generalization approach for features modelling to enrich the prediction model's input training data set. The impact of feature standardization is that it standardizes the features used for training the prediction model despite differences in the query's environment specifications in terms of hardware and software, as well as data architecture. So, defining a feature domain, then extracting a list of features from each domain, generalizes an approach to standardizing the features. This step of standardizing the input features for the query prediction model is not considered in any of the related works. All the related work depends on the optimizer statistics to extract the logs and the features that impacted the query performance as per the historical execution plans. Please cite the following works to enhance the feature engineeringrelated works. 1 The authors should emphasize the difference between other methods to clarify the position of this work further.
This has been answered responding to Reviewer-1 comment -5 2 The Wide ranges of applications need to be addressed in Introduction The following is mentioned in the introduction section page 1 Many factors influence query performance, such as data stores, data architecture, environment specifications, and query syntax.

3
The objective of the research should be clearly defined in the last paragraph of the introduction section The following is mentioned at the end of the introduction section home page 2.
This paper aims to propose a query performance prediction framework that can be used to estimate query performance. The proposed framework can then be used as a query performance simulator to improve query syntax, optimize the allocated environment resources, and enhance the data architecture features used to execute the query. The main contribution is to model all the query features that influence query execution, feeding the ensemble learning technique with all the query-impactful features, to develop a query performance prediction model. The proposed framework can then be used as a query performance simulator to improve query syntax, optimize the allocated environment resources, and enhance the data architecture features used to execute the query. Section two discusses the related background, research motivation, and explains the challenges encountered while using the traditional approach, and sections three, four, and five introduce the proposed approach, experiment work, results, and discussion. Finally, section six summarizes this work. 4 Add the advantages of the proposed system in one quoted line for justifying the proposed approach in the Introduction section This has been answered responding to Reviewer-2 comment -3

5
The motivation for the present research would be clearer, by providing a more direct link between the importance of choosing your own method.
The following is added at the end of the related work section as per the reviewer's comment home page 5.
The challenges of previous approaches, which relied solely on hardware specifications features such as (CPU, I/O, allocated RAM) to develop query performance prediction models, were highlighted in previous related work. The previously proposed models are not generalized enough to be used as a performance simulator to enhance query features that influence performance. Further, proposed hybrid models for determining which regression algorithm will fit into the distributed data space and produce the lowest error can be seen as a workaround. However, they are not optimal. As a consequence, we need to employ a more sophisticated regression algorithm.