Fig 1.
Program code transformation and operator distribution by job manager in Apache Flink framework.
Fig 2.
Workload Spikes: a) Daily spikes are in the mornings or in the evening. b) The weekly are in the weekdays starting from Monday evening till Saturday morning. c) Seasonal spikes could be around Christmas and New Year holiday’s season. d) Unplanned spikes can occur at any time of the year.
Fig 3.
System architecture: System administrator and users push requests to the system through event and data gathering component of the systems, which then tunnel it to the brokering agent.
The broker passes it to the workload analyzer that in turn passes the analysis results to the workload prediction model. The prediction results are then checked with the cluster usage and SLA constraints and are then passed to the topology generator to generate topology. If the resulting topology meets SLA requirements while considering cluster usage prediction, then the execution plan is selected; otherwise topology is refined, and the process is repeated.
Fig 4.
Logical and physical execution plan represented through directed acyclic data flow graphs.
Fig 5.
Program code transformation and operator distribution by job manager in Apache Flink framework.
Fig 6.
Data flowchart for prediction of hourly, daily, and weekly drop-offs for certain zones or the entire city in TLC dataset.
Table 1.
Features applied to assess the credit risk in the German credit dataset.
Fig 7.
Synthetic workload spikes: The moving average results shows its occurrence till 24 hours, while the prediction is shown for three hours in the near future as well.
The moving average is used to smooth the data for the prediction algorithm to fit with different spikes such as daily, weekly, seasonal, as well as unplanned ones. The prediction for spikes shows its adaption to the ups and downs in the workload.
Fig 8.
Number of requests vs. time, prediction is shown with varying values of moving averages 60, 30, and 10.
Fig 9.
Comparison of MA, ARIMA and ARIMA-SVR model’s prediction accuracy.
Table 2.
Linear regression results.
Fig 10.
Prediction with an upper and lower limit of 95% confidence interval.
Fig 11.
Normalized error based on absolute error in case of moving average 60.
Fig 12.
Use case scenario: Topology generator has given a default dataflow graph of map operator with parallelism 3 and reduce operator with parallelism 4.
The system increases the parallelism of reduce task to 5 in order to achieve a higher QoS. Adapted from [10].
Table 3.
Cluster configuration.
Fig 13.
Pseudo-code of hash join algorithm: Join one small and one large table to create multi-map with the ability of hash-based lookup or search.
Fig 14.
Execution time vs. build size: Execution time of hash join with one data set varying from 1GB to 7 GB and the other is kept constant at 10 GB.
Fig 15.
Comparing the default parallelism, TRS parallelism, and ARIMA+SVR TRS parallelism based on the incoming workload using Hash-Map Implementation with varying data set.