Predictive topology refinements in distributed stream processing system

doi:10.1371/journal.pone.0240424

Fig 1.

Program code transformation and operator distribution by job manager in Apache Flink framework.

More »

Expand

Fig 2.

Workload Spikes: a) Daily spikes are in the mornings or in the evening. b) The weekly are in the weekdays starting from Monday evening till Saturday morning. c) Seasonal spikes could be around Christmas and New Year holiday’s season. d) Unplanned spikes can occur at any time of the year.

More »

Expand

Fig 3.

System architecture: System administrator and users push requests to the system through event and data gathering component of the systems, which then tunnel it to the brokering agent.

The broker passes it to the workload analyzer that in turn passes the analysis results to the workload prediction model. The prediction results are then checked with the cluster usage and SLA constraints and are then passed to the topology generator to generate topology. If the resulting topology meets SLA requirements while considering cluster usage prediction, then the execution plan is selected; otherwise topology is refined, and the process is repeated.

More »

Expand

Fig 4.

Logical and physical execution plan represented through directed acyclic data flow graphs.

More »

Expand

Fig 5.

Program code transformation and operator distribution by job manager in Apache Flink framework.

More »

Expand

Fig 6.

Data flowchart for prediction of hourly, daily, and weekly drop-offs for certain zones or the entire city in TLC dataset.

More »

Expand

Table 1.

Features applied to assess the credit risk in the German credit dataset.

More »

Expand

Fig 7.

Synthetic workload spikes: The moving average results shows its occurrence till 24 hours, while the prediction is shown for three hours in the near future as well.

The moving average is used to smooth the data for the prediction algorithm to fit with different spikes such as daily, weekly, seasonal, as well as unplanned ones. The prediction for spikes shows its adaption to the ups and downs in the workload.

More »

Expand

Fig 8.

Number of requests vs. time, prediction is shown with varying values of moving averages 60, 30, and 10.

More »

Expand

Fig 9.

Comparison of MA, ARIMA and ARIMA-SVR model’s prediction accuracy.

More »

Expand

Table 2.

Linear regression results.

More »

Expand

Fig 10.

Prediction with an upper and lower limit of 95% confidence interval.

More »

Expand

Fig 11.

Normalized error based on absolute error in case of moving average 60.

More »

Expand

Fig 12.

Use case scenario: Topology generator has given a default dataflow graph of map operator with parallelism 3 and reduce operator with parallelism 4.

The system increases the parallelism of reduce task to 5 in order to achieve a higher QoS. Adapted from [10].

More »

Expand

Table 3.

Cluster configuration.

More »

Expand

Fig 13.

Pseudo-code of hash join algorithm: Join one small and one large table to create multi-map with the ability of hash-based lookup or search.

More »

Expand

Fig 14.

Execution time vs. build size: Execution time of hash join with one data set varying from 1GB to 7 GB and the other is kept constant at 10 GB.

More »

Expand

Fig 15.

Comparing the default parallelism, TRS parallelism, and ARIMA+SVR TRS parallelism based on the incoming workload using Hash-Map Implementation with varying data set.

More »

Expand