Data transformation based optimized customer churn prediction model for the telecommunication industry

Data transformation (DT) is a process that transfers the original data into a form which supports a particular classification algorithm and helps to analyze the data for a special purpose. To improve the prediction performance we investigated various data transform methods. This study is conducted in a customer churn prediction (CCP) context in the telecommunication industry (TCI), where customer attrition is a common phenomenon. We have proposed a novel approach of combining data transformation methods with the machine learning models for the CCP problem. We conducted our experiments on publicly available TCI datasets and assessed the performance in terms of the widely used evaluation measures (e.g. AUC, precision, recall, and F-measure). In this study, we presented comprehensive comparisons to affirm the effect of the transformation methods. The comparison results and statistical test proved that most of the proposed data transformation based optimized models improve the performance of CCP significantly. Overall, an efficient and optimized CCP model for the telecommunication industry has been presented through this manuscript.


Introduction
Over the last few decades, the telecommunication industry (TCI) has witnessed enormous growth and development in terms of technology, level of competition, number of operators, new products and services and so on. However, because of extensive competition, saturated markets, dynamic environment, and attractive and lucrative offers, the TCI faces serious customer churn issues, which is considered to be a formidable problem in this regard [39]. In a competitive market, where customers have numerous choice of service providers, they can easily switch services and even service providers. Such customers are referred to as churned customers [39] with respect to the original service provider.
The three main generic strategies to generate more revenues in an industry are (i) to increase the retention period of customers, (ii) to acquire new customers and (iii) to up-sell the existing customers being the other two [35]. In fact, customer retention is believed to be the most profitable strategy, as customer turnover severely hits the company's income and its marketing expenses [1].
Churn is an inevitable result of a customer's long term dissatisfaction over the company's services. Complete withdrawal from a service (provider) on part of a customer does not happen in a day; rather the dissatisfaction of the customer, grown over time and exacerbated by the lack of attention by the service provider, results in such a fiery gesture by the customer. To prevent this, the service provider must work on limitations (perceived by the customers) in its services to retain the aggrieved customers. Thus it is highly beneficial for a service provider to be able to identify a customer as a potential churned customer. In this context, non-churn customers are those who are reluctant to move from one service provider to another in contrast to churn customers.
If a telephone company (TELCO) can predict that a customer is likely to churn, then it can potentially cater targeted offerings to that customer to reduce his dissatisfaction, increase his engagement and thus potentially retain him/her. This has a clear positive impact on revenue. Additionally, customer churn adversely affects the company's fame and branding. As such, churn prediction is a very important task particularly in the telecom sector. To this end, TELCOs generally maintain a detailed standing report of the customer's to understand their standing and to anticipate their longevity in continuing the services. Since the expense of getting new customers is relatively high [27,17], TELCO nowadays principally focus on retaining their long-term customers rather than getting new ones. This makes churn prediction essential in the telecom sector [25,36]. With the above backdrop, in this paper, we revisit the customer churn prediction (CCP) problem as a binary classification problem in which all of the customers are partitioned into two classes, namely, Churn and Non-Churn.

Brief Literature review
The problem of CCP has been tackled using various approaches including machine learning models, data mining methods, and hybrid techniques. Several Machine Learning (ML) and data mining approaches (e.g., Rough set theory [1,5], Naïve Bayes and Bayesian network [26], Decision tree [21,11], Logistic regression [11], RotBoost [23], Support Vector Machine (SVM) [31], Genetic algorithm based neural network [30], AdaBoost Ensemble learning technique [22], etc.) have been proposed for churn prediction in the TCI using customer relationship management (CRM) data. Notably, CRM data is widely used in prediction and classification problems [19]. A detailed literature review considering all these works is beyond the scope of this paper; however, we briefly review some of the most relevant papers below.
Brandusoiu et al. [7] presented a data mining based approach for prepaid customer churn prediction. To reduce data dimension, the authors applied Principal Component Analysis (PCA). Three machine learning classifiers were used here, namely, Neural Networks (NN), Support Vector Machine (SVM), and Bayes Networks (BN) to predict churn customers. He et al. [18] proposed a model based on Neural Networks (NN) in order to tackle the CCP problem in a large Chinese TELCO that had about 5.23 million customers. Idris et al. [24] proposed a technique combining genetic programming with AdaBoost to model the churn problem in the TCI. Huang et al. [20] studied the problem of CCP in the big data platform. The aim of the study was to show that big data significantly improves the performance of churn prediction using Random Forest classifier.
Makhtar et al. [28] proposed a rough set theory based model for churn prediction in TELCO. Amin et al. [2] on the other hand focused on tackling the data imbalance issue in the context of CCP in TELCO and compared six unique sampling strategies for oversampling. Burez et al. [8] also studied the issue of unbalanced datasets in churn prediction models and conducted a comparative study for different methods for tackling the data imbalance issue. Hybrid strategies have also been used for processing massive amount of customer information together with regression techniques that provide effective churn prediction results [32]. On the other hand, Etaiwi et al. [13] showed that their Naïve Bayes model was able to beat SVM in terms of precision, recall, and F-measure.
To the best of our knowledge, an important limitation in this context is that most of the methods in the literature have been experimented on a single dataset. Also, the impact of data transformation methods on CCP models have not been investigated deeply. There are various DT methods like the Log, Rank, Z-score, Discretization, Min-max, Box-cox, Aarcsine and so on. Among these, researchers broadly used the Log, Z-score, and Rank DT methods in different domains (e.g., software metrics normality and maintainability [38] [37], defect prediction [16], dimensionality reduction [16] etc.). To the best of our knowledge There are only one work in the literature where DT methods have been applied in the context of CCP in TELCO [3], where only two DT methods (e.g., Log and Rank) and a single classifier (e.g., Naïve Bayes) have been leveraged. Therefore, a large room for improvement is there in this context, which we consider in this work.

Our Contributions
This paper makes the following key contributions: • We develop customer churn prediction models that leverage various data transformation (DT) methods and various optimized machine learning algorithms. In particular, we have combined six different DT methods with eight different optimized classifiers to develop a number of models to handle the CCP problem. The DT methods we utilized are: Log, Rank, Box-cox, Z-score, Discretization and Weight-ofevidence (WOE). On the other hand the classification algorithms we used include K-Nearest Neighbor (KNN), Naïve Bayes (NB), Logistic Regression (LR), Random forest (RF), Decision tree (DTree), Gradient boosting (GB), Feed-Forward Neural Networks (FNN) and Recurrent Neural Networks (RNN).
• We have conducted extensive experiments on three different publicly available datasets and evaluated our models using various information retrieval metrics such as, AUC, Precision, Recall and F-measure. Our models achieved promising results and conclusively found that the DT methods have positive impact on CCP models.
• We also conduct statistical tests to check whether our findings are statistically significant or not. Our results clearly indicate that the impact of DT methods on the classifiers is not only positive but also statistically significant.

Datasets
We use three publicly available benchmark datasets (referred to as Dataset-1, 2 and 3 henceforth), that are broadly used for the CCP problem in the telecommunication area. Table 1 describes these three datasets.

Data preprocessing
We apply the following essential data preprocessing steps: • We ignore the sample IDs and/or descriptive texts which are used only for informational purposes.
• Redundant attributes are removed.
• Missing numerical values are replaced with zero (0) and missing categorical values are treated as a separate category.
• We normalize the categorical values (such as 'yes' or 'no', 'true' or 'false') into 0s and 1s where each value represents the corresponding category [5]. Label encoder is used to normalize the categorical attributes.

Data Transformation (DT) Methods
data transformation refers to the application of a deterministic mathematical function to each point in a data set. Table 2 provides a description of the DT methods leveraged in our research.

Begin of Table DT Method
Description Equation

Log
Each variable x is replaced with log(x), where the base of the log is left up to the analyst [37] [29] [15]. In this study, In case the feature value contains zero, a constant 1 is typically added, along with ln(x) where x is the value of any feature variable of the original dataset.

Continuation of Table 2 DT Method
Description Equation
In this research, we followed the study [37] to transform the initial values of every feature in a original dataset into ten (10) ranks, using each 10th % (percentile) of the given feature's values (2) where Q k is the k × 10 percentile of the corresponding metric and symbol ∞ is the infinity.

Box-Cox
It is a lamba based power transformation method [37] [15]. This transformation method is a process to transform non-normal dependent feature values into a normal distribution.
Where λ is configurable to the analyst, and x is the given value of any feature of the initial dataset. The λ value = -5 to +5. In this study, we used λ = 0.5.
Z-score It indicates the distance of a data point from the mean in units of standard deviation [9].
where x is the given value of any feature of the original dataset.
Discretization It is a binning technique [14]. For continuous variables, four widely used discretization techniques are K-means, equal width, equal frequency, and decision tree based discretization. We used the equal width discretization technique which is a very simple method.
For any given continuous variable x, the following process is applied: Provided x min is the minimum of a selected feature and x max is the maximum, bin width Ω can be computed as Hence, the discretization technique generates b bins with boundaries at x min+i × Ω, where i=1,2,.....(b-1). b is a parameter chosen by the analyst.
It is binning and logarithmic based transformation [33]. Most of the cases, the WOE solves the skewed problem in the data distribution. WOE is the natural logarithm (ln) of the distribution which is the distribution of the good events (1) divided by the distribution of the bad events (0).

W OE = ln
Distribution of churn customers Distribution of non-churn customers (6) End of Table  2.

Evaluation Measures
The confusion matrix is generally used to assess the overall performance of a predictive model. For the CCP problem, the individual components of confusion matrix is defined as follows: (i) True Positives (TP): correctly predicted churn customers (ii) True Negatives (TN): correctly predicted non-churn customers (iii) False Positives (FP): non-churn customers, miss-predicted as churned customers and (iv) False Negatives (FN): churn customers, miss-predicted as non-churn customers. We use the following popular evaluation measures for comparing the performance of the models.
Precision : Mathematically precision can be expressed as: The probability of detection (POD)/ Recall: POD or recall is a valid choice of evaluation metric when we want to capture as many true churn customers as possible. Mathematically POD can be expressed as: The probability of false alarm (POF): The value of POF should be small as much as possible (in an ideal case, the value of POF is 0 ). Mathematically POF can be defined as: We use POF for measuring incorrect churn predictions.
The area under the curve (AUC): Both POF and POD are used to measure the AUC [37] [4]. A higher AUC value indicates a higher performance of the model. Mathematically AUC can be expressed as: F-Measure: The F-measure is the harmonic mean of the precision and recall. F-measure is needed when we want to seek a balance between precision and recall. A perfect model has an F-measure of 1. The Mathematical formula of F-measure is defined below.

Optimized CCP models
The baseline classifiers used in our research are presented in Table 3. To examine the effect of the DT methods, we apply them on the original datasets and subsequently, on the transformed data, we train our CCP models with multiple machine learning classifiers (KNN, NB, LR, RF, DTree, GB, FNN and RNN) listed in Table 3.

Validation method and steps
In all our experiments, the classifiers of the CCP models were trained and tested using 10-fold crossvalidation on the three different datasets described in Table 1. Firstly, a RAW data based CCP model was constructed without leveraging any of the DT methods on any features of the original datasets. In this case, we did not apply any feature selection steps either. However, we used the best hyper-parameters for the classifiers.
Subsequently, we applied a DT method on each attribute of the dataset and retrained our models based on this transformed dataset. We experimented with each of the DT methods mentioned in Table 3. For each DT based model, we also used a feature selection and optimization procedure, which is described in the following section.

Feature Selection and Optimization
We have a set of hyper-parameters and we aim to find the right combination of the values thereof which will optimize the objective function. For tuning the hyper-parameters, we have applied grid search [34]. Figure 1 illustrates the overall flowchart of our proposed optimized CCP model. First, we applied some necessary preprocessing steps on the datasets. Then, DT methods (Log, Rank, Box-cox, Z-score, Discretization, and WOE ) were applied thereon. Next, we used the univariate feature selection technique to select the higher scored features from the dataset (we selected the top 80 features for dataset-1 and top 15 features for both dataset-2 and dataset-3). We applied grid search to find the best hyper-parameters for individual classifier algorithms. Finally, 10-fold cross validation was employed to train and validate the models.

Stability measurement tests
We used Friedman non-parametric statistical test (FMT) [12] to examine the reliability of the findings and whether the improvement achieved by the DT based classification models are statistically significant. The Friedman test is the non-parametric statistical test for analyzing and finding differences in treatments across multiple attempts [12]. It does not assume any particular distribution of the data. Friedman test ranks all the methods. It ranks the classifiers independently for each dataset. Lower rank indicates a better performer. We performed the Friedman test on the F-measure results. Here, the null hypothesis (H 0 ) represents: "there is no difference among the performances of the CCP models". In our experiments, the test was carried out with the significance level, α = 0.05.
Subsequently, post hoc Holm test is conducted to perform the paired comparisons with respect to the best performing DT model. In particular, when the null hypothesis is rejected, we used the post hoc Holm test to compare the performance of the models. This test is a similarity measurement process that compares all the models. We performed the Holm's post hoc comparison for α = 0.05 and α = 0.10.

DT methods and Data Distribution
Data transformation attempts to change the data from one representation to another to enhance the quality thereof with a goal to enable analysis of certain information for specific purposes. In order to find out the impact of the DT methods on the datasets, data skewness and data normality measurement tests have been performed on the three different datasets and the results are visualized through Q-Q (quantile-quantile) plots [4,37].

Coding and Experimental Environment
All experiments were conducted on a machine having Windows 10, 64-bit system with Intel Core i7 3.6GHz processor, 24GB RAM, and 500GB HD. All codes were implemented with Python 3.7. Jupyter Notebook was used for coding. All data and code are available at the following link: https://github.com/joysana1/Churnprediction.
6 Figure 1: Flowchart of the Optimized CCP model using data transformation methods.

Results
The impact of the DT methods on all the 8 classifiers (through rigorous experimentation on 3 benchmark datasets) are illustrated in Figures 2 through 9. Each of these figures illustrates the performance comparison (in terms of AUC, precision, recall, and F-measure) among the RAW data based CCP model and other DT methods based CCP models for all three datasets as follows (please check Table 4 for a map for understanding the figures). Table 7

Results on Dataset 1
The performance of the baseline classifiers (referred to as RAW in the figures) in dataset 1 is quite poor in all the metrics: the best performer in terms of F-measure is NB with a value of 0.636 only. Interestingly, not all DT methods performed better than RAW. However, the performance of WOE is consistently better than RAW across all classifiers. In a few cases of course some other DT methods able to outperform WOE: for example, across all combinations in Dataset 1, the best individula performance is achieved by FNN with Z-SCORE with a staggering F-Measure of 0.917. As for AUC as well, the most consistent performer is WOE with the best value achieved for FNN (0.802)

Results on Dataset 2
Interestingly, the performance of some baseline classifiers in Dataset 2 is quite impressive in Dataset 2, particularly in the context of AUC. For example, both DT and GB (RAW version) achieved more than 0.82 as AUC; the F-Measure was also acceptable, particularly for GB (0.78).
Among the DT methods, again, WOE performs (in terms of F-Measure) most consistently albeit with the glitch that for DT and GB, it performs slightly worse than RAW. In fact, surprisingly enough, for GB, the best performer is RAW; for DT however, Z-SCORE is the winner, very closely followed by BOX-COX.

Results on Dataset 3
In Dataset 3 as well, the performance of DT and GB in RAW mode is quite impressive: for DT the AUC and F-Measure values are respectively 0.84 and 0.727 and for GB these are even better, 0.86 and 0.809, respectively. Again, the performance of WOE is the most consistent except in the case of DT and GB where it is beaten by RAW. The overall winner is GB with LOG transformation which registers 0.864 as AUC and 0.818 as F-Measure.   (12.59). So the decision is to reject the null hypothesis (H 0 ). Subsequently, the post hoc Holm test revealed significant differences among the DT methods. Figure 10 illustrates the results of Holm's test as a heat map. p-value ≤ 0.05 was considered as the evidence of significance. Figure 10 tells that WOE performance is significantly different from other DT methods except for the Z-SCORE. Table 6 reflects the post hoc comparisons for α = 0.05 and α = 0.10. When the p-value of the test is smaller than the significant rate α = 10% and 5% then Holm's procedure rejects the null hypothesis. Evidently, WOE DT based models are found to be significantly better than the other models.

Impact of the DT methods on Data Distribution
The Q-Q plots are shown in Figure 11, 12 and 13 for Dataset-1, Dataset-2 and Dataset-3, respectively. As we found WOE and Z-Score DT methods are performing better than the RAW (without DT) method (see the Friedman ranked table 5), we generated Q-Q plots only for RAW, WOE, and Z-Score methods. In each Q-Q plot, the first 3 features of the respective dataset are shown. From the Q-Q plots, it is observed that after transformation by the WOE DT method, we achieved less skewness (i.e., the data became more normally distributed). Normally distributed data is beneficial for the classifiers [4,10]. Similar performance is also observed for Z-SCORE.

Discussion
From the comparative analysis and statistical tests, it is evident that DT methods have a great impact on improving the CCP performance in TELCO. A few prior works (e.g., [38], [37], and [3]) also studied the effect of DT methods but in a limited scale and did not consider the optimization issues. We on the other hand conducted a comprehensive study considering six DT methods and eight machine learning classifiers on three different benchmark datasets. The performance of the DT based classifiers have been investigated in terms of AUC, precision, recall, and F-measure.
The data transformation techniques have shown great promise in improving the data distribution quality in general. Specially, in our experiments, the WOE method improved the data normality which in the sequel provided a clear positive impact on the prediction performance for the customer churn prediction ( Figures  11 -13).
The comparative analyses involving the RAW based and DT based CCP models clearly suggested the potential of DT methods in improving the CCP performance (Figures 2 through 9). In particular, our experimental results strongly suggested that the WOE method contributed a lot towards improving the performance, albeit with the exception of DTree and GB classifiers for the datasets 2 and 3. While the performance of WOE in these cases satisfactory, it failed to outperform RAW based model performance. We hypothesize that this is due to the binning technique within the WOE method. Moreover, those two datasets are unbalanced datasets. The DTree and GB classifiers might consider them as some order which is not a specific order.
From Table 5 we notice that WOE is the best ranked method and the rank value is 2.4167. The post hoc comparison heatmap 10 and Table 6 reflect how the WOE is better than the other methods. As Friedman test is rejecting the null hypothesis (H 0 ) and post hoc Holm analysis advocates the WOE method's supremacy, it is clear that DT methods improve the user churn prediction performance significantly for the telecommunication industry. Therefore, to construct a successful CCP model, we recommend to select the best classifier (LR, FNN) and the WOE data transfer method.

Conclusion
Predicting customer churn is one of the most important factors in business planning in TELCOs. To improve the churn prediction performance we investigated with six different data transformation methods, namely, Log, Rank, Box-cox, Z-score, Discretization, and Weight-of-evidence. We used eight different machine learning classifiers which are K-Nearest neighbor (KNN), Naïve Bayes (NB), Logistic regression (LR), Random forest (RF), Decision tree (DTree), Gradient boosting (GB), Feed-forward neural networks (FNN), Recurrent neural networks (RNN). For each classifier, we applied univariate feature selection method to select top ranked features and used grid search for hyper-parameter tuning. We evaluated our methods in terms of AUC, precision, recall, and F-measure. The experimental outcomes indicate that, in most cases, the data transformation methods enhance the data quality and improve the prediction performance. To support our experimental results we performed Friedman non-parametric statistical test and post hoc Holm statistical analysis. The Friedman statistical test and post hoc Holm statistical analysis confirmed that Weight-of-evidence and Z-score DT based CCP models perform better than the raw based CCP model. To test the robustness of our DT-augmented CCP models, we performed our experiments on both balanced (dataset-1) and non-balanced datasets (dataset-2 and dataset-3). CCP is still a hard and swiftly developing problem usually for competitive businesses and particularly for telecommunication companies. Future research is probably capable to offer higher outcomes on other datasets with multiple classifiers. Another future direction can be to extend this study with other types of data transformation approaches and classifiers. Our proposed model can be tested on the other telecom datasets to examine the generalization of our results at a larger scale. Last but not the least, work can be done to extend our approach to customer churn datasets from other business sectors to study the generalization of our claim across business domains.
15 Figure 11: The Q-Q plot for WOE, Z-Score DT method and without DT method on dataset-1 16 Figure 12: The Q-Q plot for WOE, Z-Score DT method and without DT method on dataset-2