An improved advertising CTR prediction approach based on the fuzzy deep neural network

Combining a deep neural network with fuzzy theory, this paper proposes an advertising click-through rate (CTR) prediction approach based on a fuzzy deep neural network (FDNN). In this approach, fuzzy Gaussian-Bernoulli restricted Boltzmann machine (FGBRBM) is first applied to input raw data from advertising datasets. Next, fuzzy restricted Boltzmann machine (FRBM) is used to construct the fuzzy deep belief network (FDBN) with the unsupervised method layer by layer. Finally, fuzzy logistic regression (FLR) is utilized for modeling the CTR. The experimental results show that the proposed FDNN model outperforms several baseline models in terms of both data representation capability and robustness in advertising click log datasets with noise.


Introduction
Internet advertising is regarded as an effective advertising communication approach due to its strong targeted communication ability. Therefore, increasing numbers of researchers from industry and academia have investigated internet advertising, and it has become an important source of income for internet companies.
The cost per click (CPC) model [1] is one of the most common payment models in internet advertising. Approximately 66% of advertising transactions depend on the CPC because the CPC can more accurately reveal the conversion rate compared with the other models [2]. In the CPC model, the click-through rate (CTR) is a significant index for measuring the effect of advertisement placement.
To address this task, Chapelle O. et al. propose a machine learning framework that uses Maximum Entropy to implement a logistic regression model and can address billions of samples and hundreds of millions of parameters. A two-phase feature selection algorithm is provided to reduce the need for domain expertise: a generalized mutual information method is used to select the feature groups that are used in the model; next, feature hashing is used to regulate the size of the models [3]. McMahan  CTR value [12]. These above methods are state-of-the-art solutions which are adopted to address CTR prediction task through deep learning technology. In these above methods, a deep architecture model is usually used to automatically extract the features of advertising click log data, and a regression model is subsequently trained to model the advertising CTR. Because a deep architecture is used to learn the features, there is no need for these methods to depend on prior knowledge or human labor; they can effectively fit complex nonlinear mapping relations in advertising datasets. However, when the dataset contains noisy data such as missing values and outliers, these methods may exhibit performance degradation. A common strategy for dealing with this problem is to identify and delete records with outliers during data preprocessing. However, deleting entire records results in the loss of much valuable and clean information. The other common strategy is to replace missing values and outliers with "0" or "NULL"; however, the records with outliers are preserved in the training set, which also affects the accuracy of CTR prediction.
Uncertainties not only exist in the data themselves but occur at each phase of big data processing. For instance, the collected data may be created by faulty sensors or provided by not fully informed customers; the outputs of specific artificial intelligent algorithms also contain uncertainties. In these cases, fuzzy set techniques could be one of the most efficient tools to handle various types of uncertainties [13]. Because the parameters of these above deep architecture models are constants, and the learning processes of the parameters are constrained in a relatively small space, these methods have insufficient representation capability and relatively weak robustness when the training data have been interrupted by uncertainties. Meanwhile, fuzzy set techniques would be more efficient if they are used associated with other decision making techniques, such as probability, neural networks, etc., because each type of techniques exhibit their own strengths of representing and handling information granularity [13].
Due to the lack of a comprehensive understanding of the advertising datasets with noises, it is often difficult for these methods to ensure that the extracted features capture the optimal information for predicting the CTR. These deficiencies affect the accuracy and fitness of these methods. According to the advantages of fuzzy set techniques, this paper proposes a novel CTR prediction method based on the fuzzy deep neural network (FDNN) to address advertising datasets with noise.
The main contributions of this paper can be summarized as follows: 1. This paper proposes an FDNN model in which FDBN is used to automatically extract abstract and complicated features from raw advertising datasets without any artificial intervention or prior knowledge and subsequently uses fuzzy logistic regression to model the CTR.
2. Fuzzy set techniques are introduced into the Gaussian-Bernoulli Restricted Boltzmann Machine (GBRBM), Restricted Boltzmann Machine (RBM) and logistic regression models to construct basic components of FDNN, and corresponding learning algorithms are presented in detail.
3. We conduct extensive experiments on real-world datasets to demonstrate the effectiveness and efficiency of the proposed FDNN model. The impacts of several baseline models are also discussed. Experiments results show the superiority of the proposed method. On the first dataset, the results show that the proposed method achieves competitive performances compared with the state-of-the-art models in terms of both data representation capability and robustness.
A fuzzy number y (fuzzy set) on a real domain can be defined as a triangular fuzzy number if its membership function m y ðxÞ : R ! ½0; 1 is equal to [14], and it is shown in S1 Fig.
where l m u, l and u stand for the lower and upper values of the support of fuzzy number y (fuzzy set), respectively; and m is the most likely value (middle value), that is, when x = m, x belongs to y. The triangular fuzzy number can be denoted by (l, m, u). The support of y is the set of elements {x 2 R|l < x < u}. When l = m = u, it is a non-fuzzy number. Definition 2. α-cuts of a fuzzy set The α-cut of fuzzy set y, represented by y½a, is defined as y½a ¼ fx 2 Ojy ! ag ¼ ½y L ; y R , where 0 < α 1, θ L is the lower value of y½a and θ R is the upper value of y½a [15,16].

Fuzzy function. Definition 3. Fuzzy Function
Extended from a real-valued function f: Y = f(x, θ), the fuzzy function f can be defined as [16] where θ and θ are parameters of functions f and f [16,17], Y is the fuzzy output set.  [16,18] [ . Definition 4. α-Cuts of a Fuzzy Function α-Cuts of Y : For a continuous function f, the α-Cuts of Y , namely Y ½a ¼ ½Y 1 ½a; Y 2 ½a, can be expressed as [16] ( Y 1 ½a ¼ minfYðθ; xÞjθ 2 θ½ag It is infeasible to calculate the membership function Y ðyÞ using formulas (2) and (3) because this requires maximization and minimization of the original function. This paper adopts α-cuts and interval arithmetic to solve this problem.
The membership function Y ðyÞ of a fuzzy function can be obtained by interval arithmetic because intervals θ½a are easy to calculate. However, when f is very complex, interval arithmetic will be NP-hard; therefore, this paper introduces a defuzzification approach to handle this problem. Defuzzification of the fuzzy function. The centroid approximate solution [19] is employed to defuzzify the fuzzy function Y ¼ f ðx; θÞ in this paper.
The centroid of fuzzy function f ðx; θÞ can be denoted by f c ðx; θÞ [19]: However, directly calculating formula (6) is difficult because it involves integrals. The centroid can be approximated through many α-cuts of the fuzzy function in discrete form. θ is a vector of fuzzy numbers and an α-cut of θ can be denoted as θ½a ¼ ½θ L ; θ R , where θ L and θ R are lower and upper bounds, respectively, of the interval with respect to α. When x is nonnegative, f ðx; θÞ is a monotonically decreasing function with respect to parameters θ. Thus, according to the interval arithmetic principle, definition 3 and formula (5), the α-cut of f ðx; θÞ is given by The approximate centroid of fuzzy function f ðx; θÞ can be obtained by [20] f c ðx; θÞ % Although all the α-cutsare bounded intervals, this paper takes only the special case where α i = 1 into consideration. Let θ = [θ L , θ R ]. Formula (8) can be expressed as [20] f c ðx; θÞ % Therefore, fuzzy function Y can be expressed as Advertising CTR prediction based on the fuzzy deep neural network Basic fuzzy components and corresponding learning algorithms. Inspired by the idea of above fuzzy theory and methods [16][17][18][19][20], we first provide a detailed introduction and describe the corresponding learning algorithms of several basic components that have been modified using fuzzy technology.
A. FRBM and its Learning Algorithm Fuzzy restricted Boltzmann machine (FRBM) is a symmetric neural network with binary nodes that is based on an energy model. It contains a set of visual binary nodes v 2 {0, 1} D and another set of hidden binary nodes h 2 {0, 1} F . There are no connections between different nodes of the same hidden layer or between different nodes of the same visual layer. Fuzzy parameters are employed to govern the FRBM model. The structure of FRBM is described in The fuzzy energy function of FRBM can be expressed as [20] Eðx; h; θÞ In this formula, θ ¼ fb; c; W g, b and c are the offsets, and W is the connection weight between the ith visible node and the jth hidden node. According to the free energy function Fðx; θÞ ¼ À log Ph e À Eðx;h;θÞ , the fuzzy free energy function F is expressed as If Fðx; θÞ is used directly to define the probability, it is difficult to calculate the fuzzy probability, fuzzy maximum likelihood and fuzzy objective function in the optimization. Therefore, Fðx; θÞ needs to be defuzzifized to transform the optimization into regular maximum-likelihood optimization. This paper adopts a centroid method to defuzzify Fðx; θÞ. The centroid of Fðx; θÞ is expressed as F c ðx; θÞ, and the probability can be defined as This paper selects the negative log-likelihood as the objective function of FRBM: In the formula, D denotes the training dataset. The learning algorithm of FRBM finds the parameters θ that minimize the objective function Lðθ; DÞ (minL θ ðθ; DÞ). The paper adopts the centroid approximation method to defuzzify Fðx; θÞ: The gradients of Lðθ; DÞ with respect to θ L can be expressed as In the formula, E p (Á) is the expectation over the target probability distribution P. Correspondingly, Because it is difficult to calculate the expectation E p @F c ðx;θÞ @θ Â Ã , an approximation approach is employed. After defuzzifying the objective function, Gibbs sampling [21] is adopted to sample from these conditional distributions.
For FRBM, the fuzzy conditional probabilities can be expressed as Pðh j ¼ 1jxÞ ¼ The α-cuts of the fuzzy conditional probabilities are described as In these formulas, P L (h j |x), P R (h j |x), P L (x i |h) and P R (x i |h) are the conditional probabilities with respect to the lower bounds and upper bounds of the parameters. Consequently, and In these formulas, W L ij and W R ij are the lower bound and upper bound of the connection weight, b L i and b R i are the lower bound and upper bound of the visible bias, and c L j and c R j are the lower bound and upper bound of the hidden bias [20].
According to (15), (16), (17) and the expectation estimation method of [20,22], the gradients of Lðθ; DÞ with respect to the fuzzy parameters of FRBM can be obtained as follows: where P c (x) is the centroid probability. The CD1 algorithm [22] is employed to obtain the updating rules for parameters (θ L and θ R ) to approximate the expectation E p @F c ðx;θÞ @θ Â Ã [20,22]. The above description is summarized as Algorithm 1.

B. FGBRBM and Its Learning Algorithm
where σ i denotes the standard deviation of v i in the ith dimension of the Gaussian visual node. Correspondingly, as for FGBRBM, the fuzzy conditional probability distribution for obtaining nodes of the hidden layer through nodes of the visual layer can be described as: where g(Á) denotes the sigmoid function. The formula for constructing the nodes of the visual layer through nodes of the hidden layer can be expressed as In this formula, N(μ, σ 2 ) is a Gaussian probability density function, where μ is the mean value and σ 2 denotes the variance.
The α-cuts of fuzzy conditional probabilities can be expressed as where P L (h j |v) and P R (h j |v) are the conditional probabilities with respect to the lower and upper bounds of the parameters governing the FGBRBM, and v iL and v iR are the probabilities of the normal distribution. Consequently, and where W L ij and W R ij are the lower and upper bounds of the connection weight, respectively; b L i and b R i are the lower and upper bounds of visible node bias, respectively; and c L j and c R j are the lower and upper bounds of the hidden node bias, respectively.
The CD-1 algorithm can also be used for learning parameters of the fuzzy Gaussian-Bernoulli RBM model. In the process of training, the input data of the whole training set should first be normalized so that each dimension of the input data in the FGBRBM model obeys the normal distribution, i.e., the mean value is 0 and the variance is 1. When the formula Pðv i ¼ 1jh; θÞ is calculated, σ = 1. In the process of reconstructing the nodes of the visual layer, this paper does not adopt binary data instead of the probabilities of the normal distribution. The above description is summarized as Algorithm 2.

C. FLR and Its Learning Algorithm Used for Modeling the Click Through Rate
The logistic regression (LR) model [24] is a classic model in the field of advertising click-rate prediction. The LR model is described in S4 Fig. The activation function of the node in the output layer is a sigmoid function [25]. The output value of the node (value of the activation function) h θ (x 0 ) is the probability that a user will click on the advertisement; it can also be described as P(y = 1|x 0 ; θ 0 ). When a sample (x 0 d ; y d ) in the training set D ¼ ðx 0 1 ; y 1 Þ; ðx 0 2 ; y 2 Þ; :::; ðx 0 N ; y N Þ is given, the probability value of advertising click rate prediction can be obtained in terms of logistic regression and can be described by formula (25) In this formula, θ 0T denotes the vector of parameters of the standardized logistic regression model, a refers to the features of the advertisement, u denotes the features of users, and c represents the features of the context environment. These three features compose a vector, which can be denoted as x 0 d . This vector is fed into the logistic regression model for CTR prediction. For a single sample (x 0 d ; y d ), the probability of correctly predicting the advertising clickthrough rate can be described as This paper selects the likelihood function as the objective function: In formula (26), the click label y d denotes the expected output and h θ 0 ðx 0 d Þ is the real output value of the neuron node in the output layer (output of the activation function); h y 0 ðx 0 d Þ can be described as As for the fuzzy logistic regression model, the objective function can be described as where According to formulas (5) and (7), the gradient of the objective function of FLR with respect to fuzzy parameter θ 0 can be expressed as The above description is summarized as Algorithm 3.

Algorithm 3 The learning algorithm of FLR
Input: ðx 0 d ; y d Þ is a sample of training sets; ε is the learning rate; 2. obtain @J c ðθ 0 Þ @θ 0 R by means of formula (29) based on the input data x 0 d ; 3. The parameters can be updated according to the following rules:

Advertising CTR prediction model FDNN. A. Construction of Advertising CTR Prediction Model FDNN
This section describes the proposed fuzzy deep neural network (FDNN) model for advertising CTR prediction and its learning algorithm in detail. The architecture of the method and its process of predicting CTR are illustrated in S5 Fig. First, the fuzzy deep belief network (FDBN) constituted by FRBM and GBRBM is used to automatically learn the discriminative features of raw input data from an advertising dataset, and these features are able to capture internal explanatory information that is hidden in the raw data, amplify the important information for discrimination and weaken irrelevant information [26].
Next, FLR is adopted to model the CTR prediction and map the predicted CTR value to the range between 0 and 1.
The raw input data of advertising CTR prediction are the fields that include the corresponding advertisement features, users' features, and context features, which are extracted from the advertisement click log. These fields contain not only numerical data but also multi-category data; therefore, the raw input data of advertising CTR prediction are composed of vectors with different data types.
Because FRBM only models input with binary variance (0, 1), this paper adopts fuzzy Gaussian-Bernoulli RBM (FGBRBM) to address the input data of advertising CTR prediction.
Then, FRBM can be used to construct FDBN through layer-by-layer stacking; the detailed steps can be found in [22].
In real applications, advertising click rate prediction refers to binary classification or regression; therefore, this paper uses fuzzy logistic regression to model user behaviors in clicking advertisements. After the input data vector x = (a, u, c) = (x 1 , x 2 , . . ., x n ) is extracted by FDBN, the vector x 0 , which is composed of the binary values, in the last hidden layer of FDBN is fed into fuzzy the logistic regression model for CTR prediction. The advantage of the proposed method is that fuzzy set techniques are introduced, and the uncertainties in the relationships between nodes located in adjacent layers is taken into consideration. Nodes in adjacent layers of FDNN often interact in uncertain ways. Because the parameters that represent the relationships between nodes in adjacent layers are fuzzy numbers and the learning process of fuzzy parameters is extended to a relatively wider space, this advantage will be reflected in the fitness of the joint probability distribution. Combined with the merits of deep learning, the proposed method demonstrates superior performance in coping with data with noises.

B. Construction and Training of the Advertising CTR Prediction Model FDNN
The process of constructing FDNN is illustrated in S6 Fig, and the detailed steps of constructing the FDNN are as follows: Step 1: The training of FRBM layer by layer First, the vectors that are constituted by related data of textual advertising click logs are regarded as the input of FGBRBM, and the CD1 algorithm is used to train FGBRBM; in this way, the parameter θ 1 of FGBRBM is obtained. Based on the input vectors and the parameter θ 1 , the hidden nodes h 1 of FGBRBM are obtained.
Second, the parameter θ 2 of the current FRBM can be obtained when h 1 is regarded as the input data for the CD1 algorithm. Based on h 1 and θ 2 , the hidden nodes h 2 of the current FRBM can be obtained.
Third, the previous process is repeated until reaching the fixed layer.
Step 2: Constructing FDBN by means of FRBMs According to the training procedure of FRBM in step 1, the weight values of FRBMs are connected from bottom to top to construct FDBN. In this FDBN, the connection between the top two layers is undirected and the other connection is directed from top to bottom [22].
Step 3: Constructing the FDNN that is used for advertising CTR prediction by means of FDBN After an FLR model is added onto the top of the FDBN, the probability that the user clicks this advertisement is equal to the output value of the nodes of the input layer. At this point, the network becomes the initially trained fuzzy deep neural network (FDNN).
Step 4: Fine-tuning the parameters of FDNN After FDNN is initially trained layer by layer, a fine-tuning process needs to be conducted. In contrast to the unsupervised learning approach in the initial training, the fine-tuning is implemented using supervised learning. Based on the weight values of the network that were obtained from the initial training, error backpropagation [27,29] and stochastic gradient descent are used to update the weight values and bias of the neural network. After fine-tuning, the process of training FDNN is complete. The output of FDNN is in the form of a probability: pðclickja; u; cÞ ¼ pðy d ¼ 1jx 0 d ; θ 0 Þ. The process of fine-tuning the parameters of FDNN contains two phases. 1) Forward calculation of the input signal: When labeled data (x d , y d ) of click logs are given, that is, the combined feature vector x = (a, u, c) = (x 1 , x 2 , . . ., x n ) is given, the jth node in the lth layer (l = 2, ‥, L n ) of the FDNN deep neural network can be defined as follows: L n denotes the total number of layers in the FDNN network; s l denotes the number of nodes in the lth layer of the FDNN network; w l ji denotes the connection weight between the jth node in the lth layer and the ith node in the l − 1th layer; b l j denotes the bias of the jth node in the lth layer; net l j ¼ b l j þ P s lÀ 1 i¼1 w l ji o lÀ 1 i denotes the weighted-sum input of the jth node in the lth layer; o l j denotes the output value of the activation function of the jth node in the lth layer; The selected activation function is sigmoid function.
In the forward calculation, the input signal is transmitted from the input layer to the output layer. The input value of the jth node in the second layer of FDNN (l = 2) can be described as When the input of node j is transferred by the activation function, its output value can be described as o 2 j ¼ gðnet 2 j Þ. Therefore, the input value of the jth node in the lth layer of FDNN (l = 3, . . ., L n ) can be described as net l j ¼ b l j þ P s lÀ 1 i¼1 w l ji o i and its corresponding output value can be described as o l j ¼ gðnet l j Þ. In particular, the output value of the jth node in the th layer of FDNN can be described as 2) Backpropagation of the error signal: This paper regards the cross-entropy as the objective function [25]. It can be described as formula (31). The local gradient will be refined using backpropagation. The error signal between the output value and labeled signal will be transmitted from the output layer to the input layer in the process.
In this formula, y d is the label value of the sample, which can be represented as y d 2 [0, 1].
The gradient descent method will be adopted to learn the model. This paper will first define the residual error d l j [28] of the jth node in the lth layerlayer as the partialand of objective function J with respect to the weighted sum net l j of node j. The residual error d L n j of the j th neuron node of the L n th layer can be calculated by formula (32) The residual error d l j of the jth neuron node of the lth layer (l = 2, . . ., L n − 1) can be calculated by formula (33) according to the chain rule [29].
where w lþ1 ji denotes the connection weight between the jth node of the lth layer and the ith node of l + 1th layer.
According to formulas (2) and (10), the gradient of cost function J ðW ; bÞ with respect to parameters w l ji and b l j can be calculated by formulas (34), (35), (36) and (37).
The above process is described in Algorithm 4. 3. The parameters can be updated according to the following rules:

Experiment datasets and environment
Description of dataset. This paper uses the Criteo Display Advertising Challenge dataset, which is provided by the Criteo company on the famous machine learning website Kaggle for advertising CTR competition [30]. Every advertisement log record contains the set of features of the displayed advertisement, which is composed of 13 numerical features and 26 categorical features. The detailed formats of these features are given in S1 Table. The class Label is used to indicate whether this advertisement is clicked: if the advertisement is clicked, the value of Label is 1; otherwise, the value of Label is 0. Categorical features are desensitized using Harsh mapping; therefore, real meaning can not be obtained from them.
To adequately evaluate the performance of this model and prevent the interference of many factors such as data timeliness, we randomly disrupt the order of the raw samples in the experiment. Meanwhile, to avoid the over-fitting, the 5-fold cross validation method [31] is adopted in the training process of all experiments. Experiment environment. Six high-performance workstations are used for the experiments. The hardware and software configurations of the experimental environment are shown as S2 Table. Results and analysis

Performance evaluation metric of models
AUC. This paper uses the area under the receiver operating characteristic curve (AUC) [33,34] as the main performance evaluation metric for advertising CTR prediction. The confusion matrix is shown as S3 Table [35,36]. Formulas TPR = TP/(TP + FN) and FPR = FP/(FP + TN) are used to calculate the corresponding TPR and FPR values of the confusion matrix to obtain a co-ordinate pair (dot pair). The dot pairs make up the receiver operating characteristic (ROC) curve [37]. The AUC is the total area under the ROC. The larger the AUC is, the more accurate the advertising CTR prediction is, and the better the effect achieved by the advertising placement is.
Log-loss. This paper selects the logarithmic loss function 'log-loss' [38,39] as another auxiliary performance evaluation metric for advertising CTR prediction.
where L is the number of samples, y i is the true label (0 or 1), and y i is the predicted probability. The value of log-loss reflects the similarity between the CTR predicted by model and real CTR. The smaller the value of log-loss is, the more accurate the advertising CTR prediction is.

Experiments design and results analysis Configuration optimization and performance analysis of FDNN and DBNLR. A. Feature Pre-processing
FDNN is similar with DBNLR in the architecture. Their difference lies in the components of model. Therefore, this paper selects the DBNLR as a baseline model.
In the experiment on the FDNN and DBNLR models, the numerical features of the raw dataset are directly adopted, and these categorical features are expanded to binary vectors using one-hot representation.
Although the method of one-hot representation is straightforward and effective, it makes the feature vectors sparse and high-dimensional. In the deep learning models, when the dimensionality of the features is large, the scale of the parameters of the model becomes large, which leads to a sharp increase in training time. Moreover, the hardware will be a bottleneck in the training process. Therefore, this paper adopts a feature hashing strategy (signed hashing trick) [40] to reduce the dimensionality of these sparse binary feature vectors. Finally, the joint vector that is composed of numerical features and categorical features and has been processed by the feature hashing strategy is regarded as the input of the two deep architecture models.
Input vectors can be obtained when the data of the Display Advertising Challenge dataset are preprocessed using the feature preprocessing methods described above. In this experiment, after preprocessing, the dimensionality of the input vector is 1746088. Therefore, the number of nodes in the visible layer (input layer) is 1746088.
B. The Parameter Settings for Training FDNN and DBNLR These two models are developed based on the Theano framework [41] by means of Python. In all experiments, the maximum number of learning epochs in the training process is set to a sufficiently large value (800) and the Early-Stopping strategy is adopted to automatically cut epochs for obtaining the best validation score. Furthermore, this strategy can also avoid overfitting.
The steps of the Early-Stopping strategy are as follows: 1. Split the dataset into a training set and a validation set; 2. When the training of every epoch has been finished, calculate the validation error (loss) in the validation dataset; 3. When the validation error of the validation dataset no longer decreases, record the number of epochs and the current validation error value. If the validation error of the validation dataset does not increase after continuation, the training process is terminated, the best validation score is obtained, and optimization is complete. Meanwhile, the optimal parameter estimation values are obtained.
In the initial training process of FDBN and DBNLR, the gradient descent method with minibatch is used for training on these samples and each minibatch includes 2000 training samples. For FGBRBM (or GBRBM), the learning rate is 0.01, while for FRBM (or RBM) with the binary variable, the learning rate is 0.1. The decay factor of the weight value is 0.0002.
In the fine-tuning process of FDNN and DBNLR, the gradient descent method with minibatch is used for training on these samples and each minibatch includes 4000 training samples. The decay factor of the weight value is 0 and the learning rate is set as 0.005.
Dropout represents the probability that a node is retained in the neural network [42,43]. A reasonable sparse dropout strategy can strengthen the robustness of a neural network. Dropout is utilized to reduce over-fitting and the complexity of these two deep learning models. The paper sets the dropout rate to 0.5, 0.6, 0.7, 0.8, and 0.9 to implement the testing experiments. Under the condition of having the same configuration structure, the dropout rates are respectively 0.5 and 0.6 when FDNN and DBNLR use their own optimal settings.

C. Optimization of the number of hidden layers and hidden nodes
Currently, there is no method for quickly setting the number of hidden layers and the number of hidden nodes in each layer. Therefore, many experiments and much experience are utilized to search for the optimal structure. According to past experience, when the number of nodes of the visible layer is large, the nodes of the hidden layer need to perform "dimensionality reduction-dimensionality addition-dimensionality reduction". However, when the number of nodes of the visible layer is small, the nodes of the hidden layer only need to perform "dimensionality addition-dimensionality reduction". In general, the number of hidden nodes will be increased or decreased with multiple. By comparing the performances in many experiments with different configurations (different numbers of nodes of each hidden layer and numbers of hidden layers), the configuration with relatively optimal performance can be obtained.
In S4 Table,  As presented in S7 and S8 Figs, in the experiments for training FDNN, in the beginning, increasing the number of hidden layers can improve the performance of the proposed method; however, when the number of hidden layers is more than 3, the performance is degraded. This is because FDNN does not adequately capture feature interactions when the number of hidden layers is small. However, when the number of hidden layers and the number of nodes in the hidden layers exceed a certain scale, cross-features disturb and degrade the accuracy of CTR prediction. The over-fitting phenomenon is observed and the algorithm spends more time learning the parameters.
After many experiments have been completed, a configuration with relatively optimal performance is obtained: Conf11, the number of hidden layers is 3; the number of nodes of the first hidden layer is 170; the number of nodes of the second hidden layer is 1700; and the number of nodes of the third hidden layer is 17. This paper chooses this configuration as the final structure of FDNN. Corresponding AUC and log-loss on the test dataset can be seen in S9 and S10 Figs.
In the experiments for training DBNLR, relatively optimal performance is obtained when the configuration structure is as follows: Conf7, the number of hidden layers is 3; the number of nodes of the first hidden layer is 150; the number of nodes of the second hidden layer is 1500; and the number of nodes of the third hidden layer is 15. This paper chooses this configuration as the final structure of DBNLR. Corresponding AUC and log-loss on the test dataset can be seen in S9 and S10 Figs.
We observe that the curve of the AUC value of the DBNLR model and that of the FDNN model are indented. However, when they have the same configuration, the overall level of performance of DBNLR is worse than that of FDNN; this is true for the log-loss index. It is because fuzzy set techniques are introduced in the proposed FDNN model. The components of FDNN model are fuzzy RBM, fuzzy GBRBM and fuzzy LR. With the data of many records in the Display Advertising Challenge dataset which contains outliers and missing values implying many uncertainties, the uncertainties can be taken into consideration and efficiently handled by fuzzy set techniques; therefore, the performance of the FDNN model for CTR prediction is better than that of DBNLR in the experiment.
Configuration optimization of other baseline models. A. LR Solution CTR prediction in a real advertising system involves a large number of samples, while logistic regression (LR) model can appropriately address a large number of features and can be trained rapidly, so the LR model is selected as a baseline model in this paper. This paper selects LIBLINEAR [44] to implement L2-regularized logistic regression, and the stochastic gradient method (SG) is adopted to optimize the parameters of LR. In the LR solution, the numerical features are used directly and the categorical features are first represented with one-hot encoding and then used to establish the feature space with MurmurHash 3. After many experiments, this paper adopts the following parameters: η = 0.2, λ = 0.

B. FM Solution
For advertising CTR prediction, Factorization Machines (FMs) combine the advantages of Support Vector Machines with factorization models, and can characterize the basic feature interactions of a large number of advertising data, therefore, this paper selects the FM model as a baseline model [45]. For advertising CTR prediction, Factorization Machines (FMs) is also adopted, which is based on feature engineering and matrix design. This paper adopts LIBFM to implement factorization machines [46,47] and the stochastic gradient method (SG) is adopted to optimize the parameters of FM. The numerical features are used directly and the categorical features are first represented with one-hot encoding. After many experiments, this paper adopts the following parameters: λ = 40, k = 30.
C. GBDT+FM Solution Gradient Boost Decision Tree (GBDT) is a nonlinear model that is based on the concept of collective learning boosting. GBDT has strong extracting feature ability and nonlinear fitting ability, and it can flexibly handle various types of data in the advertising dataset including continuous value and discrete value, but no over-fitting occurs. Therefore, this paper selects GBDT+FM solution as a baseline model. Based on the work of Y Juan et al., this paper implements GBDT+FM solutions for performance comparison and analysis. First, in Preprocessing-A, all numerical features are used directly and all categorical features are represented with one-hot encoding. Then, these features are fed into decision trees in GBDT to generate GBDT features. This paper selects Xgboost to implement the GBDT [48]. In Preprocessing-A, the number of decision trees in GBDT is 30 and the depth is 7. Thirty features are generated for each impression. The following process is Preprocessing-B, which is used for generating input features for FM. In Preprocessing-B, numerical features (I1-I13) with values greater than 2 are transformed by v blog(v) 2 c; categorical features (C1-C26) that appear fewer than 10 times are transformed into a special value; and GBDT features are directly included. Therefore, each record contains 69 features: 13 numerical features, 26 categorical features and 30 GBDT features. Then, these three groups of features are hashed into 1 M dimensions using a hashing trick. Finally, the processed features are fed into the FM model for CTR prediction. The parameters of the FM model that achieve the relatively optimal results are "λ = 40, k = 100".
Experimental results and analysis. According to S9 and S10 Figs, the proposed FDNN method has the best performance and LR has the worst performance. In the process of training models, it can be found that logistic regression model can be trained rapidly, but its effect is in the general level. One reason is that when LR is used to predict CTR, it requires the experts with sophisticated experience to implement feature engineering through the human labor. The other reason is that the simple structure of LR restrains its representation ability. These two disadvantages cause that LR cannot learn complex feature interactions and is insensitive to outliers and missing values. Therefore, LR has the relatively bad performance in the experiment.
According to S9 and S10 Figs, FM is superior in performance to LR. The reason is that FM extends the idea of LR because FM adds to second order fitting on the basis of first order fitting and it can automatically learn cross features of any two features from different dimensions. And cross features are expressed with embedding vector. Compared with LR, FM model has fewer manual intervene but higher efficiency. Therefore, FM has better representation ability than LR in the experiment. However, FM also cannot learn much more complex feature interactions because of its relatively simple structure.
In this experiment, GBDT+FM solution outperform the FM model. The GBDT+FM solution obtains good prediction results, which are slightly inferior to those of the proposed method. The GBDT can learn useful high-order feature interactions because it has strong nonlinear fitting ability and the robustness for hyperparameter, so it can efficiently extracts discriminative features and cross-features of advertising dataset, and shows better performance than FM model in our CTR prediction experiment.
Compared with the GBDT+FM solution, the proposed method achieves more than 0.09% improvement in terms of AUC (0.01% in terms of log-loss) on the dataset. The reasons are as follows. The first is that the quality of data exerts a great influence on effect of model. Because the dataset contains many noisy data, there are bottlenecks in the effects of GBDT+FM solution. Compared with this model, the proposed FDNN model, as a result of being endowed with fuzzy technology, can capture more-complex high-order feature interactions of the advertising dataset, which contains many outliers and missing values. The second is that GBDT+FM solution, belonging to a kind of stacking combined model without backproportion, is not always better than the proposed model with backproportion in effect improvement for CTR prediction from the theoretical perspective. Therefore, it demonstrates relatively good performance in terms of both data representation capability and prediction accuracy. In a real production environment for an advertising system, a small improvement in advertising CTR is likely to result in a significant increase in financial income for the internet company and an improved user experience.

Discussion
This paper proposes a fuzzy deep neural network (FDNN) to address the problem of advertising CTR prediction. The network's performance is compared with those of several baseline models in real advertisement datasets. Due to the fuzziness that is introduced into the proposed model, it can learn the most useful high-order feature interactions and demonstrates good performance in terms of both data representation capability and robustness in advertisement datasets with noise.
Big data bring new perspectives to these fields and challenges in processing and discovering valuable information. The traditional artificial intelligence techniques can no longer effectively extract and organize the discriminative information from raw advertisement data directly [49]. This paper provides a basis for the further application of fuzzy deep neural networks in advertising CTR prediction. Future work will focus on exploring additional unsupervised feature learning and deep architecture models to improve the CTR prediction for internet advertising.
In addition, in big data era, a real production environment for advertising CTR prediction requires that the advertising system return the prediction results in milliseconds or even microseconds. Determining how to implement low-lag and high-efficiency impromptu Ad-Hoc analysis to predict and return results based on a big dataset is a great challenge for big data processing systems. Big data systems with stream computing, such as Spark Streaming, Storm, and Flink, can implement analysis and query in real time. However, because of the limited memory capacity and loss of raw historical data in system, Ad-Hoc analysis and query for a big dataset cannot be implemented. Therefore, exploring CTR prediction solutions based on streaming big data processing technology is another future direction of the work of this paper.