A CTR prediction model based on session interest

Click-through rate prediction has become a hot research direction in the field of advertising. It is important to build an effective CTR prediction model. However, most existing models ignore the factor that the sequence is composed of sessions, and the user behaviors are highly correlated in each session and are not relevant across sessions. In this paper, we focus on user multiple session interest and propose a hierarchical model based on session interest (SIHM) for CTR prediction. First, we divide the user sequential behavior into session layer. Then, we employ a self-attention network obtain an accurate expression of interest for each session. Since different session interest may be related to each other or follow a sequential pattern, next, we utilize bidirectional long short-term memory network (BLSTM) to capture the interaction of different session interests. Finally, the attention mechanism based LSTM (A-LSTM) is used to aggregate their target ad to find the influences of different session interests. Experimental results show that the model performs better than other models.


Introduction
The prediction of click-through rate (CTR) is a critical problem on ads or items for many applications such as online advertising or recommender systems [1,2]. It is to estimate the probability a user will click on a recommended item. Cost per click (CPC) [3] model is often used in advertising system. The accuracy of click-through rate (CTR) can influence the final revenue in CPC model. In many recommendation systems, the goal is to maximize the number of clicks, so recommended items can be ranked by estimated CTR.
It is important for CTR prediction to find feature interactions based on user behavior. However, most models ignore to capture user interest behind user behavior. User interest has an important influence on CTR prediction. In the fields with rich internet-scale user behavior data, such as online advertising, user sequential behaviors reflect user evolving interests. Some researchers overlook the intrinsic structure of the user behavior sequences. Multiple sessions make up a sequence. A session is a list of user behaviors that occur within a given time frame. The user behavior in each session is highly homogeneous, and the user behavior in different sessions is heterogeneous. Grbovic et al. [4] found the session division principle that there is a time interval of more than 30 minutes. As we can know, the user mainly browses the shoes in the first half an hour as session 1, and browses the watch in the second half an hour as session a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 2. It is the fact that people has a clear and unique intent at a session, but the interest usually changes when user start a new session.
Through the above observation, we propose a hierarchical model based on session interest (SIHM) for CTR prediction, which uses multiple historical sessions to simulate the user's sequential behavior in the CTR prediction task. At session division module, we naturally divide the user sequential behavior into sessions. At session interest extractor module, we apply a self-attention mechanism with bias coding to model each session. Self-attention mechanism gets the internal relationship of each session behavior. Since different session interest may be related to each other or follow a sequential pattern, so we chooses a bidirectional long short-term memory (BLSTM) network [5] to model the dependency between session interest at session interacting module. The auxiliary tasks are employed for producing the interest state with the deep supervision strategy to learn the current hidden state. It can help the model learn more interest-encoded latent representation and enforce the hidden state to capture the session interest. Because different session interests have different effects on the target item, we utilize attention mechanism to achieve local activation and use LSTM to aggregate their target ad to get the final representation of the behavior sequence.
The main contributions of this paper are as follows: 1. The user behavior in each session is highly homogeneous, and the behavior of user in different sessions is heterogeneous. We focus on user multiple session interest and propose a hierarchical model based on session interest (SIHM) for CTR prediction. We can get more expressions of interest and more accurate prediction results.
2. To effectively capture session interest, we devise session interest extractor module and divide the user sequential behavior into sessions. We employ a self-attention network obtain an accurate expression of interest for each session. The auxiliary tasks are employed for producing the interest state with the deep supervision strategy to learn the current hidden state. We use BLSTM to capture the interaction of different session interests. Then the attention mechanism based LSTM (A-LSTM) is used to aggregate their target ad to find the influences of different session interests at session interacting module.
3. The experimental results demonstrate that our proposed model has great improvements over other models. In addition, we explore the impact of key parameters, which proves the validity of the SIHM model.
This work is organized as follows. In Section 2 we discuss the related work and introduce the detailed architecture of proposed SIHM model in Section 3. Then we verify the prediction effectiveness of the proposed model in Section 4. Furthermore, in Section 5, we summarize the model presented in this paper and introduce the direction of future work.

Related work
There are many models proposed by researchers for CTR prediction as a binary classification problem. Logistic regression (LR) [6] is a linear model that is used in the industry. Some researchers established models based on LR [7] for CTR prediction. Jiang et al. [8] introduced a model named SAE-LR to extract the abstract features and got better performance than LR. The advantages of linear models are simplicity and portability, but are weaker in capturing feature interactions. To overcome the limitation, Factorization Machine [9] (FM) and its variants [10] are used to capture feature interactions. The field-aware factorization machines (FFM) introduced field aware latent vectors to capture feature interaction. Liu et al. [11] proposed a FPENN model that combined field-aware embedding and high-order feature interactions.
However, most models use shallow layer that have limited representation power of feature interactions.
Recently, due to the powerful ability of feature representation, deep neural networks have achieved great success in many research fields such as in computer vision [12,13], image identification [14,15] and natural language processing [16,17]. Therefore, different kinds of deep neural networks are applied to CTR prediction. Chen et al. [18] combines together the powerful data representation and feature extraction capability of Deep Belief Nets, with the advantage of simplicity of traditional Logistic Regression models. Zhang et al. [19] proposed the Factorization Machine based Neural Network (FNN). The model uses FM to pre-train the embedding layer based on forward neural network. DeepFM model [20] uses FM to replace the wide part, and shared the same input. DeepFM model is considered to be the more advanced model in the field of CTR estimation. Product-based Neural Networks (PNN) model [21] is used for user response prediction. The model utilizes a product layer and gets feature interaction. Zhou et al. [22] proposed a DGRU model, which integrates DeepFM and GRU to improve the accuracy of prediction. Convolutional Neural Network (FGCNN) [23] model was introduced to solve the problem of feature interaction. The model leverages the strength of CNN to generate local patterns and recombine them to generate new features. Huang et al. [24] introduced a new model based on Deep&Cross Network [25], the model can get better feature interaction. Cross Network [26] further replaces the cross vector in Cross Network into a cross matrix to make it more expressive. Convolutional Neural Networks (CNN) and Graph Convolutional Networks (GCN) are also explored for feature interaction modeling. Convolutional Click Prediction Model (CCPM) [27] performs convolution, pooling and nonlinear activation repeatedly to generate arbitrary-order feature interactions. However, CCPM can only learn part of feature interactions between adjacent features since it is sensitive to the field order. Feature Generation Convolutional Neural Network (FGCNN) improves CCPM by introducing a recombination layer to model non-adjacent features [28]. It then combines the new features generated by CNN with raw features for final prediction. Early deep CTR models alleviate human efforts in feature engineering by incorporating simple MLPs.
In practical applications, different predictors usually have different predictive capabilities. Features that have a greater contribution to the prediction results should be given greater weights. As we all know, the attention mechanism [29] has a powerful function in distinguishing importance of features. Zhang et al. [30] proposed a novel framework called Multi-Scale and Multi-Channel neural network (MSMC) to learn the feature importance and feature semantics for enhancing CTR prediction. Wang et al. [31] improves FM based on the attention mechanism to find the different importance of different features. Zhang et al. [32] proposed a deep CTR prediction model based on attention mechanism, which can make use of the user historical behavior. High-order Attentive Factorization Machine (HoAFM) model [33] was proposed based on FM to determine the different importance of co-occurred features on the granularity of dimensions.
In addition to capturing feature interactions, user interest also affects prediction results. Constructing a model to capture the user's dynamics and evolving interests from the user's sequential behavior has been widely proven effective in CTR prediction tasks. Deep Interest Network (DIN) model [34] captures user interest from history behavior based DNN. At the same time, Deep Interest Evolution Network (DIEN) [35] was proposed based on DIN. DIEN can not only obtain the user interest features, but also can capture the evolution process of interest. The concept of session often appears in sequential recommendation, but it is rarely seen in CTR prediction tasks. Session-based recommendation achieves good results via user dynamic interest evolving. A personalized interest attention graph neural network (PIA-GNN) was proposed for session-based recommendation used an attention mechanism to capture the user purpose in the current session [36]. Zhang et al. [37] analyzes the current session information from multiple aspects and improves user satisfaction. Session-based recommendation [38] is often used to match the user preferences based on session information. However, most existing studies for CTR prediction ignore that the sequences are composed of sessions. Upon all these perspectives, we introduce a hierarchical model based on session interest (SIHM) to get a better result for CTR.

Material and methods
We describe SIHM model in this section. We first introduce feature representation and embedding in Section 3.1. Next, Section 3.2 illustrates the session division module. Then, we describe the session interest extractor module in Section 3.3 and session interacting module in Section 3.4. Finally, we present the overall architecture of the SIHM model in Section 3.5.

Feature representation and embedding
We use four groups of features (User Profile, Scene Profile, Target Ad, and User Behavior) as input data for the model. Four groups of features all affect the CTR, but the most important influence on the prediction results is the user behavior feature. We mainly capture the user interest from user behavior feature. The encoding vector of the feature group can be expressed by

Session division module
We divide the user behavior sequences X into sessions S and get the user session interests, where the k-th session S k ¼ ½x 1 ; . . .; x i ; . . .x T � 2 R T�d model , T is the number of behaviors in each session and b i is user i-th behaviors in current session. According to Grbovic method, we segment user behaviors more than 30 minutes apart into user sessions.

Session interest extractor module
As far as we know, behaviors in the same session are closely related to each other, and the random behavior of user in a session does not represent the original expression of session interest. We use a multi-head self-attention mechanism [39] to capture the inner relationship between behaviors in the same session and find the impact of those irrelevant behaviors.
Multi-head self-attention can get the relationship in different representation subspaces. We use S k = [S k1 ;. . .;S kn ;. . .S kN ], where S kn 2 R T�d n is the n-th head of S k . N is the number of heads, d n ¼ 1 n d model . The output of head n can be calculated as follows: where W Q ,W K ,W V are weight matrices. Then FNN can further improve the nonlinear ability: where W O is the weight matrix. FNN(�) is the feedforward neural network. Avg(�) is the average pooling. I k is the user k-th session interest.

Session interest interacting module
We applied the BLSTM module to model the dependency between the different session interest. Each LSTM unit [40,41] maintains a memory c t at time t. One input gate i t with corresponding weight matrix W xi ,W hi ,W ci , one forget gate f t with corresponding weight matrix W xf , W hf ,W cf , one forget gate f t with corresponding weight matrix W xo ,W ho ,W co .The output h t of the LSTM unit is then: where o t is an output gate that modulates the amount of memory content exposure. The output gate is calculated by the following formula: where σ is a logistic sigmoid function. The c t denotes the memory cell and it is updated by forgetting irrelevant memory information, and then adding a new memory statec t : where a new memory state can be defined as: The forget gate f t controls the information which the existing memory is forgotten, and the input gate i t controls the information which the new memory content is added to the memory unit. Gates are computed as follows: In bidirectional architecture, there are two layers of hidden nodes from two separate LSTM encoders. The two LSTM encoders capture the dependencies in different directions.
The hidden state h t can capture the dependency between session interests. However, the user's session interest related to the target ad has a greater impact on whether the user will click on the target ad. So the weight of the user's session interest needs to be reassigned to the target ad. We apply an attention mechanism with LSTM to model the representation of session interests and target ad. Fig 1 shows the framework of the applied attention mechanism with LSTM (A-LSTM).
The I 0 t is the input of the A-LSTM and h' t is the hidden state. The input to the second A-LSTM can be represented as I' t = h t . The final interest state is h' T . The attention function is formulated as: W I has the corresponding shape, attention score can reflect the relationship between target ad X I and input. We use A-LSTM to consider influences of between session interests and the target ad: where h t denotes the t-th hidden state, i 0 denotes the entry for the second LSTM module, and the � is the scalar-vector product.

The overall architecture of SIHM model
The structure of the SIHM model is shown in Fig 2. In feature representation and embedding module, we use an embedding layer to transform informative features into dense vectors. In order to get session sequence, we divide user behavior sequences into sessions in session division module. In session interest extractor module, we employ multi-head self-attention to reduce the influence of unrelated behaviors and capture the inner relationship between behaviors in the same session. In session interest interacting module, we use LSTM to get the feature interaction. At the same time, we use A-LSTM to model the representation of session interests and target ad. In prediction module, embedding of sparse features and session interests that we capture are concatenated and then imported into MLP. Finally, the softmax function is used to get probability that people click on the ad.
The loss for the auxiliary task can capture more interest representation and it can also enforce the states of the BLSTM module to effectively learn the user interests. Let I i denotes the

PLOS ONE
clicked interest sequence, andÎ i denotes the negative sample sequence. I i [t] denotes the t-th vector for user i clicks.Î i ½t� represents the vectors of without the t-th step. T denotes the number of user's behaviors. The loss for the auxiliary task can be defined as: where σ is the sigmoid activation function and h t is the t-th hidden state of the BLSTM network.The loss function is a negative log-likelihood function and is expressed as: where D denotes the training size N, p(x) denotes the probability that the user clicks on an ad. α denotes the hyper-parameter that is used to balance the interest representation and the prediction of the CTR.

Experiments setting
Datasets. In this section, we conduct experiments on four datasets: Books and Electronics in Amazon dataset [42], two public datasets: Avazu and Criteo. The dataset is showed in Table 1. The datasets are randomly divided into three parts: training set (80%), validation set (10%) for adjusting hyper parameters and the rest 10% is for testing.
Evaluation metrics. We use three evaluation metrics in our experiments: AUC (Area Under ROC), Logloss and RMSE (Root Mean Square Error). The curve in AUC means the ROC [43], which is used to evaluate the performance of a two-class classifier. We believe that the larger the value of AUC, the better the performance of model. Logloss is applied to calculate the distance in a binary classification problem. The value of logloss is smaller, the performance of the model is the better. RMSE [44] can be defined as follows: RMSE ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi 1 jTj where y t i is the observed scores andŷ t i is the value of prediction, T is the testing set. Like logloss, we want to get smaller values.
Parameter settings. We set the size of the hidden state in the LSTM is 48. The different learning rates of 10 -4 ,10 -3 ,10 -2 ,10 -1 are used to test. Also, different number of neurons from 100 to 800 is employed.

Comparisons with different models
This section compares the SIHM model with some of the most advanced models currently in CTR prediction. In Fig 3, we can see the results of the different models for. Tables 2 and 3 show the value with logloss and RMSE respectively. The following aspects can be noted according to the comparison.
1. PNN introduces a product layer between embedding layer and full-connected layer, and uses neural networks to learn feature interactions automatically. However, the model ignores the low-order feature interactions, which are also important for CTR. So the PNN does not have better performance.
2. DeepCross is a model that automatically combines features to produce superior models.

PLOS ONE
3. AFM is a CTR prediction model that can distinguish the importance of different feature interaction. As we all know, different feature interaction has different useful for results. The performance of AFM is better. This can be verified that using the attention mechanism can enhance performance of the model. 4. ADI captures interest evolving processes from user behaviors and gets higher prediction accuracy. However, the SIHM model performs better than others. The model uses multiple historical sessions to simulate the user's sequential behavior in the CTR prediction task. We can see that SIHM model based on session interest can improve accuracy in all datasets.

Sensitivity analysis of the model parameters
We carry out an influence of different parameters in SIHM model, such as the epoch, the number of neurons per layer, and the dropout rate β.
Dropout is the probability of neurons remaining in the network. We explore the value of β from 0.1 to 0.7. In Fig 4, we can see that the SIHM model performance better when β is properly set (from 0.4 to 0.7). However, with an increasing of the value of β, the performance of SIHM shows a downward trend. We choice the value of β is 0.5 in our experiment.

Conclusion
In this paper, we propose a hierarchical model based on session interest (SIHM) for CTR prediction. In order to get session interest, we divide the user sequential behavior into sessions and design session interest extractor module. To effectively capture session interest, we employ a self-attention network obtain an accurate expression of interest for each session. At the same time, the auxiliary tasks are employed for producing the interest state with the deep supervision strategy to learn the current hidden state. We use BLSTM to capture the interaction of different session interests. Then the attention mechanism based LSTM (A-LSTM) is used to aggregate their target ad to find the influences of different session interests. Finally, embedding of sparse features and session interests that we capture are concatenated and fed into MLP. The experiment demonstrates that the model achieves consistent improvements compared with the state-of-the-art models. In the feature, we will combine text features with image features [45] to build a CTR prediction model.