Using Dynamic Multi-Task Non-Negative Matrix Factorization to Detect the Evolution of User Preferences in Collaborative Filtering

Predicting what items will be selected by a target user in the future is an important function for recommendation systems. Matrix factorization techniques have been shown to achieve good performance on temporal rating-type data, but little is known about temporal item selection data. In this paper, we developed a unified model that combines Multi-task Non-negative Matrix Factorization and Linear Dynamical Systems to capture the evolution of user preferences. Specifically, user and item features are projected into latent factor space by factoring co-occurrence matrices into a common basis item-factor matrix and multiple factor-user matrices. Moreover, we represented both within and between relationships of multiple factor-user matrices using a state transition matrix to capture the changes in user preferences over time. The experiments show that our proposed algorithm outperforms the other algorithms on two real datasets, which were extracted from Netflix movies and Last.fm music. Furthermore, our model provides a novel dynamic topic model for tracking the evolution of the behavior of a user over time.


Introduction
Personalized recommender systems are widely used in e-commerce, such as by Amazon and Netflix, to identify interesting products for their customers. Recommendation systems are often based on Collaborative Filtering (CF) [1][2][3], which relies only on past user feedback data, e.g., users' previous transactions or item ratings. One type of user feedback data used in the CF literature is the explicit form of user-ratings, e.g., a 1-5 score. In this scenario, one of the challenges of CF of rating-type data is how to predict the missing values of the user-item score matrix effectively and efficiently, which has been explored widely [4][5][6][7]. However, in real-world scenarios most feedback is not explicit but implicit. Implicit feedback is tracked automatically, such as by monitoring clicks, view times, purchases, and other user activity. A common type of behavior in implicit feedback is user and item co-occurrence data reflecting user preferences, which we denote as selection data. There are many application scenarios in which a user can select the same item more than once, and we denote the behaviors in these scenarios as dynamic selection data. In the latter scenario, the data are usually divided into time steps, and the data in different time steps reflect different user preferences. Considering the dynamic selection data, an urgent and interesting question emerges: given the past behavior of the users when selecting items, how can the items that will be selected by the target user in the next time period be predicted?
Tracking user preferences is a key issue in the task of prediction. A main approach to model user preference is to use latent factor models, e.g., latent semantic models [8][9][10] and matrix factorization models [4,6], which learn a latent feature/factor vector for each user and each item in the dataset such that the inner product of these features minimizes an explicit or implicit cost function. This approach can also be considered as an example of co-clustering, where one cluster represents the item's latent factor and the other represents the user's latent factor. Fig 1(a) shows the result of performing topic modeling on user preferences without temporal considerations. The items selected by users u 1 , u 2 and u 3 are clustered according to topics 1 and 2 based on the weights (i.e., instances) of edges between users and items. The cluster of items is similar to the cluster of topic concepts. In fact, a user selects an item according to his preference, which is similar to words belonging to a document according to its topic. The topic concept clusters were derived using Probabilistic Latent Semantic Analysis (PLSA) [11] and Latent Dirichlet Allocation (LDA) [12]. These models exploit co-occurrence patterns of words in documents to unearth semantically meaningful probabilistic clusters of words. However, in practice, user preferences can change over time; therefore, this problem requires a method that is capable of modeling the temporal dependence of selections, i.e., a model that will allow us to provide the edge weights across time steps and preserve the topic distributions at different time steps. Such model could combine the temporal selections and construct dependencies between different time steps. See the scenario shown in Fig 1(b) as an example. We commonly refer to this type of models as temporal latent factor models.
In general, there are two types of temporal latent factor models for CF: temporal probabilistic topic models, such as the Dynamic Topic Model (DTM) [13], and dynamic latent factorbased matrix factorization approaches [5,7,14,15]. However, both of these models have shortcomings. For example, DTM does not consider the evolution of user behavior, which is the main assumption in our work. Specifically, we assume that the topic-item distributions remain static over time but the preferences of the users change over time. Considering the dynamic changes in the preferences of users, Sahooet et al. proposed a temporal topic-user model [16], that extended from static latent factors representing users' preferences in an Aspect Model [9] to dynamic latent factors with a Hidden Markov Model (HMM). Although this model considers dynamic changes in users' preferences, it cannot predict the state of the user every time because it usually places most of the probability mass over a few states [16]. In contrast, the matrix factorization approaches have always been focused on predicting users' ratings for items rather than users' selection data. As noted above, in the case of temporal rating, the user can only rate an item once, but in reality, a user can select the same item many times as the user's preference for the item changes over time. Therefore, these two types of models cannot be applied for modeling user item selection.
Fortunately, in recent years, Non-negative Matrix Factorization (NMF) has been extensively researched as a parts-based non-negative dictionary learning method [17]. Specifically, the non-negative constrain of NMF in latent factors space can be used to model a user's preferences. In fact, many researchers have adopted NMF and its variants to solve CF [18][19][20]. Historically, PLSA and NMF were developed independently, but researchers later proved that PLSA solves the problem of NMF with KL I-divergence [21,22]. To extend NMF to a dynamic latent factor model for selection data, Chua et al. [23] presented an algorithm called Dynamic Matrix Factorization (DMF), which combines NMF and Linear Dynamical Systems (LDS). However, DMF is not a unified model: it first uses NMF to generate an item-factor matrix as a global basis matrix for the observed data X Ã and then uses LDS to generate a state-transition matrix that governs the evolution of the factor-user matrices.
Motivated by DMF, we model dynamic selection data using a unified model that combines multi-task non-negative matrix factorization and a state transition matrix derived from LDS. Multi-task learning has become more popular because this method encourages learning tasks in parallel using a shared representation, which helps for each task learn better by using the other tasks' information [24][25][26]. The multi-task NMF is a special case of non-negative tensor factorization (NTF) [27,28], which is more flexible and useful in practice because it attempts to estimate only one common factor in the form of a basis matrix and some coefficient matrices over time [29]. In other words, the accumulative selection matrix X Ã is factorized into a common basis matrix for discovering item-factor and multiple factor-user matrices whose temporal relationships are governed by a transition matrix. The main contributions of this paper can be summarized as follows: 1. Our model provides a new method to embed common item factors and temporal user factors information into a unified model, in which user preference is tracked by a state transition matrix and every user has his/her own preference's evolution profile.
2. Our work, which recovers the evolution of latent users' factors can be interpreted as a special case of the dynamic topic model.
3. Compared with state-of-the-art methods, the proposed approach demonstrates superior performance in the prediction of future selection behavior.
The remainder of this paper is organized as follows: The 'Analysis' section introduces our model as well as the advantages and physical interpretations of the model. The multiplicative update algorithms for our model are also derived in this section. Experimental results from real world data are demonstrated and analyzed in the 'Results' section. Finally, conclusions are drawn in the 'Discussion' section.

Dynamic Multi-task NMF for CF
Here, we formulate the problem by focusing on the dynamic selection of data. Given K time periods {t 1 , t 2 , . . ., t K }, in every time period, the users provide selection information about their preferences for known items, i.e., there are matrices X k = {X 1 , X 2 , . . ., X K } for K time periods. In each X k , X k, ij = x represents a user i selecting x instances of an item j at time k. The accumulative selection matrix X Ã is obtained by X Ã = ∑ 1:K X k . The task of the time-sensitive recommender systems is to predict the items that a user will select in a future time period K + 1 given all of the selection matrices from the past K time periods.
In NMF, the goal is to find entrywise non-negative matrices W and H such that where the function L is a suitable lost function and the X is the observation data matrix. The function L can be Euclidian distance or divergence [17,30]. In our model, the selection matrix where M is the number of items, N is the number of users, and D represents the size of dimension of the latent factors space. The item-factor matrix W represents the projection from items space to latent factors space, whereas the factor-user matrix H represents the coefficients for a user's preferences for corresponding items [31]. We follow the classical assumption that the rows of the item-factor matrix W T i and the columns of the factor-user matrix H j follow Gaussian prior distributions [4][5][6], as defined below: Although the observation data X ij are integers, the residuals X ij À W T i H j should follow Gaussian distributions. Therefore, we define the conditional distribution over the observed data as where σ is the prior standard variance of the selection data. A direct and simple way to solve dynamic selection prediction is to perform NMF independently for each time step. However, this process reduces the predictive power because the lower ranking factors are not related across different time steps. In fact, we assume that item factors evolve very slowly and can be considered constant over time, whereas each user factors changes over time [32,33], i.e., given K related input data matrices X 1 , X 2 , . . ., X K , K coefficient matrices H 1 , H 2 , . . ., H K and a common basis matrix W can be factorized. According to the structure illustrated in Fig 2, we have a unified probabilistic graphical model, as shown in Fig 3. As shown in Fig 3, a classical multi-task NMF model can be used. Multi-task NMF was first proposed in [34] for extracting common gene profiles. In our model, multi-task NMF can be seen as a combination of several strongly related NMF tasks which share the common item-factors matrix defined below: where H k, j is the preference of user j during the k th time period.  Let us consider the relationship between H k − 1,j and H k, j , which represent the changes in a user's preferences from time period k − 1 to k. We adopt the idea of LDS [14,23], which represents the mapping of latent factors from time step k − 1 to k using a state transition matrix A k with added Gaussian noise defined as follows: where σ h is the prior standard variance of the factor-user latent variable and the matrix A k is different for each time period. LDS cannot ensure the non-negativity of each H k , which is required by NMF. To avoid negativity, we propose the DMNMF model to combine multi-task NMF and LDS into a unified model using the log of the posterior distribution over the item-factor matrix, factor-user matrices and state transition matrix, as defined below: where Const. is a constant that does not depend on the parameters. For the sake of simplicity, we fix A to avoid over-fitting and to capture interesting and important trends in the period.
Maximizing the log-posterior with hyperparameters (i.e., the observation noise variance and prior variances) is equivalent to minimizing the sum-of-squared-errors cost function with quadratic regularization terms: where C is the cost function, l ¼ s 2 =s 2 h , and k:k 2 F denotes the Frobenius norm. The first item in the cost function represents the minimization of the errors between the observed data and the recovered data using the NMF K times. The second item represents the minimization of the errors that occur while estimating the transition matrix over K − 1 transitions.
Once the state transition matrix A is obtained, the coefficient matrix of the next time period H K + 1 can be obtained by H K + 1 = A H K . Subsequently, we obtain the prediction function X K Here, X K + 1, ij = 1 indicates that item j most likely would be selected by user i in the time period K + 1.

Algorithm for DMNMF
Several algorithms have been proposed to solve the NMF problem, including multiplicative update (MU) [35], alternating least squares (ALS) [36], and projected gradient (PG) method [37], among others. Following [38], we adopt the MU algorithm to solve the optimization problem (7). To provide the MU algorithm for DMNMF, we first define the notations of two matrix operations ⊛ and , which represent element-wise multiplication and division, respectively. Let us consider the Karush-Kuhn-Tucker (KKT) conditions of (7), i.e., where and Substituting (Eq 17) into (Eq 14), we have Then, the MU rule for W is derived, i.e., Likewise, we can obtain the MU rule for A via (Eq 15) and (Eq 18).
There are three derived cases for H k . The first case is for k = 1, the second case for k 2 [2, K − 1], and the third case is for k = K Similarly, we can obtain the MU rule for H k via (Eq 16) and Eqs (22)(24) (23).
The proof is given in the appendix. Based on the fact that C is non-increasing, the convergence of the MU algorithm is guaranteed. The detailed steps of the MU algorithm for the recommender are listed in Algorithm 1.

Output:
Recommendation set X. end for 10: until the maximum number of iterations has been reached, or the change of cost function (7)

Results
In this section, we validate the effectiveness of DMNMF by comparing with NMF, HMM-CF and DMF-IA using two datasets, i.e., Netflix [39] and Last.fm [40], obtained from real-world data. Below, we discuss how the training sets and test sets are constructed and how to measure the performances of the algorithms before reporting the results. Each algorithm is trained on data up to a certain time period K. These algorithm tasks of algorithms are then used to predict what each user will select in time period K + 1.

Data Set
Netflix Dataset. Netflix has made available a dataset containing over 100 million ratings, containing 17,770 movies and approximately 480,000 users. The dataset consists of users' ratings for movies along with the timestamp of the rating and spans 86 months. Using this dataset, we predict which movies a target user will rate in a given test period. We define a month as the time period and select 24 months as the time span of the dataset, which divides the temporal training sets into 23 matrices and uses the 24 th month as the test set. For a test set, we consider predicting which movies the user will rate as predicting which movies he/she will watch. To keep the training dataset size manageable, we construct matrices V train by reading the first 10,000 movie record files and selecting users who have rated at least 500 movies. The process for creating the training dataset is outlined as follows, which results in a dataset with 1,015 users and 10,000 movies.
Last.fm Dataset. Last.fm is an Internet-based personalized radio station and music recommendation system. When the users of the service listen to music through a supported music player, Last.fm collects their music listening behavior. The data is used by Last.fm to make personalized music recommendations for their online radio station. This dataset contains time stamped records of users' music listening activity. It has 992 users and 177,000 artists. The process of constructing the Last.fm training and test datasets is similar to that of Netflix.
We denote the sparsity of the dataset using the formula sparsity ¼ where N is the total number of users, M is the total number of items and Item i is the number of item i that is selected by users. Summaries of the two training set of two datasets are shown on Table 1.

Comparison
In this experiment, the following methods are compared: 1. NMF: As a static algorithm, NMF is the baseline for experiments. We feed NMF using an accumulated matrix X Ã , which sums up the frequencies of user-selected items from all time. After matrix completion by multiplying the two factorized matrices, the top-n sorted elements in the reconstructive matrix X with values greater than zero are recommended.
2. HMM-CF: The estimated HMM-CF with data observed up to k can be used to compute the latent class distribution for each user in time period k + 1 and then compute the distribution over the observation of articles in time period k + 1. The probability that the item i will be observed in k + 1 can be computed as Then, the items that are most likely to be observed in period k + 1 can be recommended to the user [16]. HMM-CF assumes that the distribution of how many items will be selected by a certain user in a month is a specific negative binomial distributions (NBD); therefore, the recommended number taken from the NBD for each user is a parameter, that is arbitrarily set by selecting each user's top 5 or top 10 highest scoring items to recommended.
3. DMF-IA: According to the literature [23], DMF-IA is the variant of DMF that has the best performance. We first generate a basis matrix W using NMF; then, we use Kalman filtering to generate the fixed dynamic matrix A. Finally, the top-n sorted elements in the reconstructive matrix X, X = W A V K with values greater than zero are recommended.

DMNMF:
We run the DMNMF algorithm as described above. In DMNMF, the sorted elements in the reconstructive matrix X = W A H K with values are greater than the threshold are recommended.
The performances of all of the algorithms are measured by their precision and recall scores. Only the items that user i actually selected in the time period K + 1 are considered correct recommendations. The precision, P, of the algorithm is the fraction of the recommended set that is correct and is defined as below: where L is the number of all users who are in the recommended set, pItems i refers to the number of items recommended to user i, hitItems i refers to the number of items that are actually selected from the recommended set. The recall, R, of the algorithm is the fraction of the correct set that is recommended and is defined as below: where B is the number of all users who in the correct set and bItems i refers to the number of items that the user i actually selected from the correct set. If more items are recommended, the precision will decrease, but the recall will increase. The harmonic mean is called the F1 score, see (Eq 31). The higher the F1 score is, the better the prediction performance is [41].
To reduce the effect of randomness, we repeat these trials 500 times and compare the algorithms based on their average performances, as shown in Table 2. DMNMF significantly outperforms the other algorithms mainly because it captures the temporal changes in user preference in the unified model.
We are not only interested in precision and recall measures at a threshold or top-N quality of the recommended items but also the quality of the algorithms over the entire test dataset whose ranking score is produced by the algorithms. A receiver operating characteristic (ROC) curve is an intuitive way to compare multiple algorithms [42]. To draw an ROC curve, we split the sorted items score every 5 percentage and calculate the fraction of incorrect items recommended (false positive rate) and the fraction of correct items recommended (true positive rate). An ROC curve is a two-dimensional depiction of the prediction performance. We calculate the area-under-the-curve (AUC) to reduce the ROC performance to a single scalar value [43]. To generate confidence intervals for the AUC, we generate each ROC point 10 times and calculate the 95% confidence interval of the AUC using the method in [44]. According to Table 2, the average ROCs and the AUCs are shown in Fig 4. Fig 4 shows that DMNMF outperforms the other algorithms.
To obtain the runtime of these algorithms in Fig 4, when running the Netflix example, our new model takes 629s, which is comparable to HMM-CF (713s) and slightly slower than NMF (273s) and DMF-IA(395s).

Parameter Learning for DMNMF
In our algorithm, there are three parameters: the latent factor sizes D, regularization parameter λ and threshold τ in the DMNMF model. To choose appropriate and robust parameters for the model, we propose a strategy to estimate a stable scope of parameters for the time-serial datasets. Similar to the idea of n-fold cross-validation, we split the original dataset into 12 timecontinuous subsets for parameter validation. One subset contains 11 continuous month data as training sets and the following one month data as validation set. For example, for the first  Dynamic Multi-Task NMF for Recommender subset, the data set from the first month to the 11 th month are used as the training sets and the 12 th month is treated as the validation set, and for the second subset, the data set from the second to the 12 th month are used as the training sets ant the 13 th month is used as validation, and so on. As the original dataset is spanned of 24 months, we obtained 12 validation sets. The performances were then averaged over these validation sets.
For the sake of conciseness, we use the Netflix dataset as an example to demonstrate how the parameters were determined. Following the settings in the BPMF (Bayesian Probabilistic Matrix Factorization) model [7], which was run on the Netflix dataset, we set the D over the interval [10,100] with step sizes of 10. As show in Fig 5(a), the predictive performance achieved the optimal performance when D = 40. As the number of latent factors increased, overfitting occurred.
The role of regulation λ is to balance the multi-task NMF term and LDS term in the cost function. Similar to [6], l ¼ s 2 =s 2 h , indicating the ratio of the standard deviation of the observation data to the prior standard deviation of the factorized factor-user. Here, we choose λ from 0.001 to 0.01 with a step size of 0.001. As shown in Fig 5(b), λ is optimal at 0.005.
The threshold parameter τ controls the number of recommended items. The smaller the τ is, the larger the number of recommended items is. For example, when τ = 0.01, the number of predicting items is 202533 and the number of hitting items is 85813. Although the precision of the algorithm is 42.37%, each user is recommended over 200 items, which is called over-recommended in the recommender. Whereas when τ = 0.04, the number of predicting items is 22298 and the number of hitting items is 2898. Although the precision decreases substantially, only 22 items on average are recommended to each user, which results in the recommender providing a better user experience. We choose τ from 0.01 to 0.1 with a step size of 0.01. As shown in Fig 5(c), the threshold is optimal at 0.03. The confidence intervals of the precision and recall are omitted from Fig 5(c) for the sake of clarity.
The parameter learning for the Last.fm dataset is very similar to that of the Netflix dataset, which is shown in Fig 6. We investigate the convergence of our DMNMF. Fig 7 shows the convergence curve of DMNMF and its cost function. The values of the cost function C with factorization ranks of 10, 20 and 30 were plotted. As shown in Fig 7, the non-increasing nature of C is obvious, and it drops very fast after a few iterations. However, the DMNMF model cannot ensure that the global minimum of the cost function is obtained. Therefore, there may be multiple local minimums, which depends on the initial points. Nevertheless, our numerical experiments showed that different initial values generate very similar results, which implies that the initial value might have only a small impact on the performance of the algorithm. Because the DMNMF cannot ensure that the global minimum is obtained when starting from a random initial condition, we examine the effects of the initial points on the final optimal solution. We randomly chosen the initial points 500 times to obtain the optimal solution of the DMNMF model. After initializing W, H 1 , . . ., H K and A with random numbers in [0, 1], we obtain the optimal factored matrices corresponding to the minimum value of the object function C.

Case Study
A main feature of the DMNMF formulation is the use of dynamic matrix A to capture the evolution of a user's latent factors h i, k from one time step to the next time step. The latent state at k is given by h i, k = Ah i, k−1 . The k th factor in h i, k is derived from the dot product of the k th row of A and h i, k−1 . The largest value in the k th row of A plays an important role in accounting for the value of the k th latent factor in h i, k . The state transition matrix of the two real datasets is shown in Fig 8. In Fig 8, the darker the state, the larger the value of the element in A. To further illustrate the effectiveness of our DMNMF model, a case study is demonstrated using the Netflix dataset. For the sake of the clear visualization of the state transition matrix, we choose a dimension size of 15. We randomly select a user whose ID is 1371451(we name him 'John') and explain the evolution of his latent factors from the first month to the 23 th month (Jan. 2004 to Nov. 2005). We extract a certain column corresponding to John from every factor efficient H k and combine these into a matrix, then, we plot the matrix as an evolution of the preference profile that belongs to John. As shown in Fig 9, the larger the value of the element in the column, the darker the block and the more likely the corresponding preference factor. As time passes, the user's changing preferences are represented by the largest element in the column. We find that John became active in selecting movies in the first 3 months and was dormant for a while. Then John became active in the 23 rd month, which is consistent with his behavior on rating movies in the training set.
We regard a factor as a 'topics' describing the user's preference regarding certain properties of movies and select certain movies from corresponding topics that appear more black than others in Fig 9. For example, the top 10 largest elements (movies) are extracted from the corresponding column in the basis matrix W, as shown in Table 3. Factor 2, factor 7, factor 11 can be interpreted as 'Action', 'Romantic comedies', 'Family', respectively. We demonstrate an example in which John shifted his preference from 'Action' to 'Romantic comedies' from Jan. 2004 to Nov. 2005. The W 2,1 entry in the 1 st column of the preference graph has the second highest value of 1.73, whereas the others have a mean value of 0.12; thus, we infer that John was interested in topic 2. Likewise, the W 7,23 entry in the 23 rd column of the preference graph has the highest value of 2.09, which indicates that John was interested in topic 7.

Discussion
In collaborative filtering, item selection prediction is applied more widely than item rating prediction. This paper proposes an effective unified model called DMNMF to discover the latent factors behind users' selection behaviors and capture the transition of user preference in latent factor space. We develop an MU algorithm to solve DMNMF. Experimental results on popular CF databases demonstrate that our proposed algorithm outperforms NMF, HMM and DMF as well as their extensions. As noted in the results section, our model has a larger time cost than the traditional NMF and DMF-IA models, mainly due to the longer time required for each iteration in our model compared with the others. Finally, our model can be used as a novel form of dynamic topic models for tracking the evolution of user preferences over time. In the future, we will develop a hybrid method that integrates the tracking of evolution of user preferences and the similarities between user preferences.
where S ¼ Qðh t ðkÞ Þ À ðW T W þ lA T A þ lI Þ ð37Þ To prove S is semipositive definite, consider the matrix: M ab ðh t ðkÞ Þ ¼ h t ðkÞ;a ðSÞ ab h t which is just a rescaling of the components of S. Then S is semipositive definite if and only if M is, and we denote