Figures
Abstract
The trajectory of a user's continuous online access, which manifests as a sequence of dynamic behaviours during online purchases, constitutes fundamental behavioural data. However, a comprehensive computational method for measuring trajectory similarity and thoroughly analyzing user behaviour remains elusive. Analyzing user behaviour sequences requires balancing detail with data reduction while addressing challenges such as excessive spatial complexity and potential null results in predictions. This study addresses two critical aspects: First, it evaluates similarity in the time dimension of user behaviour sequence clustering. Second, it introduces a frequent sub-trajectory mining algorithm that emphasizes the order of user visits for trajectory analysis and prediction. We employ a variable-order Markov model to manage the growth of probability matrix size in forecasts. Additionally, we improve prediction accuracy by aggregating the time spent on specific web pages.
Citation: Wang X, Liu D-F (2025) Pattern mining and prediction techniques for user behavioral trajectories in e-commerce. PLoS One 20(5): e0320772. https://doi.org/10.1371/journal.pone.0320772
Editor: Takayuki Mizuno, National Institute of Informatics, JAPAN
Received: October 7, 2024; Accepted: February 25, 2025; Published: May 16, 2025
Copyright: © 2025 Wang, Liu. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The data relevant to this study can be accessed from the zenodo repository at https://doi.org/10.5281/zenodo.15064002. The key code for this study is available on GitHub at https://github.com/wx88dfl/Pattern-Mining-and-Prediction-Methods-for-User-Behavior-Trajectories-in-E-Commerce.
Funding: This work was supported by the High-Level Talents Scientific Research Startup Project of Nanjing Vocational College of Information Technology (Grant No. YB20220602).
Competing interests: The authors have declared that no competing interests exist.
1. Introduction
Fueled by the Internet, advanced big data, and cloud computing technologies, we have entered a rapid proliferation of information. The extensive utilization of these technologies significantly enhances information resources and elevates the societal significance of information in people's everyday lives. Over time, from the inception of e-mail to the rapid growth of social media in the present day, the behavioral patterns of Internet users have undergone a significant transformation. The Internet has ubiquitously penetrated all facets of our professional, commercial, and even everyday existence. Nevertheless, this exponential expansion of information also presented novel obstacles. While users derive considerable convenience from the Internet, they also have the challenge of information overload. Determining one's true interests in the vast amount of information has grown progressively challenging, a phenomenon called “selection phobia.” Informative providers now face the pressing challenge of differentiating themselves in the intense competition and effectively capturing and retaining consumers’ attention. The advent of prediction systems offers a novel approach to address this issue. Analyzing multi-dimensional data, including users’ behavior, knowledge, and interests, provides personalized content prediction that can achieve mutually beneficial outcomes for users and information producers. The prediction system operates in conjunction with the main website and achieves its purpose by thoroughly analyzing user data.
The implementation of prediction algorithms in e-commerce platforms like Taobao, Jingdong, Dangdang, etc., has shown impressive outcomes, greatly enhancing the company's operational profitability. It has garnered considerable interest from both the industry and academics. Prediction algorithms, as the fundamental components of prediction systems, have consistently been a focal point of experimental investigation. While conventional prediction algorithms relying on data statistics may meet customers’ requirements to some degree, they seem inadequate in addressing the growingly individualized needs of consumers in the present day.
The emergence of community e-commerce presents novel prospects and obstacles for the advancement of prediction systems. Community e-commerce differs from conventional e-commerce platforms by prioritizing the individually tailored requirements of customers. Its user groups are more stable and exhibit more frequent purchasing patterns. This necessitates that the prediction system can more precisely capture the individualized attributes of customers and offer more refined prediction services. The limited data nature of community e-commerce renders typical prediction algorithms based on big data ineffective. Therefore, it is necessary to investigate novel algorithms and approaches to address the tailored prediction requirements of community e-commerce.
This study aims to investigate the present state and difficulties of a personalized prediction system for community e-commerce, examine the distinctions between it and a conventional e-commerce platform prediction system, and provide potential remedies. The prediction system will assume a more prominent position in community e-commerce, yielding an enhanced consumer purchasing experience and concurrently generating increased commercial value for enterprises.
Analyzing personalized consumer interest is essential in the e-commerce industry to enhance marketing effectiveness and user pleasure. In their study, Bhatnagar et al. [1] investigated the phenomenon of user interest in how users use e-commerce websites and established the temporal scope of interest. Results indicate that personalized marketing is most effective when implemented during the time window of interest, particularly for users who access an e-commerce platform via a search engine. This time window is considerably longer than users accessing the platform through an online banner advertisement. Nevertheless, these studies often regard user interests as fixed, disregarding the dynamic and real-time characteristics of user interests. To capture real-time user interest, Ding et al. [2] presented a correlation between users’ inherent usefulness and the probability of purchasing by analyzing the hierarchy of web pages. Based on the assumption that there is a positive correlation between the level of user interest and the likelihood of making a purchase, a hierarchical Bayesian model was employed to determine the real-time interest of individual users. These users were classified into two states: high intent and low intent.
Furthermore, Li and Ding [3] developed a linear utility function model to determine that an individual user's real-time interest is influenced not only by previous browsing behavior but also by the combined impact of web pages and marketing stimuli. Alternatively, Tam and Ho [4] contended that the contact between an individual consumer and an e-commerce platform can serve as a persuasive force to alter the consumers’ implicit interests and impact decision-making based on the business objectives of the online retailer. A novel web page personalization approach is proposed, incorporating preference matching levels, page suggestion set sizes, and sorting cues. These elements enable e-commerce merchants to manipulate customization to cater to individual consumers’ unique cognitive requirements.
The mining and analysis of customer purchase records in enterprise information systems is a crucial method for uncovering the underlying knowledge of consumer behavior. Utilizing this data, researchers can develop models to forecast consumer behavior and offer decision-making assistance to businesses. Liu et al. [5] employed diverse machine learning models to assess over 100 features from new consumer data collected during Taobao's ‘Double Eleven’ period. Ultimately, they achieved rather accurate prediction findings. The present study highlights the significance of forecasting the consumer count for extensive e-commerce events. This paper illustrates the application of machine learning concepts in comprehending and predicting consumer behavior throughout extensive e-commerce events. The study by Li and Zhang et al. [6] examined the repeat purchase intention of new customers. It showed that integrating the SMOTE algorithm with the Random Forest algorithm resulted in superior prediction performance compared to a single method. The present study emphasizes algorithm fusion's capacity to enhance predictions’ accuracy.
In contrast, Shen et al. [7] employed a highly explanatory tree model derived from the Alibaba e-commerce platform to forecast the characteristics of recurring customers. The researchers developed the prediction model and provided a comprehensive explanation of the model, including feature importance, partial dependency graph, and decision rule parameters. The significance of this study is in its ability to create both predictive outcomes and a comprehensive comprehension of the results, playing a critical role in formulating corporate strategy.
The advent of sequential recommendation algorithms within the realm of technology has presented novel prospects for developing personalized recommender systems. Unlike conventional recommendation algorithms, sequence recommendation algorithms consider user behavior's temporal aspect and interaction's significance. They aim to uncover the user's implicit interaction intentions from sequenced past behavior, so accessing the user's dynamics and potential interest within a specific timeframe. Sequence recommendation algorithms research is categorized mostly into statistical machine learning and deep learning [8]. Classic sequence recommendation systems, such as Markov chain-based recommendation, use the user's history sequence to forecast the next item they would click on. Rendle et al. [9] integrated first-order Markov chains with matrix factorization (MF) to represent users’ first-order sequence characteristics. He et al. [10] addressed the limitations of the first two models by integrating similarity computation with higher-order Markov chains.
Conversely, deep learning-based sequence recommendation approaches begin with a multilayer perceptron (MLP). They progressively incorporate neural network models that address issues in other fields and deploy them to recommendation algorithms [11]. Convolutional neural networks (CNNs), recurrent neural networks (RNNs), graph neural networks (GNNs), and attention mechanisms are examples of representative models. To illustrate, the literature [12] on Text-CNN [13] improves the connections between features of items by combining the convolutional outcomes in various orientations. The GRU4Rec [14] and GRU4Rec+ [15] models are built upon Gated Recurrent Units (GRUs) [16], which identify a user's probable intention by analyzing their interaction sequences and making predictions about their future output.
Furthermore, Chang et al. [17] employed metric learning to reconstruct the user's feature representation. The SASRec [18] algorithm is founded on the self-attention mechanism, effectively utilizing the user's interest weight in the present interaction. The Ali team's BERT4Rec [19] is built upon the bidirectional Transformer [20], showcasing a robust processing and computation capacity for handling vast data. Similarity-based prediction methods improve prediction efficiency and reduce system processing by dividing users into separate groups and developing a trajectory prediction model for each group [21,22] These approaches’ primary focus is to determine the distance between trajectories efficiently. Examples of such methods include Dynamic Time Warping (DTW) [23], Longest Common Subsequence (LCSS) [24], Edit Distance [25], and Euclidean Distance [26]. Agrawal et al. [27] performed comprehensive experimental investigations on current trajectory distance algorithms. Meanwhile, Liu et al. [28] employ social propagation theory to categorize users based on the similarity of their mobile user trajectories and develop a second-order Markov trajectory prediction model. Quehl and colleagues [29] introduce a trajectory prediction approach that integrates trajectory similarity and heuristics to determine automatic decision-making parameters. This method is distinguished by its minimal spatio-temporal complexity.
In the study domain of user behavior trajectory prediction, existing methods depend on data-driven analytics to make predictions by analyzing user patterns in trajectory sequences. Nevertheless, these approaches have several evident limitations:
- Delimitations of User Behavior Analysis: Current approaches frequently lack comprehensiveness and scope in evaluating user behavior, limiting the accurate comprehension of user behavior. Simultaneously, they disregard the significance of semantic and temporal characteristics, which are essential for enhancing the accuracy of predictions.
- User clustering complexity: The computation technique of similarity in user clustering is highly intricate. Current approaches typically necessitate the human specification of feature dimensions for various websites, which not only lacks universality but also presents challenges in terms of automation.
- Prediction accuracy limitations: Current approaches often fail to include users’ past trajectories in the user behavioral context while predicting new trajectories, significantly decreasing prediction accuracy.
- Challenges in computational efficiency and results: Several current approaches for forecasting user behavioral trajectories encounter issues related to significant computational resources and incorrect prediction outcomes, which greatly impact the overall effectiveness of these systems.
In response to these difficulties, this paper presents a set of novel approaches designed to enhance the precision and effectiveness of predicting user behavior trajectories:
- In-depth multi-dimensional feature analysis: this work explores the geographical, temporal, and semantic characteristics of user behavioral trajectory data and presents a formal model for representing user trajectory sequences.
- Advanced clustering algorithm: This paper presents an innovative clustering technique that utilizes clickstreams and custom events to enhance comprehension and analysis of user clicking activity. The overall dataset is first subjected to density-based clustering in the clustering procedure. Subsequently, the clustering results of the resulting multiple periods are combined based on the origin of the points in the clustering results. This integration ensures the consistency of the coding rules, enabling direct comparison and computation of the clusters. Furthermore, by partitioning user sessions into distinct session clusters, we may effectively discern user behavior patterns more precisely.
- Trajectory prediction method based on Hidden Markov Model: We present a Hidden Markov Model-based trajectory prediction approach to address the limitations of current methods in handling semantic information in trajectory data. This model combines spatial and semantic variables to increase the accuracy of trajectory prediction.
- We exploited the similarity clustering technique of trajectory points and the transformation algorithm of common sub-trajectories to extract the ideal hidden state of the model effectively.
- Utilizing semantic properties, we made predictions about the regions visited by users by analyzing the semantic characteristics of mobile trajectories. Additionally, we verified the relationship between users’ clicking behavior on web pages and their purchase intention.
- By integrating trajectory point and area types prediction models, we enhance the precision of predicting the user's next destination, greatly improving the forecast's accuracy.
The paper is organized as follows. Section 2 is devoted to foundational principles. In Section 3, we evaluate clustering techniques for page-type sequences. Section 4 introduces a time-space semantics-based approach for trajectory prediction. Section 5 illustrates experimental analysis. The paper ends with the conclusions and future work in Section 6.
Methods
This study involves the behavioral analysis of users and the application of data mining techniques.
- The data used in this study is derived from publicly available datasets, ensuring that all research activities are conducted within the scope of regulated consent.
- Anonymized information are utilized to conduct the research, ensuring the privacy and confidentiality of the data subjects.
- This research does not pose any harm to human participants and does not involve sensitive personal information or commercial interests. The study adheres to the principles outlined in the Declaration of Helsinki. Furthermore, it complies with the ethical exemption requirements specified in the “Ethical Review Measures for Life Science and Medical Research Involving Humans” promulgated in China, qualifying it for an exemption from ethical review.
2. Foundational principles
In the context of web pages, a clickstream record refers to a comprehensive series of sequentially arranged web pages that a person consumes during their visit to a website. A clickstream record is a comprehensive series of sequentially arranged application pages a user loads while accessing an application. It includes access details for each page, such as the page name, load time, and dwell time. It also provides overall information such as the order in which the pages load, the number of pages, the access time, and the total time.
2.1 User behavior trajectories.
Given the clickstream data acquired without altering the application source code, the user session record based on clickstream can be abstracted in the following manner.
Abstracted user session record:
Definition 1: A series of user behavior trajectories is conceptualized as follows:
Let's indicate the sequence of user behavioral trajectories in the period
, where the ith page
viewed by the user during that period is followed by the duration of time
spent on that page, and generally satisfies
: Null describes an empty page
, indicating the termination of the user's access to that page.
Deduction: Dynamic pattern recognition of user behavior trajectory sequence is primarily achieved by selecting suitable statistical indicators and determining the accurate time observation granularity. The changes in the user's behavioral patterns are then described and calculated:
The formula provided above denotes a compilation of sequences characterizing user behavior trajectories.
The definition provided above refers to the page as a loaded object for analysis. We classify page types in the context of online shopping, as defined in Table 1:
The typing * indicates that pages of specific product categories are nested under the type page. Therefore, the type sequence of the corresponding page is defined as follows:
where denotes the period
in which of the page type behavioral trajectory sequence,
represents the ith page type visited by the user in the period the ith page type visited by the user in the period, the time spent on the visited page is and satisfies
The time spent on the page is the time that satisfies. Using the type sequence representation of a page can narrow down the number of page-type sequences and simultaneously reduce the complexity of the subsequent analysis.
Preprocessing of the type sequence of the page
User behavior data needs to be preprocessed. The processing rules for the type sequence of the page are shown below:
- 1). If the time of the type sequence of the pages
,
,…,
,…,
is less than
, then it means that the user may have clicked on the page by mistake or is not interested in the page, and therefore, the page should be removed from the type sequence. In the experiment, the reasonable threshold
can be set according to the actual need, such as 15 seconds;
- 2). If the time of the type sequence of the page
,
,…,
,…,
is greater than
, then it means that the page probably failed to load or the user has left, so the page should be removed from the type sequence of the page. In an experiment, the reasonable threshold
can be set according to the actual need, for example, 30 minutes;
- 3). If the type sequence of pages is longer than (page repeatable)
, then it means that the page sequence was generated aimlessly by the user. Therefore, the sequence has no value and needs to be removed from the data. In the experiment, N is set to 30. When there are duplicate sub-page sequences, they can be represented as super nodes to simplify the type sequence of the page. As shown in the figure below, the dotted line in the middle of the figure is the common duplicate page type sequence
, writing this sequence of three-page nodes as super node W. An example of super node merging is shown in Fig 1, where the nodes within the dashed line are merged into super node w:
A null end page is included in the user session to determine the residence time of the last page. The load time of that page is then considered as the end time of the series. The page's type sequence can be split, with the option to exclude irrelevant page types and extract the pertinent page types and dwell periods from the existing page's type sequence into a new page's type sequence based on the product type of the page where the order was made.
2.2 Similarity between type sequences
Without concern for the dwell time, our focus is solely on the order of consecutive visits to the visited pages. Compounding the similarity between two distinct access sequences is synonymous with calculating the similarity between two strings, which can be accomplished by employing the largest common substring (LCS). The largest common substring (LCS) is the largest among all common substrings. It is possible to classify substrings into two distinct groups: continuous and discontinuous. Discontinuous substrings can be derived by extracting multiple items from the initial sequence. Hence, the conventional dynamic programming method is applicable for solving the non-contiguous maximal common substring. Consider two-page type sequences represented as follows:
Split the above page type sequence into two parts, i.e., page type sequence and dwell time sequence:
Sequence derived from two different page type sequences:
The public page type sequence is extracted in:
The corresponding time dwells are expressed as follows:
Introduce the notion of access depth, which involves determining the access depth for each page type in the sequence of the initial visit to the serial number. It applies if the sequence of future visits remains the same as the first visit to the main, for instance, sequence . A's access depth is 1, B's is 2, C's is 3, and C's access depth remains three on the second instance. The following graph displays the series of access depths:
The temporal similarity of each visit to the public page is as follows:
The similarity in depth for each visit to the public page is as follows:
Apparently, there are . If any two users stay on the same page for a similar amount of time as well as depth, it means that the closer the similarity of their interest in that page is (i.e., the closer it is to 1). Therefore, the similarity calculation of two different users A and B on the same visited page type can be defined as follows
Finally, the respective similarity is calculated as:
where and
is the sum of the depths of the original page type sequences. The total similarity is defined as follows:
3. Clustering techniques for sequences of page types
To streamline the calculation of cluster coding, which is both computationally demanding and time-consuming, this section explores the appropriate clustering schemes to reduce the complexity of sequence analysis further. The datasets from various periods can be merged into a unified dataset. Subsequently, density-based clustering can be applied to the entire dataset. The clustering results can be further categorized depending on the origin of the calculation points and, after that, segregated. Furthermore, suppose the number of points in the clustering clusters derived from separation is lower than the minimum number of points in the cluster of cluster analysis. In that case, the cluster label must be substituted with Noise. Conversely, it is possible to group and partition the sequences into several independent clusters to simplify the examination of sequences. Nevertheless, the conventional DBSCAN (Density-Based Spatial Clustering of Applications with Noise) method primarily increases the size of the clusters by considering the density of the surrounding environments, which refers to the amount of other objects in the vicinity.
There exist two primary issues with the algorithm:
This paper introduces the enhanced DBSCAN algorithm, which has been modified to address two specific challenges: (1) Sequence similarity cannot be quantified using the conventional distance similarity function and requires modification; (2) Sequences with multiple sub-sequences may exhibit similarity, leading to challenges in determining classification clustering sub-clusters;
- (1). Substitute ε-neighborhood with r-neighborhood, where r (0 ≤ r < 1) represents the threshold for sequence similarity, and r-neighborhood indicates an object space where all objects in that space are greater than r similar to the core object O.
The computation may be performed using the formula above.
- (2). The type sequence of the page is pre-processed by splitting the category sequence. It ensures that the sequence is placed in the cluster with the highest similarity, enhancing the cluster formation conditions. The cluster's number of elements determines the initial criteria for cluster formation. If the number of elements in the cluster is too small, it will be classified as NOISE. The present study focuses on the sequence unit and aims to provide a higher level of processing granularity. Consequently, the criteria for cluster formation are adjusted. Specifically, if the number of common elements in the r-neighborhood of two clusters is less thanλtimes of the smaller cluster, merging the two clusters is not possible. The number ofλtimes of the smaller cluster is defined as
.
Clustering algorithm for different time periods based on cluster analysis
Input:
, ....,
, ....,
Time period dataset
, ....,
,....,
, similarity threshold r, merging threshold λ
1: Consolidation
, ....,
, ....,
Get the overall dataset PTS
2: Perform clustering calculation RDBSCAN (PTS, r, λ)
3: for each cluster of the overall dataset PTS i do
4: for each sequence
in cluster i do
5: if
, ....,
, ....,
then
6: respectively
Labelled with clusters j
7: for Dataset
, ....,
, ....,
for each cluster m do
8: if the number of points in cluster m is less than
then
9: Label cluster m as Noise
Returns: clustering results for datasets
The clustering algorithm is as follows:
RDBSCAN (PTS, r, λ) algorithm
Inputs: overall dataset PTS, similarity threshold r, merge threshold λ
1: Mark all objects in the PTS as not accessed
2: Define S as a cluster set and initialize S as empty
3: For each unvisited object in the PTS p
4: Mark p as visited
5: If the r-neighborhood of p contains at least
objects
6: Create a new cluster
and add the objects in p and p's r-neighbourhood to the
7: Add
Add S
8: END FOR
9: FOR Every set in S
and
(m≠n)
10: If the common elements of
and
are more than λ times of
or
11: then merge
and
into a new cluster and call it C. Also add C to S.
12: ELSE
and
into two separate clusters
13: END FOR
14: Output S
After executing the above algorithm, the clusters for clustering are named, and after naming the clusters, the clusters are merged when the end time of the sequence in the cluster is less than SE from the start time. Serialize each trajectory and formalize it into the following structure:
where is the renaming of the cluster in which the sequence is located, referred to as a trajectory node in the trajectory sequence, and
is the period of the interval between neighboring nodes of the sequence, i.e., the difference between the end time of the neighboring node and the next start time.
4. A time-space semantics-based approach for trajectory prediction
4.1 Frequent trajectories generation.
This approach is founded on trajectory sequences, which necessitate the integration of the trajectory points within the sequences, together with the semantic and temporal characteristics of the scene during the merging process. Firstly, the terminology of support in a trajectory sequence is introduced.
Definition 2 Support The support of a trajectory node in a dataset with n trajectory data is equivalent to the count of trajectories to which the trajectory node is associated.
Frequent trajectory tree generation is defined as follows
To build a frequent trajectory tree, it is necessary to create tree-like storage by systematically traversing the past trajectory data.
The nodes’ attributes can be adjusted to meet the user's specifications, and the time intervals between adjacent nodes are indicated on the tree structure of the nodes. The tree structure is constructed by traversing all trajectory data sets once. The specific structure construction process is as follows:
- 1). From the beginning of a trajectory, nodes in the trajectory are turned into branches of the tree, and nodes are treated as bifurcation nodes in the tree when the node of the trajectory serves as a common node for multiple trajectories;
- 2). In the process of constructing the tree, the corresponding node name, support, and pointer of the node are constantly placed. When the node in the trajectory is constructed in the tree for the first time, the pointer is pointed to that node, and the same node appearing later is pointed to the next occurrence of that node in the tree, in the order of construction;
- 3). The interval times in the trajectory are labeled on the branches in the tree, and the times are labeled in the sequences.
An illustrative sequence of three trajectories is shown as follows:
The tree trajectories are arranged in a series, and a table is established on the periphery of the tree structure trajectory. The table's first column represents the point's name, the second column represents the value of the support, and the third column marks the pointer. The example is shown in Fig 2. The sequence of three trajectories is merged and reorganized into a tree structure.
Frequent Pattern Mining: Frequent trajectory mining is performed by partitioning the trajectories into three subgroups: the candidate trajectory Can [], the frequent trajectory F [], and the final trajectory R []. Firstly, all the trajectory points that have support over the minimal support level should be identified. Next, all the sub-trajectories that meet the specified criteria as the endpoints are identified and combined into Can []. Step three: Verify all the combined sub-trajectories in Can [] to determine if they exceed the minimal support criterion and then transfer them to F []. In the fourth stage, verify the sub-trajectories in F [] again, considering the user's time limitations to ensure that the requirements are satisfied for inclusion in R[].
The approach described in this section utilizes an array to hold the conditional tree. During the traversal of the tree structure, the sub-trajectories obtained are stored in the array, enhancing the mining efficiency. The conventional algorithm for mining frequent trajectory patterns only computes the highest frequency sub-trajectories. However, this algorithm cannot satisfy the need to increase the frequent sub-trajectories during dynamic trajectory prediction. Therefore, this method is enhanced to generate frequent sub-trajectories within a specified frequency range tailored to various requirements.
4.2 Calculation of state transfer matrix.
Through the examination and analysis of those above, using the frequent sub-trajectories and combining the trajectory points derived from these sub-trajectories, it becomes feasible to calculate the likelihood of reaching the desired destination that the user is on the verge of achieving based on the state transfer matrix within the trajectory. This section exclusively addresses the computation of the state transfer matrix.
A Markov process of higher order: This work presents a higher-order Markov process to address the variability in the probability of a trajectory reaching the current node, which factors beyond the current node position may influence. In contrast to the first-order Markov process, the higher-order Markov process considers the correlation between the positions of the k preceding historical nodes visited by the user and the target node to be visited. Despite its ability to more accurately replicate the user's visit trajectory, there are two challenges in considering the trajectory of the historical visited nodes. Firstly, the size of the transfer matrix increases exponentially. Secondly, the training data set may not include historical data, resulting in null results for the visit. Thus, in this study, we employ a Markov process with variable order to forecast the outcomes by analyzing the paths of various historical dates. Because considering the last two positions results in more transfer combinations. For instance, when using second-order Markov modeling, it is necessary to construct a larger transfer probability matrix. Therefore, the probability of moving the user's next access position when predicting the user's access trajectory via second-order Markov modeling can be specified as:
where the numerator represents the distance in the training dataset from node through node
to node
in the training data set, and the denominator represents the number of all possible passages from node
as the starting point passing through node
to the number of nodes to each node.
Given the potential issue of null prediction outcomes in the conventional Markov model, we propose using a variable-order Markov model with partial matching rules. This work presents a method that uses a frequent tree structure to address the transfer probability matrix's sparsity issue. It minimizes the number of model orders by imposing an upper limit on the order value. Additionally, the method incorporates an escape mechanism to handle the problem of null prediction probability. The escape method is employed for computation, whereby the user designates the sub-trajectory to be seen and observes the sub-trajectory. The next node in the trajectory is determined by differentiating between the sets and
, which represent the sub-trajectory and the set of next nodes in the immediate vicinity in the sub-trajectoryπ, respectively. The sub-trajectoryπ nodes are those not part of the set of immediately next nodes in the sub-trajectory. Determine the probability assignment of
and, consequently, the probability of
. The allocated probability is denoted as
. The following formula may express the probability:
where denotes the
the suffix information after that. The other probabilities are defined as follows:
During the model learning phase, the tree structure T is initially built using the frequent trajectory data from the training set. If the set model
Given a maximum order of N, the tree's height is N + 1. Analogous to the binary tree structure, the nodes are designated with labels indicating the frequency of their occurrence in the trajectory. The branches of the tree consist of sequences of trajectories, denoted as and
. The calculation is executed based on the precise location of each step in the recursive computation.
4.3 Forecasting the trajectory of shop categories.
Category prediction using Markov models: Given a user's visits to n periods , a, a prediction is generated about the type of shop in the next period using Markov types. Once a user accesses a web page, the subsequent web page may be influenced by pertinent tags and ribbon settings on the current page. The next visited web page is often accessed by clicking on the current page, which opens the following one.
Given a sequence of trajectories:
The number of possible shop type sequences that can be obtained is:
For each point in the sequence , which may contain multiple commodity types accessed by the user, multiplying the number of all commodity types indicates that the possible trajectories are traversed according to the commodity types. In contrast, the user traverses the sequence of trajectories.
Shop Type Weighting Calculation
Input: Sequence of candidate trajectories
Predicted set of possible commodity types
Output: Weights of shop types for candidate trajectory sequences
1:
,
//initialization
2. for
in
do //Candidate Trajectory Sequence
3. for
do
4. if
in H(i) then // determine if the shop type covered by the candidate location is in the returned result
5.
// Add shop type weighting for wireless connection points
6. end
7. end
8. end
9. Return
Type of shop contained in the ith node is denoted by .
4.4 Candidate trajectory prediction algorithm
A spatial semantic-based prediction approach for Hidden Markov page trajectories is developed after selecting the hidden states, finding the ideal hidden state sequence solution, and predicting the type of shop that will be visited. Each node in the trajectory sequence computes the visit probability after merging the shop kinds. The parameter controls the item type ratio and the node size in the resulting output. The size ranges from 0 to 1, as determined by the following formula:
The model first initializes the parameters on line 1 of the algorithm. It then determines the length of the trajectory sequence on lines 2–3 of the algorithm. Viterbi's algorithm decodes the optimal sequence of hidden states on line 5 for the given sequence of trajectories. Subsequently, all the nodes that may visit the next node are traversed, and the probability of their possible visit is calculated. Then, the nodes are ranked on lines 6–9 of the algorithm. Finally, the algorithm returns the candidate nodes.
5. Experimental analysis
An e-commerce application has been chosen as the test target. A Shenzhen-based information technology business is responsible for developing the App. The data-collecting function is designed within the App's source code, recompiled, and deployed on 28 test devices. Twenty-eight users are distributed to carry out focused testing on different business modules within the application. The level of detail in the information analysis of the clickstream data mostly focuses on the individual pages within the App and categorizes them based on their kinds. The pages users’ access are gathered and transformed into page trajectories on a monthly cycle. These page trajectories are then further converted into page-type trajectories based on the page types. The total number of converted page-type trajectories is 1358. The neighborhood density threshold is defined as the natural logarithm of N, where N represents the number of trajectories. The similarity threshold is set to 0.7, and the merge threshold is 1/2.
The ratio inside a cluster is determined based on the count of page-type sequences included in the cluster.
The clustering clusters are initially encoded at the character level to streamline the mining and analysis of page-type sequences.
App can be effectively categorized into four modules based on their functions: shopping, communication, service, and personal center. During the experiment, several users performed simultaneous testing of several modules. The distribution of users’ tests is depicted in Fig 3.
Additionally, the clusters have been renamed to streamline the sequence, as indicated in Table 2.
The 1358-page type trajectories were grouped into 16 clusters according to each user's trajectory. When the interval between two-page type trajectories was less than 1 hour, the trajectory sequence was reconstructed using clusters as nodes. The resulting trajectory sequence consisted of 348 trajectories.
The topic of this paper is evaluation indicators and comparison algorithms.
This work presents a prediction approach that calculates the likelihood of visiting each trajectory node. The forecast results are based on the top k candidate nodes with the highest probability. The trials employ the following evaluation measures [30] to assess the reliability of the prediction approach and validate its effectiveness:
- (1). P&1 represents the likelihood that the candidate position with the highest probability is accurate.
- (2). P&6 represents the likelihood that the accurate position is one of the first six candidate nodes generated by the prediction model.
This metric, known as Mean Reciprocal Rank (MRR), represents the average value of the reciprocal rank of the obtained results as follows:
The Mean Absolute Error (MAE) as an evaluation metric was not used in the tests because the trajectory prediction approach presented in this paper provides the labeled values of the trajectory nodes instead of the precise spatial coordinates.
Hence, this assessment measure is unsuitable for the research conducted in this publication.
The most accurate algorithms for trajectory prediction are the Markov Model Carousel (MMC), the Markov Model, and the Hidden Markov Model. The usefulness of the trajectory prediction method described in this paper is validated by experimental testing.
This chapter compares the trajectory prediction algorithm STS and five predefined benchmark methods.
The present study proceeds to conduct an empirical investigation of the suggested trajectory prediction model:
- (1). The Hidden Markov Model (HMM) is a well-referenced approach in trajectory prediction [31].
- (2). The LHMM technique is well-suited for the modeling and analysis of spatio-temporal trajectory data [32]
- (3). Mobile Machine Computing: capable of simulating the motions of mobile users [33].
- (4). Temporal Predictive Analysis (TPA) considers the time-related characteristics of the trajectory and can provide more precise forecasts of paths inside the campus [34].
- (5). STM: Trajectory similarity is considered [30].
- (6). The HST-LSTM is a trajectory prediction technique that utilizes long-short memory neural networks [35].
The present study proceeds to conduct an empirical investigation of the suggested trajectory prediction model:
- We examine how various trajectories lengths impact the model's predictive performance.
- We investigate the influence of the number and size of frequent sub-trajectories on the prediction performance.
- We validate an experimental prediction model incorporating temporal, spatial, and semantic information.
The effects will be compared using the benchmark algorithm.
5.1 Influence of trajectory length on empirical findings.
Within the context of analyzing trajectory sequences of varying lengths using the Hidden Markov Model, this subsection primarily focuses on examining the impact of different trajectory lengths on the prediction accuracy of the Markov Model. 348 trajectory sequences were generated, with most trajectory lengths centered between 4 and 11. Therefore, we demonstrate in Table 3 the impact of trajectory length on the results.
The assessment metrics of the experiment remained rather consistent regardless of the variation in trajectory length. The P&1, the P&6 and the findings of MRR remained stable within the ranges of 13% and 14%, 36% and 41%, and 26% and 29%. The stability of the forecast performance of the Markov model for trajectory sequences of varying lengths is attributed to its independence from the complete historical trajectory. This characteristic aligns with the inherent nature of Markov models.
5.2 Impact of frequent sub-trajectories on the Accuracy of predictions.
Being common sub-trajectories that partially represent the preferences of users and encapsulate the group movement patterns, they significantly influence the click trajectories of users. Hence, this study primarily aims to validate the correlation between the magnitude and quantity of common sub-trajectories and the accuracy of model predictions. The investigations involve frequent sub-trajectories with sizes ranging from 3 to 6. The experimental results are depicted in Fig 4.
Fig 4 demonstrates that the model's predictive capability improves as the number of frequent sub-trajectories increases. It can be attributed to the reduction in states in the transfer probability matrix of the prediction method and the increase in matrix density, which mitigates the issue of null prediction results. Further evidence indicates that the model's predictive capability is enhanced when the tests employ numerous shorter sub-trajectories, specifically length 3. The rationale for this phenomenon is that when the length of the sub-trajectories grows, the quantity of trajectories in the historical data set that may meet the pattern of sub-trajectories of the same length decreases. Consequently, this deficiency of matching data leads to the delivery of empty findings. The accuracy of the prediction approach is influenced by both the size and the number of frequent sub-trajectories. In the subsequent experiment, based on the real data set described in this study, the number of frequent sub-trajectories is established at 20, and the length of these sub-trajectories is set at 3.
5.3 Four analyses of the accuracy of store type forecasts.
The order in which a user peruses different types of shops on a website indicates their preferences and intentions for purchasing, as viewed from a profound semantic standpoint. Here, we validate the accuracy of the Markov model in predicting the semantic information shop kinds visited by mobile users in shops. Based on the generated shop category data, the experiments are trained and predicted. Most trajectory sequences in the prediction results of 348 entries had a concentrated number of trajectory points, ranging from 4 to 11. Approximately 90% of the overall trajectories and trajectories with a length below 4 indicate movement patterns that are too uncomplicated to have a low reference value.
The empirical results indicate that the Markov model achieves an accuracy of shop-type prediction ranging from 13% to 14%, 47% to 49%, and 31% to 32% for the assessment measures p&1, p&6, and MRR, respectively. Thus, it can be inferred that the browsing conduct of consumers in purchasing exhibits a certain level of intentionality. Therefore, it can be assumed that including store type in the prediction will enhance the precision of the model trajectory forecast. The results of training and predicting shop-type sequences of different lengths are shown in Table 4.
Having successfully validated the impact of trajectory length, length and number of frequent sub-trajectories, start time, and dwell time on the prediction accuracy of browsing trajectories, as well as the prediction accuracy of shop types using Markov modeling, this subsection will now proceed to validate the accuracy of the trajectory prediction method taking into account shop types. Fig 5 depicts the outcomes of altering the semantic weighting parameter α on the prediction results.
(A) illustrates how the prediction results change with variations in the parameter α when the trajectory length is .(B) illustrates how the prediction results change with variations in the parameter α when the trajectory length is
. (C) illustrates how the prediction results change with variations in the parameter α when the trajectory length is
.
The provided figure demonstrates that the prediction accuracy of the proposed method increases progressively as the weighting coefficient increases. The prediction accuracy reaches its maximum value when the weighting factor γ is increased. At a weighting factor of 0.8, the prediction accuracy attains its peak value, beyond which a certain decrease level exists. Hence, by considering the specific category of the shop the user intends to visit, it is possible to cut down the pool of potential shops and enhance the significance of each shop. Consequently, this leads to an improvement in the precision of trajectory prediction. Table 5 illustrates the experimental comparison findings of the methods presented in this section with the benchmark methods HMM, LHMM, MMC, HST-LSTM, TPA, and STM.
Statistical analysis in Table 5 reveals that HMM and LHMM prediction techniques exhibit the lowest performance. It may be because these two benchmark approaches fail to address the optimal hidden state problem. Instead, they only identify the trajectory points as hidden states, leading to poor forecast accuracy. The Markov model-based prediction algorithms exhibit superior performance, achieving prediction accuracies of 41.6% and 45.6% for the MMC and TPA algorithms, respectively, when evaluated using the p&6 measure. After 260 training rounds, the HST-LSTM model achieves consistent prediction precision. The HST-LSTM model, which utilizes short and long-memory neural networks, achieves prediction accuracies of 14.77%, 37.88%, and 26.04% for trajectories, as evaluated by the metrics p&1, p&6, and MRR, respectively. The trajectory prediction accuracy achieved using LSTM short and long-memory neural networks surpasses the prediction approach based on Hidden Markov Models. However, the marginal advantage is not substantial when compared to the prediction accuracy of the Markov Models-based method.
Furthermore, our proposed trajectory prediction method exhibits much lower prediction accuracy. Compared to the approaches above, the trajectory similarity-based method STM also achieves high prediction accuracy. However, this paper's algorithm significantly enhances the prediction accuracy for the evaluation metrics p&1, p&6, and MRR compared to STM. This is because the prediction method in this paper addresses the issue of sparsity in the transfer probability matrix by proposing frequent sub-trajectories. It solves the problem of predicting the returned result as null and enriches the representation of user behavioral patterns by introducing the concept of dwell time. Consequently, the trajectory prediction effect is improved.
6. Conclusion
The research area of user behavior prediction for e-commerce mainly depends on data-driven analysis to develop predictions by analyzing user movement patterns in trajectory sequences. Nevertheless, current approaches have notable deficiencies, such as inadequate analysis of user behavior, complexity in user clustering, constraints in prediction accuracy, and difficulties in computing efficiency and achievement of outcomes.
This paper presents a set of novel strategies to address these difficulties. The first step was a thorough examination of multi-dimensional characteristics. We extensively investigated trajectory data's geographical, temporal, and semantic aspects and compared them with the most advanced prediction techniques already available. Using a density-based clustering approach, we provide a clustering algorithm that utilizes clickstreams and custom events to partition user sessions into distinct clusters. This method aims to enhance the accuracy of identifying user behavior patterns. Furthermore, we present a way to predict trajectories using Hidden Markov Models. This method combines spatial and semantic characteristics and effectively uncovers the best-hidden state of the model by using a similarity clustering algorithm for trajectory points and a transformation algorithm for frequent sub-trajectories.
The next phase of our research endeavors will involve a comprehensive exploration of the following aspects:
- Construction of Long-Term Forecasting Models: Present models mostly concentrate on predicting short-term trajectories without extensive study on long-term or macro-trajectory prediction. Our objective is to create novel models that can effectively capture and forecast patterns in user activity over extended durations.
- Development of hybrid models: Our study aims to generate hybrid models that integrate the sensitivity of probabilistic statistical models with the high accuracy of deep learning models to improve the precision and robustness of predictions.
- Enhancement of real-time and accuracy: We will refine our algorithms to enhance real-time predictions and accuracy, particularly in intricate dynamic settings, for real-time application scenarios.
References
- 1. Bhatnagar A, Sen A, Sinha AP. Providing a window of Bhatagar A, opportunity for converting eStore visitors. Inform Syst Res. 2017;28(1):22–32.
- 2. Ding AW, Li S, Chatterjee P. Learning user real-time intent for optimal dynamic web page transformation. Inform Syst Res. 2015;26(2):339–59.
- 3.
Li C, Jiang Z. A hybrid news recommendation algorithm based on user's browsing path. In: Lee R, editor. Computer and Information Science. ICIS 2016: Proceedings of the 2016 IEEE/ACIS 15th International Conference on Computer and Information Science; 2016 Jun 26-29; Okayama, Japan. Piscataway: IEEE; 2016. p. 1–6.
- 4. Tam KY, Ho SY. Web personalization as a persuasion strategy: an elaboration likelihood model perspective. Inform Syst Res. 2005;16(3):271–91.
- 5.
Liu G, Nguyen TT, Zhao G, et al. Repeat buyer prediction for commerce. In: Krishnapuram B, Shah M, Smola AJ, Aggarwal CC, Shen D, Rastogi R, editors. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 2016 Aug 13-17; San Francisco, CA, USA. New York: ACM; 2016. p. 155–64.
- 6. Zhang L, Li Y, Wen X. A study on predicting repeat purchase intention of new consumers. Data Anal Knowl Discov. 2018;2(11):10–8.
- 7. Shen Y, Xu X, Cao J. Reconciling predictive and interpretable performance in repeat buyer prediction via model distillation and heterogeneous classifiers fusion. Neural Comput Appl. 2020;32(13):9495–508.
- 8.
Page L, Brin S, Motwani R, et al. The PageRank citation ranking: bringing order to the web. Technical report. Stanford (CA): Stanford University, Stanford InfoLab; 1998. Report No.: 1999-66. Available from: http://ilpubs.stanford.edu:8090/422/. /
- 9.
Rendle S, Freudenthaler C, Schmidt-Thieme L. Factorizing personalized Markov chains for next-basket recommendation. In: Rappa M, Jones P, Freire J, Chakrabarti S, editors. Proceedings of the 19th International Conference on World Wide Web; 2010 Apr 26-30; Raleigh, NC, USA. New York: ACM; 2010. p. 811–20.
- 10.
He R, McAuley J. Fusing similarity models with Markov chains for sparse sequential recommendation. In: Cao L, Zhang C, Joachims T, Webb GI, Margineantu DD, Williams G, editors. Proceedings of the IEEE 16th International Conference on Data Mining (ICDM); 2016 Dec 12-15; Barcelona, Spain. Piscataway: IEEE; 2016. p. 191–200.
- 11.
Zhou K, Yu H, Zhao WX, et al. Filter-enhanced MLP is all you need for sequential recommendation. In: Candan KS, Ionescu B, Goel A, editors. Proceedings of the ACM Web Conference 2022; 2022 Apr 25-29; Lyon, France. New York: ACM; 2022. p. 2388–99.
- 12.
Tang J, Wang K. Personalized Top-N sequential recommendation via convolutional sequence embedding. In: Chang Y, Zhai C, Liu Y, editors. Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining (WSDM); 2018 Feb 5-9; Marina Del Rey, CA, USA. New York: ACM; 2018. p. 565–73.
- 13.
Kim Y. Convolutional neural networks for sentence classification. In: Moschitti A, Pang B, Daelemans W, editors. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP); 2014 Oct 25-29; Doha, Qatar. Stroudsburg: Association for Computational Linguistics; 2014. p. 1746–51.
- 14. Chung J, Gulcehre C, Cho KH, et al. Empirical evaluation of gated recurrent neural networkson sequence modeling. arXiv preprint arXiv:1412.3555, 2014.
- 15.
Hidasi B, Karatzoglou A, Baltrunas L, et al. Session-based recommendations with recurrentneural networks. Proceedings of the 4th International Conference on LearningRepresentations. 2016. p. 1–10.
- 16.
Hidasi B, Karatzoglou A. Recurrent neural networks with top-k gains for session-based recommendations. In: Guo Y, Farooq F, editors. Proceedings of the 27th ACM International Conference on Information and Knowledge Management (CIKM); 2018 Oct 22-26; Torino, Italy. New York: ACM; 2018. p. 843–52.
- 17.
Chang J, Gao C, Zheng Y, et al. Sequential recommendation with graph neural networks. In: Hagen M, Verberne S, Macdonald C, Seifert C, Balog K, Nørvåg K, Setty V, editors. Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR); 2021 Jul 11-15; Virtual Event, Canada. New York: ACM; 2021. p. 378–87.
- 18.
Kang WC, McAuley J. Self-attentive sequential recommendation. In: Yang Q, Zhou Z-H, editors. Proceedings of the IEEE International Conference on Data Mining (ICDM); 2018 Nov 17-20; Singapore. Piscataway: IEEE; 2018. p. 197–206.
- 19.
Sun F, Liu J, Wu J, et al. BERT4Rec: sequential recommendation with bidirectional encoder representations from transformer. In: Zhu W, Tao D, editors. Proceedings of the 28th ACM International Conference on Information and Knowledge Management (CIKM); 2019 Nov 3-7; Beijing, China. New York: ACM; 2019. p. 1441–50.
- 20. Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. Adv Neural Inform Proces Syst. 2017;30:5998–6008.
- 21. Luceri L, Braun T, Giordano S. Analyzing and inferring human real-life behavior through online social networks with social influence deep learning. Appl Netw Sci. 2019;4(1):34:1–34:25.
- 22. Giannotti F, Nanni M, Pedreschi D, Pinelli F, Renso C, Rinzivillo S, et al. Unveiling the complexity of human mobility by querying and mining massive trajectory data. The VLDB J. 2011;20(5):695–719.
- 23.
Gmati FE, Chakhar S, Chaari WL, et al. Shape-based representation and abstraction of time series data along with a dynamic time shape wrapping as a dissimilarity measure. In: Proceedings of the 27th International Conference on Automation and Computing (ICAC); 2021 Sep 2-4; Portsmouth, United Kingdom. Piscataway: IEEE; 2021. p. 1–8.
- 24. Khan R, Ali I, Altowaijri SM, Zakarya M, Ur Rahman A, Ahmedy I, et al. LCSS-based algorithm for computing multivariate data set similarity: a case study of Real-Time WSN Data. Sensors (Basel). 2019;19(1):166. pmid:30621241
- 25.
Yang J, Zhang Y, Hu H, et al. A hierarchical index structure for region-aware spatial keyword search with edit distance constraint. In: Pei J, Manolopoulos Y, Sadiq S, Li J, editors. Proceedings of the 24th International Conference on Database Systems for Advanced Applications (DASFAA); 2019 Apr 22-25; Chiang Mai, Thailand. Cham: Springer; 2019. p. 591–608.
- 26. Lin X, Jiang J, Ma S. One-pass trajectory simplification using the synchronous euclidean distance. VLDB J. 2019;28(6):897–921.
- 27.
Agrawal R, Faloutsos C, Swami AN. Efficient similarity search in sequence databases. In: Proceedings of the 4th International Conference on Foundations of Data Organization and Algorithms (FODO); 1993 Oct 13-15; Chicago, Illinois, USA. Berlin: Springer; 1993. p. 69–84.
- 28. Liu Z, Hu L, Wu C, et al. A novel trajectory similarity-based approach for location prediction. Int J Distributed Sens. Networks. 2016;12(11):79–93P.
- 29.
Quehl J, Hu H, Taş S, et al. How good is my prediction finding a similarity measure for trajectory prediction evaluation. In: Proceedings of the 20th International Conference on Intelligent Transportation Systems (ITSC); 2017 Oct 16-19; Yokohama, Japan. Piscataway: IEEE; 2017. p. 1–6.
- 30. Wang P, Wu S, Zhang H, Lu F. Indoor location prediction method for shopping malls based on location sequence similarity. IJGI. 2019;8(11):517.
- 31.
Mathew W, Raposo R, Martins B. Predicting future locations with hidden Markov models. In: Proceedings of the 2012 ACM Conference on Ubiquitous Computing (UbiComp); 2012 Sep 5-8; Pittsburgh, Pennsylvania, USA. New York: ACM; 2012. p. 911–8.
- 32.
Li Q, Lau HC. A layered hidden Markov model for predicting human trajectories in a multi-floor building. In: Proceedings of the 2015 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT); 2015 Dec 6-9; Singapore. Piscataway: IEEE; 2015. p. 344–51.
- 33.
Gambs S, Killijian MO, Del Prado Cortez MN. Show me how you move and I will tell you who you are. In: Ehab AS, Keromytis AD, editors.Vitaly Shmatikov Proceedings of the 3rd ACM SIGSPATIAI International Workshop on Security and Privacy in GIS and LBS. San Jose, California:ACM; November 2, 2010.p. 34–41.
- 34.
Wang B, Hu Y, Shou G, et al. Trajectory prediction in campus based on Markov Chains. In: Wang Y, Yu G, Zhang Y, Han Z, Wang G, editors. International Conference on Big Data Computing and Communications. Bali,Indonesia:Springer; 2016. p.145–54.
- 35. Kong D, Wu F. HST-LSTM: a hierarchical spatial-temporal long-short term memory network for location prediction J. IJCAI. 2018;8(11):2341–7.