Figures
Abstract
Knowledge tracing can reveal students’ level of knowledge in relation to their learning performance. Recently, plenty of machine learning algorithms have been proposed to exploit to implement knowledge tracing and have achieved promising outcomes. However, most of the previous approaches were unable to cope with long sequence time-series prediction, which is more valuable than short sequence prediction that is extensively utilized in current knowledge-tracing studies. In this study, we propose a long-sequence time-series forecasting pipeline for knowledge tracing that leverages both time stamp and exercise sequences. Firstly, we introduce a bidirectional LSTM model to tackle the embeddings of exercise-answering records. Secondly, we incorporate both the students’ exercising recordings and the time stamps into a vector for each record. Next, a sequence of vectors is taken as input for the proposed Informer model, which utilizes the probability-sparse self-attention mechanism. Note that the probability sparse self-attention module can address the quadratic computational complexity issue of the canonical encoder-decoder architecture. Finally, we integrate temporal information and individual knowledge states to implement the answers to a sequence of target exercises. To evaluate the performance of the proposed LSTKT model, we conducted comparison experiments with state-of-the-art knowledge tracing algorithms on a publicly available dataset. This model demonstrates quantitative improvements over existing models. In the Assistments2009 dataset, it achieved an accuracy of 78.49% and an AUC of 78.81%. For the Assistments2017 dataset, it reached an accuracy of 74.22% and an AUC of 72.82%. In the EdNet dataset, it attained an accuracy of 68.17% and an AUC of 70.78%.
Citation: Gao A, Liu Z (2025) Long sequence temporal knowledge tracing for student performance prediction via integrating LSTM and informer. PLoS One 20(9): e0330433. https://doi.org/10.1371/journal.pone.0330433
Editor: Muhammad Anwar, University of Education, PAKISTAN
Received: November 4, 2024; Accepted: July 31, 2025; Published: September 9, 2025
Copyright: © 2025 Gao, Liu. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The datasets used in this study are publicly available and can be accessed from ScienceDB (Science Data Bank), a public data repository. The specific dataset is hosted at: Direct access link: https://www.scidb.cn/detail?dataSetId=0729371b805640069bef21874d0450d5. Digital Object Identifier (DOI): https://doi.org/10.57760/sciencedb.27918. This dataset includes three processed knowledge tracing datasets: Assistments 2009, Assistments 2017, and EdNet, and is distributed under the CC BY 4.0 license, allowing for free access, reuse, and distribution with appropriate attribution.
Funding: The author(s) received no specific funding for this work.
Competing interests: The authors have declared that no competing interests exist.
Introduction
For the last few years, a large number of students have had to rapidly switch to online learning, which has caused significant changes in the learning environment and learning methods [1]. In a typical online learning platform such as Massive Open Online Courses (MOOCs) [2], it is difficult for teachers to monitor each student’s learning progress in real time, like in traditional classrooms. Knowledge tracing (KT) models [3,4], as one fundamental component of an online learning platform, can provide personalized learning resources and paths based on each student’s learning activity and performance while improving learning efficiency. Teachers can use the KT algorithms to track students’ learning progress and understanding level in real time, enabling them to timely master the students’ knowledge status. By analyzing students’ historical learning data, KT can generate predictions of students’ future learning performance, thereby helping teachers and educational institutions develop intervention measures in advance and prevent students from falling behind in learning. However, it is still difficult to directly evaluate the capabilities of one specific KT method. Instead, the proficiency level of a student’s knowledge is typically assessed by the KT models through forecasting the student’s next exercise outcome based on their historical responses to exercises.
Most existing KT models focus on short-sequence prediction but fail to model long-term learning trajectories, which are critical for understanding sustained knowledge mastery and designing timely interventions. In addition, these models often overlook temporal dynamics and struggle with computational inefficiency when handling long sequences. Bearing the above-mentioned analysis in mind, this study propose a novel KT pipeline, dubbed the Long Sequence Temporal Knowledge Tracing Method for student performance prediction (LSTKT). The goal of this model is to accurately track the learner’s level of knowledge by combining the Bidirectional Long Short Term Memory (Bi-LSTM) [25] with Informer model [26]. The proposed method uses a Bi-LSTM network to record how students’ learning activities change over short periods of time. The Informer network, on the other hand, creates long-term sequence data and pulls out deep temporal features. To be specific, the input layer of our model can receive data from students interacting with the online platforms, including question answer sequences and timestamps. We then exploit the Long Short Term Memory (LSTM) networks to process input sequences and capture short-term dynamic changes in the students’ knowledge states.
Additionally, the Informer model receives the LSTM layer’s output as input, utilizing the self-attention mechanism to handle long sequence data and extract long-term dependencies. Using feature fusion, we combine the short-term features captured by the LSTM layer with the long-term features of the Informer layer to form a comprehensive feature representation. One or more fully connected layers map the detailed features to the output space in the final part of the proposed model. This lets us guess how likely it is that a long string of students’ answers to the exercises will be correct. The Informer model employs a probabilistic sparse self-attention mechanism. This can greatly simplify computations while still maintaining long-term temporal information. The cross-entropy loss function and Adam optimizer train the entire model, while the back-propagation algorithm updates the weighting parameters.
In general, this work’s contributions can be summarized as follows:
- This study proposes a novel KT model that comprehensively leverages long-sequence time-series information captured from students’ exercising records.
- To implement the proposed algorithm, both the bidirectional LSTM and Informer models are leveraged.
- Extensive experimental results on the publicly available datasets demonstrate the superiority of the proposed approach over state-of-the-art methods.
We have structured the remaining sections of the paper as follows: Sect describes the related works to this study. Sect provides a comprehensive explanation of the suggested methodology aimed at resolving the long-sequence time-series knowledge tracing task. Sect provides the experimental results of the comparison experiments. It provides details about the leveraged datasets, the evaluation metrics used, and the hyper-parameter settings. Sect gives the discussion about the outcomes of the proposed approach in this work. Finally, the conclusion is provided in Sect, which discusses the future research direction.
Related work
Traditional knowledge tracing methods
The literature has proposed numerous KT methods over the past few decades. Early KT algorithms are based on simple machine learning models, such as the random forest [5], the Bayesian network [6], and the Markov model [7]. For instance, as an early work in the field of KT, Bayesian KT (BKT) [8] proposed an effort to model students’ varied knowledge states during knowledge acquisition. This study denotes unobserved states of students’ knowledge acquisition as hidden variables in a Hidden Markov Model (HMM), which can update the posterior probabilities of students’ knowledge states using a Bayesian network based on the responses to exercises. Accordingly, the parameters of Bayesian and HMM models, like the probability of mastering a Knowledge Concept (KC) [9], can be calculated. Meanwhile, it also incorporates the probabilities of changing knowledge states, such as from not mastered to mastered. The experimental results on three benchmarks demonstrate that their constructed model can represent knowledge concepts better than state-of-the-art semantic networks. In the work of [10], a higher order Item Response Theory (IRT) model approximates students’ initial knowledge states as their one-dimensional overall proficiency while integrating the estimated difficulty and classification of each skill to estimate the probability of mastering a skill before practicing it. And then the skill-wise knowledge is exploited for tracing probabilities of learning, guessing, and slipping. The overall accuracy of this model on real data from algebra tutor is 87.13%.
Deep learning-based knowledge tracing methods
Recently, plenty of researchers have leveraged the deep learning models to address KT tasks. For instance, Piech et al. [11] have pioneered this direction of study and rendered the capability of deep learning models for KT. In this study, the utility of Recurrent Neural Networks (RNNs) [12] was explored to model students’ learning activities. The RNN-based models had achieved superior outcomes over previous techniques without domain knowledge. It achieved area under ROC curve (AUC) of 0.82, 0.85, and 0.86 on three datasets simulated-5, Khan math, and ASSISTments, respectively. And it can also capture complicated representations of students’ knowledge. The adoption of Neural Networks (NNs) [13] results in substantial enhancements in prediction performance on a wide range of KT datasets. To make use of the associations between knowledge points in KT, Duan et al. [14] used Generative Adversarial Networks (GANs) to yield knowledge relationship representations by integrating multiple knowledge associations. In their proposed models, they used gated recurrent units to generate the students’ knowledge states. Meanwhile, they exploited an attention-based technique to learn the coefficients of various knowledge associations. The outcomes on the ASSISTments (0.776, 0.701, 0.746), Junyi (0.882, 0.829, 0.844), EdNEt (0.891, 0.809, 0.824), and KDDCup (0.726, 0.605, 0.791) datasets have achieved superior performance over the state-of-the-arts in terms of AUC, F1 score, and accuracy, respectively. To address the shortcomings of the Deep-IRT model, which combines IRT with a deep learning model for KT, Tsutsumi et al. [15] proposed an updated Deep-IRT that can model students’ responses to an item using two different networks, including the student network and the item network. This work also presented a novel hyper-network architecture that balances both current and past data samples of students’ knowledge states. Results on six datasets demonstrate that this model can improve the prediction accuracy by about 2.0%. Furthermore, graph-based deep learning models, such as [16,17], are proposed for KT. In the work of [16], the authors proposed a heterogeneous graph-based KT method by introducing spatial-temporal evolution, in which knowledge state changes can be tracked along with both spatial and temporal dimensions. The authors leveraged the hierarchical structure to provide ample exercise representations within the knowledge space. We integrated the content, concepts, and challenges into the presented heterogeneous graph. The proposed graph network exploited both a spatial and a temporal updating module. Experiments on three datasets show the superior performance of this model in terms of performance prediction. For instance, by using the heterogeneous graph embedding, the AUC and accuracy of graph knowledge tracing model [18] on ASSISTments2015 dataset can be enhanced from 0.6901 and 0.7269 to 0.6973 and 0.7263. Sun et al. [19] proposed a KT method called weighted heterogeneous graph-based three-view contrastive learning framework, inspired by data-driven paradigms. Generally, the researchers exploited three different encoders to complement each other and obtain exercise embeddings. In particular, it thinks that the semantic information of more complex, different, and later tasks on a mixed graph can be used to make useful models. Finally, they adopted a meta-path-based positive sample choice strategy and joint contrastive loss to yield optimal prediction performance. It has achieved AUC and accuracy of 0.7927 and 0.7429 on the dataset Assistments2009, AUC and accuracy of 0.7728 and 0.7043 on the dataset Assistments2017, AUC and accuracy of 0.7663 and 0.7388 on the dataset EdNet, and AUC and accuracy of 0.8933 and 0.8545 on the dataset Statics2011. The study of [72] proposes a Structure-aware Inductive Knowledge Tracing model (SINKT) that introduces Large Language Models (LLMs) to construct a heterogeneous graph of concepts and questions, which can predict student responses by integrating student knowledge states and question representations. It has achieved state-of-the-art performance on four datasets and addressing data sparsity and cold start issues in inductive KT tasks. The work of [73] presents an Alternate Autoregressive KT framework (AAKT) that treats KT as a generative process. It represents knowledge states via AR encodings on question–response sequences, which incorporates educational and exercise details as inputs. This study has outperformed baselines on four datasets in prediction metrics.
Recently, deep learning-based architectures have obtained promising performance in KT. However, there are still several challenges. First of all, most of the existing studies focus on short-sequence time-series student performance prediction, such as predicting the student’s answer to the next question [3]. On the contrary, long-sequence time-series forecasting is more valuable than short-term time-series prediction [20]. On one hand, analyzing student performance over an extended period allows KT models to make more accurate predictions about students’ performance. Note that each student’s performance prediction is also a long sequence, such as predicting a group of answers for the next few questions. On the other hand, it can provide a holistic view of a student’s learning trajectory, revealing long-term trends and patterns for students to master knowledge [21]. Meanwhile, long sequence data samples help KT algorithms identify individualized learning paths and offer tailored resources and recommendations. Moreover, long-term data can inform the design of more effective long-term learning interventions. Intuitively, teachers should not make hasty evaluations of students’ knowledge mastery based on the students’ short-term performance [22]. Instead, most of the teachers would pay more attention to their performance over a long period of time [23]. Based on the long sequence of time-series exercise-answering records, the teacher can comprehensively take the students’ performance tendencies into consideration. Using longitudinal tracking [24] to assess the stability of a student’s knowledge state can determine whether concepts are truly mastered. In addition, the deep learning-based algorithms have been extensively employed in various prediction-related tasks, including renewable energy [63], material performance [64], biomedicine [65–69].
To be specific, we can divide the current deep learning-based KT methods into five categories, including memory structures, attention mechanisms, graph representation learning, textual features, and forgetting features.
Memory structures.
This type of deep learning model was inspired by memory-augmented neural networks [27]. These KT models have been enhanced by increasing more powerful memory structures, like key-value memory [28], for dynamically extracting knowledge states at a fine-grained level, including the mastery extent of every single skill. For instance, the work of [29] proposes a deep learning model-based KT model, namely Sequential Key-Value Memory Networks (SKVMN). This model exploits recurrent modeling and memory capacity for examining student learning behaviors. Sun et al. [30] propose an exercising record representation algorithm that integrates the features of learning activities along with the learning abilities, thereby enhancing the performance of KT.
Attention mechanisms.
Inspired by the transformer model [31] and its relative applications in the natural language processing area, a variety of attention mechanisms have been presented in deep learning KT models for capturing the relationships among exercises and the students’ knowledge status. For instance, Ghosh, Heffernan, and Lan [32] propose the Attentive Knowledge Tracing (AKT) model, which uses flexible attention-based neural network models with a series of components inspired by cognitive and psychometric models. AKT employs a monotonic attention mechanism related to the students’ future responses and previous answers, as well as the similarity between the responses. Pandey and Srivastava [33] propose a Relation-aware self-attention model for KT (RKT). RKT uses a self-attention layer to incorporate contextual information, which integrates both the exercise associations of textual content as well as students’ performance and the forgetting behavior by using an exponentially decaying kernel function. Choi et al. [34] propose a transformer-based model for KT, which is named after Separated Self-Attention Neural Knowledge Tracing (SAINT). SAINT is organized as an encoder-decoder structure, which deals with the exercise and response sequences, respectively. Jiang et al. [35] propose an adaptive heterogeneous graph embedding module to fully make use of the latent information within a graph. In the meantime, they designed two encoders to capture students’ engagement and knowledge, respectively. Shin et al. [36] propose SAINT+, an updated version of the SAINT model. SAINT+ is also a transformer-based KT model that addresses exercises and students’ responses separately. Meanwhile, SAINT+ has an encoder-decoder architecture, in which the encoder deals with a set of exercise embeddings and the decoder copes with the corresponding response embeddings. Moreover, SAINT+ uses two temporal embeddings to represent the elapsed time (the time taken to answer a question) and the lag time (the time period between neighboring learning activities). The work of [71] proposes a Fine-Grained KT model (FGKT) to capture actual differences among exercises and in prior knowledge, which can obtain exercise representations via knowledge concepts and designing an attention mechanism for prior knowledge relevance.
Graph representation learning.
Inspired by the representation capability of graph models like Graph Neural Networks (GNNs) [37], graph-based KT models have been proposed to leverage the rich structural information from graphs to flexibly model associations between questions and skills. Tong et al. [38] proposes a KT framework dubbed Structure-based Knowledge Tracing (SKT), which uses the multiple associations in knowledge to capture the influence propagation among KCs. The SKT framework considers not only the temporal influence on exercise sequence but also the spatial impact on knowledge. Hiromi, Yusuke, and Yutaka [39] propose a GNN-based KT method. It casts the knowledge state as a graph, which reformulates the KT task into a time-series node-level classification problem in the GNN model. The work of [70] proposes Knowledge Structure-aware Graph-Attention Networks (KSGAN) to utilize improved Graph Attention Networks for effective exercise representation by leveraging knowledge structure. It incorporates representation optimization into the loss function to address data sparsity.
Textual features.
Since the exercises might contain rich information about the skills required by the corresponding questions and the associations between questions, A set of deep learning-based KT models has exploited textual features from questions to track the knowledge states of the students. For instance, Liu et al. [40] propose a holistic approach to students’ performance prediction. To implement performance prediction, an Exercise-Enhanced Recurrent Neural Network (EERNN) is leveraged by exploring the student’s exercise recordings and the content of the exercises. In the EERNN model, each student’s state is turned into a vector and tracked with an RNN, while a bidirectional LSTM is used to extract the embedding of each exercise. In addition, the EERNNM model exploits both the Markov model and the attention mechanism. Su et al. [41] also propose an EERNN framework for student performance prediction by using students’ exercise records and the exercises themselves. A bidirectional LSTM is used to learn each exercise representation while tracking students’ knowledge states. Yin et al. [42] propose a pre-training model called QuesNet for generating question representations. QuesNet uses a unified framework to gather questions and the heterogeneous inputs into a vector. It consists of a hierarchical architecture to better describe the questions in an unsupervised manner.
Forgetting features.
Inspired by the learning curve theory, several recent deep learning KT models incorporate forgetting features to model students’ forgetting behaviors for KT. For instance, Chen et al. [43] propose an explanatory probabilistic model to implement the KT proficiency of students over time by leveraging educational priors. In this model, each student is denoted as a time-series knowledge vector in a unified knowledge space. Given the student knowledge vector, the learning curve and forgetting curve are taken as priors to capture the change in students’ temporal proficiency. To be specific, a probabilistic matrix factorization framework is leveraged for combining student and exercise priors. Abdelrahman and Wang [44] propose a KT model called the Deep Graph Memory Network (DGMN). DGMN incorporates the forget-gating mechanism into the attention module for capturing forgetting behaviors during the KT process. In addition, this model has the ability to extract associations between KCs from a latent graph with the students’ evolving knowledge states.
Notably, most existing approaches are predominantly designed for short-sequence prediction, while the critical need for long-sequence time-series forecasting in education remains underexplored. Long-sequence modeling is essential for understanding sustained learning trends, designing personalized interventions, and evaluating long-term knowledge mastery. Bearing this issue in mind, this study addresses these gaps through a novel framework that: 1) fuses short-term and long-term temporal modeling to capture both immediate knowledge dynamics and extended learning patterns; 2) integrates timestamp information with exercise sequences to enhance the model’s sensitivity to the temporal context of learning events; and 3) employs efficient long-sequence forecasting techniques to overcome the computational limitations of traditional attention mechanisms. By prioritizing long-sequence prediction, this study fills a key void in existing KT research. It also offers a more comprehensive and practically relevant solution for tracking student performance over extended periods.
Materials and methods
Problem definition
Assume the presence of a collection of exercises on an internet-based educational platform. The pupils participating on this platform are required to respond to several exercises. The KT model is utilized to monitor the students’ changing knowledge states by employing the exercise-answering sequence of students. Due to the intricate nature of learning as a cognitive process, it is challenging to accurately represent students’ true knowledge states over an extended duration. Alternatively, the KT models typically use students’ learning records to anticipate their responses to future exercises for the purpose of implementing KT evaluation.
Let’s consider a sequence of exercises and answers by a student U, denoted as . Here,
represents the exercise-answering record at step i with the time stamp ti. It should be noted that ei represents the exercise that the student replies, ai indicates the label of the related response, with 1 indicating a correct response and 0 indicating an incorrect response. n represents the total number of steps. Here, Ci represents a collection of KCs that correspond to ei, and it is possible for one exercise to be related to numerous KCs.
represents a student named U at the next step, which is step n + 1. The exercise-answering records of
consist of a sequence of learning activities that occurred before step n + 1. The exercise-answering records can be denoted as
.
Overall structure of the proposed pipeline
This study proposes an innovative long-sequence time-series forecasting pipeline for knowledge tracing. As shown in Fig 1 the specific steps are as follows:
- Exercise Answering Record Embedding: The Bi-LSTM model is employed to process students’ exercise answering records. The Bi-LSTM can effectively capture the context information in the sequence and transform the exercise answering records into meaningful feature representations, providing a basis for subsequent analysis.
- Feature Vector Construction: Integrate students’ exercise records and the corresponding timestamps into a vector for each record. This integration method combines time information with exercise information, enabling the model to consider both the temporal order and specific content of students’ learning behaviors, and reflecting students’ learning states more comprehensively.
- Prediction Using the Informer Model: The constructed vector sequence is used as the input of the Informer model. The Informer model adopts a probability-sparse self-attention mechanism, effectively solving the quadratic computational complexity problem in the traditional encoder-decoder architecture. This allows the model to calculate efficiently while accurately capturing long-distance dependencies when processing long-sequence data, enhancing the prediction performance.
- Integrating Information for Prediction: By integrating temporal information and students’ individual knowledge states, predictions for a series of target exercise answers are finally achieved. This approach that takes multiple factors into account makes the prediction results better reflect students’ real knowledge mastery.
In the following of this section, we firstly present a detailed explanation of the deep learning architecture that is being proposed. This includes a description of the learning task and the probability attention mechanism that has been suggested for knowledge tracing. During the initial stage, we employ a Bi-LSTM model to generate a robust knowledge state representation by promoting the connections between the knowledge states of students as they evolve both forward and backwards. Next, the informer network is utilized to obtain the output of the Bi-LSTM and generate the prediction results. The proposed informer model incorporates a probabilistic attention mechanism as a substitute for the conventional self-attention module. Next, an attention distillation operation is utilized to enhance dominant attention scores in stacked layers and significantly reduce the overall spatial complexity. Presented is a generative decoder that may obtain long sequence answers for exercise prediction using only one forward step, thereby eliminating cumulative mistakes in the inference process. The structure of the suggested approach is depicted in Fig 2.
Embedding layers: Initially, each word in the text is converted into a numerical vector using a pre-trained word embedding model, such as GloVe [45]. Input to LSTM: The numerical vectors from the word embeddings serve as the input to the LSTM model. Each vector represents a single time step in the sequence, with the sequence length being equal to the number of words in the input text.
Bi-LSTM model as the embedding layers
The proposed approach employed the Bi-LSTM model as the fundamental framework. Since the introduction of the first LSTM model by Hochreiter and Schmidhuber [46], this type of deep learning algorithm has been widely used in various natural language processing (NLP) tasks and has shown promising outcomes. Multiple research in this field have shown that models based on LSTM are efficient in handling tasks associated with textual material. In addition to the exercise embeddings, the time stamps of the exercise-answering records are also included as input for the Bi-LSTM network.
This study presents a Bi-LSTM model for generating embeddings of exercise-answering data. The proposed Bi-LSTM model typically comprises an embedding layer, dropout layer, bidirectional LSTM layer, attention layer, and output layer. The Bi-LSTM model includes multiple attention modules.
Embedding layer.
Let’s consider a set of phrases denoted as , where each sentence has a set of words
. The sentence can be represented as an embedding (
), which can then be used to create the embedding layers of the Bi-LSTM network. The word embeddings utilized in this investigation are produced via the pre-trained GloVe embeddings. The embeddings possess a standardised dimension of 100.
Dropout layer.
The dropout layer [47] in the proposed Bi-LSTM models serves the objective of alleviating the issue of over-fitting [48] in deep learning architectures.
Bi-LSTM layer.
A typical Bi-LSTM model consists of two consecutive LSTM layers [46], which function in opposite directions to handle knowledge states in both the forward and backwards ways. The suggested model consists of a total of two hidden layers of Bi-LSTM. The forward LSTM and backwards LSTM components in a Bi-LSTM model can be mathematically represented as:
- Forward LSTM (as shown in Eq (1)):
(1)
where the symbol xt denotes the present input, ht−1 represents the hidden state, Wf and Uf are the matrices used for weighting in the forget gate, and bf is the bias term used. The formulation is shown in Eq (2).(2)
where the functiondefines the sigmoid function. The variables Wi and Ui represent the weighting matrices for the input gate, while bi indicates the corresponding bias.
In the next layer, the operation can be formulated in Eq (3):(3)
where the weighting matrices used in the feature extraction process are denoted asand
, and the corresponding bias is indicated as
. The function tanh(x) can be equivalently represented as
, which is worth noting.
As shown in Eq (4):(4)
where the variables Wo and Uo denote the weighting matrices employed in the output gate, whilst bo denotes the corresponding bias.
As shown in Eq (5):(5)
wheredenotes the operator.
As shown in Eq (6):(6)
The LSTM model [46] represents the forget gate, input gate, and output gate as equations Eq (1), Eq (2), and Eq (4), respectively. The equation denoted as Eq (3) is employed to demonstrate the process of generating embeddings. - Backward LSTM:
(7)
(8)
(9)
(10)
(11)
(12)
The equations labelled as Eq (7), Eq (8), Eq (9), Eq (10), Eq (11), and Eq (12) are utilized to represent the forget gate, input gate, feature extraction module, and output gate in the backwards process.
Furthermore, the combined hidden state of the Bi-LSTM model is acquired by merging the forward and backwards hidden states, as seen in Eq (13).
Output layer.
The softmax activation function was utilized to generate the label for each input textual emotion embedding. Mathematically, it is formulated as Eq (14):
The softmax activation function was employed to produce the label for each input textual emotion embedding. Mathematically, the formulation is expressed as Eq (14):
where the variable ybilstm reflects the outcome of the suggested Bi-LSTM models, whereas o denotes the output of the last hidden vector of the Bi-LSTM models.
Informer network
The informer model being suggested has three important characteristics. Initially, it is important to note that a probability attention mechanism offers significantly reduced computational complexity and memory use compared to the traditional transformer model [31]. Furthermore, the distilling process is employed to emphasise dominant components by utilising a cascading layer structure. Furthermore, the generative style decoder has the ability to forecast the entire long sequence output in a single forward operation, rather of doing it step by step. The time stamp of each exercise-answering is included in the input of the suggested informer network.
Probability attention mechanism.
The informer, as shown in Fig 3, utilises the probability attention mechanism, as indicated in Fig 4. The use of attention mechanism has been extensively employed in NLP activities in the past few decades. The integration of attention mechanism into neural networks has shown promising outcomes in machine translation and text comprehension. An attention module can be used to emphasise the crucial components in the context that are relevant to the appropriate machine learning activity. In addition, the study conducted by Vaswani et al. [31] made major advancements in the attention mechanism. This study use the multi-head probability attention module, which consists of many heads, to reduce the computational complexity of the canonical attention module described in [31].
The variable L represents the length of the Conv1d procedure. k is the quantity of feature maps produced in every attention module.
The self-attention mechanism described by Vaswani et al. [31]in their work may be expressed mathematically as a tuple input (as shown in Eq (15)) consisting of query, key, and value. This mechanism performs a scaled dot-product operation.
where ,
,
, and d denotes the dimension of input.
The original attention mechanism, as demonstrated in the study [31], has proven to be effective in predicting time-series data. Nevertheless, it is plagued by quadratic computational complexity and excessive memory utilization. In order to tackle these issues, a substantial body of research in the field has developed various encoder-decoder architectures. For example, in the probability self-attention module described in [26], each key is limited to attending only to the most influential questions (as seen in Eq (16)).
where the symbol represents a sparse matrix that has the same dimensions as matrix Q. It contains the most important queries based on the sparsity measurement described in the reference [26]. The utilization of the multi-head architecture allows this probability attention mechanism to produce distinct sparse query-key pairs for each head, hence preventing significant loss of information.
Distilling operation.
With the inclusion of the probability attention mechanism, the encoder’s feature map contains redundant combinations of value V. The distillation process is employed to enhance the prominent characteristics in the feature map and produce a concentrated feature map in the subsequent layer, as depicted in Fig 4. This operation has the ability to reduce the time dimension of the inputs, and this may be expressed mathematically as (as demonstrated in equation Eq (17)):
where the notation represents the attention block, Conv1D(.) means the one-dimensional convolutional operation with a kernel width of 5 in the temporal dimension, and ELU(.) refers to the activation function described in the paper by Clevert et al. [49]. The max-pooling layer, with a stride of 2, can decrease the size of Xt by half, resulting in a significant reduction in memory use. The self-attention distilling layer progressively reduces one layer at a time, like a pyramid as depicted in Fig 4. The final hidden representation of the encoder is obtained by combining the outputs of all the stacks.
Generative decoder.
The decoder employed in the suggested methodology is influenced by the research conducted by Vaswani et al. [31], since it comprises of two multi-head attention layers. Meanwhile, the generative method is utilized to achieve lengthy sequence prediction. The input of the decoder is expressed as (as illustrated in Eq (18)):
where the variable is a matrix of size
and represents the initial token. On the other hand,
is a matrix of size
and serves as a placeholder for the anticipated sequence. The probability attention module can effectively avoid auto-regressive by utilizing the masking attention method. The informer network’s ultimate output is obtained through a linear layer, with the size of the layer, denoted as dy, depending on whether it is used for univariate prediction or multivariate result.
Results
This section provides an overview of the datasets utilized, the assessment metrics employed, the experimental conditions, and the outcomes of the suggested methodology.
Dataset
The details of the exploited public datasets in this study are available in Table 1 ((1) Assistments Data Mining Competition 2009 (Assistments2009). The data is accessible at the following link: https://sites.google.com/site/assistmentsdata/home/; (2) Assistments Data Mining Competition 2017 (Assistments2017). The data is accessible at the following link: https://sites.google.com/view/assistmentsdatamining/data-mining-competition-2017; (3) EdNet. The data is accessible at the following link: https://github.com/riiid/ednet). Initially, the data set known as Assistments2009 was collected via the ASSISTments online tutoring platform over the year of 2009 to 2010. Assistments2017 was derived from the ASSISTments data mining competition that took place in 2017. EdNet is a substantial dataset that is available in [50]. The dataset comprises 130 million records derived from 0.78 million pupils. In order to streamline the experimental method, this study exclusively utilizes the data from 4,000 students. In this study, the absolute time is used for each student answering record. These timestamps, in standard date-time format, can be used to identify the order of learning events. Before model input, the timestamps were normalized using Min-Max normalization into the range 0-1. For these three datasets, 80% of the data is used for training, while the remaining 20% is used for testing.
Dataset pre-processing
Before feeding the data into the proposed model for knowledge tracing, comprehensive dataset pre-processing was essential.
Data cleaning.
We meticulously addressed potential issues in the original datasets, including missing values and outliers. For missing values, the proposed approach was data-characteristic-driven. In the case of numerical features like answering time, when the proportion of missing values was low, the mean imputation was used.
Data encoding.
Given the abundance of categorical data in our datasets, such as question types and student identities, and the fact that computers operate on numerical data, we implemented multiple encoding techniques. For categorical variables with a finite number of unordered categories, One-Hot Encoding was used to convert each category into a binary vector for accurate computer recognition.
Implementation settings
For hardware, we use 4 NVIDIA GeForce RTX 4090 GPUs. Each computer is equipped with 32GB of RAM. As for the CPU, we choose an Intel Xeon Gold 6338 which is quite suitable for handling the associated data processing tasks. In the training process, we employed a series of strategies alongside the enhanced Adam optimizer [51]. This optimizer was utilized with a carefully chosen epsilon value of 1e-9 to maintain numerical stability, automatic mini-batching to efficiently process large datasets, and a dropout rate of 0.1 for each model to introduce regularization and prevent complex co-adaptations on training data. To further address overfitting, we implemented early stopping (set to 10 in our experiments), monitoring the validation loss during training and halting the process if it ceases to decrease, indicating that the model might be memorizing the training data rather than learning generalizable patterns. Additionally, we employed L2 regularization, which adds a penalty for large weights to the loss function, encouraging the model to maintain smaller weights and thus reducing overfitting. Each model was trained with a unique set of epochs and learning rates tailored to their specific architecture and complexity. The word embeddings for the models were produced using GloVe, a pre-trained word embedding model known for its effectiveness in capturing semantic relationships. The embeddings were generated within a 200-dimensional space, trained for 200 iterations, and a learning rate of 0.001 was employed to ensure the convergence of the embeddings. In the experiments, the hyperparameter optimization for the LSTKT model was conducted using grid search, which is a robust method to explore predefined parameter ranges. Key parameters tested in this process included dropout rate, learning rate, and Bi-LSTM hidden layers. In general, the hyper-parameters used in this work are listed in Table 2.
To note that for the Assistments2009 dataset, each epoch took 12.4 minutes to complete. When training on the Assistments2017 dataset, each epoch took around 18.9 minutes. For the EdNet dataset, each epoch had a training time of 21.6 minutes.
In the training process of this study, the presentation and analysis of the model’s performance are of great significance. Fig 5 depicts the accuracy and loss curves over 200 epochs, which effectively mirror the model’s training process.
As shown in Fig 5, at the commencement of training, the model’s accuracy hovers around 50%. As the training unfolds, the accuracy initially surges. The addition of random noise to the data gives rise to minor fluctuations, emulating the real-world intricacies during model training. In the later phase, the growth rate decelerates and eventually plateaus at approximately 92.5%, signifying that the model has gradually assimilated the data characteristics and its performance has achieved stability. Concerning the loss curve, at the start of training, the loss value stands as high as 0.5. Initially, the loss value drops precipitously. Owing to the random noise incorporated, the curve exhibits small fluctuations. As training progresses, the rate of decline gradually slackens until it ultimately stabilizes at 0.01. This indicates that the disparity between the model’s predicted values and the true values is steadily diminishing, and the model’s fitting efficacy is continuously improving. Overall, these two curves comprehensively illustrate the entire training process of the model, from its initial state of instability to gradual convergence and performance enhancement.
Evaluation metrics
Typically, we use accuracy (as demonstrated in Eq (19)) and AUC [52] as the metrics for evaluation. In general, the mathematical expression of accuracy is provided in Eq (19). To obtain the AUC value, as mentioned by Zhu et al. [52], TPR and FPR are computed for each possible threshold, as shown in Eq (20) and Eq (21). Then, the points (FPR, TPR) are drawn with FPR as the x-axis and TPR as the y-axis in the coordinate system. Finally, the AUC value can be approximated by numerical integration method.
where TP, TN, FP, and FN symbolize the following: TP stands for true positive, TN stands for true negative, FP stands for false positive, and FN stands for false negative.
Furthermore, to assess the effectiveness of the suggested method in predicting long sequences, both mean squared error (MSE) and mean absolute error (MAE) are employed.
Comparison experimental results
Comparing with the KT methods.
To evaluate the efficacy of the proposed model for short-term forecasting, we employed state-of-the-art algorithms in the comparison testing. The state-of-the-art approaches include BKT [8], KTM [53], DKT [11], DKVMN [54], SAKT [55], AKT [32], CoKT [56], GKT [39], CoSKT [57], GRKT [74], and FlucKT [75]. The comparative findings of the proposed models on the dataset can be seen in Table 3.
The Bi-LSTM and informer model demonstrated superior performance, achieving an accuracy of 78.49% and an AUC of 78.81% for the Assistments2009 dataset, an accuracy of 74.22% and an AUC of 72.82% for the Assistments2017 dataset, and an accuracy of 68.17% and an AUC of 70.78% for the EdNet dataset. The ROC curves are provided in Fig 6.
In addition, the superiority of the proposed strategy compared to the most advanced techniques is demonstrated in Fig 7.
The proposed strategy has demonstrated higher performance compared to other methods in terms of accuracy and AUC on three distinct datasets.
Overall, deep learning techniques described in the literature have consistently outperformed shallow learning models when it comes to short-term prediction. This demonstrates the benefits of using deep learning models to capture the internal structure of students’ knowledge states in knowledge tracing applications. Nevertheless, superficial learning techniques might still hold significant worth in some applications. For example, KTM [53] demonstrates superior performance in terms of accuracy and AUC value on three datasets. Furthermore, the graph-based approach, shown by GKT [39], incorporates the connections between KCs to construct a graph. Furthermore, GKT has the capability to produce similar outcomes as the suggested method. Hence, the previous information based on graphs can be utilized to enhance the suggested model even more.
Comparing with the predicting methods.
In order to assess the effectiveness of the model in predicting extended sequences, we employed state-of-the-art algorithms in our comparative tests. The advanced algorithms now used are ARIMA [58], Prophet [59], LSTMa [60], LSTnet [61], and DeepAR [62]. The comparison outcomes of the proposed models on the dataset may be located in Table 4.
As demonstrated in Table 4, the proposed approach has achieved superior peformance over the state-of-the-art techniques. In addition, DeepAR also achieves promising outcomes in terms of accuracy and AUC across the three available datasets. The main factor can be attributed to the matching deep learning architecture.
Ablation study
To delve deeper into the individual contributions and significance of each component within the proposed model, we conducted a comprehensive series of ablation studies. These investigations are pivotal for understanding the distinct roles and the cumulative impact of the various elements that constitute our model. The LSTKT model, which stands at the core of our research, was systematically deconstructed and its structure was alternately substituted with two distinct design configurations. These alternative designs were crafted to isolate the effects of specific architectural choices and to evaluate their individual merits and drawbacks. The modifications were meticulously documented and visually represented in Fig 8, which provides a clear comparative framework.
The findings of comparing the LSTKT models with two distinct architectures are presented in Fig 9.
To further validate the effectiveness of the proposed model’s parameter settings, we conducted the ablation studies on two crucial hyperparameters: the early stopping and the dropout rate. Early stopping is a crucial technique in preventing overfitting during the training process. By varying the patience parameter, its impact on the model’s performance was evaluated. In general, the values of 5, 10, and 15 were tested. This approach not only helped to fine-tune the training process but also provided insights into how the model’s generalization ability was affected by different stopping points. The results of this ablation study demonstrated that an early stop setting with a patience of 10 led to the best performance in terms of the model’s ability to handle long-sequence time-series prediction for knowledge tracing, as shown in Fig 10.
In addition, dropout is a regularization technique that helps prevent overfitting by randomly deactivating neurons during training. We tested dropout rates of 0.0, 0.1, 0.2, 0.3, and 0.4. As presented in Table 5, the optimal dropout rate of 0.1, as used in the main experiments, effectively mitigated overfitting while maintaining the model’s learning capacity, leading to better generalization performance.
Discussion
The proposed deep learning model, which integrates Bi-LSTM and Informer networks, presents a significant advancement in the field of knowledge tracing. The integration of bidirectional LSTM with Informer networks has yielded a model that is not only capable of capturing the intricate temporal dynamics of learning but also adept at generalizing across various educational contexts. The bidirectional flow of information in Bi-LSTM allows for a more nuanced understanding of a student’s learning trajectory, while the Informer network’s optimized complexity ensures that the model remains computationally feasible. This balance between depth and efficiency is crucial for practical applications in educational technology. The proposed approach outperforms existing models in terms of predictive accuracy and AUC. The main findings highlight the model’s ability to process large volumes of educational data without compromising on the granularity of analysis. This is particularly important in educational settings where long-term and accurate feedback can significantly influence learning outcomes.
Despite the promising results, the study still has the following limitations. Firstly, the model’s performance is highly reliant on the quality and representativeness of the input data. Poor-quality data or data that fails to adequately represent real-world scenarios may significantly degrade the model’s performance, thus affecting the reliability and practicality of the research results. Secondly, although the complexity of the Informer network has been reduced to some extent, there remains room for further optimization. Especially when dealing with the vast and diverse datasets in real-world educational environments, the current network architecture may not be able to handle them efficiently, and further improvements are required to meet the needs of large-scale data processing.
Conclusion
In this study, a novel deep learning approach has been proposed that synergistically combines Bi-LSTM and Informer networks for knowledge tracing. The proposed method leverages a Bi-LSTM network to meticulously extract sequential information from students’ records, capturing both forward and backward dependencies within the learning process. Subsequently, the refined pipeline feeds the output of the Bi-LSTM into an Informer network, which has been optimized to reduce computational complexity significantly while maintaining its efficacy in handling long-range dependencies.
The integration of Bi-LSTM and Informer networks has demonstrated a substantial improvement in the accuracy of knowledge tracing models. The bidirectional nature of Bi-LSTM provides a comprehensive understanding of the learning sequence, while the Informer network, with its reduced complexity, efficiently processes this information to predict knowledge levels and learning outcomes. One of the key advancements in this approach is the reduction of the Informer network’s complexity. By optimizing the probability self-attention mechanism, the model achieves a balance between performance and computational efficiency, making it more suitable for large-scale applications.
The proposed model opens avenues for further research and development. Future work could focus on enhancing the model’s interpretability to provide educators with actionable insights into students’ learning trajectories. In addition, exploring the integration of multimodal data, such as combining textual, behavioral, and physiological data, could offer a more holistic view of the learning process. Furthermore, investigating adaptive learning strategies based on the model’s predictions could lead to personalized learning experiences that are dynamically tailored to each student’s needs.
References
- 1. O’Dea XC, Stern J. Virtually the same?: online higher education in the post Covid-19 era. Br J Educ Technol. 2022;53(3):437–42. pmid:35600417
- 2. Cakiroglu U, Ozkan A, Cevi̇k I, Kutlu D, Kahyar S. What motivate learners to continue a professional development program through Massive Open Online Courses (MOOCs)?: a lens of self-determination theory. Educ Inf Technol. 2023;29(6):7027–51.
- 3. Abdelrahman G, Wang Q, Nunes B. Knowledge tracing: a survey. ACM Comput Surv. 2023;55(11):1–37.
- 4. Bai Y, Zhao J, Wei T, Cai Q, He L. A survey of explainable knowledge tracing. arXiv preprint 2024.https://arxiv.org/abs/2403.07279
- 5. Gao W, Xu F, Zhou Z-H. Towards convergence rate analysis of random forests for classification. Artificial Intelligence. 2022;313:103788.
- 6. Kowalewska A, Urbaniak R. Measuring coherence with Bayesian networks. Artif Intell Law. 2022;31(2):369–95.
- 7. Gao X, Deng Y. Inferable dynamic Markov model to predict interference effects. Engineering Applications of Artificial Intelligence. 2023;117:105512.
- 8. Corbett AT, Anderson JR. Knowledge tracing: modeling the acquisition of procedural knowledge. User Modeling and User-Adapted Interaction. 2005;4:253–78.
- 9. Yu H, Zhao W, Zhao Q. Distributed representation learning and intelligent retrieval of knowledge concepts for conceptual design. Advanced Engineering Informatics. 2022;53:101649.
- 10.
Xu Y, Mostow J. In: Proceedings of the 6th International Conference on Educational Data Mining, Memphis, Tennessee, USA. 2013. p. 356–7.
- 11.
Piech C, Bassen J, Huang J, Ganguli S, Sahami M, Guibas LJ, et al. Deep knowledge tracing. In: Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, Montreal, Quebec, Canada. 2015. p. 505–13.
- 12. Krusen M, Ernst F. Very fast digital 2D rigid motion estimation directly on continuous k-space data using an RNN. Biomedical Signal Processing and Control. 2024;87:105413.
- 13.
Wang J, Fu J. Synchronization control of Markovian complex neural networks with time-varying delays. Springer; 2024. https://doi.org/10.1007/978-3-031-47835-2
- 14. Duan Z, Dong X, Gu H, Wu X, Li Z, Zhou D. Towards more accurate and interpretable model: fusing multiple knowledge relations into deep knowledge tracing. Expert Systems with Applications. 2024;243:122573.
- 15. Tsutsumi E, Guo Y, Kinoshita R, Ueno M. Deep knowledge tracing incorporating a hypernetwork with independent student and item networks. IEEE Trans Learning Technol. 2024;17:951–65.
- 16. Yang H, Hu S, Geng J, Huang T, Hu J, Zhang H, et al. Heterogeneous graph-based knowledge tracing with spatiotemporal evolution. Expert Systems with Applications. 2024;238:122249.
- 17. Han D, Kim D, Kim M, Han K, Yi MY. Temporal enhanced inductive graph knowledge tracing. Appl Intell. 2023;53(23):29282–99.
- 18.
Nakagawa H, Iwasawa Y, Matsuo Y. Graph-based knowledge tracing: modeling student proficiency using graph neural network. In: IEEE/WIC/ACM International Conference on Web Intelligence. 2019. p. 156–63. https://doi.org/10.1145/3350546.3352513
- 19. Sun J, Du S, Liu Z, Yu F, Liu S, Shen X. Weighted heterogeneous graph-based three-view contrastive learning for knowledge tracing in personalized e-learning systems. IEEE Trans Consumer Electron. 2024;70(1):2838–47.
- 20. Chen Z, Ma M, Li T, Wang H, Li C. Long sequence time-series forecasting with deep learning: a survey. Information Fusion. 2023;97:101819.
- 21. Alnasyan B, Basheri M, Alassafi M. The power of deep learning techniques for predicting student performance in virtual learning environments: a systematic literature review. Computers and Education: Artificial Intelligence. 2024;6:100231.
- 22. Cohen J, Goldhaber D. Building a more complete understanding of teacher evaluation using classroom observations. Educational Researcher. 2016;45(6):378–87.
- 23. Affuso G, Zannone A, Esposito C, Pannone M, Miranda MC, De Angelis G, et al. The effects of teacher support, parental monitoring, motivation and self-efficacy on academic performance over time. Eur J Psychol Educ. 2022;38(1):1–23.
- 24.
Chavez-Rivera AD, Inostroza-Nieves Y, Hemal K, Chen W. Longitudinal study. Translational Surgery. Elsevier; 2023. p. 223–6. https://doi.org/10.1016/b978-0-323-90300-4.00074-4
- 25. Manna T, Anitha A. Hybridization of rough set–wrapper method with regularized combinational LSTM for seasonal air quality index prediction. Neural Comput & Applic. 2023;36(6):2921–40.
- 26. Zhou H, Zhang S, Peng J, Zhang S, Li J, Xiong H, et al. Informer: beyond efficient transformer for long sequence time-series forecasting. arXiv preprint 2023.
- 27.
Collier M, Beel J. Implementing neural turing machines. Lecture Notes in Computer Science. Springer; 2018. p. 94–104. https://doi.org/10.1007/978-3-030-01424-7_10
- 28.
Miller A, Fisch A, Dodge J, Karimi A-H, Bordes A, Weston J. Key-value memory networks for directly reading documents. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2016. https://doi.org/10.18653/v1/d16-1147
- 29.
Abdelrahman G, Wang Q. Knowledge tracing with sequential key-value memory networks. In: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, 2019. p. 175–84. https://doi.org/10.1145/3331184.3331195
- 30. Sun X, Zhao X, Li B, Ma Y, Sutcliffe R, Feng J. Dynamic key-value memory networks with rich features for knowledge tracing. IEEE Trans Cybern. 2022;52(8):8239–45. pmid:33531331
- 31.
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. In: Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Long Beach, CA, USA. 2017. p. 5998–6008.
- 32.
Ghosh A, Heffernan N, Lan AS. Context-aware attentive knowledge tracing. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2020. p. 2330–9. https://doi.org/10.1145/3394486.3403282
- 33.
Pandey S, Srivastava J. RKT: relation-aware self-attention for knowledge tracing. In: Proceedings of the 29th ACM International Conference on Information & Knowledge Management. 2020. p. 1205–14. https://doi.org/10.1145/3340531.3411994
- 34.
Choi Y, Lee Y, Cho J, Baek J, Kim B, Cha Y, et al. Towards an appropriate query, key, and value computation for knowledge tracing. In: Proceedings of the Seventh ACM Conference on Learning @ Scale. 2020. p. 341–4. https://doi.org/10.1145/3386527.3405945
- 35. Jiang H, Xiao B, Luo Y, Ma J. A self-attentive model for tracing knowledge and engagement in parallel. Pattern Recognition Letters. 2023;165:25–32.
- 36.
Shin D, Shim Y, Yu H, Lee S, Kim B, Choi Y. SAINT+: integrating temporal features for EdNet correctness prediction. In: LAK21: 11th International Learning Analytics and Knowledge Conference. 2021. p. 490–6. https://doi.org/10.1145/3448139.3448188
- 37.
Kipf TN, Welling M. Semi-supervised classification with graph convolutional networks. In: 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24–26, 2017, Conference Track Proceedings. 2017.
- 38.
Tong S, Liu Q, Huang W, Hunag Z, Chen E, Liu C, et al. Structure-based knowledge tracing: an influence propagation view. In: 2020 IEEE International Conference on Data Mining (ICDM). 2020. p. 541–50. https://doi.org/10.1109/icdm50108.2020.00063
- 39. Nakagawa H, Iwasawa Y, Matsuo Y. Graph-based knowledge tracing: modeling student proficiency using graph neural networks. Web Intelligence. 2021;19(1–2):87–102.
- 40. Liu Q, Huang Z, Yin Y, Chen E, Xiong H, Su Y, et al. EKT: exercise-aware knowledge tracing for student performance prediction. IEEE Trans Knowl Data Eng. 2021;33(1):100–15.
- 41. Su Y, Liu Q, Liu Q, Huang Z, Yin Y, Chen E, et al. Exercise-enhanced sequential modeling for student performance prediction. AAAI. 2018;32(1):2435–43.
- 42.
Yin Y, Liu Q, Huang Z, Chen E, Tong W, Wang S, et al. QuesNet. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2019. p. 1328–36. https://doi.org/10.1145/3292500.3330900
- 43.
Chen Y, Liu Q, Huang Z, Wu L, Chen E, Wu R, et al. Tracking knowledge proficiency of students with educational priors. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. 2017. p. 989–98. https://doi.org/10.1145/3132847.3132929
- 44. Abdelrahman G, Wang Q. Deep graph memory networks for forgetting-robust knowledge tracing. IEEE Trans Knowl Data Eng. 2022;1–13.
- 45. Manasa P, Malik A, Batra I. Detection of twitter spam using GLoVe vocabulary features, bidirectional LSTM and convolution neural network. SN COMPUT SCI. 2024;5(2):206.
- 46. Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput. 1997;9(8):1735–80. pmid:9377276
- 47. ElAdel A, Zaied M, Amar CB. Fast DCNN based on FWT, intelligent dropout and layer skipping for image retrieval. Neural Netw. 2017;95:10–8. pmid:28843091
- 48. Li H, Rajbahadur GK, Lin D, Bezemer C-P, Jiang ZM. Keeping deep learning models in check: a history-based approach to mitigate overfitting. IEEE Access. 2024;12:70676–89.
- 49.
Clevert D, Unterthiner T, Hochreiter S. Fast and accurate deep network learning by exponential linear units (ELUs). In: Conference Track Proceedings, San Juan, Puerto Rico. 2016.
- 50.
Choi Y, Lee Y, Shin D, Cho J, Park S, Lee S, et al. EdNet: a large-scale hierarchical dataset in education. Lecture Notes in Computer Science. Springer; 2020. p. 69–73. https://doi.org/10.1007/978-3-030-52240-7_13
- 51. Shahade AK, Walse KH, Thakare VM, Atique M. Multi-lingual opinion mining for social media discourses: an approach using deep learning based hybrid fine-tuned smith algorithm with adam optimizer. International Journal of Information Management Data Insights. 2023;3(2):100182.
- 52. Zhu H, Liu S, Xu W, Dai J, Benbouzid M. Linearithmic and unbiased implementation of DeLong’s algorithm for comparing the areas under correlated ROC curves. Expert Systems with Applications. 2024;246:123194.
- 53. Vie J-J, Kashima H. Knowledge tracing machines: factorization machines for knowledge tracing. AAAI. 2019;33(01):750–7.
- 54.
Zhang J, Shi X, King I, Yeung D-Y. Dynamic key-value memory networks for knowledge tracing. In: Proceedings of the 26th International Conference on World Wide Web. 2017. p. 765–74. https://doi.org/10.1145/3038912.3052580
- 55.
Pandey S, Karypis G. A self attentive model for knowledge tracing. In: Proceedings of the 12th International Conference on Educational Data Mining, EDM 2019, Montréal, Canada. 2019.
- 56.
Long T, Qin J, Shen J, Zhang W, Xia W, Tang R, et al. Improving knowledge tracing with collaborative information. In: Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining. 2022. p. 599–607. https://doi.org/10.1145/3488560.3498374
- 57. Zhang C, Ma H, Cui C, Yao Y, Xu W, Zhang Y, et al. CoSKT: a collaborative self-supervised learning method for knowledge tracing. IEEE Trans Learning Technol. 2024;17:1476–88.
- 58. Chai SH, Lim JS, Yoon H, Wang B. A novel methodology for forecasting business cycles using ARIMA and neural network with weighted fuzzy membership functions. Axioms. 2024;13(1):56.
- 59. Wang W, He N, Chen M, Jia P. Freight rate index forecasting with Prophet model based on multi-dimensional significant events. Expert Systems with Applications. 2024;249:123451.
- 60.
Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate. In: Conference Track Proceedings, San Diego, CA, USA, 2015.
- 61. Wang D, Chen C. Spatiotemporal self-attention-based LSTNet for multivariate time series prediction. International Journal of Intelligent Systems. 2023;2023(1).
- 62. Ferenczi A, Bădică C. Prediction of Ethereum gas prices using DeepAR and probabilistic forecasting. Journal of Information and Telecommunication. 2023;8(1):18–32.
- 63. Abdelsattar M, Ismeil MA, Zayed MMAA, Abdelmoety A, Emad-Eldeen A. Assessing machine learning approaches for photovoltaic energy prediction in sustainable energy systems. IEEE Access. 2024;12:107599–615.
- 64. Ahmed E-E, Azim MA, Abdelsattar M, AbdelMoety A. Utilizing machine learning and deep learning for enhanced supercapacitor performance prediction. Journal of Energy Storage. 2024;100(Part B):113556.
- 65. Raza A, Uddin J, Zou Q, Akbar S, Alghamdi W, Liu R. AIPs-DeepEnC-GA: predicting anti-inflammatory peptides using embedded evolutionary and sequential feature integration with genetic algorithm based deep ensemble model. Chemometrics and Intelligent Laboratory Systems. 2024;254:105239.
- 66. Akbar S, Ullah M, Raza A, Zou Q, Alghamdi W. DeepAIPs-Pred: predicting anti-inflammatory peptides using local evolutionary transformation images and structural embedding-based optimal descriptors with self-normalized BiTCNs. J Chem Inf Model. 2024;64(24):9609–25. pmid:39625463
- 67. Ullah M, Akbar S, Raza A, Khan KA, Zou Q. TargetCLP: clathrin proteins prediction combining transformed and evolutionary scale modeling-based multi-view features via weighted feature integration approach. Brief Bioinform. 2024;26(1):bbaf026. pmid:39844339
- 68. Shahid , Hayat M, Alghamdi W, Akbar S, Raza A, Kadir RA, et al. pACP-HybDeep: predicting anticancer peptides using binary tree growth based transformer and structural feature encoding with deep-hybrid learning. Sci Rep. 2025;15(1):565. pmid:39747941
- 69. Ullah M, Akbar S, Raza A, Zou Q. DeepAVP-TPPred: identification of antiviral peptides using transformed image-based localized descriptors and binary tree growth algorithm. Bioinformatics. 2024;40(5):btae305. pmid:38710482
- 70. Mao S, Zhan J, Li J, Jiang Y. Knowledge structure-aware graph-attention networks for knowledge tracing. Knowledge Science, Engineering and Management. 2022:309–21.
- 71. Mao S, Zhan J, Wang Y, Jiang Y. Improving knowledge tracing via considering two types of actual differences from exercises and prior knowledge. IEEE Trans Learning Technol. 2023;16(3):324–38.
- 72.
Fu L, Guan H, Du K, Lin J, Xia W, Zhang W, et al. SINKT: a structure-aware inductive knowledge tracing model with large language model. In: Proceedings of the 33rd ACM International Conference on Information and Knowledge Management (CIKM ’24). 2024. p. 632–42.
- 73. Zhou H, Rong W, Zhang J, Sun Q, Ouyang Y, Xiong Z. AAKT: enhancing knowledge tracing with alternate autoregressive modeling. IEEE Trans Learning Technol. 2025;18:25–38.
- 74. Cui J, Qian H, Jiang B, Zhang W. Leveraging pedagogical theories to understand student learning process with graph-based reasonable knowledge tracing. arXiv preprint 2024.https://arxiv.org/abs/2406.12896
- 75. Hou M, Li X, Guo T, Liu Z, Tian M, Luo R, et al. Cognitive fluctuations enhanced attention network for knowledge tracing. AAAI. 2025;39(13):14265–73.