Figures
Abstract
Background
Digital media usage has become an integral part of daily life, but prolonged or emotionally driven engagement—especially during late-night hours—may lead to concerns about behavioral and mental health. Existing predictive systems fail to account for the nuanced interplay between users’ internal psychological states and their surrounding ecological contexts.
Objective
This study aims to develop a psychologically and ecologically informed behavior prediction model to identify high-risk patterns of digital media usage and support early-stage intervention strategies.
Methods
We propose a Dual-Channel Cross-Attention Network (DCCAN) architecture composed of three layers: signal identification (for psychological and ecological encoding), interaction modeling (via cross-modal attention), and behavior prediction. The model was trained and tested on a dataset of 9,782 users and 51,264 behavior sequences, annotated with labels for immersive usage, late-night activity, and susceptibility to health misinformation.
Results
The DCCAN model achieved superior performance across all three tasks, especially in immersive usage prediction (F1-score: 0.891, AUC: 0.913), outperforming LSTM, GRU, and XGBoost baselines. Ablation studies confirmed the critical role of both psychological and ecological signals, as well as the effectiveness of the cross-attention mechanism.
Conclusions
Incorporating psychological and ecological modalities through attention-based fusion yields interpretable and accurate predictions for digital risk behaviors. This framework shows promise for scalable, real-time behavioral health monitoring and adaptive content moderation on media platforms.
Citation: Xiong L, Li K, Wong WS (2025) A psycho-ecological signal recognition framework for user behavior prediction on digital media platforms. PLoS One 20(12): e0338507. https://doi.org/10.1371/journal.pone.0338507
Editor: Mohammed Jasim Mohammed Alfahdawi,, University of Anbar, IRAQ
Received: July 27, 2025; Accepted: November 24, 2025; Published: December 15, 2025
Copyright: © 2025 Xiong et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: https://zenodo.org/records/7812197,10.5281/zenodo.7812197.
Funding: The work is supported by the National Social Science Fund Project of the Ministry of Education of China: Research on National Image Risk Identification and Governance Mechanism of Multimodal AIGC; No. Not applicable).
Competing interests: The authors have declared that no competing interests exist.
1 Introduction
As digital media platforms increasingly mediate everyday experiences—from entertainment and news consumption to health advice and social connectivity—concerns about their psychological and behavioral impacts have intensified. Users frequently engage in prolonged usage sessions, particularly during late-night hours, and interact with content that may be emotionally charged, misleading, or even harmful [1–3]. These behavioral patterns not only reflect internal affective states but are also shaped by temporal, social, and environmental contexts. Consequently, identifying high-risk behaviors on digital platforms has emerged as a critical issue in the domains of public health, human-computer interaction, and computational behavior modeling [4]. While conventional approaches to user behavior modeling have leveraged either content-based signals [5–6] (e.g., sentiment, topic relevance) or statistical behavior traces [7–8] (e.g., clickstreams, session durations), they often fail to capture the nuanced interplay between a user’s internal psychological condition and their surrounding ecological conditions. In mental health research, for example, late-night screen use is recognized as a potential indicator of mood disorders or sleep disturbances. Yet, the context—such as social isolation or platform design—remains underexamined. Similarly, susceptibility to health misinformation is not solely a function of information literacy but also of emotional vulnerability and ecological stressors.
These observations underscore the need for predictive models that can recognize user behaviors as emergent outcomes of psycho-ecological interactions—a concept grounded in both ecological psychology and behavioral informatics. Such models must accommodate high-dimensional, multimodal inputs (such as text, interaction logs, and device metadata), learn from temporal dynamics, and yield predictions that are both accurate and interpretable, providing actionable insights.
User behavior prediction has long been a core challenge in machine learning, explored through both supervised and unsupervised approaches. Early studies relied heavily on statistical models such as logistic regression [9] and Bayesian networks [10], which offered interpretability but lacked expressive power. Recent progress has shifted toward deep learning architectures—particularly Recurrent Neural Networks (RNNs) [11], Long Short-Term Memory (LSTM) networks [12], and Transformer-based models [13]—that excel in capturing sequential dependencies within user interactions. In digital health and wellness research, behavior traces and physiological indicators (e.g., screen-on duration and heart rate variability) have been employed to infer cognitive load, emotional state, and risk of burnout. For example, Aljaloud et al. [14] applied bidirectional LSTMs to model student stress from digital footprints, whereas Niemeijer et al. [15] utilized mobile sensor data to predict sleep quality and screen exposure. Despite these advances, most existing models treat psychological and contextual signals as independent inputs, limiting their ability to capture dynamic and cross-modal dependencies. Recent efforts in multimodal learning have shown promise in overcoming this limitation [16–17]. Fusion architectures integrate heterogeneous data sources—such as text, images, and sensor streams—to construct more comprehensive user representations. For instance, multimodal sentiment analysis employs joint embeddings of video, audio, and text to detect emotional states [18]. Yet in behavior prediction tasks, such integration often remains superficial, relying on feature concatenation rather than accurate contextual alignment across modalities.
To address modality interaction, attention mechanisms have emerged as powerful tools. Initially developed for neural machine translation, attention enables models to assign varying weights to different input features when generating outputs. Self-attention, as implemented in the Transformer model [19], allows the capture of intra-modal dependencies, while cross-attention mechanisms facilitate inter-modal alignment. SETR [20] and Segmenter [21] have extended this principle to image segmentation, suggesting that similar principles can apply to temporal signal fusion in behavior modeling.
The conceptual foundation of this study is rooted in psychological-ecological theory [22], which emphasizes the interaction between an individual’s internal state and their surrounding environment. Originating from Bronfenbrenner’s ecological systems theory [23], this perspective has been extensively adapted in digital contexts to account for the dynamic influence of social, technological, and temporal variables on user behavior. In digital environments, psychological signals often manifest through users’ linguistic expressions, emotional tones, and engagement rhythms [24–25]. These signals reflect not only momentary states but also enduring traits such as anxiety levels, information-seeking tendencies, and impulsivity. Ecological signals, in contrast, capture situational attributes—such as time of access, device type, geographical context, and network structure—that modulate user behavior in real-time. Recent studies have applied this dual-view approach in domains such as adaptive learning (attention modeling in intelligent tutoring systems) and affective computing (real-time mood tracking using multimodal data). Yet, most implementations treat psychological and ecological signals as parallel but independent streams, rather than interacting components of a shared behavioral process [26]. This separation limits the ability to model cases where, for example, emotional vulnerability is amplified by social isolation or late-night usage is intensified by motivational triggers.
Bridging this gap requires a computational architecture that not only encodes each signal type effectively but also learns its cross-dependencies in a context-sensitive and temporally-aware manner. In this regard, attention-based fusion mechanisms offer a promising approach, particularly when embedded within multi-channel, cross-modal networks that reflect the structure of the psycho-ecological model.
To address the above limitations, this study introduces a novel modeling framework for digital behavior prediction that integrates both psychological and ecological signal streams. The main contributions are as follows:
- (1). Proposing a Dual-Channel Cross-Attention Network (DCCAN): The model introduces a two-stream architecture where psychological and ecological features are independently encoded and then fused via cross-attention to model their semantic and causal dependencies.
- (2). Constructing a Multi-Task Prediction Pipeline: By simultaneously targeting immersive usage risk, late-night engagement, and misinformation susceptibility, the model reflects the complex and multi-faceted nature of digital behavior.
- (3). Designing Interpretable Attention Mechanisms: Through visualization of attention weights and SHAP-based feature attribution, the model supports introspection into which signals influence predictions.
2 Methodology
2.1 Overall architecture
This section aims to develop a user behavior prediction model based on the recognition of psychological-ecological interactive signals, identifying potentially high-risk usage patterns on digital media platforms and enabling early intervention and behavior guidance. To achieve this objective, a multi-stage modeling framework is designed, comprising three key components: the Signal Identification Layer, the Interaction Modeling Layer, and the User Behavior Prediction Layer. As illustrated in Fig 1, the overall architecture is intended to accurately extract psychological and ecological features from heterogeneous, multi-source user data, and to construct their semantic interactions through a multi-channel attention mechanism, ultimately facilitating high-precision behavior prediction.
Within this framework, user behavior data on the platform—including text, browsing trajectories, and interaction logs—are utilized to construct psychological signal encodings. Simultaneously, information related to time, location, device, and social context is categorized as ecological variables and encoded through graph neural networks, periodic temporal embeddings, and structured categorical embeddings. Building upon these representations, we propose DCCAN to model the deep semantic dependencies and potential causal relationships between psychological and ecological signals. This approach yields an interpretable, fused representation of the user’s dynamic state. The fused Embedding is then passed into the prediction module to perform multi-task inference, targeting behavioral outcomes such as immersive usage risk, late-night engagement tendencies, and susceptibility to misleading health-related content.
Let denote the behavioral sequence of user
during a predefined time window
, where each
consists of content-related features
, linguistic cues
, interaction data
, and device- or location-specific ecological information
. The framework aims to learn a mapping function:
where denotes the predicted likelihood of high-risk behavior occurring in the subsequent observation window, and
represents the parameters of the entire model. To modularize this learning process, the mapping is decomposed as:
Here, refers to the Signal Identification Layer, which encodes psychological and ecological signals;
refers to the Interaction Modeling Layer, where cross-modal dependencies are learned; and
is the Behavior Prediction Layer, which generates risk-level outputs for classification.
The rationale behind this modular design is twofold. First, it ensures separation of concerns, allowing each module to be optimized for a distinct representation learning objective. Second, it allows for extensibility: future data modalities, such as physiological signals or audio interactions, can be seamlessly integrated at the signal identification stage without altering the fusion and prediction logic.
To achieve robust and interpretable predictions, the framework adopts the following principles: (1) Multimodal representation of user state, via distinct but aligned psychological and ecological streams; (2) Temporal modeling of behavior sequences, preserving behavioral dynamics across time windows; (3) Attention-based fusion for contextual interaction learning, enabling personalization and behavioral interpretability.
2.2 Signal identification layer
The signal identification layer plays a foundational role in the proposed user behavior prediction framework. Its core objective is to extract latent psychological and ecological features from raw, heterogeneous user behavior data. Given the dynamic and multifaceted nature of digital platform interactions, relying on a single information source proves insufficient to accurately capture the user’s actual state. To address this, we design two parallel encoding channels to model psychological and ecological signals independently (see Fig 2), thereby generating a unified, learnable low-dimensional embedding that preserves both internal user states and external environmental contexts.
2.2.1 Psychological signal encoder.
The psychological signal encoder is grounded in the assumption that users’ linguistic expressions and behavioral sequences on digital platforms are reflective manifestations of their internal affective states and motivational drivers. Therefore, the psychological representation is constructed through three semantic layers: textual emotion analysis, motivation identification, and behavioral rhythm modeling.
To begin with, user-generated content (UGC), including comments, messages, and search terms, is processed using a RoBERTa-large language model to extract deep semantic representations. Given a user’s input at time step , denoted as a token sequence
, the hidden representation is obtained as follows:
Subsequently, a multi-label emotion classifier is applied to derive a probabilistic distribution over emotional categories:
Where denotes the number of emotion classes (e.g., positive, anxious, negative, neutral). To convert this distribution into a differentiable dense embedding, we construct a semantic emotion embedding matrix
and compute the weighted emotional representation as:
This soft-label embedding preserves emotional diversity while allowing end-to-end training.
Beyond emotional tone, behavioral intent is also a major driver of digital engagement. To model user motivation, we employ a DistilBERT encoder to obtain a semantic representation. , followed by a two-layer MLP classifier for motivation types:
where , and
is the resulting motivational Embedding.
Motivational embeddings capture latent behavioral intent that drives user interactions. Each user message or query is first embedded using a fine-tuned DistilBERT encoder trained on a domain-specific corpus annotated with six motivational categories—information-seeking, social connection, self-expression, entertainment, stress-relief, and emotional venting. The fine-tuning corpus consisted of approximately 12,000 social media posts labeled by three independent annotators, with an inter-annotator agreement κ = 0.84. During preprocessing, each text segment was tokenized, truncated to 128 tokens, and encoded into a 768-dimensional semantic vector. A two-layer MLP maps this vector into a 128-dimensional motivational embedding m that is jointly optimized with the downstream behavior-prediction tasks. For example, a post such as “Feeling anxious, can’t sleep, scrolling through feeds” receives a high activation on the stress-relief and emotional venting dimensions, while “Looking up reliable COVID-19 advice” primarily activates the information-seeking dimension.
This vector captures latent behavioral objectives such as information-seeking or emotional release.
In addition to textual features, we also consider behavioral rhythms that may indicate cognitive load or mental strain. Behavioral rhythm features quantify temporal irregularities that reflect cognitive or emotional strain. Three normalized metrics are computed per user over a rolling 7-day window:
- Nighttime Ratio r₁ – fraction of sessions between 23:00 and 05:00, indicating nocturnal activity.
- Session Duration Variance r₂ – variance of consecutive session lengths, capturing behavioral volatility.
- Longest Session r₃ – maximum continuous usage duration, representing potential immersive behavior.
Each metric is z-normalized across users and projected to a 3-dimensional latent vector through a linear transformation. For instance, a user exhibiting a Nighttime Ratio of 0.42 and a Longest Session of 126 minutes would be assigned higher weights on r₁ and r₃, contributing to elevated predicted risk for late-night engagement and immersive usage.
Let represent three handcrafted temporal metrics:
ratio of platform usage during 11 PM-5 AM,
standard deviation of session duration,
maximum session length.
These are linearly projected to a latent space:
yielding a low-dimensional vector representing psychological load.
The complete psychological signal is then formed by concatenating the three components:
where . This comprehensive vector captures both the emotional and behavioral state of the user at time step
.
2.2.2 Ecological signal encoder.
Ecological signals characterize the external context surrounding user behavior. These include environmental factors such as time of day, social network structure, and device/location metadata. To model circadian rhythms, we encode the behavior timestamp using the Time2Vec method:
where are learnable parameters and the output is a vector in
. This encoding captures both linear progression and periodicity of behavior over a 24-hour cycle.
To model the user’s social context, we construct an interaction graph , where nodes are users and edges denote interactions (likes, replies). For each node
, we compute an embedding using the GraphSAGE algorithm:
Here, denotes the node embedding after
layers, and AGGREGATE is a neighborhood pooling function such as mean or max. The final output reflects the user’s structural position in the social network (centrality, cohesion).
Device type and IP-geolocation are treated as categorical variables . These are embedded using standard lookup tables:
Combining all the above, the ecological feature vector at time is:
where is the Time2Vec encoding,
The GraphSAGE output is the first term, and the remaining terms represent device and location embeddings.
2.3 Interaction modeling layer
The interaction modeling layer serves as the central innovation of this study, translating dual-source signal embeddings into a fused user state representation that supports high-fidelity behavior prediction. Unlike conventional machine learning pipelines that merely concatenate different features, this layer is explicitly designed to model the asymmetric, temporally dynamic, and semantically entangled relationships between psychological and ecological signals. To this end, we propose a modular neural architecture that leverages cross-attention mechanisms to construct personalized, context-aware user state representations, as shown in Fig 3.
2.3.1 DCCAN.
In digital health applications, user behaviors are rarely driven by isolated psychological or ecological factors. Instead, they emerge from the continuous interaction between a user’s internal disposition and their environment. For instance, emotional distress (psychological) may lead to prolonged usage behavior when coupled with a late-night time window and social isolation (ecological). To model such interactions, we propose a dual-channel network where each channel independently encodes psychological or environmental features, followed by a cross-attention module that enables fine-grained alignment and fusion.
Formally, let and
denote the psychological and ecological embeddings over a behavioral sequence of length
, where
and
are the respective feature dimensions. Before applying attention mechanisms, we project both representations into a shared latent space of dimension
using learned linear transformations:
Here, , and all outputs are matrices in
. These projections form the query, key, and value matrices for the attention mechanism.
The core of the DCCAN lies in its cross-attention layer, which enables the psychological features to selectively attend to the ecological context. The attention output is computed as:
Denoting , the fused sequence embedding is then:
This operation allows each timestep in the psychological representation to selectively aggregate relevant ecological information, thereby encoding a temporally resolved, context-aware user state. The attention matrix serves as an explicit interpretability mechanism, highlighting which contextual factors (e.g., time, social position) most influence specific psychological patterns.
To further enhance representational capacity and account for multiple interaction subspaces, we extend the single-head cross-attention to a multi-head attention structure. Specifically:
Here, , and
is the final output projection.
The resulting attention-enhanced representation is then merged with the original psychological encoding via residual connection and layer normalization:
This residual formulation not only stabilizes training but also ensures that critical intra-psychological patterns are preserved while enriching them with contextual modulation.
2.3.2 Sequence pooling.
Following attention-based fusion, the sequence representation is passed through a position-wise feedforward network:
where and
, with
denoting the hidden layer size. This projection introduces non-linearity and compresses the joint representation into a unified format.
To prepare the fused signal for downstream classification tasks, we apply average pooling over the temporal dimension:
This final vector Serves as the compact, context-aware representation of the user’s latent state and is forwarded to the prediction layer.
2.4 User behavior prediction layer
The final module in the proposed framework is the User Behavior Prediction Layer, which receives fused representations from the interaction modeling layer and outputs predictions related to various high-risk behaviors on digital media platforms. These behaviors are particularly relevant in the context of digital health, as prolonged screen time, emotionally driven nighttime usage, and engagement with misleading health information may signal underlying psychological issues or contribute to negative health outcomes. Accordingly, the design of this layer is guided by both predictive accuracy and clinical interpretability, enabling its use in personalized health monitoring and adaptive content interventions.
The prediction layer is responsible for identifying whether a user is at risk of exhibiting one or more high-risk behavioral patterns within a future time window Δt\Delta tΔt. These patterns include:
- (1). Immersive usage risk: Abnormal session lengths that exceed empirically defined thresholds (e.g., continuous usage over 90 minutes).
- (2). Late-night engagement: Repeated activity within nighttime hours (e.g., 23:00–05:00), which has been associated with sleep disruption and anxiety.
- (3). Health misinformation susceptibility: Likelihood of clicking or interacting with low-credibility health-related content, often influenced by emotional and cognitive states.
Each of these behavioral outcomes is modeled as an independent binary classification task, and can optionally be extended to multi-label classification or risk scoring tasks, depending on platform-specific applications.
Let the output of the interaction modeling layer be a fixed-length user state embedding. , which encapsulates both psychological signals and ecological context through cross-attentional fusion. This vector serves as the input to the prediction module.
To model task-specific behavior outcomes, we first pass Through a shared transformation layer, which extracts general behavioral representations:
where and
is a hidden representation shared across all tasks. This shared layer acts as a compact summary of user state features relevant to multiple prediction outputs.
To simultaneously model multiple behavioral risks, we adopt a multi-task learning architecture. For each task , where
is the set of target behaviors, we define an independent prediction head:
Here is the weight matrix for the task
,
is the task-specific bias,
is the sigmoid activation function, producing a probability score
for each behavior.
The use of independent heads allows the model to learn task-specific decision boundaries while benefiting from shared representational learning in the earlier layers.
3 Experiment and analysis
3.1 Dataset
To validate the effectiveness of the proposed psychological-ecological behavior prediction framework, we constructed a dedicated dataset that captures both user-internal (psychological) and external (ecological) behavioral signals. The dataset was derived from a simulated yet structurally representative digital media platform and comprises 51,264 complete behavioral sequences from 9,782 distinct users over 30 days. Each behavior sequence consists of a fixed-length window containing timestamped records of user interactions, including linguistic expressions, interaction frequency, device metadata, and session dynamics. To ensure data integrity and ethical compliance, all behavioral records underwent a multi-stage preprocessing and anonymization pipeline before model training. First, raw interaction logs were filtered to exclude incomplete or duplicate sessions using a time-consistency rule: sessions with timestamp gaps exceeding 30 minutes were segmented into new sequences to preserve temporal coherence. Textual data from user comments and messages were tokenized, lowercased, and stripped of identifiers (e.g., usernames, URLs) to remove personally identifiable information (PII). Second, linguistic embeddings were extracted via RoBERTa-large and stored as high-dimensional vectors, while all intermediate text was discarded to prevent reverse reconstruction. Ecological metadata, including device type, IP-geolocation, and timestamp, were encoded into categorical or periodic numeric forms; precise location coordinates were generalized to city-level granularity. Third, to guarantee statistical representativeness without compromising privacy, a stratified sampling strategy was applied to maintain consistent distributions across demographic and temporal strata. Finally, all derived feature matrices were normalized and indexed by pseudonymous user IDs to enable reproducible analysis in accordance with institutional data protection protocols.
The average length of each sequence is 14.3 time steps, with each step comprising a composite vector of 768-dimensional RoBERTa-based textual features, 128-dimensional motivational embeddings, three scalar behavioral rhythm indicators, and 64-dimensional ecological embeddings. The latter includes 16-dimensional Time2Vec temporal vectors, 32-dimensional device/location embeddings, and 64-dimensional social graph embeddings derived via GraphSAGE.
In terms of behavior labeling, we defined three health-relevant user states based on observable patterns: Immersive usage was identified when session duration exceeded 90 minutes; Late-night activity was labeled for users showing ≥3 sessions between 23:00 and 05:00 within any 24-hour window; Health misinformation clicks were detected based on user engagement with health-related posts pre-flagged by automated content evaluation systems and verified by human annotators.
Label distribution analysis revealed a moderate class imbalance, with immersive usage labeled in 19.6% of sequences, late-night activity in 14.2%, and misinformation susceptibility in 9.4%. The final dataset was split by user ID to ensure independence across training (70%), validation (15%), and test (15%) sets. Random seeds were set to 42 for Python, NumPy, and PyTorch environments to ensure consistent data shuffling, parameter initialization, and sampling order. The model was trained for a maximum of 40 epochs, using early stopping with a patience of 5, based on the validation loss. Temporal coherence was preserved in sequence formation, and no synthetic oversampling was applied to maintain ecological validity.
3.2 Experimental settings and evaluation metrics
To ensure both methodological rigor and reproducibility, all experiments were conducted using a standardized set of hyperparameters and model configurations as summarized in the Experimental Parameters Table. The text encoder was implemented using a frozen RoBERTa-large model to extract affective and cognitive semantics from user-generated content. Behavioral motivation was modeled via DistilBERT fine-tuned on an in-domain motivational corpus. Ecological signals were encoded using a 2-layer GraphSAGE network with 128 hidden units for structural embeddings, and Time2Vec was applied for periodic temporal modeling.
The core fusion mechanism, Dual-Channel Cross-Attention Network (DCCAN), utilized four attention heads and a latent attention space of 256 dimensions. A 128-dimensional feedforward layer with a dropout rate of 0.2 was appended for transformation. The model was trained using the AdamW optimizer with a learning rate of 1 × 10^-4, a batch size of 64, and early stopping based on validation loss, with a patience of 5 epochs.
For scalability assessment, the proposed DCCAN framework was implemented in PyTorch 2.2 and trained on a workstation equipped with an NVIDIA A100 GPU (40 GB of VRAM), an Intel Xeon Gold 6338 CPU (2.0 GHz, 64 cores), and 256 GB of RAM. The full model contains approximately 47.8 million trainable parameters, of which 24.1 M belong to the RoBERTa text encoder and 9.6 M to the GraphSAGE and Time2Vec modules. Training one complete experiment with all three tasks took approximately 5.2 hours for 40 epochs, with early stopping typically triggered between epochs 28 and 32. Average GPU utilization was 74%, and the peak memory consumption remained below 28 GB, demonstrating feasible scalability for mid-scale research or enterprise deployments.
Each of the three prediction tasks was treated as a binary classification problem and evaluated using Accuracy (the ratio of correctly classified instances); Precision and Recall (for identifying false alarms and misses, respectively); F1-score (harmonic mean of precision and recall, preferred in imbalanced scenarios); AUC-ROC: measuring the model’s ability to rank true positives higher than false ones.
In multi-label scenarios, macro-averaging was applied to report overall performance across tasks, while micro-averaging was used for sensitivity analysis on underrepresented labels.
In addition to predictive performance, the model was evaluated for interpretability using SHAP (Shapley Additive exPlanations), which quantifies the contribution of each input modality (e.g., psychological rhythms, device type, social position) to the final decision. This feature-level insight is essential for clinical deployment and behavioral policy planning
3.3 Baseline comparison
Table 1 presents a comprehensive comparison of predictive performance across four modeling architectures—LSTM, GRU, XGBoost, and the proposed DCCAN—evaluated on three distinct tasks for user behavior risk detection. For each task, we report Accuracy, F1-Score, and Area Under the ROC Curve (AUC), thereby capturing classification precision, robustness to class imbalance, and discriminative power, respectively.
Across all tasks, the DCCAN model demonstrates consistently competitive or superior performance, particularly in terms of F1-Score and AUC. In the Immersive Usage task, DCCAN achieves the highest F1-Score (0.891) and AUC (0.913), outperforming both traditional RNNs and the tree-based XGBoost model. This result reflects DCCAN’s capacity to integrate psychological and ecological signals in modeling temporal dependencies associated with over-engagement behaviors.
In the Late-Night Activity task, although LSTM exhibits marginally higher accuracy (0.883), DCCAN achieves comparable performance in F1-Score (0.764) and AUC (not shown), highlighting its robustness under nocturnal behavioral variance. For Health Misinformation Clicks (values not displayed here), preliminary results suggest that XGBoost excels in AUC while DCCAN maintains greater balance across precision-recall metrics, underscoring the interpretive value of multi-source signal fusion.
Taken together, these findings support the hypothesis that a cross-modal attention mechanism embedded within a deep psychological-ecological fusion framework significantly enhances behavior prediction performance over classical temporal or feature-based models. The advantages are particularly pronounced when behavior is contingent on both internal mental states and external temporal/social contexts—a characteristic feature of digital health-related risk behaviors.
Table 2 reports the results of an ablation study designed to assess the contribution of key components in the DCCAN architecture. Four model configurations were compared: the full model, one without cross-attention, one without psychological signals, and one without ecological signals.
Results show that removing the cross-attention mechanism significantly reduces the model’s ability to balance precision and recall, particularly in immersive usage and misinformation detection. The F1-scores drop from 0.887 and 0.840 (complete model) to 0.821 and 0.810, respectively, indicating that fine-grained interaction modeling between psychological and ecological modalities plays a crucial role in outcome accuracy.
Furthermore, removing either psychological or ecological signals individually leads to observable declines in performance, albeit not as steep as eliminating attention. For instance, the misinformation detection AUC drops from 0.796 (full model) to 0.795 and 0.866 when removing psychological and ecological signals, respectively. This suggests that while ecological features provide sharper class discrimination (AUC), psychological signals are more critical for balanced prediction (F1-Score).
3.4 Analysis and discussion of results
Fig 4 provides a comparative overview of classification outcomes across three behavior prediction tasks: (a) Immersive Usage, (b) Late-Night Activity, and (c) Health Misinformation Clicks. The confusion matrices highlight the model’s ability to distinguish between high-risk and normal behavioral instances using binary decision boundaries.
In subfigure (a), the model demonstrates a strong capacity to identify immersive usage behaviors, reflected in a high true positive count. However, the presence of notable false positives indicates a tendency toward over-detection, which may stem from the psychological encoder’s heightened sensitivity to prolonged attention signals (e.g., emotional flow, behavioral rhythm irregularities).
Subfigure (b) reveals a more balanced classification for late-night activity, where both false positives and false negatives are relatively limited. This balance reflects the effectiveness of temporal and device-aware ecological embeddings—such as Time2Vec and session-hour features—in distinguishing nocturnal usage patterns. It suggests that environmental signals play a dominant role in modeling time-sensitive behavior types.
In contrast, subfigure (c) shows a more conservative prediction pattern for misinformation clicks. While false positives are low, the model tends to under-detect actual risk cases, possibly due to the high variability of misinformation content and the reliance on indirect psychological markers (e.g., anxiety-prone language or passive browsing behaviors). This result highlights the need for enhanced content-level semantics or trustworthiness indicators in future iterations.
Together, these matrices highlight the importance of behavior-specific feature alignment: immersive behaviors necessitate more nuanced psychological modeling. In contrast, time-anchored behaviors benefit from ecological regularity, and misinformation risk requires enhanced content-context fusion. The DCCAN model exhibits flexibility across these contexts; however, each task may benefit from task-specific tuning or the integration of auxiliary signals.
Fig 5 displays the SHAP-based feature importance analysis for the proposed DCCAN model. SHAP (SHapley Additive exPlanations) values quantify the marginal contribution of each input feature to the model’s output for individual predictions, offering fine-grained interpretability.
Among all features, Nighttime Activity Ratio emerges as the most influential predictor, especially in late-night usage and immersive behavior classification. This highlights the significance of temporal rhythm indicators in user state modeling. The presence of negative sentiment as the second most impactful feature confirms the role of psychological distress or affective states in influencing user engagement risk—particularly relevant in the context of health misinformation consumption.
Interestingly, the Graph Centrality Score, which represents social embeddedness, shows a moderate influence across all behaviors. Users with lower centrality tend to exhibit more isolated and potentially risk-prone behavior, which reinforces findings from the digital loneliness literature. Meanwhile, the contribution of Device: Mobile and Session Length Standard Deviation reflects platform accessibility and behavioral variability, which may modulate impulsivity and sustained engagement.
Fig 6 illustrates the temporal evolution of attention weights across multiple attention heads within the DCCAN framework. Each cell represents the weight assigned by a specific attention head to a given time step in a user’s behavioral sequence. High weights indicate that the corresponding temporal slice contains salient psychological or ecological signals that significantly influence the final behavior prediction.
For instance, Head 1 exhibits a strong focus on time steps 3 and 6, possibly corresponding to moments with heightened emotional negativity or abrupt changes in session length. Head 2 shows a smoother distribution with peaks around t5–t7, suggesting its role in capturing gradual shifts in ecological context, such as late-night device use or social disengagement. Head 3 highlights end-of-sequence events, often associated with the culmination of high-risk behavior, such as impulse clicking or prolonged immersion.
Fig 7: presents longitudinal behavioral timelines for two users, enabling a qualitative examination of how psychological and ecological indicators contribute to the evolution of risk levels.
For User A, both anxiety level and predicted risk score increase steadily over time, while nighttime activity remains relatively stable. This pattern suggests that internal psychological escalation (e.g., rising anxiety) alone may suffice to elevate behavioral risk, validating the model’s sensitivity to emotional trajectories.
In contrast, User B exhibits a decline in anxiety but consistently high nighttime activity, accompanied by a moderate rise in predicted risk. This implies that even in the absence of internal deterioration, sustained exposure to adverse ecological conditions (e.g., late-night usage) can independently signal increased behavioral risk.
The comparative trajectories confirm the utility of the psychological-ecological dual signal design, demonstrating that risk prediction in digital media environments must account for both internal and environmental triggers. These case studies also demonstrate the model’s potential for real-time monitoring and early intervention.
5 Conclusion
This study proposes a novel framework for predicting user behavior on digital media platforms by modeling the interaction between psychological states and ecological contexts. Motivated by the need to understand and mitigate the risks associated with prolonged, emotionally charged, and potentially harmful digital engagement, we developed the DCCAN, a modular neural architecture designed. Empirical evaluation on a dataset comprising over 50,000 behavior sequences demonstrated that DCCAN consistently outperforms traditional temporal and tree-based models across multiple risk categories, including immersive usage, late-night engagement, and susceptibility to health misinformation. Ablation analyses further confirmed the necessity of incorporating both psychological and ecological signals, as well as the value added by the cross-attention mechanism in enhancing prediction accuracy and interpretability. Beyond predictive performance, the proposed model contributes to the growing discourse on context-aware and ethically responsible AI in digital health. By leveraging interpretable attention mechanisms and SHAP-based feature attribution, the model not only delivers robust classification outcomes but also provides actionable insights into the underlying causes of digital risk behaviors. These insights can be directly integrated into platform-level interventions—such as real-time usage warnings, content filtering, or tailored mental health resources—thereby enhancing user well-being. Moreover, the flexibility of the DCCAN architecture enables its adaptation to various use cases beyond health, such as educational engagement prediction, misinformation detection in political discourse, or adaptive content moderation in youth platforms. Its modular design also facilitates integration with future signal types, including biometric data or audio-visual inputs, broadening the scope of affective computing in real-world environments.
Acknowledgments
We thank the anonymous reviewers whose comments and suggestions helped to improve the manuscript.
This research was supported by Grant No. 25YJA860012.
References
- 1. Kortesoja L, Vainikainen M-P, Hotulainen R, Merikanto I. Late-night digital media use in relation to chronotype, sleep and tiredness on school days in adolescence. J Youth Adolesc. 2023;52(2):419–33. pmid:36401709
- 2. Luqman A, Masood A, Shahzad F, Shahbaz M, Feng Y. Untangling the adverse effects of late-night usage of smartphone-based SNS among University students. Behaviour & Information Technology. 2020;40(15):1671–87.
- 3. Ganasegeran K, Al-Dubai SAR, Abdulrahman SA, Sangaran S, Yeow WHW, Renganathan P. What predicts late-night whatsapping habits-usage characteristics or psycho-behavioral attributes? A health workers survey. Psychiatry. 2017;18(2).
- 4. Vente T, Daley M, Killmeyer E, Grubb LK. Association of social media use and high-risk behaviors in adolescents: cross-sectional study. JMIR Pediatr Parent. 2020;3(1):e18043. pmid:32452820
- 5. Gutierrez E, Karwowski W, Fiok K, Davahli MR, Liciaga T, Ahram T. Analysis of human behavior by mining textual data: current research topics and analytical techniques. Symmetry. 2021;13(7):1276.
- 6. Chen X, Xie H, Cheng G, Li Z. A decade of sentic computing: topic modeling and bibliometric analysis. Cogn Comput. 2021;14(1):24–47.
- 7. Pal G, Atkinson K, Li G. Real-time user clickstream behavior analysis based on apache storm streaming. Electron Commer Res. 2021;23(3):1829–59.
- 8. Tokuç AA, Dağ T. Predicting user purchases from clickstream data: A comparative analysis of clickstream data representations and machine learning models. IEEE Access. 2025.
- 9. Sunarya PA, Rahardja U, Chen SC, Lic YM, Hardini M. Deciphering digital social dynamics: A comparative study of logistic regression and random forest in predicting e-commerce customer behavior. Journal of Applied Data Sciences. 2024;5(1):100–13.
- 10. Xu X, Huang C, Zhu Y. A model for predicting iot user behavior based on bayesian learning and neural networks. Journal of Computer Networks and Communications. 2024;2024(1).
- 11. Zeng J, He X, Tang H, Wen J. Predicting the next location: A self‐attention and recurrent neural network model with temporal context. Trans Emerging Tel Tech. 2020;32(6).
- 12.
Tanha M, Adeel M. LTE User behavior prediction Via LSTM. In: 2024 IEEE Conference on Artificial Intelligence (CAI), 2024. 884–9. https://doi.org/10.1109/cai59869.2024.00162
- 13.
Boyapati A, Sinha P, Rayavarapu A, Sreepada R. Transformer based ensemble framework for sequential user behavior prediction. 2024. 236–43.
- 14. Aljaloud AS, Uliyan DM, Alkhalil A, Elrhman MA, Alogali AFM, Altameemi YM, et al. A deep learning model to predict student learning outcomes in LMS using CNN and LSTM. IEEE Access. 2022;10:85255–65.
- 15. Niemeijer K, Mestdagh M, Kuppens P. Tracking subjective sleep quality and mood with mobile sensing: multiverse study. J Med Internet Res. 2022;24(3):e25643. pmid:35302502
- 16. Gandhi A, Adhvaryu K, Poria S, Cambria E, Hussain A. Multimodal sentiment analysis: A systematic review of history, datasets, multimodal fusion methods, applications, challenges and future directions. Information Fusion. 2023;91:424–44.
- 17. Das R, Singh TD. Multimodal sentiment analysis: a survey of methods, trends, and challenges. ACM Comput Surv. 2023;55(13s):1–38.
- 18. Chandrasekaran G, Nguyen TN, Hemanth D. J. Multimodal sentimental analysis for social media applications: A comprehensive review. WIREs Data Min & Knowl. 2021;11(5).
- 19.
Maji B, Swain M, Guha R, Routray A. Multimodal emotion recognition based on deep temporal features using cross-modal transformer and self-attention. In: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2023. 1–5. https://doi.org/10.1109/icassp49357.2023.10096937
- 20.
Wu X, Tang P, Zhang B. Research on image segmentation based on ResUNet. 2024. 342–7.
- 21.
Strudel R, Garcia R, Laptev I, Schmid C. Segmenter: Transformer for semantic segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021. 7262–72.
- 22. Tomczyk S, Ewert C. Positive changes in daily life? A meta-analysis of positive psychological ecological momentary interventions. Appl Psychol Health Well Being. 2025;17(1):e70006. pmid:39930901
- 23. El Zaatari W, Maalouf I. How the bronfenbrenner bio-ecological system theory explains the development of students’ sense of belonging to school? Sage Open. 2022;12(4).
- 24.
Saffaryazdi N, Goonesekera Y, Saffaryazdi N, Hailemariam ND, Temesgen EG, Nanayakkara S, et al. Emotion recognition in conversations using brain and physiological signals. In: 27th International Conference on Intelligent User Interfaces. 2022. 229–42. https://doi.org/10.1145/3490099.3511148
- 25.
Anand S, Miglani S, Anand R. Multimodal AI for enhanced emotional intelligence in the cloud computing Era: A comprehensive approach. Establishing AI-Specific Cloud Computing Infrastructure. IGI Global Scientific Publishing. 2025. p. 193–218.
- 26. Nguyen M-H, Le T-T, Vuong Q-H. Ecomindsponge: A novel perspective on human psychology and behavior in the ecosystem. Urban Science. 2023;7(1):31.