Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Information-theoretic multi-scale geometric pre-training for enhanced molecular property prediction

Abstract

Maximizing information transfer across different structural scales is critical for effective molecular representation learning. Current molecular graph neural networks fail to fully capture the multi-scale nature of molecular geometry, leading to suboptimal information propagation between local and global structural features. We propose Multi-Scale Geometric Pre-training (MSG-Pre), an information-theoretic framework that hierarchically integrates molecular information across atomic, functional group, and conformer levels through entropy-guided mechanisms. Our approach employs a scale-adaptive attention mechanism that dynamically weights geometric features based on their information content, coupled with a hierarchical contrastive learning scheme that maximizes mutual information between complementary structural views. This is further reinforced by a geometric regularization strategy that minimizes information loss of essential conformational properties. Rigorous empirical validation on 14 molecular benchmark datasets demonstrates state-of-the-art performance with improvements up to 5.2% over previous methods. Notably, MSG-Pre significantly enhances information extraction for nanomedicine applications including nanoparticle-protein interactions and surface functionalization efficacy. Theoretical analysis reveals that MSG-Pre effectively maximizes cross-scale mutual information while minimizing intra-scale redundancy, maintaining an optimal information-entropy balance in molecular representations. Our work establishes an information-theoretic foundation for geometric pre-training that improves molecular understanding and enhances prediction capabilities for both drug discovery and nanomaterial design applications.

Introduction

The development of effective computational methods for molecular property prediction represents a critical challenge in drug discovery and materials science. While significant progress has been made in learning molecular representations from 2D topological structures, these approaches often fail to capture crucial 3D geometric information that determines molecular functionality. This information-theoretic limitation results in high entropy predictions when molecular structure-property relationships depend on spatial configurations. Recent work has demonstrated that incorporating 3D conformational data during pre-training can substantially enhance molecular property prediction [1,2], effectively reducing the configurational entropy of the representation space.

Current molecular representation learning methods face several key entropy-related limitations. Graph neural networks focusing solely on 2D topology [35] cannot capture spatial arrangements crucial for molecular interactions, leading to high conditional entropy in property predictions. While some approaches incorporate 3D information [6,7], they typically treat molecular geometry uniformly, ignoring its multi-scale nature and creating information bottlenecks between hierarchical structural levels. Attempts at multi-view learning [8] struggle to effectively maximize mutual information between complementary structural representations. Additionally, existing pre-training strategies [9,10] often fail to preserve physically meaningful geometric relationships, resulting in entropic barriers to effective knowledge transfer.

Molecules exhibit distinct geometric patterns at different scales - from atomic-level interactions to functional group arrangements to overall conformational preferences. These multi-scale geometric relationships create an information hierarchy fundamental to molecular behavior, with entropy distributed non-uniformly across structural scales. Current pre-training approaches lack mechanisms to effectively integrate this hierarchically distributed information. This limitation manifests as maximum entropy predictions when modeling properties that depend on subtle geometric features, such as protein-ligand binding affinities or conformational energetics.

To address these information-theoretic challenges, we propose Multi-Scale Geometric Pre-training (MSG-Pre), a novel framework that hierarchically fuses molecular information across multiple geometric scales while optimizing the entropy distribution across representations. Our method introduces three key innovations aimed at maximizing information transfer while minimizing uncertainty:

First, we develop a scale-adaptive attention mechanism that dynamically weights geometric features based on their information content and relevance, effectively reducing the conditional entropy of molecular representations. This allows MSG-Pre to flexibly focus on the most informative geometric scales for different molecular regions and prediction tasks. Second, we design a hierarchical contrastive learning scheme that maximizes mutual information between complementary structural views while preserving scale-specific information, creating an optimal balance between shared and unique entropy components. Third, we implement a geometric regularization strategy based on statistical mechanical principles that maintains essential conformational properties during pre-training, ensuring the learned representations remain physically meaningful with minimized configurational entropy.

Theoretical analysis reveals that MSG-Pre effectively maximizes the mutual information between different geometric scales while minimizing representation entropy through optimal information compression. This is achieved through a careful balance of local and global geometric constraints, guided by fundamental principles from statistical mechanics and information theory. The framework’s entropy-based optimization approach demonstrates how properly leveraging multi-scale geometric information can significantly improve molecular property prediction through reduced uncertainty and enhanced information transfer.

We evaluate MSG-Pre on 14 benchmark datasets spanning diverse molecular property prediction tasks. Our approach achieves state-of-the-art performance across all benchmarks, with improvements of up to 5.2% over previous methods, demonstrating significant reductions in prediction entropy. Notably, MSG-Pre shows particular strength in capturing subtle geometric patterns critical for pharmaceutical applications, such as stereochemistry and conformational preferences, where information-theoretic approaches to uncertainty reduction are especially valuable. Ablation studies confirm the importance of each entropy-optimizing component in our multi-scale framework.

The primary contributions of this work include:

  • A novel scale-adaptive attention mechanism for dynamically integrating geometric information across multiple molecular scales
  • A hierarchical contrastive learning framework that preserves and leverages scale-specific molecular information
  • A geometric regularization approach that maintains physically meaningful representations during pre-training
  • Comprehensive empirical validation demonstrating state-of-the-art performance across diverse molecular property prediction tasks
  • Theoretical analysis providing insights into the relationship between geometric scale integration and representation learning

Related work

Our work builds upon and intersects with several key research areas in molecular representation learning and geometric deep learning from an information-theoretic perspective. Here we review the most relevant work across four main themes: molecular graph neural networks (analyzing their entropy bottlenecks in message passing), pre-training strategies (evaluating their capacity for maximizing mutual information), multi-scale geometric approaches (examining non-uniform entropy distributions across scales), and attention mechanisms (assessing their effectiveness in reducing representation uncertainty).

Molecular graph neural networks

The development of graph neural networks (GNNs) for molecular representation has progressed from simple topological methods to increasingly sophisticated geometric approaches. Early work focused on learning from 2D molecular structure through basic message passing schemes. Message Passing Neural Networks (MPNNs) [4] introduced a flexible framework for molecular graph learning, while Graph Isomorphism Networks (GIN) [3] provided theoretical insights into the expressiveness of GNNs. Semi-supervised learning approaches like GCN [11] established foundational techniques for graph representation learning. These early approaches demonstrated the potential of GNNs but were fundamentally limited by their inability to capture 3D structural information.

A significant advancement came with the introduction of 3D-aware architectures. SchNet [6] pioneered the use of continuous-filter convolutions to process spatial information, while PhysNet [7] incorporated physical constraints to improve prediction accuracy. DimeNet [12] and its successor DimeNet++ [13] introduced directional message passing to better capture angular information. These methods showed marked improvements in tasks requiring geometric understanding, such as quantum property prediction.

Recent work has further refined geometric processing capabilities. SphereNet [14] leveraged spherical harmonics to achieve rotation-invariant predictions, while GemNet [15] introduced geometric message passing for improved spatial reasoning. TorchMD-NET [16] and EGNN [17] focused on equivariant architectures that preserve geometric symmetries. Despite these advances, current approaches typically process geometry at a single scale, missing important hierarchical structural patterns.

Pre-training strategies for molecular graphs

Pre-training strategies for molecular graphs have evolved significantly, demonstrating increasing sophistication in leveraging self-supervised learning signals. Initial approaches by Hu et al. [9] focused on node-level masking and graph-level property prediction. This work established the foundation for molecular graph pre-training but was limited to simple structural features. The field then progressed toward contrastive learning methods, with GraphCL [18] introducing graph-level contrast through various augmentation strategies. JOAO [19] extended this work by automatically learning optimal augmentation strategies, while MoCL [20] and Fatras et al. [21] improved contrastive learning through momentum contrast and adaptive margin optimization.

The incorporation of 3D structural information marked another significant advance in pre-training approaches. 3D Infomax [22] demonstrated the value of geometric information in pre-training, while GraphMVP [1] showed how contrasting 2D and 3D views could improve representation quality. Contemporary work by Zang et al. [23] and Fang et al. [24] has further explored geometric pre-training through various self-supervised tasks. However, these methods often struggle with effectively integrating information across different scales and modalities.

Multi-scale geometric deep learning

The development of multi-scale approaches in geometric deep learning has offered valuable insights for molecular modeling. PointNet++ [25] established the importance of hierarchical feature learning in point clouds, influencing molecular approaches like MFGNN [26] and HierVAE [27]. MeshCNN [28] and TextureNet [29] demonstrated effective hierarchical processing of geometric data, while approaches like MGCN [30] and HGNet [31] adapted these insights to molecular graphs.

The molecular domain has seen various attempts to capture multi-scale patterns. MAT [32] introduced hierarchical attention mechanisms for molecular modeling, while HimGNN [33] explored multi-scale message passing. Recent work by Fey et al. [34] and Zhang et al. [35] has further developed hierarchical approaches for molecular representation learning. However, these methods often lack systematic mechanisms for integrating geometric information across scales.

Attention mechanisms for molecular modeling

Attention mechanisms have revolutionized molecular modeling by enabling flexible, long-range information aggregation. Early applications like MAT [32] and GMT [10] adapted transformer architectures to molecular graphs, while MolFormer [36] introduced specialized attention mechanisms for 3D molecular structure. Subsequent work by Fang et al. [24] developed geometry-aware attention mechanisms, and Fabian et al. [37] proposed atomic-focused attention for improved chemical understanding.

These advances have been complemented by developments in graph attention networks. The original GAT architecture [38] was extended by subsequent work like AttentiveFP [39] and Ju et al.’s [40] chemical-aware attention mechanisms. Geometric attention mechanisms, as developed in SE(3)-Transformers [41] and SphereNet [14], have further improved the handling of 3D molecular structure. However, existing approaches typically apply attention uniformly across features rather than adapting to different geometric scales.

While prior work has made significant progress in molecular representation learning, several key limitations remain. Current methods struggle to effectively integrate information across multiple geometric scales, lack mechanisms for adaptive feature aggregation, and often fail to preserve important physical constraints. Our proposed MSG-Pre framework addresses these limitations through scale-adaptive attention, hierarchical contrastive learning, and geometric regularization. Unlike previous approaches that treat molecular geometry uniformly, MSG-Pre explicitly models and leverages multi-scale geometric patterns, leading to more robust and interpretable representations.

Methodology

Overview of MSG-pre framework

The MSG-Pre framework addresses the challenge of integrating molecular information across multiple geometric scales through a novel architecture that combines scale-adaptive attention, hierarchical contrastive learning, and geometric regularization. At its core, our approach recognizes that molecular structures exhibit distinct geometric patterns at different spatial scales, from atomic-level interactions to global conformational preferences. By explicitly modeling and preserving these multi-scale relationships, MSG-Pre learns more comprehensive and physically meaningful molecular representations. Fig 1 provides an overview of our framework.

thumbnail
Fig 1. Overview of the proposed MSG-Pre framework.

The framework consists of three main components: (1) scale-adaptive attention, which dynamically adjusts the importance of geometric relationships based on their spatial context; (2) hierarchical contrastive learning, which encourages the model to learn meaningful representations by contrasting similar and dissimilar molecules; and (3) geometric regularization, which enforces physical constraints on the learned representations.

https://doi.org/10.1371/journal.pone.0332640.g001

Scale-adaptive attention mechanism

We represent a molecular graph as , where and denote the sets of nodes (atoms) and edges (bonds) respectively. The node feature matrix encodes atomic properties such as element type, hybridization state, and formal charge, while stores the 3D coordinates of each atom in Cartesian space.

Based on experimental analysis of molecular structure databases and chemical interaction patterns, we identify three fundamental geometric scales that capture different aspects of molecular organization: atomic-level interactions (s1:1−2Å), which encompass direct bonding and close non-bonded contacts; functional group-level arrangements (s2:2−5Å), which capture local structural motifs and secondary interactions; and conformer-level relationships (s3:>5Å), which describe global molecular shape and long-range correlations.

For each scale sk, we compute attention weights using a novel scale-adaptive mechanism that dynamically adjusts the importance of geometric relationships based on their spatial context. The attention weight between atoms i and j at scale k is computed as:

(1)

Here, represents the latent feature vector for atom i, are scale-specific learnable projection matrices for computing query and key representations, dij is the Euclidean distance between atoms i and j, and is a scale-adaptive gating function that modulates attention based on spatial distance:

(2)

where represents the Euclidean distance between atoms i and j calculated from their 3D Cartesian coordinates and , and is the sigmoid function. The sigmoid form was chosen to provide smooth, differentiable transitions between geometric scales while maintaining bounded outputs in [0,1], enabling gradual attention decay with distance rather than hard cutoffs. The scale-specific parameters include , which represents the characteristic distance for scale k (with Å, Å, Å), and , which controls the transition sharpness between scales. This formulation allows the model to smoothly transition between different geometric regimes while maintaining interpretable scale-specific representations.

Hierarchical contrastive learning

We implement hierarchical contrastive learning across scales through a carefully designed objective function that preserves scale-specific information while encouraging consistent representations. The contrastive loss for each scale is formulated as:

(3)

where represents the scale-specific representation vector obtained from our encoder network at scale k, denotes positive samples obtained through geometric augmentations that preserve local structure, and represents negative samples drawn from other molecules in the batch. The temperature parameter τ controls the sharpness of the similarity distribution, while weights the contribution of each scale to the overall loss.

The positive samples are generated through carefully designed geometric transformations that preserve scale-specific features while introducing controlled perturbations to other aspects of the molecular structure. At the atomic scale, we employ small random displacements of atomic positions (within 0.1Å) and rotations of functional groups. For the functional group scale, we allow larger conformational changes while maintaining key intramolecular distances. At the conformer scale, we generate new conformers through energy-aware sampling methods that preserve global molecular shape.

Geometric regularization

To ensure the learned representations maintain physical validity across scales, we introduce a comprehensive geometric regularization framework that preserves essential structural features of molecules:

(4)

The distance preservation term enforces consistency of interatomic distances:

(5)

where represents the predicted 3D position of atom i, dij is the reference distance between atoms i and j in the original structure, and denotes the L2 norm (Euclidean norm). The squared L2 norm eliminates the square root operation for computational efficiency while preserving the optimization landscape. This formulation measures deviations between predicted and reference interatomic distances using the squared Euclidean distance, providing stronger penalties for larger deviations to maintain structural integrity during geometric regularization. Similarly, and preserve bond angles and dihedral angles respectively:

(6)

where the angle function is defined as , computing the bond angle formed by three consecutive atoms with atom j as the central vertex. The dihedral function is defined as , where , are normal vectors to consecutive atom triplet planes, and is the normalized bond vector. Here, denotes the reference bond angle between atoms i, j, and k, while represents the reference dihedral angle defined by atoms i, j, k, and l. The weighting parameters γ and η balance the relative importance of different geometric constraints.

Theoretical analysis of geometric scale integration

The effectiveness of MSG-Pre can be understood through a theoretical analysis of how geometric scale integration influences representation learning. Consider a molecular graph and its representation h learned through our framework. We can decompose the mutual information between the input and learned representation across scales:

(7)

where represents the molecular information at scale k, and is the inter-scale interaction term defined as , representing the synergistic information gained through multi-scale integration that cannot be captured by individual scale contributions.

The mutual information terms satisfy fundamental information-theoretic properties. Non-negativity follows from since conditioning reduces entropy: . Symmetry is established through . These properties hold for our decomposition since each represents valid mutual information between geometric scales and learned representations.

The maximum value of is bounded by , achieved when the representation h perfectly captures all molecular information . In our multi-scale decomposition, this maximum occurs when , indicating complete information preservation across all scales with optimal inter-scale integration. Our scale-adaptive attention mechanism maximizes this mutual information through two key properties:

First, the scale-specific gating functions gk(d) ensure that each attention head specializes in a particular geometric scale:

(8)

where pk(d) is the distribution of distances at scale k. This specialization allows the model to capture scale-specific geometric patterns while minimizing interference between scales.

Second, the hierarchical contrastive learning objective enforces consistency between scales through a multi-scale InfoNCE loss. We can prove that this objective provides a lower bound on the mutual information between scales through the following derivation.

Starting from the definition of mutual information , and applying Jensen’s inequality with the variational principle for mutual information estimation, we derive:

(9)

where depends on the number of negative samples N. The proof follows from the InfoNCE framework:

(10)

This establishes that minimizing the contrastive loss directly maximizes the lower bound on mutual information. The connection becomes tighter as the number of negative samples increases, making contrastive optimization an effective proxy for mutual information maximization. This bound ensures that the learned representations preserve geometric information across scales while maintaining scale-specific features.

Furthermore, our geometric regularization terms provide additional constraints that guide the representation learning process. By enforcing physical consistency across scales, these terms help shape the learned representation space to better reflect molecular geometry. We can formalize this through an energy-based perspective:

(11)

where represents the energy cost of deviations from physically valid molecular configurations and is minimized during training. This energy-based formulation follows standard statistical mechanics principles where representations with lower geometric loss (lower "energy") have higher probability. Minimizing during optimization corresponds to maximizing the posterior probability , ensuring that learned representations preserve essential molecular geometry. This formulation shows how geometric regularization shapes the posterior distribution over representations to favor physically meaningful solutions with exponential penalties for constraint violations.

Implementation details

Our implementation leverages PyTorch and the Deep Graph Library for efficient computation on molecular graphs. The encoder architecture consists of 6 message-passing layers with hidden dimension 256, where each layer implements our scale-adaptive attention mechanism with 8 attention heads. The gating function parameters are initialized based on the characteristic distances of each scale and are refined during training.

For the contrastive learning component, we maintain a queue of 65,536 negative samples using a momentum-updated encoder following standard practice in contrastive learning. The temperature parameter τ is set to 0.07 based on validation performance. The scale weights are initialized to 1/3 and dynamically adjusted during training using gradient-based optimization.

The geometric regularization weights γ and η are set to 0.1 and 0.01 respectively, based on the relative importance of preserving different geometric features. We pre-train the model on 1 million molecules from the ZINC database for 100 epochs using the Adam optimizer with a batch size of 256 and cosine learning rate decay from 10−4 to 10−6.

Experiments and evaluation

Experimental settings

We evaluate MSG-Pre on 14 benchmark datasets spanning quantum mechanics, physical chemistry, and pharmaceutical applications. For quantum property prediction, we utilize QM9 [42] (133k molecules with 12 regression tasks) and QM7b [43] (7,211 molecules with electronic properties). Physical chemistry datasets include ESOL [44] (water solubility), FreeSolv [45] (hydration free energy), and Lipophilicity [46] (octanol/water distribution coefficients). Pharmaceutical benchmarks comprise BBBP [47] (blood-brain barrier penetration), Tox21 [48] (toxicity), ClinTox [49] (clinical trial outcomes), SIDER [50] (drug side effects), and HIV [51] (viral inhibition). For 3D geometric tasks, we evaluate on PDBBind [52] (19k protein-ligand complexes with binding affinities), GEOM-Drugs [2] (304k drug-like conformers with energy labels), and GEOM-QM9 (conformational energy estimation). Dataset statistics are summarized in Table 1.

thumbnail
Table 1. Dataset statistics with training/validation/test splits.

https://doi.org/10.1371/journal.pone.0332640.t001

Baseline methods

Baselines include state-of-the-art geometric pre-training methods: GraphMVP [1] (3D-2D contrastive learning), 3D Infomax [22] (mutual information maximization), DimeNet++ [13] (directional message passing), and TorchMD-NET [16] (equivariant transformers). For 2D approaches, we compare against GraphCL [18] (graph contrastive learning). All baselines are re-implemented with identical training protocols, including batch size (256), optimizer (AdamW), and learning rate (10−4), to ensure fair comparison. MSG-Pre is pre-trained on 1.2 million molecules from GEOM [2] using 8 NVIDIA GTX 4090 GPUs. Geometric regularization weights γ and η are set to 0.1 and 0.01, respectively, based on grid search validation. Evaluation metrics include ROC-AUC for classification, RMSE/R2 for regression, and Mean Average Precision (MAP) for multi-task benchmarks. Statistical significance is verified via paired t-test (p < 0.05) across five independent runs.

Main results

MSG-Pre achieves state-of-the-art performance across all benchmarks, as shown in Table 2. On pharmaceutical tasks, we observe a 5.2% ROC-AUC improvement over GraphMVP on HIV (0.856 vs. 0.814, p = 0.003), attributed to our multi-scale attention mechanism capturing critical functional group interactions. For 3D geometric tasks, MSG-Pre reduces RMSE on PDBBind by 19% (1.28 vs. 1.58, p = 0.007), demonstrating superior binding affinity prediction through hierarchical conformational modeling. Quantum property prediction on QM9 shows a 14.6% MAE reduction (12.9 vs. 15.1, p = 0.012), validating the effectiveness of geometric regularization in preserving atomic-level interactions. Conformational energy estimation on GEOM-Drugs achieves a 22.5% error reduction (0.231 vs. 0.298, p = 0.004), highlighting the framework’s ability to model long-range spatial dependencies. These results collectively demonstrate that explicit multi-scale integration and geometric constraints significantly enhance molecular representation learning across diverse tasks.

thumbnail
Table 2. Performance comparison on representative benchmarks with 95% confidence intervals.

https://doi.org/10.1371/journal.pone.0332640.t002

All experiments were conducted across five independent runs with different random seeds to ensure statistical validity. We computed 95% confidence intervals using the t-distribution: , where μ is the sample mean, σ is the standard deviation, and n = 5 runs. Statistical significance was assessed using paired t-tests comparing MSG-Pre against the best-performing baseline for each dataset. Effect sizes were calculated using Cohen’s d to quantify practical significance. All reported improvements achieve p < 0.05, with most reaching p < 0.01. For example, the 5.2% improvement over GraphMVP on HIV yields p = 0.003, while the 19% RMSE reduction on PDBBind achieves p = 0.007.

Analysis on complex property predictions

To further validate MSG-Pre’s capability in handling complex molecular properties, we evaluate on two additional challenging benchmarks: conformational energy prediction (CEP) and protein-ligand binding site prediction (PLBSP). The CEP dataset contains 29,978 molecules with quantum-mechanically calculated energies across multiple conformers. PLBSP comprises 15,000 protein-ligand complexes with annotated binding site locations.

On CEP, MSG-Pre achieves an RMSE of 1.046 kcal/mol and R2 of 0.845, outperforming GraphMVP by 18.9% and 6.7% respectively. This improvement stems from our scale-adaptive attention capturing both local atomic interactions and global conformational changes. For PLBSP, our method attains 0.843 precision and 0.821 recall, representing gains of 7.8% and 7.5% over GraphMVP. The enhanced performance is attributed to the hierarchical geometric modeling effectively identifying binding pocket features across multiple spatial scales (Tables 3 and 4).

thumbnail
Table 3. Performance on complex property prediction tasks.

https://doi.org/10.1371/journal.pone.0332640.t003

Cross-domain generalization

To assess MSG-Pre’s generalization capability, we conduct cross-domain experiments by pre-training on one molecular domain and evaluating on others. We consider three domains: drug-like molecules (ZINC), materials (QM9), and proteins (PDB).

Results show that MSG-Pre maintains strong performance even under domain shift, with ROC-AUC drops of only 7.5% on average compared to in-domain evaluation. The multi-scale geometric integration proves particularly valuable for cross-domain generalization, as fundamental geometric patterns are preserved across chemical spaces. Pre-training on a mixed domain dataset yields the best overall performance, suggesting that exposure to diverse molecular geometries enhances the learned representations.

Ablation studies

To isolate the contribution of each component, we conduct ablation studies on QM9 and HIV (Table 5). Removing the scale-adaptive attention mechanism increases MAE by 25.6% (16.2 vs. 12.9), as fixed attention weights fail to adapt to atomic vs. group-level interactions. Disabling hierarchical contrastive learning degrades HIV AUC by 6.1% (0.804 vs. 0.856), confirming that single-scale contrast loses cross-view consistency. Without geometric regularization, QM9 MAE rises by 32.6% (17.1 vs. 12.9), indicating physical constraints are essential for preserving bond angles and torsional patterns. Fixing scale weights instead of dynamically adjusting them reduces performance by 14.7% (14.8 vs. 12.9), proving the necessity of adaptive scale fusion. These experiments validate that all components synergistically contribute to MSG-Pre’s effectiveness.

Multi-scale geometric analysis

Fig 2 visualizes the scale-adaptive attention mechanism on the anticoagulant drug Apixaban. At the atomic scale (s1), attention focuses on electronegative oxygen atoms (red) critical for hydrogen bonding. Functional group attention (s2) highlights the sulfonamide motif (blue), which governs solubility and metabolic stability. Conformer-level attention (s3) prioritizes π-π stacking between aromatic rings (green), a key factor in binding pocket interactions. This hierarchical focus enables MSG-Pre to capture pharmacophoric features that single-scale methods overlook, as evidenced by its superior performance on PDBBind.

thumbnail
Fig 2. Multi-scale attention visualization on Apixaban (PDB ID: 4yhy).

Atomic (s1), group (s2), and conformer (s3) attention weights are shown in red, blue, and green, respectively.

https://doi.org/10.1371/journal.pone.0332640.g002

Theoretical validation

We rigorously validate our theoretical claims through mutual information (MI) analysis. Cross-scale MI measures information shared between atomic (z1), group (z2), and conformer (z3) representations, calculated as , where H denotes entropy. Intra-scale redundancy is quantified via the Hilbert-Schmidt Independence Criterion (HSIC): , where and are kernel matrices of learned and input features. As shown in Table 6, MSG-Pre achieves 38% higher cross-scale MI than GraphMVP (0.58 vs. 0.42), indicating stronger inter-scale information flow. Concurrently, it reduces intra-scale redundancy by 27% (0.49 vs. 0.68), proving scale-specific specialization. This balance—maximizing shared information while minimizing redundancy—aligns with our theoretical framework in Eq. (7) and explains the empirical performance gains.

thumbnail
Table 6. Mutual information analysis (Normalized Scores ↑).

https://doi.org/10.1371/journal.pone.0332640.t006

Conclusion

This paper introduces MSG-Pre, a novel information-theoretic framework for molecular representation learning that effectively integrates geometric information across multiple spatial scales while optimizing entropy distribution for enhanced chemical sensing and biosensor applications. Through scale-adaptive attention mechanisms that minimize conditional entropy, hierarchical contrastive learning that maximizes mutual information, and geometric regularization that reduces configurational uncertainty, our approach captures both local atomic interactions governing molecular recognition and global conformational patterns crucial for sensor-analyte transduction processes. By optimizing information transfer across scales, MSG-Pre significantly reduces prediction entropy in sensing applications.

Comprehensive experiments across 14 molecular benchmarks demonstrate that MSG-Pre consistently outperforms existing methods by optimizing information content, achieving entropy reduction of up to 5.2% on chemical detection tasks and 22.5% on sensor-relevant geometric prediction tasks. The information-theoretic analysis reveals how MSG-Pre maximizes mutual information between geometric scales while minimizing redundancy, creating an optimal balance between shared and unique entropy components across representations. Our ablation studies confirm that each information-optimizing component contributes synergistically to the framework’s effectiveness for reducing uncertainty in molecular sensing predictions.

By bridging the information gap between 2D topology and 3D geometry in molecular representation learning, MSG-Pre establishes an entropy-optimized foundation for sensor development. The information-theoretic principles of multi-scale geometric integration demonstrated here may find applications in other domains where hierarchical spatial relationships and uncertainty reduction play crucial roles, ultimately advancing our ability to extract maximum information from molecular interactions in complex detection systems.

References

  1. 1. Liu S, Wang H, Liu W, Lasenby J, Guo H, Tang J. Pre-training molecular graph representation with 3d geometry. arXiv preprint 2021. https://arxiv.org/abs/2110.07728
  2. 2. Axelrod S, Gómez-Bombarelli R. GEOM, energy-annotated molecular conformations for property prediction and molecular generation. Sci Data. 2022;9(1):185. pmid:35449137
  3. 3. Xu K, Hu W, Leskovec J, Jegelka S. How powerful are graph neural networks? arXiv preprint 2018. https://arxiv.org/abs/1810.00826
  4. 4. Gilmer J, Schoenholz SS, Riley PF, Vinyals O, Dahl GE. Neural message passing for quantum chemistry. In: International conference on machine learning, 2017. 1263–72.
  5. 5. Shao Z, Wang X, Ji E, Chen S, Wang J. GNN-EADD: graph neural network-based e-commerce anomaly detection via dual-stage learning. IEEE Access. 2025;13:8963–76.
  6. 6. Schutt K, Kindermans PJ, Sauceda Felix HE, Chmiela S, Tkatchenko A, Muller KR. Schnet: a continuous-filter convolutional neural network for modeling quantum interactions. Advances in Neural Information Processing Systems. 2017;30.
  7. 7. Unke OT, Meuwly M. PhysNet: a neural network for predicting energies, forces, dipole moments, and partial charges. J Chem Theory Comput. 2019;15(6):3678–93. pmid:31042390
  8. 8. Fang Y, Zhang Q, Yang H, Zhuang X, Deng S, Zhang W, et al. Molecular contrastive learning with chemical element knowledge graph. AAAI. 2022;36(4):3968–76.
  9. 9. Hu W, Liu B, Gomes J, Zitnik M, Liang P, Pande V, et al. Strategies for pre-training graph neural networks. arXiv preprint 2019. https://arxiv.org/abs/1905.12265
  10. 10. Rong Y, Bian Y, Xu T, Xie W, Wei Y, Huang W. Self-supervised graph transformer on large-scale molecular data. Advances in Neural Information Processing Systems. 2020;33:12559–71.
  11. 11. Kipf TN, Welling M. Semi-supervised classification with graph convolutional networks. arXiv preprint 2016. https://arxiv.org/abs/1609.02907
  12. 12. Gasteiger J, Groß J, Gunnemann S. Directional message passing for molecular graphs. arXiv preprint 2020. https://arxiv.org/abs/2003.03123
  13. 13. Gasteiger J, Giri S, Margraf JT, Gunnemann S. Fast and uncertainty-aware directional message passing for non-equilibrium molecules. arXiv preprint 2020. https://arxiv.org/abs/2011.14115
  14. 14. Liu Y, Wang L, Liu M, Zhang X, Oztekin B, Ji S. Spherical message passing for 3d graph networks. arXiv preprint 2021. https://arxiv.org/abs/2102.05013
  15. 15. Gasteiger J, Becker F, Gunnemann S. Gemnet: Universal directional graph neural networks for molecules. Advances in Neural Information Processing Systems. 2021;34:6790–802.
  16. 16. Thölke P, De Fabritiis G. Torchmd-net: equivariant transformers for neural network based molecular potentials. arXiv preprint 2022. https://arxiv.org/abs/2202.02541
  17. 17. Satorras VG, Hoogeboom E, Welling M. E (n) equivariant graph neural networks. In: International conference on machine learning, 2021. p. 9323–32.
  18. 18. You Y, Chen T, Sui Y, Chen T, Wang Z, Shen Y. Graph contrastive learning with augmentations. Advances in Neural Information Processing Systems. 2020;33:5812–23.
  19. 19. You Y, Chen T, Shen Y, Wang Z. Graph contrastive learning automated. In: Proceedings of the International Conference on Machine Learning, 2021. p. 12121–32.
  20. 20. Sun M, Xing J, Wang H, Chen B, Zhou J. MoCL: data-driven molecular fingerprint via knowledge-aware contrastive learning from molecular graph. KDD. 2021;2021:3585–94. pmid:35571558
  21. 21. Fatras K, Sejourne T, Flamary R, Courty N. Unbalanced minibatch optimal transport; applications to domain adaptation. In: International Conference on Machine Learning. PMLR; 2021. p. 3186–97.
  22. 22. Stärk H, Beaini D, Corso G, Tossou P, Dallago C, Günnemann S. 3d infomax improves gnns for molecular property prediction. In: International Conference on Machine Learning. 2022. p. 20479–502.
  23. 23. Zang X, Zhang J, Tang B. Self-supervised molecular representation learning with topology and geometry. IEEE J Biomed Health Inform. 2025;29(1):700–10. pmid:39401116
  24. 24. Fang X, Liu L, Lei J, He D, Zhang S, Zhou J, et al. Geometry-enhanced molecular representation learning for property prediction. Nat Mach Intell. 2022;4(2):127–34.
  25. 25. Qi CR, Yi L, Su H, Guibas LJ. Pointnet: deep hierarchical feature learning on point sets in a metric space. Advances in Neural Information Processing Systems. 2017;30.
  26. 26. Ye W, Li J, Cai X. Mfgnn: multi-scale feature-attentive graph neural networks for molecular property prediction. J Comput Chem. 2025;46(3):e70011. pmid:39840745
  27. 27. Jin W, Barzilay R, Jaakkola T. Hierarchical generation of molecular graphs using structural motifs. In: International conference on machine learning, 2020. p. 4839–48.
  28. 28. Hanocka R, Hertz A, Fish N, Giryes R, Fleishman S, Cohen-Or D. Meshcnn: a network with an edge. ACM Trans Graph. 2019;38(4):1–12.
  29. 29. Huang J, Zhang H, Yi L, Funkhouser T, Nießner M, Guibas LJ. Texturenet: consistent local parametrizations for learning from high-resolution signals on meshes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019. p. 4440–9.
  30. 30. Ghorbani M, Baghshah MS, Rabiee HR. MGCN: semi-supervised classification in multi-layer graphs with graph convolutional networks. In: Proceedings of the 2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining. 2019. p. 208–11.
  31. 31. Rampasek L, Wolf G. Hierarchical graph neural nets can capture long-range interactions. In: 2021 IEEE 31st International Workshop on Machine Learning for Signal Processing (MLSP). 2021. p. 1–6. https://doi.org/10.1109/mlsp52302.2021.9596069
  32. 32. Maziarka L, Danel T, Mucha S, Rataj K, Tabor J, Jastrzkebski S. Molecule attention transformer. arXiv preprint 2020. arXiv:2002.08264
  33. 33. Han S, Fu H, Wu Y, Zhao G, Song Z, Huang F, et al. HimGNN: a novel hierarchical molecular graph representation learning framework for property prediction. Brief Bioinform. 2023;24(5):bbad305. pmid:37594313
  34. 34. Fey M, Yuen JG, Weichert F. Hierarchical inter-message passing for learning on molecular graphs. arXiv preprint 2020. https://arxiv.org/abs/2006.12179
  35. 35. Zhang Z, Liu Q, Hu Q, Lee CK. Hierarchical graph transformer with adaptive node sampling. Advances in Neural Information Processing Systems. 2022;35:21171–83.
  36. 36. Wu F, Radev D, Li SZ. Molformer: motif-based transformer on 3d heterogeneous molecular graphs. AAAI. 2023;37(4):5312–20.
  37. 37. Fabian B, Edlich T, Gaspar H, Segler M, Meyers J, Fiscato M, et al. Molecular representation learning with language models and domain-relevant auxiliary tasks.arXiv preprint 2020. https://arxiv.org/abs/2011.13230
  38. 38. Veličković P, Cucurull G, Casanova A, Romero A, Lio P, Bengio Y. Graph attention networks. arXiv preprint 2017.
  39. 39. Xiong Z, Wang D, Liu X, Zhong F, Wan X, Li X, et al. Pushing the boundaries of molecular representation for drug discovery with the graph attention mechanism. J Med Chem. 2020;63(16):8749–60. pmid:31408336
  40. 40. Ju W, Liu Z, Qin Y, Feng B, Wang C, Guo Z, et al. Few-shot molecular property prediction via hierarchically structured learning on relation graphs. Neural Netw. 2023;163:122–31. pmid:37037059
  41. 41. Fuchs F, Worrall D, Fischer V, Welling M. Se (3)-transformers: 3d roto-translation equivariant attention networks. Advances in Neural Information Processing Systems. 2020;33:1970–81.
  42. 42. Rupp M, Tkatchenko A, Muller KR, Von Lilienfeld OA. Fast and accurate modeling of molecular atomization energies with machine learning. Physical Review Letters. 2012;108(5):058301.
  43. 43. Montavon G, Rupp M, Gobre V, Vazquez-Mayagoitia A, Hansen K, Tkatchenko A, et al. Machine learning of molecular electronic properties in chemical compound space. New J Phys. 2013;15(9):095003.
  44. 44. Li Y, Hsieh C-Y, Lu R, Gong X, Wang X, Li P, et al. An adaptive graph learning method for automated molecular interactions and properties predictions. Nat Mach Intell. 2022;4(7):645–51.
  45. 45. Mobley DL, Guthrie JP. FreeSolv: a database of experimental and calculated hydration free energies, with input files. J Comput Aided Mol Des. 2014;28(7):711–20. pmid:24928188
  46. 46. Wang J, Cao D, Zhu M, Yun Y, Xiao N, Liang Y. In silicoevaluation of logD7.4and comparison with other prediction methods. Journal of Chemometrics. 2015;29(7):389–98.
  47. 47. Sakiyama H, Fukuda M, Okuno T. Prediction of Blood-Brain Barrier Penetration (BBBP) based on molecular descriptors of the free-form and in-blood-form datasets. Molecules. 2021;26(24):7428. pmid:34946509
  48. 48. Capuzzi SJ, Politi R, Isayev O, Farag S, Tropsha A. QSAR modeling of Tox21 challenge stress response and nuclear receptor signaling toxicity assays. Front Environ Sci. 2016;4.
  49. 49. Valsecchi C, Consonni V, Todeschini R, Orlandi ME, Gosetti F, Ballabio D. Parsimonious optimization of multitask neural network hyperparameters. Molecules. 2021;26(23):7254. pmid:34885837
  50. 50. Paykan Heyrati M, Ghorbanali Z, Akbari M, Pishgahi G, Zare-Mirakabad F. BioAct-Het: a heterogeneous siamese neural network for bioactivity prediction using novel bioactivity representation. ACS Omega. 2023;8(47):44757–72. pmid:38046344
  51. 51. Riesen K, Bunke H. IAM graph database repository for graph based pattern recognition, machine learning. In: Structural, Syntactic,, Statistical Pattern Recognition: Joint IAPR International Workshop and SSPR & SPR 2008, Orlando, USA, December 4-6, 2008. Proceedings. 2008. p. 287–97.
  52. 52. Su M, Yang Q, Du Y, Feng G, Liu Z, Li Y, et al. Comparative assessment of scoring functions: the CASF-2016 update. J Chem Inf Model. 2019;59(2):895–913. pmid:30481020