Hyper-MDR: An open-world multimodal reasoning framework based on dynamic hypergraph and meta-strategy optimization

Jian Shi; Xiaobin Huang; Lianhai Yuan

doi:10.1371/journal.pone.0342169

Abstract

Open-world object detection (OWOD) has become a crucial paradigm for advancing intelligent perception systems, as it requires not only accurate recognition of known categories but also autonomous discovery and continuous learning of emerging unknown categories in dynamic environments. However, existing methods often suffer from shallow cross-modal interaction and rigid reasoning mechanisms, making them unable to cope with the continuous emergence of new categories and the dynamic changes in modality reliability in open-world environments. First, to achieve hypergraph enhancement, image regions, text, and semantic prototypes are treated as nodes, while a gating network dynamically generates hyperedges under semantic and spatial constraints, thereby modeling the high-order dependencies of vision–language–semantic triplets and enabling deep multimodal fusion at the topological level. Second, a hierarchical hypergraph convolutional network is designed to facilitate knowledge propagation between known and unknown categories. Finally, a meta-policy gradient-based adaptive controller is proposed, which dynamically adjusts feature fusion weights, propagation depth, and attention topology based on the detection state and historical trajectories. Experimental results on the OWOD dataset show that our proposed method achieves an accuracy of 76.8%, providing a new paradigm for open-world multimodal perception that integrates semantic depth and adaptability.

Citation: Shi J, Huang X, Yuan L (2026) Hyper-MDR: An open-world multimodal reasoning framework based on dynamic hypergraph and meta-strategy optimization. PLoS One 21(2): e0342169. https://doi.org/10.1371/journal.pone.0342169

Editor: Francesco Bardozzo, University of Salerno: Universita degli Studi di Salerno, ITALY

Received: September 14, 2025; Accepted: January 19, 2026; Published: February 17, 2026

Copyright: © 2026 Shi et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The data are available from https://github.com/jianshi1224.

Funding: The author(s) received no specific funding for this work.

Competing interests: The authors have declared that no competing interests exist.

1. Introduction

As artificial intelligence systems evolve from closed environments to open, dynamic scenarios, object detection tasks are facing a paradigm shift from recognizing the known to perceiving the unknown [1]. Open-world object detection requires models to not only identify predefined categories but also discover and gradually learn from emerging unknown instances during continuous operation. This places higher demands on the depth of semantic understanding and the adaptability of perception systems [2,3]. The fusion of multimodal data offers a new path to alleviate annotation scarcity and enhance semantic generalization. However, achieving deep, structured semantic interaction between modalities remains a key bottleneck hindering performance improvement [4].

Current mainstream open-world object detection methods rely on unimodal features or shallow attention mechanisms for cross-modal fusion, making it difficult to capture high-order joint dependencies between visual regions, text descriptions, and abstract semantic concepts [5,6]. Such methods typically treat multimodal information as independent feature streams, ignoring their potential group associations during the fusion process, resulting in fragmented semantic representations [7]. This fragile fusion model significantly reduces its robustness in the face of modal noise. Visual modality noise primarily stems from feature distortion caused by low light, occlusion, and motion blur, while language modality noise stems from ambiguous, missing, or weakly correlated text descriptions with image content. Existing solutions lack the ability to model noise sources, making it difficult to make robust decisions when modal reliability changes dynamically [8]. Furthermore, existing inference architectures generally employ a fixed number of layers and static weights, lacking the ability to dynamically adjust fusion strategies and inference depth based on input content [9–11]. This makes them difficult to adapt to the complex demands of continuously evolving category distributions and dynamically changing modal reliability in open environments. In recent years, some studies have attempted to introduce Hyper-GNNs to model multimodal high-order associations and have initially explored meta-learning mechanisms for policy initialization, which have, to a certain extent, improved the integrity of cross-modal representations and cross-task transfer capabilities [12,13]. However, existing approaches still have significant shortcomings: Hypergraph structures are mostly statically constructed, and their hyperedge generation relies on predefined rules or fixed similarity thresholds, making it impossible to dynamically adjust the topological structure in real time based on the input semantics. There are two core difficulties in implementing dynamic hypergraphs: first, how to design a learnable hyperedge generation mechanism that can capture the complex co-occurrence patterns of “visual-linguistic-semantic” triples; second, how to achieve sparse and adaptive evolution of the hypergraph structure while ensuring computational efficiency [14,15]. The message passing process of Hyper-GNN is fixed, making it impossible to adjust the inference depth based on semantic uncertainty. Meta-policy optimization often focuses on task-level adaptation, fails to couple with the model’s internal reasoning process, and lacks the ability to fine-grainedly control feature fusion paths and topological structures [16]. These limitations lead to the model still exhibiting problems of reasoning rigidity and lagging adaptation when faced with complex scenarios such as the emergence of unknown categories and modality mismatch in the open world [17].

To address the above problems, this paper proposes a dynamic collaborative framework that deeply integrates Hyper-GNN and meta-policy optimization. It realizes the dynamic construction of cross-modal semantic clusters through a learnable heterogeneous hypergraph, and designs a hierarchical message passing mechanism to enhance the topological reasoning ability of unseen categories. Furthermore, a meta-policy controller coupled with the reasoning process is constructed. The controller takes the current detection state as input, and the number of GNN propagation layers output by it directly determines the number of rounds of Hyper-GNN message passing, thereby achieving real-time regulation of the reasoning depth. Feature fusion weights dynamically weight multimodal node features before message transmission, directly influencing the source of information aggregation. This design eliminates the need for a meta-policy as an isolated parameter generator, instead forming a closed loop of observation, decision-making, and execution within the Hyper-GNN message transmission process. This approach not only overcomes the static nature of existing hypergraph modeling and the coarse-grained nature of meta-policy application, but also significantly improves the ability to discover unknown categories and achieve semantic consistency in open-world continuous learning scenarios, providing a more autonomous and adaptable solution for multimodal dynamic perception.

The main contributions are summarized as follows:

A learnable heterogeneous hypergraph is designed to dynamically capture cross-modal co-occurrence patterns without relying on manually predefined hyperedges.
A hierarchical propagation strategy is developed to enhance high-order reasoning and improve unknown-category discovery.
A meta-policy controller is introduced to adapt fusion weights and inference depth based on detection states.
By tightly integrating Hyper-GNN reasoning with meta-policy optimization, our framework provides a more autonomous and flexible solution, effectively discovering unknown categories, maintaining semantic consistency, and adapting to continuously evolving multimodal environments.

2. Related work

2.1. Open-world object detection

Open-World Object Detection (OWOD) as an emerging paradigm requires models to detect unknown instances alongside recognized categories, progressively learning these new categories in subsequent stages [18]. Joseph et al. [19] proposed pioneering framework ORE builds upon Faster R-CNN by introducing feature comparison clustering and an energy discrimination mechanism. This approach identifies and labels high-confidence, unmatched candidate regions as “unknown,” effectively mitigating confusion between known and unknown categories while achieving leading performance in incremental detection tasks. Subsequently, Wu et al. [20] introduced UC-OWOD, which further supports distinguishing multiple unknown categories rather than lumping them into a single “unknown” class, thereby revealing the semantic generalization limitations of earlier OWOD approaches. Recent OWOD methods such as PROB [21] and Decoupled PROB [22] refine unknown-object detection by estimating class probability bounds and decoupling classification from localization, respectively. These approaches provide strong performance on the COCO-based OWODB benchmark and establish widely adopted metrics for unknown recall and unknown mAP, making them essential baselines for evaluating open-world systems.

In recent years, researchers have attempted to leverage semantic knowledge from vision-language models to address this shortcoming. Vision-Language Distillation (ViLD) [23] achieves this by distilling semantic representations from pre-trained image-text models like CLIP into detectors, aligning visual regions with textual descriptions to enable recognition of unseen categories [24]. Concurrently, Zhou et al. [25] expanded detector label spaces using image-level supervision or large-scale word embeddings, eliminating the need for per-class annotated bounding boxes. Despite progress in zero-shot detection and semantic transfer, existing OWOD systems struggle to consistently assign accurate labels to unseen instances in open environments and cope with continuously expanding category spaces.

2.2. Multimodal fusion in vision-language models

Modern detectors increasingly integrate textual information (e.g., category names, attributes, or descriptions) to enable open-world and open-vocabulary detection. CLIP-based models use a dual encoder to align image and text features in a joint space, where the visual embeddings are trained to match the textual embeddings of the category labels. This represents a relatively shallow cross-modal interaction, as each modality is processed independently and only combined during the prediction phase (e.g., by comparing region and text feature vectors) [26]. In contrast, recent architectures pursue deeper fusion through cross-modal interaction. For example, Grounded Language-Image Pre-training (GLIP) [27] unifies object detection and phrase localization by introducing cross-attention layers throughout the detector backbone, achieving language-aware region representations and improving fine-grained localization. Similarly, Kamath et al. [28] proposed the MDETR model, which jointly processes image and text tokens, allowing iterative attention modules to reason about visual content conditioned on language. Modern grounded models such as CLIP [29], GLIP [30], and MDETR [31] excel at aligning visual regions with textual descriptions, yet they require full cross-modal training and fine tuning of text encoders. These constraints limit their suitability for incremental open world settings, where new categories emerge and previous data cannot be revisited. Hyper-MDR differs in that it treats the pretrained text encoder as fixed and performs multimodal reasoning entirely through the hypergraph and meta policy modules. This design enables seamless integration into CLIP, GLIP, and MDETR without retraining their language components. The experimental results further show that Hyper-MDR improves unknown category detection while keeping known class performance stable.

Despite these advances, existing vision-language fusion techniques face significant limitations. Many methods still rely on shallow alignment and lack richer interactions between vocabulary and visual context. Deep fusion models address this issue but also introduce new sensitivities: detectors can become brittle when one modality is noisy or misaligned, e.g., irrelevant text descriptions can lead to erroneous attention mechanisms [32]. Recent work has highlighted the challenges posed by modal noise and misalignment, which hinder robust integration [33]. Furthermore, deeply fused Transformers are computationally expensive and require careful balancing (e.g., through gating or modulation) to prevent one modality from dominating. Ongoing research is exploring dynamic fusion strategies, such as adaptive cross-modal attention mechanisms, to downweight unreliable signals and enhance robustness. Achieving strong multimodal collaboration without sacrificing reliability remains a significant challenge, which is particularly important in open-world detection, where user-provided text can often be noisy or incomplete. Beyond open-world detection, researchers in biomedical computational imaging have explored cross modal explainability to enhance reliability and mitigate hallucinations.Beyond open-world detection, recent biomedical imaging studies have highlighted the importance of cross-modal explainability to mitigate hallucinations. Bardozzo et al. [34] propose a cross-explainable GAN that enforces structural consistency between real and imaginary channels in Fourier ptychographic microscopy and validates perceptual fidelity with clinicians. Although developed for a different domain, this work shares our goal of stabilizing multimodal reasoning by ensuring cross-channel coherence. This perspective is complementary to Hyper-MDR, where the meta-policy controller regulates reasoning depth and multimodal fusion in response to uncertainty.

2.3. Hypergraph neural networks

Beyond recognizing individual objects, open-world multimodal understanding often requires reasoning over multiple entities, across modalities, or over time. Hypergraph neural networks (HGNNs) have emerged as powerful tools for modeling such high-order relationships. Unlike traditional graphs, hyperedges can connect more than two nodes, enabling the joint modeling of group interactions [35,36]. This property is especially valuable for inherently multi-entity or multimodal tasks, such as collective interactions in multi-agent systems or linking image regions with multiple textual attributes. For example, recent studies in autonomous driving leverage hypergraphs to represent group-wise vehicle interactions, showing that hyperedges more naturally encode simultaneous influences than pairwise connections. Similarly, in multimodal scenarios, hypergraphs can connect visual nodes, textual descriptions, and contextual signals, thereby facilitating joint cross-modal reasoning. Early attempts demonstrate that incorporating hypergraph structures enhances relational reasoning in complex environments, with dynamic hypergraph networks in multi-agent reinforcement learning adaptively clustering agents into hyperedges to capture emergent cooperation or competition.

Despite their promise, applying Hyper-GNNs in open-world multimodal contexts introduces several challenges. Most existing approaches assume a static hypergraph topology, relying on predefined hyperedges and fixed message passing schemes that cannot adapt to the emergence of new entities or evolving relations [37]. Although recent work has explored end-to-end hypergraph structure learning to prune spurious edges or add implicit connections, these methods still operate under relatively stable node sets and remain computationally expensive for online updates [38]. In addition, traditional Hyper-GNNs often use uniform or static propagation weights, which limits their flexibility in heterogeneous multimodal environments. Deeper hypergraph models also suffer from over smoothing beyond one or two layers, while dynamic hyperedge construction raises efficiency concerns for large scale or real time settings.

Hypergraph reasoning has also been explored in object detection, for example in Hyper-YOLO, which models higher order visual correlations but remains restricted to unimodal imagery and relies on fixed or heuristic hyperedge definitions. Recent unified open vocabulary and open world frameworks, such as OW-OVD [39] and OW-VAP [40], have begun to combine unknown category discovery with language grounding, yet they typically require fine tuning or adaptation of the text encoder during training. These factors limit their applicability in incremental open world settings where unseen categories continually emerge. In contrast, the proposed Hyper-MDR framework introduces heterogeneous hypergraph reasoning with dynamic hyperedge selection and operates as a strict plug in module that uses frozen text encoders. All multimodal interactions occur within the learned hypergraph and the meta policy controller without modifying the underlying vision language model. This design addresses the limitations of prior Hyper-GNNs by enabling adaptive topology, efficient reasoning, and compatibility with existing detectors, thereby providing a more suitable foundation for open world multimodal detection.

3. Method

This section provides an overview of the proposed Hyper-MDR framework, followed by detailed descriptions of its core components including multimodal node construction, heterogeneous hypergraph reasoning, meta-policy controlled adaptive inference, and unknown-category energy scoring.

3.1. Construction of a cross-modal heterogeneous hypergraph

In open-world multimodal perception tasks, traditional graph structures are limited in their ability to capture complex interactions across visual, linguistic, and semantic modalities. To address this issue, we construct a learnable cross-modal heterogeneous hypergraph (CM-HHG) to explicitly model high-level semantic associations. Formally, the hypergraph is defined as as . The node set V consists of three heterogeneous elements: visual nodes , derived from the region proposal box features of the Faster Region-based Convolutional Neural Network (Faster R-CNN); language nodes , composed of word vectors generated by the Contrastive Language–Image Pre-training (CLIP) [34] text encoder; and semantic prototype nodes , representing predefined or clustered category semantic centers. Together, these three elements constitute a heterogeneous node set in the joint semantic space.

The construction of hyperedge set aims to capture the co-occurrence pattern of multi-body semantic clusters. For any hyperedge , the set of nodes connected to it satisfies the dual constraints of semantic consistency and spatial alignment. Define the semantic similarity metric function:

(1)

And introduce the spatial alignment score:

(2)

where is the text space anchor generated based on saliency analysis. Finally, the probability of hyperedge generation is determined by a learnable gating network:

(3)

Among them, encodes the node modal combination type, and is the Sigmoid function. Only when is true is the hyperedge activated and participates in subsequent reasoning. The node and hyperedge attribute definitions are shown in Table 1.

Download:

Table 1. Node and hyperedge attribute definitions.

https://doi.org/10.1371/journal.pone.0342169.t001

To enhance the dynamic adaptability of hypergraphs, we design a sparse attention-based strategy. Specifically, the attribution degree of a node i to a hyperedge ℎ is calculated as:

(4)

where denotes the query vector of node i, is the key vector of hyperedge ℎ, and d is the feature dimension used for scaling. This formulation follows the scaled dot-product attention mechanism, ensuring that the attribution weights are normalized across candidate hyperedges. To control computational complexity, only the hyperedges associated with each node are retained.

Furthermore, to address scenarios with missing modalities, we introduce a virtual node completion mechanism. When modality-specific information is unavailable, the node embedding is replaced by a weighted combination of its original feature and its semantic category prototype :

(5)

where is a learnable parameter that balances the contribution of the observed feature and the prototype representation. This mechanism allows the hypergraph to remain functional under incomplete modality conditions, thereby improving its robustness in open-world settings.

3.2. Hierarchical hypergraph convolutional network

To effectively propagate high-level semantics across modalities, we design a Hierarchical Hypergraph Convolutional Network (Hyper-GNN) with a bidirectional message passing mechanism. This network performs two stages of information aggregation at each layer: (i) from nodes to hyperedges, and (ii) from hyperedges back to nodes.

In the first stage, each hyperedge receives information from its connected nodes , which are fused using a gated aggregation function:

(6)

where is in the form of weighted sum:

(7)

Here, the gating vector is generated by learnable parameters, distinguishing the contributions of different nodes.

In the second stage, each node receives updates from all hyperedges and updates its representation:

(8)

(9)

(10)

(11)

where is the sigmoid function, ⊕ denotes concatenation, and ⊙ element-wise multiplication. This gated recurrent-style update retains historical states, controls the influx of new information, and mitigates oversmoothing.

To enhance model expressiveness, we introduce a modality-aware weight matrix that dynamically adjusts parameters according to the node type:

(12)

where is an indicator function selecting the corresponding modality.

Finally, to alleviate gradient vanishing in deep propagation, inter-layer skip connections are employed. The final output representation is obtained as:

(13)

where are learnable weights normalized via softmax. The final embedding serves as the input feature for multimodal fusion and target detection. The Hyper-GNN message passing process is shown in Fig 1.

Download:

Fig 1. Hyper-GNN message passing flow chart.

https://doi.org/10.1371/journal.pone.0342169.g001

3.3. Multimodal Feature Alignment and Initialization

To ensure that nodes from different modalities effectively interact in a unified semantic space, we adopt cross-modal contrastive learning (Cross-Modal CL) prior to hypergraph construction. Given a batch of image–text pairs , we extract global embeddings and . The cross-modal similarity matrix is then defined as:

(14)

where is the temperature coefficient. Optimization is performed using symmetrical contrast loss:

(15)

which encourages positive image–text pairs to be closer in the embedding space while pushing negative pairs apart.

After alignment, the initial features of hypergraph nodes are constructed as follows. For each visual node , we apply Region of Interest (RoI)-Align to extract regional features from the global representation, followed by a projection layer:

(16)

For each language node , we directly take the CLIP embedding of the corresponding word after tokenization:

(17)

For each semantic prototype node , we compute the average embedding of both visual and textual features associated with category:

(18)

Finally, all node features are stabilized through Layer Normalization, ensuring scale invariance across modalities:

(19)

This initialization strategy ensures that all three types of nodes (visual, language, and semantic prototypes) are embedded in a shared, semantically aligned space, providing a solid foundation for subsequent hypergraph construction and reasoning. The image information parsing process is shown in Fig 2.

Download:

Fig 2. Feature map resolution enhancement and detail restoration.

https://doi.org/10.1371/journal.pone.0342169.g002

3.4. Meta-policy controller architecture

To enable dynamic adaptation during inference, we design a meta-policy controller based on Long Short-Term Memory (LSTM). Its objective is to generate the optimal policy action conditioned on the current observation state . The state is defined as:

(20)

where denotes the mean confidence score of current detections, is the classification entropy, and represents the response strength of unknown-class predictions. The term is the action taken at the previous time step. Together, these components capture both the model’s current uncertainty and the dynamics of the open-world environment. The hidden state is updated via:

(21)

where and are the hidden and cell states, respectively.

The output layer maps to a discrete action space. Each action consists of four dimensions: (i) fusion weight , ; (ii) number of GNN propagation layers ; and (iii) number of attention heads H. Each dimension is produced by a dedicated fully connected head:

(22)

(23)

(24)

The final action is obtained either by sampling or by selecting the maximum probability outcome.

The controller is optimized using meta-policy gradients. For each task batch, the policy generates trajectories . The corresponding task reward is computed, and the policy parameters are updated according to:

(25)

where is the baseline function used to reduce variance.

To stabilize training and explicitly guide the controller toward open-world objectives, we design a composite reward function that balances accuracy, discovery ability, efficiency, and stability. Specifically, the reward is defined as:

(26)

where encourages high detection accuracy on known categories by rewarding higher mAP and penalizing classification errors, improves novel-class discovery by rewarding high unknown recall and unknown mAP, regularizes computational overhead by penalizing excessive GFLOPs, and prevents unstable policy oscillations by enforcing smooth transitions between consecutive actions. This multi-term reward design ensures that the meta-policy controller not only adapts reasoning depth and fusion strategies to input uncertainty, but also maintains semantic consistency, computational efficiency, and robust performance in dynamic open-world environments.

3.5. Dynamic inference path generation mechanism

Based on the action output by the meta-controller, the system dynamically generates an inference path. Fusion weights , are used to weightily combine multimodal features:

(27)

This weight is dynamically adjusted by the controller based on the modal reliability of the current input. For example, when the image is blurred, the weight of the language modality is automatically increased.

is the alignment mapping function. The number of GNN propagation layers directly determines the number of Hyper-GNN iterations:

(28)

The model performs rounds of layered convolution. A larger value results in a wider spread of features across the hypergraph and more complete semantic aggregation, but this also increases computational overhead. This mechanism enables adaptive computation with “shallow reasoning in simple scenarios and deep reasoning in complex scenarios.” The number of attention heads () controls the parallelism of the multi-head attention module and affects the breadth of feature interactions. Ultimately, the reasoning path is defined by both L and H, forming a dynamic computational graph. To ensure the stability of policy switching, an action smoothing loss is introduced:

(29)

The smooth loss prevents the strategy from oscillating wildly between adjacent samples. The dynamic inference path selection is shown in Fig 3.

Download:

Fig 3. Schematic diagram of dynamic inference path selection.

https://doi.org/10.1371/journal.pone.0342169.g003

Circle nodes represent input modalities, whose size is adjusted according to the modal weight; larger nodes have higher weights. Square nodes represent feature representations in the GNN layer. Black dashed lines represent possible feature transfer paths, which are dynamically generated based on the modal weights. The red solid line represents the main inference path selected by the controller.

3.6. Open-world unknown class detection and pseudo-label generation

A fundamental challenge of open-world object detection is that the model must autonomously discover and continuously learn unknown instances in the environment without relying on predefined class labels. Traditional closed-set detectors typically adopt a fixed confidence threshold to filter background, but this strategy struggles to distinguish unknown foreground objects from low-confidence known classes, often leading to a large number of unknown objects being misclassified as background or noise. To address this issue, we propose an unknown class detection and pseudo-label generation mechanism that integrates uncertainty quantification with hypergraph-based semantic propagation, aiming to achieve high-precision unknown discovery and semantically consistent incremental learning.

During inference, for each detected instance x, the unknown class score is defined as:

(30)

where a lower maximum class probability indicates higher likelihood of being unknown. Combined with the energy function , dual discrimination is performed. When and are both present, the instance is considered unknown.

To further transform unknown clusters into learnable novel categories, we design a pseudo-label generation algorithm via hypergraph propagation (PL-HGP). The key idea is to leverage the constructed cross-modal heterogeneous hypergraph to backpropagate semantic information from cluster centers to all nodes within the cluster, thereby producing semantically consistent pseudo-labels. Specifically, spectral clustering is first applied to all unknown instance features, generating clusters, with each cluster center serving as a new semantic prototype:

(31)

where is the set of samples belonging to the k-th cluster. The new prototype is connected to existing nodes in the hypergraph, and pseudo-labels are generated through Hyper-GNN backpropagation:

(32)

where measures semantic similarity between the instance feature and candidate prototypes .

The advantage of this mechanism is that it doesn’t simply use cluster IDs as labels. Instead, it leverages the hypergraph structure to ensure that the generated pseudo-labels are semantically coherent. For example, members of a cluster designated “new vehicle” would be linked to known categories like “car” and “truck” in the hypergraph through shared semantic hyperedges like “transportation” and “four-wheeled,” thus ensuring semantic consistency for the new category. Although classical energy-based open-set and OOD detection typically defines the energy of a sample x through a logit-level log-sum-exp operator:

(33)

Open-world object detection differs from standard OOD settings in two important aspects:

(1). The OWOD detectors operate on region-level classification heads where logits are normalized across multiple object queries and later fused with geometric and proposal features. Under such conditions, raw logits often exhibit large variance across spatial proposals, making classical logit-based energy unstable when applied directly to region-wise outputs.
(2). The hypergraph propagation in Hyper-MDR further aggregates multimodal cues prior to classification, and the softmax probabilities after propagation provide a calibrated representation of semantic ambiguity. For these reasons, we define the unknown energy using post-softmax probabilities:

(34)

Where a lower maximum class confidence directly reflects higher semantic uncertainty. This probability-based formulation empirically yields more stable unknown detection signals within OWOD pipelines, particularly when combined with dynamic hypergraph reasoning.

3.7. Complexity and convergence analysis

The proposed Hyper-MDR framework maintains favorable scalability due to its bounded-degree and bounded-depth design. In hypergraph construction, each node connects to at most hyperedges, resulting in an incidence matrix with nonzero entries. Consequently, both construction and Hyper-GNN message passing scale linearly with the number of nodes, with a per-layer cost of . Since the meta-controller restricts propagation depth to , the total inference cost remains bounded and practical for large-scale open-world detection.

For spectral clustering in pseudo-label generation, we adopt sparse eigen solvers (Lanczos/IRAM), yielding near-linear complexity in the number of unknown samples. The subsequent clustering and hypergraph propagation steps are also linear under fixed parameters. This ensures that pseudo-label generation remains computationally tractable even as new categories emerge dynamically. Regarding convergence, the recurrent-style gated Hyper-GNN update is stabilized by residual connections and LayerNorm, and finite-step convergence is guaranteed by the controller-imposed depth cap. Spectral clustering converges under standard eigenvalue tolerance, and the controller further prevents oscillatory behavior by smoothing consecutive actions. Together, these mechanisms ensure that Hyper-MDR achieves both theoretical scalability and stable convergence in dynamic multimodal environments.

4. Experiments

4.1. Dataset

To systematically evaluate the effectiveness and generalization ability of the proposed method in open-world object detection tasks, we conducted comprehensive experiments on the Open World Object Detection (OWOD) [19]. OWOD is constructed from the MS-COCO 2017 training and validation sets and follows a standard open-world protocol that simulates the dynamic emergence of novel categories in real-world scenarios. Specifically, the 60 most frequent categories are selected as known classes for initial supervised training, while the remaining categories are retained as unknown classes during testing. This setup enables the evaluation of a model’s ability to detect and incrementally learn novel categories without retraining from scratch.

To enhance multimodal representation, we augment the visual data with corresponding textual descriptions. The textual inputs are derived from original annotations, MS-COCO-style captions, and semantic descriptions generated by a pre-trained language model. After manual verification and denoising, these textual features are used as language input to ensure semantic reliability. Owing to its large-scale coverage, diverse object distributions, and high-quality annotations, OWOD has become one of the most widely adopted benchmarks for open-world object detection. Representative multimodal examples are illustrated in Fig 4.

Download:

Fig 4. Examples of multimodal samples in the OWOD dataset.

https://doi.org/10.1371/journal.pone.0342169.g004

To construct the textual modality, we augment the visual OWOD dataset with multimodal annotations. Specifically, we utilize the COCO caption annotations as the primary textual descriptions, each image being associated with 5 human-written captions. To enrich semantic diversity, we further employ a large-scale pre-trained language model to generate auxiliary descriptions (e.g., object attributes, contextual relations), followed by manual verification and denoising to ensure semantic reliability. For each known category, its textual name and verified descriptions are encoded by the CLIP text encoder to form language nodes and semantic prototype nodes in the hypergraph. Importantly, to strictly follow the OWOD protocol, only known-category descriptions from the training set are used for hypergraph construction, while no unseen class information is introduced during training, thereby preventing data leakage and ensuring fair evaluation.

4.2. Implementation details

We adopt the AdamW optimizer with an initial learning rate of . The batch size is set to 16, and the model is trained for a total of 36 epochs. The learning rate is decayed by a factor of 0.1 at the 24th and 32nd epochs. For hypergraph construction, the activation threshold of the gated network is set to 0.7, which controls the generation of hyperedges. In cross-modal contrastive learning, the temperature coefficient is set to 0.07 as the scaling factor for similarity calculation. The maximum number of GNN propagation layers is 5, allowing the Hyper-GNN to dynamically adjust its reasoning depth.

For the meta-policy controller, the policy is updated every 5 training batches to regulate fusion strategies and propagation depth. In the pseudo-label generation process, the confidence threshold is fixed at 0.9, ensuring that only high-confidence predictions are retained as pseudo-labels. In addition, spectral clustering is employed to group unknown instances into 20 clusters, with each cluster serving as a candidate for new semantic categories. For statistical reliability, all main quantitative results in this paper are reported as the mean and standard deviation over three different random seeds. All tables have been updated accordingly to ensure consistent reporting across known class metrics, unknown class metrics, and harmonic mean results.

To ensure strict compliance with the OWOD protocol and to avoid label leakage across tasks, the text augmentation pipeline is designed with the three stages safety mechanism consisting of prompt design, automatic filtering, and manual verification.

(1). Prompt construction

All LLM prompts are restricted to high level visual attributes and contextual descriptions and explicitly exclude any category names or synonyms corresponding to classes appearing in later OWOD tasks. The prompts focus only on generic object properties such as color, texture, shape, material, and scene context.

(2). Automatic filtering

The generated descriptions are passed through a two step filtering process. First, any text containing tokens that match the WordNet synsets of future task categories is removed. Second, CLIP based semantic similarity is used to detect near synonyms or paraphrases of unseen categories; descriptions whose similarity to any future class name exceeds a threshold of 0.25 are discarded.

(3). Manual verification.

The remaining descriptions are manually screened to ensure that no future category names or recognizable semantic cues are present. This final review step guarantees that no text contains explicit or implicit references to unseen classes.

(4). Ablation without generated text.

To validate that the LLM generated descriptions do not leak information or artificially inflate performance, we additionally include a “no text augmentation” ablation. This experiment uses only the original text descriptions provided by the OWOD benchmark. The results show that the proposed method still improves both known and unknown detection performance, confirming that the LLM generated descriptions do not introduce semantic leakage and serve only as auxiliary regularization.

4.3. Evaluation metric

To rigorously evaluate performance in open-world object detection, we adopt six complementary metrics covering known-category accuracy, unknown-category discovery, incremental stability, computational efficiency, and multimodal robustness. The definitions are provided below for clarity.

(1). Known-Class Detection: mAP@0.5

Mean Average Precision at IoU = 0.5 is used to measure detection accuracy on known classes:

(35)

(2). Unknown-Class Detection: Unknown Recall

Unknown Recall evaluates the ability to correctly identify unknown objects:

(36)

(3). Unknown-Class Detection Quality: Unknown mAP

Unknown mAP@0.5 quantifies localization and classification quality for unknown predictions, using pseudo-labels generated according to the OWOD protocol:

(37)

(4). Balanced Performance: Harmonic Mean (H-Mean)

To capture the trade-off between known-category retention and unknown-category discovery, we compute:

(38)

(5). Incremental Stability: Forgetting Ratio

To quantify the degradation of previously learned classes across tasks, we define:

(39)

(6). Policy Responsiveness: Strategy Adaptation Latency

This metric measures the time elapsed from input arrival to the meta-policy controller producing its full action vector. Lower latency indicates more responsive dynamic reasoning.

(7). Computational Efficiency: GFLOPs

GFLOPs reports the floating-point operations required for a single forward pass, reflecting the model’s suitability for real-time or resource-constrained settings.

(8). Multimodal Fusion Robustness: Attention Entropy

Attention entropy measures the balance of modality contributions:

(40)

where denotes the fusion weight of modality . Higher entropy indicates more balanced multimodal integration, while lower entropy suggests reliance on a single modality.

5. Results and analysis

5.1. Comparison with state-of-the-art methods

We conducted a preliminary evaluation of Hyper-MDR’s performance on the OWOD dataset. Results demonstrate that through our proposed dynamic hypergraph modeling and meta-strategy control approach, the model effectively enhances detection accuracy for known categories while significantly boosting its ability to discover and identify unknown categories. Since PROB, Decoupled PROB, and OW-OVD define the standard U-Recall and unknown mAP metrics and represent state-of-the-art OWOD systems on COCO-based OWODB splits, we include them in our evaluation for a fair and comprehensive comparison. Table 2 shows that the proposed Hyper-MDR achieves significant advantages over existing methods on the OWOD dataset. For known category detection, Hyper-MDR achieves a Known mAP@0.5 of 76.8%, outperforming Decoupled PROB (75.1%), OW-OVD (74.3%), and OW-VAP (72.5%), demonstrating its robustness in maintaining known knowledge recognition accuracy. For unknown category detection, Hyper-MDR achieves an Unknown mAP@0.5 of 39.5%, significantly surpassing PROB (28.6%) and Decoupled PROB (32.8%), while maintaining an Unknown Recall of 65.1%, demonstrating its significant advantage in novel category discovery and recognition tasks. This result demonstrates that through dynamic hypergraph modeling and a meta-policy control mechanism, the model is able to achieve an optimal balance between semantic consistency and generalization.

Download:

Table 2. Comparison of Hyper-MDR with recent state-of-the-art methods on the OWOD dataset.

https://doi.org/10.1371/journal.pone.0342169.t002

In terms of comprehensive metrics, Hyper-MDR achieved an H-mean of 70.6%, which stands out among all methods, demonstrating a more balanced performance across known and unknown categories. Furthermore, in terms of efficiency, Hyper-MDR’s computational overhead was 150 GFLOPs, which is not only lower than PROB (180 GFLOPs) and OW-OVD (220 GFLOPs), but also better than Decoupled PROB (170 GFLOPs), demonstrating its superior computational efficiency and deployment feasibility while maintaining high detection accuracy. Overall, these results demonstrate the effectiveness and superiority of Hyper-MDR in open-world object detection, demonstrating that its design strikes an ideal balance between accuracy, generalization, and efficiency.

To ensure consistency across evaluation criteria, we also report the Forgetting Ratio in our ablation study (Table 3). Since most prior works do not provide this metric, it is omitted from Table 2 for fairness in direct comparison. For methods where Forgetting Ratio is unavailable, we denote the value as “–” if the column is included. This design allows Table 2 to reflect standard benchmarks while Table 3 emphasizes the incremental stability analysis unique to Hyper-MDR.

Download:

Table 3. Impact of each module on overall performance.

https://doi.org/10.1371/journal.pone.0342169.t003

5.2. Ablation experiment

To further validate the contribution of each key module to overall performance, we conducted ablation experiments on the OWOD dataset, as shown in Table 3. The complete Hyper-MDR model achieves 76.8% mAP@0.5, 65.1% Unknown Recall, and 70.6% H-mean, while maintaining the lowest forgetting rate (8.9%), demonstrating optimal comprehensive performance. When removing the hypergraph structure, Unknown Recall dropped to 55.4%, and H-mean also significantly decreased to 62.2%, indicating that the dynamic hypergraph plays a core role in capturing cross-modal high-order relationships and enhancing unknown category discovery. Removing the meta-policy controller disrupted model balance, increasing forgetting rate to 12.1% and lowering H-mean to 66%, demonstrating that the meta-policy effectively mitigates forgetting by dynamically adjusting reasoning depth and fusion strategies. Furthermore, removing cross-modal contrastive learning caused Unknown Recall to plummet to 53.2%, while the forgetting rate rose to 16.7% and H-mean dropped to 59.8%. This validates the critical importance of cross-modal feature alignment in ensuring semantic consistency and enhancing generalization capabilities. Finally, removing the pseudo-label generation module resulted in an Unknown Recall of only 51.5% and an H-mean dropping to 61%, indicating the indispensable role of the pseudo-label propagation mechanism in driving semantic gains for new categories and enabling continuous learning.

Overall, the ablation results clearly demonstrate that each component of Hyper-MDR significantly contributes to performance enhancement. Cross-modal contrastive learning and the dynamic hypergraph structure are particularly crucial for discovering unknown categories, while the meta-strategy controller and pseudo-label generation mechanism play vital roles in maintaining stable incremental learning and mitigating disaster forgetting.

4.3. Meta-strategy dynamic analysis

To verify the meta-policy controller’s adaptive adjustment capabilities for different open scenarios, the paper designed five typical scenarios. Based on the full-scenario test set, it screened corresponding samples and statistically analyzed the mean and stability of the core actions output by the controller. The results are shown in Table 4. Table 4 clearly demonstrates the scene-adaptive nature of the meta-policy controller: in blurry image scenarios, the visual weight is reduced from 0.72 for clear images to 0.35, while the language weight is simultaneously increased to 0.65. This enhances text semantic guidance to compensate for feature distortion caused by visual noise. In scenarios where unknown classes emerge, the GNN propagation layer reaches 4.2, strengthening the semantic association between unknown and known classes through deep message passing. At the same time, the number of attention heads is increased to 6.2, expanding the breadth of feature interactions to capture more potential semantic clues. In the modality mismatch scenario (no text), the visual weight was increased to 0.95, ensuring detection continuity by weakening the missing modality dependency. The standard deviation of the action was only 0.03, demonstrating the stability of the strategy.

Download:

Table 4. Statistics of the meta-policy controller’s output actions in different scenarios.

https://doi.org/10.1371/journal.pone.0342169.t004

The distribution of strategy adjustment delays under different scenarios was statistically analyzed, as shown in Fig 5. The results indicate that latency remains relatively low for clear images and scenes with dense known classes, demonstrating that the controller can quickly adjust its policy when semantics are explicit and information is sufficient. In contrast, latency increases significantly for blurry images, scenes with emergent unknown classes, and modality mismatches, reflecting the higher computational cost required for policy optimization in uncertain or complex environments. Further analysis shows that the average latency is highest in scenarios with emergent unknown classes, suggesting that these cases pose the greatest challenge to system responsiveness. Overall, the meta-policy controller is able to adaptively adjust its response speed according to scenario complexity, achieving rapid responses in simple conditions while performing deeper reasoning in complex situations.

Download:

Fig 5. Policy adjustment latency distribution in different scenarios.

https://doi.org/10.1371/journal.pone.0342169.g005

4.4. Verification of feature alignment and modal balance

To evaluate the effectiveness of cross-modal feature interaction in the Hyper-MDR framework, we first analyzed feature space alignment using t-Distributed Stochastic Neighbor Embedding (t-SNE). As shown in Fig 6, before alignment, the dispersion of known-class visual and textual features was (0.91, 0.88) and (0.88, 0.92), respectively. The low overlap between visual and textual clusters indicated poor semantic consistency across modalities. After applying cross-modal contrastive learning (Cross-Modal CL), the dispersion decreased to (0.77, 0.77) and (0.81, 0.79), showing higher overlap between visual and textual features of the same class. These results demonstrate that symmetric contrastive loss effectively forces semantically similar multimodal features to cluster in a unified space, thereby enhancing cross-modal cluster consistency and reducing intra-class dispersion. This alignment establishes a solid foundation for subsequent hypergraph-based high-order correlation modeling.

Download:

Fig 6. T-SNE visualization of cross-modal feature alignment.

https://doi.org/10.1371/journal.pone.0342169.g006

In addition to alignment, we further analyzed modal fusion balance using attention entropy time-series evaluation. This metric quantifies the distribution of multimodal weights and reflects whether one modality dominates the decision process. Higher entropy values indicate more balanced utilization of multimodal information, while lower values suggest over-reliance on a single modality. The analysis results confirm that Hyper-MDR maintains a stable and balanced cross-modal interaction, ensuring robust feature fusion and preventing modality bias, which is essential for open-world detection scenarios involving noisy or missing inputs.

The temporal analysis of modal fusion attention entropy is presented in Fig 7. The results show that the Hyper-MDR framework achieves highly balanced modal fusion across diverse scenarios. In clear-image conditions, attention entropy remains stable at approximately 1.8, reflecting reliable visual modality contributions and a relatively balanced weight distribution. In blurry-image scenarios, the entropy initially drops to around 1.0 for the first 20 samples before gradually recovering to 1.8, indicating that the system adaptively increases the contribution of the textual modality to compensate for degraded visual information. When encountering emergent unknown classes, the entropy stabilizes at a higher level of about 2.2, suggesting that the system assigns more balanced weights across modalities to cope with semantic uncertainty. Across the entire test set, the attention entropy distribution has a mean of 1.83 and a median of 1.86, demonstrating that the system maintains overall balanced multimodal fusion while flexibly adjusting weights under challenging scenarios.

Download:

Fig 7. Time series of attention entropy in modal fusion.

https://doi.org/10.1371/journal.pone.0342169.g007

5.5. Performance analysis of unknown class detection and pseudo-label generation

A fundamental challenge in open-world object detection is accurately identifying instances of unknown classes and generating semantically consistent pseudo-labels, as this directly determines the model’s adaptability to open environments. To address this, the proposed PL-HGP module leverages a hypergraph topology to backpropagate the semantics of cluster centers to all nodes within the cluster, thereby alleviating issues of unknown class–background confusion and discontinuous pseudo-label semantics. For evaluation, we compare the proposed method against fixed-threshold filtering and K-means clustering. The results of unknown class detection accuracy are presented in Fig 8.

Download:

Fig 8. ROC curve for unknown class detection.

https://doi.org/10.1371/journal.pone.0342169.g008

As can be seen, the ROC curve of the Hyper-MDR (PL-HGP) method significantly outperforms fixed threshold filtering and K-means clustering, with the entire curve closer to the upper left corner. The corresponding AUC is as high as 0.962, indicating that this method significantly improves the true positive rate while controlling the false positive rate, demonstrating stronger detection capabilities for unknown classes. In contrast, the fixed threshold filtering method’s curve is lower, with an AUC of 0.847. This indicates that it tends to miss unknown classes at high confidence thresholds, while lower thresholds increase the risk of false detections. The K-means clustering method, with an AUC of 0.889, performs somewhere in between the two methods. This suggests that relying solely on feature distance for clustering, while ignoring semantic associations, makes it difficult to fully distinguish unknown classes.

The results of pseudo-label quality and the anti-forgetting capability in continuous learning are presented in Fig 9. In terms of pseudo-labeling accuracy, PL-HGP achieved an average accuracy of 93.5%, compared with 80.3% for K-means clustering and 69.6% for fixed-threshold filtering. The superiority of PL-HGP is particularly evident for semantically complex clusters, such as novel vehicle categories. This advantage arises from the hypergraph’s ability to utilize shared semantic hyperedges (e.g., vehicle–four-wheel–mobile) to link unknown clusters with related known classes, thereby ensuring semantic coherence in the generated pseudo-labels. Furthermore, during continuous learning, the forgetting rate of initial known classes under PL-HGP was only 6.9% at stage 5, indicating that semantically consistent pseudo-labels effectively reduce interference with the feature space of known classes and mitigate catastrophic forgetting.

Download:

Fig 9. Pseudo-labeling quality and continuous learning’s ability to resist forgetting.

https://doi.org/10.1371/journal.pone.0342169.g009

Fig 10 presents the detection results for both known and unknown objects. The visual outputs, including bounding boxes, highlight the advantages of the proposed Hyper-MDR framework. By leveraging high-level semantic modeling and meta-strategy dynamic reasoning within a cross-modal heterogeneous hypergraph, the model not only achieves higher detection accuracy for known objects but also effectively identifies unknown objects while reducing false positives.

Download:

Fig 10. Detection results for different object types.

https://doi.org/10.1371/journal.pone.0342169.g010

In addition to the standard qualitative results, we further conducted a detailed failure analysis to better understand the limitations of Hyper-MDR. Three representative types of failure cases were identified. First, in scenes with heavy occlusion or truncation, the model sometimes assigns overly high uncertainty and incorrectly predicts known objects as unknown due to insufficient visible cues for reliable hypergraph propagation. Second, when two visually similar instances belong to semantically different categories, ambiguous multimodal correlations may be propagated to neighboring nodes, leading to inaccurate semantic grouping. Third, under extremely low-resolution or noisy conditions, the meta-policy controller occasionally allocates insufficient reasoning depth, which results in suboptimal predictions for both known and unknown categories. These observations help clarify the conditions under which the model is most challenged and indicate future opportunities for improving robustness and adaptive reasoning.

5.6. Computational complexity

In the actual deployment of open-world object detection, model computational efficiency and real-time responsiveness are core engineering metrics that directly determine its applicability on edge devices. This article calculates the efficiency gains of GFLOPs statistics and dynamic inference as the scene complexity changes. The results are shown in Fig 11.

Download:

Fig 11. GFLOPs statistics and efficiency gains of dynamic inference.

https://doi.org/10.1371/journal.pone.0342169.g011

Hyper-MDR’s computational complexity strikes a balance between accuracy and efficiency, with a total GFLOPs of 36.0 for a single forward inference pass. Visual feature extraction (Faster R-CNN) accounts for the largest portion (50.6%, 18.2 GFLOPs), while Hyper-GNN hypergraph convolution accounts for 21.7% (7.8 GFLOPs). The meta-policy controller accounts for only 4.2% (1.5 GFLOPs), demonstrating the extremely low computational overhead of the dynamic inference mechanism. In terms of dynamic reasoning efficiency, low-complexity scenarios (known class ratio > 80%) achieve the highest gain, at 38.2%. This is because the meta-policy controller automatically reduces the number of GNN propagation layers and attention heads, reducing inefficient computation. It is worth noting that the 36 GFLOPs reported here correspond only to the incremental reasoning modules introduced by Hyper-MDR beyond the backbone detector. For fair comparison with prior open-world detection methods, Table 2 reports the end-to-end inference cost, including backbone feature extraction and detection heads, which amounts to 150 GFLOPs. This distinction highlights that while the overall pipeline remains competitive in efficiency, the additional overhead from Hyper-MDR itself is lightweight and well-suited for deployment in resource-constrained scenarios. Although the experiments are conducted on the OWOD benchmark, the proposed framework maintains favorable scalability due to its bounded hypergraph degree, controller-regulated propagation depth, and lightweight dynamic inference cost. The total overhead introduced by Hyper-MDR is only 36 GFLOPs beyond the backbone, indicating that the method can be extended to larger datasets or higher-resolution inputs without prohibitive computational growth. In addition, the adaptive reasoning mechanism reduces computation for simple inputs while using deeper inference only when necessary, which provides a clear path toward real-time deployment in resource-constrained or latency-sensitive environments.

6. Discussion

The experimental results comprehensively demonstrate the effectiveness of Hyper-MDR in addressing the core challenges of open-world object detection. The model achieves 76.8% known mAP@0.5, outperforming prior approaches while also significantly improving novel-category discovery with 39.5% unknown mAP@0.5 and 65.1% Unknown Recall. These gains highlight the value of modeling high-order multimodal correlations through dynamic hypergraphs. Ablation results further confirm the role of each component: removing cross-modal contrastive learning reduces Unknown Recall from 65.1% to 53.2%, and disabling the hypergraph structure drops H-mean from 70.6% to 62.2%, while excluding the meta-policy controller increases the forgetting rate from 8.9% to 12.1%. Together, these findings show that every module contributes substantially to semantic alignment, unknown-category detection, and incremental stability. A notable advantage of Hyper-MDR is its adaptive reasoning capability. The meta-policy controller dynamically adjusts propagation depth, fusion weights, and interaction breadth according to scenario complexity, enabling the model to balance accuracy and efficiency. Despite this adaptivity, Hyper-MDR maintains a modest computational cost of 150 GFLOPs, making it practical for deployment in resource-constrained or edge environments. Latency analysis further shows that the controller responds rapidly under standard conditions and increases reasoning depth only when encountering ambiguous or unknown instances.

However, limitations remain. The pseudo-labeling pipeline currently relies on spectral clustering, which may introduce additional overhead as the number of unknown categories increases. Although multimodal fusion reduces sensitivity to noise, bias from unreliable modalities cannot be completely eliminated. Future work may therefore focus on integrating stronger foundation vision–language models to improve multimodal robustness, designing more efficient clustering strategies to reduce the computational overhead of unknown-class expansion, and further lowering the forgetting rate to support long-term continuous learning.

7. Conclusion

In conclusion, this paper presented Hyper-MDR, a dynamic open-world object detection framework that unifies cross-modal hypergraph reasoning with meta-policy–driven adaptive inference. By jointly modeling visual, linguistic, and semantic prototype nodes and regulating the reasoning depth through a policy controller, the framework achieves more consistent and adaptive detection in complex open-world scenarios. Experiments on the OWOD benchmark demonstrate clear improvements over recent methods in both known-category detection and unknown-category discovery, while maintaining strong computational efficiency and stability during incremental learning. The pseudo-label propagation module further supports reliable category expansion by enhancing semantic consistency across new samples. Overall, these results highlight the effectiveness of integrating hypergraph-guided multimodal reasoning with adaptive meta-policy optimization. Future work will explore real-time extensions of the framework, more efficient mechanisms for discovering emerging categories, and lifelong learning strategies to support continuous scaling to broader open-world environments.

References

1. Kaur R, Singh S. A comprehensive review of object detection with deep learning. Digit Signal Process. 2023;132:103812.
- View Article
- Google Scholar
2. Edozie E, Shuaibu AN, John UK, Sadiq BO. Comprehensive review of recent developments in visual object detection based on deep learning. Artif Intell Rev. 2025;58(9).
- View Article
- Google Scholar
3. Mittal P. A comprehensive survey of deep learning-based lightweight object detection models for edge devices. Artif Intell Rev. 2024;57(9).
- View Article
- Google Scholar
4. Xu S, Zhang M, Song W, Mei H, He Q, Liotta A. A systematic review and analysis of deep learning-based underwater object detection. Neurocomputing. 2023;527:204–32.
- View Article
- Google Scholar
5. Jiang W, Zhang Z, Xiong Q, et al. A survey of object detection based on deep learning. Fourth International Conference on Advanced Algorithms and Neural Networks (AANN 2024). SPIE; 2024. pp. 450–61.
6. Zhang J, Huang J, Jin S, Lu S. Vision-language models for vision tasks: a survey. IEEE Trans Pattern Anal Mach Intell. 2024;46(8):5625–44. pmid:38408000
- View Article
- PubMed/NCBI
- Google Scholar
7. Shinde G, Ravi A, Dey E. A survey on efficient vision‐language models. Wiley Interdiscip Rev: Data Mining Knowl Discov. 2025;15(3):e70036.
- View Article
- Google Scholar
8. Yu X, Wang Y, Sun H. Open-vocabulary object detection survey. Procedia Comput Sci. 2025;266:62–71.
- View Article
- Google Scholar
9. Zhang X, Wang H, Dong H. A survey of deep learning-driven 3D object detection: sensor modalities, technical architectures, and applications. Sensors (Basel). 2025;25(12):3668. pmid:40573555
- View Article
- PubMed/NCBI
- Google Scholar
10. Mousa-Pasandi M, Liu T, Massoud Y, Laganière R. RGB-LiDAR fusion for accurate 2D and 3D object detection. Machi Vis Appl. 2023;34(5).
- View Article
- Google Scholar
11. Scicluna A, Le Gentil C, Sutjipto S, et al. Towards Robust Perception for Assistive Robotics: An RGB-Event-LiDAR Dataset and Multi-Modal Detection Pipeline. 2024 IEEE 20th International Conference on Automation Science and Engineering (CASE). IEEE; 2024. pp. 920–25.
12. Mirlach J, Wan L, Wiedholz A. R-livit: A lidar-visual-thermal dataset enabling vulnerable road user focused roadside perception. arXiv preprint. 2025.
- View Article
- Google Scholar
13. Li C, Li T, Hu X, et al. DVHGNN: Multi-Scale Dilated Vision HGNN for Efficient Vision Recognition. Proceedings of the Computer Vision and Pattern Recognition Conference. 2025. pp. 20158–68.
14. Lei M, Wu Y, Li S, et al. Softhgnn: Soft hypergraph neural networks for general visual recognition. arXiv preprint arXiv:2505.15325. 2025.
- View Article
- Google Scholar
15. Feng Y, Huang J, Du S. Hyper-yolo: When visual object detection meets hypergraph computation. IEEE Trans Pattern Anal Mach Intell. 2024.
- View Article
- Google Scholar
16. Cui Z, Dai Y, Duan Y. Joint object detection and multi-object tracking based on hypergraph matching. Appl Sci. 2024;14(23).
- View Article
- Google Scholar
17. Yu N, Wang C, Qiao Y. Hypergraph-based asynchronous event processing for moving object classification. J Shanghai Jiaotong Univ (Science). 2024:1–10.
- View Article
- Google Scholar
18. Rong Y, Nong L, Liang Z, Huang Z, Peng J, Huang Y. Hypergraph position attention convolution networks for 3d point cloud segmentation. Appl Sci. 2024;14(8):3526.
- View Article
- Google Scholar
19. Joseph KJ, Khan S, Khan FS. Towards open world object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021. pp. 5830–40.
20. Wu Z, Lu Y, Chen X, et al. UC-OWOD: Unknown-classified open world object detection. European Conference on Computer Vision. Cham: Springer Nature Switzerland; 2022. pp. 193–210.
21. Zohar O, Wang K C, Yeung S. Prob: Probabilistic objectness for open world object detection. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023. pp. 11444–53.
22. Inoue R, Tsuchiya M, Yasui Y. Decoupled PROB: Decoupled Query Initialization Tasks and Objectness-Class Learning for Open World Object Detection. 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). IEEE; 2025. pp. 8207–16.
23. Gu X, Lin TY, Kuo W. Open-vocabulary object detection via vision and language knowledge distillation. arXiv preprint. 2021.
- View Article
- Google Scholar
24. Ma Z, Jiang Z, Zhang H. Hyperspectral image classification using feature fusion hypergraph convolution neural network. IEEE Trans Geosci Remote Sensing. 2022;60:1–14.
- View Article
- Google Scholar
25. Zhou X, Girdhar R, Joulin A, et al. Detecting twenty-thousand classes using image-level supervision. European Conference on Computer Vision. Cham: Springer Nature Switzerland; 2022. pp. 350–68.
26. Zang Y, Li W, Zhou K, et al. Open-vocabulary detr with conditional matching. European Conference on Computer Vision. Cham: Springer Nature Switzerland; 2022. pp. 106–22.
27. Li L H, Zhang P, Zhang H, et al. Grounded language-image pre-training. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022. pp. 10965–75.
28. Kamath A, Singh M, LeCun Y, et al. Mdetr-modulated detection for end-to-end multi-modal understanding. Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021. pp. 1780–90.
29. Guo F, Wang Y, Qi H, Jin W, Zhu L, Sun J. Multi-view distillation based on multi-modal fusion for few-shot action recognition (CLIP-MDMF). Knowl-Based Syst. 2024;304:112539.
- View Article
- Google Scholar
30. Li G, Zhang M, Zheng X, et al. Multimodal Inplace Prompt Tuning for Open-set Object Detection. Proceedings of the 32nd ACM International Conference on Multimedia. 2024. pp. 8062–71.
31. Chu SY, Lee MS. Mt-detr: Robust end-to-end multimodal detection with confidence fusion. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2023. pp. 5252–61.
32. Rajora A, Gupta S, Kundu S. Cross-Aligned Fusion For Multimodal Understanding. 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). IEEE; 2025. pp. 5730–40.
33. Dong B, Jiang C, Liu X, Deng Y, Huang L. Theoretical characterization and modal directivity investigation of the interaction noise for a small contra-rotating fan. Mech Syst Signal Process. 2020;135:106362.
- View Article
- Google Scholar
34. Bardozzo F, Fiore P, Valentino M, Bianco V, Memmolo P, Miccio L, et al. Enhanced tissue slide imaging in the complex domain via cross-explainable GAN for Fourier ptychographic microscopy. Comput Biol Med. 2024;179:108861. pmid:39018884
- View Article
- PubMed/NCBI
- Google Scholar
35. Sheng Z, Huang Z, Chen S. Kinematics-aware multigraph attention network with residual learning for heterogeneous trajectory prediction. J Int Con Veh. 2024;7(2):138–50.
- View Article
- Google Scholar
36. Sheng Z, Xu Y, Xue S, Li D. Graph-based spatial-temporal convolutional network for vehicle trajectory prediction in autonomous driving. IEEE Trans Intell Transport Syst. 2022;23(10):17654–65.
- View Article
- Google Scholar
37. Cai D, Song M, Sun C, et al. Hypergraph Structure Learning for Hypergraph Neural Networks. IJCAI; 2022. pp. 1923–9.
38. Yan J, Feng Y, Ying S, et al. Hypergraph dynamic system. The Twelfth International Conference on Learning Representations. 2024.
39. Xi X, Huang Y, Luo R, et al. OW-OVD: Unified Open World and Open Vocabulary Object Detection. Proceedings of the Computer Vision and Pattern Recognition Conference. 2025. pp. 25454–4.
40. Xi X, Fu X, Wang W, et al. OW-VAP: Visual Attribute Parsing for Open World Object Detection. International Conference on Machine Learning. PMLR; 2025. pp. 68190–207.

[ref1] 1. Kaur R, Singh S. A comprehensive review of object detection with deep learning. Digit Signal Process. 2023;132:103812.
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. Edozie E, Shuaibu AN, John UK, Sadiq BO. Comprehensive review of recent developments in visual object detection based on deep learning. Artif Intell Rev. 2025;58(9).
View Article
Google Scholar

[5] View Article

[6] Google Scholar

[ref3] 3. Mittal P. A comprehensive survey of deep learning-based lightweight object detection models for edge devices. Artif Intell Rev. 2024;57(9).
View Article
Google Scholar

[8] View Article

[9] Google Scholar

[ref4] 4. Xu S, Zhang M, Song W, Mei H, He Q, Liotta A. A systematic review and analysis of deep learning-based underwater object detection. Neurocomputing. 2023;527:204–32.
View Article
Google Scholar

[11] View Article

[12] Google Scholar

[ref5] 5. Jiang W, Zhang Z, Xiong Q, et al. A survey of object detection based on deep learning. Fourth International Conference on Advanced Algorithms and Neural Networks (AANN 2024). SPIE; 2024. pp. 450–61.

[ref6] 6. Zhang J, Huang J, Jin S, Lu S. Vision-language models for vision tasks: a survey. IEEE Trans Pattern Anal Mach Intell. 2024;46(8):5625–44. pmid:38408000
View Article
PubMed/NCBI
Google Scholar

[15] View Article

[16] PubMed/NCBI

[17] Google Scholar

[ref7] 7. Shinde G, Ravi A, Dey E. A survey on efficient vision‐language models. Wiley Interdiscip Rev: Data Mining Knowl Discov. 2025;15(3):e70036.
View Article
Google Scholar

[19] View Article

[20] Google Scholar

[ref8] 8. Yu X, Wang Y, Sun H. Open-vocabulary object detection survey. Procedia Comput Sci. 2025;266:62–71.
View Article
Google Scholar

[22] View Article

[23] Google Scholar

[ref9] 9. Zhang X, Wang H, Dong H. A survey of deep learning-driven 3D object detection: sensor modalities, technical architectures, and applications. Sensors (Basel). 2025;25(12):3668. pmid:40573555
View Article
PubMed/NCBI
Google Scholar

[25] View Article

[26] PubMed/NCBI

[27] Google Scholar

[ref10] 10. Mousa-Pasandi M, Liu T, Massoud Y, Laganière R. RGB-LiDAR fusion for accurate 2D and 3D object detection. Machi Vis Appl. 2023;34(5).
View Article
Google Scholar

[29] View Article

[30] Google Scholar

[ref11] 11. Scicluna A, Le Gentil C, Sutjipto S, et al. Towards Robust Perception for Assistive Robotics: An RGB-Event-LiDAR Dataset and Multi-Modal Detection Pipeline. 2024 IEEE 20th International Conference on Automation Science and Engineering (CASE). IEEE; 2024. pp. 920–25.

[ref12] 12. Mirlach J, Wan L, Wiedholz A. R-livit: A lidar-visual-thermal dataset enabling vulnerable road user focused roadside perception. arXiv preprint. 2025.
View Article
Google Scholar

[33] View Article

[34] Google Scholar

[ref13] 13. Li C, Li T, Hu X, et al. DVHGNN: Multi-Scale Dilated Vision HGNN for Efficient Vision Recognition. Proceedings of the Computer Vision and Pattern Recognition Conference. 2025. pp. 20158–68.

[ref14] 14. Lei M, Wu Y, Li S, et al. Softhgnn: Soft hypergraph neural networks for general visual recognition. arXiv preprint arXiv:2505.15325. 2025.
View Article
Google Scholar

[37] View Article

[38] Google Scholar

[ref15] 15. Feng Y, Huang J, Du S. Hyper-yolo: When visual object detection meets hypergraph computation. IEEE Trans Pattern Anal Mach Intell. 2024.
View Article
Google Scholar

[40] View Article

[41] Google Scholar

[ref16] 16. Cui Z, Dai Y, Duan Y. Joint object detection and multi-object tracking based on hypergraph matching. Appl Sci. 2024;14(23).
View Article
Google Scholar

[43] View Article

[44] Google Scholar

[ref17] 17. Yu N, Wang C, Qiao Y. Hypergraph-based asynchronous event processing for moving object classification. J Shanghai Jiaotong Univ (Science). 2024:1–10.
View Article
Google Scholar

[46] View Article

[47] Google Scholar

[ref18] 18. Rong Y, Nong L, Liang Z, Huang Z, Peng J, Huang Y. Hypergraph position attention convolution networks for 3d point cloud segmentation. Appl Sci. 2024;14(8):3526.
View Article
Google Scholar

[49] View Article

[50] Google Scholar

[ref19] 19. Joseph KJ, Khan S, Khan FS. Towards open world object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021. pp. 5830–40.

[ref20] 20. Wu Z, Lu Y, Chen X, et al. UC-OWOD: Unknown-classified open world object detection. European Conference on Computer Vision. Cham: Springer Nature Switzerland; 2022. pp. 193–210.

[ref21] 21. Zohar O, Wang K C, Yeung S. Prob: Probabilistic objectness for open world object detection. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023. pp. 11444–53.

[ref22] 22. Inoue R, Tsuchiya M, Yasui Y. Decoupled PROB: Decoupled Query Initialization Tasks and Objectness-Class Learning for Open World Object Detection. 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). IEEE; 2025. pp. 8207–16.

[ref23] 23. Gu X, Lin TY, Kuo W. Open-vocabulary object detection via vision and language knowledge distillation. arXiv preprint. 2021.
View Article
Google Scholar

[56] View Article

[57] Google Scholar

[ref24] 24. Ma Z, Jiang Z, Zhang H. Hyperspectral image classification using feature fusion hypergraph convolution neural network. IEEE Trans Geosci Remote Sensing. 2022;60:1–14.
View Article
Google Scholar

[59] View Article

[60] Google Scholar

[ref25] 25. Zhou X, Girdhar R, Joulin A, et al. Detecting twenty-thousand classes using image-level supervision. European Conference on Computer Vision. Cham: Springer Nature Switzerland; 2022. pp. 350–68.

[ref26] 26. Zang Y, Li W, Zhou K, et al. Open-vocabulary detr with conditional matching. European Conference on Computer Vision. Cham: Springer Nature Switzerland; 2022. pp. 106–22.

[ref27] 27. Li L H, Zhang P, Zhang H, et al. Grounded language-image pre-training. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022. pp. 10965–75.

[ref28] 28. Kamath A, Singh M, LeCun Y, et al. Mdetr-modulated detection for end-to-end multi-modal understanding. Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021. pp. 1780–90.

[ref29] 29. Guo F, Wang Y, Qi H, Jin W, Zhu L, Sun J. Multi-view distillation based on multi-modal fusion for few-shot action recognition (CLIP-MDMF). Knowl-Based Syst. 2024;304:112539.
View Article
Google Scholar

[66] View Article

[67] Google Scholar

[ref30] 30. Li G, Zhang M, Zheng X, et al. Multimodal Inplace Prompt Tuning for Open-set Object Detection. Proceedings of the 32nd ACM International Conference on Multimedia. 2024. pp. 8062–71.

[ref31] 31. Chu SY, Lee MS. Mt-detr: Robust end-to-end multimodal detection with confidence fusion. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2023. pp. 5252–61.

[ref32] 32. Rajora A, Gupta S, Kundu S. Cross-Aligned Fusion For Multimodal Understanding. 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). IEEE; 2025. pp. 5730–40.

[ref33] 33. Dong B, Jiang C, Liu X, Deng Y, Huang L. Theoretical characterization and modal directivity investigation of the interaction noise for a small contra-rotating fan. Mech Syst Signal Process. 2020;135:106362.
View Article
Google Scholar

[72] View Article

[73] Google Scholar

[ref34] 34. Bardozzo F, Fiore P, Valentino M, Bianco V, Memmolo P, Miccio L, et al. Enhanced tissue slide imaging in the complex domain via cross-explainable GAN for Fourier ptychographic microscopy. Comput Biol Med. 2024;179:108861. pmid:39018884
View Article
PubMed/NCBI
Google Scholar

[75] View Article

[76] PubMed/NCBI

[77] Google Scholar

[ref35] 35. Sheng Z, Huang Z, Chen S. Kinematics-aware multigraph attention network with residual learning for heterogeneous trajectory prediction. J Int Con Veh. 2024;7(2):138–50.
View Article
Google Scholar

[79] View Article

[80] Google Scholar

[ref36] 36. Sheng Z, Xu Y, Xue S, Li D. Graph-based spatial-temporal convolutional network for vehicle trajectory prediction in autonomous driving. IEEE Trans Intell Transport Syst. 2022;23(10):17654–65.
View Article
Google Scholar

[82] View Article

[83] Google Scholar

[ref37] 37. Cai D, Song M, Sun C, et al. Hypergraph Structure Learning for Hypergraph Neural Networks. IJCAI; 2022. pp. 1923–9.

[ref38] 38. Yan J, Feng Y, Ying S, et al. Hypergraph dynamic system. The Twelfth International Conference on Learning Representations. 2024.

[ref39] 39. Xi X, Huang Y, Luo R, et al. OW-OVD: Unified Open World and Open Vocabulary Object Detection. Proceedings of the Computer Vision and Pattern Recognition Conference. 2025. pp. 25454–4.

[ref40] 40. Xi X, Fu X, Wang W, et al. OW-VAP: Visual Attribute Parsing for Open World Object Detection. International Conference on Machine Learning. PMLR; 2025. pp. 68190–207.

Figures

Abstract

1. Introduction

2. Related work

2.1. Open-world object detection

2.2. Multimodal fusion in vision-language models

2.3. Hypergraph neural networks

3. Method

3.1. Construction of a cross-modal heterogeneous hypergraph

3.2. Hierarchical hypergraph convolutional network

3.3. Multimodal Feature Alignment and Initialization

3.4. Meta-policy controller architecture

3.5. Dynamic inference path generation mechanism

3.6. Open-world unknown class detection and pseudo-label generation

3.7. Complexity and convergence analysis

4. Experiments

4.1. Dataset

4.2. Implementation details

4.3. Evaluation metric

5. Results and analysis

5.1. Comparison with state-of-the-art methods

5.2. Ablation experiment

4.3. Meta-strategy dynamic analysis

4.4. Verification of feature alignment and modal balance

5.5. Performance analysis of unknown class detection and pseudo-label generation

5.6. Computational complexity

6. Discussion

7. Conclusion

References