Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Needle in a haystack: Coarse-to-fine alignment network for moment retrieval from large-scale video collections

  • Lingwen Meng ,

    Roles Conceptualization, Funding acquisition, Investigation, Project administration, Supervision, Writing – review & editing

    menglingwen_gpg@163.com

    Affiliation Electric Power Research Institute of Guizhou Power Grid Co. Ltd, Guiyang, China

  • Fangyuan Liu,

    Roles Data curation, Investigation, Methodology, Software, Validation, Writing – original draft

    Affiliation Electric Power Research Institute of Guizhou Power Grid Co. Ltd, Guiyang, China

  • Mingyong Xin,

    Roles Data curation, Formal analysis, Investigation, Methodology, Resources, Software, Visualization, Writing – original draft

    Affiliation Electric Power Research Institute of Guizhou Power Grid Co. Ltd, Guiyang, China

  • Siqi Guo,

    Roles Investigation, Methodology, Software, Validation, Visualization, Writing – original draft

    Affiliation Electric Power Research Institute of Guizhou Power Grid Co. Ltd, Guiyang, China

  • Fu Zou

    Roles Software, Validation, Visualization, Writing – original draft

    Affiliation Electric Power Research Institute of Guizhou Power Grid Co. Ltd, Guiyang, China

Abstract

Moment retrieval from large-scale video collections aims to search and localize the temporal boundary of a video moment from a collection of numerous videos according to the given natural language query. Existing methods for moment retrieval in a single video is too time-consuming to directly scale to this task due to their sophisticated network architecture. In this paper, we decompose the original problem into two mutually boosting subtasks: video retrieval from video collections and moment retrieval in a single video, and propose the coarse-to-fine alignment network (CFAN) including a video alignment module, a cross-modal interaction module and flow of multi-level coarse-to-fine alignment information. Through the interaction of the multi-level information from two subtasks, our method makes full use of the global contextual information in videos and the fine-grained alignment information between videos and queries. We perform sufficient experiments on three public datasets ActivityNet Captions, Charades-STA and DiDeMo and the evaluation results demonstrate the effectiveness of the proposed CFAN method.

Introduction

Video retrieval with natural language [16], which aims to search the most relevant video from a large collection of videos, and moment retrieval in video [710], where the goal is to localize the temporal boundary of a target moment in a single video, have received significant attention in recent years. Despite the advancements, several limitations persist in prior works. When provided with a textual query, users expect the retrieval system not only to identify videos of interest but also to exclude irrelevant content and pinpoint the most semantically relevant moment accurately. In this paper, we study the task of moment retrieval from large-scale video collections, as a natural and essential extension of prior tasks, aiming to identify a video moment from a large collection of videos according to the given natural language query, as the example in Fig 1 shows.

thumbnail
Fig 1. Example of moment retrieval from large-scale video collections.

https://doi.org/10.1371/journal.pone.0320661.g001

Localizing the temporal boundary of a moment from video collections considering both efficiency and accuracy is much more challenging than prior tasks. A simple way is to scale those methods of moment retrieval in a single video [35, 1113] to the large video collections and generate confidence score for the predicted moment in each video. However, due to the sophisticated network architecture of those proposed methods, generating candidate moments and its corresponding confidence score video by video are expensive, time-consuming and unrealistic. A smarter way is to efficiently search a most relevant video for moment retrieval but it is also greatly limited by video-level alignment and ignores the benefits of fine-grained alignment. Another way is to break a video into a short sequence of clips and align the clips of target moment to the given textual query with a clip-alignment loss [14]. Despite faster retrieval, coarsely aligning the clips of target moment with query ignores the global contextual information of the video, leading to insufficient understanding of video contents. Also, treating different clips in the same video separately raises two crucial challenges: semantic misalignment and structural misalignment [12], which is not helpful for accurate retrieval.

To tackle these challenges and achieve an optimal balance between speed and accuracy, we propose the Coarse-to-Fine Alignment Network (CFAN). This approach decomposes the original problem into two interrelated and mutually enhancing subtasks: video retrieval from a video collection and moment retrieval within a single video. Instead of merely selecting a video for moment prediction or directly searching for the target clip in a clip database [14], we first efficiently retrieve a small candidate set of videos by learning a shared visual-semantic space for video alignment. We then design an advanced cross-modal interaction mechanism to refine the fine-grained alignment between candidate proposals, video frames, and the query. The video alignment step also serves as auxiliary guidance to enhance this process. The fine-grained alignment further acts as a guided gating mechanism, emphasizing key content relevant to the query and refining the learned visual-semantic space. By integrating multi-level coarse-to-fine alignment information across the two subtasks, our method fully leverages both the global contextual information of videos and the detailed correspondence between frames and the query. This comprehensive interaction enables more accurate and efficient retrieval.

Specifically, to learn the common visual-semantic space, we devise a video alignment module where the multi-head self attention mechanism [15] is plugged into the trainable generalized Vector of Locally Aggregated Descriptors (VLAD) layer [16] to learn the spatio-temporal descriptors, and the sum of residuals from different cluster center is aggregated by mean pooling to obtain the visual-semantic embeddings for query and video. To explore the fine-grained alignment among the candidate proposals, frames and the query, we propose a cross-modal interaction module including attention aggregation, cross gate guided by video alignment and BiGRU [17] to obtain the cross-modal representations for frame and proposal alignment. Moreover, the frame alignment information and the hardest negative samples for video alignment are applied to fine tune the video alignment module and improve the learned visual-semantic space for correlation re-estimation. In total, our contributions can be summarized as follows:

  • We decompose the original retrieval task into two mutually boosting subtasks, which considers both the global contextual information of videos and the fine-grained information between videos and queries.
  • We propose a novel and effective coarse-to-fine alignment network including a video alignment module, a cross-modal interaction module, and the interaction of multi-level coarse-to-fine alignment information for moment retrieval from large-scale video collections, which can be trained in an end-to-end manner.
  • We perform sufficient experiments on three public datasets: ActivityNet Captions [18], Charades-STA [7] and DiDeMo [8], and validate the effectiveness of our CFAN method.

Related work

Video retrieval with natural language

With a natural language query, video retrieval aims to retrieve a specific video from a candidate set of videos. Most methods [35, 1923] maximize the similarity score between video and its corresponding caption while minimize the score between negative pairs by encoding both video and text into a common visual-semantic space. For fine-grained video and query encoding, Xu et al. [3] propose a compositional semantics language model with the dependency-tree structure and a deep video model to extract visual features. Otani et al. [4] leverage the web image search results to disambiguate fine-grained visual concepts in the query sentence and compute the sentence embeddings. Yu et al. [20] propose a trainable high-level concept word detector as useful semantic priors and develop an attention mechanism that selectively focuses on the detected concept words and fuse them with word encoding. Mithun et al. [5] and Miech et al. [22] both utilize multi-modal features (e.g. motion, audio) from a video for more robust video understanding. Shen et al. [24] used contrastive learning and Transformer model to effectively exploit the long dependency between video and text in cross-modal video-text retrieval task, thereby improving the accuracy and efficiency of retrieval. Zhang et al. [25] proposed an asymmetric co-attention network for video clip and text alignment, which effectively handles the information asymmetry between video and text through a specially designed contrastive loss function, achieving excellent performance on multiple benchmark datasets. The Hierarchical Sequence Embedding (HSE) [21] exploit both low-level and high-level correspondences in the hierarchically semantic spaces, and the Dual Encoding Network [23] proposes multi-level encodings including global, local and temporal patterns in both videos and sentences to learning better shared representations. In this paper, we extend the video retrieval to the moment retrieval from large-scale video collections and leverage the fine-grained alignment to improve the performance of video retrieval.

Moment retrieval in video

According to the given textual query, moment retrieval aims to identify the temporal boundary of the most semantically-matching moment in the video. Ealry methods [79] sample the moment candidates with multi-scale slide windows, map their visual features and the textual features of the query into a joint semantic space and maximize the similarity score of each positive moment-query pair. For better understanding and modeling of both query and video, Liu et al. [26] propose a language-temporal attention network that encodes the temporal context information to comprehend query descriptions. Xu et al. [11] employ an early fusion approach to generate clip proposals and further consider video captioning as an auxiliary task to learn better representations. Zhang et al. [12] devise an iterative graph adjustment network to exploit the graph-structured moment relations in the videos. Chen et al. [27] proposes a cross-modal semantic alignment and contrastive learning approach to improve the accuracy and efficiency of video moment retrieval. By enhancing the semantic alignment between video and text, the model achieves more accurate moment localization. Gao et al. [28] introduces a time localization method based on attention mechanism, which realizes accurate retrieval of video moments through fine-grained temporal modeling. Zhao et al. [29] uses a graph network to build the context representation in the video, which improves the effect of moment retrieval through the combination of semantic and structural information. Zhang et al. [13] propose a multi-head self-attention mechanism to capture the long-range dependencies in videos and a graph network to exploit the syntactic dependencies in the queries. Some methods [3032] also study this task in a weakly-supervised setting, which requires only video-level annotations for training.

As natural extensions of moment retrieval in video, Zhang et al. [33] propose the self-attention interaction localizer (SAIL) to localize unseen activities in video via an image query. Yuan et al. [34] propose a novel graph convolved video thumbnail pointer (GTP) to dynamically select and concatenate multiple video clips from an original video via a textual query. Escorcia et al. [14] devise a Clip Alignment with Language (CAL) model to localize the relevant moment in a large collection of videos rather than one single video. In this paper, we also study the task of moment retrieval in video collections, which is more practical and useful than moment retrieval in a single video.

Proposed method

Problem statement and decomposition

Given a natural language query , our goal is to search a video from a video collection , and further localize the temporal boundary of the moment that is the most relevant to the query, where is the word embedding of the i-th word in the query, nq is the total number of words of the query, v is a video in the collection , is the pre-extracted feature of the i-th frame of v, and is the total number of frames of the video.

To address the issues of prior works mentioned above, exhibit a good speed-accuracy trade-off and make use of global and fine-grained information, we decompose the original problem into two mutually boosting subtasks: video retrieval from video collections and moment retrieval in a single video. Following our idea, we devise an effective and novel coarse-to-fine alignment network, as shown in Fig 2. Specifically, we first develop a video alignment module to retrieve a small candidate set of videos efficiently by learning a common visual-semantic space. We then develop a sophisticated cross-modal interaction to explore the fine-grained alignment among the candidate proposals, frames and the query, and the visual-semantic embeddings learned before can be employed as extra guidance information. Moreover, we leverage the fine-grained frame alignment and the hardest negative samples for video alignment to improve the common visual-semantic space and boost the video alignment module.

thumbnail
Fig 2. The overview framework of the coarse-to-fine alignment network.

The CFAN consists of a video alignment module to learn the common visual-semantic space, a cross-modal interaction module to explore the fine-grained alignment among frames, proposals and video. Also, the multi-level coarse-to-fine alignment information flows between modules to make full use of both the global contextual information and fine-grained alignment information.

https://doi.org/10.1371/journal.pone.0320661.g002

Video alignment

To learn a common visual-semantic space for video alignment, we devise a dual video alignment module where the multi-head self attention mechanism [15] is plugged into the trainable generalized Vector of Locally Aggregated Descriptors (VLAD) layer, also named NetVLAD, [16] to learn the spatio-temporal descriptors, and the sum of residuals from different centers is aggregated by mean pooling to obtain the visual-semantic embeddings for query and video.

Multi-head self attention. The self-attention mechanism is able to learn the global interaction between each pair of items in a sequence while the multi-head setting ensure the sufficient understanding of complex information. Specifically, we first employ the multi-head self attention mechanism to absorb the global contextual information for query and video modeling. The contextual representations of video v can be represented as follows:

(1)

NetVLAD. Given the encoded sequence and the trainable cluster centers where , and k is the number of cluster center, to obtain the spatio-temporal descriptors, the trainable VLAD accumulates the residuals between local descriptors and multiple cluster centers by a differentiable and soft assignment, denoted by

(2)

where , , is the soft assignment vectors of descriptor for k cluster centers and the assignment of descriptor to the j-th cluster center.

The visual-semantic embedding of video can be aggregated by mean pooling of spatio-temporal descriptors . Similarly, we can obtain the visual-semantic embedding of query fqd by dual operation. With fq and , we can build a candidate set of videos by selecting top-K relevant videos based on cosine similarity of two embeddings.

Cross-modal interaction

To explore the fine-grained alignment among the proposals, frames and the query, we propose a cross-modal interaction module including attention aggregation, cross gate guided by visual-semantic embeddings and BiGRU [17] to obtain the cross-modal representations for further alignment. Similarly, we can obtain the global contextual representations for video and query by multi-head self attention, denoted as and .

Attention aggregation. We employ a usual attention mechanism to aggregate the contextual representations of query for each frame, denoted by

(3)

where , , is the attention score between and , and is the aggregated result.

Cross Gate. As an extension of ordinary cross gate [35], we introduce the visual-semantic embeddings of query and video, and as high-level information for guidance and further fusion, denoted by

(4)

where , and and are gated representations of query and video separately. Taking the concatenation of and as the input of BiGRU, we can obtain the final cross-modal representations .

Proposal alignment. Based on the cross-modal representations o, we sample a fixed number of candidate proposals according to a set of ratios at each time step and score all of them in one single pass, where and nr is the number of candidate proposals at each time step. Through dense sampling, the set of candidate proposals can be denoted as , where are the temporal boundaries of the j-th proposal at i-th time step. The scoring for proposal alignment can be denoted as follows:

(5)

where , , , and is the alignment score at i-th time step.

Frame alignment. In a similar way, for frame alignment, we compute a score of query-relevance for each frame according to the cross-modal representations o, denoted by

(6)

where , , , and is the alignment score for i-th frame of the video.

Guided gating. The fine-grained frame alignment can not only be regarded as an auxiliary task to enhance proposal alignment, but also be leveraged as the guided gating information of the predicted moment to highlight the key content relevant to the query and weaken the background content in video. The query-aware video can be computed as follows:

(7)

where , and is the visual feature of i-th query-aware frame.

Then the query-aware video is employed as training data to fine tune the video alignment module and improve the learned visual-semantic space. We compute the similarity scores of video and query based on both the original and improved common space, given by

(8)

where and are the improved visual-semantic embeddings of videos and queries given by the fine-tuned video alignment module.

Training

In this section, we describe the training strategy and the loss function we devised. To provide a more intuitive explanation of our method, we present the training pseudocode of the proposed CFAN in Algorithm 1.

Algorithm 1. Training process of the proposed CFAN

Video alignment loss. We first adopt a video alignment loss to minimize the distance between positive pairs of video and query while maximize the distance of negative pairs based on the bidirectional max-margin ranking loss [36], denoted by

(9)

where and represents a negative video for q and a negative query for v respectively, is the cosine distance of the visual-semantic embeddings of the given query and video and is the margin.

Proposal alignment loss. During training, we determine the label of a candidate proposal according to its temporal IoU with the target moment. Moreover, to keep the number of positive and negative proposals in a fixed ratio, we mark some positive proposals with lower IoU as negative and assign zero to their labels. Given a video v, the proposal alignment loss is defined as:

(10)

where , are separately the confidence score and the discretized label for the j-th proposal at i-th time step,

Frame alignment loss. For frame alignment, we hope the higher scores are assigned to frames in the target moment while the lower scores to those frames outside the target moment. Given a video v, the frame alignment loss is computed as follows:

(11)

We eventually employ a multi-task loss considering multi-level coarse-to-fine alignment to train our CFAN model, denoted by

(12)

where is the trade-off hyper-parameter.

Hardest negative samples. Besides leveraging the fine-grained alignment information by the guided gating mechanism, we also introduce the hardest negative examples for video alignment to strengthen the training data [37] when fine tuning the video alignment module for correlation re-estimation. The loss function is a modified version of that selects and from the hardest negative samples instead of randomly selecting a negative sample in the minibatch.

During evaluation, we can build a candidate set efficiently based on the cosine similarity where the visual-semantic embeddings over all videos can be pre-computed, and select a video according to the more accurate similarity score that considering both the global contextual information and fine-grained alignment information for further moment retrieval. Moreover, since the fine-grained but time-consuming re-estimation and moment retrieval are limited to the small candidate set , our method can effectively achieve a balance between speed and accuracy.

Experiments

In this section, we first introduce the datasets we used, the implementation details of our method and the evaluation criteria, and then compare our method with some existing state-of-the-art methods. Next, we conduct several ablation experiments to explore the impact of different steps of our algorithm. We also provide some quantitative results of retrieval to prove the effectiveness of our method.

Datasets

Our experiments are conducted on the ActivityNet Captions, DiDeMo, and Charades-STA datasets. They are publicly available datasets widely used for video understanding and video-text retrieval tasks.

ActivityNet Captions [18]. The ActivityNet Captions dataset connects over 20,000 untrimmed videos from the ActivityNet [38] dataset to temporally annotated sentences. Each sentence describes an event occurring within a unique segment of the video. On average, each video contains 3.65 temporally localized sentences, leading to a total of approximately 100,000 sentences. The length of each sentence averages 13.48 words and covers 36 seconds or about 31% of the video’s total duration. These sentences, when considered together, account for 94.6% of the entire video length. Furthermore, 10% of the temporal descriptions overlap, reflecting the co-occurrence of events. The dataset emphasizes action-centric descriptions, with a higher prevalence of verbs and pronouns compared to other datasets such as Visual Genome [18].

DiDeMo [8]. The DiDeMo dataset comprises over 10,000 25-30 second personal videos sourced from the YFCC100M dataset [39]. It includes 41,206 moment-query pairs split into training (33,005), validation (4,180), and test (4,021) subsets. Each video is segmented into 5-second intervals, with moments consisting of one or more consecutive segments. The dataset is specifically designed to localize moments with natural language descriptions. The descriptions in DiDeMo are verified to ensure they refer to specific moments in the video, making it one of the largest and most diverse video-language datasets for temporal localization. The dataset contains 26,892 moments, with descriptions provided by multiple annotators. The videos focus on personal activities, with detailed annotations including camera movements and time transitions.

Charades-STA [7]. The Charades-STA dataset builds upon the Charades dataset [40] and includes 12,408 moment-query pairs for training and 3,720 for testing. Charades originally provides video-level descriptions, but Charades-STA adds clip-level temporal annotations. A semi-automatic method was developed to generate these annotations: long sentences were split into sub-sentences, and temporal annotations were assigned to these sub-sentences by matching keywords with activity categories. Each sub-sentence is associated with a specific time span. In total, Charades-STA contains 13,898 clip-sentence pairs for training, 4,233 for testing, and 1,378 complex sentence queries for testing. The dataset focuses on household activities, and most descriptions follow a syntactic pattern, with sub-sentences connected by conjunctions like “then,” “while,” and “and.”

The primary quantitative information regarding the aforementioned three datasets is presented in Table 1.

thumbnail
Table 1. The details of the ActivityNet Captions, DiDeMo and Charades-STA Dataset.eak Table includes number of videos, number of queries, average video length and average query length.

https://doi.org/10.1371/journal.pone.0320661.t001

Implementation details

We train the CFAN model in an end-to-end manner using the Pytorch framework with four NVIDIA 3090 GPUs. Specifically, we apply the pre-trained Glove word2vec [41] to extract initial textual features for each query. For ActivityNet Captions and Charades-STA, we employ pre-trained 3D-ConvNet [42] to extract initial visual features and use PCA to reduce the data dimension for each video. For DiDeMo, we follow the prior work [8] that uses the pre-trained VGG [43] to extract RGB features and a competitive activity recognition model [44] to extract optical flow features. The RGB features and optical flow features are fused by concatenation. For model setting, we set the hidden size of model d to 512. The number of clusters for query and video in NetVLAD is set to 16. To sample the candidate moments at each time step, the sample ratio is set to for ActivityNet Captions, to for Charades-STA and to for DiDeMo, and the illegal moments are removed from the candidate set. The trade-off hyper-parameter is set to 0.5. The whole CFAN model is trained by Adam optimizer with learning rate 0.0006 for training and 0.0004 for fine tuning. The size of candiate set is set to 5.

Evaluation criteria

Following [7], we adopt the “R@n, IoU=m” accuracy as the evaluation metric of moment retrieval in video collections, where average recall over all test queries is computed by determining whether one of the top-n predicted moments has Intersection over Union (IoU) larger than m. Moreover, we also compute the “Recall@K” as the criteria for sentence-to-video retrieval, where average recall over all test queries is determined by whether one of the top-K returned videos is the target video. “R@n, IoU=m” can be regarded as a criterion for moment-level retrieval evaluation, while “Recall@K” as a criterion for video-level retrieval evaluation. Note that the DiDeMo dataset contains multiple temporal annotations from different annotators for each description. Following [14], the predicted moment must have IoU larger than the specified m with at least two ground truth moments.

Performance comparison

We compare our CFAN method with some existing methods to verify the effectiveness.

Chance. [14] For chance, moments across all videos are sampled and returned based on a uniform distribution.

Moment Prior. [14] The moment prior method samples a video from a uniform distribution and return a moment based on the moment frequency prior [8].

MCN. [8] The MCN method for single video moment retrieval is scaled to the large-scale video collections by enumerating all the candidate moments exhaustively and returning the moment with the highest score.

CAL. [14] To retrieve a specific moment from video collections, the CAL method splits video into clips, builds a clip database and minimize the squared-Euclidean distance between moment’s visual features and language feature.

M-DETR. [45] MDETR is an end-to-end modulated detector that conditions object detection on raw text queries, integrating text and image modalities early in its transformer-based architecture. Pre-trained on 1.3M text-image pairs, it achieves remarkable performance on tasks like phrase grounding and referring expression comprehension while effectively addressing the long-tail problem in object categories through few-shot fine-tuning.

UniVTG. [46] UniVTG is a unified framework that consolidates diverse video temporal grounding tasks and label types into a single model, enabling large-scale pretraining, zero-shot generalization, and flexible adaptation across various VTG tasks such as moment retrieval, highlight detection, and video summarization.

QD-DETR. [47] QD-DETR is a query-dependent detection transformer designed for video moment retrieval and highlight detection, enhancing query-video relevance by explicitly injecting query context via cross-attention and training on negative query-video pairs to improve saliency estimation. It features an input-adaptive saliency predictor and achieves satisfying results.

The overall performance evaluation results of our CFAN method and other baselines on the ActivityNet Captions, DiDeMo and Charades-STA datasets are shown in Table 2.

thumbnail
Table 2. Performance evaluation results on the ActivityNet Captions, DiDeMo and Charades-STA datasets. We show the results with the mertic “R@n, IoU=m” where n ∈ {1, 10, 10}, m ∈ {0.5, 0.7}.

https://doi.org/10.1371/journal.pone.0320661.t002

Compared with all the other baselines, our method achieves huge improvement across all datasets, which demonstrates the effectiveness of our method including the interaction of multi-level coarse-to-fie alignment and the CFAN framework for moment retrieval from video collections. In particular, the results of “R@1, IoU=0.5” and “R@1, IoU=0.7” increase by 100%–780% over the prior CAL method for our task. Note that due to a large number of candiate moments, the evaluation results are low for all the baselines based on clip retrieval, especially Chance, indicating the difficulty of our task and further illustrating the effectiveness of our method.

As results shown in Table 2, our method also significantly outperforms the CAL method by a large margin on ActivityNet Captions dataset, especially on the criteria “R@100, IoU=0.5” and “R@100, IoU=0.7”, which indicates the superiority of our method on the ActivityNet Captions dataset. As shown in Table 1, there are more videos on the ActivityNet Captions dataset, and each video is much longer over the other two datasets, leading to a large number of candiate clips for the CAL method. Searching a specific clip relevant to the query in such a big clip database is extremely difficult and is susceptible to noise of clips from other videos. Instead, our method first retrieve a candiate set of videos based on the high-level visual-semantic space, and further leverage the fine-grained alignment information of moment retrieval for correlation re-estimation, which greatly reduces the noise of clips from other videos, and further speed up the retrieval.

Moreover, we can find that the overall evaluation results on the Charades-STA are lower than the other two datasets, and meanwhile our method only obtains a smaller absolute improvement over CAL for “R@1, IoU=0.5” and “R@n, IoU=0.7”. That is because there are more similar videos that describe the same activity of humans in a similar sentence pattern, leading to more noise for retrieval of the target video. Also, longer sentence can provide more information while the average length of descriptions on the Charades-STA dataset is shorter than the other two datasets, as shown in Table 1.

Ablation study

To verify the effectiveness of and each module in our CFAN model and the interaction of multi-level coarse-to-fine alignment information, we next conduct some ablation experiments. Specifically, we modify our method to generate ablation models as follows:

CFAN(w/o. RE). We simply compute the similarity score by without using the improved visual-semantic embeddings for correlation re-estimation.

CFAN(w/o. HN). Instead of selecting negative videos and queries from the hardest samples, we still randomly choose a negative sample to fine tune the video alignment module for correlation re-estimation.

CFAN(w/o. FA). We remove the frame alignment loss and the frame alignment information is unused for proposal alignment enhancement and guided video gating.

CFAN(w/o. CG). Instead of integrating the high-level visual embeddings into the cross gate, we apply the ordinary cross gate taking and as input without extra guidance.

The evaluation results of ablation models on ActivityNet Captions and DiDeMo datasets are shown in Tables 3 and 4 respectively. It can be observed that all the ablation models still achieve better performance than the baselines, demonstrating the effectiveness of the whole framework again. Compared with other ablation models, the CFAN(w/o. RE) achieves the worst performance on both the “Recall@1” and “Recall@10” criterion, indicating that the improved common visual-semantic space are helpful for the re-estimation of correlation between video and query. The full model outperforms the CFAN(w/o. HN), which demonstrates that incorporating the hardest negative samples can effectively boost the video-level alignment. The results of CFAN(w/o. FA) without the frame alignment loss also show a decrease in performance, demonstrating that the frame alignment information can effectively boost the proposal alignment and the video alignment for more accurate retrieval. Moreover, the CFAN(w/o. CG) achieves worse performance than the full model, indicating the high-level visual embeddings can considered as the guidance for fine-grained alignment. The evaluation results of CFAN(w/o. FA) and CFAN(w/o. CG) verify the effectiveness of the interaction of multi-level coarse-to-fine alignment information.

thumbnail
Table 3. Evaluation results of ablation study on the ActivityNet Captions dataset. n ∈ {1, 10}, m ∈ {0.5, 0.7} and K ∈ {1, 10}

https://doi.org/10.1371/journal.pone.0320661.t003

thumbnail
Table 4. Evaluation results of ablation study on the DiDeMo dataset. n ∈ {1, 10}, m ∈ {0.5, 0.7} and K ∈ {1, 10}

https://doi.org/10.1371/journal.pone.0320661.t004

Limitations and future work

The proposed method has two main limitations. First, it relies heavily on high-quality labeled training data, which is costly and labor-intensive to acquire. Second, its task-specific feature extraction and alignment strategies are sensitive to distributional shifts in video content and language expressions, limiting generalizability across domains. Future work could explore self-supervised or weakly supervised learning to reduce dependence on annotations and enhance domain adaptation techniques for greater robustness. Additionally, improving query comprehension and fine-grained alignment mechanisms could help address challenges in handling complex queries and capturing subtle video details.

Conclusion

In this paper, we study the task of moment retrieval from large-scale video collections which aims to search and localize the temporal boundary of a moment from a collection of numerous videos according to the given textual query. To make full use of both the global contextual information of videos and the fine-grained alignment information between videos and queries, we decompose the original problem into two mutually boosting subtasks: video retrieval from video collections and moment retrieval in a single video, and propose the coarse-to-fine alignment network (CFAN) that leverages multi-level coarse-to-fine alignment information. The sufficient experiments are performed on three datasets ActivityNet Captions, DiDeMo and Charades-STA and the evaluation results demonstrate the effectiveness of the proposed method.

Acknowledgments

The authors are grateful to the editor and reviewers for their meticulous evaluation of the paper and the invaluable recommendations they offered to improve the overall quality of the manuscript.

References

  1. 1. Li Y, Tang G, Luo L, Zhang T, Yang T. Research on storage and retrieval of massive GPS and video surveillance data in cloud environment. Power Syst Big Data 2022;25(05):85–92.
  2. 2. Xiao Z, Wen J, Wang J, Xu G, Zhou T. Retrieval method of defect text for power equipment based on knowledge graph and entropy weight. Power Syst Big Data 2023;26(12):62–72.
  3. 3. Xu R, Xiong C,Chen W, Corso JJ. Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. In: Proceedings of the twenty-ninth AAAI conference on artificial intelligence; 2015.
  4. 4. Otani M, Nakashima Y, Rahtu E, Heikkilä J, Yokoya N. Learning joint representations of videos and sentences with web image search. In: European conference on computer vision; 2016. p. 651–67.
  5. 5. Mithun NC, Li J, Metze F, Roy-Chowdhury AK. Learning joint embedding with multimodal cues for cross-modal video-text retrieval. In: Proceedings of the 2018 ACM on international conference on multimedia retrieval; 2018. p. 19–27. doi: https://doi.org/10.1145/3206025.3206064
  6. 6. Tay Y, Dehghani M, Tran VQ, Garcia X, Wei J, Wang X. UL2: Unifying language learning paradigms. arXiv. 2023.
  7. 7. Gao J, Sun C, Yang Z, Nevatia R. TALL: Temporal activity localization via language query. In: Proceedings of the IEEE international conference on computer vision; 2017. p. 5267–75.
  8. 8. Hendricks A, Wang O, Shechtman E, Sivic J, Darrell T, Russell B. Localizing moments in video with natural language. In: Proceedings of the IEEE international conference on computer vision; 2017. p. 5803–12.
  9. 9. Hendricks L, Wang O, Shechtman E, Sivic J, Darrell T, Russell B. Localizing moments in video with temporal language. In: Proceedings of the 2018 conference on empirical methods in natural language processing; 2018. p. 1380–90.
  10. 10. Spokoiny V. Mixed Laplace approximation for marginal posterior and Bayesian inference in error-in-operator model. Available from: https://arxiv.org/abs/2305.09336; 2023.
  11. 11. Xu H, He K, Plummer BA, Sigal L, Sclaroff S, Saenko K. Multilevel language and vision integration for text-to-clip retrieval. AAAI 2019;33(01):9062–9.
  12. 12. Zhang D, Dai X, Wang X, Wang Y, Davis L. MAN: Moment alignment network for natural language moment retrieval via iterative graph adjustment. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2019. p. 1247–57.
  13. 13. Zhang Z, Lin Z, Zhao Z, Xiao Z. Cross-modal interaction networks for query-based moment retrieval in videos. In: ACM SIGIR; 2019. p. 655–64.
  14. 14. Escorcia V, Soldan M, Sivic J, Ghanem B, Russell B. Temporal localization of moments in video collections with natural language. arXiv preprint; 2019. https://doi.org/arXiv:1907.12763.
  15. 15. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A, et al. Attention is all you need. Adv Neural Inform Process Syst. 2017;30:5998–6008.
  16. 16. Arandjelovic R, Gronat P, Torii A, Pajdla T, Sivic J. NetVLAD: CNN architecture for weakly supervised place recognition. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR); 2016. p. 5297–307. doi: https://doi.org/10.1109/cvpr.2016.572
  17. 17. Chung J, Gulcehre C, Cho K, Bengio Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. In: NIPS 2014 workshop on deep learning, December 2014; 2014.
  18. 18. Krishna R, Hata K, Ren F, Fei-Fei L, Niebles C. Dense-captioning events in videos. In: Proceedings of the IEEE international conference on computer vision; 2017. p. 706–15.
  19. 19. Yu Y, Ko H, Choi J, Kim G. Video captioning and retrieval models with semantic attention. arXiv. 2016.
  20. 20. Yu Y, Ko H, Choi J, Kim G. End-to-end concept word detection for video captioning, retrieval, and question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2017. p. 3165–73.
  21. 21. Zhang B, Hu H, Sha F. Cross-modal and hierarchical modeling of video and text. In: Proceedings of the European conference on computer vision (ECCV); 2018. p. 374–90.
  22. 22. Miech A, Laptev I, Sivic J. Learning a text-video embedding from incomplete and heterogeneous data. arXiv preprint. 2018.
  23. 23. Dong J, Li X, Xu C, Ji S, He Y, Yang G. Dual encoding for zero-example video retrieval. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2019. p. 9346–55.
  24. 24. Shen X, Huang Q, Lan L, Zheng Y. Contrastive transformer cross-modal hashing for video-text retrieval. In: Larson K, editor. Proceedings of the thirty-third international joint conference on artificial intelligence, IJCAI-24. International Joint Conferences on Artificial Intelligence Organization; 2024. p. 1227–35. Available from: doi: https://doi.org/10.24963/ijcai.2024/136
  25. 25. Panta L, Shrestha P, Sapkota B, Bhattarai A, Manandhar S, Sah AK. Cross-modal contrastive learning with asymmetric co-attention network for video moment retrieval; 2023. Available from: https://arxiv.org/abs/2312.07435
  26. 26. Liu M, Wang X, Nie L, Tian Q, Chen B, Chua T. Cross-modal moment localization in videos. In: Proceedings of the 2018 ACM multimedia conference; 2018. p. 843–51.
  27. 27. Papaderos P, Östlin G, Breda I. Bulgeless disks, dark galaxies, inverted color gradients, and other expected phenomena at higher z. A&A. 2023;673:A30. doi: https://doi.org/10.1051/0004-6361/202245769
  28. 28. Lu W. Clifford algebra Cl(0,6) approach to beyond the standard model and naturalness problems. Int J Geom Methods Mod Phys. 2023;21(05).
  29. 29. Wang C, Erfani S, Alpcan T, Leckie C. OIL-AD: An anomaly detection framework for sequential decision sequences. arXiv. 2024.
  30. 30. Mithun N, Paul S, Roy-Chowdhury A. Weakly supervised video moment retrieval from text queries. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2019. p. 11592–601.
  31. 31. Tan R, Xu H, Saenko K, Plummer B. wMAN: Weakly-supervised moment alignment network for text-based video segment retrieval. arXiv preprint. 2019.
  32. 32. Gao M, Davis LS, Socher R, Xiong C. WSLLN: Weakly supervised natural language localization networks. arXiv preprint. 2019.
  33. 33. Zhang Z, Zhao Z, Lin Z, Song J, Cai D. Localizing unseen activities in video via image query. In: Proceedings of the 28th international joint conference on artificial intelligence. AAAI Press; 2019. p. 4390–96.
  34. 34. Yuan Y, Ma L, Zhu W. Sentence specified dynamic video thumbnail generation. In: Proceedings of the 27th ACM international conference on multimedia; 2019. p. 2332–40.
  35. 35. Feng Y, Ma L, Liu W, Zhang T, Luo J. Video re-localization. In: Proceedings of the European conference on computer vision (ECCV); 2018. p. 51–66.
  36. 36. Socher R, Karpathy A, Le QV, Manning CD, Ng AY. Grounded compositional semantics for finding and describing images with sentences. TACL. 2014;2:207–18.
  37. 37. Faghri F, Fleet D, Kiros J, Fidler S. VSE: Improving visual-semantic embeddings with hard negatives. arXiv preprint; 2017. https://doi.org/arXiv:1707.05612.
  38. 38. Caba-Heilbron F, Escorcia V, Ghanem B, Niebles CJ. Activitynet: A large-scale video benchmark for human activity understanding. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR); 2015. p. 961–70.
  39. 39. Thomee B, Shamma D, Friedland G, Elizalde B, Ni K, Poland D, et al. YFCC100M: The new data in multimedia research. arXiv preprint. 2015.
  40. 40. Sigurdsson GA, Varol G, Wang X, Farhadi A, Laptev I, Gupta A. Hollywood in homes: Crowdsourcing data collection for activity understanding. arXiv:1604.01753. 2016.
  41. 41. Pennington J, Socher R, Manning C. Glove: Global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP); 2014. p. 1532–43.
  42. 42. Tran D, Bourdev L, Fergus R, Torresani L, Paluri M. Learning spatiotemporal features with 3D convolutional networks. In: 2015 IEEE international conference on computer vision (ICCV). 2015. p. 4489–97. doi: https://doi.org/10.1109/iccv.2015.510
  43. 43. Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv preprint. 2014.
  44. 44. Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, et al. Temporal segment networks: Towards good practices for deep action recognition. In: European conference on computer vision. Springer; 2016. p. 20–36.
  45. 45. Kamath A, Singh M, LeCun Y, et al. MDETR–modulated detection for end-to-end multi-modal understanding. arXiv preprint arXiv:2104.12763; 2021.
  46. 46. Lin K, Zhang P, Chen J, Pramanick S, Gao D. UniVTG: Towards unified video-language temporal grounding. In: Proceedings of the IEEE international conference on computer vision; 2023. p. 2794–804.
  47. 47. Moon W, Hyun S, Park et al. Query-dependent video representation for moment retrieval and highlight detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2023. p. 23023–33.