UFM: Unified feature matching pre-training with multi-modal image assistants

Yide Di; Yun Liao; Hao Zhou; Kaijun Zhu; Qing Duan; Junhui Liu; Mingyu Lu

doi:10.1371/journal.pone.0319051

Abstract

Image feature matching, a foundational task in computer vision, remains challenging for multimodal image applications, often necessitating intricate training on specific datasets. In this paper, we introduce a Unified Feature Matching pre-trained model (UFM) designed to address feature matching challenges across a wide spectrum of modal images. We present Multimodal Image Assistant (MIA) transformers, finely tunable structures adept at handling diverse feature matching problems. UFM exhibits versatility in addressing both feature matching tasks within the same modal and those across different modals. Additionally, we propose a data augmentation algorithm and a staged pre-training strategy to effectively tackle challenges arising from sparse data in specific modals and imbalanced modal datasets. Experimental results demonstrate that UFM excels in generalization and performance across various feature matching tasks. The code will be released at: https://github.com/LiaoYun0x0/UFM.

Citation: Di Y, Liao Y, Zhou H, Zhu K, Duan Q, Liu J, et al. (2025) UFM: Unified feature matching pre-training with multi-modal image assistants. PLoS ONE 20(3): e0319051. https://doi.org/10.1371/journal.pone.0319051

Editor: Paulo Eduardo Teodoro, Federal University of Mato Grosso do Sul, BRAZIL

Received: October 22, 2024; Accepted: January 26, 2025; Published: March 31, 2025

Copyright: © 2025 Di et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: Data Availability and access The datasets analyzed during the current study are available from the following public domain resources: https://github.com/hpatches, http://www.ok.sc.e.titech.ac.jp/INLOC/, https://data.ciirc.cvut.cz/public/projects/2020VisualLocalization/Aachen-Day-Night/, https://mediatum.ub.tum.de/1474000, https://cs.nyu.edu/silberman/datasets/nyudepthv2.html, http://matthewalunbrown.com/nirscene/nirscene.html, https://github.com/AmberHen/WHU-OPT-SAR-dataset, http://www.ti.uni-bielefeld.de/html/people/ddiffert/databases uvg.html, https://brainweb.bic.mni.mcgill.ca/brainweb/.

Funding: This work was supported by the National Natural Science Foundation of China under Grant 61976124 and 62372077 to M.L. This work was also supported by the Scientific Research Fund of Yunnan Provincial Education Department under Grant 2021J0007 to Y.L. and Q.D.

Competing interests: The authors have declared that no competing interests exist.

1 Introduction

Feature matching, as a fundamental task in computer vision, serves to establish correspondences between local features in different images, facilitating the advancement of various downstream applications such as image fusion [1,2], image stitching [3,4], and 3D reconstruction [5,6], among others. With the proliferation of computer vision applications, the demand for precise feature matching of diverse multi-modal images has grown significantly. When researchers address the task of specific modal images, they are required to identify the appropriate feature matching method tailored to the corresponding modal image. This process typically involves utilizing substantial amounts of relevant training data, leading to substantial resource consumption. Thus, the development of a unified pre-trained comprehensive model for feature matching across a wide range of modals becomes increasingly imperative.

The concept of pretraining-fine-tuning methods initially emerged in the realm of natural language processing [7,8]. Given its remarkable performance, this approach was swiftly extended to the field of computer vision [9,10] and multimodal applications [11,12], offering a means to achieve more efficient results with reduced resources and time. Nevertheless, within the domain of multi-modal image feature matching, a comprehensive and effective unified framework has been notably absent, limiting the ability of researchers to make rapid advancements in multi-modal image feature matching through fine-tuning on a pre-trained foundational model.

Feature matching tasks can be broadly categorized into two types: matching features within images of the same modal and matching features across images of different modals. As illustrated in Fig 1, feature matching tasks for images of the same modal often involve significant variations in sensor positions and angles. For instance, the left image might capture the front of a building while the right image could be taken from the left side, exhibiting notable disparities in time, location, and angle between the two images. The primary objective in matching features within images of the same modal is to precisely align the poses of objects depicted in different pictures. Conversely, in the context of feature matching tasks involving images of different modals, the sensor positions typically exhibit minimal differences, serving to leverage the complementary information provided by the distinct modals present in the images.

Download:

Fig 1. Illustration of feature matching of UFM. When working with specific data, the pre-trained backbone is frozen, and only the corresponding modal assistants need to be fine-tuned for feature matching. The multi-modal assistants contain both same-modal assistants and different-modal matching assistants.

https://doi.org/10.1371/journal.pone.0319051.g001

Previous studies have introduced numerous techniques [13–15] for facilitating feature matching within images of the same modal, along with several methods [16,17] dedicated to achieving feature matching across specific pairs of modals. While some approaches [18,19] have attempted to address feature matching across multimodal images [20], their applicability is often restricted to specific image modals, necessitating extensive training for specialized tasks.

In this paper, we introduce the Unified Pre-trained Feature Matching (UFM) model, designed to facilitate multi-modal image feature matching across a wide range of image modals. UFM not only enables feature matching within images of the same modal but also supports feature matching across various modals. This capability is achieved through the utilization of a Multi-Modal Image Assistant (MIA) Transformer. By incorporating a diverse set of modal assistants to aid the feedforward network within the standard transformer, the MIA Transformer can effectively capture specific modal information. Furthermore, the model leverages cross-modal shared self-attention and cross-attention mechanisms to capture inter-modal information. The modal assistants encompass two distinct types: those designed for the same image modal and those tailored for different image modals. Leveraging its modeling flexibility, the MIA Transformer, equipped with shared parameters, can be repurposed for various tasks.

To address the disparities in data distribution among different modal image data and to compensate for the scarcity of specific modal image data, we have devised a data augmentation methodology to expand the datasets of various modal images. Additionally, UFM incorporates a staged pre-training strategy, commencing with an initial pre-training phase on the same modal data and followed by subsequent pre-training on a dataset inclusive of various modals. Given that the feature matching data for images of the same modal significantly outweighs the corresponding data for different modals, this staged pre-training strategy significantly augments the data available for pre-training, thereby enhancing the generalization capabilities of UFM. Experimental results attest to the exceptional performance of UFM in both the feature matching tasks of images from the same modal and the feature matching tasks of images from different modals.

Our main contributions are as follows:

We introduce a unified pre-trained model for multi-modal image feature matching, denoted as UFM, which is adaptable to a wide array of modal image feature matching tasks.
We present the Multi-Modal Image Assistant (MIA) Transformer, a novel framework that captures specific modal information through modal assistants. Furthermore, MIA Transformer enhances the efficacy of feature matching for different modal images by facilitating cross-modal feature fusion.
Our proposed method demonstrates the ability to handle feature matching tasks for images of the same modal with significant disparities in sensor poses, as well as for feature matching tasks between images of different modals.
Leveraging data augmentation techniques, we expand the datasets of modal images with limited data, while the staged pre-training strategy significantly enhances the effectiveness of pre-training for multi-modal image feature matching.

2 Related work

In the realm of feature matching, methods can be categorized into two types: detector-based and detector-free. Beyond methodological distinctions, tasks in this domain can also be classified into two broad categories: conventional feature matching tasks and multi-modal image feature matching tasks.

Detector-based feature matching. The detector-based method unfolds in three key stages: feature detection, feature description, and feature matching. In the feature detection and description stage, interest points with descriptors are generated, followed by the establishment of point-to-point correspondences through a suitable matching algorithm. Detector-based methods can be further categorized into handcrafted descriptor methods and deep learning descriptor methods. Among the prominent handcrafted descriptor methods is SIFT [21], renowned for its simplicity and efficiency, making it versatile across various computer vision tasks. Subsequent works [22–24] have iteratively improved upon the SIFT algorithm, enhancing its performance. Deep learning descriptor methods leverage convolutional neural networks to glean deep features and capture nonlinear expressions, uncovering valuable hidden information. Sarlin et al. [25] introduced SuperGlue, featuring a flexible attention-based context aggregation mechanism capable of jointly reasoning about the underlying 3D scene and feature assignment. Jiang et al. [26] proposed GLMNet, a graph learning-matching network utilizing graph convolutional networks to adaptively learn optimal graphs for the graph matching task.

Detector-free feature matching. The detector-free method eliminates the need for extracting interest points, opting instead to directly extract features in image pairs through transformers. This category can be further divided into semi-dense matching methods and dense matching methods. Semi-dense matching methods typically employ a coarse-to-fine matching process, achieving final precise matching results on 1/2 size feature maps. Notably, a pioneering contribution is the LoFTR algorithm by Shen et al. [27], which introduces global receptive fields in Transformers to produce semi-dense matches, particularly excelling in low-texture regions. Subsequent algorithms [28–32] have iteratively built upon and improved LoFTR as a baseline. Dense matching methods, on the other hand, extract all matches between views directly on the original image, aiming to estimate each matched pair of pixels. Truong et al. introduced PDC-Net [33] and PDC-Net+ [34], formulating bias estimates in a probabilistic manner and pairing proposed feature correspondences with deterministic estimates via a mixture model. Edstedt et al. presented DKM [35] and RoMa [36]. While DKM achieves superior matching accuracy by utilizing depthwise separable large kernels and local correlations with stacked feature maps as input, RoMa combines the strengths of semi-dense and dense matching methods. It designs a coarse-to-fine dense matching framework and achieves state-of-the-art matching accuracy. Despite the higher accuracy of dense matching methods compared to semi-dense counterparts, they come at the cost of increased computational resources and training time. Considering these factors comprehensively, we adopt a semi-dense matching framework in designing UFM.

Multi-modal image feature matching. The significant disparities among sensors in multi-modal images pose challenges for conventional matching methods, rendering them less applicable for multi-modal image feature matching. Consequently, numerous scholars have introduced algorithms specifically tailored for this purpose. Zhu et al. [37] presented R2FD2, a detector-based multi-modal image feature matching algorithm. Initially, they employed the reproducible feature detector MALG to identify interest points and subsequently utilized the feature descriptor RMLG for feature representation and matching. Hu et al. [19] proposed a multi-scale structural feature transform (MSFT) method designed for multi-modal image matching. This approach detects scale-invariant feature points based on the Gaussian difference image pyramid of the phase congruency map, addressing challenges arising from nonlinear radiation distortion. In the detector-free domain, Di et al. [18] introduced FeMIP, a multi-modal feature matching algorithm. Employing a coarse-to-fine approach, FeMIP achieves accurate feature matching. Notably, it incorporates a policy gradient method to address issues related to the discreteness of matching. While there exist several commendable multi-modal image feature matching algorithms, they often exhibit limitations in terms of the types of multi-modal images they are suited for. Furthermore, these algorithms may require substantial training on specific datasets and may not be well-suited for addressing feature matching challenges within images of the same modal. Therefore, the development of a unified feature matching algorithm, accommodating the majority of modal images, is of utmost importance.

Pre-training – fine-tuning. As the transformer paradigm has evolved, the pretraining-fine-tuning approach has become instrumental in advancing the state of the art across diverse domains such as natural language processing [38,39], computer vision [40,41], and multi-modal tasks [42,43]. Zaken et al. [44] introduced BitFit, a sparse fine-tuning method tailored for natural language processing. Leveraging the modification of bias terms exclusively, this method demonstrates remarkable performance on extensive training data. Sohn et al. [45] presented a method for learning vision transformers through generative knowledge transfer. Their approach incorporates a novel prompt design that strategically places learnable tokens within image token sequences. Radford et al. [11] proposed CLIP, a multi-modal pretraining-fine-tuning model that exhibits effective knowledge transfer across tasks without the need for dataset-specific training. Despite the notable success of the pre-training-fine-tuning technique in various domains, its application in the realm of multi-modal image feature matching has been limited. Consequently, we have devised a unified large model, UFM, for multi-modal image feature matching, achieving comprehensive pre-training on extensive multi-modal image data.

3 Methodology

Given an image pair from most modals, the Unified Feature Matching (UFM) approach can effectively derive its image pair representation using a Multimodal Image Assistant (MIA) Transformer network. The UFM method exhibits remarkable versatility, enabling it to address feature matching tasks not only within the same modal but also across different modals, even when the sensor positions vary significantly. Additionally, in handling the feature matching tasks of diverse modal images, the process of cross-modal feature fusion can further enhance the overall effectiveness of the approach.

3.1 Multi-modal image assistants transformer

Similar to previous methods [18,27,28], UFM adopts a coarse-to-fine dense matching approach for feature matching. The input image pairs undergo processing via the FPN network to generate features at 1/8th and 1/2th sizes. The 1/8-size features are trained using the augmented GT matrix as labels, yielding coarse matching results at the patch level. Subsequently, the 1/2 size features are precisely matched with the coarse matching results to derive the final dense matching results at the pixel level. Distinguishing itself from other methods, UFM introduces a novel architecture, departing from the conventional use of transformers in coarse and fine matching stages.

In this study, we propose a novel Multi-modal Image Assistant (MIA) Transformer tailored for feature matching tasks, as depicted in Fig 2. The UFM model encompasses all the processes involved in image enhancement, feature extraction, coarse matching, and fine matching. The MIA Transformer is utilized in both the coarse and fine matching stages. The MIA Transformer integrates a multi-modal image assistant with a generic feedforward network. The Unified Feature Matching (UFM) framework comprises a total of L layers. Given the output vector G_(l-1) from the generic feedforward network of the preceding layer and the output vector A_(l-1) from the assistant feedforward network of the previous layer, the approach leverages multi-head self-attention and cross-attention (MSCA) mechanisms, shared across modals, to align the content of a pair of images. The input vector for each layer can be computed as follows:

Download:

Fig 2. The overview of UFM and the MIA transformer. The UFM model encompasses all the processes involved in image enhancement, feature extraction, coarse matching, and fine matching. The MIA Transformer is utilized in both the coarse and fine matching stages.

https://doi.org/10.1371/journal.pone.0319051.g002

(1)

where LN is short for layer normalization. MIA_FFN is designed to select the appropriate assistant from a collection of multiple modal assistants for processing the input vector. Notably, it encompasses two distinct types of modal assistants: those tailored for the same modal and those dedicated to different modals. When the input consists of image vectors from the same modal, the corresponding assistant associated with that modal is employed for encoding the images. Conversely, in cases where the input includes image vectors from diverse modals, such as optical-SAR, the optical and SAR assistants are utilized to encode the respective modal vectors at the underlying Transformer layer. Subsequently, the optical-SAR assistant is employed at the top layer to capture more comprehensive modal interactions. The output vector V_l for each layer can be computed as follows:

(2)

3.2 Data augmentation

UFM necessitates a comprehensive pre-training process involving images from all modals. However, significant discrepancies often exist in the availability of image data across different modals. For instance, optical images commonly offer a wealth of data, whereas long-infrared images may be relatively scarce. To address the nonuniformity in data distribution among different modals and to mitigate the scarcity of data in certain modals, this paper introduces a data augmentation technique. The primary objective of this technique is to achieve a more balanced data distribution for the image dataset of each modal and enhance the generalizability of UFM.

As shown in Fig 3, given an image pair Image_a and Image_b, a sequence of data augmentation procedures is initially applied. Subsequently, these enriched image pairs are utilized to produce a comprehensive pixel matching label GT_matrix. The augmentation process includes mirroring, flipping, and rotating the input images, significantly enhancing the diversity within the dataset. In addition, the processed image is randomly cropped. Finally, random noise is added to the cropped block map, and some pixels are randomly masked [20].

Download:

Fig 3. Data augmentation is applied both geometrically and in terms of intensity. Geometrically, the images are mirrored, flipped, rotated, and randomly cropped. For intensity augmentation, random noise is added, and random masking is applied. Finally, a square matrix (GT matrix) is used to represent the correspondence of matching points between the two images. The GT_matrix is a square matrix of N × N dimensions. GT ⁡ ( i , j ) represents the element of the ith row and the jth column in the GT matrix. The shown input image pairs take optical and SAR image pairs as an example.

https://doi.org/10.1371/journal.pone.0319051.g003

The cropped pair of images are defined as I_a and I_b. The sizes of I_a and I_b are the same, and both their height and width are denoted by h and w, respectively. The image is divided into N image patches. The patch coordinates can be defined as , where ; and p is the size of the patch. We applied a random mask to these image patches and the proportion of them is between 20% and 40%. The location of the random mask is defined as M and can be expressed as

(3)

For positions that are not masked, the central points of the patches in I_a can be defined as , which can be calculated as

(4)

The coordinates of the points in Image_b corresponding to the are defined as . They can be obtained by running a series of data augmentation operations (Mirror, flip, rotate, etc.) in reverse for coordinates . Then the corresponding patch coordinates of them can also be extracted as

(5)

where [o] means round down o.

Then we define a GT matrix to represent the matching of the image patches after data augmentation. I_a partially overlaps with I_b, so the corresponding patch coordinates may be inside or outside of I_b. If is in image I_b, then

(6)

where GT ⁡ ( i , j ) = 1 indicates that the unmasked ith patch in I_a matches the unmasked jth patch in I_b.

3.3 Pre-training

Presented in Fig 4, our proposed staged pre-training strategy aims to enhance the image matching model’s performance by leveraging a large-scale image dataset from the same modality. The pre-training is divided into 3 stages: (1) pre-train the general FFN, (2) pre-train all X-X assistants, and (3) pre-train all X-Y assistants. In stage 1, given that optical images provide a rich source of feature matching data, we initially conduct pre-training on multi-head attention and the generic feedforward network (FFN) using a substantial collection of pure optical images. In stage 2, We pre-train feature matching for all of the same modal images (X-X). The pre-training of all modal images (X-X, Y-Y, Z-Z···) in the stage 2 can be performed in parallel. We freeze multi-head self+cross attention at this stage, which greatly improves the efficiency of training. In stage 3, we further pre-train the feature matching of all cross-modal images (X-Y) on the basis of stage 2. At this stage, we adjusted all the attention and corresponding FFNS to maximize cross-modal matching. The three FFNS (X-X, Y-Y, X-Y) corresponding to the two modal images are pre-trained simultaneously.

Download:

Fig 4. Illustration of pre-training. The pre-training of NIR and SAR images is taken here as an example. Pre-training consists of 3 stages: (1) pre-train the general FFN, (2) pre-train all X-X assistants, and (3) pre-train all X-Y assistants.

https://doi.org/10.1371/journal.pone.0319051.g004

A notable consideration is the adoption of the concept of frozen attention blocks, as previously introduced in [46,47], to potentially enhance the pre-training for the same modal. Consequently, during the pre-training phase for the same modal, we retain the parameters of the multi-head attention and other modal assistants, exclusively training the assistant FFN tailored to the specific modal under consideration. In the process of training feature matching assistants for different modals, it is essential to conduct separate pre-training for each modal, ensuring that the parameters are fine-tuned accordingly. Notably, in scenarios involving different modals, such as Opt-SAR, the parameters for multi-head attention are not frozen, facilitating a more comprehensive adaptation to the diverse modals.

Conventional general multi-modal image matching methods are typically trained solely on limited image data from various modals, posing a challenge in achieving optimal results. In contrast, datasets containing images of the same modal are comparatively more accessible. Leveraging the staged pre-training strategy enables the model to undergo initial pre-training on a same-modal dataset, followed by subsequent pre-training on datasets encompassing different modals. This approach significantly augments the volume of data available for pre-training and substantially enhances the model’s overall generalization capacity.

By incorporating this staged pre-training strategy, the model can effectively leverage the advantages of the larger same-modal datasets during the initial training phase. Subsequently, through continued pre-training on diverse-modal datasets, the model can gain a more comprehensive understanding of the variations and nuances across different modals, thus improving its adaptability and performance across a broader range of modals.

In the specific pre-training process of feature matching, UFM is also trained in two stages: coarse matching and fine matching. In coarse matching, dual-softmax is used for training, which can be extracted as

(7)

where GT_i,j denotes the GT matrix, P ( i , j ) represents the probability of the correct matching and n is the number of feature points.

Following [49], we used the epipolar loss. The epipolar constraint states that j^TF_i=0 holds if i and j are truly matched, where F_i can be interpreted as the epipolar line corresponding to i in I_b. The epipolar loss is defined as the distance between the predicted corresponding position and the ground-truth epipolar line:

(8)

where h_1→2(i) is the predicted correspondence in I_b for the point i in I_b, and dist ⁡ ( ⋅ , ⋅ ) is the distance between a point and a line.

The epipolar loss itself only encourages the predicted match to lie on the epipolar line rather than close to the ground truth correspondence. We also need to introduce a cycle consistency loss to encourage the forward and backward mapping of a point to be spatially close to itself:

(9)

For a pair of enhanced images I_a and I_b, the dense feature descriptors extracted by MIA transformer are defined as M₁ and M₂. To compute the correspondence for a query point i in I_a, we correlate the feature descriptor at i, denoted by M₁(i), with all of M₂. A 2D pixel location distribution of I_b is obtained, indicating the probability corresponding to each location and i in I_a. The probability distribution can be expressed as . A single 2D match can then be computed as the expectation of this distribution:

(10)

where y varies over the pixel grid of I_b.

The loss at each point is re-weighted using the total variance σ²(i) as an uncertainty measure. The loss function for fine matching is the weighted sum of the epipolar and cycle consistency loss of the n sampled query points:

(11)

The overall loss function Loss is composed of Loss_c and Loss_f, which can be expressed as

(12)

Download:

Fig 5. Fine-tuning on same-model feature matching tasks. The X-FFN and Y-FFN represent the assistants of any two kinds of pre-trained different modal images in the second stage of Fig 4. The fine-tuning of the X-modal image and the fine-tuning of the Y-modal image are independent of each other.

https://doi.org/10.1371/journal.pone.0319051.g005

3.4 Fine-tuning on different feature matching tasks

After comprehensive pre-training, UFM only requires fine-tuning on the corresponding dataset to achieve excellent matching results. For instance, when dealing with a brand new multimodal dataset, other methods typically need to be pre-trained on the entire training set. In contrast, UFM usually requires only about 1/10 of the training data for fine-tuning to deliver superior matching performance. This fine-tuning process includes both within-modal feature matching and cross-modal image feature matching. This critical stage ensures that the model effectively adapts to the unique characteristics and complexities of the matching task.

Fine-tuning of feature matching for the same modal. As illustrated in Fig 5, UFM facilitates the fine-tuning of image feature matching for individual modals. During the fine-tuning process for the same modal, UFM selectively freezes the parameters associated with the generic feedforward network (FFN) and multi-head attention, focusing solely on refining the assistant relevant to the corresponding modal. This modal assistant works in conjunction with the general FFN, utilizing a residual structure. By employing this strategy, the model can capitalize on the robust capabilities of the general FFN while making specific adjustments tailored to the data characteristics of the current modal. This approach not only enhances the model’s adaptability and performance within the same modal but also significantly reduces the computational resources required for training, ensuring a more efficient and effective fine-tuning process.

Download:

Fig 6. Fine-tuning on different-model feature matching tasks. The X-FFN and Y-FFN represent the assistants of any two kinds of pre-trained different modal images in the second stage of Fig 4. The X-Y FFN represent the assistant of the pre-trained different modal images in the third stage of Fig 4.

https://doi.org/10.1371/journal.pone.0319051.g006

Fine-tuning of feature matching across different modals. Illustrated in Fig 6, the fine-tuning process for image feature matching across different modals in UFM involves two distinct stages. In the first stage, the assistant FFNs from two modals aid the generic FFN in extracting features. In the second stage, the assistant FFNs for modal matching assist the general FFN in executing feature matching. Throughout both stages, the parameters of the multi-head attention and general FFN remain frozen, and only the modal-specific assistant is subject to fine-tuning. Specifically, the first stage encompasses L-M layers, while the second stage comprises M layers. The total number of layers corresponds to that used for fine-tuning image feature matching within the same modal. This two-stage strategy not only leverages the specialized modal assistant to facilitate feature extraction but also enhances the effect of feature matching across different modals through cross-modal feature fusion. By integrating these complementary stages, UFM effectively optimizes the feature matching process, ensuring improved performance and enhanced adaptability across diverse modals.

Download:

Fig 7. Examples of the image pairs of the same modal. (a) The Hpatches dataset. (b) The InLoc dataset. (c) The Aachen Day-Night v1.1 dataset.

https://doi.org/10.1371/journal.pone.0319051.g007

4 Experiments

We pre-train the UFM on large-scale multimodal image data and evaluate the models qualitatively and quantitatively on different feature matching tasks.

4.1 Pre-training setup

Our pre-training data mainly consists of these datasets: MegaDepth [50], ScanNet [51], YFCC100M [52], OSCD [53], MRSI [54], MRSIs [37], DIRSIG [55], MS-SAR LCZ [56], Retina [57], BrainWeb [58], VIS–NIR [59], WxBS [60] and image–paint [61]. There are about 2 million image pairs in the pre-training data. After pre-training, we only need to use about 1/10 of the training set of the corresponding dataset for fine-tuning.

The model consists of 9 layers transformers with 768 hidden size. The model is trained by AdamW optimizer with α = 0 . 9, β = 0 . 98. The learning rate is 1 × 10⁻⁴. The pre-training of multi-modal image feature matching takes about a week using 8 Nvdia Tesla V100 32GB GPU cards.

4.2 Estimation on same-modal images feature matching

In the feature matching evaluation of images from the same modal, we mainly verify the feature matching between images with large camera pose differences. As shown in Fig 7, We compare the feature matching of different methods on Hpatches [62], InLoc [63] and Aachen Day-Night v1.1 [64] datasets. The InLoc and Aachen Day-Night v1.1 datasets are just the datasets for testing and do not have any training data. We fine-tuned on the Hpatches dataset using 1/10 of the training data.

As shown in Table 1, we evaluate the effect of UFM and contrast algorithms in homography estimation on the HPatches benchmark set and report the proportion of accurate predictions with average corner error distances less than 1/3/5 pixels. Bold numbers in the table indicate the best results under the current metric. Compared with these excellent algorithms, UFM has achieved the best results under most indicators. Especially under the Illumination indicator, UFM achieved extremely excellent results.

Download:

Table 1. Evaluation on HPatches for homography estimation.

https://doi.org/10.1371/journal.pone.0313772.t001

To verify the visual localization capability of UFM, we also estimated the 6-DOF pose of a given image with respect to the corresponding 3D scene model. We evaluated different approaches on long-term visual localization benchmarks [65]. The focus is on benchmarking indoor scene changes and day/night changes. As shown in Tables 2 and 3, UFM is very competitive on both Aachen Day-Night v1.1 and InLoc datasets. UFM surpasses the compared methods under most metrics.

Download:

Table 2. Visual localization evaluation on the Aachen Day-Night v1.1.

https://doi.org/10.1371/journal.pone.0313772.t002

Download:

Table 3. Visual localization evaluation on the InLoc benchmark.

https://doi.org/10.1371/journal.pone.0313772.t003

Download:

Fig 8. Examples of the image pairs of different modals. (a) The SEN 12 MS dataset. a1. Optical a2. SAR a3. NIR a4. SWIR (b) The RGB-NIR Scene dataset. b1. RGB b2. Near-Infrared (c) The WHU-OPT-SAR dataset. c1. Optical c2. SAR (d) The Optical-SAR dataset. d1. Optical d2. SAR (e) The BrainWeb dataset. e1. T1 e2. T2 (f) The NYU-Depth V2 dataset. f1. Depth f2. RGB (g) The UV/Green Image dataset. g1. UV g2. Green.

https://doi.org/10.1371/journal.pone.0319051.g008

4.3 Estimation on different-modals images feature matching

In the evaluation of feature matching between images of different modals, seven datasets were used for testing. As shown in Fig 8, the data sets used for the test are: SEN 12 MS dataset [66], RGB-NIR Scene dataset [59], WHU-OPT-SAR dataset [67], Optical-SAR dataset [68], BrainWeb dataset [58], NYU-Depth V2 dataset [69] and UV-Green dataset [70]. We fine-tuned on these datasets using 1/10 of the training data. UFM can handle most multi-modal feature matching problems. This test experiment mainly covers optical images, SAR images, NIR images, SWIR images, depth images, UV images, green images and medical images.

Multiple modals of the same scene. In the SEN12MS dataset, multiple images correspond to different modals of the same scene. We perform MMA evaluation and Homography estimation on images from different modals of SEN12MS. Except UFM, the other algorithms are fully trained. The results of MMA are shown in Fig 9. In this experiment, the average matching accuracy of different methods is calculated for pixel values ranging from 1 to 10. The higher and more left MMA curve indicates the better feature matching performance of the proposed method. The results of the Homography estimation are shown in Table 4. We report the regions under the cumulative curve (AUC) where corner error reaches the 3, 5, and 10 pixel thresholds, respectively. The higher the result of the Homography estimation, the better the effect of its feature matching. Through the experimental results, it can be found that UFM outperforms the compared algorithms in the vast majority of cases, indicating that it is highly competitive in feature matching of different modal images of the same scene.

Download:

Fig 9. MMA estimation on multi-modal images in SEN12MS dataset.

https://doi.org/10.1371/journal.pone.0319051.g009

Download:

Table 4. Homography estimation on different modal images in the SEN12MS dataset.

https://doi.org/10.1371/journal.pone.0313772.t004

Multiple modals of the different scenes. When performing feature matching of multi-modal images, most of the time it is necessary to deal with the task of different scenes. We tested feature matching on multiple modal images of various scenes. Except UFM, the other algorithms are fully trained. The results of the MMA evaluation and the results of the Homography estimation are shown in Fig 10 and Table 5, respectively. UFM also outperforms other excellent algorithms in most cases when only fine-tuning is performed. These experiments can prove that UFM has good feature matching performance and generalization while saving a lot of computing resources.

4.4 Visualization of image feature matching

The images of the same modal. As illustrated in Fig 11, in order to observe the matching effect of different methods more intuitively, the matches with less than 1 pixel error are represented by lines. To save space, we only show the matching results of the four methods with the best results. Compared with other methods, UFM has more correct matching lines, which qualitatively proves that UFM algorithm has a good effect on image feature matching of the same modal.

The images of different modals. The results of multi-modal image feature matching are generated, and the results of the four methods with the best results are presented in Fig 12. The datasets from top to bottom are: SEN 12 MS dataset, RGB-NIR Scene dataset, WHU-OPT-SAR dataset, Optical-SAR dataset, BrainWeb dataset, NYU-Depth V2 dataset and UV-Green dataset. In order to clearly compare the matching effects of different methods, we rotate and crop the image data of many modals. The matches with less than 1 pixel error are represented by lines. Compared with other methods, the UFM algorithm obtains more correct matching connections, which qualitatively proves the competitiveness of the UFM algorithm in dealing with multi-modal image feature matching.

4.5 Estimation on the distance error test of feature matching

Although the lines of matched points with pixel error less than 1 are shown in Sect 4.4, it is not possible to show the specific error value for each pixel.To further objectively evaluate the matching accuracy of different methods, we calculated the average distance error of matching points with horizontal and vertical pixel errors less than 1. The total error is computed based on both the horizontal and vertical pixel errors, and its value may exceed 1. In this experiment the horizontal distance error is defined as H_rmse, the vertical distance error is defined as V_rmse and the total distance error on the image is defined as HV_rmse. All errors are measured in pixels. The functions H_rmse, V_rmse, and HV_rmse are computed as:

(13)

(14)

(15)

where denotes the coordinates of the matching points in the X-modal image, indicates the coordinates of the matching point in the Y-modal image. N denotes the total number of matched points.

Download:

Fig 10. MMA estimation on different multi-modal image datasets. The modal of the images: (1) RGB-NIR, (2) Optical-SAR, (3) Optical-SAR, (4) T1-T2 (5) RGB-Depth, (6) UV-Green.

https://doi.org/10.1371/journal.pone.0319051.g010

Download:

Table 5. Homography estimation on different datasets.

https://doi.org/10.1371/journal.pone.0313772.t005

As shown in Table 6, UFM consistently achieves the smallest distance error in most cases, outperforming other methods. The experiments further demonstrate that UFM not only identifies the maximum number of matching points with an error of less than 1 pixel, but also ensures that the error of the obtained matching points remains very small.

4.6 Evaluation of computational cost

To objectively evaluate the computational cost of the UFM algorithm, we compare the matching speed and resource requirements across different methods. In this experiment, the matching methods are tested on the RGB-NIR Scene dataset using an RTX 3090 GPU and an Intel i7-11700 processor. Matching speed is measured in frames per second (FPS), and resource usage is assessed in terms of storage requirements (MB). A higher FPS indicates faster inference, while lower storage values reflect reduced resource usage. As shown in Table 7, UFM achieves the fastest matching speed. Among the methods compared, UFM, LoFTR, and FeMIP are semi-dense matching techniques, while the remaining methods are sparse matching approaches. Generally, semi-dense methods offer higher matching accuracy but tend to demand more resources. Notably, UFM requires the least resources among the semi-dense methods.

Download:

Fig 11. Feature matching of same-modal images. (a) SuperGlue (b) MatchFomer (c) LoFTR (d) UFM. The matches with less than 1 pixel error are represented by lines.

https://doi.org/10.1371/journal.pone.0319051.g011

Download:

Fig 12. Feature matching of different-modals images: (a) MatchosNet (b) LoFTR (c) FeMIP, and (d) UFM. The datasets from top to bottom are: SEN 12 MS dataset, RGB-NIR Scene dataset, WHU-OPT-SAR dataset, Optical-SAR dataset, BrainWeb dataset, NYU-Depth V2 dataset and UV-Green dataset. The images of SEN 12 MS dataset, RGB-NIR Scene dataset and WHU-OPT-SAR dataset are rotated and the images of the Optical-SAR dataset and the NYU-Depth V2 dataset are cropped. The matches with less than 1 pixel error are represented by lines.

https://doi.org/10.1371/journal.pone.0319051.g012

Download:

Table 6. The distance error

test of different methods on different datasets.

https://doi.org/10.1371/journal.pone.0313772.t006

Download:

Table 7. Evaluation of matching speed and resources on the RGB-NIR Scene dataset.

https://doi.org/10.1371/journal.pone.0313772.t007

4.7 Ablation study

To fully assess the role of different modules in UFM, seven variants were designed. In the 7th variant, pre-training is omitted, and the model is trained using the entire training dataset. As shown in Tables 8 and 9, we conduct ablation experiments on feature matching using images from the same modal and from different modals, respectively. The results of these experiments demonstrate that UFM outperforms all variants in feature matching. The combination of data augmentation and the MIA transformer is particularly effective, with the Assistant FFN in the MIA transformer having the most significant impact. The Generic FFN has a lesser effect on the matching performance. Data augmentation, Assistant FFN, and fine-tuning have a substantial influence on feature matching across different modals, while their impact on same-modal feature matching is relatively smaller. Although the 7th variant is trained on the full dataset, it still underperforms compared to the complete UFM, highlighting the necessity of the pre-training and fine-tuning mechanisms we have designed.

Download:

Table 8. Ablation experiments on feature matching of same-modal images. A.Data Augmentation, B.Generic FFN, C.Assistant FFN, D.Pre-training, E.Fine-tuning. (1)–(7) are seven variants of UFM. (8) is the full UFM.

https://doi.org/10.1371/journal.pone.0313772.t008

Download:

Table 9. Ablation experiments on feature matching of different-modal images. A. Data Augmentation, B.Generic FFN, C.Assistant FFN, D.Pre-training, E.Fine-tuning. (1)–(7) are seven variants of UFM. (8) is the full UFM.

https://doi.org/10.1371/journal.pone.0313772.t009

4.8 Limitation

Although UFM has been shown to deliver good matching results in most cases, it also has some limitations. As illustrated in Fig 13, UFM struggles to achieve accurate matching when dealing with multi-modal images that have very few textures. This is a common challenge for feature matching methods, and other approaches also face difficulties in such scenarios. While UFM has proven to be highly generalizable, its pre-training becomes less effective when confronted with unseen modal images. In such cases, as with other methods, it is necessary to train the model on the entire dataset to achieve accurate matching results.

Download:

Fig 13. Feature matching on images with very little texture.

https://doi.org/10.1371/journal.pone.0319051.g013

5 Conclusion

This paper introduces UFM, a Unified Feature Matching model designed for fine-tuning across a broad range of modal images using a shared Multimodal Image Assistant (MIA) Transformer. MIA Transformers, serving as multimodal assistants, enhance a generic Feedforward Network (FFN) by encoding modal-specific information. The shared self-attention and cross-attention mechanisms enable improved feature matching across different modals. Through the incorporation of data augmentation and staged pre-training, UFM demonstrates significantly enhanced pre-training effectiveness on multimodal images. Experimental results showcase UFM’s capability to achieve excellent performance in diverse feature matching tasks.

Future work aims to enhance UFM further by expanding the pre-training dataset and integrating image data from less common modals. Consideration is given to transitioning from a semi-dense matching framework to a dense matching framework to elevate matching accuracy. Plans also include scaling up the model size used in UFM pretraining. Additionally, ongoing research explores downstream tasks of image feature matching, with an emphasis on integrating the unified feature matching algorithm into diverse applications.

References

1. Kulkarni SC, Rege PP. Pixel level fusion techniques for SAR and optical images: a review. Inform Fusion. 2020;59:13–29.
- View Article
- Google Scholar
2. Deng S, Deng L, Wu X, Ran R, Hong D, Vivone G. PSRT: Pyramid shuﬄe-and-reshuﬄe transformer for multispectral and hyperspectral image fusion. IEEE Trans Geosc Rem Sens 2023;61(1):1–15.
- View Article
- Google Scholar
3. Peng Z, Ma Y, Zhang Y, Li H, Fan F, Mei X. Seamless UAV hyperspectral image stitching using optimal seamline detection via graph cuts. IEEE Trans Geosci Rem Sens 2023;61(1):1–13.
- View Article
- Google Scholar
4. Zhang Z, Yang X, Xu C. Natural image stitching with layered warping constraint. IEEE Trans Multimedia. 2023;25:329–38.
- View Article
- Google Scholar
5. Yin W, Zhang J, Wang O, Niklaus S, Chen S, Liu Y, et al. Towards accurate reconstruction of 3D scene shape from a single monocular image. IEEE Trans Pattern Anal Mach Intell 2023;45(5):6480–94. pmid:36197868
- View Article
- PubMed/NCBI
- Google Scholar
6. Song S, Truong KG, Kim D, Jo S. Prior depth-based multi-view stereo network for online 3D model reconstruction. Pattern Recogn. 2023;136109198.
- View Article
- Google Scholar
7. Yenkikar A, Babu CN. AirBERT: a fine-tuned language representation model for airlines tweet sentiment analysis. IDT 2023;17(2):435–55.
- View Article
- Google Scholar
8. He Y, Zhang Q, Wang S, Chen Z, Cui Z, Guo Z-H, et al. Predicting the sequence specificities of DNA-binding proteins by DNA fine-tuned language model with decaying learning rates. IEEE/ACM Trans Comput Biol Bioinform 2023;20(1):616–24. pmid:35389869
- View Article
- PubMed/NCBI
- Google Scholar
9. Deng Y, Karam LJ. Frequency-tuned universal adversarial attacks on texture recognition. IEEE Trans Image Process. 2022;315856–68. pmid:36054395
- View Article
- PubMed/NCBI
- Google Scholar
10. Ellahyani A, Jaafari IE, Charfi S, Ansari ME. Fine-tuned deep neural networks for polyp detection in colonoscopy images. Pers Ubiquit Comput 2022;27(2):235–47.
- View Article
- Google Scholar
11. Radford A, Kim J, Hallacy C, Ramesh A, Goh G, Agarwal S, et al. Learning transferable visual models from natural language supervision. Proceedings of the 38th International Conference on Machine Learning. 2021;1398748–63.
- View Article
- Google Scholar
12. Jia C, Yang Y, Xia Y, Chen Y, Parekh Z, Pham H. Scaling up visual and vision-language representation learning with noisy text supervision. Proceedings of the 38th International Conference on Machine Learning. 2021;1394904–16.
- View Article
- Google Scholar
13. Giang K, Song S, Jo S. TopicFM: Robust and Interpretable Topic-Assisted Feature Matching. Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence. 20232447–55.
- View Article
- Google Scholar
14. Cai Y, Li L, Wang D, Li X, Liu X. HTMatch: An efficient hybrid transformer based graph neural network for local feature matching. Signal Processing. 2023;204108859.
- View Article
- Google Scholar
15. Luo H, Xie T, Wang A, Dai K, Cao C, Zhao L. CorMatcher: a corners-guided graph neural network for local feature matching. Expert Syst Appl. 2024;258125190.
- View Article
- Google Scholar
16. Xu W, Yuan X, Hu Q, Li J. SAR-optical feature matching: a large-scale patch dataset and a deep local descriptor. Int J Appl Earth Obs Geoinform. 2023;122103433.
- View Article
- Google Scholar
17. Di Y, Liao Y, Zhu K, Zhou H, Zhang Y, Duan Q. MIVI: multi-stage feature matching for infrared and visible image. The Visual Computer. 20231–13.
- View Article
- Google Scholar
18. Di Y, Liao Y, Zhou H, Zhu K, Zhang Y, Duan Q, et al. FeMIP: detector-free feature matching for multimodal images with policy gradient. Appl Intell 2023;53(20):24068–88.
- View Article
- Google Scholar
19. Hu M, Sun B, Kang X, Li S. Multiscale structural feature transform for multi-modal image matching. Inform Fusion. 2023;54–54.
- View Article
- Google Scholar
20. Lu Y, Lu G. SuperThermal: matching thermal as visible through thermal feature exploration. IEEE Robot Autom Lett 2021;6(2):2690–7.
- View Article
- Google Scholar
21. Tang G, Liu Z, Xiong J. Distinctive image features from illumination and scale invariant keypoints. Multimed Tools Appl 2019;78(16):23415–42.
- View Article
- Google Scholar
22. Rublee E, Rabaud V, Konolige K, Bradski G. ORB: An efficient alternative to SIFT or SURF. Proc IEEE Int Conf Comput Vis. 20112564–71.
- View Article
- Google Scholar
23. Zhu M, Song H, Xu J, Jiang X, Zhang Y, Ma J, et al. Introgression of ZmCPK39 in maize hybrids enhances resistance to gray leaf spot disease without compromising yield. Mol Breed 2025;45(3):28. pmid:40013268
- View Article
- PubMed/NCBI
- Google Scholar
24. Li J, Xu W, Shi P, Zhang Y, Hu Q. LNIFT: Locally normalized image for rotation invariant multimodal feature matching. IEEE Trans Geosci Rem Sens 2022;60(1):1–14.
- View Article
- Google Scholar
25. Sarlin P, DeTone D, Malisiewicz T, Rabinovich A. SuperGlue: Learning feature matching with graph neural networks. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020;4937–46. https://openaccess.thecvf.com/content_CVPR_2020/html/Sarlin_SuperGlue_Learning_Feature_Matching_With_Graph_Neural_Networks_CVPR_2020_paper.html
- View Article
- Google Scholar
26. Jiang B, Sun P, Luo B. GLMNet: Graph learning-matching convolutional networks for feature matching. Pattern Recogn. 2022;121108167.
- View Article
- Google Scholar
27. Shen Z, Sun J, Wang Y, He X, Bao H, Zhou X. Semi-dense feature matching With transformers and its applications in multiple-view geometry. IEEE Trans Pattern Anal Mach Intell 2023;45(6):7726–38. pmid:36409815
- View Article
- PubMed/NCBI
- Google Scholar
28. Fisher AN, Stinson DA, Kalajdzic A, Dupuis HE, Lowey EE, Desgrosseilliers E, et al. “A recipe for disaster?”: Female-Breadwinner relationships threaten heterosexual scripts. Sex Roles. 2025;91(3):16. pmid:39990977
- View Article
- PubMed/NCBI
- Google Scholar
29. Huang D, Chen Y, Liu Y, Liu J, Xu S, Wu W, et al. Adaptive Assignment for Geometry Aware Local Feature Matching. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2023;5425–34.
- View Article
- Google Scholar
30. Mok TCW, Chung ACS. Affine medical image registration with coarse-to-fine vision transformer. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2022;20803–12.
- View Article
- Google Scholar
31. Xie T, Dai K, Wang K, Li R, Zhao L. DeepMatcher: A deep transformer-based network for robust and accurate local feature matching. Exp Syst Appl. 2024;237(Part A):121361.
- View Article
- Google Scholar
32. Dai K, Xie T, Wang K, Jiang Z, Li R, Zhao L. OAMatcher: an overlapping areas-based network for accurate local feature matching. CoRR 2023.
- View Article
- Google Scholar
33. ruong P, Danelljan M, Gool LV, Timofte R. Learning accurate dense correspondences and when to trust them. IEEE Conference on Computer Vision and Pattern Recognition. 2021;5714–24.
- View Article
- Google Scholar
34. Truong P, Danelljan M, Timofte R, Van Gool L. PDC-Net+: Enhanced Probabilistic Dense Correspondence Network. IEEE Trans Pattern Anal Mach Intell 2023;45(8):10247–66. pmid:37027599
- View Article
- PubMed/NCBI
- Google Scholar
35. Edstedt J, Athanasiadis I, Wadenbäck M, Felsberg M. DKM: dense kernelized feature matching for geometry estimation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023;75–75.
- View Article
- Google Scholar
36. Edstedt J, Sun Q, Bokman G, Wadenback M, Felsberg M. RoMa: revisiting robust losses for dense feature matching. CoRR. 2023;abs.
- View Article
- Google Scholar
37. Zhu B, Yang C, Dai J, Fan J, Qin Y, Ye Y. R₂FD₂: fast and robust matching of multimodal remote sensing images via repeatable feature detector and rotation-invariant feature descriptor. IEEE Trans Geosci Remote Sensing. 2023;611–15.
- View Article
- Google Scholar
38. Devlin J, Chang M-W, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North. 2019;4186–4186.
- View Article
- Google Scholar
39. Dong L, Yang N, Wang W, Wei F, Liu X, Wang Y, et al. Unified language model pre-training for natural language understanding and generation. Adv Neural Inform Process Syst. 2019;3213042–54
- View Article
- Google Scholar
40. Yang S, Lei X. Reciprocal causation relationship between rumination thinking and sleep quality: a resting-state fMRI study. Cogn Neurodyn 2025;19(1):41. pmid:39991016
- View Article
- PubMed/NCBI
- Google Scholar
41. Bao H, Dong L, Piao S, Wei F. BEiT: BERT pre-training of image transformers. The Tenth International Conference on Learning Representations. 2022;2022. https://openreview.net/forum?id=p-BhZSz59o4
- View Article
- Google Scholar
42. Chen Y, Li L, Yu L, Kholy AE, Ahmed F, Gan Z, et al. UNITER: UNiversal Image-TExt Representation Learning. Vedaldi A, Bischof H, Brox T, Frahm J, editors. Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XXX. vol. 12375 of Lecture Notes in Computer Science. Springer; 2020;104–120. Available from: doi: https://doi.org/10.1007/978-3-030-58577-8_7.
- View Article
- Google Scholar
43. Su W, Zhu X, Cao Y, Li B, Lu L, Wei F, et al. VL-BERT: Pre-training of generic visual-linguistic representations. OpenReview.net. 2020.
- View Article
- Google Scholar
44. Muresan S, Nakov P, Villavicencio A. Fine-tuning for transformer-based masked language-models. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2022;9–9.
- View Article
- Google Scholar
45. Sohn K, Chang H, Lezama J, Polania L, Zhang H, Hao Y, et al. Visual prompt tuning for generative transfer learning. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2023;19840–51.
- View Article
- Google Scholar
46. Bao H, Wang W, Dong L, Liu Q, Mohammed OK, Aggarwal K, et al. VLMo: unified vision-language pre-training with mixture-of-modality-experts. Advances in Neural Information Processing Systems. 2022;35, Abstract.
- View Article
- Google Scholar
47. Wang W, Bao H, Dong L, Bjorck J, Peng Z, Liu Q, et al. Image as a foreign language: BEIT pretraining for vision and vision-language tasks. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2023;19175–86.
- View Article
- Google Scholar
48. DeTone D, Malisiewicz T, Rabinovich A. SuperPoint: self-supervised interest point detection and description. 2018 IEEE Conference on Computer Vision and Pattern Recognition Workshops. 2018;224–36.
- View Article
- Google Scholar
49. Wang Q, Zhou X, Hariharan B, Snavely N. Learning feature descriptors using camera pose supervision. In: Vedaldi A, Bischof H, Brox T, Frahm J, eds. Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part I. vol. 12346 of Lecture Notes in Computer Science. Springer; 2020;757–774.
- View Article
- Google Scholar
50. Li Z, Snavely N. MegaDepth: learning single-view depth prediction from internet photos. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018;2041–50.
- View Article
- Google Scholar
51. Dai A, Chang AX, Savva M, Halber M, Funkhouser TA, Nießner M. ScanNet: Richly-annotated 3D reconstructions of indoor scenes. 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017. 2017;2432–43.
- View Article
- Google Scholar
52. Thomee B, Shamma D, Friedland G, Elizalde B, Ni K, Poland D. YFCC100M: the new data in multimedia research. Commun ACM 2016;59(2):64–73.
- View Article
- Google Scholar
53. Daudt RC, Saux BL, Boulch A, Gousseau Y. Urban change detection for multispectral earth observation using convolutional neural networks. 2018 IEEE International Geoscience and Remote Sensing Symposium, IGARSS 2018. 2018;2115–8.
- View Article
- Google Scholar
54. Yao Y, Zhang Y, Wan Y, Liu X, Yan X, Li J. Multi-modal remote sensing image matching considering co-occurrence filter. IEEE Trans Image Process. 2022;31:2584–97. pmid:35286258
- View Article
- PubMed/NCBI
- Google Scholar
55. Nilosek D, Walvoord DJ, Salvaggio C. Assessing geoaccuracy of structure from motion point clouds from long-range image collections. Opt Eng 2014;53(11):113112.
- View Article
- Google Scholar
56. Hong D, Gao L, Yokoya N, Yao J, Chanussot J, Du Q, et al. More diverse means better: multimodal deep learning meets remote-sensing imagery classification. IEEE Trans Geosci Remote Sensing 2021;59(5):4340–54.
- View Article
- Google Scholar
57. Ma J, Zhao J, Jiang J, Zhou H, Guo X. Locality preserving matching. Int J Comput Vis 2018;127(5):512–31.
- View Article
- Google Scholar
58. Collins DL, Zijdenbos AP, Kollokian V, Sled JG, Kabani NJ, Holmes CJ, et al. Design and construction of a realistic digital brain phantom. IEEE Trans Med Imaging 1998;17(3):463–8. pmid:9735909
- View Article
- PubMed/NCBI
- Google Scholar
59. Brown M, Susstrunk S. Multi-spectral SIFT for scene category recognition. CVPR 2011. 2011;184–184.
- View Article
- Google Scholar
60. Mishkin D, Matas J, Perdoch M, Lenc K. WxBS: wide baseline stereo generalizations. Procedings of the British Machine Vision Conference 2015. 2015;12.1-12.12.
- View Article
- Google Scholar
61. Shrivastava A, Malisiewicz T, Gupta A, Efros AA. Data-driven visual similarity for cross-domain image matching. ACM Trans Graph 2011;30(6):1–10.
- View Article
- Google Scholar
62. Balntas V, Lenc K, Vedaldi A, Mikolajczyk K. HPatches: a benchmark and evaluation of handcrafted and learned local descriptors. 2017 IEEE Conference on Computer Vision and Pattern Recognition. 2017;3852–61.
- View Article
- Google Scholar
63. Taira H, Okutomi M, Sattler T, Cimpoi M, Pollefeys M, Sivic J, et al. InLoc: indoor visual localization with dense matching and view synthesis. IEEE Trans Pattern Anal Mach Intell 2021;43(4):1293–307. pmid:31722474
- View Article
- PubMed/NCBI
- Google Scholar
64. Zhang Z, Sattler T, Scaramuzza D. Reference pose generation for long-term visual localization via learned features and view synthesis. Int J Comput Vis 2021;129(4):821–44. pmid:34720404
- View Article
- PubMed/NCBI
- Google Scholar
65. Toft C, Maddern W, Torii A, Hammarstrand L, Stenborg E, Safari D, et al. Long-term visual localization revisited. IEEE Trans Pattern Anal Mach Intell 2022;44(4):2074–88. pmid:33074802
- View Article
- PubMed/NCBI
- Google Scholar
66. RoBberg T, Schmitt M. Estimating NDVI from Sentinel-1 Sar data using deep learning. IGARSS 2022 - 2022 IEEE International Geoscience and Remote Sensing Symposium. 2022;5–5.
- View Article
- Google Scholar
67. Li X, Zhang G, Cui H, Hou S, Wang S, Li X, et al. MCANet: a joint semantic segmentation framework of optical and SAR images for land use classification. Int J Appl Earth Obs Geoinformation. 2022;106:102638.
- View Article
- Google Scholar
68. Liao Y, Di Y, Zhou H, Li A, Liu J, Lu M, et al. Feature matching and position matching between optical and SAR with local deep feature descriptor. IEEE J Sel Top Appl Earth Observations Remote Sensing. 2022;15:448–62.
- View Article
- Google Scholar
69. Silberman N, Hoiem D, Kohli P, Fergus R. Indoor segmentation and support inference from RGBD images. In: Fitzgibbon AW, Lazebnik S, Perona P, Sato Y, Schmid C, eds. Computer Vision - ECCV 2012 - 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part V. vol. 7576 of Lecture Notes in Computer Science. Springer; 2012. p. 746–760.
- View Article
- Google Scholar
70. Differt D, Möller R. Spectral skyline separation: extended landmark databases and panoramic imaging. Sensors (Basel) 2016;16(10):1614. pmid:27690053
- View Article
- PubMed/NCBI
- Google Scholar
71. Dusmanu M, Rocco I, Pajdla T, Pollefeys M, Sivic J, Torii A, et al. D2-Net: a trainable CNN for joint description and detection of local features. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019;8092–101.
- View Article
- Google Scholar
72. Revaud J, de Souza CR, Humenberger M, Weinzaepfel P. R2D2: reliable and repeatable detector and descriptor. In: Wallach HM, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox EB, Garnett R, eds. Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada; 2019;12405–12415. https://proceedings.neurips.cc/paper/2019/hash/3198dfd0aef271d22f7bcddd6f12f5cb-Abstract.html.
- View Article
- Google Scholar
73. Luo Z, Zhou L, Bai X, Chen H, Zhang J, Yao Y, et al. ASLFeat: learning local features of accurate shape and localization. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020;6588–97
- View Article
- Google Scholar
74. Zhou Q, Sattler T, Leal-Taixé L. Patch2Pix: epipolar-guided pixel-level correspondences. IEEE Conference on Computer Vision and Pattern Recognition. 2021;4669–78.
- View Article
- Google Scholar
75. Sarlin P, Unagar A, Larsson M, Germain H, Toft C, Larsson V, et al. Back to the feature: learning robust camera localization from pixels to pose. IEEE Conference on Computer Vision and Pattern Recognition. 2021;3247–57.
- View Article
- Google Scholar
76. Li X, Wang S, Zhao Y, Verbeek J, Kannala J. Hierarchical scene coordinate classification and regression for visual localization. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020;11980–9.
- View Article
- Google Scholar
77. Sarlin P, Cadena C, Siegwart R, Dymczyk M. From coarse to fine: robust hierarchical localization at large scale. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019;12716–25.
- View Article
- Google Scholar
78. Melekhov I, Brostow GJ, Kannala J, Turmukhambetov D. Image stylization for robust features. CoRR. 2020.
- View Article
- Google Scholar
79. Zhou Y, Fan H, Gao S, Yang Y, Zhang X, Li J, et al. Retrieval and localization with observation constraints. 2021 IEEE International Conference on Robotics and Automation (ICRA). 2021;5237-5244.
- View Article
- Google Scholar
80. Jiang W, Trulls E, Hosang J, Tagliasacchi A, Yi KM. COTR: Correspondence Transformer for Matching Across Images. 2021 IEEE/CVF International Conference on Computer Vision (ICCV). 2021;6187-6197.
- View Article
- Google Scholar
81. Xufeng H, Leung T, Jia Y, Sukthankar R, Berg AC. MatchNet: unifying feature and metric learning for patch-based matching. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2015;3279–86.
- View Article
- Google Scholar
82. Balntas V, Riba E, Ponsa D, Mikolajczyk K. Learning local feature descriptors with triplets and shallow convolutional neural networks. Proceedings of the British Machine Vision Conference 2016. 2016;119.
- View Article
- Google Scholar
83. ishchuk A, Mishkin D, Radenovic F, Matas J. Working hard to know your neighbor’s margins: local descriptor learning loss. Advances in Neural Information Processing Systems. 2017;30:4826–37.
- View Article
- Google Scholar

[ref1] 1. Kulkarni SC, Rege PP. Pixel level fusion techniques for SAR and optical images: a review. Inform Fusion. 2020;59:13–29.
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. Deng S, Deng L, Wu X, Ran R, Hong D, Vivone G. PSRT: Pyramid shuﬄe-and-reshuﬄe transformer for multispectral and hyperspectral image fusion. IEEE Trans Geosc Rem Sens 2023;61(1):1–15.
View Article
Google Scholar

[5] View Article

[6] Google Scholar

[ref3] 3. Peng Z, Ma Y, Zhang Y, Li H, Fan F, Mei X. Seamless UAV hyperspectral image stitching using optimal seamline detection via graph cuts. IEEE Trans Geosci Rem Sens 2023;61(1):1–13.
View Article
Google Scholar

[8] View Article

[9] Google Scholar

[ref4] 4. Zhang Z, Yang X, Xu C. Natural image stitching with layered warping constraint. IEEE Trans Multimedia. 2023;25:329–38.
View Article
Google Scholar

[11] View Article

[12] Google Scholar

[ref5] 5. Yin W, Zhang J, Wang O, Niklaus S, Chen S, Liu Y, et al. Towards accurate reconstruction of 3D scene shape from a single monocular image. IEEE Trans Pattern Anal Mach Intell 2023;45(5):6480–94. pmid:36197868
View Article
PubMed/NCBI
Google Scholar

[14] View Article

[15] PubMed/NCBI

[16] Google Scholar

[ref6] 6. Song S, Truong KG, Kim D, Jo S. Prior depth-based multi-view stereo network for online 3D model reconstruction. Pattern Recogn. 2023;136109198.
View Article
Google Scholar

[18] View Article

[19] Google Scholar

[ref7] 7. Yenkikar A, Babu CN. AirBERT: a fine-tuned language representation model for airlines tweet sentiment analysis. IDT 2023;17(2):435–55.
View Article
Google Scholar

[21] View Article

[22] Google Scholar

[ref8] 8. He Y, Zhang Q, Wang S, Chen Z, Cui Z, Guo Z-H, et al. Predicting the sequence specificities of DNA-binding proteins by DNA fine-tuned language model with decaying learning rates. IEEE/ACM Trans Comput Biol Bioinform 2023;20(1):616–24. pmid:35389869
View Article
PubMed/NCBI
Google Scholar

[24] View Article

[25] PubMed/NCBI

[26] Google Scholar

[ref9] 9. Deng Y, Karam LJ. Frequency-tuned universal adversarial attacks on texture recognition. IEEE Trans Image Process. 2022;315856–68. pmid:36054395
View Article
PubMed/NCBI
Google Scholar

[28] View Article

[29] PubMed/NCBI

[30] Google Scholar

[ref10] 10. Ellahyani A, Jaafari IE, Charfi S, Ansari ME. Fine-tuned deep neural networks for polyp detection in colonoscopy images. Pers Ubiquit Comput 2022;27(2):235–47.
View Article
Google Scholar

[32] View Article

[33] Google Scholar

[ref11] 11. Radford A, Kim J, Hallacy C, Ramesh A, Goh G, Agarwal S, et al. Learning transferable visual models from natural language supervision. Proceedings of the 38th International Conference on Machine Learning. 2021;1398748–63.
View Article
Google Scholar

[35] View Article

[36] Google Scholar

[ref12] 12. Jia C, Yang Y, Xia Y, Chen Y, Parekh Z, Pham H. Scaling up visual and vision-language representation learning with noisy text supervision. Proceedings of the 38th International Conference on Machine Learning. 2021;1394904–16.
View Article
Google Scholar

[38] View Article

[39] Google Scholar

[ref13] 13. Giang K, Song S, Jo S. TopicFM: Robust and Interpretable Topic-Assisted Feature Matching. Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence. 20232447–55.
View Article
Google Scholar

[41] View Article

[42] Google Scholar

[ref14] 14. Cai Y, Li L, Wang D, Li X, Liu X. HTMatch: An efficient hybrid transformer based graph neural network for local feature matching. Signal Processing. 2023;204108859.
View Article
Google Scholar

[44] View Article

[45] Google Scholar

[ref15] 15. Luo H, Xie T, Wang A, Dai K, Cao C, Zhao L. CorMatcher: a corners-guided graph neural network for local feature matching. Expert Syst Appl. 2024;258125190.
View Article
Google Scholar

[47] View Article

[48] Google Scholar

[ref16] 16. Xu W, Yuan X, Hu Q, Li J. SAR-optical feature matching: a large-scale patch dataset and a deep local descriptor. Int J Appl Earth Obs Geoinform. 2023;122103433.
View Article
Google Scholar

[50] View Article

[51] Google Scholar

[ref17] 17. Di Y, Liao Y, Zhu K, Zhou H, Zhang Y, Duan Q. MIVI: multi-stage feature matching for infrared and visible image. The Visual Computer. 20231–13.
View Article
Google Scholar

[53] View Article

[54] Google Scholar

[ref18] 18. Di Y, Liao Y, Zhou H, Zhu K, Zhang Y, Duan Q, et al. FeMIP: detector-free feature matching for multimodal images with policy gradient. Appl Intell 2023;53(20):24068–88.
View Article
Google Scholar

[56] View Article

[57] Google Scholar

[ref19] 19. Hu M, Sun B, Kang X, Li S. Multiscale structural feature transform for multi-modal image matching. Inform Fusion. 2023;54–54.
View Article
Google Scholar

[59] View Article

[60] Google Scholar

[ref20] 20. Lu Y, Lu G. SuperThermal: matching thermal as visible through thermal feature exploration. IEEE Robot Autom Lett 2021;6(2):2690–7.
View Article
Google Scholar

[62] View Article

[63] Google Scholar

[ref21] 21. Tang G, Liu Z, Xiong J. Distinctive image features from illumination and scale invariant keypoints. Multimed Tools Appl 2019;78(16):23415–42.
View Article
Google Scholar

[65] View Article

[66] Google Scholar

[ref22] 22. Rublee E, Rabaud V, Konolige K, Bradski G. ORB: An efficient alternative to SIFT or SURF. Proc IEEE Int Conf Comput Vis. 20112564–71.
View Article
Google Scholar

[68] View Article

[69] Google Scholar

[ref23] 23. Zhu M, Song H, Xu J, Jiang X, Zhang Y, Ma J, et al. Introgression of ZmCPK39 in maize hybrids enhances resistance to gray leaf spot disease without compromising yield. Mol Breed 2025;45(3):28. pmid:40013268
View Article
PubMed/NCBI
Google Scholar

[71] View Article

[72] PubMed/NCBI

[73] Google Scholar

[ref24] 24. Li J, Xu W, Shi P, Zhang Y, Hu Q. LNIFT: Locally normalized image for rotation invariant multimodal feature matching. IEEE Trans Geosci Rem Sens 2022;60(1):1–14.
View Article
Google Scholar

[75] View Article

[76] Google Scholar

[ref25] 25. Sarlin P, DeTone D, Malisiewicz T, Rabinovich A. SuperGlue: Learning feature matching with graph neural networks. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020;4937–46. https://openaccess.thecvf.com/content_CVPR_2020/html/Sarlin_SuperGlue_Learning_Feature_Matching_With_Graph_Neural_Networks_CVPR_2020_paper.html
View Article
Google Scholar

[78] View Article

[79] Google Scholar

[ref26] 26. Jiang B, Sun P, Luo B. GLMNet: Graph learning-matching convolutional networks for feature matching. Pattern Recogn. 2022;121108167.
View Article
Google Scholar

[81] View Article

[82] Google Scholar

[ref27] 27. Shen Z, Sun J, Wang Y, He X, Bao H, Zhou X. Semi-dense feature matching With transformers and its applications in multiple-view geometry. IEEE Trans Pattern Anal Mach Intell 2023;45(6):7726–38. pmid:36409815
View Article
PubMed/NCBI
Google Scholar

[84] View Article

[85] PubMed/NCBI

[86] Google Scholar

[ref28] 28. Fisher AN, Stinson DA, Kalajdzic A, Dupuis HE, Lowey EE, Desgrosseilliers E, et al. “A recipe for disaster?”: Female-Breadwinner relationships threaten heterosexual scripts. Sex Roles. 2025;91(3):16. pmid:39990977
View Article
PubMed/NCBI
Google Scholar

[88] View Article

[89] PubMed/NCBI

[90] Google Scholar

[ref29] 29. Huang D, Chen Y, Liu Y, Liu J, Xu S, Wu W, et al. Adaptive Assignment for Geometry Aware Local Feature Matching. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2023;5425–34.
View Article
Google Scholar

[92] View Article

[93] Google Scholar

[ref30] 30. Mok TCW, Chung ACS. Affine medical image registration with coarse-to-fine vision transformer. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2022;20803–12.
View Article
Google Scholar

[95] View Article

[96] Google Scholar

[ref31] 31. Xie T, Dai K, Wang K, Li R, Zhao L. DeepMatcher: A deep transformer-based network for robust and accurate local feature matching. Exp Syst Appl. 2024;237(Part A):121361.
View Article
Google Scholar

[98] View Article

[99] Google Scholar

[ref32] 32. Dai K, Xie T, Wang K, Jiang Z, Li R, Zhao L. OAMatcher: an overlapping areas-based network for accurate local feature matching. CoRR 2023.
View Article
Google Scholar

[101] View Article

[102] Google Scholar

[ref33] 33. ruong P, Danelljan M, Gool LV, Timofte R. Learning accurate dense correspondences and when to trust them. IEEE Conference on Computer Vision and Pattern Recognition. 2021;5714–24.
View Article
Google Scholar

[104] View Article

[105] Google Scholar

[ref34] 34. Truong P, Danelljan M, Timofte R, Van Gool L. PDC-Net+: Enhanced Probabilistic Dense Correspondence Network. IEEE Trans Pattern Anal Mach Intell 2023;45(8):10247–66. pmid:37027599
View Article
PubMed/NCBI
Google Scholar

[107] View Article

[108] PubMed/NCBI

[109] Google Scholar

[ref35] 35. Edstedt J, Athanasiadis I, Wadenbäck M, Felsberg M. DKM: dense kernelized feature matching for geometry estimation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023;75–75.
View Article
Google Scholar

[111] View Article

[112] Google Scholar

[ref36] 36. Edstedt J, Sun Q, Bokman G, Wadenback M, Felsberg M. RoMa: revisiting robust losses for dense feature matching. CoRR. 2023;abs.
View Article
Google Scholar

[114] View Article

[115] Google Scholar

[ref37] 37. Zhu B, Yang C, Dai J, Fan J, Qin Y, Ye Y. R₂FD₂: fast and robust matching of multimodal remote sensing images via repeatable feature detector and rotation-invariant feature descriptor. IEEE Trans Geosci Remote Sensing. 2023;611–15.
View Article
Google Scholar

[117] View Article

[118] Google Scholar

[ref38] 38. Devlin J, Chang M-W, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North. 2019;4186–4186.
View Article
Google Scholar

[120] View Article

[121] Google Scholar

[ref39] 39. Dong L, Yang N, Wang W, Wei F, Liu X, Wang Y, et al. Unified language model pre-training for natural language understanding and generation. Adv Neural Inform Process Syst. 2019;3213042–54
View Article
Google Scholar

[123] View Article

[124] Google Scholar

[ref40] 40. Yang S, Lei X. Reciprocal causation relationship between rumination thinking and sleep quality: a resting-state fMRI study. Cogn Neurodyn 2025;19(1):41. pmid:39991016
View Article
PubMed/NCBI
Google Scholar

[126] View Article

[127] PubMed/NCBI

[128] Google Scholar

[ref41] 41. Bao H, Dong L, Piao S, Wei F. BEiT: BERT pre-training of image transformers. The Tenth International Conference on Learning Representations. 2022;2022. https://openreview.net/forum?id=p-BhZSz59o4
View Article
Google Scholar

[130] View Article

[131] Google Scholar

[ref42] 42. Chen Y, Li L, Yu L, Kholy AE, Ahmed F, Gan Z, et al. UNITER: UNiversal Image-TExt Representation Learning. Vedaldi A, Bischof H, Brox T, Frahm J, editors. Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XXX. vol. 12375 of Lecture Notes in Computer Science. Springer; 2020;104–120. Available from: doi: https://doi.org/10.1007/978-3-030-58577-8_7.
View Article
Google Scholar

[133] View Article

[134] Google Scholar

[ref43] 43. Su W, Zhu X, Cao Y, Li B, Lu L, Wei F, et al. VL-BERT: Pre-training of generic visual-linguistic representations. OpenReview.net. 2020.
View Article
Google Scholar

[136] View Article

[137] Google Scholar

[ref44] 44. Muresan S, Nakov P, Villavicencio A. Fine-tuning for transformer-based masked language-models. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2022;9–9.
View Article
Google Scholar

[139] View Article

[140] Google Scholar

[ref45] 45. Sohn K, Chang H, Lezama J, Polania L, Zhang H, Hao Y, et al. Visual prompt tuning for generative transfer learning. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2023;19840–51.
View Article
Google Scholar

[142] View Article

[143] Google Scholar

[ref46] 46. Bao H, Wang W, Dong L, Liu Q, Mohammed OK, Aggarwal K, et al. VLMo: unified vision-language pre-training with mixture-of-modality-experts. Advances in Neural Information Processing Systems. 2022;35, Abstract.
View Article
Google Scholar

[145] View Article

[146] Google Scholar

[ref47] 47. Wang W, Bao H, Dong L, Bjorck J, Peng Z, Liu Q, et al. Image as a foreign language: BEIT pretraining for vision and vision-language tasks. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2023;19175–86.
View Article
Google Scholar

[148] View Article

[149] Google Scholar

[ref48] 48. DeTone D, Malisiewicz T, Rabinovich A. SuperPoint: self-supervised interest point detection and description. 2018 IEEE Conference on Computer Vision and Pattern Recognition Workshops. 2018;224–36.
View Article
Google Scholar

[151] View Article

[152] Google Scholar

[ref49] 49. Wang Q, Zhou X, Hariharan B, Snavely N. Learning feature descriptors using camera pose supervision. In: Vedaldi A, Bischof H, Brox T, Frahm J, eds. Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part I. vol. 12346 of Lecture Notes in Computer Science. Springer; 2020;757–774.
View Article
Google Scholar

[154] View Article

[155] Google Scholar

[ref50] 50. Li Z, Snavely N. MegaDepth: learning single-view depth prediction from internet photos. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018;2041–50.
View Article
Google Scholar

[157] View Article

[158] Google Scholar

[ref51] 51. Dai A, Chang AX, Savva M, Halber M, Funkhouser TA, Nießner M. ScanNet: Richly-annotated 3D reconstructions of indoor scenes. 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017. 2017;2432–43.
View Article
Google Scholar

[160] View Article

[161] Google Scholar

[ref52] 52. Thomee B, Shamma D, Friedland G, Elizalde B, Ni K, Poland D. YFCC100M: the new data in multimedia research. Commun ACM 2016;59(2):64–73.
View Article
Google Scholar

[163] View Article

[164] Google Scholar

[ref53] 53. Daudt RC, Saux BL, Boulch A, Gousseau Y. Urban change detection for multispectral earth observation using convolutional neural networks. 2018 IEEE International Geoscience and Remote Sensing Symposium, IGARSS 2018. 2018;2115–8.
View Article
Google Scholar

[166] View Article

[167] Google Scholar

[ref54] 54. Yao Y, Zhang Y, Wan Y, Liu X, Yan X, Li J. Multi-modal remote sensing image matching considering co-occurrence filter. IEEE Trans Image Process. 2022;31:2584–97. pmid:35286258
View Article
PubMed/NCBI
Google Scholar

[169] View Article

[170] PubMed/NCBI

[171] Google Scholar

[ref55] 55. Nilosek D, Walvoord DJ, Salvaggio C. Assessing geoaccuracy of structure from motion point clouds from long-range image collections. Opt Eng 2014;53(11):113112.
View Article
Google Scholar

[173] View Article

[174] Google Scholar

[ref56] 56. Hong D, Gao L, Yokoya N, Yao J, Chanussot J, Du Q, et al. More diverse means better: multimodal deep learning meets remote-sensing imagery classification. IEEE Trans Geosci Remote Sensing 2021;59(5):4340–54.
View Article
Google Scholar

[176] View Article

[177] Google Scholar

[ref57] 57. Ma J, Zhao J, Jiang J, Zhou H, Guo X. Locality preserving matching. Int J Comput Vis 2018;127(5):512–31.
View Article
Google Scholar

[179] View Article

[180] Google Scholar

[ref58] 58. Collins DL, Zijdenbos AP, Kollokian V, Sled JG, Kabani NJ, Holmes CJ, et al. Design and construction of a realistic digital brain phantom. IEEE Trans Med Imaging 1998;17(3):463–8. pmid:9735909
View Article
PubMed/NCBI
Google Scholar

[182] View Article

[183] PubMed/NCBI

[184] Google Scholar

[ref59] 59. Brown M, Susstrunk S. Multi-spectral SIFT for scene category recognition. CVPR 2011. 2011;184–184.
View Article
Google Scholar

[186] View Article

[187] Google Scholar

[ref60] 60. Mishkin D, Matas J, Perdoch M, Lenc K. WxBS: wide baseline stereo generalizations. Procedings of the British Machine Vision Conference 2015. 2015;12.1-12.12.
View Article
Google Scholar

[189] View Article

[190] Google Scholar

[ref61] 61. Shrivastava A, Malisiewicz T, Gupta A, Efros AA. Data-driven visual similarity for cross-domain image matching. ACM Trans Graph 2011;30(6):1–10.
View Article
Google Scholar

[192] View Article

[193] Google Scholar

[ref62] 62. Balntas V, Lenc K, Vedaldi A, Mikolajczyk K. HPatches: a benchmark and evaluation of handcrafted and learned local descriptors. 2017 IEEE Conference on Computer Vision and Pattern Recognition. 2017;3852–61.
View Article
Google Scholar

[195] View Article

[196] Google Scholar

[ref63] 63. Taira H, Okutomi M, Sattler T, Cimpoi M, Pollefeys M, Sivic J, et al. InLoc: indoor visual localization with dense matching and view synthesis. IEEE Trans Pattern Anal Mach Intell 2021;43(4):1293–307. pmid:31722474
View Article
PubMed/NCBI
Google Scholar

[198] View Article

[199] PubMed/NCBI

[200] Google Scholar

[ref64] 64. Zhang Z, Sattler T, Scaramuzza D. Reference pose generation for long-term visual localization via learned features and view synthesis. Int J Comput Vis 2021;129(4):821–44. pmid:34720404
View Article
PubMed/NCBI
Google Scholar

[202] View Article

[203] PubMed/NCBI

[204] Google Scholar

[ref65] 65. Toft C, Maddern W, Torii A, Hammarstrand L, Stenborg E, Safari D, et al. Long-term visual localization revisited. IEEE Trans Pattern Anal Mach Intell 2022;44(4):2074–88. pmid:33074802
View Article
PubMed/NCBI
Google Scholar

[206] View Article

[207] PubMed/NCBI

[208] Google Scholar

[ref66] 66. RoBberg T, Schmitt M. Estimating NDVI from Sentinel-1 Sar data using deep learning. IGARSS 2022 - 2022 IEEE International Geoscience and Remote Sensing Symposium. 2022;5–5.
View Article
Google Scholar

[210] View Article

[211] Google Scholar

[ref67] 67. Li X, Zhang G, Cui H, Hou S, Wang S, Li X, et al. MCANet: a joint semantic segmentation framework of optical and SAR images for land use classification. Int J Appl Earth Obs Geoinformation. 2022;106:102638.
View Article
Google Scholar

[213] View Article

[214] Google Scholar

[ref68] 68. Liao Y, Di Y, Zhou H, Li A, Liu J, Lu M, et al. Feature matching and position matching between optical and SAR with local deep feature descriptor. IEEE J Sel Top Appl Earth Observations Remote Sensing. 2022;15:448–62.
View Article
Google Scholar

[216] View Article

[217] Google Scholar

[ref69] 69. Silberman N, Hoiem D, Kohli P, Fergus R. Indoor segmentation and support inference from RGBD images. In: Fitzgibbon AW, Lazebnik S, Perona P, Sato Y, Schmid C, eds. Computer Vision - ECCV 2012 - 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part V. vol. 7576 of Lecture Notes in Computer Science. Springer; 2012. p. 746–760.
View Article
Google Scholar

[219] View Article

[220] Google Scholar

[ref70] 70. Differt D, Möller R. Spectral skyline separation: extended landmark databases and panoramic imaging. Sensors (Basel) 2016;16(10):1614. pmid:27690053
View Article
PubMed/NCBI
Google Scholar

[222] View Article

[223] PubMed/NCBI

[224] Google Scholar

[ref71] 71. Dusmanu M, Rocco I, Pajdla T, Pollefeys M, Sivic J, Torii A, et al. D2-Net: a trainable CNN for joint description and detection of local features. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019;8092–101.
View Article
Google Scholar

[226] View Article

[227] Google Scholar

[ref72] 72. Revaud J, de Souza CR, Humenberger M, Weinzaepfel P. R2D2: reliable and repeatable detector and descriptor. In: Wallach HM, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox EB, Garnett R, eds. Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada; 2019;12405–12415. https://proceedings.neurips.cc/paper/2019/hash/3198dfd0aef271d22f7bcddd6f12f5cb-Abstract.html.
View Article
Google Scholar

[229] View Article

[230] Google Scholar

[ref73] 73. Luo Z, Zhou L, Bai X, Chen H, Zhang J, Yao Y, et al. ASLFeat: learning local features of accurate shape and localization. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020;6588–97
View Article
Google Scholar

[232] View Article

[233] Google Scholar

[ref74] 74. Zhou Q, Sattler T, Leal-Taixé L. Patch2Pix: epipolar-guided pixel-level correspondences. IEEE Conference on Computer Vision and Pattern Recognition. 2021;4669–78.
View Article
Google Scholar

[235] View Article

[236] Google Scholar

[ref75] 75. Sarlin P, Unagar A, Larsson M, Germain H, Toft C, Larsson V, et al. Back to the feature: learning robust camera localization from pixels to pose. IEEE Conference on Computer Vision and Pattern Recognition. 2021;3247–57.
View Article
Google Scholar

[238] View Article

[239] Google Scholar

[ref76] 76. Li X, Wang S, Zhao Y, Verbeek J, Kannala J. Hierarchical scene coordinate classification and regression for visual localization. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020;11980–9.
View Article
Google Scholar

[241] View Article

[242] Google Scholar

[ref77] 77. Sarlin P, Cadena C, Siegwart R, Dymczyk M. From coarse to fine: robust hierarchical localization at large scale. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019;12716–25.
View Article
Google Scholar

[244] View Article

[245] Google Scholar

[ref78] 78. Melekhov I, Brostow GJ, Kannala J, Turmukhambetov D. Image stylization for robust features. CoRR. 2020.
View Article
Google Scholar

[247] View Article

[248] Google Scholar

[ref79] 79. Zhou Y, Fan H, Gao S, Yang Y, Zhang X, Li J, et al. Retrieval and localization with observation constraints. 2021 IEEE International Conference on Robotics and Automation (ICRA). 2021;5237-5244.
View Article
Google Scholar

[250] View Article

[251] Google Scholar

[ref80] 80. Jiang W, Trulls E, Hosang J, Tagliasacchi A, Yi KM. COTR: Correspondence Transformer for Matching Across Images. 2021 IEEE/CVF International Conference on Computer Vision (ICCV). 2021;6187-6197.
View Article
Google Scholar

[253] View Article

[254] Google Scholar

[ref81] 81. Xufeng H, Leung T, Jia Y, Sukthankar R, Berg AC. MatchNet: unifying feature and metric learning for patch-based matching. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2015;3279–86.
View Article
Google Scholar

[256] View Article

[257] Google Scholar

[ref82] 82. Balntas V, Riba E, Ponsa D, Mikolajczyk K. Learning local feature descriptors with triplets and shallow convolutional neural networks. Proceedings of the British Machine Vision Conference 2016. 2016;119.
View Article
Google Scholar

[259] View Article

[260] Google Scholar

[ref83] 83. ishchuk A, Mishkin D, Radenovic F, Matas J. Working hard to know your neighbor’s margins: local descriptor learning loss. Advances in Neural Information Processing Systems. 2017;30:4826–37.
View Article
Google Scholar

[262] View Article

[263] Google Scholar

Figures

Abstract

1 Introduction

2 Related work

3 Methodology

3.1 Multi-modal image assistants transformer

3.2 Data augmentation

3.3 Pre-training

3.4 Fine-tuning on different feature matching tasks

4 Experiments

4.1 Pre-training setup

4.2 Estimation on same-modal images feature matching

4.3 Estimation on different-modals images feature matching

4.4 Visualization of image feature matching

4.5 Estimation on the distance error test of feature matching

4.6 Evaluation of computational cost

4.7 Ablation study

4.8 Limitation

5 Conclusion

References