Figures
Abstract
Image feature matching, a foundational task in computer vision, remains challenging for multimodal image applications, often necessitating intricate training on specific datasets. In this paper, we introduce a Unified Feature Matching pre-trained model (UFM) designed to address feature matching challenges across a wide spectrum of modal images. We present Multimodal Image Assistant (MIA) transformers, finely tunable structures adept at handling diverse feature matching problems. UFM exhibits versatility in addressing both feature matching tasks within the same modal and those across different modals. Additionally, we propose a data augmentation algorithm and a staged pre-training strategy to effectively tackle challenges arising from sparse data in specific modals and imbalanced modal datasets. Experimental results demonstrate that UFM excels in generalization and performance across various feature matching tasks. The code will be released at: https://github.com/LiaoYun0x0/UFM.
Citation: Di Y, Liao Y, Zhou H, Zhu K, Duan Q, Liu J, et al. (2025) UFM: Unified feature matching pre-training with multi-modal image assistants. PLoS ONE 20(3): e0319051. https://doi.org/10.1371/journal.pone.0319051
Editor: Paulo Eduardo Teodoro, Federal University of Mato Grosso do Sul, BRAZIL
Received: October 22, 2024; Accepted: January 26, 2025; Published: March 31, 2025
Copyright: © 2025 Di et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: Data Availability and access The datasets analyzed during the current study are available from the following public domain resources: https://github.com/hpatches, http://www.ok.sc.e.titech.ac.jp/INLOC/, https://data.ciirc.cvut.cz/public/projects/2020VisualLocalization/Aachen-Day-Night/, https://mediatum.ub.tum.de/1474000, https://cs.nyu.edu/silberman/datasets/nyudepthv2.html, http://matthewalunbrown.com/nirscene/nirscene.html, https://github.com/AmberHen/WHU-OPT-SAR-dataset, http://www.ti.uni-bielefeld.de/html/people/ddiffert/databases uvg.html, https://brainweb.bic.mni.mcgill.ca/brainweb/.
Funding: This work was supported by the National Natural Science Foundation of China under Grant 61976124 and 62372077 to M.L. This work was also supported by the Scientific Research Fund of Yunnan Provincial Education Department under Grant 2021J0007 to Y.L. and Q.D.
Competing interests: The authors have declared that no competing interests exist.
1 Introduction
Feature matching, as a fundamental task in computer vision, serves to establish correspondences between local features in different images, facilitating the advancement of various downstream applications such as image fusion [1,2], image stitching [3,4], and 3D reconstruction [5,6], among others. With the proliferation of computer vision applications, the demand for precise feature matching of diverse multi-modal images has grown significantly. When researchers address the task of specific modal images, they are required to identify the appropriate feature matching method tailored to the corresponding modal image. This process typically involves utilizing substantial amounts of relevant training data, leading to substantial resource consumption. Thus, the development of a unified pre-trained comprehensive model for feature matching across a wide range of modals becomes increasingly imperative.
The concept of pretraining-fine-tuning methods initially emerged in the realm of natural language processing [7,8]. Given its remarkable performance, this approach was swiftly extended to the field of computer vision [9,10] and multimodal applications [11,12], offering a means to achieve more efficient results with reduced resources and time. Nevertheless, within the domain of multi-modal image feature matching, a comprehensive and effective unified framework has been notably absent, limiting the ability of researchers to make rapid advancements in multi-modal image feature matching through fine-tuning on a pre-trained foundational model.
Feature matching tasks can be broadly categorized into two types: matching features within images of the same modal and matching features across images of different modals. As illustrated in Fig 1, feature matching tasks for images of the same modal often involve significant variations in sensor positions and angles. For instance, the left image might capture the front of a building while the right image could be taken from the left side, exhibiting notable disparities in time, location, and angle between the two images. The primary objective in matching features within images of the same modal is to precisely align the poses of objects depicted in different pictures. Conversely, in the context of feature matching tasks involving images of different modals, the sensor positions typically exhibit minimal differences, serving to leverage the complementary information provided by the distinct modals present in the images.
Previous studies have introduced numerous techniques [13–15] for facilitating feature matching within images of the same modal, along with several methods [16,17] dedicated to achieving feature matching across specific pairs of modals. While some approaches [18,19] have attempted to address feature matching across multimodal images [20], their applicability is often restricted to specific image modals, necessitating extensive training for specialized tasks.
In this paper, we introduce the Unified Pre-trained Feature Matching (UFM) model, designed to facilitate multi-modal image feature matching across a wide range of image modals. UFM not only enables feature matching within images of the same modal but also supports feature matching across various modals. This capability is achieved through the utilization of a Multi-Modal Image Assistant (MIA) Transformer. By incorporating a diverse set of modal assistants to aid the feedforward network within the standard transformer, the MIA Transformer can effectively capture specific modal information. Furthermore, the model leverages cross-modal shared self-attention and cross-attention mechanisms to capture inter-modal information. The modal assistants encompass two distinct types: those designed for the same image modal and those tailored for different image modals. Leveraging its modeling flexibility, the MIA Transformer, equipped with shared parameters, can be repurposed for various tasks.
To address the disparities in data distribution among different modal image data and to compensate for the scarcity of specific modal image data, we have devised a data augmentation methodology to expand the datasets of various modal images. Additionally, UFM incorporates a staged pre-training strategy, commencing with an initial pre-training phase on the same modal data and followed by subsequent pre-training on a dataset inclusive of various modals. Given that the feature matching data for images of the same modal significantly outweighs the corresponding data for different modals, this staged pre-training strategy significantly augments the data available for pre-training, thereby enhancing the generalization capabilities of UFM. Experimental results attest to the exceptional performance of UFM in both the feature matching tasks of images from the same modal and the feature matching tasks of images from different modals.
Our main contributions are as follows:
- We introduce a unified pre-trained model for multi-modal image feature matching, denoted as UFM, which is adaptable to a wide array of modal image feature matching tasks.
- We present the Multi-Modal Image Assistant (MIA) Transformer, a novel framework that captures specific modal information through modal assistants. Furthermore, MIA Transformer enhances the efficacy of feature matching for different modal images by facilitating cross-modal feature fusion.
- Our proposed method demonstrates the ability to handle feature matching tasks for images of the same modal with significant disparities in sensor poses, as well as for feature matching tasks between images of different modals.
- Leveraging data augmentation techniques, we expand the datasets of modal images with limited data, while the staged pre-training strategy significantly enhances the effectiveness of pre-training for multi-modal image feature matching.
2 Related work
In the realm of feature matching, methods can be categorized into two types: detector-based and detector-free. Beyond methodological distinctions, tasks in this domain can also be classified into two broad categories: conventional feature matching tasks and multi-modal image feature matching tasks.
Detector-based feature matching. The detector-based method unfolds in three key stages: feature detection, feature description, and feature matching. In the feature detection and description stage, interest points with descriptors are generated, followed by the establishment of point-to-point correspondences through a suitable matching algorithm. Detector-based methods can be further categorized into handcrafted descriptor methods and deep learning descriptor methods. Among the prominent handcrafted descriptor methods is SIFT [21], renowned for its simplicity and efficiency, making it versatile across various computer vision tasks. Subsequent works [22–24] have iteratively improved upon the SIFT algorithm, enhancing its performance. Deep learning descriptor methods leverage convolutional neural networks to glean deep features and capture nonlinear expressions, uncovering valuable hidden information. Sarlin et al. [25] introduced SuperGlue, featuring a flexible attention-based context aggregation mechanism capable of jointly reasoning about the underlying 3D scene and feature assignment. Jiang et al. [26] proposed GLMNet, a graph learning-matching network utilizing graph convolutional networks to adaptively learn optimal graphs for the graph matching task.
Detector-free feature matching. The detector-free method eliminates the need for extracting interest points, opting instead to directly extract features in image pairs through transformers. This category can be further divided into semi-dense matching methods and dense matching methods. Semi-dense matching methods typically employ a coarse-to-fine matching process, achieving final precise matching results on 1/2 size feature maps. Notably, a pioneering contribution is the LoFTR algorithm by Shen et al. [27], which introduces global receptive fields in Transformers to produce semi-dense matches, particularly excelling in low-texture regions. Subsequent algorithms [28–32] have iteratively built upon and improved LoFTR as a baseline. Dense matching methods, on the other hand, extract all matches between views directly on the original image, aiming to estimate each matched pair of pixels. Truong et al. introduced PDC-Net [33] and PDC-Net+ [34], formulating bias estimates in a probabilistic manner and pairing proposed feature correspondences with deterministic estimates via a mixture model. Edstedt et al. presented DKM [35] and RoMa [36]. While DKM achieves superior matching accuracy by utilizing depthwise separable large kernels and local correlations with stacked feature maps as input, RoMa combines the strengths of semi-dense and dense matching methods. It designs a coarse-to-fine dense matching framework and achieves state-of-the-art matching accuracy. Despite the higher accuracy of dense matching methods compared to semi-dense counterparts, they come at the cost of increased computational resources and training time. Considering these factors comprehensively, we adopt a semi-dense matching framework in designing UFM.
Multi-modal image feature matching. The significant disparities among sensors in multi-modal images pose challenges for conventional matching methods, rendering them less applicable for multi-modal image feature matching. Consequently, numerous scholars have introduced algorithms specifically tailored for this purpose. Zhu et al. [37] presented R2FD2, a detector-based multi-modal image feature matching algorithm. Initially, they employed the reproducible feature detector MALG to identify interest points and subsequently utilized the feature descriptor RMLG for feature representation and matching. Hu et al. [19] proposed a multi-scale structural feature transform (MSFT) method designed for multi-modal image matching. This approach detects scale-invariant feature points based on the Gaussian difference image pyramid of the phase congruency map, addressing challenges arising from nonlinear radiation distortion. In the detector-free domain, Di et al. [18] introduced FeMIP, a multi-modal feature matching algorithm. Employing a coarse-to-fine approach, FeMIP achieves accurate feature matching. Notably, it incorporates a policy gradient method to address issues related to the discreteness of matching. While there exist several commendable multi-modal image feature matching algorithms, they often exhibit limitations in terms of the types of multi-modal images they are suited for. Furthermore, these algorithms may require substantial training on specific datasets and may not be well-suited for addressing feature matching challenges within images of the same modal. Therefore, the development of a unified feature matching algorithm, accommodating the majority of modal images, is of utmost importance.
Pre-training – fine-tuning. As the transformer paradigm has evolved, the pretraining-fine-tuning approach has become instrumental in advancing the state of the art across diverse domains such as natural language processing [38,39], computer vision [40,41], and multi-modal tasks [42,43]. Zaken et al. [44] introduced BitFit, a sparse fine-tuning method tailored for natural language processing. Leveraging the modification of bias terms exclusively, this method demonstrates remarkable performance on extensive training data. Sohn et al. [45] presented a method for learning vision transformers through generative knowledge transfer. Their approach incorporates a novel prompt design that strategically places learnable tokens within image token sequences. Radford et al. [11] proposed CLIP, a multi-modal pretraining-fine-tuning model that exhibits effective knowledge transfer across tasks without the need for dataset-specific training. Despite the notable success of the pre-training-fine-tuning technique in various domains, its application in the realm of multi-modal image feature matching has been limited. Consequently, we have devised a unified large model, UFM, for multi-modal image feature matching, achieving comprehensive pre-training on extensive multi-modal image data.
3 Methodology
Given an image pair from most modals, the Unified Feature Matching (UFM) approach can effectively derive its image pair representation using a Multimodal Image Assistant (MIA) Transformer network. The UFM method exhibits remarkable versatility, enabling it to address feature matching tasks not only within the same modal but also across different modals, even when the sensor positions vary significantly. Additionally, in handling the feature matching tasks of diverse modal images, the process of cross-modal feature fusion can further enhance the overall effectiveness of the approach.
3.1 Multi-modal image assistants transformer
Similar to previous methods [18,27,28], UFM adopts a coarse-to-fine dense matching approach for feature matching. The input image pairs undergo processing via the FPN network to generate features at 1/8th and 1/2th sizes. The 1/8-size features are trained using the augmented GT matrix as labels, yielding coarse matching results at the patch level. Subsequently, the 1/2 size features are precisely matched with the coarse matching results to derive the final dense matching results at the pixel level. Distinguishing itself from other methods, UFM introduces a novel architecture, departing from the conventional use of transformers in coarse and fine matching stages.
In this study, we propose a novel Multi-modal Image Assistant (MIA) Transformer tailored for feature matching tasks, as depicted in Fig 2. The UFM model encompasses all the processes involved in image enhancement, feature extraction, coarse matching, and fine matching. The MIA Transformer is utilized in both the coarse and fine matching stages. The MIA Transformer integrates a multi-modal image assistant with a generic feedforward network. The Unified Feature Matching (UFM) framework comprises a total of L layers. Given the output vector G_(l-1) from the generic feedforward network of the preceding layer and the output vector A_(l-1) from the assistant feedforward network of the previous layer, the approach leverages multi-head self-attention and cross-attention (MSCA) mechanisms, shared across modals, to align the content of a pair of images. The input vector for each layer can be computed as follows:
where LN is short for layer normalization. MIA_FFN is designed to select the appropriate assistant from a collection of multiple modal assistants for processing the input vector. Notably, it encompasses two distinct types of modal assistants: those tailored for the same modal and those dedicated to different modals. When the input consists of image vectors from the same modal, the corresponding assistant associated with that modal is employed for encoding the images. Conversely, in cases where the input includes image vectors from diverse modals, such as optical-SAR, the optical and SAR assistants are utilized to encode the respective modal vectors at the underlying Transformer layer. Subsequently, the optical-SAR assistant is employed at the top layer to capture more comprehensive modal interactions. The output vector V_l for each layer can be computed as follows:
3.2 Data augmentation
UFM necessitates a comprehensive pre-training process involving images from all modals. However, significant discrepancies often exist in the availability of image data across different modals. For instance, optical images commonly offer a wealth of data, whereas long-infrared images may be relatively scarce. To address the nonuniformity in data distribution among different modals and to mitigate the scarcity of data in certain modals, this paper introduces a data augmentation technique. The primary objective of this technique is to achieve a more balanced data distribution for the image dataset of each modal and enhance the generalizability of UFM.
As shown in Fig 3, given an image pair Imagea and Imageb, a sequence of data augmentation procedures is initially applied. Subsequently, these enriched image pairs are utilized to produce a comprehensive pixel matching label GT_matrix. The augmentation process includes mirroring, flipping, and rotating the input images, significantly enhancing the diversity within the dataset. In addition, the processed image is randomly cropped. Finally, random noise is added to the cropped block map, and some pixels are randomly masked [20].
The cropped pair of images are defined as Ia and Ib. The sizes of Ia and Ib are the same, and both their height and width are denoted by h and w, respectively. The image is divided into N image patches. The patch coordinates can be defined as , where
;
and p is the size of the patch. We applied a random mask to these image patches and the proportion of them is between 20% and 40%. The location of the random mask is defined as M and can be expressed as
For positions that are not masked, the central points of the patches in Ia can be defined as , which can be calculated as
The coordinates of the points in Imageb corresponding to the are defined as
. They can be obtained by running a series of data augmentation operations (Mirror, flip, rotate, etc.) in reverse for coordinates
. Then the corresponding patch coordinates of them can also be extracted as
where [o] means round down o.
Then we define a GT matrix to represent the matching of the image patches after data augmentation. Ia partially overlaps with Ib, so the corresponding patch coordinates may be inside or outside of Ib. If
is in image Ib, then
where GT ( i , j ) = 1 indicates that the unmasked ith patch in Ia matches the unmasked jth patch in Ib.
3.3 Pre-training
Presented in Fig 4, our proposed staged pre-training strategy aims to enhance the image matching model’s performance by leveraging a large-scale image dataset from the same modality. The pre-training is divided into 3 stages: (1) pre-train the general FFN, (2) pre-train all X-X assistants, and (3) pre-train all X-Y assistants. In stage 1, given that optical images provide a rich source of feature matching data, we initially conduct pre-training on multi-head attention and the generic feedforward network (FFN) using a substantial collection of pure optical images. In stage 2, We pre-train feature matching for all of the same modal images (X-X). The pre-training of all modal images (X-X, Y-Y, Z-Z···) in the stage 2 can be performed in parallel. We freeze multi-head self+cross attention at this stage, which greatly improves the efficiency of training. In stage 3, we further pre-train the feature matching of all cross-modal images (X-Y) on the basis of stage 2. At this stage, we adjusted all the attention and corresponding FFNS to maximize cross-modal matching. The three FFNS (X-X, Y-Y, X-Y) corresponding to the two modal images are pre-trained simultaneously.
A notable consideration is the adoption of the concept of frozen attention blocks, as previously introduced in [46,47], to potentially enhance the pre-training for the same modal. Consequently, during the pre-training phase for the same modal, we retain the parameters of the multi-head attention and other modal assistants, exclusively training the assistant FFN tailored to the specific modal under consideration. In the process of training feature matching assistants for different modals, it is essential to conduct separate pre-training for each modal, ensuring that the parameters are fine-tuned accordingly. Notably, in scenarios involving different modals, such as Opt-SAR, the parameters for multi-head attention are not frozen, facilitating a more comprehensive adaptation to the diverse modals.
Conventional general multi-modal image matching methods are typically trained solely on limited image data from various modals, posing a challenge in achieving optimal results. In contrast, datasets containing images of the same modal are comparatively more accessible. Leveraging the staged pre-training strategy enables the model to undergo initial pre-training on a same-modal dataset, followed by subsequent pre-training on datasets encompassing different modals. This approach significantly augments the volume of data available for pre-training and substantially enhances the model’s overall generalization capacity.
By incorporating this staged pre-training strategy, the model can effectively leverage the advantages of the larger same-modal datasets during the initial training phase. Subsequently, through continued pre-training on diverse-modal datasets, the model can gain a more comprehensive understanding of the variations and nuances across different modals, thus improving its adaptability and performance across a broader range of modals.
In the specific pre-training process of feature matching, UFM is also trained in two stages: coarse matching and fine matching. In coarse matching, dual-softmax is used for training, which can be extracted as
where GTi,j denotes the GT matrix, P ( i , j ) represents the probability of the correct matching and n is the number of feature points.
Following [49], we used the epipolar loss. The epipolar constraint states that jTFi=0 holds if i and j are truly matched, where Fi can be interpreted as the epipolar line corresponding to i in Ib. The epipolar loss is defined as the distance between the predicted corresponding position and the ground-truth epipolar line:
where h1→2(i) is the predicted correspondence in Ib for the point i in Ib, and dist ( ⋅ , ⋅ ) is the distance between a point and a line.
The epipolar loss itself only encourages the predicted match to lie on the epipolar line rather than close to the ground truth correspondence. We also need to introduce a cycle consistency loss to encourage the forward and backward mapping of a point to be spatially close to itself:
For a pair of enhanced images Ia and Ib, the dense feature descriptors extracted by MIA transformer are defined as M1 and M2. To compute the correspondence for a query point i in Ia, we correlate the feature descriptor at i, denoted by M1(i), with all of M2. A 2D pixel location distribution of Ib is obtained, indicating the probability corresponding to each location and i in Ia. The probability distribution can be expressed as . A single 2D match can then be computed as the expectation of this distribution:
where y varies over the pixel grid of Ib.
The loss at each point is re-weighted using the total variance σ2(i) as an uncertainty measure. The loss function for fine matching is the weighted sum of the epipolar and cycle consistency loss of the n sampled query points:
The overall loss function Loss is composed of Lossc and Lossf, which can be expressed as
3.4 Fine-tuning on different feature matching tasks
After comprehensive pre-training, UFM only requires fine-tuning on the corresponding dataset to achieve excellent matching results. For instance, when dealing with a brand new multimodal dataset, other methods typically need to be pre-trained on the entire training set. In contrast, UFM usually requires only about 1/10 of the training data for fine-tuning to deliver superior matching performance. This fine-tuning process includes both within-modal feature matching and cross-modal image feature matching. This critical stage ensures that the model effectively adapts to the unique characteristics and complexities of the matching task.
Fine-tuning of feature matching for the same modal. As illustrated in Fig 5, UFM facilitates the fine-tuning of image feature matching for individual modals. During the fine-tuning process for the same modal, UFM selectively freezes the parameters associated with the generic feedforward network (FFN) and multi-head attention, focusing solely on refining the assistant relevant to the corresponding modal. This modal assistant works in conjunction with the general FFN, utilizing a residual structure. By employing this strategy, the model can capitalize on the robust capabilities of the general FFN while making specific adjustments tailored to the data characteristics of the current modal. This approach not only enhances the model’s adaptability and performance within the same modal but also significantly reduces the computational resources required for training, ensuring a more efficient and effective fine-tuning process.
Fine-tuning of feature matching across different modals. Illustrated in Fig 6, the fine-tuning process for image feature matching across different modals in UFM involves two distinct stages. In the first stage, the assistant FFNs from two modals aid the generic FFN in extracting features. In the second stage, the assistant FFNs for modal matching assist the general FFN in executing feature matching. Throughout both stages, the parameters of the multi-head attention and general FFN remain frozen, and only the modal-specific assistant is subject to fine-tuning. Specifically, the first stage encompasses L-M layers, while the second stage comprises M layers. The total number of layers corresponds to that used for fine-tuning image feature matching within the same modal. This two-stage strategy not only leverages the specialized modal assistant to facilitate feature extraction but also enhances the effect of feature matching across different modals through cross-modal feature fusion. By integrating these complementary stages, UFM effectively optimizes the feature matching process, ensuring improved performance and enhanced adaptability across diverse modals.
4 Experiments
We pre-train the UFM on large-scale multimodal image data and evaluate the models qualitatively and quantitatively on different feature matching tasks.
4.1 Pre-training setup
Our pre-training data mainly consists of these datasets: MegaDepth [50], ScanNet [51], YFCC100M [52], OSCD [53], MRSI [54], MRSIs [37], DIRSIG [55], MS-SAR LCZ [56], Retina [57], BrainWeb [58], VIS–NIR [59], WxBS [60] and image–paint [61]. There are about 2 million image pairs in the pre-training data. After pre-training, we only need to use about 1/10 of the training set of the corresponding dataset for fine-tuning.
The model consists of 9 layers transformers with 768 hidden size. The model is trained by AdamW optimizer with α = 0 . 9, β = 0 . 98. The learning rate is 1 × 10−4. The pre-training of multi-modal image feature matching takes about a week using 8 Nvdia Tesla V100 32GB GPU cards.
4.2 Estimation on same-modal images feature matching
In the feature matching evaluation of images from the same modal, we mainly verify the feature matching between images with large camera pose differences. As shown in Fig 7, We compare the feature matching of different methods on Hpatches [62], InLoc [63] and Aachen Day-Night v1.1 [64] datasets. The InLoc and Aachen Day-Night v1.1 datasets are just the datasets for testing and do not have any training data. We fine-tuned on the Hpatches dataset using 1/10 of the training data.
As shown in Table 1, we evaluate the effect of UFM and contrast algorithms in homography estimation on the HPatches benchmark set and report the proportion of accurate predictions with average corner error distances less than 1/3/5 pixels. Bold numbers in the table indicate the best results under the current metric. Compared with these excellent algorithms, UFM has achieved the best results under most indicators. Especially under the Illumination indicator, UFM achieved extremely excellent results.
To verify the visual localization capability of UFM, we also estimated the 6-DOF pose of a given image with respect to the corresponding 3D scene model. We evaluated different approaches on long-term visual localization benchmarks [65]. The focus is on benchmarking indoor scene changes and day/night changes. As shown in Tables 2 and 3, UFM is very competitive on both Aachen Day-Night v1.1 and InLoc datasets. UFM surpasses the compared methods under most metrics.
4.3 Estimation on different-modals images feature matching
In the evaluation of feature matching between images of different modals, seven datasets were used for testing. As shown in Fig 8, the data sets used for the test are: SEN 12 MS dataset [66], RGB-NIR Scene dataset [59], WHU-OPT-SAR dataset [67], Optical-SAR dataset [68], BrainWeb dataset [58], NYU-Depth V2 dataset [69] and UV-Green dataset [70]. We fine-tuned on these datasets using 1/10 of the training data. UFM can handle most multi-modal feature matching problems. This test experiment mainly covers optical images, SAR images, NIR images, SWIR images, depth images, UV images, green images and medical images.
Multiple modals of the same scene. In the SEN12MS dataset, multiple images correspond to different modals of the same scene. We perform MMA evaluation and Homography estimation on images from different modals of SEN12MS. Except UFM, the other algorithms are fully trained. The results of MMA are shown in Fig 9. In this experiment, the average matching accuracy of different methods is calculated for pixel values ranging from 1 to 10. The higher and more left MMA curve indicates the better feature matching performance of the proposed method. The results of the Homography estimation are shown in Table 4. We report the regions under the cumulative curve (AUC) where corner error reaches the 3, 5, and 10 pixel thresholds, respectively. The higher the result of the Homography estimation, the better the effect of its feature matching. Through the experimental results, it can be found that UFM outperforms the compared algorithms in the vast majority of cases, indicating that it is highly competitive in feature matching of different modal images of the same scene.
Multiple modals of the different scenes. When performing feature matching of multi-modal images, most of the time it is necessary to deal with the task of different scenes. We tested feature matching on multiple modal images of various scenes. Except UFM, the other algorithms are fully trained. The results of the MMA evaluation and the results of the Homography estimation are shown in Fig 10 and Table 5, respectively. UFM also outperforms other excellent algorithms in most cases when only fine-tuning is performed. These experiments can prove that UFM has good feature matching performance and generalization while saving a lot of computing resources.
4.4 Visualization of image feature matching
The images of the same modal. As illustrated in Fig 11, in order to observe the matching effect of different methods more intuitively, the matches with less than 1 pixel error are represented by lines. To save space, we only show the matching results of the four methods with the best results. Compared with other methods, UFM has more correct matching lines, which qualitatively proves that UFM algorithm has a good effect on image feature matching of the same modal.
The images of different modals. The results of multi-modal image feature matching are generated, and the results of the four methods with the best results are presented in Fig 12. The datasets from top to bottom are: SEN 12 MS dataset, RGB-NIR Scene dataset, WHU-OPT-SAR dataset, Optical-SAR dataset, BrainWeb dataset, NYU-Depth V2 dataset and UV-Green dataset. In order to clearly compare the matching effects of different methods, we rotate and crop the image data of many modals. The matches with less than 1 pixel error are represented by lines. Compared with other methods, the UFM algorithm obtains more correct matching connections, which qualitatively proves the competitiveness of the UFM algorithm in dealing with multi-modal image feature matching.
4.5 Estimation on the distance error test of feature matching
Although the lines of matched points with pixel error less than 1 are shown in Sect 4.4, it is not possible to show the specific error value for each pixel.To further objectively evaluate the matching accuracy of different methods, we calculated the average distance error of matching points with horizontal and vertical pixel errors less than 1. The total error is computed based on both the horizontal and vertical pixel errors, and its value may exceed 1. In this experiment the horizontal distance error is defined as Hrmse, the vertical distance error is defined as Vrmse and the total distance error on the image is defined as HVrmse. All errors are measured in pixels. The functions Hrmse, Vrmse, and HVrmse are computed as:
where denotes the coordinates of the matching points in the X-modal image,
indicates the coordinates of the matching point in the Y-modal image. N denotes the total number of matched points.
As shown in Table 6, UFM consistently achieves the smallest distance error in most cases, outperforming other methods. The experiments further demonstrate that UFM not only identifies the maximum number of matching points with an error of less than 1 pixel, but also ensures that the error of the obtained matching points remains very small.
4.6 Evaluation of computational cost
To objectively evaluate the computational cost of the UFM algorithm, we compare the matching speed and resource requirements across different methods. In this experiment, the matching methods are tested on the RGB-NIR Scene dataset using an RTX 3090 GPU and an Intel i7-11700 processor. Matching speed is measured in frames per second (FPS), and resource usage is assessed in terms of storage requirements (MB). A higher FPS indicates faster inference, while lower storage values reflect reduced resource usage. As shown in Table 7, UFM achieves the fastest matching speed. Among the methods compared, UFM, LoFTR, and FeMIP are semi-dense matching techniques, while the remaining methods are sparse matching approaches. Generally, semi-dense methods offer higher matching accuracy but tend to demand more resources. Notably, UFM requires the least resources among the semi-dense methods.
4.7 Ablation study
To fully assess the role of different modules in UFM, seven variants were designed. In the 7th variant, pre-training is omitted, and the model is trained using the entire training dataset. As shown in Tables 8 and 9, we conduct ablation experiments on feature matching using images from the same modal and from different modals, respectively. The results of these experiments demonstrate that UFM outperforms all variants in feature matching. The combination of data augmentation and the MIA transformer is particularly effective, with the Assistant FFN in the MIA transformer having the most significant impact. The Generic FFN has a lesser effect on the matching performance. Data augmentation, Assistant FFN, and fine-tuning have a substantial influence on feature matching across different modals, while their impact on same-modal feature matching is relatively smaller. Although the 7th variant is trained on the full dataset, it still underperforms compared to the complete UFM, highlighting the necessity of the pre-training and fine-tuning mechanisms we have designed.
4.8 Limitation
Although UFM has been shown to deliver good matching results in most cases, it also has some limitations. As illustrated in Fig 13, UFM struggles to achieve accurate matching when dealing with multi-modal images that have very few textures. This is a common challenge for feature matching methods, and other approaches also face difficulties in such scenarios. While UFM has proven to be highly generalizable, its pre-training becomes less effective when confronted with unseen modal images. In such cases, as with other methods, it is necessary to train the model on the entire dataset to achieve accurate matching results.
5 Conclusion
This paper introduces UFM, a Unified Feature Matching model designed for fine-tuning across a broad range of modal images using a shared Multimodal Image Assistant (MIA) Transformer. MIA Transformers, serving as multimodal assistants, enhance a generic Feedforward Network (FFN) by encoding modal-specific information. The shared self-attention and cross-attention mechanisms enable improved feature matching across different modals. Through the incorporation of data augmentation and staged pre-training, UFM demonstrates significantly enhanced pre-training effectiveness on multimodal images. Experimental results showcase UFM’s capability to achieve excellent performance in diverse feature matching tasks.
Future work aims to enhance UFM further by expanding the pre-training dataset and integrating image data from less common modals. Consideration is given to transitioning from a semi-dense matching framework to a dense matching framework to elevate matching accuracy. Plans also include scaling up the model size used in UFM pretraining. Additionally, ongoing research explores downstream tasks of image feature matching, with an emphasis on integrating the unified feature matching algorithm into diverse applications.
References
- 1. Kulkarni SC, Rege PP. Pixel level fusion techniques for SAR and optical images: a review. Inform Fusion. 2020;59:13–29.
- 2. Deng S, Deng L, Wu X, Ran R, Hong D, Vivone G. PSRT: Pyramid shuffle-and-reshuffle transformer for multispectral and hyperspectral image fusion. IEEE Trans Geosc Rem Sens 2023;61(1):1–15.
- 3. Peng Z, Ma Y, Zhang Y, Li H, Fan F, Mei X. Seamless UAV hyperspectral image stitching using optimal seamline detection via graph cuts. IEEE Trans Geosci Rem Sens 2023;61(1):1–13.
- 4. Zhang Z, Yang X, Xu C. Natural image stitching with layered warping constraint. IEEE Trans Multimedia. 2023;25:329–38.
- 5. Yin W, Zhang J, Wang O, Niklaus S, Chen S, Liu Y, et al. Towards accurate reconstruction of 3D scene shape from a single monocular image. IEEE Trans Pattern Anal Mach Intell 2023;45(5):6480–94. pmid:36197868
- 6. Song S, Truong KG, Kim D, Jo S. Prior depth-based multi-view stereo network for online 3D model reconstruction. Pattern Recogn. 2023;136109198.
- 7. Yenkikar A, Babu CN. AirBERT: a fine-tuned language representation model for airlines tweet sentiment analysis. IDT 2023;17(2):435–55.
- 8. He Y, Zhang Q, Wang S, Chen Z, Cui Z, Guo Z-H, et al. Predicting the sequence specificities of DNA-binding proteins by DNA fine-tuned language model with decaying learning rates. IEEE/ACM Trans Comput Biol Bioinform 2023;20(1):616–24. pmid:35389869
- 9. Deng Y, Karam LJ. Frequency-tuned universal adversarial attacks on texture recognition. IEEE Trans Image Process. 2022;315856–68. pmid:36054395
- 10. Ellahyani A, Jaafari IE, Charfi S, Ansari ME. Fine-tuned deep neural networks for polyp detection in colonoscopy images. Pers Ubiquit Comput 2022;27(2):235–47.
- 11. Radford A, Kim J, Hallacy C, Ramesh A, Goh G, Agarwal S, et al. Learning transferable visual models from natural language supervision. Proceedings of the 38th International Conference on Machine Learning. 2021;1398748–63.
- 12. Jia C, Yang Y, Xia Y, Chen Y, Parekh Z, Pham H. Scaling up visual and vision-language representation learning with noisy text supervision. Proceedings of the 38th International Conference on Machine Learning. 2021;1394904–16.
- 13. Giang K, Song S, Jo S. TopicFM: Robust and Interpretable Topic-Assisted Feature Matching. Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence. 20232447–55.
- 14. Cai Y, Li L, Wang D, Li X, Liu X. HTMatch: An efficient hybrid transformer based graph neural network for local feature matching. Signal Processing. 2023;204108859.
- 15. Luo H, Xie T, Wang A, Dai K, Cao C, Zhao L. CorMatcher: a corners-guided graph neural network for local feature matching. Expert Syst Appl. 2024;258125190.
- 16. Xu W, Yuan X, Hu Q, Li J. SAR-optical feature matching: a large-scale patch dataset and a deep local descriptor. Int J Appl Earth Obs Geoinform. 2023;122103433.
- 17. Di Y, Liao Y, Zhu K, Zhou H, Zhang Y, Duan Q. MIVI: multi-stage feature matching for infrared and visible image. The Visual Computer. 20231–13.
- 18. Di Y, Liao Y, Zhou H, Zhu K, Zhang Y, Duan Q, et al. FeMIP: detector-free feature matching for multimodal images with policy gradient. Appl Intell 2023;53(20):24068–88.
- 19. Hu M, Sun B, Kang X, Li S. Multiscale structural feature transform for multi-modal image matching. Inform Fusion. 2023;54–54.
- 20. Lu Y, Lu G. SuperThermal: matching thermal as visible through thermal feature exploration. IEEE Robot Autom Lett 2021;6(2):2690–7.
- 21. Tang G, Liu Z, Xiong J. Distinctive image features from illumination and scale invariant keypoints. Multimed Tools Appl 2019;78(16):23415–42.
- 22. Rublee E, Rabaud V, Konolige K, Bradski G. ORB: An efficient alternative to SIFT or SURF. Proc IEEE Int Conf Comput Vis. 20112564–71.
- 23. Zhu M, Song H, Xu J, Jiang X, Zhang Y, Ma J, et al. Introgression of ZmCPK39 in maize hybrids enhances resistance to gray leaf spot disease without compromising yield. Mol Breed 2025;45(3):28. pmid:40013268
- 24. Li J, Xu W, Shi P, Zhang Y, Hu Q. LNIFT: Locally normalized image for rotation invariant multimodal feature matching. IEEE Trans Geosci Rem Sens 2022;60(1):1–14.
- 25. Sarlin P, DeTone D, Malisiewicz T, Rabinovich A. SuperGlue: Learning feature matching with graph neural networks. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020;4937–46. https://openaccess.thecvf.com/content_CVPR_2020/html/Sarlin_SuperGlue_Learning_Feature_Matching_With_Graph_Neural_Networks_CVPR_2020_paper.html
- 26. Jiang B, Sun P, Luo B. GLMNet: Graph learning-matching convolutional networks for feature matching. Pattern Recogn. 2022;121108167.
- 27. Shen Z, Sun J, Wang Y, He X, Bao H, Zhou X. Semi-dense feature matching With transformers and its applications in multiple-view geometry. IEEE Trans Pattern Anal Mach Intell 2023;45(6):7726–38. pmid:36409815
- 28. Fisher AN, Stinson DA, Kalajdzic A, Dupuis HE, Lowey EE, Desgrosseilliers E, et al. “A recipe for disaster?”: Female-Breadwinner relationships threaten heterosexual scripts. Sex Roles. 2025;91(3):16. pmid:39990977
- 29. Huang D, Chen Y, Liu Y, Liu J, Xu S, Wu W, et al. Adaptive Assignment for Geometry Aware Local Feature Matching. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2023;5425–34.
- 30. Mok TCW, Chung ACS. Affine medical image registration with coarse-to-fine vision transformer. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2022;20803–12.
- 31. Xie T, Dai K, Wang K, Li R, Zhao L. DeepMatcher: A deep transformer-based network for robust and accurate local feature matching. Exp Syst Appl. 2024;237(Part A):121361.
- 32. Dai K, Xie T, Wang K, Jiang Z, Li R, Zhao L. OAMatcher: an overlapping areas-based network for accurate local feature matching. CoRR 2023.
- 33. ruong P, Danelljan M, Gool LV, Timofte R. Learning accurate dense correspondences and when to trust them. IEEE Conference on Computer Vision and Pattern Recognition. 2021;5714–24.
- 34. Truong P, Danelljan M, Timofte R, Van Gool L. PDC-Net+: Enhanced Probabilistic Dense Correspondence Network. IEEE Trans Pattern Anal Mach Intell 2023;45(8):10247–66. pmid:37027599
- 35. Edstedt J, Athanasiadis I, Wadenbäck M, Felsberg M. DKM: dense kernelized feature matching for geometry estimation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023;75–75.
- 36. Edstedt J, Sun Q, Bokman G, Wadenback M, Felsberg M. RoMa: revisiting robust losses for dense feature matching. CoRR. 2023;abs.
- 37. Zhu B, Yang C, Dai J, Fan J, Qin Y, Ye Y. R2FD2: fast and robust matching of multimodal remote sensing images via repeatable feature detector and rotation-invariant feature descriptor. IEEE Trans Geosci Remote Sensing. 2023;611–15.
- 38. Devlin J, Chang M-W, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North. 2019;4186–4186.
- 39. Dong L, Yang N, Wang W, Wei F, Liu X, Wang Y, et al. Unified language model pre-training for natural language understanding and generation. Adv Neural Inform Process Syst. 2019;3213042–54
- 40. Yang S, Lei X. Reciprocal causation relationship between rumination thinking and sleep quality: a resting-state fMRI study. Cogn Neurodyn 2025;19(1):41. pmid:39991016
- 41. Bao H, Dong L, Piao S, Wei F. BEiT: BERT pre-training of image transformers. The Tenth International Conference on Learning Representations. 2022;2022. https://openreview.net/forum?id=p-BhZSz59o4
- 42. Chen Y, Li L, Yu L, Kholy AE, Ahmed F, Gan Z, et al. UNITER: UNiversal Image-TExt Representation Learning. Vedaldi A, Bischof H, Brox T, Frahm J, editors. Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XXX. vol. 12375 of Lecture Notes in Computer Science. Springer; 2020;104–120. Available from: doi: https://doi.org/10.1007/978-3-030-58577-8_7.
- 43. Su W, Zhu X, Cao Y, Li B, Lu L, Wei F, et al. VL-BERT: Pre-training of generic visual-linguistic representations. OpenReview.net. 2020.
- 44. Muresan S, Nakov P, Villavicencio A. Fine-tuning for transformer-based masked language-models. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2022;9–9.
- 45. Sohn K, Chang H, Lezama J, Polania L, Zhang H, Hao Y, et al. Visual prompt tuning for generative transfer learning. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2023;19840–51.
- 46. Bao H, Wang W, Dong L, Liu Q, Mohammed OK, Aggarwal K, et al. VLMo: unified vision-language pre-training with mixture-of-modality-experts. Advances in Neural Information Processing Systems. 2022;35, Abstract.
- 47. Wang W, Bao H, Dong L, Bjorck J, Peng Z, Liu Q, et al. Image as a foreign language: BEIT pretraining for vision and vision-language tasks. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2023;19175–86.
- 48. DeTone D, Malisiewicz T, Rabinovich A. SuperPoint: self-supervised interest point detection and description. 2018 IEEE Conference on Computer Vision and Pattern Recognition Workshops. 2018;224–36.
- 49. Wang Q, Zhou X, Hariharan B, Snavely N. Learning feature descriptors using camera pose supervision. In: Vedaldi A, Bischof H, Brox T, Frahm J, eds. Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part I. vol. 12346 of Lecture Notes in Computer Science. Springer; 2020;757–774.
- 50. Li Z, Snavely N. MegaDepth: learning single-view depth prediction from internet photos. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018;2041–50.
- 51. Dai A, Chang AX, Savva M, Halber M, Funkhouser TA, Nießner M. ScanNet: Richly-annotated 3D reconstructions of indoor scenes. 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017. 2017;2432–43.
- 52. Thomee B, Shamma D, Friedland G, Elizalde B, Ni K, Poland D. YFCC100M: the new data in multimedia research. Commun ACM 2016;59(2):64–73.
- 53. Daudt RC, Saux BL, Boulch A, Gousseau Y. Urban change detection for multispectral earth observation using convolutional neural networks. 2018 IEEE International Geoscience and Remote Sensing Symposium, IGARSS 2018. 2018;2115–8.
- 54. Yao Y, Zhang Y, Wan Y, Liu X, Yan X, Li J. Multi-modal remote sensing image matching considering co-occurrence filter. IEEE Trans Image Process. 2022;31:2584–97. pmid:35286258
- 55. Nilosek D, Walvoord DJ, Salvaggio C. Assessing geoaccuracy of structure from motion point clouds from long-range image collections. Opt Eng 2014;53(11):113112.
- 56. Hong D, Gao L, Yokoya N, Yao J, Chanussot J, Du Q, et al. More diverse means better: multimodal deep learning meets remote-sensing imagery classification. IEEE Trans Geosci Remote Sensing 2021;59(5):4340–54.
- 57. Ma J, Zhao J, Jiang J, Zhou H, Guo X. Locality preserving matching. Int J Comput Vis 2018;127(5):512–31.
- 58. Collins DL, Zijdenbos AP, Kollokian V, Sled JG, Kabani NJ, Holmes CJ, et al. Design and construction of a realistic digital brain phantom. IEEE Trans Med Imaging 1998;17(3):463–8. pmid:9735909
- 59. Brown M, Susstrunk S. Multi-spectral SIFT for scene category recognition. CVPR 2011. 2011;184–184.
- 60. Mishkin D, Matas J, Perdoch M, Lenc K. WxBS: wide baseline stereo generalizations. Procedings of the British Machine Vision Conference 2015. 2015;12.1-12.12.
- 61. Shrivastava A, Malisiewicz T, Gupta A, Efros AA. Data-driven visual similarity for cross-domain image matching. ACM Trans Graph 2011;30(6):1–10.
- 62. Balntas V, Lenc K, Vedaldi A, Mikolajczyk K. HPatches: a benchmark and evaluation of handcrafted and learned local descriptors. 2017 IEEE Conference on Computer Vision and Pattern Recognition. 2017;3852–61.
- 63. Taira H, Okutomi M, Sattler T, Cimpoi M, Pollefeys M, Sivic J, et al. InLoc: indoor visual localization with dense matching and view synthesis. IEEE Trans Pattern Anal Mach Intell 2021;43(4):1293–307. pmid:31722474
- 64. Zhang Z, Sattler T, Scaramuzza D. Reference pose generation for long-term visual localization via learned features and view synthesis. Int J Comput Vis 2021;129(4):821–44. pmid:34720404
- 65. Toft C, Maddern W, Torii A, Hammarstrand L, Stenborg E, Safari D, et al. Long-term visual localization revisited. IEEE Trans Pattern Anal Mach Intell 2022;44(4):2074–88. pmid:33074802
- 66. RoBberg T, Schmitt M. Estimating NDVI from Sentinel-1 Sar data using deep learning. IGARSS 2022 - 2022 IEEE International Geoscience and Remote Sensing Symposium. 2022;5–5.
- 67. Li X, Zhang G, Cui H, Hou S, Wang S, Li X, et al. MCANet: a joint semantic segmentation framework of optical and SAR images for land use classification. Int J Appl Earth Obs Geoinformation. 2022;106:102638.
- 68. Liao Y, Di Y, Zhou H, Li A, Liu J, Lu M, et al. Feature matching and position matching between optical and SAR with local deep feature descriptor. IEEE J Sel Top Appl Earth Observations Remote Sensing. 2022;15:448–62.
- 69. Silberman N, Hoiem D, Kohli P, Fergus R. Indoor segmentation and support inference from RGBD images. In: Fitzgibbon AW, Lazebnik S, Perona P, Sato Y, Schmid C, eds. Computer Vision - ECCV 2012 - 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part V. vol. 7576 of Lecture Notes in Computer Science. Springer; 2012. p. 746–760.
- 70. Differt D, Möller R. Spectral skyline separation: extended landmark databases and panoramic imaging. Sensors (Basel) 2016;16(10):1614. pmid:27690053
- 71. Dusmanu M, Rocco I, Pajdla T, Pollefeys M, Sivic J, Torii A, et al. D2-Net: a trainable CNN for joint description and detection of local features. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019;8092–101.
- 72. Revaud J, de Souza CR, Humenberger M, Weinzaepfel P. R2D2: reliable and repeatable detector and descriptor. In: Wallach HM, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox EB, Garnett R, eds. Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada; 2019;12405–12415. https://proceedings.neurips.cc/paper/2019/hash/3198dfd0aef271d22f7bcddd6f12f5cb-Abstract.html.
- 73. Luo Z, Zhou L, Bai X, Chen H, Zhang J, Yao Y, et al. ASLFeat: learning local features of accurate shape and localization. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020;6588–97
- 74. Zhou Q, Sattler T, Leal-Taixé L. Patch2Pix: epipolar-guided pixel-level correspondences. IEEE Conference on Computer Vision and Pattern Recognition. 2021;4669–78.
- 75. Sarlin P, Unagar A, Larsson M, Germain H, Toft C, Larsson V, et al. Back to the feature: learning robust camera localization from pixels to pose. IEEE Conference on Computer Vision and Pattern Recognition. 2021;3247–57.
- 76. Li X, Wang S, Zhao Y, Verbeek J, Kannala J. Hierarchical scene coordinate classification and regression for visual localization. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020;11980–9.
- 77. Sarlin P, Cadena C, Siegwart R, Dymczyk M. From coarse to fine: robust hierarchical localization at large scale. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019;12716–25.
- 78. Melekhov I, Brostow GJ, Kannala J, Turmukhambetov D. Image stylization for robust features. CoRR. 2020.
- 79. Zhou Y, Fan H, Gao S, Yang Y, Zhang X, Li J, et al. Retrieval and localization with observation constraints. 2021 IEEE International Conference on Robotics and Automation (ICRA). 2021;5237-5244.
- 80. Jiang W, Trulls E, Hosang J, Tagliasacchi A, Yi KM. COTR: Correspondence Transformer for Matching Across Images. 2021 IEEE/CVF International Conference on Computer Vision (ICCV). 2021;6187-6197.
- 81. Xufeng H, Leung T, Jia Y, Sukthankar R, Berg AC. MatchNet: unifying feature and metric learning for patch-based matching. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2015;3279–86.
- 82. Balntas V, Riba E, Ponsa D, Mikolajczyk K. Learning local feature descriptors with triplets and shallow convolutional neural networks. Proceedings of the British Machine Vision Conference 2016. 2016;119.
- 83. ishchuk A, Mishkin D, Radenovic F, Matas J. Working hard to know your neighbor’s margins: local descriptor learning loss. Advances in Neural Information Processing Systems. 2017;30:4826–37.