Picture semantic similarity search based on bipartite network of picture-tag type

Searching similar pictures for a given picture is an important task in numerous applications, including image recommendation system, image classification and image retrieval. Previous studies mainly focused on the similarities of content, which measures similarities based on visual features, such as color and shape, and few of them pay enough attention to semantics. In this paper, we propose a link-based semantic similarity search method, namely PictureSim, for effectively searching similar pictures by building a picture-tag network. The picture-tag network is built by “description” relationships between pictures and tags, in which tags and pictures are treated as nodes, and relationships between pictures and tags are regarded as edges. Then we design a TF-IDF-based model to removes the noisy links, so the traverses of these links can be reduced. We observe that “similar pictures contain similar tags, and similar tags describe similar pictures”, which is consistent with the intuition of the SimRank. Consequently, we utilize the SimRank algorithm to compute the similarity scores between pictures. Compared with content-based methods, PictureSim could effectively search similar pictures semantically. Extensive experiments on real datasets to demonstrate the effectiveness and efficiency of the PictureSim.


Introduction
Searching similar pictures for a given picture is an important task in numerous applications. Typical examples include medical image classification [1], image forgery detection [2], image recommendation system [3] and image cluster analysis [4], in which picture plays an important role. Traditional picture similarity search methods compute similarities based on visual features, namely Content-based Image Retrieval (CBIR), including color and shape, e.g., GIST [5], SIFT [6] and SURF [7]. [8] aggregates local deep features to produce compact global descriptors. [9] proposes FTS to get fractal-based local search. [10] proposes VWIaC and FIbC to build smaller and larger sizes of codebooks for salient objects within pictures. [11] proposes Iterative Search (IS) to achieve search similar pictures effectively, which extracts knowledge from similar pictures to compensate for the missing information in the feature extraction process. [12] measures similarities between pictures by using distributed environments and LSH in the distributed scheme. [13] employs an improvement of the D-index method to reduce queries on a dynamic network. SimRank � [28] remedies the problem of "zero-similarity" in SimRank, which enriches semantics without suffering from increased computational overhead. PRSim [29] is based on the main concepts of SLING, which leverages the graph structure of power-law to efficiently answer SimRank queries, and a connection is built between Sim-Rank and personalized PageRank. UniWalk [30] calculates the similarities between objects based on Monte Carlo, which could directly locate the top k similar vertices for any single source via R sampling paths originating from the single source. SimPush [31] speeds up query processing by identifying a small number of nodes, then computes statistics and performs residue push from these nodes. These measures have been applied in numerous applications, such as spam detection [32], web page ranking [33], citation analysis [34]. Table 1 summarizes several picture similarity search methods, including content-based and link-based. Compared with the latest content-based metrics, link-based similarity measures could capture the semantic information of pictures based on a picture-tag network, while content-based methods mainly focus on searching similar pictures in visual features, which might neglect the expected similar pictures and deviate from the user's intention. Moreover, the intuition of link-based methods is that "two pictures are similar if they are related to similar pictures", which could search underlying similar pictures. For example, picture A is similar to picture B, and picture A is similar to picture C, so picture B is similar to picture C.
In this paper, we propose a link-based picture semantic similarity search method, namely PictureSim, for effectively searching similar pictures by building a picture-tag network. We first build a picture-tag network based on "description" relationships between pictures and tags, and then exploit the object-to-object relationships [36,37] in picture-tag network. The intuition behind PictureSim is that "similar pictures contain similar tags, and similar tags describe similar pictures", which is consistent with the intuition of SimRank. Consequently, we adopt SimRank model [21] to compute the similarity scores, which helps to find underlying similar pictures semantically. Our main contributions are as follows.
• We build a picture-tag network by "description" relationships between pictures and tags. Initially, tags and pictures are treated as nodes, and relationships between pictures and tags are regarded as edges. Then, we propose a TF-IDF-based method to remove the noisy links by setting a threshold, which could measure whether a tag has good classification performance.
• We propose a link-based picture similarity search algorithm, namely PictureSim, for effectively searching similar pictures semantically, which considers the context structure to search underlying similar pictures in a network. And it could respond to the user's requirement timely.
• We ran a comprehensive set of experiments on Nipic datasets and ImageNet datasets. Our results show that PictureSim achieves semantic similarity search between pictures, which produces a better correlation with human judgments compared with content-based methods.

Methods
In this section, we show a framework of the top k picture semantic similarity search, which is divided into two stages. The first stage is to build a picture-tag network by "description" relationships between pictures and tags, in which pictures and tags are regarded as nodes, and relationships between the pictures and the tags are regarded as edges. And we remove the noisy links based on TF-IDF model, in which a few informative tags are removed. Then, we use Sim-Rank algorithm to search the top k most similar pictures for a given picture. Compared with content-based methods, PictureSim can achieve semantically similarity by building a picture-tag network, while content-based methods can only achieve visual similarity. And users usually judge similarities based on semantics rather than visual features.

Problem definition
For subsequent discussions, we first give the definition of the top k picture semantic similarity search, that is defined as: Definition 1. Top-k picture semantic similarity search In the picture-tag network, given query q in the network, a positive integer k < n, a top-k semantic similar picture is to find k most similar pictures in terms of semantic and ranked with similarity descending.

Network building
Definition of picture-tag network. Tags are descriptive keywords to discriminate objects. For example, the web tag is a way to organize Internet content. It helps users classify and describe the content of web retrieval. The purpose of tag generation is to find semantic information of a given object. There is a need for finding semantic information of objects. Thus, many approaches to generate tags are developed, including user annotation and machine generation. For example, Oriol et al. [38] proposed an attention mechanism, which maps each word of the picture-generated description to a certain area of the picture. Therefore, there is semantic information between the tag and the picture, which provides an important guarantee for semantic similarity computation. The review network [39] as an extension of [38], it can learn the annotation and initial states for the decoder steps. In the picture-tag network, tags could fully express the semantic information of pictures, which helps search similar pictures semantically. The picture-tag network is defined as:

and V P and V T represent the sets of pictures and tags respectively; E denotes the set of edges of "description" relationship between pictures and tags, and an edge e
In a picture dataset, a "description" relationship between a picture and a tag builds an link, and all of the "description" relationships build a picture-tag network. Fig 1 is a toy picture-tag network, pictures and tags are treated as nodes, and "description" relationships between the pictures and the tags are treated as edges. Fig 1 shows that the first picture is described by some tags, including "antique decoration", "wooden finish", "showcase", etc. A link between the first picture and "antique decoration" could represent a "description" relationship, and a "description" relationship and "be described" relationship exist simultaneously. Similarly, a tag can also describe several pictures, such as "showcase" describes three pictures. These tags can fully illustrate semantic information of pictures, which has a better correlation with human judgments in similarity search.
Removing noisy links. Noisy links are the tags that cannot effectively discriminate pictures when computing the similarities. It not only affects search results but also incur the expensive time and space overhead, so it is necessary to be removed. Term Frequency-Inverse Document Frequency (TF-IDF) [40] could be seen as a promising method to find noisy links. It is a statistical method to assess whether a tag is important for a picture. In other words, if a tag describes a picture, and this tag rarely appears in the description of other pictures. It indicates that the tag has a better discrimination ability. Term frequency (TF) indicates how often a tag t appears in a picture p, it is defined as: where n t,p is the number of a tag t is associated with a picture p, and we set n t,p roughly as 1.
There is no duplicate tags t will appear in the description of the picture p, because the tag is different from the text; and |O(p)| is the number of out-neighbors of picture p. The inverse document frequency (IDF) is a measure of the universal importance of a tag, it is defined as: where |n p | is the total number of pictures in the data, |I(t)| is the number of in-neighbors of tag t. Based on TF and IDF, the TF-IDF value for tag t and picture p is defined as: Intuitively, the tag has good recognition performance if tags have high TF-IDF. And tags with lower TF-IDF should be removed to avoid affecting results. Then, we remove noisy links with low TF-IDF according to a threshold δ, defined as δ = (max − min) � h + min where h = (0, 1). In a picture-tag network, the links correspond to TF-IDF values lower than δ are removed before similarity computation.

Similarity model
Link-based approaches search similar pictures semantically by building a picture-tag network. And SimRank [21] can be regarded as one of the most attractive methods. Because it no longer only considers direct in-links among nodes but also indirect in-links. Then SimRank is a general model that can be applied in any similarity search field, and it is suitable for bipartite networks. There are some other link-based similarity measures, such as PageSim [41], P-Rank [22] and SimRank � [28]. P-Rank enriches SimRank by jointly encoding both in-and out-link relationships into structural similarity computation. And the picture-tag network is bipartite, which makes both SimRank and P-Rank equivalent. PageSim and SimRank � consider the paths of unequal length to search similar pictures. However, PictureSim only considers the paths of equal length.
PictureSim uses SimRank to compute similarity in a picture-tag network. Our key observation is that "similar pictures contain similar tags, and similar tags describe similar pictures". As shown in Fig 1, the similarity score between the first picture and itself is 1, similarly for "showcase". Clear, three pictures are similar: all have the "showcase", and the reason we can conclude that three pictures are similar is that they are described as "showcase". The first picture is described as "wooden finish" while the second picture is described as "shopwindow", and these are similar tags. In this sense that they describe similar pictures.

On-line query processing
Based on Eq (6), the similarities between pictures can be computed in the off-line stage. A straightforward method to find the top k similar pictures is that: we first choose k most similar pictures based on the pre-computing similarity scores, then sort and return them. Though this can save time overhead in on-line stage, expensive operations are required in the off-line stage, which involves O(n 2 ) time cost and O(ld 2 n 2 ) space cost at the l-th iteration, where n is the number of nodes in the network, d is the average degree of the nodes, and we set l from 1 to 7 in terms of time and cost overhead. Therefore, the computation would become inefficient especially when the picture-tag network grows large. Fortunately, there is a large portion of optimization techniques on SimRank similarity search, e.g., TopSim [35], Par-SR [26] and ProbeSim [27], which searches similar pictures without any preprocessing, a typical example is TopSim. TopSim focuses on computing exact SimRank efficiently. It uses neighborhood to describe the structural context of a node, then merges certain random walk paths by maintaining a similarity map at each step. Therefore, PictureSim optimizes the efficiency of SimRank by TopSim algorithm without any preprocessing, which requires O(d 2l ) time cost in the on-line stage.

Results
In this section, experimental results are reported in real datasets. Experiments were done on a 2.3 GHz Intel(R) Core i5 CPU with 8 GB main memory. All algorithms were implemented in Java by using Eclipse Java 2018.

Datasets and evaluation
In the experiments, we extract picture-tag networks from Nipic dataset (http://www.nipic. com/index.html) and ImageNet dataset (http://www.image-net.org/) to evaluate our approach. Nipic contains 37,221 pictures, 58,623 tags and 610,440 "description" relationships. The parameter h is set to be 0.8 to remove noisy links if not specified explicitly, and we finally obtain 283,079 links. We select the sub dataset ILSVRC-2012 from ImageNet, which contains 50,000 pictures, 1,000 tags and 50,000 "description" relationships.
We implemented four contrast algorithms to evaluate the effectiveness: SimRank algorithm [21] and some content-based algorithms that include Minkowski Distance (MD) [42], Histogram Intersection (HI) [43] and Relative Deviation (RD). We use TopSim [35] to improve the efficiency of SimRank, which only needs to find candidates from the neighborhood locally without traversing the entire network. TopSim is used in a homogeneous network in [35], and we use it in a heterogeneous network. The decay factor c of SimRank is set as 0.8. The MD defines a set of distances, which makes it possible to measure the distance between points. In the HI, each feature set is mapped to a multi-resolution histogram that preserves each feature's distinctness at the finest level. RD judges the similarities by calculating relative deviation.
In the dataset, we randomly pick 20 pictures to test the effectiveness of different algorithms for the top k query with k = 50. Effectiveness is evaluated by Mean Average Precision (MAP), which is formally defined as MAP ¼ P Q q¼1 AvePðqÞ=Q, where Q is the number of query pictures and AveP(q) is the average MAP scores of the query picture q. MAP scores are computed according to the similarity levels which are set as six levels: 0 (dissimilar), 0.2 (potentially similar), 0.4 (marginally similar), 0.6 (moderately similar), 0.8 (highly similar) and 1 (completely similar). The similarity levels are labeled by people, which is a gold standard due to we judge the semantic similarity of pictures based on users' understanding of the pictures. Table 2 shows the MAP scores of different metrics in Nicpic, and PictureSim sets l as 5. The MAP scores of PictureSim are obviously higher than that of traditional content-based methods with different k. For example at k = 15, PictureSim achieves average 0.599 MAE, while RD and HI yield average 0.119 MAE. This is because PictureSim computes similarity scores by the structure of context in the picture-tag network, while the traditional content-based approach considers the visual features, which often fails to reflect the semantic information in the user's mind. Fig 2(a) shows the MAP scores with varying l in Nipic, which clearly illustrates the effect of l in PictureSim. We observe that the MAP scores increase slowly as l increases from 1 to 5, because PictureSim not only considers direct in-links among nodes but also indirect in-links. After l = 5, the MAP scores become stable, and PictureSim converges to a stable state. So the returned rankings would become stable empirically after the fifth iteration. Fig 2(b) shows the MAP scores of PictureSim on varying k in Nipic. The MAP scores gradually decrease as k increases, it could achieve average 0.718 MAP at k = 5. This is because the higher similarity scores have a higher rank in the returned list. Generally, users are only interested in the top 10 similar pictures for a given picture, so PictureSim could achieve the user's intention.  then drops continuously and the curves reach the bottom at h = 0.9. This is because the more noisy links are removed if h becomes large. However, some useful links might be also removed as h increases, and consequently the MAP scores decrease. And the curve is an exception at l = 1, the MAP scores are relatively stable from h = 0.1 to 0.7, then decreases evidently as h increases, due to it considers direct in-links among nodes, and other curves not only consider direct in-links but also indirect in-links. Similar results can be found in Fig 3(b), which shows the MAP scores of PictureSim on varying h where k = 10. And the result can be explained similarly due to the change of MAP is similar with Fig 3(a).

Nipic
To compare the performance of different algorithms from user's perspectives in Nipic, including semantic, color and shape, we calculate the MAP scores of top 10 similar pictures are shown in Table 3, where l = 5 in PictureSim. Obviously, PictureSim has relatively higher MAP scores compared with traditional content-based metrics in terms of semantics, while the comparison methods have relatively higher MAP scores in terms of shape and color. This is because we pay more attention to whether the tag fully expresses the semantics of pictures rather than visual feature, and color and shape often fail to fully express the semantic information of the picture. Fig 4(a) shows the running time on varying l in Nipic, in which, the running time increases slowly before l = 5 and increases rapidly after l = 5. This is because PictureSim also considers indirect in-links when searching similar pictures, it needs to traverse more paths as l increases. Fortunately, PictureSim could converge rapidly at l = 5 as shown in Fig 2(a), which shows a good performance of the proposed approach.  We observe that the running time almost remains stable as k increases, which indicates time overhead does not change as k increases. This is because the running time is affected by the sorting rather than the similarity calculation, and sorting overhead is almost negligible compared with the computational overhead. And the running time fluctuates significantly at l = 7, due to the instability of the machine. Fig 5(a) shows the running time on varying h in Nipic. We observe that the running time decreases as h increases from 0.1 to 0.9. It drops rapidly from h = 0 to 0.6, and afterward, the slowly decreases as h increases. Because of a larger h, the more noisy links will be removed in picture-tag network, which indicates the efficiency can be significantly improved after h = 0.6. So we set h as 0.8 if not specified in other experiments. Fig 5(b) shows the running time on varying h in Nipic. The figure illustrates that the running time decreases as h increases except for the curve of l = 1 remain stable, since we only consider direct in-links among nodes at l = 1 and the time change is minor as h increases. At the same l, the larger network, the longer running time. The reason is that PictureSim iteratively calculates the similarity between pictures, which makes running time increase evidently. Fig 6(a) shows the MAP scores of PictureSim on varying l in ImageNet. We observe that the MAP scores of PictureSim are relatively lower than that of Nipic. The reason is that each picture is described by only one tag, which fails to fully express the semantics information of the picture. Moreover, MAP scores irregularly fluctuate as l increases, including the ranking of top 5, top 10 and top 15, because PictureSim searches all similar pictures at l = 1 and it has the same similarity scores. However, the returned ranking is different due to the sort algorithm. Fig 6(b) shows the MAP scores of PictureSim on varying k in ImageNet. The results are similar to Fig 2(b), but the curve relatively fluctuates compared with Fig 2(b), and the difference between maximum and minimum is smaller than Nipic, the reason is as mentioned above.  In which, the running time increases as l increases. But the time overhead is very small, especially Ima-geNet takes 0.004s at l = 7, while Nipic needs 7.3s. This is because each picture is described by only one tag, the picture-tag network is very sparse in ImageNet. Fig 7(b) shows the running time on varying k in ImageNet, where l = 1, 3, 5, 7. The result is similar to Fig 4(b), and the reason is as mentioned above. But the fluctuation of curve is relatively evident compared with Nipic. Because the time overhead of ImageNet is very small, so the time overhead of sort is relatively evident.

PLOS ONE
Picture semantic similarity search based on bipartite network of picture-tag type increases at l = 1, 3, 5 and obviously increases at l = 7. Because PictureSim iteratively computes similarity scores, as the network becomes larger, the running time obviously increases as l increases. And due to the network is very sparse in ImageNet, the time overhead is very small .  Fig 8(b) shows the running time on varying n in Nipic. The figure illustrates that the running time slowly increases as n increases at l = 1, 3, 5, and obviously increases at l = 7, especially the n varies from 30,000 to 37,200. The reason is that the picture-tag network becomes dense as n increases, it traverses more paths as l increases, which takes more time to obtain similar pictures. And the network is large, the time overhead increases exponentially as l increases.

Conclusion
This paper proposes a semantic similarity search method, namely PictureSim, for effectively searching similar pictures by building a picture-tag network. Compared with content-based methods, PictureSim can effectively and efficiently search similar pictures, which produces a better correlation with human judgments. Empirical studies on real datasets demonstrate the effectiveness and efficiency of our proposed approach. Future work will extend our approach to other datasets for effectively searching similar objects in other fields, because PictureSim is proposed for searching semantically similar pictures. Then, PictureSim requires O(d 2l ) time cost, and the number of paths increases exponentially as path length increases, which makes computation expensive in terms of time and space and cannot support fast similarity search over large networks. So we will focus on reducing computational overhead to ensure timely response in large networks.