Automating the search for a patent’s prior art with a full text similarity search

doi:10.1371/journal.pone.0212103

Fig 1.

Illustration of the presented novel approach to the search for a patent’s prior art.

First, a dataset of patent applications is obtained from a patent database using a few manually selected seed patents and recursively including the patent applications they cite. Then, the patent texts are transformed into feature vectors and the similarity between two documents is computed based on said feature vectors. Finally, patents that are considered as very similar to a new target patent application are returned as possible prior art. An appropriate similarity measure for this process should assign high similarity scores to related patents (e.g. where one patent was cited in the search report of the other) and low scores to unrelated (randomly paired) patents. We compare different similarity measures by quantifying the overlap between the respective similarity score distributions of pairs of related documents and randomly paired patents using the AUC score.

More »

Expand

Table 1.

Evaluation results on the cited/random dataset.

More »

Expand

Fig 2.

Distributions of cosine similarity scores.

Similarity scores for the patent pairs are computed using BOW feature vectors generated either from full texts (left) or only the claims sections (right). Scale on the y-axis is irrelevant and was therefore omitted.

More »