Fig 1.
Illustration of the presented novel approach to the search for a patent’s prior art.
First, a dataset of patent applications is obtained from a patent database using a few manually selected seed patents and recursively including the patent applications they cite. Then, the patent texts are transformed into feature vectors and the similarity between two documents is computed based on said feature vectors. Finally, patents that are considered as very similar to a new target patent application are returned as possible prior art. An appropriate similarity measure for this process should assign high similarity scores to related patents (e.g. where one patent was cited in the search report of the other) and low scores to unrelated (randomly paired) patents. We compare different similarity measures by quantifying the overlap between the respective similarity score distributions of pairs of related documents and randomly paired patents using the AUC score.
Table 1.
Evaluation results on the cited/random dataset.
Fig 2.
Distributions of cosine similarity scores.
Similarity scores for the patent pairs are computed using BOW feature vectors generated either from full texts (left) or only the claims sections (right). Scale on the y-axis is irrelevant and was therefore omitted.
Table 2.
Confusion matrix for the dataset subsample.
Table 3.
Correlations between labels and similarity scores on the dataset subsample.
Fig 3.
Score correlation for the patent with ID US20150018885.
A false negative (ID US20110087291) caught by the cosine similarity is circled in gray.
Table 4.
Summary of evaluation results.