Fig 1.
Our method identifies authors of anonymous scientific manuscripts by leveraging both the information contained in the text as well as the citations.
We encode the main text using DistilBERT [5] and combine this encoding with a feature vector extracted from the cited references. The encodings are subsequently fused by a two-layer classification MLP. It outputs the log-likelyhoods that the given anonymous paper has been (co-)authored by one of the over 2000 authors included in our novel dataset.
Table 1.
Summary of the datasets used in this work.
Fig 2.
Our proposed network architecture consists of two separate feature encoders for the different input modalities followed by an MLP network with a logit output layer.
Table 2.
Comparison of our DistilBERT (“Content”) architecture with other methods on the most common authorship attribution benchmark datasets.
Fig 3.
The two plots visualize the results presented in Table 3.
On the left the ‘non-C’ datasets using only the first 512 words are used, on the right the full paper is used. Although the AA accuracy degrades with an increasing number of authors, our approach retains an impressive 73.4% for 2070 authors.
Table 3.
This table summarizes the authorship identification accuracy in % on the test split of the different arXiv datasets four our method.
On the largest dataset D100-C our approach achieves 73.4% correct authorship attribution.
Table 4.
Comparison of training times on an Nvidia Quadro RTX 8000 GPU for the best model from Table 3. The last column reports the increase in accuracy when including the full document (C) is used.
Table 5.
Results on the trimmed datasets with a decreasing number of papers per author available for training and testing.
Even at less than 1/8th of the original data, the network still retains 75% of its performance on the D200-C dataset. When the number of authors is increased 20 times with only 25 papers per author (D50T25-C), the performance drops by a mere 6 percentage points.
Fig 4.
On the left the attribution accuracy distribution for the three datasets D100-C, D200-C and D300-C is shown.
The boxplot on the right presents the summary statistics corresponding to the histogram on the left. Both diagrams show that our method performs relatively consistent across the authors in the dataset in terms of accuracy.
Fig 5.
The boxplot shows the attribution accuracy of our method as a function of the dataset size (color coding) and the number of papers per author (groups on the x-axis).
As intuitively expected, more samples per author increase the attribution accuracy. Interestingly, for a given number of papers per author (e.g. 100) one can see that an increased overall dataset size (e.g. D100-C vs. D200-C) yields higher attribution accuracy.
Table 6.
Accuracy [%] for the metrics (1)-(4).
For a more finegrained analysis we present the results in the entire dataset (overall) as well as when only the papers with one (single author) or many authors (multiple authors) are selected for evaluation.
Table 7.
Ablation of the learning rate for 10 epochs.