Fig. 1.
Workflow for comparing Text 1 and Text 2.
Step I: Perform exact binomial testing for each word, measuring the fit of the word’s occurrences to a binomial allocation model. Step II: Conduct Higher Criticism (HC) on the per-word binomial allocation p-values and use it as an index of discrepancy between the texts. HC assesses the global significance of the p-values by comparing their z-scores to the uniform empirical process. Words associated with p-values smaller than the HC threshold are considered to provide meaningful discrimination between Text 1 and Text 2.
Fig 2.
Examined biblical data displayed using the HC-discrepancy values.
Each point corresponds to a chapter, indicating its HC-discrepancy with respect to each of the corpora (D, DtrH and P). The labels on the nodes correspond to chapters. For the purpose of validation, only the convex hull of the chapters was colored, based on the ground-truth attribution (yellow for D, blue for DtrH and pine green for P).
Fig 3.
Discriminating lemmas for each corpus (D, DtrH, and P) are presented in three graphs (left to right: D, DtrH,
P). Each graph lists 20 lemmas (in order of importance) of a corpus vs the union of the two other corpora.
Table 1.
p-values of ground-truth chapters for the author attribution task.
Table 2.
Summary of the discriminating words of the eight chapters for which the algorithm and the opinion of biblical scholars do not match.
Table 3.
Likelihood values of the authorship attribution of the additional texts.