On the Reconstruction of Text Phylogeny Trees: Evaluation and Analysis of Textual Relationships

Over the history of mankind, textual records change. Sometimes due to mistakes during transcription, sometimes on purpose, as a way to rewrite facts and reinterpret history. There are several classical cases, such as the logarithmic tables, and the transmission of antique and medieval scholarship. Today, text documents are largely edited and redistributed on the Web. Articles on news portals and collaborative platforms (such as Wikipedia), source code, posts on social networks, and even scientific publications or literary works are some examples in which textual content can be subject to changes in an evolutionary process. In this scenario, given a set of near-duplicate documents, it is worthwhile to find which one is the original and the history of changes that created the whole set. Such functionality would have immediate applications on news tracking services, detection of plagiarism, textual criticism, and copyright enforcement, for instance. However, this is not an easy task, as textual features pointing to the documents’ evolutionary direction may not be evident and are often dataset dependent. Moreover, side information, such as time stamps, are neither always available nor reliable. In this paper, we propose a framework for reliably reconstructing text phylogeny trees, and seamlessly exploring new approaches on a wide range of scenarios of text reusage. We employ and evaluate distinct combinations of dissimilarity measures and reconstruction strategies within the proposed framework, and evaluate each approach with extensive experiments, including a set of artificial near-duplicate documents with known phylogeny, and from documents collected from Wikipedia, whose modifications were made by Internet users. We also present results from qualitative experiments in two different applications: text plagiarism and reconstruction of evolutionary trees for manuscripts (stemmatology).


Dataset construction
In this section, we present more details regarding the parameters and procedure used to construct the Synthetic and Wikipedia datasets. For both datasets, the text files, the ground-truth, the phylogeny reconstructed by our algorithm, and the tree structures are available at http://www.recod.ic.unicamp.br/ oikawa/datasets.html#text.

Synthetic dataset
In our algorithm for generating the trees in the synthetic dataset, we used weights such that the operations are chosen with probability relative to these weights. Through empiric experiments, the weights set to the operations were as listed below: When one of the word operations above is chosen, only one word is modified at a time, counting as one editing operation. When the operation is performed in a sentence, we operate on multiple words, so the total count of editing operations is equal to the length (in words) of the sentence. In addition, to simulate sentence insertion, it is necessary to cache some sentences from the original text document, so when this operation is chosen, we have some sentences available to keep the text coherence. For this case, the proportion of held out sentences chosen was 0.2. The operations are performed until the editing limit L is reached, which varies in the interval I = {5%, 10%, ..., 50%}, in steps of 5%. Algorithm 1 (Experiments section) summarizes the construction of our synthetic dataset.

Wikipedia dataset
Regarding the articles in Wikipedia dataset, we collected them from the Featured Articles category available up to March 2014, disregarding changes to references, table of contents, and images. Furthermore, as mentioned in Section 4.1.2, in our Wikipedia dataset, we considered not only linear trees, but also trees with branchings by applying the revision revert algorithm (Algorithm 2 in the manuscript). To give a general idea about the structure of the trees of this dataset, we have performed an analysis of the number of linear and non-linear trees included in the dataset, as shown on Table A: It is important to notice that in this dataset, the number of linear trees depends on the number of nodes in it, since the higher the number of trees, higher are the chances that there is a reversion, and therefore, a branching in the tree. Thus, as shown in Table A, for trees with 10 nodes, there is a large number of linear trees, and this number decreases as the number of nodes increases. In our dataset, within a set of 304 trees with 400 nodes each, there is only one linear tree. Furthermore, in Table B, we show the number of children nodes per node in the entire dataset to show the distribution of the nodes in the branchings. Although most of nodes have only one child node, we can see some variation. For instance, there is a case of one node with 38 children nodes for a tree with 400 documents. We included this additional information about the construction of the dataset in the manuscript's supplementary material along with the dataset's description.
We have also performed two additional experiments to provide more details about the Wikipedia dataset characteristics. In Table C, we present the results for the average edit distance between parent and child nodes in number of words for |T | ∈ {10, 20, 30, 40, 50}. With these results, we intended to show that the similarity of the documents present in the dataset are very high, making the inference of the phylogeny trees more difficult. The editing percentage in the last column shows the average of the distance between parent and child nodes, normalized by the number of words in the parent's document (to compensate effects on the score when the documents do not have similar length).
To complement the results in Table C, we show in Table D, for each tree size, the number of edges that were edited according to the editing interval in the first column. For instance, for trees with 30 nodes (|T |=30), we can see that 93.126% of the edges have an editing amount between 0.0% and 0.5%, while only 0.197% edges have 95.0% to 100.0% of editing amount.

Experiments Results
In this section, we present additional experiments performed in the Synthetic and Wikipedia dataset. Table E shows the number of times (%) that the reconstructed root or one of its neighbors was the actual root (Neighbor probability), for all dissimilarity functions combined with the minimum-cost heuristic, and considering L = 50%. These results complement the results shown in Table 1 of the submitted manuscript. In Figure A, we show the average results (over tree sizes), for the progressive editing limit case of the synthetic dataset, for all dissimilarity functions, combined with the minimum-cost heuristic. Among all approaches, the reconstruction using tf-idf word-based 1-, 2-, 3-grams presented the best results, followed by the normalized edit distance. The unnormalized edit distance presented slightly better results than NCD up to L = 35%, when the document length begins to become a relevant issue, as the transformations become more intense. These results also show the importance of normalization: while the remaining normalized methods managed to maintain robustness after reaching L = 35%, the unnormalized edit distance results started to rapidly deteriorate. Nonetheless, the reconstruction of the undirected tree was nearly perfect for most of the L-range for all dissimilarity functions, as evidenced by the Indirect Edges metric ( Figure A(a)). Figure A complements the graphics shown in Figure 6 of the submitted manuscript.

Minimum-cost heuristic
In Figure B, we show a detailed comparison over tree sizes, considering the best performing dissimilarity function, tf-idf. The plots reveal a small variation of accuracy with respect to the tree sizes, with Directed Edges (Figure B(b)) being the metric with the most disperse results, improving as the number of trees increase. This is not unexpected: as the Roots metric is roughly constant over tree sizes, and Depth is close to 1, we expect that proportionately less wrong edges are present in larger trees.
To finish the analysis of the synthetic case with progressive editing limit, we present the results considering editing limit L = 50% in Table F, divided by tree size. In this case, tf-idf also outperformed the other methods in all tree sizes and metrics.
For the synthetic case with mixed editing limit, the results are also promising, showing robustness of the proposed approach when reconstructing trees composed of a non-homogeneous combination of mildly (L ≤ 25%) or heavily (25% ≤ L ≤ 50%) modified documents. In agreement with the progressive editing limit case, the best results were achieved by tf-idf word-based 1-, 2-, 3-grams as presented in Table G. Table F  and Table G Table I shows the results using Random Forests for the synthetic and the Wikipedia datasets. In comparison to SVMs, we had similar performance for the synthetic data, and obtained improvements for Wikipedia trees.

Supervised Machine Learning
The accuracy for the Roots metric was above 45% for all tree sizes, along with moderate improvements in Directed Edges, Ancestry and Depth metrics. These results complement the results presented in Table 5 of the submitted manuscript.