The Potential of Automatic Word Comparison for Historical Linguistics

doi:10.1371/journal.pone.0170046

Table 1.

Test data used in our study.

More »

Expand

Table 2.

Training data used in our study.

More »

Expand

Table 3.

Recent approaches to cognate detection.

A plus “+” indicates that the algorithm meets the requirement, a minus “-” indicates that its failure. ML (multilingual) refers to the ability of an algorithm to identify cognate words across more than two languages at the same time. RQ (requirements) refers to additional requirements apart from the raw word list data, such as needing reference phylogenies or extensive training data. FA (free availability) means that the method has a useable public implementation.

More »

Expand

Fig 1.

Workflows for automatic cognate detection.

In LingPy, cognate detection is treated as a hierarchical clustering task. After distances or similarities between word pairs have been determined (A), a hierarchical clustering algorithm is applied to the matrix and terminates when a certain threshold is reached (B). Similarity networks start from a graph-representation of the similarity or distance matrix (C). In a first step, edges whose score exceeds a certain threshold are removed from the graph (D). In a second step, state-of-the-art algorithms for community detection are used to partition the graph into groups of cognate words (E).

More »

Expand

Table 4.

Cognate detection algorithms in LingPy.

Columns show the performance of cognate identification for the given wordforms in the International Phonetic Alphabet (IPA). The algorithms are the Turchin, Edit distance, Sound Class Algorithm, and LexStat methods. Italic numbers indicate false positives (forms incorrectly identified as cognate) and bold numbers indicate false negatives (forms incorrectly identified as not cognate) in comparison with the Gold Standard.

More »

Expand

Table 5.

Preliminaries for B-Cubed score calculation.

Cognate clusters, cluster size and cluster intersection for a fictive test analysis of the five words from Fig 1 compared to a gold standard.

More »

Expand

Fig 2.

Determining the best thresholds for the methods.

The y-axis shows the B-Cubed F-scores averaged over all training sets, and the x -axis shows the threshold for the 5 methods we tested. Infomap shows the best results on average, Edit Distance performs worst. Dots in the plots indicate the mean for each sample, with triangular symbols indicating the peak.

More »

Expand

Table 6.

Results of the training analysis to identify the best thresholds.

Bold numbers indicate best values.

More »

Expand

Table 7.

General results of the test analysis.

More »

Expand

Fig 3.

Individual test results (B-Cubed F-Scores).

The figure shows the individual results of all algorithms based on B-Cubed F-Scores for each of the datasets. Results marked by a red triangle point to the worst result for a given subset, and results marked by a yellow star point to the best result. Apart from Uralic, our new Infomap approach always performs best, while the Turchin approach performs worst in four out of six tests.

More »

Expand

Fig 4.

Partial and non-partial cognate relations.

The word for “moon” in Germanic and Sinitic languages is mono-morphemic in Germanic languages, while it is usually compounded in Chinese dialects, with the first element in the compound meaning “moon” proper, while the second often originally meant “shine” or “glance”. The different cognate relations among the morphemes in the Chinese words make it impossible to give a binary assessment regarding the cognacy of the four words.

More »

Expand

Fig 5.

Distribution of true and false positives and true and false negatives.

More »

Expand

Fig 6.

Comparing false positives and false negatives in the Chinese data.

The figure compares the amount of false positives and false negatives, as measured in pairwise scores for the Turchin method and our Infomap approach for all pairs of language varieties in the Chinese test set. The upper triangle of the heatmaps shows the amount of false positives, while the lower triangle shows the amount of false negatives.

More »

Expand