Table 1.
Test data used in our study.
Table 2.
Training data used in our study.
Table 3.
Recent approaches to cognate detection.
A plus “+” indicates that the algorithm meets the requirement, a minus “-” indicates that its failure. ML (multilingual) refers to the ability of an algorithm to identify cognate words across more than two languages at the same time. RQ (requirements) refers to additional requirements apart from the raw word list data, such as needing reference phylogenies or extensive training data. FA (free availability) means that the method has a useable public implementation.
Fig 1.
Workflows for automatic cognate detection.
In LingPy, cognate detection is treated as a hierarchical clustering task. After distances or similarities between word pairs have been determined (A), a hierarchical clustering algorithm is applied to the matrix and terminates when a certain threshold is reached (B). Similarity networks start from a graph-representation of the similarity or distance matrix (C). In a first step, edges whose score exceeds a certain threshold are removed from the graph (D). In a second step, state-of-the-art algorithms for community detection are used to partition the graph into groups of cognate words (E).
Table 4.
Cognate detection algorithms in LingPy.
Columns show the performance of cognate identification for the given wordforms in the International Phonetic Alphabet (IPA). The algorithms are the Turchin, Edit distance, Sound Class Algorithm, and LexStat methods. Italic numbers indicate false positives (forms incorrectly identified as cognate) and bold numbers indicate false negatives (forms incorrectly identified as not cognate) in comparison with the Gold Standard.
Table 5.
Preliminaries for B-Cubed score calculation.
Cognate clusters, cluster size and cluster intersection for a fictive test analysis of the five words from Fig 1 compared to a gold standard.
Fig 2.
Determining the best thresholds for the methods.
The y-axis shows the B-Cubed F-scores averaged over all training sets, and the x -axis shows the threshold for the 5 methods we tested. Infomap shows the best results on average, Edit Distance performs worst. Dots in the plots indicate the mean for each sample, with triangular symbols indicating the peak.
Table 6.
Results of the training analysis to identify the best thresholds.
Bold numbers indicate best values.
Table 7.
General results of the test analysis.
Fig 3.
Individual test results (B-Cubed F-Scores).
The figure shows the individual results of all algorithms based on B-Cubed F-Scores for each of the datasets. Results marked by a red triangle point to the worst result for a given subset, and results marked by a yellow star point to the best result. Apart from Uralic, our new Infomap approach always performs best, while the Turchin approach performs worst in four out of six tests.
Fig 4.
Partial and non-partial cognate relations.
The word for “moon” in Germanic and Sinitic languages is mono-morphemic in Germanic languages, while it is usually compounded in Chinese dialects, with the first element in the compound meaning “moon” proper, while the second often originally meant “shine” or “glance”. The different cognate relations among the morphemes in the Chinese words make it impossible to give a binary assessment regarding the cognacy of the four words.
Fig 5.
Distribution of true and false positives and true and false negatives.
Fig 6.
Comparing false positives and false negatives in the Chinese data.
The figure compares the amount of false positives and false negatives, as measured in pairwise scores for the Turchin method and our Infomap approach for all pairs of language varieties in the Chinese test set. The upper triangle of the heatmaps shows the amount of false positives, while the lower triangle shows the amount of false negatives.