On the Accuracy of Language Trees

doi:10.1371/journal.pone.0020109

Figure 1.

Ethnologue resolution power.

This map represents the Ethnologue resolution power in the different world locations. Red areas corresponds to regions where the Ethnologue classification is completely binary, i.e., correspond to a tree in which each internal node has exactly two child nodes. Yellow areas corresponds to fully unspecified trees, featuring only a star structure. Grey areas are those for which no data are present in the databases we consider to reconstruct language trees. Asterisks are for regions which include more than one language family (we report in File S1 the list of such families).

More »

Expand

Figure 2.

Top: Statistics of the ASJP database.

(left panel) Fraction-rank plot: for each word in the lists of words of the Automated Similarity Judgement Project (ASJP), we measured the fraction of languages containing it. The plot reports this fraction vs. its rank. In the -items lists in the ASJP database, only meanings are shared by almost of the languages for each family. (right panel) Ranked fraction of pairs of languages sharing each specific word vs. rank. Again only meanings are shared by almost of the pairs of languages. Bottom: Statistical measures on the ABVD database. (left panel) Fraction-rank plot: for each word in the lists of words of the Austronesian Basic Vocabulary Database (ABVD), we measured the fraction of languages containing it. The plot reports this fraction vs. its rank. (right panel) Ranked fraction of pairs of languages sharing each specific word vs. rank. For sake of a rough comparison we also reported the same quantities measured on the Austronesian family of the ASJP database. The ASJP includes words up to a maximum of almost of the languages, whereas in the ABVD the percentage of coverage is at least of for almost all the words in the list. Limited to the most shared words the ASJP database features a slightly larger coverage than the ABVD database.

More »

Expand

Figure 3.

Robinson-Foulds and Quartet Distance: errors due to a displacement of a couple of subtrees.

The trees and are different because of the swap of the subtrees A and B. While computing the distance between and , the Robinson-Foulds distance detects all the edges in the path as errors, regardless of the size of the subtrees attached to them. The number of wrong butterflies quartets counted as errors with the Quartet Distance is expressed by : the QD thus depends on the size of the subtrees.

More »

Expand

Figure 4.

Non-binary nodes: biases of errors.

The standard Robinson-Foulds distance and the Quartet Distance have a bias when comparing binary trees with non-binary classifications. The difference between tree and is that shows a more fine grained classification. The two trees, however, are not conflicting, since is simply a refinement of the classification . The RF distance will count every internal edge (blue ones in ) of this refinement as errors, since they are not in . The QD will count every quartet including the blue edges as errors, since all these quartets are stars in . The generalized measures we introduce correctly give a null score between and in the example.

More »

Expand

Table 1.

Accuracy of the reconstructions as measured with the Generalized Robinson-Foulds (GRF).

More »

Expand

Table 2.

Accuracy of the reconstructions as measured with the Generalized Quartet Distance (GQD).

More »

Expand

Figure 5.

Accuracy histograms as measured with the Generalized Robinson-Foulds score (GRF).

For each continent and for the whole world we report the histograms of the GRF as measured over all the families spread on each specific region. We considered here only the FastSBiX algorithm that features slightly better performances with respect to the competing algorithms, and both the the LDN (2) (right panel) and the LDND (4) (left panel) definition of distance. The histograms are always peaked near zero, meaning that the rate of errors are always very low, but the variances are quite large. These distributions do not discriminate the performances of the inference using LDN (2) or LDND (4) definition of distances.

More »

Expand

Figure 6.

Accuracy histograms as measured with the Generalized Quartet Distance (GQD).

For each continent and for the whole world we report the histograms of the GQD as measured over all the families spread on each specific region. We considered here only the FastSBiX algorithm that features slightly better performances with respect to the competing algorithms, both with the LDN (2) (right panel) and the LDND (4) (left panel) definition of distance. The histograms are always peaked near zero, meaning that the rate of errors are always very low. The distributions of the LDN-inferred trees, moreover, display larger variances than the LDND ones, this means that the latter definition allows for better performances in inferring languages trees with a distance-based approach. The overall variances are smaller with respect to the ones in fig. 5.

More »

Expand

Figure 7.

Worldwide accuracy of the inferred language trees.

This map represents the level of accuracy of the FastSBiX algorithm on several language families throughout the world. The colors code the values of the Generalized Quartet Distance (GQD) between the trees inferred with the FastSBiX algorithm and the LDND definition of distance for each language family included in the ASJP database and the corresponding Ethnologue classifications. The GQD is normalized with the corresponding random value (see text for details). On the one hand blue regions corresponds to language families for which the inferred trees strongly agree with the Ethnologue classification. On the other hand red regions corresponds to poorly reconstructed language families. Yellow is for the families in which a random reconstruction would get a GQD score of zero, meaning that the Ethnologue classification has a null resolution (the corresponding tree is a star). Grey areas are those for which no data are present in the databases adopted for the reconstruction. Asterisks are for regions which include more than one family of languages. See File S1 for the analogous maps obtained with different algorithms and different definitions of the distance between languages.

More »

Expand

Figure 8.

Role of the word-list completeness and coverage.

(left) the Generalized Robinson-Foulds (GRF) score between the inferred trees and the corresponding Ethnologue classification for the Austronesian family, vs. the number of most shared words, both for the ASJP and the ABVD databases. The inset reports the behaviour of , the effective number of most shared words, defines as follows. For each list is the sum of all the value of for all the meanings in the list. In this way quantifies the effective number of most shared meanings. There is a strong correlation between and for . For does not increase anymore in the ASJP database. This explains why the GRF does not decrease for for the ASJP database. (right) the Generalized Quartet Distance (GQD) between the inferred trees and the corresponding Ethnologue classification for the Austronesian family, vs. the number of most shared words, both for the ASJP and the ABVD databases. The inset reports the behaviour of the Coverage, which measures the degree of alignment of the word-lists for the different languages considered, vs. (see text for details about the definition of Coverage). Again there is a strong correlation between the Coverage and . The distance-based algorithm used is FastSBiX with the LDN definition of distance.

More »

Expand