Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

< Back to Article

Fig 1.

Unequal representation of letters (1-grams) occurring at different positions in words in written English.

The probability of occurrence of the 26 letters of the English alphabet in the Mieliestronk corpus comprising about 58000 unique words of the English language (see Methods for details), at (a) any position, (b) left terminal position (i.e., in the beginning) and (c) right terminal position (i.e., at the end) of a word. The distribution shows more heterogeneity in the letter occurrence probabilities for (c), indicating that only a few letters occur with high frequency at the right terminal position of a word, compared to a relatively more egalitarian frequency of occurrence of letters in the left terminal position (b). This difference is illustrated in the Lorenz curve (d) comparing the cumulative distribution function for the occurrence probability of the different letters in any (solid curve), left terminal (dash-dotted curve) and right terminal position (dashed curve) of a word. The thin broken diagonal line (line of perfect equality) corresponds to a perfectly uniform distribution, deviation from which indicates the extent of heterogeneity of letter occurrence probability distributions—measured by the Gini index which is the ratio of the area between the line of perfect equality and the observed Lorenz curve, and, the area between the lines of perfect equality and of perfect inequality (viz., the horizontal line).

More »

Fig 1 Expand

Fig 2.

Unequal representation of signs (1-grams) occurring at different positions in words in corpora written using different languages and writing systems.

The Lorenz curves in the 24 panels (corresponding to all the scripts analyzed here except English, which is shown in Fig 1) show the differences in the cumulative distribution function of the occurrence probability of signs at left terminal position (blue, dash-dot curve), right terminal position (purple, dashed curve) and at any position (red, solid curve) of a word written in a particular script. The thin broken diagonal line corresponds to a perfectly uniform distribution, deviation from which indicates the extent of heterogeneity of sign occurrence distributions. This is measured in terms of the Gini index, the corresponding values at the left terminal (L), right terminal (R) and any position (A) for a script being indicated in each panel.

More »

Fig 2 Expand

Fig 3.

Asymmetry in the sign occurrence probability distributions at the left and right terminal positions of words in different languages correlate with the directions in which they are read.

The normalized difference of the Gini indices ΔG = 2(GLGR)/(GL + GR) (filled circles), which measures the relative heterogeneity between the occurrences of different signs in the terminal positions of words of a language, are shown for a number of different written languages (arranged in alphabetical order) that span a variety of possible writing systems—from alphabetic (e.g., English) and syllabic (e.g., Japanese kana) to logographic (Chinese) [see text for details]. All languages that are conventionally read from left to right (or rendered in that format in the databases used here) show a negative value for ΔG, while those read right to left exhibit positive values. The horizontal thick bars superposed on the circles represent the 95% bootstrap confidence interval for the estimated values of ΔG. To verify the significance of the empirical values, they are compared with corresponding ΔG (diamonds) calculated using an ensemble of 1000 randomized versions for each of the databases (obtained through multiple realizations of random permutations of the signs occurring in each word—see Materials and Methods for details), the ranges of fluctuations being indicated by error bars. Along with the set of known languages, ΔG measured for a corpus of undeciphered inscriptions from the Indus Valley Civilization (2600–1900 BCE) is also shown (bottom row).

More »

Fig 3 Expand

Fig 4.

The observed asymmetry between heterogeneity of letter occurrence probability in left and right terminal positions is significant when the database is sufficiently large.

Gini index differential ΔG shown for the left and right terminal letter (1-gram) distributions calculated using a set of N words, as a function of N. Empirical results are shown for random samples (without replacement) taken from the Mieliestronk corpus comprising about 58000 unique words of the English language, each data point (circles) being the average over 103 samples of size N. For each empirical sample, a corresponding randomized sample is created by randomly permuting the letters in each of the N words, and a data point for the randomized set (squares) represents an average over randomizations of 103 samples of size N. With increasing N the empirical distribution becomes distinguishable from the randomized set (which, by definition, should not have any left-right asymmetry). The error bars indicate standard deviation over the different samples.

More »

Fig 4 Expand