Statistical Analysis of the Indus Script Using n-Grams | PLOS One

Advertisement

Browse Subject Areas

?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

< Back to Article

Figure 1.

An example of an Indus seal.
It shows the three typical components: the Indus script at the top, a field symbol (an animal) in the middle, and a decorated object at the bottom left (Copyright Harappa Archaeological Research Project/J.M. Kenoyer, Courtesy Dept. of Archeology and Museums, Govt. of Pakistan). Here, since the script is embossed on a seal, it is to be read from the left to the right, whereas on the sealing, which are impressions of the seal, it is read from the right to the left. For the most part, the seals are typically between 1 to 2 square inches in size.

More »

Figure 2.

Frequency distribution of individual signs in the EBUDS corpus.
The five most common signs are shown alongside the frequency bars. The relative frequency distribution does not change significantly between EBUDS and M77 corpora.

More »

Figure 3.

Rank-ordered frequency distribution of signs ƒ_r plotted against the rank r for the EBUDS corpus.
The data can be fit well by the Zipf-Mandelbrot law, . For c = 0 and b = 1, this reduces to Zipf's Law, . Both these laws are used to fit the frequency distribution of words in linguistic corpora. Our fitted values are a = 15.39, b = 2.59 and c = 44.47. For English (the Brown Corpus), a = 12.43, b = 1.15 and c = 100 [16].

More »

Figure 3.

Rank-ordered frequency distribution of signs ƒ_r plotted against the rank r for the EBUDS corpus.
The data can be fit well by the Zipf-Mandelbrot law, . For c = 0 and b = 1, this reduces to Zipf's Law, . Both these laws are used to fit the frequency distribution of words in linguistic corpora. Our fitted values are a = 15.39, b = 2.59 and c = 44.47. For English (the Brown Corpus), a = 12.43, b = 1.15 and c = 100 [16].

More »

Figure 4.

Cumulative frequency distribution of all signs, only text beginners, and only text enders in the EBUDS corpus.
Approximately 69 signs account for 80% of the corpus. The script has a large number of signs which are used infrequently. The cumulative distributions for text beginners and text enders show an asymmetry, with only 23 signs accounting for 80% of all text enders, while 82 signs account for 80% of all text beginners. This is clear evidence of an underlying logic in the sign usage.

More »

Table 1 — Table 1.

Perplexity and the n-gram cross entropy H_n(Q,P) for the EBUDS corpus.

More »

Figure 5.

Bigram probability P(b|a) for a random distribution with no correlations amongst the signs (above) and for the EBUDS corpus (below).
Horizontal lines in the upper matrix imply that the conditional probability of a sign b following a sign a is equal to probability of sign b itself. The bigram probability P(b|a) after Witten-Bell smoothing is shown in the lower plot. The difference between the two matrices indicates the presence of correlations in the texts.

More »

Figure 6.

Probability P(a|#) of a sign following the start token # (text beginners) and probability P(a$) of sign a preceding the end token $ (text enders).
This is extracted from bigram matrix with Witten-Bell smoothing. Text beginners with a significant probability are more numerous than text enders at the same threshold of probability.

More »

Figure 6.

Probability P(a|#) of a sign following the start token # (text beginners) and probability P(a$) of sign a preceding the end token $ (text enders).
This is extracted from bigram matrix with Witten-Bell smoothing. Text beginners with a significant probability are more numerous than text enders at the same threshold of probability.

More »

Figure 7.

Conditional probability plots for text beginners , , followed by sign and for texts enders , , preceded by sign from bigram matrix P(b|a) with Witten-Bell smoothing.
Text beginners are more selective in terms of the number of signs which can follow them than text enders, which can have a large number of signs preceding them.

More »

Figure 7.

Conditional probability plots for text beginners , , followed by sign and for texts enders , , preceded by sign from bigram matrix P(b|a) with Witten-Bell smoothing.
Text beginners are more selective in terms of the number of signs which can follow them than text enders, which can have a large number of signs preceding them.

More »

Figure 8.

Conditional probability plots for sign b following text beginners a = 99 and a = 123.
The number of signs following the signs 99 and 123 is greater than the number of signs following text beginners 267, 391 and 293 (Fig. 7).

More »

Table 2 — Table 2.

Significant sign pairs from the log-likelihood ratio (LLR) measure of association for bigrams.

More »

Figure 9.

Examples of texts generated by the bigram model.
The texts are to be read from the right to the left. Some of the texts generated by the model occur in the corpus.

More »

Figure 10.

A section of the trigram matrix.
Trigram conditional probability P(c|ab), with a = 336, the most frequent triplet being 336,89,211 (circled in the plot). This gives the trigram conditional probability of all strings of the form 336,b,c.

More »

Table 3 — Table 3.

Significant sign triplets from the log-likelihood ratio (LLR) measure of association for trigrams.

More »

Table 4 — Table 4.

The entropy and mutual information of the EBUDS corpus.

More »

Figure 11.

Suggested restoration of signs missing from texts.
The last column lists the suggested restorations in decreasing order of probability (Left to Right).

More »

Figure 12.

Suggested restoration of doubtfully read signs in the texts of M77 corpus.
The last column lists the suggested restorations in decreasing order of probability (Left to Right). The signs with asterisk sign at the top right are the doubtfully read signs which are being restored using the bigram model.

More »

Figure 13.

The most probable texts of length 4, 5 and 6 predicted by the model.
Note that exact instances of the predicted texts are present in the corpus for the 4-sign and 5-sign texts. For the 6-sign text, the same sequence, but with 2insertions, is found in the corpus.

More »

Figure 14.

Sensitivity of the bigram model taking all signs under 90% area of the cumulative probability curve as true positives.
The five plots are for five different sets of test and training sets of EBUDS as given in Table 5.

More »

Table 5 — Table 5.

Mean sensitivity (in %) with standard deviation of the model predicted from each of the five test sets P1, P2, P3, P4 and P5.

More »

Table 6 — Table 6.

Major conclusions.

More »

Figure 15.

Text length distributions in the different corpora used in the analysis.
The raw corpus (M77) contains four instances of outliers, texts of length n = 2 and n = 3 which occur in unusually large numbers. Keeping only single occurrences of these removes the sharp maximum around n = 2 in the raw corpus. The corpus free of the outliers is then reduced again to keep only unique occurrences of the texts. This gives the M77-unique corpus. Finally, damaged, illegible and multi-line texts are removed to give the EBUDS corpus. Texts of length n = 3 and n = 5 are most frequent in this corpus.

More »

Figure 16.

The conditional probability P(b|a = 2) from the maximum likelihood estimate (above) and from Witten-Bell smoothing (below).
The maximum likelihood estimate assigns zero probabilities to unseen sign pairs and results in a non-ergodic Markov chain. The Witten-Bell smoothing algorithm reduces the probabilities of the seen sign pairs and distributes the reduction over unseen sign pairs. This gives an ergodic Markov chain. The square root of conditional probabilities are plotted in each case to highlight the probabilities of unseen sign pairs.

More »