Probing the Statistical Properties of Unknown Texts: Application to the Voynich Manuscript

While the use of statistical physics methods to analyze large corpora has been useful to unveil many patterns in texts, no comprehensive investigation has been performed on the interdependence between syntactic and semantic factors. In this study we propose a framework for determining whether a text (e.g., written in an unknown alphabet) is compatible with a natural language and to which language it could belong. The approach is based on three types of statistical measurements, i.e. obtained from first-order statistics of word properties in a text, from the topology of complex networks representing texts, and from intermittency concepts where text is treated as a time series. Comparative experiments were performed with the New Testament in 15 different languages and with distinct books in English and Portuguese in order to quantify the dependency of the different measurements on the language and on the story being told in the book. The metrics found to be informative in distinguishing real texts from their shuffled versions include assortativity, degree and selectivity of words. As an illustration, we analyze an undeciphered medieval manuscript known as the Voynich Manuscript. We show that it is mostly compatible with natural languages and incompatible with random texts. We also obtain candidates for keywords of the Voynich Manuscript which could be helpful in the effort of deciphering it. Because we were able to identify statistical measurements that are more dependent on the syntax than on the semantics, the framework may also serve for text analysis in language-dependent applications.

Methods from statistics, statistical physics, and artificial intelligence have increasingly been used to analyze large volumes of text for a variety of applications [1][2][3][4][5][6][7] some of which are related to fundamental linguistic and cultural phenomena.Examples of studies on human behaviour are the analysis of mood change in social networks [1] and the identification of literary movements [3].Other applications of statistical natural language processing techniques include the development of statistical techniques to improve the performance of information retrieval systems [8], search engines [9], machine translators [10,11] and automatic summarizers [12].Evidence of the success of statistical techniques for natural language processing is the superiority of current corpus-based machine translation systems in comparison to their counterparts based on the symbolic approach [13].
The methods for text analysis we consider can be classified into three broad classes: (i) those based on firstorder statistics where data on classes of words are used in the analysis, e.g.frequency of words [14]; (ii) those based on metrics from networks representing text [3,4,6,7,15]; (iii) those using intermittency concepts and time-series analysis for texts [4,5].One of the major advantages inherent in these methods is that no knowledge about the meaning of the words or the syntax of the languages is required.Furthermore, large corpora can be processed at once, thus allowing one to unveil hidden text properties that would not be probed in a manual analysis given the limited processing capacity of humans.The obvious disadvantages are related to the superficial nature of the analysis, for even simple linguistic phenomena such as lexical disambiguation of homonymous words are very hard to treat.Another limitation in these statistical methods is the need to identify the representative features for the phenomena under investigation, since many parameters can be extracted from the analysis but there is no rule to determine which are really informative for the task at hand.Most significantly, in a statistical analysis one may not even be sure if the sequence of words in the dataset represents a meaningful text at all.For testing whether an unknown text is compatible with natural language, one may calculate measurements for this text and several others of a known language, and then verify if the results are statistically compatible.However, there may be variability among texts of the same language, especially owing to semantic issues.
In this study we combine measurements from the three classes above and propose a framework to determine the importance of these measurements in investigations of unknown texts, regardless of the alphabet in which the text is encoded.The statistical properties of words and the books were obtained for comparative studies involving the same book (New Testament) in 15 languages and distinct pieces of text written in English and Portuguese.The purpose in this type of comparison was to identify the features capable of distinguishing a meaningful text from its shuffled version (where the position of the words is randomized), and then determine the proximity of pieces of text.
As an application of the framework, we analyzed the famous Voynich Manuscript (VMS), which has remained indecipherable in spite of attempts from renowned cryptographers for a century.This manuscript dates back to the 15th century, possibly produced in Italy, and was named after Wilfrid Voynich who bought it in 1912.In the analysis we make no attempt to decipher VMS, but we have been able to verify that it is compatible with natural languages, and even identified important keywords, which may provide a useful starting point toward deciphering it.

II. DESCRIPTION OF THE MEASUREMENTS
The analysis involves a set of steps going beyond the basic calculation of measurements, as illustrated in the workflow in Fig. 1.Some measurements are averaged in order to obtain a measurement on the text level from the measurement on the word level.In addition, a comparison with values obtained after randomly shuffling the text is performed to assess to which extent structure is reflected in the measurements.
A. Raw measurements

First order statistics
The simplest measurements obtained are the vocabulary size M , which is the number of distinct words in the text, and the frequency of word i (number of appearances), denoted by N i .The heterogeneity of the contexts surrounding words was quantified with the so-called selectivity measurement [16].If a word is strongly selective then it always co-occurs with the same adjacent words.Mathematically, the selectivity of a word i is s i = 2N i /t i , where t i is the number of distinct words that appear immediately beside (i.e., before or after) i in the text.
A language-dependent feature is the number of different words (types) that at least once had two word tokens immediately beside each other in the text.In some languages this repetition is rather unusual, but in others it may occur with a reasonable frequency (see Sec. III and Figure 3).In this paper, the number of repeated bigrams is denoted by B.

Network characterization
Complex networks have been used to characterize texts [3,4,6,7,15], where the nodes represent words and links are established based on word co-occurrence, i.e. links between two nodes are established if the corresponding words appear at least once adjacent in the text.j.In most applications of co-occurrence networks, the stopwords [27] are removed and the remaining words are lemmatized [28].Here, we decided not to do this because in unknown languages it is impossible to derive lemmatized word forms or identify stopwords.To characterize the structure and organization of the networks, the following topological metrics of complex networks were calculated (more details are given in the Supplementary Information (SI)): • We quantify degree correlations, i.e. the tendency of nodes of certain degree to be connected to nodes with similar degree (the degree of a node is the number of links it has to other nodes), with the Pearson correlation coefficient, r, thus distinguishing assortative (r > 1) from disassortative (r < 1) networks.
• The so-called clustering coefficient, C i , is given by the fraction of closed triangles of a node, i.e. the number of actual connections between neighbours of a node divided by the possible number of connections between them.The global clustering coefficient C is the average over the local coefficients of all nodes.
• The average shortest path length, L i , is the shortest path between two nodes i and j averaged over all possible j's.In text networks it measures the relevance of words according to their distance to the most frequent words [4].
• The diameter d corresponds to the maximum shortest path, i.e. the maximum distance on the network between any two nodes.
• We also characterized the topology of the networks through the analysis of motifs, i.e. analysis of connectivity patterns expressed in terms of small building blocks (or subgraphs) [17].We define as m Y the number of motifs Y appearing in the network.The motifs employed in the current paper are displayed in Figure 2.

Intermittency
The fact that words are unevenly distributed along texts has been used to detect keywords in documents [5,18,19].Since bursty words appear concentrated in portions of the text in contrast to others, which are distributed homogenouly along the text, words with different functions can be distinguished.
The intermittency was calculated using the concept of recurrence times, which have been used to quantify the burstiness of time series.In the case of documents, the time series of a word is taken by counting the number of words (representing time) between successive appearances of the considered word.For example, the recurrence times for the word 'the' in the previous sentence are T 1 = 4, T 2 = 10, and T 3 = 11.If N i is the frequency of the word its time series will be composed by the following elements {T 1 , T 2 , . . .T Ni−1 }.Because the times until the first occurrence T f and after the last occurrence T l are not considered, the element T N is arbitrarily defined as T N = T f + T l .Note that with the inclusion of T N in the time series, the average value over all N i values is T i = N/N i .Then, to compute the heterogeneity of the distribution of a word i in the text, we obtained the intermittency I i as Words distributed by chance have I i 1 (for N i 1), while bursty words have I i > 1. Words with N i < 5 were neglected since they lack statistics.

B. From word to text measurements
Many of the measurements defined in the previous Section are attributes of the word i.For our aims here it is essential to compare different texts.The easiest and most straightforward choice is to assign to a piece of text the average value of each measurement Xi , computed over all M words in the text X = M −1 Xi .This was done for L, C, I, k and s.One potential limitation of this approach is that the same weight is attributed to each word, regardless of their frequency in the text.To overcome this, we also calculated another metric, X * obtained as the average of the η most frequent words, i.e.
X * = η −1 X i , where the sum runs over the η most frequent words.Here, we chose η = 50.Finally, because s is known to have a distribution with long tails [16], we also computed the coefficient γ s of the power-law P (s) ∝ s −γs , for which the maximum-likelihood methodology described in [20] was used.

C. Comparison to shuffled texts
Since we are interested in measurements capable of distinguishing a meaningful text from its shuffled version, each of the measurements X and X * described above was normalized by the average obtained over 10 texts produced using a word shuffling process, i.e. randomizing preserving the word frequencies.If µ( X(R) ) and σ( X(R) ) are respectively the average and the deviation over 10 realizations of shuffled texts, the normalized measurement X and the uncertainty (X) related to X are: Normalization by the shuffled text is useful because it permits comparing each measurement with a null model.Hence, a measurement provides significant information only if its normalized X value is not (X) close to X * = 1.Moreover, the influence of the vocabulary size M on the other measurements tends to be minimized.

III. VARIABILITY ACROSS LANGUAGES AND TEXTS
The measurements described in Section II vary from text to text due to the syntactic properties of the language.In a given language, there is also an obvious variation among texts on account of stylistic and semantic factors.Thus, in a first approximation one may assume that variations across texts of a measurement X occur in two dimensions.Let X t,l denotes the value of X for text t written in language l.If we had access to the complete matrix X t,l , i.e. if all possible texts in every possible language could be analyzed, we could simply compare a new text t to the full variation of the measurements X t,l in order, e.g., to attribute to which languages λ the text is compatible with.In practice, we can at best have some rows and columns filled and therefore additional statistical tests are needed in order to characterize the variation of specific measurements.For different texts, P (X t,l=λ ) denotes the distribution of measurement X across different texts in a fixed language l = λ and P (X t=τ,l ) the distribution of X across a fixed text t = τ written in various languages.Accordingly, µ(P ) and σ(P ) represent the expectation and the variation of the distribution P .For concreteness, Fig. 3 illustrates the distribution of X = B (number of duplicated bigrams) for the three sets of texts we use in our analysis: 15 books in Portuguese, 15 books in English, and 15 versions of the New Testament in different languages, see SI for details.We consider also the average X and the standard deviation σ(X) of X computed over different books (e.g., each of the three sets of 15 books) and the correlation R M between X and the vocabulary size M of the book.Table I shows the values of X , σ(X) and R M of all measurements in each of the three sets of books.In order to obtain further insights on the dependence of these measurements on language (syntax) and text (semantics), next we perform additional statistical analysis to identify measurements that are more suitable to target specific problems.0.00 0.10 0.20 0.30 0.40 0.50 X = NUMBER OF DUPLICATED BIGRAMS FIG.3: Distribution of X = B for the New Testament (black circles), English (red circles) and Portuguese (blue circles) texts.The average X for the three sets of texts is represented as dashed lines.

A. Distinguishing books from shuffled sequences
Our first aim is to identify measurements capable of distinguishing between natural and shuffled texts, which will be referred to as informative measurements.For instance, for X = B in Fig. 3 all values are much smaller than 1 in all three sets of texts, indicating that this measurement takes smaller values in natural texts than TABLE I: Verification of which measurements satisfy conditions ζ1, ζ2, ζ 2 and ζ3.RM is the Pearson correlation between X and the vocabulary size M .We assume that ζ1, ζ2, ζ 2 and ζ3 are satisfied respectively when ρ = 0.00 %, υ t=new,l > υ t,l=λ , ι(υ t=τ,l ) ∩ ι(υ t,l=λ ) ≤ 0.05ι(υ t=τ,l ) ∪ ι(υ t,l=λ ) and c(X t=new,l=λ , P (X t,l=λ )) > 0.05.Measurements satisfying conditions for all three sets of texts are marked with a filled circle (•).• mF 0.24 ± 0.07 0.37 ± 0.05 0.39 ± 0.06 0.00 % 0.00 % 0.00 % 1.87 1.80 0.00 0.00 -0.20 • • mG 0.36 ± 0.14 0.47 ± 0.09 0.56 ± 0.05 0.00 % 0.00 % 0.00 % 1.82 4.43 0.00 0.00 +0.14 • mH 0.71 ± 0.24 1.25 ± 0.11 1.16 ± 0.11 0.00 % 0.00 % 0.00 % 2.67 3.66 0.00 0.00 -0.17 • • • mI 0.20 ± 0.07 0.32 ± 0.05 0.36 ± 0.05 0.00 % 0.00 % 0.00 % 1.68 2.48 0.00 0.00 -0.14 • • mJ 0.45 ± 0.17 0.57 ± 0.12 0.73 ± 0.05 0.00 % 0.00 % 0.00 % 1.76 5.19 0.00 0.00 +0.11 in shuffled texts.In order to quantify the distance of a set of values {X} to X = 1 we define the quantity ρ(X = 1, {X}) as the proportion of elements in the set {X} for which X = 1 lies within the interval X ± (X), where (X) arises from fluctuations due to the randomness of the shuffling process as defined in eq. ( 3).This leads to condition ζ 1 : where {X} is a set of values X obtained over different texts in different languages or texts, and |{X}| is the number of elements in this set.We now discuss the results obtained applying ζ 1 (with ρ(X = 1, {X}) = 0) for all three sets of texts in our database for each of the measurements described in Sec.II.Measurements which satisfied ζ 1 are indicated by a • in Tab.I. Several of the network measurements (d, L, C, k * and motifs m C , m E and m K ) do not fully satisfy ζ 1 .Consequently they cannot be used to distinguishing a manuscript from its shuffled version.This finding is rather surprising because some of the latter measurements were proven useful to grasp subtleties in text, e.g. for author recognition [4].In the latter application, however, the networks representing text did not contain stopwords and the texts were lemmatized.The averaging over the 50 most frequent words seems to be essential to satisfy ζ 1 for the clustering coefficient and for the shortest paths (note that C * and L * are informative while C and L are not).This means that the informativeness of these quantities is concentrated in the most frequent words.On the other hand, for the degree, an opposite effect occurs, i.e., k is informative and k * is not.The informativeness of intermittency (I and I * ) may be explained by the fact by construction I i 1 in shuffled texts (see Sec. II A 3).Because in natural texts many words tend to appear clustered in regions I i > 1 and The selectivity s is also strongly affected by the shuffling process.Words in shuffled texts tend to be less selective, which yields an increase in γ s [16] (i.e., very selective words occur very sporadically) and a decrease in s and s * .The selectivity is related to the effect of word consistency [29] (see Ref. [21]) which was verified to be common in English, especially for very frequent words.The number of bigrams B is also informative, which means that in natural languages it is unlikely that the same word is repeated (when compared with random texts).As for the informative motifs, m A , m D , m F , m G , m I , m J , m L and m M rarely occur in natural language texts ( X < 1) while motif m B was the only measurement taking values above and below 1.The emergence of this motif therefore appears to depend on the syntax, being very rare for Xhosa, Vietnamese, Swahili, Korean, Hebrew and Arabic.

B. Dependence on style and language
We are now interested in investigating which textmeasurements are more dependent on the language than on the style of the book, and vice-versa.Measurements depending predominantly on the syntax are expected to have larger variability across languages than across texts.On the other hand, measurements depending mainly on the story (semantics) being told are expected to have larger variability across texts in the same language, i.e. t = τ [30].The variability of the measurements was computed with the coefficient of variation υ = σ(X)/ X , where σ(X) and X represent respectively the standard deviation and the average computed for the books in the set {X}.Thus, we may assume that X is more dependent on the language than on the style/semantics if condition ζ 2 is satisfied: X is more dependent on the language (or syntax) than it is on the style (or semantics) if υ t=τ,l > υ t,l=λ .
The results for the measurements satisfying conditions ζ 2 and ζ 2 are shown in Tab.I. Measurements satisfying conditions ζ 2 and ζ 2 serve to examine the dependency on the syntax or on the style/semantics.The vocabulary size M , and the network measurements r, L, L * , C, k and k * are more dependent on syntax than on semantics.The measurements derived from the selectivity (γ s , s and s * ) are also strongly dependent on the language.With regard to the motifs, five of them satisfy ζ 2 and ζ 2 : m B , m C , m H , m K and m M .Remarkably, I and I * are the only measurements with low values of υ t=new,l /υ t,l=λ .Reciprocally, the only measurement which statistically significantly violated ζ 2 (i.e., satisfied ζ 2 ) was I * .This confirms that the average intermittency of the most frequent words is more dependent on the style than on the language.

C. On the representativeness of measurements
The practical implementation of our general framework was done quantifying the variation across languages using a single book (the New Testament).This was done because of the lack of available books in a large number of languages.In order for this approach to work it is essential to determine whether fluctuations across different languages are representative of the fluctuations observed in different books.We now try to the measurements X whose values of a single book on a specific language λ (X t=new,l=λ ) are compatible to other books in the same language (X t,l=λ ).To this end we define the compatibility c(X, P ) of X t=new,l=λ to P (X t,l=λ ).The distribution P was taken with the Parzen-windowing interpolation [23] using a Gaussian function as kernel.More precisely, P was constructed adding Gaussian distributions centered around each X observed over different texts in a fixed language λ.Mathematically, the compatibility c(X, P ) is computed as where X median is the median of P (X).For practical purposes, we consider that X t=new,l=λ is compatible with other books written in the same language λ if ζ 3 is fulfilled: ζ 3 : X t=new,l is a representative measurement of the language λ if c(X t=new,l=λ , P (X t,l=λ )) > 0.05.
The representativeness of the measurements computed for the New Testament was checked using the distribution P (X) obtained from the set of books written in Portuguese and English.The standard deviation employed in the Parzen method was the worst deviation between English and Portuguese, i.e. σ = min{σ pt , σ en }.The measurements satisfying ζ 3 for both English and Portuguese datasets are displayed in the last column of Tab.I.With regard to the network measurements, only L, L * , C and C * are representative, suggesting that they are weakly dependent on the variation of style (obviously assuming the New Testament as a reference).In addition, I, I * , B, γ s , s * and m L turned out to be representative measurements.

IV. CASE STUDY: THE VOYNICH MANUSCRIPT (VMS)
So far we have introduced a framework for identifying the dependency of different measurements on the language and story of different books.We now investigate which extent the measurements we identified as relevant can provide information on analysis of single texts.The Voynich Manuscript (VMS), named after the book dealer Wilfrid Voynich who bought the book in the early XX century, is a 240 page folio that dates back to the XV century.Its mysterious aspect has captivated people's attention for centuries.Indeed, VMS has been studied by professional cryptographers, being a challenge to scholars and decoders [24,25], currently included among the six most important ciphers [24].The various hypotheses about VMS can be summarized three categories: (i) A sequence of words without a meaningful message; (ii) a meaningful text written originally in an existing language which was coded (and possibly encrypted) in the Voynich alphabet; and (iii) a meaningful text written in an unknown (possibly constructed) language.While it is impossible to investigate systematically all these hypotheses, here we perform a number of statistical analysis which aim at clarifying the feasibility of each of these scenarios.To address point (i) we analyze shuffled texts.To address point (ii) we consider 15 different languages, including the artificial language Esperanto that allows us to touch on point (iii) too.We do not consider the effect of encryption of the text.
The statistical properties of VMS were obtained to try and answer the questions posed in Tab.II, which required checking the measurements that would lead to statistically significant results.To check whether a given text is compatible with its shuffled version, X computed in texts written in natural languages should always be far from X = 1, and therefore only informative measurements are able to answer question Q 1 .To test whether a text is consistent with some natural language (question Q 2 ), the texts employed as basis for comparison (i.e., the New Testament) should be representative of the language.Accordingly, condition ζ 3 must be satisfied when selecting suitable measurements to answer Q 2 .ζ 2 and ζ 2 must be satisfied for measurements suitable to answer Q 3 because the variance in style within a language should be small, if one wishes to determine the most similar language.Otherwise, an outlier text in terms of style could be taken as belonging to another language.An analogous reasoning applies to selecting measurements to identify the closest style.Finally, note that answers for Q 3 and Q 4 depend on a comparison with the New Testament in our dataset.Hence, suitable measurements must fulfill condition ζ 3 in order to ensure that the measurements computed for the New Testament are representative of the language.
A. Is the VMS distinguishable from its shuffled text?
Before checking the compatibility of the VMS with shuffled texts, we verified if Q 1 can be accurately answered in a set of books written in Portuguese and English, henceforth referred to as test dataset (see SI-Tab.3).A given test text was considered as not shuffled if the interval X − (X) to X + (X) does not include X = 1.To quantify the distance of a text from its shuffled version, we defined the distance D: which quantifies how many 's the value X is far from X = 1.As one should expect, the values of X computed in the test dataset for λ = pt and λ = en (see SI-Tab.4) indicate that all texts are not compatible with its shuffled version because D > 1, which means that the interval from X − (X) to X + (X) does not include X = 1.
Once the methodology appropriately classified the texts in the test dataset as incompatible with their shuffled versions, we are now in position to apply it to the VMS.The values of X for the VMS, denoted as X VMS , in Tab.III indicate that the VMS is not compatible with shuffled texts, because the interval from X VMS − (X VMS ) to X VMS + (X VMS ) does not include X = 1.All but one measurement (C * ) include X = 1 in the interval X VMS ± (X VMS ), suggesting that the word order in the VMS is not established by chance.The property of the VMS that is most distinguishable from shuffled texts was determined quantitatively using the distance D VMS from eq. ( 5).Tab.III shows the largest distances for intermittency (I and I * ) and network measurements (k and L * ).Because intermittency is strongly affected by stylistic/semantic aspects and network measurements are mainly influenced by syntactic factors, we take these results to mean that the VMS is not compatible with shuffled, meaningless texts.

B. Is the VMS compatible with a text in natural languages ?
The compatibility with natural languages was checked by comparing the suitable measurements for the VMS with those for the New Testament written in 15 languages.Similarly to analysis of compatibility with shuffled texts, we validated our strategy in the test dataset as The conditions that must be fulfilled by the measurements for answering each of the Questions posed.For Q1, X should not be close to X = 1 because X ≈ 1 in shuffled texts.In the case of Q3, it is desirable that there is no intersection between the measurements computed for books belonging to different languages.Therefore ζ2 and ζ 2 should be fulfilled.To find the closest style, the measurement must be strongly dependent on style, i.e.only ζ 2 should be fulfilled.Finally, if a question involves a comparison of the unknown manuscript with the New Testament then it requires that the measurements employed are representative.Therefore, Q2, and Q4 require the fulfillment of condition ζ3.

Questions
ζ1 ζ2 ζ 2 ζ3 Q1 Is the text compatible with shuffled version?• Q2 Is the text compatible with a natural language?
• Q3 Which language is closer to the manuscript?
Which style is closer to the manuscript?follows.The compatibility with natural texts was computed using eq.( 4), where P was computed from the New Testament dataset.The standard deviation on each Gaussian representing a book in the test dataset should be proportionally to the variation of X across different texts and therefore we used the worst σ between English and Portuguese.The values displayed in SI-Tab.5 reveal that all books are compatible with natural texts, as one should expect.Therefore we have good indications the proposed strategy is able to properly decide whether a text is compatible with natural languages.
The distance from the VMS to the natural languages was estimated by obtaining the compatibility c(X VMS , P (X t=new,l )) (see eq. 4).In this case, P was constructed adding Gaussian distributions centered around each X observed in the New Testament over different languages λ.The distribution P for three measurements is illustrated in Fig. 4. The values of c(X VMS , P (X t=new,l )) displayed in Tab.IV confirm that VMS is compatible with natural languages for most of the measurements suitable to answer Q 2 .The exceptions were B and I * .A large B is a particular feature of VMS because the number of duplicated bigrams is much greater than the expected by chance, unlike natural languages.I * is higher for VMS than the typically observed in natural languages (see Fig. 4(a)), even though the absolute intermittence value of the most frequent words in VMS is not far from those for natural languages.Since the intermittency I is related to large scale distribution of a (key) word in the text, we speculate that the reason for these observations may be the fact that the VMS is a compendium of different topics.Except for I * and B, the measurements computed for VMS are consistent with those expected for texts written in natural languages.

X r
L L * C C * I I * B s * γs c 0.14 0.62 0.99 0.96 0.05 0.39 0.00 0.00 0.09 0.12 C. Which language/style is closer to the VMS?
We address this question in full generality but we shall show that with the limited dataset employed, we cannot obtain a faithful prediction of the language of a manuscript.Given a text τ , we identify the most similar language according to the following procedure.We first calculate the Euclidean distance (using the z-normalized values of the measurements suitable to answer Q 3 in Tab.II) between the book under analysis and the versions of the New Testament.Let R λ,τ be the ranking obtained by language λ in the text τ .Given a set of texts T written in the same language, this procedure yields a list of R λ,τ for each τ ∈ T .In this case, it is useful to combine the different R λ,τ by considering the product of the normalized ranks where |T | is the number of texts in the database T .This choice is motivated by the fact that R λ,τ /|T | corresponds to the probability of achieving by chance a rank as good as R λ,τ so that δ λ in Eq. ( 6) corresponds to the probability of obtaining such a ranking by chance in every single case.By ranking the languages according to δ λ we obtain a rank of best candidates for the language of the texts in T .
In our control experiments with |T | = 15 known texts we verified that the measurements suitable to answer Q 3 led to results for the books in Portuguese and English of our dataset which not always coincide with the correct language.In the case of the Portuguese test dataset, Portuguese was the second best language (after Greek), while in the English dataset the most similar languages were Greek and Russian and English was only in place 6.Even though the most similar language did not match the language of the books, the δ λ obtained were significantly better than chance (p-value=1.0 10 − 7 and 4.3 10 −5 , respectively in the English and Portuguese test sets).
The reason why the procedure above was unable to predict the accurate language is directly related to the use of only one example (a version of the New Testament) for each language, while in robust classification methods many examples are used for each class.Hence, finding the most similar language to VMS will require further efforts, with the analysis of as many as possible books representing each language, which will be a challenge since there are not many texts widely translated into many languages.

D. Keywords of the VMS
One key problem in information sciences is the detection of important words as they offer clues about the text content.In the context of decryption, the identification of keywords may be helpful for guiding the deciphering process, because cryptographers could focus their attention on the most relevant words.Traditional techniques are based on the analysis of frequency, such as the widely used term frequency-inverse document frequency [14] (tfidf).Basically, it assigns a high relevance to a word if it is frequent in the document under analysis but not in other documents of the collection.The main drawback associated with this approach is the requirement of a set of representative documents in the same language.Obviously, this restriction makes it impossible to apply tf-idf to the VMS, since there is only one document written in this "language".Another possibility would be to use entropy-based methods [5,18] to detect keywords.However, the application of all these methods to cases such as the VMS will be limited because they typically require the manuscript to be arranged in partitions, such as chapters and sections, which are not easily identified in the VMS.
To overcome this problem, we use the fact that keywords show high intermittency inside a single text [5,19].Therefore, this feature can play the role traditionally played by the inverse document frequency (idf).In agreement with the spirit of the tf-idf analysis, we define the relevance Ω i of word i as proportional to both the intermittency and frequency as follows: Note that with the factor I i , words with I 1 receive low values of Ω even if they are very frequent.There are other methods for detecting keywords relying on the analysis of the uneven distribution of the words [26], but we decided not to use them because they generate better results for short texts, which is not the case of VMS.For the case of small texts and small frequency, corrections on our definition of intermittency should be used, see Ref. [26] which also contains alternatives methods for the computation of key-words from intermittency.In order to validate Ω we applied Eq. ( 7) to the New Testament in Portuguese, English and German.An inspection of Tab.V for Portuguese, English and German indicates that representative words have been captured, such as the characters "Pilates", "Herod", "Isabel" and "Maria" and important concepts of the biblical background such as "nasceu" (was born), "céus"/"himmelreich" (heavens), "heuchler" (hypocrite), "demons" and "sabbath".In the right column of Tab.V we present the list of 'words obtained for the VMS through the same procedure, which are natural candidates as keywords.
V. CONCLUSION In this paper we have developed the first steps towards a statistical framework to determine whether an unknown piece of text, recognized as such by the presence of a sequence of symbols organized in "words", is a meaningful text and which language or style is closer to it.The framework encompassed statistical analysis of individual words and then books using three types of measurements, namely metrics obtained from first-order statistics, metrics from networks representing text and the intermittency properties of words in a text.We identify a set of measurements capable of distinguishing between real texts and their shuffled versions, which were referred to as informative measurements.With further comparative studies involving the same text (New Testament) in 15 languages and distinct books in English and Portuguese, we could also find metrics that depend on the language (syntax) to a larger extent than on the story being told (semantics).Therefore, these measurements might be employed in language-dependent applications.Significantly, the analysis was based entirely on statistical properties of words, and did not require any knowledge about the meaning of the words or even the alphabet in which texts were encoded.
The use of the framework was exemplified with the analysis of the Voynich Manuscript, with the final conclusion that it differs from a random sequence of words, being compatible with natural languages.Even though our approach is not aimed at deciphering Voynich, it was capable of providing keywords that could be helpful for decipherers in the future.7).Note that keywords are characterized by high intermittency and frequency terms.

FIG. 1 :
FIG.1: Illustration of the procedures performed to obtain a measurement X of each book.

FIG. 2 :
FIG. 2: Illustration of 13 motifs comprising three nodes used to analyze the structure of text networks.

FIG. 4 :
FIG. 4: Distribution of measurements for the New Testament compared with the measurement obtained for VMS (dotted line).The measurements are (a) X = I * (intermittency of the most frequent words); (b) X = r (assortativity) and (c) X = L (average shortest path length).While in (a) VMS is not compatible with natural languages, in (b) and (c) the compatibility was verified since c(XVMS, P ) > 0.05.

FIG. 5 :
FIG. 5: List of keywords (marked as * ) found for the New Testament in (a) Portuguese; (b) English and (c) German.The intermittency term refers to (Ii − 1) and the frequency term refers to √ log Ni in eq.(7).Note that keywords are characterized by high intermittency and frequency terms.

TABLE III :
Values of X for the Voynich Manuscript considering only the informative measurements (i.e., the measurements satisfying ζ1).Apart from C * all measurements point to the VMS being different from shuffled texts.

TABLE IV :
Compatibility of VMS with natural languages.

TABLE V :
Keywords of the New Testament (English, Portuguese and German) and the VMS using Eq.(7).