Graphemic-phonetic diachronic linguistic invariance of the frequency and of the Index of Coincidence as cryptanalytic tools

Languages have inherent characteristics that make them their own and differentiated entities within their phyla and families. Even messages written in any language and later encrypted by cryptographic systems do not lose all of their characteristics, there remain aspects that help the cryptanalyst to recover them without knowing the decryption keys. For the characterization of the languages we will consider the frequencies of their graphemic and phonetic units and the Index of Coincidence, tools of fundamental utility in the field of Cryptography. Their diachronic invariance or survival over time in one language and their ability to discriminate against other languages will be analized. In order to do so, we will examine a total of 101 languages of which 261 texts have been taken. All of them are very diverse in style and time, taking us through a wide linguistic and temporal spectrum that will cover the period from the 6th century BC to the present day.


Introduction
Cryptography is the applied science that designs and implements information protection systems by transforming the original messages in readable language into encrypted messages, impossible to decypher without the decryption key, although susceptible to cryptoanalytic attacks.
Natural languages are different from each other, in their lexicon and their syntax, their morphology or their phonology, among other aspects. Within Cryptography, encrypted texts which the cryptanalyst must deal with have gone through the disturbances and changes in the encryption that hide the information of their original message. However, as far as Cryptography is concerned, languages also show differences in their individual units, their graphemes or unigrams, as well as in their bigramas, trigrams, etc., and thus can be distinguished from each other [1]. It is, in short, something that Jaakub ibn Ishak al-Kindi (800-870) had already discovered and was independently rediscovered by Leon Battista Alberti (1404-1472) in 1466, as published in´De Cifris´(´On Encryption´) [2] [3] [4] [5] [6].
Cryptography divides its methods into transposition and substitution systems. Transposition takes the original message and changes the order of its units or minimum elements, be a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 they characters or other symbols in which the message has been encoded. Thus, in the case of changing the characters for numbers in any numerical base, for example in base 2 or binary, in bits, either by ASCII, Extended ASCII, BCD, Unicode, etc., it would be a question of transposing bits (0 and 1) by changing them. It is, therefore, an algebraic permutation. On the other hand, the substitution takes the minimum number of units from the source text, or sets of minimum units, and substitutes them for other symbols or elements.
Even today's advanced cryptographic systems (symmetric cryptography of DES or AES, and asymmetric of RSA or Elliptic Curves, among others) are still based on transposition and substitution, whether using SBoxes, permutation functions, shifts, rotations or modular operations in an algebraic Group. And that is why transposition and substitution are linked to the concepts of confusion and diffusion, which Claude E. Shannon (1916Shannon ( -2001 defined as the basis of Cryptography [7]. Although classic systems, based on the graphemic units of languages, have not been used since the second half of the 20th century, at least as they had been used, because of the introduction of binary digital systems and the greater numerical processing capacity of recent computers, we cannot underestimate them, because some of them still remain unresolved. These include those using multiple substitution or homophony [8]; or some nomenclators, mixtures of substitutions, the insertion of nulls, homophony and dictionaries of substitution of common names, such as Manchester in 1783 [9], or Van Gelder in 1809 [10]; or one of the four cryptosystems housed in the´Kryptos´sculpture, created by Herbert James Sanborn, Jr. (1945-) in 1990 and placed at the entrance of the headquarters of the Central Intelligence Agency (CIA) in Langley (Virginia), based on classic ciphers (one of transposition and two of polyalphabetic substitution, all three solved), and pending the deciphering of the four cryptograms [11]. As we say, classic methods are the basis of all current systems [12].
With that said, and given the fundamental importance in the first place when solving the cryptosystem, to specify the original language of the clear or readable text that had been encrypted, it is necessary to use some basic tools to infer the language even on the basis of the text manipulated by the cipher system. The two fundamentals ones are the Frequency of their minimum units and the Index of Coincidence.
We begin by studying the frequency of the basic linguistic elements, which on the one hand will be the graphemic units or characters of a language, the letters, properly taken from languages written in romanized letters or as transcription; and on the other hand the phonetic units, a new aspect in our study which has not previously been considered, although useful and complementary, as we will see, and how necessary it would have been to decipher the texts sent during World War II by the United States Army with the help of Navajo Indians who transmitted military texts in their own language [4].
The study of frequency is fundamental for the cryptographic methods of transposition (groups, series, column/row, grid, etc.), since the alteration in the order of the elements does not modify the values of the units in their appearance in the ciphertext. This is the case, for example, even in such complex methods as the grid of Girolamo Cardano (1501-1576). On the other hand, in substitution systems the frequency is not altered in its numerical values, only in its assignments, which pass on to other symbols, at least in the monoalphabetic substitution systems where a source element is always replaced in the whole encryption process by the same target element. In the case of polyalphatic systems the frequency is less and tends to homogenize, dispersing the values.
The method of comparison or distance [13] between the frequency values of each gram (grapheme or phoneme), or multigram, will be of the Euclidean distance type (L 2 -norm), although, when estimating each unit in one dimension it is actually a Manhattan distance (L 1 -norm with a Minkowski index of p = 1): P n i¼1 jBaseLanguage À language i j This formula must be understood in the sense that each language in our set will be compared to the remaining the languages, both in the different frequencies of graphemes and phonemes as a subtraction in absolute value.
On the other hand, and similarly to frequency, we will study the value of the Index of Coincidence (I.C.), Kappa value (K) [2] [14], which will help us to determine the language to which a certain text belongs, even if it had been encrypted by transposition or monoalphabetic substitution, the most common classic encryption types [2] [4] [6] [15]. Furthermore, the value of the I.C. will allow us to verify even if we are dealing with a polyalphabetic substitution cipher, allowing its value, if it is low, to suppose a polyalphabetic cipher and even to calculate the length of the key used in the system. Let us establish that in the case of a random text and with an alphabet of 26 characters the value of the I.C. is a scarce 0.0385, very remote data from what would be about double when some graphemic units are more common than others, as is usual in any language.
The value of the Index of Coincidence will be calculated in a general way without considering the number of units of the language in question as a normalizing aspect, making the calculations in accordance with the following expression, where´n´is the total amount of grams in all our set or sequence and´f´the number of occurrences of each gram, which varies from the unit to the maximum value´m´of different grams: The difference in the aforementioned frequency distance of the units is that in the case of the Index of Coincidence the frequency of occurrence of the pairings of these units is calculated, hence the order of the units in the sequence is indifferent. Thus, it is normal that the value of the I.C. is not as discriminatory in the languages and texts, as we will be able to verify, although it will also analyze its diachronic invariance.
On the other hand, for the two languages most widely considered here, Latin and Castilian Spanish, we will analyze their graphemic units (characters) and phonetic units (phonemes), to see their behavior over time. Subsequently we will compare them with other studied languages from which we will give their conclusive results. And in less detail we will do the same with about another hundred languages. In that way we will study whether there is a diachronic invariance in the grapheme and phonetic frequency of languages.

Selected languages
From the total of phyla and languages [16] we will consider the Indo-European, Uralic, Altaic and Caucasian phyla as present on the great continent of Eurasia. In addition, we will study the Afro-Asiatic, Nilo-Saharan, Niger-Congo and Khoisan phyla from the African continent and Arabian peninsula. Next to them we must also consider the Austronesian Phylum, which extends from Oceania, Southeast Asia, Polynesia and the Pacific Islands to the island of Madagascar, a Phylum of which we will only take Malagasy language as a sample.
We will leave the following aside in this study: the Dravidian languages, from India, Sri Lanka, Nepal and Pakistan; the Eskimo-Aleut, Chukotko-Kamchatkan and Yeniseian languages, from Canada and the U.S.A. as well as the Asian Steppe, Russia and China; the Austroasiatic languages, from the East and Southeast of Asia, China and India; the Hmong-Mien Phylum and the Sino-Tibetan Phylum, from China and Tibet; the Na-Dene Phylum and the Algonquian, Chimakuan, Wakashan, Salishan, Siouan, Caddoan, Keresan, Iroquoian languages, as well as the Hokan Phylum and Muskogean, Penutian, Tanoan, Misumalpan, Yuki, Mayan, Oto-Manguean, Uto-Aztecan, Mixe-Zoquean, Huave, Totonacan, Chipaya-Uru, Gulf, Ge-Pano-Carib, Ecuadorian, Tunican, Chibchan-Paezan, Yok, Yanomaman, Mura or Andean languages, among others, present in all the extension from the north to the south of the American continent; the Papuan and Andamanese languages, from Papua New Guinea and Indonesia; and finally, the Australian languages, from the continent and islands of Oceania.
On the other hand, among languages not yet classified, we will take Basque and Etruscan into account [16].
In Table 1 we detail the number of languages and texts selected and analyzed for each Phylum.
The selected languages [12] are as follows (in which we will put the number of texts selected for each one in brackets).  The selection of texts, a total of 261 [12], covers a wide spectrum from the 6th century BC to the 21st century with the following temporal distribution: 1 text from the 6th century BC; 2 texts from the 3rd-2nd centuries BC; 2 texts from the 2nd century BC-the 1st century AD; 1 text from the 1st-2nd centuries; 1 text from the 2nd century; 1 text from the 2nd-4th centuries; 1 text from the 3rd-4th centuries; 1 text from the 4th century; 1 text from the 4th-5th centuries; 1 text from the 6th century; 1 text from the 7th century; 1 text from the 9th century; 1 text from the 11th century; 1 text from the 11th-12th centuries; 5 texts from the 12th century; 1 text from the 12th-13th centuries; 1 text from the 12th-14th centuries; 19 texts from the 13th century; 5 texts from the 13th-14th centuries; 2 texts from the 13th-15th centuries; 1 text from the 13th-17th centuries; 19 texts from the 14th century; 5 texts from the 14th-15th centuries; 26 texts from the 15th century; 1 text from the 15th-16th centuries; 1 text from the 15th-19th centuries; 28 texts from the 16th century; 10 texts from the 17th century; 2 texts from the 18th century; 1 text from the 18th-21st centuries; 22 texts from the 19th century; 57 texts from the 20th century; 4 texts from the 20th-21st centuries; 35 texts from the 21st century.
The 261 texts are composed of 177 in prose and 84 in verse to extend the linguistic spectrum of languages better.

Graphemes
The set of its graphemic elements has been considered for each language. All works were taken in their original spelling, regardless of the later versions that´updated´the texts. Languages use a romanized script in many cases. In the 14th century there were already a romanized orthography or extensions available on a Latin basis in many of the languages that we have considered, as well as in Celtic or Germanic languages, among others, in addition to graphemes of their own languages which are not of Latin origin. The same applies to Cyrillic languages. These differences were collected in the list of graphical symbols of each language in the particular study for each one [12].
In some cases, the linguistic material did not have a latinized or romanized base, although occasionally transliterated literary works have been used in a romanized alphabet, sometimes in critical editions by specialists in those languages. Although, in other cases we have made the transliterations ourselves.
Nevertheless, for the Indo-European, the Altaic, and the Caucasian, as well as the Greek Family of the Indo-European Phylum, and for the Semitic Family of the Afro-Asiatic Phylum, we have used their own orthographic signs of their alphabets as well as the romanized transliteration. We have done the same for languages such as Croatian and its Glagolitic and Cyrillic alphabet, or with Abkhaz, Chechen and Ukrainian, together with Cyrillic alphabets. In the case of Georgian we have again considered its own alphabet and its latinized transliteration. In the Armenian Family, for the language that gives it its name, we have counted again on its own alphabet and its transliteration to the romanized alphabet. Finally, for the Afro-Asiatic Phylum we have had to transliterate Coptic and Hebrew from their own alphabet, as well as Arabic, the latter language that will influence the Persian of the Indo-Iranian Subphylum, which we have analyzed both in its own alphabet and in its corresponding romanized transliteration.
In the case of Mongolian we did not have to consider the alphabets of its original works, since the material that was used in critical editions had already been transliterated.
For the set of phyla present on the African continent, the Austronesian Phylum, Niger-Congo, Nilo-Saharan, Khoisan and Afro-Asiatic (without considering Maltese, written in the romanized alphabet in the examples that have been considered; the aforementioned Arabic, Hebrew and Coptic), all their languages have been transliterated during the 19th and 20th centuries, predominantly, often following criteria close to those of the phonemes.
And finally, for Etruscan and Punic, two languages in use two millennia ago, although written with their own alphabets, we have been working with specialized critical works and their already-made romanized transliterations.

Phonemes
We have made a correspondence of each graphemic unit, or units, if there is more than one, together with their unit or phonetic units by means of their transcription in all of the languages that we have selected. If we have talked before about differences between alphabets and their formation throughout history and their modifications, now we will move within the same domain into the sounds, the phonemes, the minimum units of the phonetic set for each language, which will not always be easy for many languages.
We have followed the characterization of the International Phonetic Association (IPA), which will serve as a common framework for all of the languages, at the same time as homogenizing and normalizing [18].
The phonetic analysis is a novelty in our cryptological study, compared to that of the usual grapheme, for the characterization of a language.
The examination of languages and texts has offered a total of 206 different graphemic units ( Table 2). The number of different phonetic units is about 183 (Table 3).

Latin and Spanish (a detailed analysis)
Among the 101 languages we will take the case of Latin and Spanish or Castilian to highlight their selection of texts in greater detail and study the variations in the graphemic and phonetic values of their Frequency and their Index of Coincidence over time.
If, on average, 2.58 texts/language have been taken, in the case of these two languages we have gathered 10 texts. The stability of the values of Frequency and I.C., after the results with Latin and Spanish, as will be verified, led to a smaller number of texts in the rest of the languages (usually 3 or 2 texts per language).

Latin
Latin is the language that gives rise to the entire Romance Family, although it is situated in the Indo-European Phylum, Italic Subphylum, Latino-Faliscan Family. It is an unique case, as the origin of a whole linguistic family is perfectly known [16] [19]. Its geographical extension corresponded to the entire Roman Empire in the West during the Middle Ages, although limited to political and educated circles, and it was the language used in the courts and chancelleries of all Europe, in the palatine and catedral universities of the nascent states, as well as Catholic liturgical language and usual in religion.
We will take 10 texts in prose and verse in a wide range of the literary history of the Latin language for our analysis, from the 1st century until the 14th [20] [21] [22].
Selection of texts. I-II centuries. Text in verse. We will begin with a work in verse by Decimus Iunius Iuvenalis (ca. 60-ca. 130), in particular his´Satura I´(´Satire I´), which we can place between the end of the 1st century and the beginning of the 2nd century of our era [23] [24].
IV century. Text in prose.
The following text will be the main pericopes of Saint Luke's Gospel in which the Virgin Mary is the main protagonist, which are the following fragments: the Presentation of the Lord in the Temple (Lk 2,21-39), the Annunciation (Lk 1,26-38), the Visitation (Lk 1,39-56) and the Birth in Bethlehem (Lk 2,1-20). The chosen text will be the version of the´Vulgate´, made by Saint Ieronimus (ca. 340-420) at the end of the 4th century [25] [26] [27]. Although there have been many other versions over the following centuries, even with interpolations and copyist errors, we will make use of a critical version such as the so-called´Nova Vulgata´(´New Vulgate´) from the year 1974 by José Mª Valente Bover, S.I. IV-V centuries. Text in prose. We will take our next text from a classic author and widely read in the Middle Ages, as is Saint Aurelius Augustinus Hipponensis (354-430), who had also a profound knowledge of this language [30]. We will consider his work´Confessionum´(´Con-fessions´), written between the years 397 and 401, the I to X fragments of the book I [31] [32].
VI century. Text in verse. We will take a serie of poems by the poet Venantius Fortunatus (ca.530-ca.600) [22], such as the hymn dedicated to the Holy Cross,´Hymnus in Honore Sanctae Crucis´(´Hymn in Honor of the Holy Cross´), another one to the Church of Paris, De Ecclesia Parisiaca´(´On the Church of Paris´), and the famous´Pange, Lingua´(´Sing, my Tongue´), to the Body of Christ, all of them from the second half of the 6th century [33].
VII century. Text in prose. From Saint Isidorus Hispalensis (ca.556-636) [34] we will select the book IV, which is about Medicine (´De Medicina´) of his encyclopaedia´Etymologiarum Libri Viginti Sive Origines´(´Twenty Books on Etymologies or Origins´), written during the last 20 years of his life [35].
XI-XII centuries. Text in verse. From the anonymous chants of´Carmina Burana(´S ongs from Beuern´), satirical and loving goliard ballads, we will take chant number 60, of amorous style, which could be dated from around the 11th and 12th centuries [36] [37].

XII century. Text in verse.
We will take the hymn´Ad Vesperas´(´For Vespers´) and thé Hymnus Gratiarum post Epulas´(´Thanksgiving Hymn After the Banquet´) by Petrus Abelardus (1079-1142) [38], leading intellectual figure of the 12th century, written between the years 1130 and 1135 [22]. XIII century. Text in prose. From a very famous preacher of his time throughout Europe, the franciscan Saint Antonius Patavinus, O.F.M. (1195-1231) [39], we will take a sermon for the feast of the Candelaria,´In Purificatione Beatae Mariae Virginis´(´In the Purification of the Blessed Virgin Mary´), from the 13th century [40].
XIII century. Text in prose. Our next literary work corresponds to a fragment of the theological composition most studied and read during the Middle Ages, therefore, basis of discussions, arguments and a way of understanding and expressing the Latin language for centuries. This is the´Summa Theologiae´(´Compendium of Theology´) by Saint Thomas Aquinas, O. P. (1225-1274) [41]. We will select a fragment of part I, in particular question II in its proemium and articles I to III (I q. II a. I-III), which we know were written between 1266 and 1268 [42].
XIV century. Text in verse. We will choose our last text by the poet Francesco Petrarca (1304-1374) [43]. It will be one of his songs written in Latin of his´Bucolicum Carmen(´B ucolic Song Book´), written between 1346 and 1357, specifically the first eclogue,´Parthe-nias´(´Virgin´) [44].
Graphemic-phonetic correspondences. Carrying out the linguistic analysis for the 10 selected texts, both in their graphemic and phonological aspects [45], our results regarding their correspondence are represented in Table 4:
Our selection of its vast literary history, in this case, as for Latin, will be a total of 10 texts, taking works in prose and verse from the 13th to the 20th century for our analysis [47] [48] [49] [50] [51] [52].
Selection of texts. XIII century. Text in verse. We will start by taking a text from the 13th century selected from the work by Gonzalo de Berceo (ca.1199-ca.1255) [47], in particular, the first three miracles of´Milagros de Nuestra Señora´(´Miracles of Our Lady´) (ca.1246-1252):´La casulla de San Ildefonso´(´Saint Ildefonso's Chasuble´),´El sacristán fornicario(´T he Fornicator Sacristan´) and´El clérigo y la flor´(´The Clergyman and the Flower´), the stanzas from 47 to 115 [53] [54].
XIV century. Text in prose. For the 14th century we will take the work´Libro de los Enxiemplos del Conde Lucanor et de Patronio´(´Book of the Examples of Count Lucanor and Table 4. Latin language. Latin alphabet (lowercase letters) and phonetic transcription (IPA symbols).

LATIN ALPHABETIC CHARACTERS (LOWERCASE)
The consonant <c> preceded by <e>, <i>, <ae> or <oe> adopts the /tʃ/ sound. In other cases it is pronounced /k/. The set <gn> takes the sound /ɲ/. In other cases <g> becomes /g/ and <n> becomes /n/. The set <ph> adopts the sound /f/. In other cases the sound <p> is /p/ and <h> is phonetically muted. The block <th> becomes /t h /. In other cases <t> becomes /t/. The vowel <u> preceded by <q> becomes /w/. In other cases <u> takes the sound /ʊ/. The consonant <v> preceding <g>, <q> or <s> takes the /w/ sound. In other cases it becomes /v/.
Patronio´) by the Infante Juan Manuel (1282-1348), written in prose around 1330 as an example [47] [55]. We will select the enxiemplo I in its folios from 6 to 8 [56]. XIV century. Text in verse. As a poetic text from the 14th century we will choose thé Libro del Buen Amor´(´The Book of Good Love´) by Juan Ruiz, Archpriest of Hita (died ca.1351), dated around 1330 or 1343 [47] [57]. We will consider the verses that include froḿ Del Ave María de Santa María´(´Ave Maria of Santa Maria´) and the following four blocks of Cantica de loores de Santa María´(´Song of Praise of Santa Maria´), that is, the verses from 1661 to 1689 [58].
XV century. Text in prose. From Fernando de Rojas (ca.1476-1541) [47] [61] we will take his work´Libro de Calixto y Melibea y de la Puta Vieja Celestina´(´Book of Calixto and Melibea and the Old Whore Celestina´) (1499), in particular the folios from 8 to 24, Act I, an example in prose from the 15th century [62].
XVII century. Text in prose. Already in the 16th century we will select´El ingenioso hidalgo don Quixote de la Mancha´(´The Ingenious Hidalgo Don Quixote of La Mancha´), by Miguel de Cervantes Saavedra (1547-1616) [49] [63] in 1605, chapter I, which includes the folios from 1v to 4v [64].
XVIII century. Text in verse. We will go on to the 18th century with Tomás de Iriarte y Nieves Ravelo (1750-1791) [65], from whom we will take his poetic work, more specifically, from the genre of the fable, his´Fábulas literarias´(´Literary Fables´), from 1782. We will specifically select the first four:´El elefante y otros animales´(´The Elephant and Other Ani-mals´),´El gusano de seda y la araña´(´The Silkworm and the Spider´),´El oso, la mona y el cerdo´(´The Bear, the Monkey and the Pig´) and´La abeja y los zánganos´(´The Bee and the Drones´) [66].
Graphemic-phonetic correspondences. The set of vowels in cases where they are accented with an accent has been reduced to their characters without the suprasegmental sign, which does not imply a change in sound and does not provide differential aspects in our analysis. This occurs in the vowels {<á>, <é>, <í>, <ó>, <ú>} and in the consonant <ý>. We also have the case of {<ï>, <ü>}, which have a diacritical sense of either a diphthong rupture or a phonetic mark, which we will group together in our phonetic transcription, however this diaeresis will be eliminated in the analysis of the graphemes.
Thus, the set of characters and phonological elements, in their correspondence, from the previous texts, where we notice that <k> and <w> do not appear, is shown in Table 5 [74] [75]:

Persistence in time of a language
We do not intend to deal with variations in time of languages in this study as it is a very complex subject and with so many aspects to be taken into account beyond those we can collect and analyze here. However, not even some of them or even a single language in particular, is rich and complex enough, something that is outside the aim of this paper [76]. However, we will make some notes on various aspects referring to some of the tools used in Cryptology, such as the Frequency and the Index of Coincidence, applied to both graphemes and phonemes.

LATIN ALPHABETIC CHARACTERS (LOWERCASE)
The consonant <c> preceding <e> or <i> takes the sound /ɵ/. In the remaining cases, it is pronounced /k/. The set <ch> adopts the /tʃ/ sound. The graphemic group <gü> is pronounced /gu/. The <gue> and <gui> blocks are pronounced /ge/ and /gi/, respectively. For the rest <gu> remains as /gw/. The same consonant <g> preceding <e> or <i> remains as /x/. In all other cases the sound is /g/. The set <ll> changes to /ʎ/. On other occasions the consonant <l> becomes /l/. The <ng> group changes to /ŋ/. In all other cases <n> takes the phoneme /n/. The grapheme <q> appears linked to the vowel <u>, where the <qu> set preceding <e> or <i> remains as /k/; before <a> as /ku/; it does not exist before <o>; and finally, in the rest of the cases <qu> turns into /ku/, and there are no other possible situations than those reviewed here for the consonant <q>. The <rr> set and <r> at the beginning of the word takes the sound /r/. Between vowels and following <b>, <t> or <p>, the grapheme <r> changes to /ɾ/. In all other cases <r> takes the phoneme /r/. We can obtain a series of data for these 45 values: The Standard Deviation σ = ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi It is clear that the most distant ones are the 4th, 6th and 9th texts, corresponding respectively to the texts in verse by Venantius Fortunatus´Hymnus in Honore Sanctae Crucis´,´De Ecclesia Parisiaca´and´Pange, Lingua´, all of the second half of the 6th century, on the one hand, as the 4th text; the anonymous songs of the 11th-12th centuries of the´Carmina Buraná for the 6th; and finally, as 9th text the fragments by Thomas Aquinas, O.P. of the´Summa Theologiae´, written between 1266 and 1268. On the other hand, the 5th and 8th texts are the ones with the lowest total distance, corresponding respectively to the texts taken from the work´De Medicina´, inserted in the´Etymologiarum Libri Viginti Sive Origines´by Isidorus Hispalensis, written between 616 and 636, and the sermon´In Purificatione Beatae Mariae Virginis´by Antonius Patavinus, O.F.M., of the 13th century.
By comparing the average value of the Latin language with the rest of the texts, including those in their own language (a total of 261 texts; 10 in Latin), the results are shown in the Fig 3, highlighting the Latin texts in red versus the rest, in blue: The 10 closest texts are the Latin ones, which correspond to the following texts: {Latin5, Latin8, Latin10, Latin7, Latin2, Latin1, Latin3, Latin9, Latin6, Latin4}. Following these, the nearest one in distance is quite far, with a value of 24.2, about 9 points farther than the previous value, the 4th Latin text, with a distance of 15  By observing the value of the I.C. differential of all the texts, we obtain the following graph, Fig 4, in which again we have highlighted in red the texts in Latin versus the rest ones, in blue.
We can see that the I.C. range value of the texts in Latin is broad enough to encompass a multitude of other languages and other texts, so it does not serve as a fine discriminant as it did in the case of the distances of the frequency of each graphemic unit, such as we have seen before. However, it is a result that helps to filter around 50% of the languages and their texts. We can find a great similarity in this case again, clearly due to a very high correspondence between graphemic and phonetic units in the same language. By calculating all the Manhattan distances (L 1 ) of the 10 texts paired together we have get (ordered from least to greatest): {8.4, 9.8, 11.1, 11.3, 11.5, 11.6, 12. As we expected, and due to the high correspondence between letters and phonetic units in Latin, the results are almost similar to the graphemic case, the furthest being the 6th, 4th and 9th texts. On the other hand, as in the previous case, the 5th and 8th texts are the closest ones.   It is clearly appreciable that the I.C. range of the values of the texts in Latin is broad enough to include a large number of other languages and texts, so it is not fit for purpose to make a fine discrimination between texts or languages, as opposed to distances in Frequency, although it manages to filter around 60% of texts and languages.   We can note a great resemblance between the different texts in spite of the passing of time. By calculating the Manhattan distances (L 1 ) for all the pairs of texts, we obtain (ordered from lowest to highest): {10.1, 10.5, 11.0, 11.0, 11.1, 11.4, 11.8, 11.9, 12.4, 12.4, 12.9, 13.0, 13. The furthest texts are the 2nd and the 3rd ones, corresponding respectively to the literary work in prose by the Infante Juan Manuel in his´Libro de los Enxiemplos del Conde Lucanor et de Patronio´, from around 1330; and the fragments in verse by Juan Ruiz, the Archpriest of Hita, of´Libro del Buen Amor´, dated around 1330 or 1343. The closest one of them all is the 6th, which corresponds to´El ingenioso hidalgo don Quixote de la Mancha´, from 1605, by Miguel de Cervantes Saavedra, very closely followed by the 8th and the 9th texts, corresponding respectively to´Espadas como labios´(1932), by Vicente Aleixandre y Merlo, and to´La sombra del ciprés es alargada´(1948), by Miguel Delibes Setién.

Spanish (Castilian)
By comparing the average value of the Spanish language with the other texts, including those in their own language (a total of 261 texts; 10 in Spanish), the results obtained are shown in the chart below, Fig 9,   Diachronic linguistic invariance of cryptanalytic tools case of Latin. Furthermore, as we have said before, we must not forget that several texts of a greater value correspond to literary examples clearly or highly consonantal, which increases the comparative distance.
Index of Coincidence. We have the following data by analyzing the I.C. for our 10 texts, which does not take into account the order of the elements, and only the frequency of the graphemes : {0.0759, 0.0765, 0.0737, 0.0762, 0.0767, 0.0751, 0.0750, 0.0735, 0.0755, 0.0743}.
The Mean μ = 0.0752, the Mean Deviation Dm = 0.0009 and the Standard Deviation σ = 0.0011. These data indicate that with the I.C., compared to the Frequency distance, the values are much closer to the Mean μ.
The Fig 10 shows the distance with the average value or differential with the mean value of the Index of Coincidence in all texts, highlighting the texts in Spanish in red versus the others, in blue.
The I.C. of the texts in Castilian or Spanish is broad enough to encompass a large number of other languages and other texts, as happened with Latin, and therefore it is not useful as a discriminator and a fine differentiator, which did occur in the case of the analysis for the frequencies of appearance of the graphemic units and their distances, although it filters around the 70% of the languages and texts. As we have seen before for Latin, and due to the strong correspondence between graphemes and phonemes, although there is a lesser degree of bijectivity in Castilian, we also have a similar relation of distances between texts in this case. If we calculate the Manhattan distances (L 1 ) for the 10 texts paired together we get the following 45 results (ordered from lowest to highest): {8. 5, 8.7, 9.6, 10.4, 10.7, 11.5, 11.9, 11.9, 12. The most distant texts are the 2nd and 3rd ones, just as in the graphemic case. The nearest ones are the 8th, 6th and 9th, very similar to the case of the graphemic units.
If you then compare the average value of Spanish with all the texts (261 texts; 10 in Spanish), the values obtained are shown in Fig 12, again marking the Spanish texts in red versus the others, in blue:  Galician-Portuguese3, Galician-Portuguese1, Ladino1, Galician-Portuguese2, Romani4, Tus-can2, Tuscan1, Balearic1, Venetian2, Venetian1}. As we can see, here the best 10 texts are all of the Castilian ones. In addition, the order is quite close to that of the graphemic units. We can also observe that there is a certain difference between the Castilian group and the rest of the following texts in distance, having a gap of about 12 units with the 11th text, something that did not occur in the case of the graphemic units. The set of the texts between the 11th and the 20th is largely similar to those found before, coinciding in that way Galician-Portuguese, Balearic, Ladino and Venetian, appearing here Tuscan and Romani, however there was neither Catalan nor French, which had been there, due to the differences between graphemes and phonemes. In any case, there is sufficient coherence, with their own dissimilarities, between graphemes and phonemes.  find equality in this group with the results that we have obtained for the graphemic case, explained by the differences between graphemes and phonemes.
Index of Coincidence. We have the following values by analyzing the I.C. for our 10 Spanish texts, which does not take into account the order of the elements, and only the frequency of the phonemes: {0.0763, 0.0788, 0.0757, 0.0773, 0.0782, 0.0768, 0.0764, 0.0743, 0.0768, 0.0759}.
The Mean μ = 0.0767, the Mean Deviation Dm = 0.0009 and the Standard Deviation σ = 0.0012. Also in this case we can observe that the Index of Coincidence, compared to the Frequency distance, has its values very close to the Mean value.
We show the I.C. differential value of all texts in Fig 13: We can clearly see that the I.C. of the texts in Spanish or Castilian is broad enough to include a multitude of other languages and other texts, although it manages to filter about 70% of languages and texts, far from their values. Once again, this is the reason for not considering this variable as a fine discriminator when it comes to filtering texts and languages, as was the case with the study of the Frequency distance.

The rest of the languages
After the intra and inter linguistic analysis for Latin and Spanish, we will try to corroborate this persistence over time with the remaining languages analyzed. We will do so only in the case of frequencies, both graphemic and phonetic units, since the Index of Coincidence is less discriminatory. Although, in these other cases the selection of texts has been lower than in Latin and Spanish, as the diachronic strength has been observed in the texts of both languages. However, and due to the temporal amplitude and the wide range that has been taken, they will serve to give us an appropiate idea of the diachronic maintenance or not in a large number of languages in diverse linguistic phyla.
If we take the average value from each language in its graphemes (and in its phonemes) separately, and we calculate the Frequency distance with the remaining the texts in the same language, the greater value from these data will be our reference; if that value is compared with the other languages and their texts, the texts that are below this maximum value, and therefore could be as a text of such a language, are are of only fifteen languages, which we will consider next, leaving the rest perfectly clear and distinguishable against other languages: Galician-Portuguese. Within the graphemic range of Galician-Portuguese are included the texts Castilian4 and Castilian5, surpassing the maximum value of Galician-Portuguese, that is 14.4. Both languages, Spanish (Castilian) and Galician-Portuguese, belong to the Romance Family, Western Subfamily, Galo-Iberian-Romance Group and Ibero-Romance Subgroup. It is a low value, explained by the proximity between these languages, more when it is due to a text in verse of the 13th century (ca. 1269-1280), as they are two Cantigas by Alfonso X the Wise (1221-1284), Galician-Portuguese1 [77], versus Castilian4, the text in French-Provençal. Within the French-Provençal graphemic range are included Balearic1, French1, French2 and French3 texts, surpassing the maximum value of French-Provençal, which is 21.8. The proximity between these languages (especially with French), but all of them of the Galo-Iberian-Romance Group, explains this overlap, even more considering the temporal amplitude of the texts, French-Provençal2 being the text that maximizes the distance, which corresponds to a few verses of´La Piedmontoize en vers Bressan´(´Piedmontese in Verse of Bressan´), dated on 1619, by Bernardin Uchard (1575-1624) [81]. The other texts in French-Provençal are French-Provençal1, text in prose by Marguerite d'Oingt (ca. 1240-1310),´Speculum´(´Mirror´), around 1294 [82]; and French-Provençal3, by Jean-Baptiste Cerlogne (1826Cerlogne ( -1910,´La Bataille di Vatse a Vertosan´(´The Battle of Cows of Vertosan´), of the year 1858 [83]. We have already spoken about the Balearic1 text, French1 being a set of verses of the poem´Li tornoiemenz Antecrit´(´The Antichrist's Tournament´), a work between 1234 and 1240, by Huon de Méry, author who can be placed around the first half of the 13th century [84]. On the other hand, French2 is a text in prose by Jehan Froissart (ca. 1337-ca.1405), of his´Chroniques´(´Chronicles´), which covers events from the year 1327 to 1400 [85]. The French3 text is a text in verse by the poet François Villon (ca. 1431-ca.1463), in particular of his work´Grant Testament´(´The Great Testament´), written in 1461 [86].
Scottish-Gaelic. Within the range of the graphemic units of Scottish-Gaelic are included the Irish-Gaelic2 and Irish-Gaelic3 texts, surpassing the maximum value of the Scottish-Gaelic, a high value of 32.8. Gaelic influence gives rise to this overlap, with such a value of distance, marked by the Scottish-Gaelic2 text, a work in verse of the first half of the 16th century,´Leabhar Deathan Lios Mòir´(´The Book of the Dean of Lios Mòir´), written between 1512 and 1542, and in the case selected, compiled by the bard Allan Mac Royre (15th-16th centuries) [87]. On the other hand, Scottish-Gaelic1 corresponds to a document of 1408, a charter of rights,´Brontanas Mhic Dhomhuill nan Eilein´(´Letter of Rights of Mhic Dhomhuill, Lord of the Islands´), by Mhic Dhomhuill (XIV-1423) to Bhrian Bhicaire Mhagaodh (14th-15th centuries) [88] [89]. As for Scottish-Gaelic3, it is a selection of Marian texts of the St. Luke's Gospel in the version of Bible from 1807 by the Scottish Promoting Society for the Diffusion of Christianity,´Am Bìoball Gàidhlig´(´Scottish Gaelic Bible´) [90]. On the other hand, Irish-Gaelic2 corresponds to a text in verse by the bard Gofraidh Fionn Ó Dálaigh (1320-1387), the poem´A chros thall ar an dtulaigh´(´Oh, Cross, up there on the hill´), from the end of the 14th century, but we are not sure of the exact date [91]. The Irish-Gaelic3 text is in verse, by the bard Tadhg Ó g Ó hUiginn (died 1448), the poem´Cia do-ghéabhainn Go Gráinne(´W hom could I send Gráinne?´), not much before the year 1440 [92]. Icelandic. Within the graphemic margin of the Icelandic enters the Norwegian2 text, surpassing the maximum value of the Icelandic language, a high value of 29.9. The text that maximizes the distance from the rest of the selection is Icelandic2, a work in verse of the Edda poetry, the´Völuspá´(´Prophecy of the Sibyl´), of uncertain date and anonymous authorship, and which we will take from the´Codex Regious. GKS 2365 4to´(ca. 1270) [93], a value that places in its range Norwegian2, a text in prose of the´Saga of Þiðrek´,´Þiðrekssaga´, a work from the 13th century of unknown authorship, relative to Dietrich von Bern, a name that in the northern half of Europe refers to Theodericus Magnus (454-526) for the Romans, king of the Ostrogoths [94]. The other two Icelandic texts considered are Icelandic1, the´Ynglinga Saga´(´Saga of the Ynglinga´), from the´Snorra Edda´(´Snorri's Eddas´), circa 1225 to 1250 by Snorri Sturluson (ca. 1178-1241) [95]; and Icelandic3, a set of texts of the St. Luke's Gospel of the´Nýja Testamenti´(´New Testament´) of the year 1540, translated by Oddur Gottskálksson (ca.1515-1556) [96].
The result of overlap of the graphemes is also found in the analysis of distances in the phonemes, where the text that remains within the range of the Icelandic is again Norwegian2, surpassing the maximum value of the Icelandic, a high value again of 25.5. Note that both Icelandic and Norwegian are languages from the Germanic Family, Northern Subfamily and Scandinavian Group, and during the 12th and 13th centuries the languages that we can differentiate today for Icelandic, Danish, Swedish or Norwegian were still joined, which would explain this overlap.
Norwegian. If we now consider the Norwegian language, within its range of graphemes includes the texts Icelandic1, Icelandic3, Danish2 and Danish3, it surpasses the maximum value of Norwegian, a high value of 34.6. The text in this language with such a high differential is Nor-wegian3, a text in prose by Absalon Pedersson Beyer (1528-1575) of his work´Oration om M: Geble´(´Prayer for Master Geble´), dated on 1571 [97]. The other texts analyzed in Norwegian language are Norwegian1, a text in prose from about 1250, the´Konungs Skuggsjá´(´Royal Mirror´), by an anonymous author [98]; and Norwegian2, just mentioned in the preceding study. The Icelandic texts that have entered in their range are the aforementioned Icelandic1 and Icelandic3. On the other hand the Danish texts are Danish2, a text in verse by Hans Christensen Sthen (1544-1610), in particular of his hymn work´En Liden Haandbog´(´Little Man-ual´), from 1578, his´Dialogus eller Samtale imellem it Menniske oc Døden´(´Dialogue between Man and Death´) [99]; and Danish3, a text in verse by Anders Sørensen Vedels (1542-1616), the ballad´Den anden Vise om Herr Marsk Stig´(´The Second Ballad of Mr. Marsk Stiǵ ), summarized in 1591 [100]. The analysis of the phonemes shows that within the Norwegian band only the Icelandic1 text enters, defining the membership better than in the graphemic case, but surpassing the maximum value of Norwegian with this text, still a high value of 31.6. As mentioned above, these languages, not only Icelandic and Norwegian, but also Danish, belong to the same Group, Scandinavian, and until relatively few centuries ago have had a very similar mixed and united history, since the independence of Norway from Denmark in 1814 when there was to be a renaissance of Norwegian as its own distinct language.
German. The only text that falls within the German graphemic range is LowGerman1, surpassing the maximum value of that language, a value of 19.9, not too high, and which explains that both languages, which are from the Germanic Family, Western Subfamily and Continental Group, present similarities, even more at times when both languages were being defined. The text marking the maximum distance difference is German2, a text by Oswald von Wolkenstein (ca. 1376-1445), the lyric piece´Ain burger und ain hofman´(´The Bourgeois and the Courtier´), from around 1425 [101]. The other texts in German that mark the differences with the first text are, on the one hand, German1, which corresponds to´Alexius(´A lexius´), which we can locate in the 13th century, of Wirzeburc ich Kuonrât (ca. 1225-1287) [102]; German3 being a series of fragments from St. Luke's Gospel of the translation of the Bible by Martin Luther (1483-1546) of the year 1545,´Biblia Das ist, Die gantze heilige Schrifft, Deudsch´(´The Bible, that is, the Complete Sacred Writings in German´) [103]. For its part, LowGerman1 is a text in verse,´Sachsenspiegel´(´The Mirror of Saxony´), from around 1220-1232, a legal book by Eike von Repgow (ca. 1180-ca. 1232) [104].
Low German. The texts Dutch1 and Dutch2 are found within the range of the graphemic units of Low German, surpassing the maximum value of that language, 19.5. LowGerman1 is the text which gives the aforementioned maximum value in distance. Both Low German and Dutch are languages from the Germanic Family, Western Subfamily, Continental Group and Lower Subgroup, hence their kinship, especially in the centuries of their conformation. The other works in Low German are a fragment in prose of´Der grosse Seelentrost´(´Great Relief for the Soul´), from about the middle of the 14th century, by an unknown author, which we refer as LowGerman2 [105]. LowGerman3 are some fragments in verse of´Reynke de Vos(´T he Fox Reynke´), from 1498, which is a translation to the Low German made in that year of the Dutch work of 1487 by Hinrek van Alkmaar (c. XV) [106]. For its part, the Dutch texts correspond to Dutch1, which is a text in verse, of unclear authorship, which we can date from around 1374,´Beatrijs´(´Beatrijs´) [107]; on the other hand, Dutch2 is another text in verse by the poet Anthonis de Roovere (ca. 1430-1482), fragment of´Lof vanden heyjligen Sacra-mente´(´Praised be the Holy Sacrament´), dated from around 1456 [108].
Albanian. Within the graphemic range of this language are the texts French1, German3, Dutch2, English1, English3, Frisian1, Frisian2 and Frisian3, surpassing the maximum value of Albanian, a very high 44.1, an overlapping that does not occur in their phonetic units, resolving this apparent resemblance in graphemes. The texts taken from the Albanian have so many differences that they distort the values, hence overlapping languages of families outside the Albanian Family, such as the Romance Family or the Germanic Family. This is because, on the one hand, we have taken one of the oldest texts ever, which we call Albanian1, a small rudimentary glossary with 46 terms, dated 1496, with some names, numbers and phrases in Albanian written by a German merchant, Arnold von Harfit (1471-1505) on a trip to Albania [109] [110]; Albanian2 is a fragment of a´Meshari´(´Missal´), by Gjon Buzukut (c. XVI), dated 1555 [111]; Albanian3, a mixture of prose and verse, is a series of fragments of´E mbsuame e krështerë´or´Christian Doctrine´, of the year 1592, a Catechism by the Orthodox priest Lekë Matrënga (1567-1619) [112] from the doctrine book by the Jesuit Diego de Ledesma, S.I., Docttrina Christiana´(1572).
Romani. Within the range of the graphemic units of Romani, we can find the texts Tus-can1, Tuscan2, Lombardo1, Venetian2, Napolitan1, Ladino1, Galician-Portuguese1, Castil-ian1, Castilian3, Castilian5, Castilian6, Castilian7, Castilian8, Castilian9, Castilian10, Croatian2, Swedish1 and Wolof1, surpassing the maximum value of Romani, a very high result of 39.1. These results are due to the discordance of the average of Romani language with the Romani1 text, a vocabulary taken from a list, the´Winchester Confessions´, a confession made by a so-called Walter Hindes (c. XVI-XVII), a Londoner who learned the language through contact with Romani gipsy groups in London, when he was arrested in Winchester prison in 1616 [113], which even leads to overlap languages of the Indo-European Phylum with the Niger-Congo (Wolof) Phylum, something absurd. The analysis for the phonetic units concentrates data better, leaving only the texts Tuscan1, Tuscan2 and Ladino1 within the range of Romani, surpassing the maximum value of Romani, a still very high value of 37.2. Romani has had a very geographically dispersed literary history for centuries, and yet today, in diverse countries with its own and different languages, from the United Kingdom or Finland to Romania, Yugoslavia or Nepal, passing through Spain or France, highly dialectical and full of loaned words adapted to its own linguistics, at the same time as strongly traditional and oral. Romani has undergone a multitude of variations, changes and adaptations from its linguistic origins to the dissemination and nomadism of these peoples throughout Europe and part of Asia. In this case we have a text in written British Romani, and therefore, linked to the branch from the North of Europe (Polish, Russian, Scandinavian, Baltic, Germanic and Gallic areas), with all the pecularities that they may have versus other branches of the same language, such as the branch of Eastern and Slavic Europe (Czech, Slovak, Hungarian, Austrian and Balkan areas). However, due to its antiquity, we have taken it as the first text, with the exception of it being analyzed doubly: on the one hand, in the same way as it was written at that time in the confession, and on the other, as in the second text, Romani2, but this time in the spelling of the Welsh Romani as it was already articulated grammatically in the 20th century [113]. We can say that in this case, Romani2, data of graphemes/phonemes have a much better value of 34.5/31.2, respectively.
On the other hand, Romani3 is a tale from Romanian areas of Hungary,´O Rom taj o Beng(´T he Romani and the Devil´) compiled in 1984 from its oral tradition [114]; Romani4 being the Marian pericopes of St. Luke, already analyzed in other languages, of the New Testament in translation of the Jesus Army Church in 1984,´E Lashi Viasta´(´New Testament´) [115].
Awdjila. In this language the texts that are within its graphemic range are Tashelhiyt1, Tashelhiyt2, Djerba3, Ghadames2 and Tagargrent3, surppasing the maximum value of this language, a very high result of 49.2. The very high value for this language, which means that other languages of the Berber Family remain within its range of frequency distances, is due to the use of a first text, Awdjila1, the oldest written compilation, between the years 1932-1933 by Fernando Zanon (died c. XX), only six verses, which imply this discordance, furthermore, the other two texts are transliterations of the same hand [116]. Thus, Awdjila2 and Awdjila3 are short stories,´Abú-dabăr u ámz̩ a´(´Abú-dabăr and the Ogre´) and´Sîdi H̩ ámed ĕz-Zar-rûq´(´Mr. H̩ ámed ĕz-Zarrûq´), both collected by Umberto Paradisi (1878Paradisi ( -1933 between 1959 and 1960 [117]. In the case of phonemes there is also overlap since within the band of Awdjila enter the texts Tashelhiyt1, Tashelhiyt2, Ghadames2, Tagargrent1, Tagargrent2, Tagargrent3, Figuig1, Tarifiyt2, Tarifiyt3, Chaouia1, surpassing the maximum value of Awdjila, a very high 48.7, although they are texts in the nearby Berber languages, from the Eastern and Northern Subfamily.
Tagargrent. The texts that are within the range of the graphemic units of the frequency distance of the Tagargrent are a great multitude, a total of 53, even from several phyla, because of the high maximum distance between the texts used, with a value of 50.6, so this information is not relevant, as explained by Tagargrent3, a text in prose, a report on life at home taken from inhabitants of the zone of Ouargla (Algeria), collected by Jean Delheure (1911Delheure ( -2001 and Maurice Jardon (died 1956) between 1941 and 1961 [118]. Its transliteration is very different in graphemes (and phonemes) with the other two texts selected from this language. Thus, Tagargrent1 is a text in verse that includes several songs collected between 1885 and 1887 by René Basset (1855Basset ( -1924 [119]; and Tagargrent2 is a tale collected in 1906,´Tanfoust n baba Brahim´(´Tale of father Brahim´), by Samuel Biarnay (1879Biarnay ( -1918 [120]. There is also overlap in the case of phonemes, although much lower than in the graphemic case, because within the Tagargrent range enters the texts Awdjila2, Tarifiyt1 and Tarifiyt3, this time only Berber languages, surpassing the maximum value of the phonetic Tagargrent, a high data of 40.3.

Tarifiyt.
In the Tarifiyt language, although only in the analysis of the phonemes, since characteristics with considerable confidence that, as well as of the Index of Coincidence, of the graphemic and phonetic units of the languages, characteristics that can be used to identify them from the rest.
In a few languages we have found overlaps with other texts from other languages, although the use of literary works so disparate in time, even written during their linguistic conformation, sometimes very close to other languages, or different transliterated texts in high oral languages or the small number of units of the selected material, are some of the main explanations for the differences observed in our analysis, divergences that would be solved with a more meticulous observation in each particular case, but nevertheless they reaffirm the thesis of the diachronic invariance of the languages in their minimal graphemic and phonetic units and their use as differentiating patterns (in Frequency better than in I.C.) in contrast to the rest.

Conclusions
Throughout history, cryptanalysts who faced a cipher text used to do it during periods of conflict, which allowed the languages used by the contenders to be known and thus the base language of the original text. What is more, the consideration of a particular language forced to study it in its synchronic properties, relegating a temporal and diacronic analysis, of little usefullness in those circumstances.
The underlying question that was pending was whether the two main tools of classic Cryptography, the analysis of the Frequency and the Index of Coincidence of their graphemes, remained in time as linguistic invariants and were able to be used as distinguishing elements in contrast to the rest of languages.
In order to do so we have analyzed a wide range of languages, a total of 101, from the geographical area of Western Eurasia and Africa, which includes the Indo-European, Uralic, Altaic, Caucasian, Afro-Asiatic, Nilo-Saharan, Niger-Congo, Khoisan and Austronesian phyla, as well as two unclassified languages, a total of 261 texts between the 6th century BC to the 21st century, examining their frequency characteristics and their Index of Coincidence of their graphemic units, which we have extended to the study of phonemes, a novel aspect [12].
Thus, beyond the already available cryptological knowledge of a couple of dozen European languages, we present results and conclusions of considerable linguistic scope, although with the desire to extend it in the future to Asian phyla and families which are not yet studied and those of the American continent and Oceania.
It was assumed by the cryptographers that different texts written in different languages, despite being encrypted, maintained some of their own patterns such as the Index of Coincidence and even the frequency of the elements, the letters. At least in about a dozen languages, mainly languages of European origin, from the Indo-European Phylum. This assumption was due not so much to a systematic and rigorous study but to experimentation in the deciphering of texts throughout history.
In this paper it has been carried out exhaustively for a wide range of languages, around a hundred, and from nine phila, an analysis of whether there is invariance in the frequency of letters and the Index of Coincidence for the various languages to distinguish them from others, even nearby languages.
On the other hand, it had never been considered to analyze whether the test of time substantially modified or not these invariants, since it was not necessary in the historical environments of war and espionage, typical of cryptography and cryptanalysis; a diachronic study that we take into account here.
In this study we make clear that languages are very robust over time and that the Frequency and the I.C. of the letters remain stable despite the passage of centuries.
Another novelty, never previously considered, that our study analyzes, is to apply the same analysis that has been done to letters or graphemes to phonemes, reaching similar conclusions.
Sometimes we have military or civil documents, texts of different themes, colophones, . . . encrypted, of different length, made several centuries ago, where it is not easy to suppose the source language of the cipher, being necessary to apply the conclusions here obtained. A case where these conclusions have been applied has been the study of a text inscribed in a wood carving around the 14th and 15th centuries, with a total of almost two hundred letters, resulting in a text of proto-Berber origin [130].
The results show that there has been a high diachronic continuity and stabiliy of the language over the centuries. The Index of Coincidence, by its own mathematical definition, is not a very fine discriminator tool, although it is highly stable and very centered in its average value. Refereing to the Frequency, the combination and dual analysis for graphemes and phonemes helps to bring the characteristics of the language together, strengthening its own and distinctive aspects and discriminating one language versus others. Languages have a very strong invariant for their differentiation in the distance of the frequencies of their minimum units (graphemes/phonemes).