Comparing phonological and orthographic networks: A multiplex analysis

The complexity of natural language can be explored by means of multiplex analyses at different scales, from single words to groups of words or sentence levels. Here, we plan to investigate a multiplex word-level network, which comprises an orthographic and a phonological network defined in terms of distance similarity. We systematically compare basic structural network properties to determine similarities and differences between them, as well as their combination in a multiplex configuration. As a natural extension of our work, we plan to evaluate the preservation of the structural network properties and information-based quantities from the following perspectives: (i) presence of similarities across 12 natural languages from 4 linguistic families (Romance, Germanic, Slavic and Uralic), (ii) increase of the size of the number of words (corpus) from 104 to 50 × 103, and (iii) robustness of the networks. Our preliminary findings reinforce the idea of common organizational properties among natural languages. Once concluded, will contribute to the characterization of similarities and differences in the orthographic and phonological perspectives of language networks at a word-level.


Introduction
Many studies focused on the complexity of natural language have pointed out that language is the manifestation of different levels of complex organization [1][2][3][4], ranging from semantics [5] to syntax [6,7] or even emotional components [8]. Of particular interest are the applications of network science on language organization, where these levels of complexity may be explored by means of single [9,10] and multilayer graphs [11,12]. A number of studies have reported emergent organizational properties in language based on associations of semantics, orthographic similarities [13] and phonetics [14,15]. In many of these networks, the behavior of connectivities -the number of neighbors of a given node-is found to follow a distribution with a tail, which can be short or large. For instance, Arbesman et al. [16] reported that for phonological networks, the degree distribution can be well described by a truncated power law for several languages. For orthographic networks, Trautwein et al. [13]  This is a Registered Report and may have an associated publication; please check the article page on the journal site for any related articles.
described that the distribution of connectivities for mental lexicon of students at elementary level, has a power law tail and the network exhibits a small-word property. Despite the variety of characterizations of language from the network's perspective, only a limited number of studies have incorporated the multi-layer aspects of language. Here, we consider a bi-layer approach of the analysis of orthographic and phonological language networks. Our procedure is based on the mapping of words into a two-layer network where nodes are words, and where connections are defined if an appropriate distance similarity is considered. In general, distance similarity between two strings, A and B, can be defined as the minimum number of edit operations needed to transform A into B. In our study we will consider the Damerau-Levenshtein (DL) as a proxy of the similarity between two words. It is recognized that, for many natural languages, there is not a biunivocal correspondence between how a word is spelled and its corresponding pronunciation, for instance, there is not a biunivocal correspondence between graphemes and phonemes. In fact, it is more likely to be observed in particular situations like homography (when a letter corresponds to two phonemes), digraphy (two letters correspond to one phoneme or viceversa), heterography (one phoneme corresponds to two or more letters), etc. When comparing orthographical and phonological networks, an important question would be if the local and global connectivity patterns exhibit similarities. As well as what kind of differences can be identified, more specifically, in the context of psycholinguistics studies. The latter suggesting that the acting mechanisms on the cognitive processes, such as word recognition and retrieval, are particularly different than the orthographic organization.

Proposed hypothesis and research plan
Our study is based on the premises that network representation of both syntax and phonological networks capture the most representative features of each network. In this sense, different questions can be asked. Our study focuses on the following three research questions: • What are the characteristics of multiplex orthographic-phonological language networks?
• Would the connectivity patterns from orthographic and phonological networks reveal similarities and differences between them?
• How does orthographic structure varies in relation to phonological patterns across several natural languages?
There is enough evidence that phonological and grammatical networks exhibit common properties and differences. We shall focus on the evaluation of properties both locally and globally to show the differences between each layer while quantifying them at a bi-layer network (multiplex). To strengthen our study we initially intend to carry out the analysis in four natural languages (Spanish, English, German and Russian) via a 10 4 word corpus. The plan for a secondary stage contemplates two considerations: (i) increase the corpus size from 10 4 to 50 × 10 3 words and (ii) expand the analysis to 12 languages belonging to 4 different linguistic families (Germanic, Romance, Slavic and Uralic).

Methods
The study of complex networks has incorporated the analysis of systems, for which, multiplex modelling is more suitable. In these cases nodes are located in layers with connections among them and the nodes are common to all layer-networks. A number of real-world and simulated multilayer networks have been studied in contexts such as finance and economics [17][18][19], social systems [20,21], synchronization [22] and linguistics [12].
In this study, we plan to analyze the multiplex language network which consists of an orthographic network and phonological network (see Fig 1 for a schematic representation). For the orthographic network, we construct a network at word-level where nodes are words and a link between two nodes is defined if the DL distance, described later, is smaller or equal than a threshold value ℓ. Similarly, a phonological Construction of the multiplex language network. Schematic illustration of the construction of a multiplex language network for English based on an orthographic-distance and phonological-distance similarity networks. In the orthographic and phonological layers nodes are words and there is a link if the Damerau-Levenshtein distance is smaller than a given threshold ℓ. Notice that words in the phonological layer were translated into the International Phonetic Alphabet and then the DL was calculated. https://doi.org/10.1371/journal.pone.0245263.g001

PLOS ONE
Comparing phonological and orthographic networks: A multiplex analysis network G [P] = (V [P] , E [P] ) is constructed where the nodes represent words which were translated to the international phonological alphabet (IPA), and edges are defined if the DL, is smaller or equal than a given threshold ℓ. To generate a multiplex language network at word-level, the orthographic and phonological networks are combined to form a two-layer word-level network, denoted by G ½a� Here, the adjacency matrix for the multiplex network is given a ½a� ij , where a ½a� ij ¼ 1 indicates that there is a link between node (word) i and node (word) j at layer α. More formally, the adjacency matrix associated with each layer is defined as: where Θ(−) represents the Heaviside function, δ ij is the Kronecker delta and dðw ½a� i ; w ½a� j Þ the DL distance between word i and word j at layer α.
Regarding the distance condition between two words, as we mentioned in the Introduction, the distance similarity between two strings A and B can be defined as the minimum number of edit operations needed to transform A into B. These operations are: (1) substitute a character in A to a different character, (2) insert a character into A, (3) delete a character of A, and (4) transpose two adjacent characters of A. The Damerau-Levenshtein (DL) distance is then defined as the length of the optimal edit sequence. For instance, the Levenshtein distance is the length of the shortest sequence of substitutions, insertions, and deletions needed to transform string A into string B. In our analysis, we adopt the DL distance ℓ as a threshold value to define a link between two words.

Databases
The corpus of words were constructed from written texts (books) freely available at Gutenberg project www.gutenberg.org. The written texts were pre-processed to remove function words, stop words and any mark symbol. The titles of the written texts and the resulting corpus are described in https://doi.org/10.6084/m9.figshare.12735380.v4 [23]. The final corpora contain 10 4 words with their corresponding translation to the international phonetic alphabet for four languages (transliterated by the epitran library of Python version 3.6.8).

Topological properties of single-layer and multiplex networks
Our initial analysis is focused on the basic topological characteristics of two individual networks, and then to proceed to investigate similarities and differences of the two layers. The single-layer-network measures (of a network with N nodes) in a multiplex network that have been initially evaluated are [24]: • Density. The density of a layer α, ρ [α] , is given as: where m [α] is the number of actual connections within the layer α.
• Degree distribution. The degree k ½a� i of a node i is the number of links outgoing (or incoming) to that node, The degree distribution for layer α is then defined as the fraction of nodes in the network

PLOS ONE
Comparing phonological and orthographic networks: A multiplex analysis with degree k, where n ½a� k is the number of nodes with degree k. • Clustering Coefficient. Measures the degree of transitivity in connectivity among the nearest neighbors of a node i within the layer α. C ½a� i is calculated as [25], where E ½a� i is the number of links between the k ½a� i neighbors of the node i within the layer α. • Average Nearest-Neighbor Degree. Measures the average of the neighbors of a node [25].
The � k ½a� nn;i is calculated as: • Modularity. Given c ½a� i the community associated to the node i within the layer α, where c ½a� i 2 f1; 2; . . . ; Pg, with P a natural number. The modularity, Q [α] of a given layer α is given by [24]: where δ is the Kronecker delta. We use the Louvain algorithm [26] to perform a greedy optimization of the modularity.
In order to get insight on our study, we plan to characterize structural network properties and information-based quantities from the following perspectives: (i) presence of similarities across 4 linguistics families (Romance, Germanic, Slavic and Uralic), (ii) increase of the size of the number of words (corpus) from 10 4 to 50 × 10 3 , and (iii) robustness of the networks. Regarding (i), we will analyze to what extent the topological single-layer and multiplex network properties exhibit similarities and differences quantified by means of correlation measures and information-theory-based metrics for 12 natural languages which belong to 4 linguistic families. To reinforce the characterization of the grouping patterns of nodes of the network, we will consider multilayer community detection algorithms [27] to determine the presence of clusters across layers. These procedures will help us in the understanding of local and global network properties of the orthographic-phonological variations across several languages. With respect to (ii), we plan to increase the size of the corpus to 50 × 10 3 in the number words for all languages in our study. The results for this size will confirm the validity of our preliminary results for 10 4 words, and also will permit to evaluate the concordance of our findings with previous results. Concerning (iii), the robustness of the single-layer and multiplex network will be evaluated by means of two well-recognized strategies: random removal of fraction of nodes and edges and directed attacks [28]. Moreover, a randomized version of the networks will be also considered to repeat all the calculations in our study.

Initial analyses for 4 natural languages and 10 4 words
We have started our analyses working with 4 languages (Spanish, English, German and Russian) with corpus containing 10 4 words each one. Table 1 concentrates the results of the calculations for the basic structural properties of the orthographic network, the phonological network and the multiplex one. These preliminary results of topological features indicate that there are common properties at local and global scales. Interestingly, the results for the average clustering for Spanish, in the case of the phonological layer with ℓ = 2, is concordant with the value reported for phonological networks [16], where the authors used a different corpus and Notes. Topological metrics of the orthographic network and the phonological network. Here we present the average values of the degree (k i ), clustering (c i ) and nearest neighbor (k nn,i ). We observe that the density, � k, � c and � k nn exhibit an increasing behavior for the four languages and the two layers, with some similarities such as it occurs for � c in both layers and distances ℓ = 2, 3. For the modularity and the average cluster size, we observe they exhibit opposite trends, while the modularity decreases as ℓ increases, the average cluster size increases because a larger number of nodes tends to be connected to a giant component.

PLOS ONE
Comparing phonological and orthographic networks: A multiplex analysis assumed an edge between words if the differ by a single phoneme or sound segment. In order to get a better understanding of the patterns of the connectivities in both layers, we proceed to construct the degree distribution for different threshold values of the DL distance ranging form 1 to 3. Fig 2 shows the cases of the degree distributions of G O and G P for Spanish, German, English and Russian and DL distances from 1 to 3. It is visually apparent that, for the 4 languages, as the DL distance increases, the distributions change from an approximately exponential regime (ℓ = 1) to a combination of an exponential and power law behavior (ℓ � 2). It is likely that the best fit would be obtained by means of a truncated power law function, which has been suggested to fit phonological networks [16]. In our initial estimation of the best fit of the distributions, we only consider the power law behavior at the tails, P(k) � k −γ , where γ is an exponent which characterizes the connectivities. For instance, for a DL distance ℓ = 3 and the phonological layer, the estimated γ-exponents (1.12) for the power law degree distribution is concordant to the value reported in [16] for Spanish. Additional tests are needed in order to get a better description of the distributions, and also for the behavior of the other topological metrics as a function of the degree.

Proposed timeline
The proposed study requires at most 3 months to complete (starting Dec. 1st., 2020). It is planned to build the corpus of 12 new languages and enlarge the size of the existing ones to 50 × 10 3 . This stage is planned to conclude in a month, and immediately proceed to carry out the corresponding pre-processing for the translation into the international phonetic alphabet of all the corpus. Then we will proceed with the calculations of the metrics of the orthographic, phonological and multiplex networks. Next, we plan on finishing data interpretation and drafting the final manuscript in the following two months.