Fig 1.
Toy co-occurrence network and pseudograph.
(A) shows the co-occurrence network based on Sentences 1 to 4. We draw a directed edge between all adjacent word pairs. The unrestricted path from elephants to equities is highlighted in yellow. The unrestricted distance between the two words is 6. The pseudograph of the same toy corpus is shown in (B). Each sentence is represented by a different colored path on the network. A restricted path from elephants to equities does not exist. The box highlights the motif a lot of. It is marked by a fan-in of edges at the start (indicated by the sharp drop in the leftward extension probability PL) and a fan-out at of (indicated by the sharp drop of the rightward extension probability PR). A new node for the significant motif is created in (C) and the edges are routed through it. (D) identifies an equivalence class within a window Lw = 4. An equivalence class ‘supernode’ will be created and the edges are now routed through this new node. However, the unlike motif nodes, the new equivalence class node does not reduce distances between the terminal word nodes that lie on either side of it.
Fig 2.
The bar graphs in the sub-plots show the motif densities (number of motifs divided by the number of original terminal words) and the line plots chart decrease in the number of tokens as more motifs are embedded in the network. (A) gives the results for the real corpora (USEC, SAC, and BC) while (B), (C), and (D) show the difference between each real corpus, its shuffled equivalent and its POS-shuffled equivalent.
Table 1.
The measurements are density ρ, average degree 〈k〉, clustering coefficient C, assortativity r, average minimum unrestricted distances 〈min(dur)〉 (i.e. distances along unrestricted paths as in Fig 1A), average minimum restricted distances 〈min(dr)〉, and average mean restricted distances 〈mean(dr)〉 (i.e. distances along restricted paths as in Fig 1B). For scrambled (appended with -S) and POS scrambled (appended with -PS) corpora, the values are given up to the precision not affected by fluctuations in the random scrambling.
Fig 3.
Shrinking distances in the USEC and ranked word frequencies.
(A) show how 〈mean(dr)〉 decreases in the USEC and how it compares to null models set at different cost parameters ΓNode. (B) shows relative the word frequencies f(r) against rank r and highlights the stop word cutoff in each corpus. The cutoffs mark sudden drops in the ranked word frequencies.
Table 2.
We present for levels 1 and 2 the top 3 ranked (in terms of Z-score) POS templates and stop word templates respectively. The Penn Treebank [60] POS tags are used here. Beyond level 2, there were no regularly used templates. For each level, we also gave the number of template types used by the motifs in the corpus and also the number of template types that appeared in randomly extracted motifs.