Fig 1.
Structural decomposition of the character 照.
Primitive characters appear as characters in their own right whereas primitive components do not. The primitive component 灬 is an abbreviated form of the primitive character 火. The parameter r is the SUBTLEX-CH usage frequency rank of the character. Pronunciations are given in pinyin romanization. Note that each character is only assigned a single meaning even though most actually possess a range of broadly related meanings.
Fig 2.
Usage frequency versus number of unique components for the 1000 most common Chinese characters.
This plot shows the weak relationship between character usage frequency and complexity, the latter represented by the number of unique components used to construct the character. Usage frequency is normalized to 1.0 over the whole usage frequency data set, which encompasses more characters than shown in this plot. The six characters illustrated are the most common in each column. Note that the number of unique components is not the same as visual complexity: the characters 我 and 说 have similar visual complexity (they are composed of similar numbers of strokes) but 我 is conceptually more simple, being, in the OLS character decomposition, composed of two relatively complex primitive components 手 and 戈, compared with the four from which 说 is composed.
Fig 3.
Measures of learning efficiency.
The curves A and B represent two different learning curves. For each curve, the final learning efficiency Λf is the cumulative usage frequency for a specific cumulative learning cost C0, and the integral learning efficiency 〈Λ〉 is the average cumulative usage frequency between the origin and C0. Curve A has higher Λf but lower 〈Λ〉. Illustrated values for 〈Λ〉 are approximate.
Fig 4.
Illustration of the topological sort algorithm.
The ordered list is processed from low to high centrality (right to left in the figure). Once 的 is reached, its components are checked in turn. 白 is found to lie to the right of 的 and so is repositioned to its left. Likewise 勺 is found to the right of 的 and is similarly repositioned. 勺 is positioned to the right of 白 because it has lower centrality. The centralities used in this figure are for illustrative purposes only.
Fig 5.
A network where our algorithm does not return the optimal character order.
A hypothetical network where the integral learning efficiency of the order generated by the algorithm is lower than another possible order. Letters represent Chinese characters (for example, E is a compound character formed from primitives A and B) and the numbers are centralities. Both orders have identical final learning efficiencies.
Fig 6.
The first 85 characters of our optimized learning order.
Taken together these characters have a cumulative usage frequency of 0.42.
Fig 7.
The black and green curves were created using the OLS character decompositions and the two different learning order algorithms. The Yan et al. algorithm was optimized up to a cumulative learning cost of C0 = 4000. The blue curve uses the HR network with Heisig and Richardson’s fixed character order. Learning efficiencies are presented in Table 1.
Fig 8.
Usage frequencies for the first 85 characters.
The gray, green and blue bars correspond to the black, green and blue curves in Fig 8. Dark bars represent primitives and light bars represent compounds.
Table 1.
The number of characters learned N, final learning efficiency Λf, and integral learning efficiency 〈Λ〉 for reference cumulative learning costs of C0 = 500 and C0 = 1500. The Yan et al. algorithm was optimized up to a cumulative learning cost of C0 = 4000.
Fig 9.
Measures of character clustering.
The top panel shows the average distance, in number of characters, to the closest preceding component. The bottom panel shows the average distance, in number of characters, to another character which shares a component. Curves were generated with a fixed cumulative learning cost of C0 = 4000. Averages below 250 characters are not shown because in this region the averages fluctuate wildly.
Fig 10.
Learning curves for characters and words.
The green curves correspond to HSK word lists for levels 1 to 4 (shorter curve) and 1 to 6 (longer curve). The yellow curves correspond to word lists generated from two levels of beginner readers. All curves were created using the OLS character decompositions.