Optimizing the Learning Order of Chinese Characters Using a Novel Topological Sort Algorithm

doi:10.1371/journal.pone.0163623

Fig 1.

Structural decomposition of the character 照.

Primitive characters appear as characters in their own right whereas primitive components do not. The primitive component 灬 is an abbreviated form of the primitive character 火. The parameter r is the SUBTLEX-CH usage frequency rank of the character. Pronunciations are given in pinyin romanization. Note that each character is only assigned a single meaning even though most actually possess a range of broadly related meanings.

More »

Expand

Fig 2.

Usage frequency versus number of unique components for the 1000 most common Chinese characters.

This plot shows the weak relationship between character usage frequency and complexity, the latter represented by the number of unique components used to construct the character. Usage frequency is normalized to 1.0 over the whole usage frequency data set, which encompasses more characters than shown in this plot. The six characters illustrated are the most common in each column. Note that the number of unique components is not the same as visual complexity: the characters 我 and 说 have similar visual complexity (they are composed of similar numbers of strokes) but 我 is conceptually more simple, being, in the OLS character decomposition, composed of two relatively complex primitive components 手 and 戈, compared with the four from which 说 is composed.

More »

Expand

Fig 3.

Measures of learning efficiency.

The curves A and B represent two different learning curves. For each curve, the final learning efficiency Λ_f is the cumulative usage frequency for a specific cumulative learning cost C₀, and the integral learning efficiency 〈Λ〉 is the average cumulative usage frequency between the origin and C₀. Curve A has higher Λ_f but lower 〈Λ〉. Illustrated values for 〈Λ〉 are approximate.

More »

Expand

Fig 4.

Illustration of the topological sort algorithm.

The ordered list is processed from low to high centrality (right to left in the figure). Once 的 is reached, its components are checked in turn. 白 is found to lie to the right of 的 and so is repositioned to its left. Likewise 勺 is found to the right of 的 and is similarly repositioned. 勺 is positioned to the right of 白 because it has lower centrality. The centralities used in this figure are for illustrative purposes only.

More »

Expand

Fig 5.

A network where our algorithm does not return the optimal character order.

A hypothetical network where the integral learning efficiency of the order generated by the algorithm is lower than another possible order. Letters represent Chinese characters (for example, E is a compound character formed from primitives A and B) and the numbers are centralities. Both orders have identical final learning efficiencies.

More »

Expand

Fig 6.

The first 85 characters of our optimized learning order.

Taken together these characters have a cumulative usage frequency of 0.42.

More »

Expand

Fig 7.

Learning curves.

The black and green curves were created using the OLS character decompositions and the two different learning order algorithms. The Yan et al. algorithm was optimized up to a cumulative learning cost of C₀ = 4000. The blue curve uses the HR network with Heisig and Richardson’s fixed character order. Learning efficiencies are presented in Table 1.

More »

Expand

Fig 8.

Usage frequencies for the first 85 characters.

The gray, green and blue bars correspond to the black, green and blue curves in Fig 8. Dark bars represent primitives and light bars represent compounds.

More »

Expand

Table 1.

Learning curve parameters.

The number of characters learned N, final learning efficiency Λ_f, and integral learning efficiency 〈Λ〉 for reference cumulative learning costs of C₀ = 500 and C₀ = 1500. The Yan et al. algorithm was optimized up to a cumulative learning cost of C₀ = 4000.

More »

Expand

Fig 9.

Measures of character clustering.

The top panel shows the average distance, in number of characters, to the closest preceding component. The bottom panel shows the average distance, in number of characters, to another character which shares a component. Curves were generated with a fixed cumulative learning cost of C₀ = 4000. Averages below 250 characters are not shown because in this region the averages fluctuate wildly.

More »

Expand

Fig 10.

Learning curves for characters and words.

The green curves correspond to HSK word lists for levels 1 to 4 (shorter curve) and 1 to 6 (longer curve). The yellow curves correspond to word lists generated from two levels of beginner readers. All curves were created using the OLS character decompositions.

More »

Expand