A Complex Network Approach to Stylometry

doi:10.1371/journal.pone.0136076

Table 1.

Pre-processing steps applied to construct networks from texts.

More »

Expand

Fig 1.

Word adjacency model.

Example of word adjacency network created from the poem “In the Middle of the Road”, by Carlos Drummond de Andrade. The pre-processing steps performed to generate this word adjacency network are shown in Table 1.

More »

Expand

Fig 2.

Accessibility in a toy network.

Probabilities of transition from node 1 considering h = 2 steps for two distinct configuration of links. The first configuration considers only the red edges and the second one considers both red and blue edges. Note that, in the first configuration, the probability to reach any node at the second level is the same. In this case, . When blue edges are included, nodes 7 and 9 tend to receive more visits than the other nodes, according to the considered probabilities. For this reason, the effective number of accessed nodes drops to .

More »

Expand

Fig 3.

Intermittence of words.

Profile of the spatial distribution of long (N_i = 44 and I_i = 1.02) and Hobson (N_i = 45 and I_i = 3.40) in the book “Adventures of Sally”, by P. G. Wodehouse. Because Hobson is unevenly distributed along the book, this word take high values of intermittency.

More »

Expand

Fig 4.

Hybrid classifier.

Example of classification based on the hybrid classifier. In the top panel, the network r_? might assume two possible classes: c₁ and c₂. An example for each of these classes is provided (see networks r₁ and r₂). In the central panel, the decision boundary obtained for λ = 0.15 is shown. Because λ takes a low value in this case, the decision is mainly based on the number of shared nodes (words). As a consequence, r_? is classified as belonging to class c₂. For higher values of λ, the topological features of texts takes over. In the bottom panel, r_? is classified as belonging to class c₁ because r_? and r₁ are topologically similar.

More »

Expand

Fig 5.

Tiebreaker classifier.

Example of classification based on the tiebreaker classifier. In the left panel, the gray instances are the test instances that should be classified with the class labels red circle or blue asterisk. In this case, traditional attributes were used. Note that the square is significantly far from the decision boundary (dashed line). Therefore, the classification of this instance does not demand the use of topological features because Δ ≥ θ in Eq 16. Differently, the pentagon is located on the decision boundary. Because Δ < θ in this case, topological attributes are used to perform the classification. According to the topological attributes (see right panel), the test instance represent by the pentagon is classified as belonging to the blue class.

More »

Expand

Table 2.

List of books employed in the authorship recognition task.

More »

Expand

Fig 6.

Accuracy rate for the hybrid classifier.

Relative accuracy rate as a function of the topological weight (λ) obtained with the k nearest neighbors. The attributes employed were: (a)-(c): intermittence; (d)-(f): stopwords; (g)-(i): characters. The parameters k of the k-nearest neighbors were: k = 3 in (a), (d) and (g); k = 4 in (b), (e) and (h); and k = 5 in (g), (f) and (i). As it turns, there is an improvement of the accuracy rates when traditional methods are combined with the technique based on the topological analysis of complex networks.

More »

Expand

Table 3.

Accuracy rates obtained with hybrid classifier based on the kNN technique.

More »

Expand

Table 4.

Accuracy rates obtained with the hybrid classifier based on the SVM, RFO and MLP methods.

More »

Expand

Fig 7.

Accuracy rate for the tiebreaker classifier.

Relative accuracy rate as a function of the threshold (θ) obtained with the tiebreaker algorithm applied to the the k nearest neighbors. The attributes employed were: (a)-(c): intermittence; (d)-(f): stopwords; (g)-(i): characters. The parameters k of the k-nearest neighbors were: k = 3 in (a), (d) and (g); k = 4 in (b), (e) and (h); and k = 5 in (g), (f) and (i). As it turns, there is an improvement of the accuracy rates when traditional methods are combined with the technique based on the topological analysis of complex networks.

More »

Expand

Table 5.

Accuracy rates obtained with tie-breaker classifier based on the kNN technique.

More »

Expand

Table 6.

Accuracy rates obtained with the tie-breaker classifier based on the SVM, RFO and MLP methods.

More »

Expand

Fig 8.

Authorship recognition.

Discriminability of authors obtained with two topological features of complex networks modelling texts. Note that, using only two features, it was possible to separate e.g. Alger from Melville. According to the information gain criterion, the most relevant network features for the authorship identification task were the standard deviation of the accessibility computed at the third level (Ω(Δα^{(h = 3)}) = 1.07) and the standard deviation of the average neighboorhood degree (Ω(Δk⁽ⁿ⁾) = 1.05).

More »

Expand

Table 7.

Best relative accuracy rate Γ_max obtained with the hybrid classifier based on the k-nearest neighbors method.

More »

Expand

Table 8.

Best relative accuracy rate Γ_max obtained with the tiebreaker algorithm for the style identification task.

More »

Expand

Fig 9.

Style identification.

Projection of the Brown dataset using topological features of networks modeling texts. The linear discriminant analysis [68] was employed to generate the visualization. Note that the variability of the documents classified as imaginative prose is lower than the variability of style observed for informative documents.

More »

Expand