Analysis of Web Spam for Non-English Content: Toward More Effective Language-Based Classifiers

doi:10.1371/journal.pone.0164383

Table 1.

Feature descriptions used in our study for the effects of the page language on the spam detection rate.

Note that the numbers are per page.

More »

Expand

Fig 1.

Cumulative distribution function (CDF) for different features in both data sets.

More »

Expand

Table 2.

Results obtained after applying the feature selection algorithms to both data sets.

More »

Expand

Table 3.

Performance of the decision tree classifier using different sets of features (where S = spam and NS = non-spam).

More »

Expand

Table 4.

Confusion matrix obtained by the decision tree classifier using different sets of features (where S = spam, NS = non-spam).

More »

Expand

Table 5.

Performance measurement indices.

More »

Expand

Fig 2.

Percentages of the collected URLs in each Google Trends category.

More »

Expand

Fig 3.

Process flow employed for collecting and building our web spam corpus.

More »

Expand

Fig 4.

Distribution of the URL categories in the data set.

More »

Expand

Fig 5.

Distribution of positive and negative URLs for different manually labeled categories.

More »

Expand

Fig 6.

System sequence diagram.

More »

Expand

Table 6.

Descriptions of the detection features.

More »

Expand

Fig 7.

Cumulative distribution function (CDF) for features 2, 5, 6, and 7 in the spam, borderline, and non-spam categories.

More »

Expand

Fig 8.

Probability density function (PDF) for features 2, 5, 6, and 7 in the spam, borderline, and non-spam categories.

More »

Expand

Fig 9.

Distributions of features 1, 3, and 4 in the spam, borderline, and non-spam categories.

More »

Expand

Fig 10.

Probability density functions P_n and P_s for different combinations of features, where n denotes non-spam pages (in red) and s denotes spam pages (in green).

More »

Expand

Table 7.

Parameters used in the decision tree, Bayesian network, support vector machine (SVM), and multilayer neural network methods (see Part II of the WEKA Manual for descriptions of the various algorithms used in our study [40]).

More »