Table 1.
Feature descriptions used in our study for the effects of the page language on the spam detection rate.
Note that the numbers are per page.
Fig 1.
Cumulative distribution function (CDF) for different features in both data sets.
Table 2.
Results obtained after applying the feature selection algorithms to both data sets.
Table 3.
Performance of the decision tree classifier using different sets of features (where S = spam and NS = non-spam).
Table 4.
Confusion matrix obtained by the decision tree classifier using different sets of features (where S = spam, NS = non-spam).
Table 5.
Performance measurement indices.
Fig 2.
Percentages of the collected URLs in each Google Trends category.
Fig 3.
Process flow employed for collecting and building our web spam corpus.
Fig 4.
Distribution of the URL categories in the data set.
Fig 5.
Distribution of positive and negative URLs for different manually labeled categories.
Fig 6.
System sequence diagram.
Table 6.
Descriptions of the detection features.
Fig 7.
Cumulative distribution function (CDF) for features 2, 5, 6, and 7 in the spam, borderline, and non-spam categories.
Fig 8.
Probability density function (PDF) for features 2, 5, 6, and 7 in the spam, borderline, and non-spam categories.
Fig 9.
Distributions of features 1, 3, and 4 in the spam, borderline, and non-spam categories.
Fig 10.
Probability density functions Pn and Ps for different combinations of features, where n denotes non-spam pages (in red) and s denotes spam pages (in green).
Table 7.
Parameters used in the decision tree, Bayesian network, support vector machine (SVM), and multilayer neural network methods (see Part II of the WEKA Manual for descriptions of the various algorithms used in our study [40]).
Table 8.
Classification accuracy for three classes.
Table 9.
Classification accuracy for two classes.
Table 10.
Confusion matrix for three-class classifiers.
Table 11.
Confusion matrix for two-class classifiers.