Analysis of Web Spam for Non-English Content: Toward More Effective Language-Based Classifiers

Web spammers aim to obtain higher ranks for their web pages by including spam contents that deceive search engines in order to include their pages in search results even when they are not related to the search terms. Search engines continue to develop new web spam detection mechanisms, but spammers also aim to improve their tools to evade detection. In this study, we first explore the effect of the page language on spam detection features and we demonstrate how the best set of detection features varies according to the page language. We also study the performance of Google Penguin, a newly developed anti-web spamming technique for their search engine. Using spam pages in Arabic as a case study, we show that unlike similar English pages, Google anti-spamming techniques are ineffective against a high proportion of Arabic spam pages. We then explore multiple detection features for spam pages to identify an appropriate set of features that yields a high detection accuracy compared with the integrated Google Penguin technique. In order to build and evaluate our classifier, as well as to help researchers to conduct consistent measurement studies, we collected and manually labeled a corpus of Arabic web pages, including both benign and spam pages. Furthermore, we developed a browser plug-in that utilizes our classifier to warn users about spam pages after clicking on a URL and by filtering out search engine results. Using Google Penguin as a benchmark, we provide an illustrative example to show that language-based web spam classifiers are more effective for capturing spam contents.


Introduction
Web spamming (or spamdexing) is a process for illegitimately increasing the search rank of a web page with the aim of attracting more users to visit the target page by injecting synthetic content into the page [1,2]. Web spamming can degrade the accuracy of search engines greatly if this content is not detected and filtered out from the search results [3][4][5]. In general, spammers aim to illegally enhance the search engine ranks of their spam pages, which might lead to user frustration, information pollution, and distortion of the search results, thereby affecting the entire information search process.
Black hat search engine optimization (SEO) techniques are generally used to create web spam pages. For example, in content-based web spamming, spammers stuff spam keywords into the target page by listing them in the HTML tags (e.g., META tags) or by using an invisible font. In addition, scraper techniques are used where the spam content is simply a replica of another popular site [6][7][8]. These deception techniques are refused by search engines because they can lead to misleading search results [9].
Some web ranking algorithms give higher ranks to pages that can be reached from other web pages that are highly ranked, so the black hat SEO method exploits this feature to increase the ranks of spam pages [5,[10][11][12][13]. For example, in the cookie stuffing method, the user's browser receives a third-party cookie after visiting a spam page with an affiliate site so the cookie stuffer is credited with a commission after visiting the affiliate site and completing a particular qualifying transaction. Moreover, by utilizing a page cloaking mechanism, a search engine crawler can receive different content from the spam page compared with that displayed on the end-user's browser, where the aim is delivering advertisements or malicious content to the user, which is partially or completely irrelevant to that searched for by the user. Another link-based tactic is link farms where a set of pages are linked with each other.
Site mirroring is another black hat SEO method, which exploits the fact that many search engines grant higher ranks to pages that contain search keywords in the URL. Thus, spammers can create multiple sites with various URLs but similar content. Further, web spammers can create pages that redirect the user's browser to a different page that contains the spam content in order to evade detection by search engines [10].
Due to the success of email anti-spam tools based on machine learning, we consider that these techniques might also be effective for detecting web spamming. Typically, high detection accuracy and a low false positive rate are the main properties required for detection tools based on machine learning methods. This is particularly important for detecting spam pages and ensuring that benign web sites are not penalized.
Search engines enhance their anti-spamming techniques continuously. For example, Google developed their latest algorithm (called Penguin) in 2012 and they have continued updating it to lower the search engine ranks of web sites that use black hat SEO or that violate Google Webmaster Guidelines [21,22]. Google's latest web spam report urges publishers to verify the contents of their pages via the Search Console. In fact, Google sent over 4.3 million emails to webmasters during 2015 alone to warn them of identified spam-like content and to give them a chance of reconsideration [23].
The effectiveness of the Google Penguin algorithm is affected by the text language used in the page examined [24]. Several web spam detection features have been proposed but to the best of our knowledge, the effect of the language on these detection features has not been examined previously. In addition, to the best of our knowledge, the performance of the Google Penguin algorithm at detecting web spam pages that contain text in languages other than English has not been evaluated.
This study significantly extends our earlier conference paper [25,26], where the data set is expanded and updated, a new release of Google Penguin is explored, new spamming detection algorithms are introduced, and their results are presented. This study makes the following main contributions.
showing how and why the distribution of selected detection features differ according to a given page language. We used English and Arabic as languages in case studies.
2. COLLECTING AN ARABIC WEB SPAM DATA SET. We collected and manually labeled a corpus containing both benign and spam pages with Arabic content. We used this corpus to evaluate our proposed machine learning-based classifier and we have also made the corpus available for use by the research community in this domain.
3. ANALYSIS OF DETECTION FEATURES AND DEVELOPMENT OF A NOVEL CLASSIFIER. Using Arabic pages in a case study, we showed how to identify a set of web spam detection features with satisfactory detection accuracy. Employing supervised machine learning techniques, we then built a classifier for detecting web pages that contain spam content and showed that it yielded better accuracy compared with the Google Penguin algorithm.
4. CONSTRUCTION OF A BROWSER ANTI-WEB SPAM PLUG-IN. Using our proposed classifier, we developed a browser plug-in to warn the user before accessing web spam pages (i.e., after clicking on a link from the search results). The plug-in is also capable of filtering out spam pages from the search engine results.
The remainder of this paper is organized as follows. Section 2 presents our analysis of how the page language affects the detection rate for web spam using a set of classifiers. Section 3 describes the collection and labeling process for our data set. Section 4 illustrates our system architecture and design. Section 5 explains the feature extraction and selection process. Section 6 presents the proposed classifier and evaluations of its accuracy. Section 7 discusses the meaning and implications of our main findings, and Section 8 presents related research. Finally, we give our conclusions in Section 9.

Data Sets
Two web spam data sets were used in this study. First, we used UK-2011 [27], which is a subset of the WEBSPAM-UK2007 data set [28]. The UK-2011 data set was labeled by volunteers and each page is flagged as either "spam" or "non-spam." Second, we used an extended Arabic web spam data set [29], which included spam and non-spam Arabic pages (this data set was collected and labeled during the period from April 2011 to August 2011).
We used Wahsheh's web spam detection features [30] (see Table 1). We employed the J48 classifier, which is a Weka (version 3.7.6) implementation of the C4.5 decision tree classifier (decision trees are statistical machine learning algorithms that utilize a greedy top-down process to select attributes at selected nodes in the tree and divide the samples into subsets based on the values of these attributes). Cross-validation, a model evaluation method used to improve how a classifier generalizes to an independent data set, was used to ensure that each instance in the data set had an equal probability of appearing in either the training or testing sets. We performed a 10-fold cross-validation and we divided the data set into 10 chunks for training 10 times, where a different chunk was used as the testing set each time. For the decision tree classifier, the issue of overfitting was addressed by using a pruning technique, where the less significant tree nodes for classifying the data set instances were removed from the tree (we set the minimum number of instances to two).

Results and Analysis
We started our analysis by studying the selected detection features in both data sets. Fig 1 shows the probability density function (PDF) for different features in both data sets. A random sample of 1,500 web pages was used to determine the figure visibility (compared with 3,688 pages in data set (1) and 9,988 in data set (2)). According to the cumulative distribution function (CDF) for feature 2 in Fig 1A, almost 60% of the Arabic non-spam pages contained less than 270 words in their pages, whereas less than 15% of Arabic spam pages had less than 270 words. The figure shows that Arabic spam pages tended to have more words in their pages compared with Arabic non-spam pages. In addition, the CDFs for the number of words in Arabic non-spam pages and English pages were very similar. The same observation can be made based on Fig 1B and 1C, but there was more variation among them. In fact, most of the features exhibited greater variation between spam and non-spam pages in the Arabic data set compared with the UK data set. Furthermore, Fig 1B shows that Arabic spam pages tended to have shorter word lengths, where almost 80% of the Arabic spam pages had an average word length of six characters, whereas only 40% of the Arabic non-spam pages had an average word length of six characters. In terms of the number of characters per meta-element, as shown in Fig 1C, Arabic spam pages usually had more characters (80% had more than 400 characters) compared with Arabic non-spam pages (20% had more than 400 characters). Furthermore, Fig 1D shows that Arabic pages usually had more images in their pages compared with English pages, particularly in spam pages.
First, we used all 11 detection features to build the classifiers. Most of the Arabic web spam pages used more obvious spamming tactics compared with those in English, so the DR for English spam pages was lower than that for those in Arabic. We then selected different sets of features using the following feature selection algorithms implemented in Weka: CfsSubsetEval, PrincipalComponents, ConsistencySubsetEval, and FilteredSubsetEval. Brief descriptions of these algorithms and the results obtained from their execution are shown in Table 2. The CfsSubsetEval algorithm considers the individual predictive ability of every feature as well as the features' degree of redundancy in order to evaluate the value of a subset of features. Princi-palComponents performs principal components analysis and transforms the data. Based on the results obtained by these algorithms, we selected the following sets as training scenarios for the classifier: 1,5,7,11,1,5,8,9,1,5,7,10,11, and all 11 features. Tables 3 and 4 show the performance of each set of features using the classifiers described above, the performance measurement indices mentioned in Table 5, and the confusion matrix obtained by the classifier.

Limitations in Existing Data Sets
We found that the distributions of a selected set of features varied according to the underlying language used in the page examined. In addition, for both data sets, the results obtained by the classifiers showed that only a few common features yielded similar results. However, the significance of several of the remaining features varied according to the language used in the page

Measurement Indices Description
Detection rate (DR) Ratio of the number of correctly classified samples relative to the total number of samples.
Error rate (ER) Ratio of the number of incorrectly classified samples relative to the total number of samples. examined. The effect of language was due partly to the use of a similar set of web spamming techniques for a given language. It is important to note that these data sets are fairly old and they do not represent the current techniques of new spammers. In addition, given that the original contents of the web pages of the two data sets were not available, we could not examine other spam detection features (i.e., other than those of the 11 features provided within the two data sets). Furthermore, the method used to collect the web pages in these data sets did not consider specific search engines as the main goal of spammers in order to obtain higher ranks for their web pages in the search engine results and increase the number of hits. To overcome these limitations, we decided that a new data set must be collected carefully and made available.

Building an Arabic web spam corpus
In order to overcome the limitations described in the previous section, we followed a threestep process to collect a data set of Arabic pages, including both benign and spam web pages. First, we collected the top Arabic search keywords for the period from January 2004 to October 2012 on the Google Trends website. We then queried the Google search engine using the collected search keywords. The URLs of the top 50 result pages for each search keyword were then stored, thereby obtaining a total of 8,168 distinct domain names. Fig 2 shows the percentages of the URLs collected for each category in Google Trends. We note that the number of search keywords in a given category affected the corresponding percentage.
We identified multiple types of pages with malware and phishing content, where each URL was examined using six security scanners (these scanners were provided by selected antivirus vendors): 1) Sucuri SiteCheck scanner; 2) McAfee SiteAdvisor scanner; 3) Google Safe Browsing scanner; 4) Norton scanner; and 5) Sophos scanner (with Yandex ranking). The scanners examined every visible web page in the entire domain of a given URL. This scanning process was beneficial for studying the relationships between existing vulnerabilities, malicious content, and web spam [31]. The scanning results were then stored into a database (see Fig 3).
Finally, the URLs were labeled manually by several raters. Each link was classified into one of four categories: i) spam class; ii) borderline class; iii) benign class; and iv) unknown class. The raters were given a set of guidelines for labeling web spam pages (e.g., see [32]). A web application was utilized by the raters to view and rate the data set's links so every link was classified by at least one rater. Fig 4A shows the distribution of classes (i.e., non-spam, borderline, and spam) according to the raters. It should be noted that almost 26% of the Google search results were flagged as either the spam class (10%) or borderline class (16%), although the new update to the Penguin algorithm has been in place for several months.
Many spammers aim to compromise the machines of users and there was a clear correlation between spamming and the existence of web vulnerabilities, as shown in Fig 4B and 4C. We note that 15% of the positive URLs results obtained from the Sucuri scanner (i.e., containing malware and flagged as malicious) were manually labeled as spam, whereas 9% of the negative web pages were labeled as spam. Similarly, the percentage of URLs flagged as borderline represented (1) 13% of the Sucuri scanner-negative URLs and (2) 30% of the Sucuri scanner-positive URLs. However, the percentage of non-spam URLs represented more than 78% of the negative URLs and 55% of the positive URLs. Similar observations can be made for the sites scanned by the McAfee tool, as shown in Fig 4C, which indicates that spamming seems to be a preferred tool for attackers. Fig 5A, 5B and 5C illustrate the distributions of our three classes among Google Trends categories. The distribution is divided into two sets: malicious and benign, as found in the URL classification by the Sucuri scanner. The arts & entertainment, beauty & fitness, and online communities categories were most common for web spammers. Furthermore, we note that the numbers of positive and negative URLs according to the Sucuri scanner were proportional to those in the spam and borderline classes, unlike the non-spam category class.

System Architecture and Design
The system comprises two major components: (i) a back-end server and (ii) a browser plug-in. The plug-in represents the connection between the back-end server and the browser (see Fig 6). After the browser plug-in captures the URL (either clicked on or entered in the web browser address bar by the user), the URL is sent by the plug-in to the back-end server, which then extracts the values of the detection features from the URL and flags it as either benign or spam.
The page will be blocked by the plug-in if it is flagged as a spam page and it will display a pop-up dialog box to warn the user of spam content. The user has the option to proceed and browse the spam page. The plug-in maintains a cache with a blacklist and whitelist, so only Toward More Effective Language-Based Classifiers new URLs are examined by the back-end server. A database containing all the requests received from the plug-ins is also maintained by the back-end server, which serves as a local cache lookup mechanism to speed up the retrieval process.
The plug-in was implemented for the Chrome browser using standard web techniques, such as HTML, CSS, and JavaScript, and JavaScript Object Notation (JSON) is used for lightweight data interchange with the browser and the back-end server. The back-end server uses Apache tomcat as a web server, MySQL as a database server, and JavaServer Pages (JSP) as a server-side programming technology. In the back-end server, jsoup is used as a Java library to deal with HTML and xml document parsing and feature extraction. Most computations are performed on the server side, which maintains a cache containing both the blacklist and whitelist, so the waiting time tends to be very short compared with the loading time for the pages examined. Furthermore, the back-end server can easily be scaled up or down to serve the number of requests. The back-end server can also be used to collect crash reports from the plug-in, which may help to improve new releases.

Feature Selection and Extraction
Feature selection and extraction are crucial steps in the construction of a classifier. Several previous studies have proposed the detection of features that minimize the intra-class variability and maximize the inter-class variability (e.g., [33][34][35][36][37][38]). In general, the use of raw data for classification leads to classifiers with complex structures, thereby resulting in poor performance.
In addition to some known features from previous studies, we propose novel detection features that have not been used before to the best of our knowledge, as shown in Table 6. We calculated the CDF for the second feature in Fig 7A, the fifth feature in Fig 7B, the sixth feature in Fig 7C, and feature 7 in Fig 7D, thereby helping us to understand the nature of each feature, and thus the contribution of the features to the classifier's accuracy.
As shown in Fig 7A, 70% of the web spam and borderline pages had 18 links, whereas the benign pages had 10 links. Fig 7B shows that 90% of the benign web pages had 8 meta tags compared with 37 meta tags in the borderline and spam pages.
Similarly, Fig 7C and 7D show clearly that for features 6 and 7, the benign web pages were sufficiently easy to distinguish from both borderline and spam web pages. For instance, 90% of the benign pages had 12 unique words from Google Trends compared with 25-30 words in both the borderline and spam pages. Furthermore, 90% of the benign web pages had 70 Toward More Effective Language-Based Classifiers repeated words from Google Trends compared with 170-230 words in both the borderline and spam web pages. Features 6 and 7 were actually critical for distinguishing between spam and borderline URLs. In almost 50% of cases, the borderline and spam web pages differed from each other by 50-60 words (see Fig 7D). We also calculated the PDF for the same features, as shown in Fig 8.  Fig 9A shows that 6% of the spam pages had one hidden iframe, whereas this was the case for only 2% of the borderline and benign pages. It should be noted that although some detection features might not prove useful in isolation, employing multiple features for detection could result in better detection performance when distinguishing between benign and spam pages because these features may complement each other (see Fig 9B and 9C).     Fig 10A, we note that there is one obvious peak where the PDF for the non-spam pages was much greater than that for the spam pages (the x-axis represents feature F2 and the y-axis represents feature F5, as in Table 6; the non-spam class is shown in red and the spam class in green). Fig 10B shows the delta values (i.e., |P n − P s | (F2, F5)). Similarly, in Fig 10C, when the values of features F2 and F6 were relatively small, there was a clear peak where the PDF for the non-spam pages was greater than that for the spam pages. Fig 10D shows the delta values (i.e., |P n − P s | (F2, F6)). Similar observations can be made based on Fig 10E and 10F.

Classification and Evaluation
We tested four machine learning algorithms by using multiple variations to build our classifier, as follows. First, we tested decision trees (C4.5, logistic model tree, random forest, and Logit-Boost). Second, we tested Bayes Network, which is a probabilistic graphical model that represents the relationships and conditional dependencies between a set of random variables using a graphical model. Third, we tested a support vector machine (SVM), a statistical-based algorithm that separates classification classes using a set of hyperplanes. Fourth, we tested a multilayer neural network, which comprises a set of interconnected processing units (the weights of these interconnections are calibrated during the training phase to obtain the required knowledge).
Understanding the similarity between spam and borderline web pages is important for the prior training of classification models (see Section 5). To build our classifiers, we considered the following scenarios: (i) two-class classification with only two classes: class 1 for spam and borderline web pages, and class 2 for benign pages; and (ii) three-class classification where we had three classes: spam pages, borderline pages, and benign pages.
The classifiers were configured using Weka (version 3.7.6) for both scenarios [39]. The parameters settings for the three algorithms are shown in Table 7. We performed 10-fold cross-validations for each of the classifiers by using a subset of the observations to establish the classifier and to identify whether the classifier correctly flagged the eliminated observations. To address the overfitting problem for the decision tree classifier, we utilized a pruning technique to reduce the size of the tree by eliminating tree nodes with low significance for classifying instances. Pruning techniques are used for reducing the complexity of classifiers, which in turn helps to reduce the time required to execute the classifier in the browser plug-in. For the other classifiers, a validation threshold was used to stop the training process when the algorithm detected overfitting and misclassification increased in the validation set. In order to deal with an imbalanced data set, we used the Synthetic Minority Oversampling Technique (SMOTE), which is an oversampling technique for the minority in an imbalanced data set based on the use of "synthetic" examples. The letter "S" is used at the end of the abbreviations in the tables to indicate whether SMOTE was applied to the data set or not.
The results obtained after training the classifier in the three-class scenario are shown in Table 8, which demonstrate that decision trees performed the best, followed by the Bayesian network, multilayer neural network, and SVM classifiers. In particular, the random forest (RFT-S) decision tree scores were better than those produce by all of the other algorithms, with the highest precision (value of 84%), F-measure (value of 84%), and ROC (value of 95%) However, we note that the detection accuracy was relatively low due partly to two main causes: (1) the URLs in the spam and borderline classes (27% of the data set) were actually similar; and (2) the fact that spammers use clever tactics to evade detection by Google Penguin. For mitigation purposes, we only established the classification models for the two-class scenario. Table 9 shows that the performance of decision tree was better than that of the other classifiers (particularly the RFT-S algorithm where DR = 87% and ROC = 93%). Similarly, the BayesNet-S classifier was ranked second, where DR = 86% and ROC = 93%, followed by the multilayer neural network and SVM classifiers. Tables 10 and 11 show the confusion matrices (i.e., error matrix) for the three-class and two-class classifiers, respectively. In each confusion matrix, the first row represents the actual class and the second row represents the predicted class or that classified by a given classifier. Thus, for the RFT-S algorithm, the number of correctly detected spam instances (i.e., TPs) was 87, the number of spam instances mistakenly flagged as borderline was seven, and the number Table 7. Parameters used in the decision tree, Bayesian network, support vector machine (SVM), and multilayer neural network methods (see Part II of the WEKA Manual for descriptions of the various algorithms used in our study [40]  Toward More Effective Language-Based Classifiers of spam instances mistakenly flagged as non-spam was five. Similarly, the number of correctly detected non-spam instances (i.e., true negatives) was 85, the number of non-spam instances mistakenly flagged as borderline was nine, and the number of non-spam instances mistakenly flagged as spam was five.

Further Discussion
In this study, we used two public data sets (see Section 2) to show that spammers who target different languages behave differently and develop their own new tactics to influence the results obtained by search engine ranking algorithms. In fact, this issue has been recognized by search engine companies and they are considering the development of ranking algorithms that are global and language-independent as far as possible in their new releases. In most web spam data sets, however, search engine ranking algorithms were not considered when the data sets were constructed. In this study, we constructed a new data set to address this issue (see Section 3). Our data set was carefully selected to contain highly ranked Web pages according to the Google Penguin ranking algorithm. However, this data set led to a concern about the effectiveness of the Google anti-spamming algorithm against spam pages containing Arabic content as well as other non-English languages. In particular, when the data set was examined using six security scanners, the results showed that a significant number of In a further study (see Sections 5 and 6), we explored the effectiveness of multiple detection features using our data set and we evaluated different classifiers. Despite that some of our classifiers obtained a detection rate of 87%, which might be lower than previous reported detection rates in other studies, we demonstrated that spammers employ clever techniques to avoid being detected by Google Penguin. We also confirmed the need to build more representative and realistic data sets that are suitable to the context of the outputs obtained by search engines.

Related Work
Numerous previous studies have investigated the prevalence of web spam and various detection techniques have been proposed using different approaches. Gyongyi and Garcia-Molina proposed a web spam taxonomy after the web spam problem emerged in the early 2000s [2]. Heymann et al. were the first to survey the detection, demotion, and prevention of web spam [41]. Recent surveys of existing spam detection techniques and mechanisms have analyzed their advantages and disadvantages (e.g., [42] and [43]). It should be noted that spam and automated accounts in social networks have also contributed to the prevalence of web spam (e.g., see [44][45][46][47][48]). The detection features used for web spam in previous studies belong to two categories: (1) those that exploit topology and network-related data; and (2) those that exploit the web page content.
Gyongyi et al. [1] proposed an algorithm for identifying pages that are likely to be spam and those that are likely to be reputable (also see [49] and [50] for improved versions of the algorithm). Fetterly et al. [51] utilized statistical analysis to show that there are outliers in the statistical distribution of the linkage structure, page content, and page evolution properties in spam pages compared with benign web pages. Wu et al. [52] proposed some alternative methods for propagating trust on the web and utilized distrust to demote web spam. In addition, Castillo et al. [53] built a machine learning classifier that utilizes both link-based and content-based detection features, which obtained TP = 88.4% and FP = 6.3%. Svore et al. [33] built a classifier to identify web spam pages by training a SVM classifier based on a selected set of page attributes.
Ntoulas et al. [15] proposed a C4.5 decision tree classifier, which could detect 86.2% of the spam pages examined. Becchetti et al. [37] explored the best combinations of spam detection features and selected classifiers that achieved high precision (DR = 80.4%) using a small set of features. Furthermore, Abernethy et al. [54] proposed a machine learning classifier that employs a variety of SVM for detecting web spam using both the page content and hyperlinks. Similarly, Becchetti et al. [55] proposed a link-based technique for detecting web spam pages by using a damping function for rank propagation and an approximate counting technique. By exploiting textual and extra-textual features in HTML source code, Urvoy et al. [56] investigated multiple HTML style similarity measures and proposed a flexible clustering algorithm for identifying web spam pages. In addition, Gan and Suel [57] proposed a classifier that uses the decision tree C4.5 algorithm and many detection features, including content-based and link-based, which obtained precision of around 88%. Webb et al. [58] identified a relationship between email and web spam, which they utilized to identify web spam. They also employed their method to collect a web spam corpus. Lee et al. [59] proposed a simplified swarm optimization method to solve the complexity problem that affects statistical classification and machine learning approaches, which increases when there are a large number of web spam detection features.
Previous studies also considered linguistic-based detection features and evaluated their effectiveness at web spam classification (e.g., [36,60]). However, to the best of our knowledge, no previous studies have investigated the advantages of using linguistic-based features to improve web spam detection in a particular language.

Conclusion and Future Work
Google continues to improve their Penguin algorithm, but web spammers are also developing creative evasion mechanisms to increase their web page ranks with the aim of attracting more users. In fact, we consider that web spam will remain a good method for both phishing attacks and malware spreading. In this study, we showed that Google anti-spamming methods are actually ineffective against web spam pages that contain non-English content, which raises a concern that the insufficient testing of pages with non-English content could potentially encourage spammers to target these pages.
As an illustrative example, we developed and tested a classifier in the form of a browser anti-spam plug-in for detecting Arabic spam pages, and we showed that our classifier captured most of the web spam pages not detected by the Penguin algorithm. We also created a labeled Arabic web spam data set to evaluate our classifier and to encourage other researchers to build upon our work.
In future work, we plan to extend our web spam data set, create similar data sets for other languages, and develop custom classifiers for these languages. Spammers and Google search engine developers are continually improving their techniques to defeat each other, so future experimental studies are important for understanding new trends and directions. In recent years, large-scale spamming campaigns using compromised Web sites have been performed to corrupt search engine results. These spamming campaigns are an emerging trend that needs to be investigated. Using Google Penguin as a benchmark, our illustrative example shows that language-based web spam classifiers are more effective at capturing spam content. We consider that the web spam problem requires a continuous effort from search engines as well as developers and webmasters based on appropriate vetting of their sites, and end-users should also report spam content.