Figures
Abstract
The rapid advancement of technology has enabled the collection of detailed, multi-dimensional user data, paving the way for multi-criteria recommendation systems that consider diverse aspects of user preferences. While traditional recommendation systems aim to satisfy individual users, group recommendation systems are designed to generate suggestions that accommodate the collective preferences of a group. However, the increasing prevalence of group interactions in digital environments has also introduced new vulnerabilities, such as group shilling attacks, where coordinated malicious users manipulate recommendation outcomes. This study conducts the first comprehensive robustness analysis of multi-criteria group recommender systems, addressing a critical research gap. A novel shilling attack strategy is proposed by adapting the group shilling model to multi-criteria settings, allowing a deeper understanding of the risks these systems face. Experimental results indicate that the proposed multi-criteria recommender system achieves notable robustness across datasets. Specifically, the average hit ratio (AvgHR) increases up to approximately 12% on the YM20 dataset and reaches around 15% on the YM10 dataset. Furthermore, among the target item selection strategies, the MUP-NNZ method consistently demonstrates superior resistance to profile injection attacks, confirming its effectiveness in maintaining recommendation accuracy under adversarial conditions.
Citation: Turkoglu Kaya T (2025) Multi-criteria group shilling attacks. PLoS One 20(12): e0338319. https://doi.org/10.1371/journal.pone.0338319
Editor: Qinglin Meng, State Grid Corporation of China, CHINA
Received: August 27, 2025; Accepted: November 17, 2025; Published: December 11, 2025
Copyright: © 2025 Tugba Turkoglu Kaya. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All MATLAB source code used to generate the results in this manuscript, together with the processed datasets, have been deposited in the public GitHub repository: https://github.com/tugba7203/mcrs-shilling-attack-codes.
Funding: This study was supported by Ardahan University Scientific Research Project Commission, Turkey under the grant no: 2025-002. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. No authors received a salary from the funders.
Competing interests: The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
1 Introduction
The rapid advancement of science and technology in recent years has enabled the collection of large-scale and detailed user data through online platforms. User comments, ratings, and feedback now form extensive datasets that reflect individual preferences and behaviors. Recommender systems analyze these heterogeneous data sources to help users make informed decisions efficiently, enhancing user satisfaction by reducing the time spent on product selection. To further improve recommendation accuracy, many modern systems allow users to provide more detailed, multi-dimensional feedback instead of a single overall score. For instance, major hotel booking platforms enable users to rate criteria such as cleanliness, breakfast, and staff friendliness; Amazon.com offers rating dimensions for video games; and eBay allows users to evaluate sellers based on multiple aspects. Similarly, Yahoo!Movies lets users rate movies across sub-criteria such as acting, direction, story, and visuals. These systems, known as multi-criteria recommender systems (MCRS), enable a more comprehensive understanding of user preferences and contribute to higher recommendation accuracy [1].
Traditional recommender systems—whether single- or multi-criteria—provide personalized recommendations for items users may like but have not yet experienced [2]. However, many real-world activities are shared experiences, such as dining with colleagues, watching movies with family, or exercising with friends. In these contexts, the focus shifts from individual to collective preferences. To address this need, group recommender systems (GRS) have been developed as an advanced extension of traditional recommenders. GRS aim to generate recommendations that satisfy the collective interests of multiple users by aggregating their individual evaluations—whether single- or multi-criteria—thereby improving satisfaction and decision-making efficiency in group settings.
Companies utilizing group recommendation systems, as well as other recommendation system methods (including multi-criteria recommendation systems), must ensure that they provide high-quality recommendations to avoid disappointing their customers. Recommending inappropriate or irrelevant products may lead users to switch to competing e-commerce platforms, resulting in customer loss and a decline in the company’s competitive standing. Additionally, malicious individuals or businesses can exploit these systems to manipulate recommendations for their own benefit. By targeting specific products, they may attempt to artificially increase or decrease their popularity. To achieve this, they can introduce fake user profiles into the system. Such activities are commonly referred to as “shilling attacks” or “profile injection attacks” [3].
Shilling attacks in recommender systems are typically classified into two categories: push attacks, which aim to artificially promote the popularity of targeted items, and nuke attacks, which attempt to diminish their visibility or perceived value. While traditional shilling attacks are often executed through the independent injection of fake user profiles, a more sophisticated and dangerous variant emerges when multiple malicious actors operate in a coordinated and covert manner. This strategy, widely known as a group shilling attack, poses a substantially greater threat to the integrity and trustworthiness of recommender systems compared to conventional individual-based attacks [4].
Despite the significance of this threat, the existing body of research reveals a critical gap: the vast majority of prior studies on shilling attacks have focused on single-criteria recommender systems. Even within the domain of multi-criteria recommenders, recent works [5,6] have concentrated primarily on individual-level attack detection rather than on group-level robustness analysis. To the best of our knowledge, no prior research has systematically investigated group shilling attacks within multi-criteria group recommender systems (MC-GRS), where recommendations are based on multiple user preference dimensions and collective decision-making mechanisms. This absence represents a serious limitation in the literature, given that MC-GRS are increasingly deployed in real-world digital environments and are inherently more vulnerable to coordinated adversarial manipulation.
Motivated by this gap, the present study undertakes the first comprehensive robustness analysis of multi-criteria group recommender systems under group shilling attacks. To this end, we propose a novel attack strategy specifically tailored to multi-criteria group settings, exposing critical vulnerabilities and quantifying the extent to which such systems can be manipulated. Furthermore, we develop a systematic evaluation framework to rigorously assess the impact of these attacks, thereby offering actionable insights for designing more resilient, accurate, and trustworthy recommender systems. By bridging this crucial gap, our work not only advances the theoretical understanding of adversarial threats in recommender systems but also contributes practical guidelines for safeguarding next-generation recommendation technologies against coordinated manipulation.
The contributions of the study are listed below;
- This study provides the first comprehensive robustness analysis of multi-criteria group recommender systems, thereby addressing a significant and previously unexplored gap in the existing body of literature. The analysis offers a holistic perspective on the reliability and security of such systems under adversarial conditions.
- The research makes a substantial contribution to the understanding of group shilling attacks, particularly within the context of multi-criteria recommendation settings, which have been largely neglected in prior studies. It delivers in-depth insights into the ways in which group interactions may be exploited and manipulated, thus filling an important void in current knowledge.
- A novel shilling attack strategy is introduced by specifically adapting the traditional group shilling model to multi-criteria group recommender systems. This adaptation reveals critical system vulnerabilities and, at the same time, provides valuable guidance for the development of more resilient recommendation frameworks and defensive mechanisms.
This study consists of 8 sections. While the literature review related to the GRS is included in Sect 2, information about the methods used throughout the study is given in Sect 3. While the proposed method is presented in Sect 4, data set, testing methodology and performance metrics are introduced in Sect 5. In Sect 6 the implementation and results of the proposed methods and approaches are given, in the last section, the general conclusions and suggestions from the study are presented.
2 Related work
With the rapid development of the Internet and the emergence of e-commerce sites, recommendation systems create personalized product lists using user-product feedback and harmonize similarities between users. In addition, individuals by nature tend to act together by interacting with people they know for many activities. In this mechanism, called group recommendation systems, the target audience is a community of users who come together for various reasons and have to act together, rather than individual users. However, recommendation systems and group recommendation systems are vulnerable to attacks by malicious users, such as the insertion of fake profiles, so-called Shilling Attacks. The vulnerability of recommender systems to shilling attacks has led to many recent studies focusing on the concept of trust in the proposal presented from different perspectives. Riedl and Lam [3], who first gives the term shilling, suggested two basic attack models, Random and Average attack models. Although random attack (called RandomBot [3,7] can be applied easily, it is not a very effective model [3,7], while average attack (called AverageBot [3,8] is difficult to implement because it is high knowledge. The other attack studies are [7–15].
In traditional shilling attacks, malicious users work individually to add fake profiles in order to manipulate recommendation (including group recommendation systems) systems into recommending or not recommending certain products. In fact, a group of attackers can work together to affect the services of recommendation systems by adding fake profiles to the system. These systems, referred to as group shilling attacks, are more harmful than traditional shilling attacks, and it seems that much work has been done on the detection of group shilling attacks rather than their impact.
This concept was first mentioned by Su et al. [4]. Researchers have noted the impact of aggressive user groups with the same shilling behavior. However, in the proposed study, Wang et al. [16], who demonstrated that high diversity should be taken into account when creating shilling groups, expanded the group shilling attack creation algorithm. The detection of proposed group shilling attacks was carried out by Wang et al. [17]. The researcher present a shilling group detection method based on graph convolutional network. Another study, [18] propose a graph embedding-based method to detect group shilling attacks in collaborative recommender systems. Other studies carried out on group shilling attack and detection are [19–23].
In the context of multi-criteria recommendation, Kaya and Kaleli [6] developed a novel top-n recommendation framework based on a new neighborhood selection process (NSP) utilizing entropy and association rule mining (ARM), effectively improving accuracy through the analysis of user and item characteristics. Turkoglu Kaya et al. [5] further contributed by proposing a classification-based shilling attack detection model specifically designed for multi-criteria recommender systems, focusing on identifying fake profiles through user–item interaction patterns and popularity-driven features. These studies significantly improved robustness and personalization in multi-criteria settings but primarily addressed individual-level recommendation scenarios.
Meanwhile, robustness in deep learning–based recommendation has also attracted significant attention. Wang et al. [24] proposed a Distributionally Robust Graph-based Recommender System (DR-GNN), incorporating Distributional Robust Optimization (DRO) to mitigate performance degradation under distributional shifts, ensuring stability in graph neural network (GNN) recommenders. Similarly, Boratto et al. [25] examined robustness and fairness trade-offs in GNN-based recommenders under edge-level perturbations, revealing how fairness between providers and consumers is affected by adversarial noise. These works highlight the growing importance of robustness under structural and fairness-oriented perturbations.
To bridge these robustness-oriented advances with existing attack and detection research, Table 1 summarizes key studies addressing shilling and group shilling attacks in recommender systems.
Despite these advancements, none of the existing studies have systematically investigated robustness in multi-criteria group recommender systems (MC-GRS) under coordinated group shilling attacks. Prior research has primarily focused on either robustness in single-user multi-criteria recommenders or perturbation-resilient graph-based systems. In contrast, the present study provides the first comprehensive robustness analysis of MC-GRS under adversarial manipulation, introducing a novel group-oriented attack model and an evaluation framework to quantify its impact. This approach bridges the gap between multi-criteria and group-level robustness, contributing new insights into the design of resilient recommendation architectures.
When we look at the literature review, it is seen that studies on group recommendation systems are not included, and moreover, the focus is more on the detection of group shilling attacks rather than the impact of these attacks. For this purpose, the robustness of group recommendation systems and the impact of these group shilling attacks on multi-criteria systems will be examined in order to fill the gap in the literature.
3 Preliminaries
In the section, necessary information about the methods used in the study is given.
3.1 Group recommender systems
Group recommendation systems (GRS) are mechanisms that analyze the characteristics and tendencies of a user community that acts together by sharing the same environment, rather than individual recommendations, and produces group recommendations that will satisfy them at the maximum level [30]. In these systems, the first step in generating group suggestions is to identify user communities containing similar individuals. In group creation mechanisms where 4 different approaches were determined, the increase in the number of users in the system over time and the constant change in user preferences showed that the automatic group definition method is more successful and appropriate in these scenarios [31,32]. The reason why automatically defined group recommendation systems are successful is that they include users who have similar tastes and are in harmony with each other, rather than individuals who come together randomly, and it can be easier to please the suggestions offered to these groups of people [33]. The most common method used to identify similar and compatible user groups is to divide users into groups using a clustering algorithm without any restrictions [34]. In clustering algorithms based on users’ evaluations of products, the k-means method is frequently used in automatic group identification due to its efficiency and applicability.
After creating user groups, products or services that will satisfy the individuals in the group to the maximum extent must be offered. For this purpose, it is generally produced by taking into account group profiles, which are created by combining the rating values of group members for products/services using methods called aggregation techniques and reflecting the group’s preferences in relevant products/services [34]. Combining techniques, which have a very important role in the production of group suggestions, directly affect the quality of group suggestions [35]. Many different combining techniques have been developed in the literature in line with the need, and Average (Avg) [36–39], Average without Misery (AwM) [40], Multiplicative (Mul) [41], and Additive Utilitarian (AU) [34,42], Approval Voting (AV) based on rating frequency [34,43], Most satisfaction or least satisfaction with the highest and lowest ratings (Most Pleasure-MP; Least Misery-LM ) [34,41,44–46] and Most Respected Person (MRP) [47] are techniques that take into account influential group members.
3.2 Group shilling attack
The concept of group shilling attacks in recommendation systems was proposed by Su et al. [4] and two scenarios for such attacks were presented. Based on this concept, Wang et al. [16] presented a new group shilling attack model to increase the diversity within the group and keep the similarity relationships between each pair of attackers to a minimum. The method they propose makes attack detection difficult as well as successful attacks.
In the developed attack model, two new versions have been proposed, denoted as GSAGens (strict version) and GSAGenl (loose version), respectively. While the attackers in the shilling group created with the GSAGens model do not have common rated products other than the target product, in the GSAGenl model, the attackers in the shilling group have a common rated target item and each pair of attackers have rated at most two filler items in common, and these filler items are rated by at most two attackers. According to the description of both models, it can be said that GSAGens have stricter conditions. GSAGens and GSAGenl can be divided into two types: GSAGensRan and (GSAGenlRan and
), which are formed by the random attack model and the average attack model, respectively. Fig 1 is shown briefly the group attack models [19].
In Fig 1, GSAGens has more string conditions in creating group attack profiles, so the size of attack groups created by the group attack model is limited. Therefore, we will use the loose version (GSAGenl) to create group attack profiles in the article to ensure the effect of group shilling attacks.
4 Multi-criteria group shilling attack
In the attack on multi-criteria group recommendation systems, which contain versatile user data, the attacker’s aim is to manipulate the products in the group recommendation list and ensure that fake products are included in the list instead of recommendations that will please the group. Therefore, in the group attack methodology to be created for this purpose, we assume that the attacker wants to inject shill attack groups into the system and that fake users in these groups rate target products that real users do not like or are unpopular. Thus, the attack intent is push, as the aim is to include the wrong products in the list. However, one of the important points here is how to distribute the fake profiles homogeneously to each group and make a successful attack for each group in the attack on the group suggestion system. We mentioned in Sect 3.2 that the group shillings designed to be compatible with this purpose are GSAGens and GSAGenl. It is seen in literature reviews that the GSAGenl attack model is more suitable for studies since the GSAGens attack model has very strict rules and limits the user groups to be injected into the system. Therefore, in this study, the GSAGenl model is chosen to create a attack profile called multi-criteria group shilling attack (MC-GSA). Another critical point is that for MC-GSA, we need to decide how to design the attack and determine attack-specific parametric values in the system consisting of multiple criteria.
The overall framework of the proposed method, which performs the robustness analysis of multi-criteria group recommender systems, is illustrated in Fig 2. In this section, we first explain the design of the GSAGenl attack model and describe how the parametric values specific to the multi-criteria setting are determined.
As shown in Fig 2, the framework outlines the complete workflow of the proposed MC-GSA model, including the generation of attack profiles, clustering of users into groups, and the construction of group-based recommendation lists. To enhance the clarity and readability of the diagram, a legend has been added below, defining all symbols and notations used in the figure.
Legend:
- ui : Normal users in the training dataset
- ai : Attack users in the shilling groups
- rmax : Maximum rating value assigned to target items
: Rating values for filler items
- Gn : Formed user groups after the clustering process
- Top-n List : Final recommended items for each group
After introducing the overall structure, the operation of the proposed model can be described in four main steps, each representing a distinct stage of the robustness analysis process.
- Training Dataset: The initial dataset consists of genuine users (
) who provide ratings for multiple items (
) across different criteria. This dataset reflects the normal operation of the recommender system.
- Shilling Groups: To simulate the attack, synthetic user profiles (
) are generated and injected into the dataset. These attack profiles apply specific strategies, such as assigning maximum ratings (rmax) to target items or using pre-defined filler values (
), in order to bias the recommendation outcome.
- Clustering Process: The combined dataset of genuine and attack users is processed through a clustering algorithm. Users are grouped into clusters (
) according to similarities in their rating patterns, which reflects the group-based recommendation scenario.
- Top-n Recommendation Lists: For each group, a Top-n list is generated. As illustrated in the figure, some of the target items (highlighted in red) are promoted within the recommendation lists, revealing the manipulative effect of group shilling attacks on multi-criteria group recommender systems.
This framework demonstrates how adversarial user profiles can significantly distort group recommendation results and emphasizes the importance of designing robust defense mechanisms against such attacks.
4.1 Designing attack specific-parameters
The contents of MC-GSA profiles designed on multi-criteria group recommendation systems are shown in the Table 2. According to Table 2, the group attack user profile consists of the following in the MC-GSA:
Selected Items (IS) is empty for the MC-GSA (GSAGenl) model.
Filler Items (IF): The ratings are the set of fillers provided by attacker i in the attack group. Fillers are very important in group shilling attacks. The item sets in traditional shilling attacks are chosen randomly. In contrast, the selection of fillers in group shilling attacks is more meticulous.
Target Items (IT), Target product rating value gets the maximum or minimum rating value depending on the purpose of the attack. In the literature study [48], it has been seen that the choice of target product is significant for the success of the attack. Therefore, target item selection is carried out using three different methods.
- Most Unpopular - Unsatisfaction Ratio (MUP-UsR): In the system, the items with the least satisfaction ratio are selected. The items with the highest ratio of the rating value below the threshold value to the total number of ratings given for that product are selected. It can be calculated using the formula given in Eq 1:
(1)
- where, |Ui| is number of rated for item i.
- Most Unpopular - Dominant Ratio (MUP-DR): These are the items where dissatisfaction is more dominant. It is calculated as the ratio of the number of rated for the item higher than a certain threshold value to the number of rated lower than the threshold value. As a result of the calculation, the items with the lowest ratio are selected. It can be calculated using the formula given in Eq 2:
(2)
- where, |Ui| is number of rated for item i.
- Most Unpopular - Number of Non Zero (MUP-NNZ): Target items are those that have received the fewest user ratings.
- Most Unpopular - Sum (MUP-Sum): Target items are those that have received the fewest sum of user rating values.
Following the design of attack user groups, statistical parameters need to be addressed. In single-criteria systems, the statistical parameters required for the attack profile are produced by using all rating values. However, including more than one criterion in multi-criteria systems may allow these attack-specific features to be predicted in more than one way. For the purpose, in this study, the following three different scenarios (overall-based, criteria-based, aggregated-based) are taken into account to calculate these parametric values on the multi-criteria system:
General-based (S1): In this scenario, which reflects the most reliable information in a general-based, multi-criteria system, the process is progressed like a single-criteria system. Therefore, the selection of parametric values and target/filler products required for the attack is made only based on overall criteria.
Criteria-based (S2): Criterion-based scenario is intuitively the best way to produce results using all the information available with the system in an encapsulated way, as separate values for each subcriteria can help attack profiles resemble real users [49]. In this scenario, the attack process for each criterion is progressed one by one and the average is taken into account for performance evaluation.
4.2 Attack methodology for MC-GSA
In this section, the steps of MC-GSA methodologies performed on multi-criteria group recommendation systems are given.
When the algorithmic steps of the MC-GSA method given in Algorithm 1 are examined, the GSAGen method, which has minimal relationships with each other, is adopted when creating attack profiles, unlike traditional methods. After the creation of fake user groups, a grouping process is performed with these profiles added to the system and the attack process is started. Each of the steps such as producing predictions and preparing recommendation lists for user groups are carried out according to three different scenarios (). While in the S1, the process is run on the overall criterion, in the second scenario (S2), the process is run for each criterion and the average of the performance evaluation is taken. In the final stage, a list of suggestions is created and the success of the attack is evaluated.
Algorithm 1. MC-GSA methodology.
Complexity Note. The overall computational complexity of the proposed MC-GSA algorithm can be expressed as , where U is the number of users, I is the number of items, C is the number of criteria, A is the number of injected attack profiles, G is the number of groups,
is the average group size, kc is the number of clusters, t is the number of clustering iterations, and H represents the total number of evaluated items (100 + n). The dominant computational cost arises from the clustering and prediction phases, which scale approximately linearly with the number of users and items. Therefore, the method is computationally feasible for medium-scale multi-criteria datasets and can be further optimized through parallelization or sampling for large-scale applications.
In practice, however, the runtime behaviour is also affected by the attack design: since push attacks are applied to multiple target products concurrently, the procedure generates and evaluates synthetic user profiles on-the-fly. These synthetic profiles consist of filler and target item sets (MUP-UsR, MUP-DR, MUP-NNZ, MUP-Sum), as given in Table 2.
5 Experiments
The section examines the impact of MC-GSA on group recommendation systems. First, the data set, experimental design and performance evaluation metrics used in the study are briefly mentioned.
5.1 Dataset
In the study, Yahoo! Movies (YM), a well-known multi-criteria data set in the field of RS and group RS, is used as the data set [50,51]. These researchers collected rating data from the Yahoo!Movies website with a web crawler. All considered movies are part of the single-rating dataset provided by Yahoo!Research (https://webscope.sandbox.yahoo.com). In the YM data set, users stated their preferences information based on four criteria, i.e., acting, story, direction, visuals. In addition to the individual criteria ratings, users made an overall grade for each movie to reflect their overall opinion. We utilize two subsets of the dataset, the YM10 and YM20 data sets which is converted to a 1-5 scale. The YM10 dataset includes data with at least 10 reviews for each user and movie, while the YM20 dataset sets the threshold at 20 reviews.
Information on the datasets are shown in Table 3.
5.2 Experiment design
Our experiments split the dataset into test and training using 10-fold, based on the attack methodology outlined in the previous sections. Attack profiles are created using two different attack models, i.e., GSAGenlRan, with the target products obtained using the training data set. The training dataset is updated including attack profiles. Then, estimated values are produced by adding n such as 5, 10, and 15, target products to 100 products that the user u is not rated from the test dataset due to the small sizes of the subsets within the dataset, the sizes of the excluded products are also designated as small. From the list of 100+n products, the user u is presented with a list of n products that they may like the most, and a performance evaluation is made.
The parameter selections required for the schema are given below.
Attack user profiles: The synthetic user profiles to be added to the system consist of Ifs and Its according to the MC-GSA model. How to select these products has been described in the previous sections. Since the aim of the attack is to include the target products in top-n (push attack), the rating value of the target items is determined as rmax (=5). How many synthetic profiles will be added to the system varies for each data set. The previous studies have shown that choosing this ratio between 5% and 10% is very effective [52,53]. However, in this study, we set the attack size as 0.1 since the change in attack size affects the cost/benefit of the attack; a value greater than 10% is not considered appropriate in real-world applications [54]. Therefore, in this study, the number of attack users is chosen as 183, 43 (10%) for YM10 and YM20, respectively.
Filler item selection: Two different methods used for the selection of filler products are given in the previous section (Sect 3.2). While the number of filler items in the synthetic user profile for the YM10 dataset is 15 (1% of items in the dataset), 74 (5%), these values are 4 and 21 for the YM20 dataset, respectively.
Target item selection: It has been stated in the previous sections that the target products are selected by four different methods (MUP-UsR, MUP-DR, MUP-NNZ and MUP-Sum). In the first method, MUP-UsR, which contains high information, the products with the lowest satisfaction rates in the system are selected. Specifically, the products with the highest ratio of scores below the threshold value to the total number of scores given for that product are chosen. The other method, MUP-DR, which contains low information, focuses on products where dissatisfaction is particularly dominant. Similarly, in other methods with low information (MUP-NNZ, MUP-Sum), the number of evaluations and scores for all products is taken into account.
The number of target products to be selected from the first 50 products determined according to these three methods will be as much as the number of n products in the top-n recommendation list to be presented to the user. Therefore, 5, 10, and 15 multiple-target items are randomly selected from 50 products in this study.
Test variations: In this study, we used 2 filler item selection methods, 2 filler item sizes, 4 target product selection types, 3 target item size, 1 attack size on 2 data sets (YM10, YM20) and 2 scenarious. Selection methods and the created attack user profiles are carried out for each criterion in the multi-criteria data set by following the approach steps adopted by Adomavicius [1] and also used overall criterion. All calculations and experimental results are created using Matlab R2022a.
5.3 Evaluation metrics
Robustness metrics have been developed to evaluate these GRS and RS attacks [52,53]. For example, Average Hit Ratio (Average Hit Ratio, AvgHR) calculates the percentage of users whose target products are included in the top-n list, and Weighted Rank Value (WRV), which is a proposed new metric by researchers [5] that shows the order in which the target products are included in the top-n.
AvgHR: It is the rate of the target products in the top-n list created for the user’s groups. In the metric, the number of multiple target products included in the list is divided by the list size to find the hit rate of the user [5]. A high value in the AvgHR metric calculated using Eq 3 indicates that the attack is highly effective.
Here, ITT and UT are the set of target products and users, respectively. n is the size of the recommendation list presented to the user.
WRV: In the new metric proposed by Kaya and Kaleli [5], the order in which the target products are presented in the manipulated recommendation lists is discussed. Here, the approach has been adopted that the target products’ ranking at the top can have a significant impact on the success of the attack. Therefore, the success of the attack will be shown by specifying the order in which the target products are included in top-n. Obtaining a high WRV calculated using the Eq 4 shows that the attack is quite effective.
The range of values obtained with the new metric is 0-11.4167, 0-29.2897, 0-49.7734 for top-5, top-10, and top-15, respectively.
6 Experimental results
In this study, the effects of group shilling attacks on multi-criteria systems are examined in two different data sets, for different parameter values and two different scenarios. The performance of the attack on this system when the user groups are presented with top-5,10,15 product recommendation lists for the YM20 data set is given in Table 4. The table examines the effects of different types of attacks on recommendation systems, GSAGenlRan and , cluster sizes (1%, 5%, 10%) and target item selection strategies (MUP-DR, MUP-NNZ, MUP-Sum), on the system within the scope of two different scenarios (S1 and S2). Performance measurements are evaluated based on the rates of the target items exposed to the attack in the recommendation lists being included in different top-n ranges (Top-5, Top-10, Top-15). Table 4 reports the AvgHR results of multi-criteria group shilling attacks on the YM20 dataset for a filler size of 1%. AvgHR reflects the average number of target items that appear in the Top-n recommendation lists. For instance, an AvgHR value of 0.05 at Top-10 means that, on average,
target items are included in each recommendation list, i.e., approximately one fake item appears in every two lists. This provides a clear indication of how strongly the attack is able to infiltrate the recommendation outcomes.
A general examination of the results shows that as the filler size increases, AvgHR values rise significantly. This implies that injecting more attack profiles into the system strengthens the impact of the attack, thereby increasing the number of target items in the recommendation lists. For example, under GSAGenlRan, the MUP-NNZ strategy increases from 0.0500 at Top-10 with 1% filler to 0.0970 at 10% filler, nearly doubling the success of the attack. When analyzing target item selection strategies, MUP-UsR yields relatively low AvgHR values. Since this strategy relies only on dissatisfaction ratios, its ability to penetrate the recommendation process is limited. For example, at 1% filler and Top-10, the AvgHR value is 0.0294, indicating only 0.29 fake items per list on average. MUP-DR performs slightly better than UsR, as it considers the balance between ratings above and below a threshold. At 5% filler and Top-10, its AvgHR value is 0.0442, corresponding to approximately 0.44 fake items per list. The strongest results are observed for MUP-NNZ. This strategy targets items with the fewest ratings, which are typically less popular and therefore easier to push into higher ranks. Under GSAGenlRan with 10% filler and Top-10, the AvgHR value reaches 0.0970, meaning that nearly every recommendation list contains one attack item. This makes MUP-NNZ the most vulnerable strategy against shilling attacks. MUP-Sum (Sum of Ratings) is also considerably affected, although slightly less than NNZ. For instance, under with 5% filler at Top-10, the AvgHR value is 0.0731, implying that around 0.73 fake items are included per list. Overall, MUP-NNZ emerges as the weakest and most vulnerable strategy, while MUP-UsR and MUP-DR exhibit relatively more robust behavior. In these cases, fake items appear less frequently in the recommendation lists, particularly when the filler size is low. Furthermore, AvgHR values under
are generally higher than those under GSAGenlRan, indicating that systematic selection of target items increases attack success compared to random selection. Additionally, as the Top-n value increases (e.g., from Top-5 to Top-15), the likelihood of including fake items grows, since longer lists offer more opportunities for target items to appear.
In summary, the results highlight that the effectiveness of multi-criteria group shilling attacks strongly depends on both the filler size and the target item selection strategy. Attacks focusing on sparsely rated items (e.g., MUP-NNZ) demonstrate the highest vulnerability, while dissatisfaction-based strategies (MUP-UsR and MUP-DR) are more resilient. These findings emphasize the necessity of incorporating robust defense mechanisms, particularly against vulnerabilities associated with less popular items, in the design of multi-criteria group recommender systems.
Table 5 presents the AvgHR results of multi-criteria group shilling attacks on the YM20 dataset with a filler size of 5%. The findings clearly show that the impact of the attack becomes more pronounced under this setting. For instance, under GSAGenlRan with the MUP-Sum strategy, the AvgHR at Top-10 reaches 0.1211, which corresponds to approximately 12% of the items in the recommendation list. In practical terms, this implies that on average one fake item is included in every Top-10 list. At Top-15, the value rises to 0.1248, meaning that nearly two manipulated items are regularly injected into the recommendations. The MUP-DR strategy also yields relatively high values. At a 1% cluster size, the Top-10 AvgHR is observed at 0.0929, which equates to around 9% of the items being target products. This ratio is almost three times higher compared to MUP-UsR under the same setting, highlighting that the DR-based selection renders the system more vulnerable to manipulation. MUP-NNZ produces moderately strong outcomes. For example, under with a 5% cluster size, the Top-10 AvgHR is 0.1067, corresponding to about 10% of the recommended items. This result indicates that sparsely rated items, when targeted, can be effectively pushed into the recommendation lists, exposing a significant weakness of the system against this type of strategy. In contrast, the MUP-UsR approach generates the lowest AvgHR values. With a 10% cluster size at Top-10, the metric reaches only 0.0667, implying that roughly 6–7% of the items are fake. This suggests that UsR-based target selection has a limited impact, thereby making the system more resistant to such attacks compared to other strategies. A general trend is also observed regarding the effect of cluster size: as the cluster size increases, AvgHR values consistently rise. For instance, under GSAGenlRan at Top-10, MUP-UsR increases from approximately 3% to over 8% as the cluster size grows. Similarly, MUP-Sum increases from 9% at 5% cluster size to nearly 12% at 10% cluster size. This indicates that larger groups provide greater leverage for injected attack profiles, thereby amplifying the attack’s overall effectiveness.
In summary, when the filler size is set to 5%, the most significant impact is observed for MUP-Sum and MUP-DR, which both achieve AvgHR values in the range of 10–12%. MUP-NNZ remains moderately strong but stable, while MUP-UsR shows the lowest ratios, marking it as the most robust selection strategy against group shilling attacks. These results demonstrate that the choice of target item selection strategy critically influences the attack success, with sum- and dominance-based approaches posing the highest threat to multi-criteria group recommender systems.
A comparative analysis of Tables 4 and 5 reveals that increasing the filler size from 1% to 5% substantially amplifies the success of the attacks. At a filler size of 1%, AvgHR values remain relatively low, with Top-10 lists typically containing only 3–5% of fake items. For example, under MUP-UsR, AvgHR values are observed around 0.029–0.038, indicating that, on average, 0.3–0.4 target items are present in each list of ten. Similarly, MUP-DR yields values in the 4–6% range, while MUP-NNZ produces slightly higher ratios of approximately 5%. Interestingly, MUP-Sum already shows stronger results under GSAGenlRan, achieving 0.094 at Top-10, corresponding to nearly one fake item in every list. When the filler size is raised to 5%, the impact of the attacks intensifies considerably, with AvgHR values nearly doubling to reach the 8–12% range. In particular, MUP-Sum and MUP-DR exhibit the most significant increases, both achieving Top-10 values exceeding 0.10. For instance, under GSAGenlRan, MUP-Sum reaches 0.1211 at Top-10, which means that approximately 12% of the items in the recommendation lists are fake. Similarly, under , both MUP-DR and MUP-NNZ surpass the 10% threshold, highlighting the heightened vulnerability of the system under higher filler conditions. In summary, raising the filler size from 1% to 5% nearly doubles the effectiveness of the attacks. While the influence of shilling profiles remains relatively limited at 1%, with only sporadic infiltration of recommendation lists, a filler size of 5% results in approximately one out of every ten recommended items being fake. This finding emphasizes that filler size is a critical factor in determining attack success, and further demonstrates that strategies such as MUP-Sum and MUP-DR pose the greatest threat to multi-criteria group recommender systems under higher filler conditions.
A comparative examination of Tables 6 and 7 indicates that increasing the filler size from 1% to 5% substantially amplifies the effectiveness of group shilling attacks. Under a filler size of 1%, AvgHR values remain relatively moderate, with the proportion of fake items in the Top-10 lists typically ranging between 6% and 12%. In contrast, when the filler size is set to 5%, these ratios increase to between 9% and 15%, clearly demonstrating that the greater the number of filler ratings injected into the system, the stronger the influence of the attack on recommendation outcomes. For the MUP-UsR strategy, AvgHR values at 1% filler remain around 0.0643 (approximately 6.4%), which corresponds to fewer than one fake item in a typical Top-10 recommendation list. However, with a filler size of 5%, this range increases to 0.0841–0.1185, meaning that 8–11% of the recommended items are manipulated. Although UsR yields the lowest values overall, its effectiveness still grows with larger filler sizes, indicating that even this relatively robust strategy becomes increasingly vulnerable. The MUP-DR method shows AvgHR values of about 0.0838 (8.4%) under 1% filler, but these values climb to 0.1037–0.1210 (10–12%) under 5% filler. This growth highlights the sensitivity of the DR strategy to filler size, as larger proportions of injected profiles significantly increase the number of fake items appearing in the lists. The MUP-NNZ strategy produces the highest AvgHR values across both settings. At 1% filler, Top-10 results reach 0.1231 (around 12%), while under 5% filler the values rise further to 0.1262–0.1385 (13–14%). This demonstrates that sparsely rated items are the most vulnerable targets, as they can be effectively pushed into recommendation lists regardless of filler size, making MUP-NNZ the weakest selection strategy. The MUP-Sum strategy records AvgHR values in the range of 0.0791–0.1072 at 1% filler. When the filler size is increased to 5%, the values rise to 0.0852–0.1194, representing a 2–3% increase in attack penetration. This indicates that the Sum strategy, although not as extreme as NNZ, still exposes the system to significant vulnerabilities.
In conclusion, the analysis on the YM10 dataset highlights that filler size is a critical factor in determining the success of multi-criteria group shilling attacks. With a filler size of 1%, the impact remains relatively limited, whereas at 5% filler nearly 15% of Top-10 recommendations consist of fake items. Among the strategies, MUP-NNZ consistently emerges as the most vulnerable, while MUP-DR and MUP-Sum also display considerable susceptibility. By contrast, MUP-UsR proves to be comparatively more robust, though its resilience diminishes as filler size increases.
A comparison of the results for filler sizes of 1% and 5% on the YM10 dataset clearly demonstrates the substantial increase in attack effectiveness as the filler ratio grows. At a filler size of 1%, AvgHR values remain relatively limited, with the proportion of fake items in the Top-10 recommendation lists generally ranging between 6% and 12%. Under this setting, the influence of shilling profiles on recommendation outcomes is noticeable but still constrained. When the filler size is raised to 5%, the proportion of manipulated items in the Top-10 lists increases to approximately 9–15%. This rise indicates that a greater number of injected profiles directly strengthens the penetration of attacks, resulting in fake items appearing more frequently in recommendation lists. In particular, both MUP-DR and MUP-Sum strategies exceed the 10% threshold, while MUP-NNZ reaches 13–14%, confirming its position as the most vulnerable target selection approach. MUP-UsR, by contrast, remains the most robust, yielding the lowest ratios in both scenarios, with values of around 6% at 1% filler and between 8–11% at 5% filler. Overall, this comparison highlights that filler size is a decisive factor in determining the success of multi-criteria group shilling attacks. While the impact of attacks is relatively modest at 1% filler, at 5% filler nearly one out of every six items in the Top-10 lists is a fake product. These findings underscore the heightened risk posed by higher filler levels, where group shilling attacks become considerably more destructive in multi-criteria recommendation environments.
In summary, the comparative analysis of the two datasets illustrates the behavior of both attack models with respect to varying cluster and filler sizes (can see Figs 3 and 4). For each scenario, the average values of the attack models were calculated and visualized in a comparative manner. According to these results, the performance differences between attack strategies and datasets provide valuable insights into the robustness characteristics of multi-criteria recommender systems.
For the first dataset, YM10, both attack types negatively affected the recommendation accuracy. The method produced higher AvgHR values, particularly under the S1 scenario, indicating a stronger attack impact due to the more realistic user profile generation. In contrast, under the S2 scenario, the GSAGen_Ran attack exhibited better performance, suggesting that random profile injection can sometimes align more effectively with user diversity in specific scenarios. Moreover, as the filler size increased, the overall AvgHR values consistently decreased, with the most significant degradation observed at the 10% filler level. Additionally, increasing the cluster size (from 1% to 10%) resulted in a gradual reduction of the attack impact, implying that larger user groups tend to dilute the influence of injected profiles.
In contrast, the results obtained from the YM20 dataset revealed a relatively weaker attack influence compared to YM10. At lower filler levels, the method achieved slightly higher AvgHR values; however, the difference between attack types became marginal as the filler size and cluster size increased. The AvgHR trend remained more stable across different cluster sizes, indicating that the greater diversity of users and items in YM20 made the system inherently less susceptible to manipulation. This stability highlights the role of dataset scale and heterogeneity in absorbing or mitigating artificial perturbations.
When comparing both datasets, it becomes evident that the YM10 dataset is more vulnerable to shilling attacks, whereas the YM20 dataset demonstrates higher robustness. The YM10 results exhibit stronger sensitivity to both attack type and scenario, while YM20 presents a more stable and resistant behavior. These findings collectively suggest that dataset scale, user diversity, and item variety are crucial factors in mitigating the effects of profile injection attacks, ultimately enhancing the resilience of recommender systems operating on larger and more heterogeneous datasets.
An examination of the WRV metric results on the YM20 dataset demonstrates that increasing the filler size significantly strengthens the impact of group shilling attacks (see Tables 8 and 9). Since WRV evaluates not only whether target items appear in recommendation lists but also their ranking positions, higher values indicate that fake items are successfully promoted to the top of the lists, thereby intensifying the attack’s effectiveness. At a filler size of 1%, WRV values remain relatively modest; however, certain strategies still manage to push target items toward the top positions (see Table 8). For instance, the MUP-NNZ strategy achieves 1.0507 for Top-10 and 2.9115 for Top-15, suggesting that sparsely rated items can already be placed within upper ranks even under low filler conditions. Similarly, both MUP-DR and MUP-Sum generate noticeable values, indicating that manipulated items appear in favorable positions. In contrast, MUP-UsR yields the lowest WRV scores, meaning that items selected under this strategy tend to remain in lower positions within the lists. Nevertheless, the results confirm that even with only 1% filler, recommendation outputs are not fully resistant to attack.
When the filler size is raised to 5%, the effect of the attacks becomes dramatic according to the Table 9. WRV values rise sharply, with target items being consistently promoted to the very top of the lists. Under GSAGenlRan, MUP-DR reaches 3.4846 at Top-10 and 4.9753 at Top-15, while under the MUP-NNZ strategy attains 4.8811 for Top-10 and 6.2493 for Top-15, marking the highest observed values. These findings clearly demonstrate that selecting sparsely rated items makes the system particularly vulnerable, as such items can dominate recommendation lists once manipulated. The MUP-Sum strategy also produces high WRV values, including 5.5510 at Top-15, further illustrating the ability of attackers to promote fake items into upper ranks. Although MUP-UsR remains the least effective strategy, its WRV values also increase noticeably under 5% filler, highlighting that even relatively robust approaches lose resistance under stronger attack scenarios. In summary, the WRV analysis reveals that as the filler size increases, group shilling attacks not only infiltrate recommendation lists but also succeed in positioning target items at the top ranks. At 1% filler, the attacks exert a moderate effect, whereas at 5% filler the upper parts of the Top-10 and Top-15 lists are heavily dominated by fake products. Among the selection strategies, MUP-NNZ consistently emerges as the most vulnerable, while MUP-Sum and MUP-DR also present significant threats. By comparison, MUP-UsR proves to be the most robust, although its resilience diminishes as the filler size grows.
In conclusion, under a filler size of 1% the impact of the attacks remains at a moderate level, whereas at 5% filler the fake items not only infiltrate the recommendation lists but also occupy the top ranks. The MUP-NNZ strategy consistently emerges as the most vulnerable, while both MUP-Sum and MUP-DR demonstrate a high level of threat. By contrast, MUP-UsR yields the lowest values in both scenarios and thus proves to be the most robust strategy against such attacks.
The analysis of the WRV results in Tables 10 and 11 for the YM10 dataset reveals several key insights regarding the effectiveness of the multi-criteria recommender systems attacks. For the filler size of 1%, it is observed that the WRV values for are generally higher than those for GSAGenlRan. This indicates that
demonstrates superior attack efficiency. Furthermore, as the cluster size increases from 1% to 10%, there is a consistent increase in WRV values, highlighting that a larger cluster size amplifies the attack’s impact. When examining the target product selection strategies, MUP-NNZ and MUP-Sum exhibit notably higher WRV values compared to other strategies across both S1 and S2 systems. These findings underscore the effectiveness of these two strategies in ensuring the successful placement of target items in the top-n recommendations. Particularly in the top-15 recommendations, MUP-NNZ and MUP-Sum consistently outperform alternative methods, establishing their dominance in attack success. A comparison of the S1 and S2 systems indicates that WRV values are generally higher for S1, suggesting that the system is more vulnerable to attacks. The trend persists across all cluster sizes and target selection strategies. The findings suggest that system-specific characteristics can influence susceptibility to adversarial attacks.
For the filler size of 5% in Table 11, a significant increase in WRV values is observed compared to the 1% filler size in Table 10. This emphasizes the critical role of filler size in enhancing attack effectiveness. The impact of increased filler size is particularly evident in strategies like MUP-NNZ, where WRV values show remarkable improvements. Across both GSAGenlRan and , the trend of higher WRV values with larger cluster sizes is sustained, further reinforcing the role of cluster size in determining attack efficacy. Notably, GSAGenlRan performs well with smaller cluster sizes and lower filler ratios, while
demonstrates superior performance with larger cluster sizes and higher filler ratios. The distinction highlights the varying strengths of these attack types under different configurations. Moreover, the effectiveness of MUP-NNZ and MUP-Sum remains evident, with these strategies achieving the highest WRV values across both S1 and S2 systems under varying filler sizes and cluster configurations.
In summary, the results for the YM10 dataset corroborate the observations made on the YM20 dataset. Larger cluster sizes and higher filler ratios enhance attack effectiveness, while MUP-NNZ and MUP-Sum consistently emerge as the most successful target selection strategies. The attack exhibits superior performance in most scenarios, particularly with larger configurations, while S1 consistently shows greater vulnerability compared to S2.
The Table 12 provides a comprehensive analysis of the AvgHR results for different attack configurations, highlighting the effectiveness of various attack types, cluster sizes, and aggregation strategies across S1. The AvgHR metric is a critical indicator of the success of adversarial attacks in increasing the likelihood of targeted items being recommended and the metric reveals that higher ratios indicate a greater vulnerability of the system to shilling attacks, while lower ratios correspond to stronger robustness. Under the GSAGenlRan attack, the most vulnerable aggregation techniques are MUL, SC, and, to some extent, AU. For instance, MUL produces values of 0.1223 at Top-10 and 0.2202 at Top-15 with a cluster size of 0.1, while SC reaches 0.0843 at Top-10 and 0.0978 at Top-15 with a cluster size of 0.05. These results demonstrate that such techniques allow fake items to infiltrate the recommendation lists effectively. In contrast, AwM and MP exhibit very low AvgHR values, remaining close to zero in most scenarios, thereby proving to be the most robust approaches against GSAGenlRan. A similar pattern can be observed under the attack. MUL and SC again appear as the most fragile strategies, yielding the highest AvgHR values. For example, MUL records 0.1732 at Top-15 with a cluster size of 0.05 and 0.1715 at Top-15 with a cluster size of 0.1, whereas SC produces 0.1467 at Top-15 with a cluster size of 0.05. These results indicate that both techniques are highly susceptible to manipulation. On the other hand, AwM consistently remains the most robust method, producing either zero or very low values across all scenarios. Furthermore, MP and, to some extent, LM also show relatively low AvgHR values, suggesting better resistance to attacks.
In summary, across both attack types, MUL, SC, and AU emerge as the most vulnerable aggregation techniques, while AwM, MP, and partially LM stand out as the most robust methods. This comparison highlights the importance of selecting appropriate aggregation strategies to mitigate the impact of group shilling attacks in multi-criteria recommender systems.
The results in Table 13 show that the AvgHR values remain generally low under the S2 scenario, indicating that the attacks significantly reduce the hit ratio of the group recommender systems. As the cluster size increases, a partial improvement in AvgHR values can be observed, with more noticeable gains in the case when the cluster size is 0.1. In contrast, the GSAGenlRan scenario consistently yields the weakest performance. Among the aggregation techniques, AU provides moderate results and performs relatively better under
with larger clusters, while MUL behaves similarly to AU but sometimes aligns with SC as one of the better methods. SC performs poorly with smaller cluster sizes but achieves considerable improvements in
at 0.1 cluster size. MP remains mostly average, contributing more at medium and large cluster sizes. MRP stands out as the most robust technique, showing the highest AvgHR values under
with cluster size 0.1, which suggests its higher resistance to attacks. Moreover, the AvgHR values consistently increase from Top-5 to Top-15 recommendations, revealing that the systems are more vulnerable to attacks with smaller recommendation lists, but partially recover as the list length increases. Overall, GSAGenlRan with small clusters is the weakest case, while
with cluster size 0.1 yields the strongest results, where MRP, SC, and MUL emerge as the most effective techniques, with MRP being the most stable one.
The results in Tables 14 and 15 show that when the filler size increases to 5%, the AvgHR values rise considerably compared to the 1% filler condition. In particular, the hit ratio values at Top-15 reach around 20–25%, indicating that injected fake profiles have a stronger ability to push target items into recommendation lists. Under the GSAGenlRan scenario, MRP achieves up to 22–23% AvgHR at Top-15, while AU and MUL also produce values in the range of 18–21%. Although SC remains weak for small cluster sizes, its performance improves with larger clusters, reaching nearly 17%. The scenario reveals an even stronger attack effect: for cluster size 0.1, MRP reaches around 23–25% AvgHR at Top-15, while MUL and AU achieve 19–22%, and SC rises to 16–18%. Among the aggregation strategies, MRP is the most vulnerable, consistently producing the highest AvgHR values under both scenarios. AU and MUL also show significant susceptibility, whereas MP and LM remain relatively less affected with values mostly in the range of 10–15%. Overall, the increase in filler size amplifies the impact of attacks, with
combined with larger cluster sizes producing the strongest effects on the recommendation lists.
In the last dataset, YM10, we can see the AvgHR results of performed experimental in Tables 16–19. According to the results in Tables 16 and 17 , increasing the cluster size consistently enhances AvgHR values across both S1 and S2, with larger clusters proving more effective in influencing recommendations. The attack outperforms GSAGenlRan in all configurations, particularly with larger clusters and in top-10 and top-15 scenarios, demonstrating its superior ability to exploit system vulnerabilities.
Among target item selection strategies, MUP-NNZ and MUP-Sum achieve the highest AvgHR values, especially in higher-ranked recommendations, due to their effectiveness in prioritizing influential items. Other strategies, such as AU and MUL, exhibit limited impact. The vulnerability comparison shows that S1 is more susceptible to attacks than S2, though the difference diminishes as cluster size grows.
The results in Tables 18 and 19 demonstrate that when the filler size increases to 5%, the AvgHR values rise significantly compared to the 1% filler condition. In the GSAGenlRan scenario, AvgHR values generally range between 10–20%, with MRP reaching up to nearly 20% and AU and MUL achieving around 16–19%. Although SC remains weak under small cluster sizes, it improves to nearly 17% with larger clusters, while MP and LM remain relatively less affected at around 10–15%. In contrast, the scenario exhibits stronger attack effects: with cluster size 0.1, MRP reaches 22–25% AvgHR at Top-15, followed by AU and MUL in the range of 18–22% and SC around 15–18%. These findings indicate that average-based attacks manipulate the system more effectively than random attacks. Among the aggregation strategies, MRP is the most vulnerable, consistently showing the highest AvgHR values, while AU, MUL, and SC are also significantly influenced. MP and LM, on the other hand, appear more resilient with comparatively lower hit ratios. Overall, the increase in filler size amplifies the attack impact on YM10, with MRP emerging as the weakest technique.
Overall, the results indicate that increasing the filler size to 5% substantially amplifies the attack impact on YM10, with average-based attacks () being more effective than random ones, and MRP emerging as the most vulnerable aggregation strategy, while MP and LM remain relatively more robust.
In the final stage, a statistical analysis was conducted to evaluate the robustness of the system against adversarial attacks, focusing on the relationship between and GSAGenlRan (see Figs 5 and 6). According to the results of the paired t-test, the differences between
and GSAGenlRan methods were generally limited.
In the S1 scenario for YM10 dataset, the average performance of was slightly higher than that of GSAGenlRan across the Top-5, Top-10, and Top-15 metrics (e.g., 0.0615 vs. 0.0580 for S1Top5); however, these differences were not statistically significant, as all p-values were above 0.05 (Fig 5). Therefore, both methods yielded comparable outcomes in the S1 scenario, and the superiority of
could not be statistically confirmed. In contrast, more distinct differences were observed in the S2 scenario. For the S2Top5 metric, the GSAGenlRan method achieved a significantly higher mean value compared to
(p = 0.0129, t = –2.961, Cohen’s dz = –0.893). Similarly, this difference was also statistically significant for the S2Top15 metric (p = 0.0160, t = –2.841, Cohen’s dz = –0.857). In both cases, the Cohen’s dz values indicated a large effect size, revealing a strong superiority of GSAGenlRan over
. For the S2Top10 metric, the difference was marginally significant (p = 0.0523), suggesting a trend favoring GSAGenlRan, though without strong statistical support. Overall, while no significant performance difference was detected between
and GSAGenlRan in the S1 scenario, GSAGenlRan demonstrated a statistically significant and large effect size advantage in the S2 scenario, particularly in the Top-5 and Top-15 metrics. These results suggest that the performance of the
approach may vary depending on the data structure and scenario characteristics, indicating that in certain contexts, even random selection-based strategies can yield relatively higher performance.
When the filler size ratio was increased to 5%, the YM10 results exhibited a partially different pattern compared to the previous 1% filler ratio experiments (Fig 5). In the S1 scenario, the differences between and GSAGenlRan remained small and statistically insignificant (e.g., p = 0.610 for S1Top5 and p = 0.272 for S1Top10). However, a statistically significant difference was observed in the S1Top15 metric (
, t = 6.997, Cohen’s dz = 2.109). The remarkably high t-statistic and effect size values indicate a clear superiority of
over the random baseline. This finding implies that at larger recommendation list sizes (Top-15), the performance advantage of
becomes more pronounced. In the S2 scenario, however, this tendency weakened considerably. For all Top-5, Top-10, and Top-15 metrics, p-values remained above 0.05, indicating no statistically significant difference between the two methods. Moreover, the Cohen’s dz values ranged between –0.06 and 0.35, corresponding to small effect sizes. This outcome suggests that increasing the filler ratio in the S2 scenario diminishes the statistical advantage of
, implying that the method’s performance is sensitive to both data volume and sampling density. In summary, while the GSAGenlRan method exhibited significant superiority under the 1% filler condition for specific scenarios, increasing the filler ratio to 5% led to statistically significant improvements in favor of
, particularly within the S1 scenario. These findings indicate that the recommendation performance depends not only on the employed method but also on factors such as sampling density and data distribution homogeneity. The increase in filler ratio appears to enhance the stability of the
approach while reducing the comparative advantage of the GSAGenlRan method.
For the other dataset, YM20, the corresponding results are presented as Fig 6. When the filler size was set to 1%, the results indicated that although produced slightly higher average values overall, most differences were not statistically significant (e.g., p > 0.1). However, significant differences were observed for S1–Top15 (
) and for S2 metrics, particularly Top-5 and Top-15 (p < 0.05). This suggests that at lower injection levels, the average-based attack tends to adapt more effectively to user similarity patterns, thereby producing a stronger impact. Furthermore, the high Cohen’s dz values (ranging between 1.15 and 1.76) confirm a strong effect size in these cases.
When the filler size increased to 5%, the structure of attack impact changed (see Fig 6). Although continued to yield higher mean results, the statistical significance of the differences decreased (e.g., p > 0.2). Only in the S2 scenario (particularly for Top-5, Top-10, and Top-15; p < 0.05) were significant effects detected, indicating that the system became more resistant to perturbations as the filler ratio increased. This behavior implies that at higher filler levels, the influence of injected profiles diminishes due to the system’s ability to absorb and stabilize rating fluctuations.
Briefly, the YM20 dataset demonstrated relatively high robustness against both and GSAGenlRan attacks. While the S1 scenario (attacks targeting popular items) exhibited higher vulnerability at lower filler levels, the S2 scenario (attacks focusing on less popular or sparsely rated items) continued to show statistically significant effects even under higher filler sizes. These findings highlight that dataset scale and diversity are critical factors in determining system resilience, as they shape how recommendation models respond to adversarial manipulations under different attack conditions.
Overall, the results highlight that the combination of larger cluster sizes, the attack, and advanced target item selection strategies such as MUP-NNZ and MUP-Sum significantly increases the AvgHR values, making them critical factors in the success of adversarial attacks. The higher vulnerability observed under the MUP-NNZ and MUP-Sum selection strategies can be explained by their focus on items with limited rating information. In both cases, the targeted items are characterized by low popularity and sparse user interactions, which makes their estimated rating profiles less stable and more susceptible to distortion. Consequently, even a small number of injected fake ratings can cause substantial shifts in the predicted scores of these items. This heightened sensitivity amplifies the influence of adversarial users, leading to higher AvgHR values and demonstrating that sparsely rated items are inherently more prone to manipulation. Conversely, items with richer rating histories tend to exhibit greater robustness, as the diversity and volume of genuine feedback dilute the relative effect of artificially introduced profiles. Collectively, these findings underscore the importance of developing defense mechanisms that are specifically tailored to the system’s structural vulnerabilities and data sparsity characteristics.
A comparison of the YM10 and YM20 datasets in terms of the AvgHR metric reveals notable differences in their susceptibility to group shilling attacks. While both datasets exhibit the general trend that higher filler sizes amplify the impact of attacks, the overall intensity and sensitivity to target item selection strategies vary across datasets. For the YM10 dataset, AvgHR values are generally higher. Even at a filler size of 1%, the proportion of fake items in the Top-10 lists ranges between 6% and 12%, and at 5% filler this ratio increases further to 9–15%. This indicates that, due to the structural characteristics of YM10, fake profiles can more easily penetrate the recommendation lists. In particular, the MUP-NNZ strategy consistently produces the highest values, with nearly 13–14% of Top-10 recommendations being manipulated items. MUP-DR and MUP-Sum also yield strong attack success rates, while MUP-UsR remains comparatively more resilient. In contrast, the YM20 dataset exhibits lower AvgHR values, suggesting stronger robustness against attacks. At 1% filler, the proportion of fake items in Top-10 lists typically falls within the 3–6% range, while at 5% filler this ratio increases to 8–12%. Although YM20 is less vulnerable at low filler levels, the effect of attacks becomes more pronounced as the filler size increases. Specifically, both MUP-Sum and MUP-DR exceed the 10% threshold at 5% filler, demonstrating that these strategies can still achieve substantial influence under more aggressive attack scenarios. In summary, the AvgHR results highlight that YM10 is more fragile, with fake items infiltrating recommendation lists even at low filler levels, whereas YM20 shows greater resistance but remains vulnerable as the filler ratio grows. This comparison underscores the critical role of dataset characteristics in shaping the effectiveness of group shilling attacks, as well as the importance of tailoring defense mechanisms to the specific vulnerabilities of each dataset.
7 Conclusion and future work
This study presents a comprehensive analysis of the vulnerabilities of multi-criteria recommender systems to group shilling attacks. The existing literature predominantly focuses on such attacks in single-criteria systems, leaving multi-criteria systems largely unexamined. In this context, our study addresses a critical gap by investigating, modeling, and analyzing shilling attacks specifically in multi-criteria group recommender systems. The proposed attack strategy reveals the susceptibility of multi-criteria recommender systems to contemporary threats. Our findings demonstrate that while these systems offer significant advantages in terms of personalization and detailed evaluations, they can be easily manipulated by malicious user groups. Such manipulations not only decrease the recommendation accuracy but also negatively impact user satisfaction and trust, posing a serious threat to the long-term success and sustainability of recommender systems. The empirical results further show that the impact of attacks increases as the filler size grows. For instance, when the filler size rises from 1% to 5%, the AvgHR values significantly increase, reaching up to 20–25% under average-based attacks. In particular, the MRP aggregation strategy consistently emerged as the most vulnerable, while AU, MUL, and SC were also notably affected. On the other hand, MP and LM demonstrated comparatively greater robustness. These findings emphasize that attack severity is highly dependent on both the attack model and the aggregation technique employed. Also, it is important to note that the robustness observed in this study is specific to the employed datasets and attack models; therefore, the findings should be interpreted within the context of these experimental conditions.
In conclusion, the complexity and user-centric nature of multi-criteria recommender systems bring both advantages and vulnerabilities. The present study not only addresses these vulnerabilities but also provides a foundation for future research, aiming to contribute to the development of secure, efficient, and user-friendly recommender systems. Future studies could focus on developing more sophisticated detection algorithms that leverage machine learning and anomaly detection techniques to identify group shilling attacks more effectively. Additionally, investigating how dynamic user behavior and changing preferences influence the susceptibility of multi-criteria systems to group shilling attacks could enhance the adaptability and resilience of these systems. Moreover, future work could extend this research to different application domains and datasets to examine whether the observed patterns of robustness and vulnerability generalize across various contexts. Such cross-domain analyses would provide deeper insights into the transferability of attack dynamics and the scalability of proposed defense mechanisms.
References
- 1.
Adomavicius G, Manouselis N, Kwon Y. Multi-criteria recommender systems. Recommender systems handbook. Springer US; 2010. p. 769–803. https://doi.org/10.1007/978-0-387-85820-3_24
- 2.
Herlocker JL, Konstan JA, Borchers A, Riedl J. An algorithmic framework for performing collaborative filtering. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 1999. p. 230–7. https://doi.org/10.1145/312624.312682
- 3.
Lam SK, Riedl J. Shilling recommender systems for fun and profit. In: Proceedings of the 13th International Conference on World Wide Web. 2004. p. 393–402. https://doi.org/10.1145/988672.988726
- 4.
Su X-F, Zeng H-J, Chen Z. Finding group shilling in recommendation system. In: Special interest tracks and posters of the 14th international conference on World Wide Web - WWW ’05. 2005. p. 960. https://doi.org/10.1145/1062745.1062818
- 5. Kaya TT, Kaleli C. Robustness analysis of multi-criteria top-n collaborative recommender system. Arab J Sci Eng. 2022;48(8):10189–212.
- 6. Kaya TT, Yalcin E, Kaleli C. A novel classification-based shilling attack detection approach for multi-criteria recommender systems. Computational Intelligence. 2023;39(3):499–528.
- 7.
Burke R, Mobasher B, Zabicki R, Bhaumik R. Identifying attack models for secure recommendation. In: Beyond personalization: a workshop on the next generation of recommender systems. 2005. p. 347–61.
- 8. Hurley NJ, O’Mahony MP, Silvestre GCM. Attacking recommender systems: a cost-benefit analysis. IEEE Intell Syst. 2007;22(3):64–8.
- 9.
Cheng Z, Hurley N. Robust collaborative recommendation by least trimmed squares matrix factorization. In: 2010 22nd IEEE International Conference on Tools with Artificial Intelligence. 2010. p. 105–12. https://doi.org/10.1109/ictai.2010.90
- 10.
Zhang F. Reverse bandwagon profile inject attack against recommender systems. In: 2009 Second International Symposium on Computational Intelligence and Design. 2009. p. 15–8. https://doi.org/10.1109/iscid.2009.11
- 11. Mobasher B, Burke R, Bhaumik R, Sandvig JJ. Attacks and remedies in collaborative recommendation. IEEE Intell Syst. 2007;22(3):56–63.
- 12.
Williams C, Mobasher B. Profile injection attack detection for securing collaborative recommender systems. CTI: DePaul University; 2006.
- 13.
Burke R, Mobasher B, Bhaumik R. Limited knowledge shilling attacks in collaborative filtering systems. In: Proceedings of 3rd international workshop on intelligent techniques for web personalization (ITWP 2005), 19th international joint conference on artificial intelligence (IJCAI 2005); 2005. p. 17–24.
- 14.
O’Mahony MP, Hurley NJ, Silvestre GC. Recommender systems: attack types and strategies. In: AAAI; 2005. p. 334–9.
- 15. O’Mahony M, Hurley N, Kushmerick N, Silvestre G. Collaborative recommendation. ACM Trans Internet Technol. 2004;4(4):344–77.
- 16.
Wang Y, Wu Z, Cao J, Fang C. Towards a tricksy group shilling attack model against recommender systems. In: Advanced Data Mining, Applications: 8th International Conference and ADMA 2012, Nanjing, China, December 15-18, 2012. Proceedings 8. Springer; 2012. p. 675–88.
- 17. Wang S, Zhang P, Wang H, Yu H, Zhang F. Detecting shilling groups in online recommender systems based on graph convolutional network. Information Processing & Management. 2022;59(5):103031.
- 18. Zhang F, Qu Y, Xu Y, Wang S. Graph embedding-based approach for detecting group shilling attacks in collaborative recommender systems. Knowledge-Based Systems. 2020;199:105984.
- 19. Zhang F, Wang S. Detecting group shilling attacks in online recommender systems based on bisecting K-means clustering. IEEE Trans Comput Soc Syst. 2020;7(5):1189–99.
- 20. Hao Y, Meng G, Wang J, Zong C. A detection method for hybrid attacks in recommender systems. Information Systems. 2023;114:102154.
- 21. Cai H, Zhang F. An unsupervised approach for detecting group shilling attacks in recommender systems based on topological potential and group behaviour features. Security and Communication Networks. 2021;2021:1–18.
- 22. Cai H, Ren J, Zhao J, Yuan S, Meng J. KC-GCN: a semi-supervised detection model against various group shilling attacks in recommender systems. Wireless Communications and Mobile Computing. 2023;2023:1–15.
- 23. Xu Y, Zhang P, Yu H, Zhang F. Detecting group shilling attacks in recommender systems based on user multi-dimensional features and collusive behaviour analysis. The Computer Journal. 2023;67(2):604–16.
- 24.
Wang B, Chen J, Li C, Zhou S, Shi Q, Gao Y, et al. Distributionally robust graph-based recommendation system. In: Proceedings of the ACM Web Conference 2024 . 2024. p. 3777–88. https://doi.org/10.1145/3589334.3645598
- 25.
Boratto L, Fabbri F, Fenu G, Marras M, Medda G. Robustness in fairness against edge-level perturbations in GNN-based recommendation. In: European Conference on Information Retrieval. Springer; 2024. p. 38–55.
- 26. Liu S, Yu S, Li H, Yang Z, Duan M, Liao X. A novel shilling attack on black-box recommendation systems for multiple targets. Neural Comput & Applic. 2024;37(5):3399–417.
- 27. Alhwayzee A, Araban S, Zabihzadeh D. A robust recommender system against adversarial and shilling attacks using diffusion networks and self-adaptive learning. Symmetry. 2025;17(2):233.
- 28. Zhou W, Wen J, Qu Q, Zeng J, Cheng T. Shilling attack detection for recommender systems based on credibility of group users and rating time series. PLoS One. 2018;13(5):e0196533. pmid:29742134
- 29. Li K, Zhou H, Ma B, Huang F. SemanticShield: LLM-powered audits expose shilling attacks in recommender systems. arXiv preprint 2025. https://arxiv.org/abs/2509.24961
- 30.
Jameson A, Smyth B. Recommendation to groups. The aptive web: methods and strategies of web personalization. Springer; 2007. p. 596–627.
- 31. Khazaei E, Alimohammadi A. An automatic user grouping model for a group recommender system in location-based social networks. IJGI. 2018;7(2):67.
- 32.
Boratto L, Carta S. State-of-the-art in group recommendation and new approaches for automatic identification of groups. Information retrieval and mining in distributed environments. Springer. 2011. p. 1–20.
- 33. Yalcin E, Bilge A. Novel automatic group identification approaches for group recommendation. Expert Systems with Applications. 2021;174:114709.
- 34. Boratto L, Carta S, Fenu G. Discovery and representation of the preferences of automatically detected groups: exploiting the link between group modeling and clustering. Future Generation Computer Systems. 2016;64:165–74.
- 35. Seo Y-D, Kim Y-G, Lee E, Seol K-S, Baik D-K. An enhanced aggregation method considering deviations for a group recommendation. Expert Systems with Applications. 2018;93:299–312.
- 36. Ardissono L, Goy A, Petrone G, Segnan M, Torasso P. Intrigue: personalized recommendation of tourist attractions for desktop and hand held devices. Applied Artificial Intelligence. 2003;17(8–9):687–714.
- 37.
McCarthy K, Salamó M, Coyle L, McGinty L, Smyth B, Nixon P. CATS: a synchronous approach to collaborative group recommendation. In: FLAIRS; 2006. p. 86–91.
- 38.
Jameson A. More than the sum of its members: challenges for group recommender systems. In: Proceedings of the Working Conference on Advanced Visual Interfaces; 2004. p. 48–54.
- 39.
Yu Z, Zhou X, Zhang D. An adaptive in-vehicle multimedia recommender for group users. In: 2005 IEEE 61st Vehicular Technology Conference. p. 2800–4. https://doi.org/10.1109/vetecs.2005.1543857
- 40.
Chao DL, Balthrop J, Forrest S. Adaptive radio: achieving consensus using negative preferences. In: Proceedings of the 2005 ACM International Conference on Supporting Group Work. 2005. p. 120–3.
- 41. Christensen IA, Schiaffino S. Entertainment recommender systems for group of users. Expert Systems with Applications. 2011;38(11):14127–35.
- 42.
McCarthy JF. Pocket restaurantfinder: a situated recommender system for groups. In: Workshop on mobile ad-hoc communication at the 2002 ACM conference on human factors in computer systems. vol. 8; 2002.
- 43. Lieberman H, van Dyke N, Vivacqua A. Let’s browse: a collaborative browsing agent. Knowledge-Based Systems. 1999;12(8):427–31.
- 44.
Ahmad HS, Nurjanah D, Rismala R. A combination of individual model on memory-based group recommender system to the books domain. In: 2017 5th International Conference on Information and Communication Technology (ICoIC7). 2017. p. 1–6. https://doi.org/10.1109/icoict.2017.8074655
- 45. Agarwal A, Chakraborty M, Chowdary CR. Does order matter? Effect of order in group recommendation. Expert Systems with Applications. 2017;82:115–27.
- 46.
O’connor M, Cosley D, Konstan JA, Riedl J. PolyLens: a recommender system for groups of users. In: ECSCW 2001 : Proceedings of the Seventh European conference on computer supported cooperative work 16–20 September 2001, Bonn, Germany. Springer; 2001. p. 199–218.
- 47.
Salehi-Abari A, Boutilier C. Preference-oriented social networks: group recommendation and inference. In: Proceedings of the 9th ACM Conference on Recommender Systems. 2015. p. 35–42.
- 48.
Seminario CE, Wilson DC. Attacking item-based recommender systems with power items. In: Proceedings of the 8th ACM Conference on Recommender Systems. 2014. p. 57–64. https://doi.org/10.1145/2645710.2645722
- 49. Turk AM, Bilge A. Robustness analysis of multi-criteria collaborative filtering algorithms against shilling attacks. Expert Systems with Applications. 2019;115:386–402.
- 50. Lakiotaki K, Matsatsinis NF, Tsoukias A. Multicriteria user modeling in recommender systems. IEEE Intell Syst. 2011;26(2):64–76.
- 51.
Jannach D, Karakaya Z, Gedikli F. Accuracy improvements for multi-criteria recommender systems. In: Proceedings of the 13th ACM Conference on Electronic Commerce. 2012. p. 674–89. https://doi.org/10.1145/2229012.2229065
- 52. Mobasher B, Burke R, Bhaumik R, Williams C. Toward trustworthy recommender systems: an analysis of attack models and algorithm robustness. ACM Trans Internet Technol. 2007;7(4):23-es.
- 53.
Burke R, O’Mahony MP, Hurley NJ. Robust collaborative recommendation. Recommender systems handbook. Springer US; 2015. p. 961–95. https://doi.org/10.1007/978-1-4899-7637-6_28
- 54. Batmaz Z, Yilmazel B, Kaleli C. Shilling attack detection in binary data: a classification approach. J Ambient Intell Human Comput. 2019;11(6):2601–11.