Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Enhancing the scalability of distance-based link prediction algorithms in recommender systems through similarity selection

  • Zhan Su ,

    Contributed equally to this work with: Zhan Su, Jun Ai

    Roles Funding acquisition, Methodology, Project administration, Writing – review & editing

    suzhan@foxmail.com (ZS); aijun@outlook.com (JA)

    Affiliation School of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology, Shanghai, P.R. China

  • Zhong Huang,

    Roles Formal analysis, Software, Validation, Writing – original draft

    Affiliation School of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology, Shanghai, P.R. China

  • Jun Ai ,

    Contributed equally to this work with: Zhan Su, Jun Ai

    Roles Conceptualization, Software, Validation, Writing – original draft, Writing – review & editing

    suzhan@foxmail.com (ZS); aijun@outlook.com (JA)

    Affiliation School of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology, Shanghai, P.R. China

  • Xuanxiong Zhang,

    Roles Resources, Writing – review & editing

    Affiliation School of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology, Shanghai, P.R. China

  • Lihui Shang,

    Roles Writing – review & editing

    Affiliation School of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology, Shanghai, P.R. China

  • Fengyu Zhao

    Roles Resources, Writing – review & editing

    Affiliation School of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology, Shanghai, P.R. China

Abstract

Slope One algorithm and its descendants measure user-score distance and use the statistical score distance between users to predict unknown ratings, as opposed to the typical collaborative filtering algorithm that uses similarity for neighbor selection and prediction. Compared to collaborative filtering systems that select only similar neighbors, algorithms based on user-score distance typically include all possible related users in the process, which needs more computation time and requires more memory. To improve the scalability and accuracy of distance-based recommendation algorithm, we provide a user-item link prediction approach that combines user distance measurement with similarity-based user selection. The algorithm predicts unknown ratings based on the filtered users by calculating user similarity and removing related users with similarity below a threshold, which reduces 26 to 29 percent of neighbors and improves prediction error, ranking, and prediction accuracy overall.

Introduction

Due to the rapid spread of the Internet, which has resulted in a tremendous number of information appearing every second, people have entered an era of information overload. Thus, overwhelming information makes it difficult for consumers to find information that they are interested in. Consumers and merchants both hope to find useful information on the Internet, whether it’s about a suitable product or a group of potential interested clients. Excessive data, on the other hand, is a severe impediment to reaching this aim in a harmonic manner, and the current epidemic forces people to turn to the Internet for consumption, learning, and communication, exacerbating the problem.

Fortunately, scientists and engineers have devised systems that automate the calculation of user preferences for various goods and information based on people’s existing consumption and evaluation data, and then recommend to them automatically goods or information that they are likely to like but have yet to discover or consume. The whole process is to predict the weights of links and links between users and items in a recommender system, hence, it is also often referred to as link prediction [1, 2].

This is the primary function of the recommender system, and it not only saves the user time in searching for an interesting target but it also identifies a possible user group for the merchant. Therefore, link prediction and recommendation algorithms are effective ways to streamline excessive information, saving users time searching for products and information while also saving businesses money on advertising [3, 4].

Thus, the link prediction and recommendation algorithms have been intensively investigated in related fields due to its commercial value and research significance [5]. Amazon, Facebook, JD, Taobao, and other companies have their own personalized recommender systems and algorithms [6, 7].

According to the general understanding of the area, link prediction in recommender systems and recommendation algorithms can be separated into three primary types. Content-based recommendation algorithms, collaborative filtering recommendation algorithms, and hybrid recommendation algorithms together constitute the classification.

First, in content-based recommender systems, the descriptive attributes of items are used to make recommendations. The term “content” refers to these descriptions [8, 9]. Second, memory-based collaborative filtering algorithms and model-based collaborative filtering algorithms are two sub-types of collaborative filtering algorithms.

Memory-based collaborative filtering algorithms [10], whether item-based [1113] or user-based [14], are based on the assumption that similar individuals demonstrate similar patterns of rating behavior, and similar objects receive similar ratings [15]. On the other hand, clustering models [16], maximum entropy models [17], matrix decomposition models, and similarity-network models [1821] are examples of model-based collaborative filtering techniques [22, 23].

Finally, there are weighted type, switching type, cross type, feature combination type, waterfall type, feature enhancement type, and meta-level type hybrid recommendation strategies [24].

Despite the fact that the recommender system has been under development for a long time, some issues remain, such as cold start [25], data sparsity, insufficient scalability [26], accuracy improvement, computational cost, prediction vulnerability [27] and lack of diversity [28].

The most pressing issue remains to improve the accuracy of recommendations. To increase the accuracy of the recommendation system, many academics have proposed improved similarity computation algorithms [29]. For measuring similarity, Hawashin et al. used user interests [30]. PK Singh and others consider consumers’ likes and dislikes of similar features of a single item separately when determining similarity [31]. Ai et al. [32] used similarity to model user-user network, and considered centrality measures [33] as a factor to enhanced prediction accuracy. By introducing a specified distance function, MA Yi et al. increased Pearson similarity [34]. N Joorabloo et al. proposed a new approach of user/item neighborhood reordering that takes into account the future trend in user/item similarity [35]. All of these techniques increase the accuracy of recommendations based on user similarity prediction.

Other approaches, such as the Slope One algorithm [36] and its variants, treat rating differences as a distance between users, and use that distance to anticipate unknown ratings for the recommendation. LY Dong et al. [37] improved this approach by incorporating non-negative matrix factorization into Slope One, which effectively handled the sparsity problem. W Li et al. [38] created the Slope One algorithm, which improves prediction accuracy and reduces sparsity by using weighted essential items. L Jiang and colleagues [39] presented the Slope One method combined trusted data with user similarities. The algorithms improved the prediction accuracy by using all the possible neighbors.

The Slope One method is not only simple to implement, but also highly effective. The Slope One algorithm’s prediction accuracy, on the other hand, could be improved further. This algorithm’s computational complexity is also higher than that of other algorithms. Although there are methods that aim to merge distance and similarity in previous studies, the algorithm’s time complexity is increased [39]. Because the Slope One algorithm requires all eligible users in its prediction, the large number of neighbors severely affects its scalability despite its computational simplicity. In practical applications it can save storage space and speed up prediction if only a small portion of neighbors can be cached for prediction.

To solve the problem of Slope One using too many neighbors in the prediction, we present a link prediction method in recommender system based on user difference and similarity selection, which is primarily based on two assumptions. 1) The user-rating distance in recommender system can be measured and used for unknown rating prediction. 2) If the similarity between two users is less than a certain threshold, their rating distance is not helpful for accurately measuring the differences between the users. Therefore, it can be discarded in the calculation of user-rating distance.

The following are the primary contributions of the work.

  1. We designed a rating prediction method based on user similarity selection, which reduced more than 40% of neighbors in the prediction.
  2. Second, the method successfully decreases the time complexity, improve the overall performance in prediction error, ranking of recommendation list and prediction accuracy of user preference.
  3. Third, our study suggests that, neighbors who are not similar enough to the prediction target cause more harm than good to the prediction and should be discarded.

The rest of the paper is laid out as follows: Section II reviews related research; Section III details the suggested approach; Section IV compares the algorithm established in this study to several state-of-the-art algorithms in the field; and Section V summarizes the conclusion and proposes potential future studies.

Related works

Measurement of user similarity

Among many others, the collaborative filtering recommendation algorithm is widely explored and utilized. The primary approach is to offer recommendations based on the tastes of comparable user groups to the target user [40]. The algorithm calculates the similarity between the target and all other users, selects a group of highly similar neighbors, and uses the neighbors’ ratings to estimate the target user’s rating on an item [5].

The calculation of similarity is the most important phase in collaborative filtering algorithms. As a result, many methods in the field of recommender systems have been proposed to solve this problem, such as the Pearson correlation coefficient [41], cosine similarity [5], Ou Kilid distance [10, 42], similarity with confidence measures [43], mean square distance [44], user behavior probability [45], Jaccard [46, 47], Spearman correlation [48], vector similarity [2], Bhattacharyya coefficient [49], and user opinion spreading [50, 51].

Rating distance between users

The Slope One [36] algorithm is a model-based recommendation technique for calculating average user distance. Its fundamental idea is to employ a linear regression analysis method based on the user’s previous movie ratings. Between the user U and V, the rating on the same item has a linear relationship, U = V + b. The premise of prediction is to predict unknown ratings based on the difference between users. Assume that user u has given the movie i a rating but user v has not rated the item. U = V + b calculates the deviation value between two users, with b being the deviation value between them. It is important to identify the two users that have jointly rated a group of items in order to calculate the deviation value between them. The average deviation of the ratings between the two users is estimated to be b [5, 36] and used for prediction when one of U and V is unknown.

The proposed method

Symbol description

Assume that a total of m users have rated n movies, it can be regarded as a m×n-level rating matrix Rm×n, which is represented by the user set U = {u1, u2, …, um}, the collection of movies is represented as I = {i1, i2, …, in}, rui represents the rating of the movie i by the user u.

The data set is divided into a training set TrainingSet and a test set TestSet. We use duv = dev(u, v) to represent the average of the rating distance between user u and v, suv = similarity(u, v) stands for the similarity between users u and v, and Th is a similarity threshold, by which select neighbors for users with similarity larger than or equal to Th.

Algorithm description

The problem with collaborative filtering is that it does not consider the impact of distance between users, and the traditional distance method does not consider the influence of similarity on user prediction. In addition, the collaborative filtering algorithm needs to calculate and select neighbors for the target user, and the time complexity of the algorithm is relatively higher than distance approach. The issue can be demonstrated by Fig 1. In terms of rating distance, these two user pairings are near (−0.5 against −0.25), yet their similarities are absolutely opposite (−1.0 against 1.0).

thumbnail
Fig 1. Ratings of three movies by three users are shown in a schematic diagram of traditional algorithms.

s is given by similarity measurement and d is calculated by distance method.

https://doi.org/10.1371/journal.pone.0271891.g001

This research provides a link prediction algorithm based on user distance and similarity selection (DSS) to address the aforementioned issues. As shown in the Fig 3, there are three users x, y and z, respectively. For the rating i, j, and k, it is necessary to calculate the predicted rating of user y on movie j. According to the distance equation, the distance between user x and y calculated as dxy = −0.25, and the distance between user y and z as dyz = −0.5, respectively. By contrast, the Pearson similarity equation evaluates the similarities between user x, y and user y, z as sxy = 1.0 and syz = −1.0, respectively.

According to our approach, only users with high similarity are selected for the prediction, that means user z is excluded. We use the rating of user x to predict the target user y. Therefore, ryj = rxj + dxy = 1.5 − 0.25 = 1.25.

Algorithm 1 DSS algorithm for unknown rating prediction and Top-k recommendation

1: Calculate similarity for all possible suv between user u and user v.

2: For each target user, the set of all his neighbors with similarity greater than a threshold (Th = 0.0 in our experiments) is recorded (Nu).

3: Calculate distance between user u and each of his neighbor in Nu, duv standing for distance between user u and user v.

4: for each prediction target ruiTestSet do

5:  Select all neighbors with distance duv for user u, who has rated i.

6:  Predict by Eq 4.

7:  Add the predicted rating into predicted set, PP +

8: end for

9: Sort P by descending order of the predicted rating.

10: L ← Select top-k prediction in P as recommendation list.

11: return The predicted rating P and recommendation list L for u.

Main equations of the proposed method

The similarity between users is calculated based on the unknown ratings in the training set. In general, the greater the similarity between the target user and his neighbor user, the higher the percentage of that neighbor in the prediction. We use Pearson’s correlation coefficient to calculate the similarity between two users based on their common rating characteristics, which is defined as Eq 1. (1) where the set Iuv represents the collection of items that users u, v shared, and rui and rvi represent the ratings given by users u, v on the same item i, respectively. and represent the average ratings of users u and v, respectively.

Moreover, the rating distance between two users is defined by Eq 2: (2) where distance dvu between users is equal to the mean of the differences of the one-to-one corresponding ratings in the set of items Iuv jointly rated by the two users. rui and rvi are the ratings of user u and user v on item i, and |Iuv| is the number of the subset.

However, we only calculate user pairs with the similarity (Eq 1) larger or equal to the threshold, filtering them based on Eq 3, and consider the qualified users as the neighbors for the target user. (3) where Nu is the selected neighbor set for user u, and s(v, u) is the similarity between user u and v. Th is the threshold set by experiments.

On this basis, Eq 4 can be used to calculate the unknown rating of user u on item i. (4) where |Nu| is the neighbor number of user u, dvu is the rating distance between user v and user u.

Experiment

Experimental data set

We use MoviesLens data set (ML-25M, download at https://grouplens.org/datasets/movielens/25m/) in our experiments, in which describes 5-star rating and free-text tagging activity from a movie recommendation service [52]. It contains 25000095 ratings and 1093360 tag applications across 62423 movies. The rating ranges from 0.5 to 5, with a 0.5 interval. The number of users in the data set is 162541, and we randomly selected 3000 of these users for the experiment to save computational resources and time. It contains 431783 ratings across 62423 movies, and we only use the rating data.

Additionally, 10-fold cross-validation approach is used to determine how generalizable the suggested strategy is. The data set is divided into ten similar-size groups at random. Each experiment uses one of the data sets as the test set, while the other nine sets are used as the training set. The average of all ten experiments was calculated as the final result.

Benchmark algorithms

We compare the proposed DSS algorithm to average rating (AverageRating), Slope One algorithm, cosine similarity collaborative filtering (Cosine), resource allocation collaborative filtering (RA), user opinion spreading (UOS), and multi-level collaborative filtering (MLCF) to demonstrate the effectiveness of DSS. The main ideas of these algorithms used for comparison are described below.

  1. The user’s average rating (AverageRating) predicts the target user’s rating based on his average rating on all other items. If there is no ratings of the traget user in the traning set, the method takes the average ratings of all other users as the result.
  2. Cosine similarity collaborative filtering (Cosine) [53] is to find the similarity between two rating vectors by measuring the cosine value of the angle between them. The vectors contains ratings from two users on shared items.
  3. Resource allocation collaborative filtering (RA) [54, 55] is to use the concept of link prediction in the recommendation system to improve performance. Popular products should have less impact when determining user similarity because most users like popular items, hence they are not suited for users to offer recommendations. RA is a popularity-based measure of local similarity, and its value for two users is determined by the degree of common scoring items between them. The lower the RA similarity index of a movie with a common rating, the more popular it is. Furthermore, as the number of shared rated items grows, the value of RA grows, and the estimated similarity becomes more reliable.
  4. User opinion spreading (UOS) [56] algorithm is a combination of collaborative filtering algorithm and user opinion dissemination process. If two users have the same positive or negative view on the same item, their opinions are consistent, and the great similarity between them indicates that their tastes are similar. On the other hand, the similarity is low if the two people have opposing viewpoints on the same items. In other words, users share their opinions on items that have been jointly assessed in the user-item bipartite graph model, and UOS measures user similarity based on the attitudes of the items that users have rated in common.
  5. Multi-level collaborative filtering (MLCF) [57] algorithm is designed to improve the similarity method, divide the Pearson similarity into different levels, and impose different constraints, so as to improve the accuracy of the classic collaborative filtering algorithm.

Performance criteria

We evaluate the design algorithm’s performance in six dimensions. The similarity weights for filtering are established first. Second, we look at the error level the rating prediction. Third, after generating the user’s Top-k recommendation list, we examine the ranking performance of user’s favorite items in the list. Forth, we evaluate the diversity of the recommendation lists. Fifth, the prediction of users’ likes or dislikes is verified by comparison. Finally, we measure the scalability of the proposed algorithm by comparing the number of neighbors used in the prediction and the time spent on the computation.

Similarity selection threshold.

According to Eq 3, each user has a group of neighbors at varying distances, and the degree of similarity between the user and his neighbors varies. Neighbors with a high degree of similarity are worthy of reference, while neighbors with a low degree of similarity not only provide no reference but also degrade prediction accuracy.

Therefore, we design experiments to decide the proper value of Th ∈ [−1.0, 1.0], as shown in Fig 2. The results suggest that the appropriate similarity threshold is Th = 0.0, and the proposed DSS has the optimal prediction accuracy, with mean absolute error at 0.667 and root mean square error at 0.874. Although the accuracy improvement seems trivial in Fig 2, the number of neighbor used in the prediction is much lower (details presented in Fig 12).

thumbnail
Fig 2. Comparison of errors generated by different similarity selection thresholds under the same experimental conditions (mean absolute error and root mean square error on MovieLens with selected users).

https://doi.org/10.1371/journal.pone.0271891.g002

Measure errors in predicting the ratings.

Our work uses mean absolute error (MAE) [5], root mean square error (RMSE) [58] to evaluate the error of rating prediction. The MAE reflects the deviation between the algorithm’s predicted ratings and the user’s actual ratings, while the RMSE represents the root mean square value of the deviation between the predicted user ratings and the actual ratings.

It’s important to note that the MAE (Eq 5) value represents the average errors, and the RMSE (Eq 6) adds up the square of errors before extraction of a root, increasing the error gap and penalizing the predicted value with large errors. The lower the result of these two parameters is, the smaller the prediction error and the better the algorithm prediction performance is. (5) (6) where n is the number of predictions in the testing set, the user’s predicted score value is represented by , and the user’s actual rating value is represented by rui.

The MAE of all algorithms is shown in Fig 3. The MAE of the proposed DSS algorithm is 0.6647, which indicates the highest prediction accuracy and best performance compared to others. The Slope One has an MAE of 0.6707. The proposed DSS reduces errors by 0.9 percent. MAE of DSS has been improved by 2 percent on average when compared to all other algorithms.

thumbnail
Fig 3. MAE comparison results of DSS and other algorithms.

The smaller the result, the smaller the prediction error of the algorithms on the rating.

https://doi.org/10.1371/journal.pone.0271891.g003

As illustrated in Fig 4, the RMSE also shows the extent of the recommender system’s prediction error. RMSE, in comparison to MAE, is more sensitive to larger errors. With a value of 0.8766, the DSS algorithm has the smallest result. The RMSE of the Slope One method is 0.8840, which is the second-best. The DSS lowers the error rate by 0.83 percent. RMSE of DSS has been improved by 2.3 percent on average as compared to all other algorithms.

thumbnail
Fig 4. RMSE comparison results of DSS and other algorithms.

The smaller the result, the smaller the prediction error of the algorithms on the rating.

https://doi.org/10.1371/journal.pone.0271891.g004

Ranking of Top-k recommendations.

When making Top-k recommendations, it is not enough to predict the unknown user-item ratings; the recommendation system must also filter the user’s favorite items based on the prediction results and generate a list of recommendations for these items in descending order of the user’s preferences.

We utilize the average rating of each user in the training set as a threshold for whether a user likes or dislikes a certain item. For instance, suppose the average of a user u in the training set is 3.5, then the user u likes item i when the rating rui is 4.5 and user u dislikes item i if the rating rui is 3. The recommender system selects those items that are predicted to be preferred by users and sorts them in descending order by predicted ratings.

The experimental approach in this paper dictates that the number of scored items is distributed differently for each user in each trial. If user u has ku ratings on different items in the testing set for the first trial, the algorithm predicts each of the ratings, and only those with predicted ratings greater than the average rating of users in the training set are chosen as Top-k suggested items.

In our work, we use half-life utility index (HLU) and normalized discounted cumulative gain (NDCG) to evaluate the ranking performance of Top-k suggested items [5].

HLU (Eq 7) measures the practicability of the recommendation list for users. (7) where m is the number of item in recommendation lists, rui represents the rating of the user u on the movie i, d is the default rating (such as the average rating), and h is the system’s Half-life, the value of h in our experiments is set as 2. The larger the value of HLU, the better the ranking of recommendation list.

HLU reflects the user’s level of interest in the recommended list, as seen in Fig 5. The higher the score, the more interested the user is in the limited pages. The DSS method has a score of 1.050, while the Slope One algorithm had a score of 1.021, an increase of 2.8 percent. When compared to other algorithms, HLU has a 5.5 percent increase on average.

thumbnail
Fig 5. HLU comparison results of DSS and other algorithms.

The larger the result, the higher the probability that the algorithm give a recommendation list with the items that the user likes at the top.

https://doi.org/10.1371/journal.pone.0271891.g005

NDCG (Eq 9) reveals that the user’s favorite movies being ranked first in the recommended list will increase the user’s experience to a greater extent. The more relevant the items, the higher the ranking, the higher the user’s satisfaction with the system. (8) (9) where DCG is calculated based on the list of real and predicted ratings of each user. Ri indicates whether the movie ranked in i is liked by the user, Ri = 1 indicates that the user likes the movie, Ri = 0 means that the user does not like the movie. NDCG is the normalization of DCG, given by Eq 9, which normalizes the value range to between 0 and 1. A high NDCG usually indicates a favorable item suggestion order, as well as a higher ability of the algorithm to rank items in the list.

The results of NDCG is shown in Fig 6. Because the NDCG shows the recommendation list’s ranking performance, and the higher the score, the better. The DSS method has the largest score of 0.755, which is about 1.3 percent higher than the Slope One algorithm, with a score of 0.745. When compared to other algorithms, NDCG has a 2.8 percent increase on average.

thumbnail
Fig 6. NDCG comparison results of DSS and other algorithms.

The larger the result, the higher the probability that the algorithm give a recommendation list with the items that the user likes at the top.

https://doi.org/10.1371/journal.pone.0271891.g006

Diversity of recommendations.

The average similarity between items in a recommendation list is commonly used in the domain to measure the diversity of a recommendation list, and the diversity of a suggestion list is beneficial to improve the user experience. According to this hypothesis, the smaller the average similarity of a recommendation list, the greater the diversity [5].

In our work, we propose that a diverse recommendation list should include both low degree and high degree items because the degree of items in recommender systems is frequently used to measure their popularity [59].

Based on this assumption, the ratio of the standard deviation of the item degrees in a recommendation list to the mean of the item degrees can be used to measure the diversity of the recommendation list, i.e., which popular items and niche items are included in the recommendation list.

Therefore, the standard deviation dispersion ratio is proposed to measure the diversity of recommendation list. The average diversity of the recommendation list created for each user can be calculated using the Eqs 1012 for a given training and test set. (10) where L is the set of recommendation lists for all users in the test set, σ(l) is the standard deviation of item degree in a list l, and |l| is the size of the list. (11) (12) where i is an item in the recommendation list, ki is the degree of the item indicating the number of users has rated the item. is the average degree of the items in the list.

The standard deviation dispersion ratio, as illustrated in Fig 7, indicates the diversity of the suggestion list. The more diversified the distribution of degrees in recommended lists, the higher the diversity rating. As can be observed from the graph, the proposed DSS has a value of 0.791, which is somewhat lower than all the other approaches. Slope One has a 0.1 percent higher as 0.792. The highest diversity is provided by Cosine CF, which is also the one with the largest prediction errors in Fig 3. The results highlight the frequent tension between prediction accuracy and recommendation diversity in recommender systems.

thumbnail
Fig 7. Diversity comparison results of DSS and other algorithms.

The larger the result, the more disperse the degree distribution of items in the recommendation list given by the algorithm, and the better the diversity of the recommendation list.

https://doi.org/10.1371/journal.pone.0271891.g007

Area under curve and accuracy.

When predicting if a user likes an item, as stated in the preceding section, the recommendation algorithm predicts that the user likes the item when validated in the test set, and this situation is classified as a true positive (TP). Similarly, when a user likes an item and the recommendation algorithm predicts that the user does not like the item, this situation is known as a false negative (FN). When a user dislikes an item and the recommendation algorithm also predicts that the user does not like the item, this situation is known as a true negative (TN). When a user dislikes an item and the recommendation algorithm predicts that the user likes the item, this situation classified as a false positive (FP).

On this basis, Eqs 1315, are given for measuring the prediction accuracy of user preference. (13) (14) (15) where TP, FP, TN and FN stand for the number of true positive, false positive, true negative and false negative, respectively.

By changing the threshold of the algorithm to determine whether a user likes an item or not, we can get the receiver operating characteristic curve (ROC) in Fig 8. The area under curve (AUC) of DSS is the largest between false positive rate 0.0 and 0.3, the third between 0.3 and 0.8375, and the second between 0.8375 and 1.0. The area of DSS is the second over all, second only to RA.

thumbnail
Fig 8. Comparison of receiver operating characteristic curve (ROC).

The area under curve (AUC) of DSS is the largest between false positive rate 0.0 and 0.3, the third between 0.3 and 0.8375, and the second between 0.8375 and 1.0, but the advantage of leading algorithms at all stages is very small. For example, DSS has a 1.67 percent advantage over RA at FPR = 0.19 (false positive rate).

https://doi.org/10.1371/journal.pone.0271891.g008

It is shown that Fig 9 shows that the Precision defined by Eq 13. DSS has Precision = 0.7107, compared with Slope One with Precision = 0.7060, improving 0.67 percent. By analyzing the definition, it can be concluded that the DSS algorithm has less false positive cases in the testing set than other comparative algorithms, and DSS is also with low prediction error and higher ranking accuracy.

thumbnail
Fig 9. Precision results of different algorithms.

The larger the result, the less false positives the algorithm has in predicting user preferences, and the less likely the algorithm is to put items that the user does not like into the recommendation list.

https://doi.org/10.1371/journal.pone.0271891.g009

The comparison of Recall is presented in Fig 10, which is defined by Eq 14. DSS has Recall = 0.7470, compared with Slope One with Recall = 0.7454, deteriorating 0.21 percent. By contrast to Precision, it can be concluded that the DSS algorithm has more false negative cases in the testing set than RA, UOS and MLCF. The recall of DSS indicates the algorithm tends to predict ratings lower than the actual ones, as RA and UOS tend to predict ratings higher.

thumbnail
Fig 10. Recall results of different algorithms.

The larger the result, the less false negatives the algorithm has in predicting the user’s preferences, and the less likely the algorithm will give a recommendation list that leaves out the items the user likes.

https://doi.org/10.1371/journal.pone.0271891.g010

The results of the Accuracy is presented in Fig 11, which is defined by Eq 15. DSS has Accuracy = 0.7165, compared with Slope One with Accuracy = 0.7124, improving 0.68 percent. By analyzing the definitions, it can be concluded that the DSS algorithm has a higher overall combined accuracy in predicting user preferences.

thumbnail
Fig 11. Accuracy of different algorithms.

The larger the result, the more accurate the algorithm is in predicting user preferences.

https://doi.org/10.1371/journal.pone.0271891.g011

Scalability of algorithms.

Scalability generally considers the ability of different algorithms to adapt in the face of increasing amounts of data in real applications. We explore the number of neighbor used in prediction and the time used in the calculation of the method. Because the number of neighbors relied on at prediction is often cached up in the actual system, the fewer neighbors required the more storage space is saved in real applications.

Figs 12 and 13 show the number of neighbors used in the prediction against the required number of neighbors in algorithm parameters. We test both 3000 randomly selected users and 1000 randomly selected users. Compared to classical distance-based algorithm, the number of neighbors used by DSS is reduce by 29.6 percent and 26.5 percent, respectively. When there are 1000 users in the data set, DSS can do the prediction with less neighbors than all other algorithms. The lower the number of users in the training set, the lower the number of neighbors used by the DSS. In practical applications, the DSS algorithm can further improve the scalability of the algorithm by limiting the number of users involved in the calculation.

thumbnail
Fig 12. The actual number of neighbors used in prediction (y-axis) against the number of neighbors that algorithms plan to select (x-axis).

The number of users selected for experiments is 3000.

https://doi.org/10.1371/journal.pone.0271891.g012

thumbnail
Fig 13. The actual number of neighbors used in prediction (y-axis) against the number of neighbors that algorithms plan to select (x-axis).

The number of users selected for the experiment is 1000.

https://doi.org/10.1371/journal.pone.0271891.g013

The DSS algorithm, which is the second fastest of the six approaches, can be finished in a very short period of time due to the simplicity of distance-based methods and the similarity selection before prediction. The DSS calculation employs substantially fewer neighbors after employing similarity to filter out those less important neighbors, resulting in a DSS calculation time that is even faster than Slope One. The detailed results is shown in Table 1.

thumbnail
Table 1. The time it takes to compute each method (in minutes) with identical hardware and parameters.

https://doi.org/10.1371/journal.pone.0271891.t001

Conclusions

To overcome the problem that distance-based recommendation algorithms like Slope One use too many neighbors in prediction, we combine distance-based and similarity-based collaborative filtering algorithms. In this study, users with similarity below a threshold are removed from the set of neighbors of the predicted target, and the study examines in detail the selection of thresholds for user rating distance and user similarity in recommender systems. Using the inter-user rating distance to estimate the target users’ ratings of unknown items, a novel algorithm for user-item link prediction in recommender systems is devised based on the similarity selection.

In comparison to the original algorithm, the proposed method reduces the computational complexity, the number of neighbors required in the prediction process, the prediction errors, and improves the rationality of the recommendation list ranking by 1 percent. Based on our understanding, the main reason why the algorithm designed in this paper can improve prediction accuracy and recommendation performance is that negatively correlated neighbors can only play a negative role in the prediction, so removing these neighbors has a positive effect on both prediction and recommendation.

However, it is evident in the experiments that DSS, the distance model-based algorithm designed in this paper, improves the prediction accuracy and reduces the number of neighbors used leading to a decrease in its recommendation diversity and recall rate.

Therefore, our work can continue to be explored in the future in the following ways. The first possible option is to consider about how to improve recall for better preference prediction. The second choice is to consider other information of the data set can be used in distance-based model. The last option is to study the effect of system features such as the long-tail distribution of ratings on prediction.

Acknowledgments

Zhan Su and Jun Ai would like to express their love to Lingyi Ai and thank her for inspiring us to keep fighting.

References

  1. 1. Lü L, Zhou T. Link Prediction in Complex Networks: A Survey. Physica A: statistical mechanics and its applications. 2011;390(6):1150–1170.
  2. 2. Su Z, Zheng X, Ai J, Shen Y, Zhang X. Link Prediction in Recommender Systems Based on Vector Similarity. Physica A: Statistical Mechanics and its Applications. 2020;560:125154.
  3. 3. Isinkaye FO, Folajimi YO, Ojokoh BA. Recommendation systems: Principles, methods and evaluation. Egyptian Informatics Journal. 2015;16(3):261–273.
  4. 4. Schedl M, Knees P, Gouyon F. New paths in music recommender systems research. In: Proceedings of the Eleventh ACM Conference on Recommender Systems; 2017. p. 392–393.
  5. 5. Jalili M, Ahmadian S, Izadi M, Moradi P, Salehi M. Evaluating collaborative filtering recommender algorithms: a survey. IEEE access. 2018;6:74003–74024.
  6. 6. Sarwar B, Karypis G, Konstan J, Riedl J. Item-based collaborative filtering recommendation algorithms. In: Proceedings of the 10th international conference on World Wide Web; 2001. p. 285–295.
  7. 7. Du J, Rong J, Michalska S, Wang H, Zhang Y. Feature Selection for Helpfulness Prediction of Online Product Reviews: An Empirical Study. PLOS ONE. 2019;14(12):e0226902. pmid:31869404
  8. 8. Aggarwal CC, et al. Recommender Systems. vol. 1. Springer; 2016.
  9. 9. Du J, Rong J, Wang H, Zhang Y. Helpfulness Prediction for Online Reviews with Explicit Content-Rating Interaction. In: Cheng R, Mamoulis N, Sun Y, Huang X, editors. Web Information Systems Engineering—WISE 2019. Lecture Notes in Computer Science. Cham: Springer International Publishing; 2019. p. 795–809.
  10. 10. Xiao-hui C, Yan G. Collaborative Filtering Recommendation Based on Optimization Euclidean Distance. Computer and Modernization. 2015;(3):37.
  11. 11. Geng C, Zhang J, Guan L. A recommendation method of teaching resources based on similarity and ALS. In: Journal of Physics: Conference Series. vol. 1865. IOP Publishing; 2021. p. 042043.
  12. 12. Li C, He K. CBMR: An optimized MapReduce for item-based collaborative filtering recommendation algorithm with empirical analysis. Concurrency and Computation: Practice and Experience. 2017;29(10):e4092.
  13. 13. Li N, Li C. Accumulative Influence Weight Collaborative Filtering Recommendation Approach. In: Opportunities and Challenges for Next-Generation Applied Intelligence. Springer; 2009. p. 73–78.
  14. 14. Forestiero A. AIRS: Ant-Inspired Recommendation System. Advances in Intelligent Systems and Computing. 2015;323:213–224.
  15. 15. Du J, Rong J, Wang H, Zhang Y. Neighbor-Aware Review Helpfulness Prediction. Decision Support Systems. 2021;148:113581.
  16. 16. Pavlov D, Pennock DM. A maximum entropy approach to collaborative filtering in dynamic, sparse, high-dimensional domains. In: NIPS. vol. 2. Citeseer; 2002. p. 1441–1448.
  17. 17. Li K, Zhou X, Lin F, Zeng W, Alterovitz G. Deep probabilistic matrix factorization framework for online collaborative filtering. IEEE Access. 2019;7:56117–56128.
  18. 18. Ai J, Su Z, Li Y, Wu C. Link Prediction Based on a Spatial Distribution Model with Fuzzy Link Importance. Physica A: Statistical Mechanics and its Applications. 2019;527:121155.
  19. 19. Su Z, Ai J, Zhang Q, Xiong N. An Improved Robust Finite-Time Dissipative Control for Uncertain Fuzzy Descriptor Systems with Disturbance. International Journal of Systems Science. 2017;48(8):1581–1596.
  20. 20. Ai J, Liu Y, Su Z, Zhang H, Zhao F. Link Prediction in Recommender Systems Based on Multi-Factor Network Modeling and Community Detection. EPL (Europhysics Letters). 2019;126(3):38003.
  21. 21. Ai J, Liu Y, Su Z, Zhao F, Peng D. K-Core Decomposition in Recommender Systems Improves Accuracy of Rating Prediction. International Journal of Modern Physics C. 2021;32(07):2150087.
  22. 22. Deshpande M, Karypis G. Item-based top-n recommendation algorithms. ACM Transactions on Information Systems (TOIS). 2004;22(1):143–177.
  23. 23. Hu R, Dou W, Liu J. ClubCF: A Clustering-Based Collaborative Filtering Approach for Big Data Application. Emerging Topics in Computing, IEEE Transactions on. 2014;2(3):302–313.
  24. 24. Batmaz Z, Yurekli A, Bilge A, Kaleli C. A review on deep learning for recommender systems: challenges and remedies. Artificial Intelligence Review. 2019;52(1):1–37.
  25. 25. Alabdulrahman R, Viktor H, Paquet E. Active Learning and Deep Learning for the Cold-Start Problem in Recommendation System: A Comparative Study. Springer, Cham. 2019;.
  26. 26. Dutta S, Bandyopadhyay SK. A Hypothesis is Placed to Justify the Extendibility of Recommender System/ Recommendation System into Social Life. International Journal of Engineering and Management Research. 2020;10(5):37–39.
  27. 27. Yin J, Tang M, Cao J, Wang H, You M, Lin Y. Adaptive Online Learning for Vulnerability Exploitation Time Prediction. In: Huang Z, Beek W, Wang H, Zhou R, Zhang Y, editors. Web Information Systems Engineering—WISE 2020. Lecture Notes in Computer Science. Cham: Springer International Publishing; 2020. p. 252–266.
  28. 28. Yadav N, Mundotiya RK, Singh AK, Pal S. Diversity in Recommendation System: A Cluster Based Approach. In: International Conference on Hybrid Intelligent Systems. Springer; 2019. p. 113–122.
  29. 29. Karypis G. Evaluation of item-based top-n recommendation algorithms. In: Proceedings of the tenth international conference on Information and knowledge management; 2001. p. 247–254.
  30. 30. Hawashin B, Lafi M, Kanan T, Mansour A. An efficient hybrid similarity measure based on user interests for recommender systems. Expert Systems. 2020;37(5):e12471.
  31. 31. Singh PK, Pramanik P, Choudhury P. An improved similarity calculation method for collaborative filtering- based recommendation, considering neighbor’s liking and disliking of categorical attributes of items. Journal of Information and Optimization Sciences. 2019;40(2):397–412.
  32. 32. Ai J, Su Z, Wang K, Wu C, Peng D. Decentralized Collaborative Filtering Algorithms Based on Complex Network Modeling and Degree Centrality. IEEE Access. 2020;8:151242–151249.
  33. 33. Wang KL, Wu CX, Ai J, Su Z. Complex Network Centrality Method Based on Multi-Order K-Shell Vector. Acta Physica Sinica. 2019;68(19):196402.
  34. 34. Yi MA, Nx A, Rt A, Liang LA, Xy A. An Efficient Similarity Measure for Collaborative Filtering. Procedia Computer Science. 2019;147:416–421.
  35. 35. Joorabloo N, Jalili M, Ren Y. Improved Collaborative Filtering Recommendation Through Similarity Prediction. IEEE Access. 2020;8:202122–202132.
  36. 36. Lemire D, Maclachlan A. Slope one predictors for online rating-based collaborative filtering. In: Proceedings of the 2005 SIAM International Conference on Data Mining. SIAM; 2005. p. 471–475.
  37. 37. DONG Ly, JIN Jh, FANG Yc, WANG Yq, LI Yl, SUN Mh. Slope One algorithm based on nonnegative matrix factorization. Journal of ZheJiang University (Engineering Science). 2019;53(7):1349–1353.
  38. 38. Li W, Gao L, Wang H, Zhang Z, Zhang Y. Improved Weight Slope One Algorithm of Integrating Expert Similarity and Item Characteristics. In: 2018 Sixth International Conference on Advanced Cloud and Big Data (CBD). IEEE; 2018. p. 316–321.
  39. 39. Jiang L, Cheng Y, Yang L, Li J, Yan H, Wang X. A trust-based collaborative filtering algorithm for E-commerce recommendation system. Journal of ambient intelligence and humanized computing. 2019;10(8):3023–3034.
  40. 40. Gao C, Lei W, He X, de Rijke M, Chua TS. Advances and challenges in conversational recommender systems: A survey. arXiv preprint arXiv:210109459. 2021;.
  41. 41. Edelmann D, Móri TF, Székely GJ. On relationships between the Pearson and the distance correlation coefficients. Statistics & Probability Letters. 2021;169:108960.
  42. 42. Dhawan S, Singh K, Jain A. An assessment of feature selection based mechanism to recommend friends in online social networks. In: 2017 6th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions)(ICRITO). IEEE; 2017. p. 678–682.
  43. 43. Su Z, Zheng X, Ai J, Shang L, Shen Y. Link Prediction in Recommender Systems with Confidence Measures. Chaos: An Interdisciplinary Journal of Nonlinear Science. 2019;29(8):083133. pmid:31472512
  44. 44. Bo Ba Dilla J, Hernando A, Ortega F, Gutiérrez A. Collaborative filtering based on significances. Information Sciences. 2012;185(1):1–17.
  45. 45. Su Z, Lin Z, Ai J, Li H. Rating Prediction in Recommender Systems Based on User Behavior Probability and Complex Network Modeling. IEEE Access. 2021;9:30739–30749.
  46. 46. Bag S, Kumar SK, Tiwari MK. An efficient recommendation generation using relevant Jaccard similarity. Information Sciences. 2019;483:53–64.
  47. 47. Verma V, Aggarwal RK. A New Similarity Measure Based on Simple Matching Coefficient for Improving the Accuracy of Collaborative Recommendations. International Journal of Information Technology and Computer Science. 2019;11(6):37–49.
  48. 48. Sania R, Maharani W, Kurniati A. Analisis Perbandingan Metode Pearson Dan Spearman Correlation Pada Recommender System. Identifikasi Korban Bencana: Pengenalan Gambar Radiograph Gigi Secara Otomatis Menggunakan Pendekatan Fuzzy Bali: STMIK STIKOM BALI. 2010; p. 99–105.
  49. 49. Jain A, Singh PK, Dhar J. Multi-objective item evaluation for diverse as well as novel item recommendations. Expert Systems with Application. 2020;139(1):112857.1–112857.18.
  50. 50. Xsh A, Myz A, Zhao ZA, Zqf A, Jgl B. Predicting online ratings based on the opinion spreading process. Physica A: Statistical Mechanics and its Applications. 2015;436:658–664.
  51. 51. Ai J, Li L, Su Z, Wu C. Online-Rating Prediction Based on an Improved Opinion Spreading Approach. In: 2017 29th Chinese Control And Decision Conference (CCDC); 2017. p. 1457–1460.
  52. 52. Dooms S, Bellogín A, Pessemier TD, Martens L. A framework for dataset benchmarking and its application to a new movie rating dataset. ACM Transactions on Intelligent Systems and Technology (TIST). 2016;7(3):1–28.
  53. 53. Jones KS, Jones GJF, Foote J, Young SJ. Experiments in spoken document retrieval. Information Processing & Management. 1996;32(4):399–417.
  54. 54. Javari A, Gharibshah J, Jalili M. Recommender systems based on collaborative filtering and resource allocation. Social Network Analysis and Mining. 2014;4(1):234.
  55. 55. Zhou T, Lü L, Zhang YC. Predicting missing links via local information. European Physical Journal B. 2009;71(4):623–630.
  56. 56. He XS, Zhou MY, Zhuo Z, Fu ZQ, Liu JG. Predicting Online Ratings Based on the Opinion Spreading Process. Physica A: Statistical Mechanics and its Applications. 2015;436:658–664.
  57. 57. Polatidis N, Georgiadis CK. A multi-level collaborative filtering method that improves recommendations. Expert Systems with Applications. 2016;48(4):100–110.
  58. 58. Zhang J, Wang Y, Yuan Z, Jin Q. Personalized real-time movie recommendation system: Practical prototype and evaluation. Tsinghua Science and Technology. 2019;25(2):180–191.
  59. 59. Shani G, Gunawardana A. Evaluating Recommendation Systems. In: Ricci F, Rokach L, Shapira B, Kantor PB, editors. Recommender Systems Handbook. Boston, MA: Springer US; 2011. p. 257–297.