Ranking Reputation and Quality in Online Rating Systems

How to design an accurate and robust ranking algorithm is a fundamental problem with wide applications in many real systems. It is especially significant in online rating systems due to the existence of some spammers. In the literature, many well-performed iterative ranking methods have been proposed. These methods can effectively recognize the unreliable users and reduce their weight in judging the quality of objects, and finally lead to a more accurate evaluation of the online products. In this paper, we design an iterative ranking method with high performance in both accuracy and robustness. More specifically, a reputation redistribution process is introduced to enhance the influence of highly reputed users and two penalty factors enable the algorithm resistance to malicious behaviors. Validation of our method is performed in both artificial and real user-object bipartite networks.

Introduction.-Science is not a monolithic movement, but rather a complex enterprise divided in a multitude of fields and subfields, many of which enjoy rapidly increasing levels of activity [1,2].Even sub-disciplines have grown so broad that individual researchers cannot follow all possibly relevant developments.Despite swift growth of online scientific communities (such as Research-Gate, Mendeley, Academia.edu,VIVO, and SciLink) [3] which facilitate social contacts and exchange of information, finding relevant papers and authors still remains a daunting task, especially in lively research fields.
More generally, reliance of the modern society on computer-mediated transactions has provoked extensive research of reputation systems which compute reputation scores for individual entities and thus reduce the information asymmetry between the involved parties [4,5].Perhaps more important than the immediately useful information is the proverbial shadow of the future-incentives for good behavior and penalties against offenses-generated by these systems [6,7].Reputation systems are now an organic part of most e-commerce web sites [8] and question & answer sites [9].Complex networks [10] have provided a fruitful ground for research of reputation systems with PageRank [11,12] and HITS [13] being the classical examples.In [14], the authors extended HITS by introducing authority score of content providers and apply the resulting EigenRumor algorithm to rank blogs.Building on a bipartite version of HITS [15], [16] presents a QTR algorithm suited for online communities.This algorithm co-determines item quality and user reputation from a multilayer network which consists of a bipartite user-item network and a monopartite social network.
We propose here a reputation algorithm designed especially for online scientific communities where researchers share relevant papers.We first simplify the QTR algorithm by neglecting the social network among users.This simplification reflects the fact that trust relationships are often not available and allows us to better study the algorithm's output with respect to the remaining parameters.We then extend the algorithm by introducing author credit which is however computed differently than in the previously-mentioned EigenRumor.Since author credit is co-determined from the same data as paper quality and user reputation, this extension preserves an important advantage of QTR: Its reliance on implicit ratings which are easier to elicit than explicit ratings [8].We use various standard metrics of research productivity (citation count, impact factor, and h-index) to demonstrate that the new algorithm outperforms other state-of-the-art algorithms.
Algorithms. -An online community is assumed to consist of N users and M items (papers or other sort of scientific artifacts) which are labeled with Latin and Greek letters, respectively.The community is represented by a bipartite user-item network W where a weighted link between user i and item α exists if user i has interacted with p-1 item α.Link weight w iα is decided by the type of interaction between the corresponding user-item pair and reflects the level of importance or intensity of the interaction.It is convenient to introduce an unweighted user-item network E where e iα = 1 if w iα > 0 and e iα = 0 otherwise.The corresponding unweighted user and item degree are denoted as k i and k α , respectively.
We first introduce a bipartite variant of the classical HITS algorithm, biHITS, which assigns reputation values R i to user nodes and quality values Q α to item nodes.The algorithm's definitory equations are where R and Q are user reputation and item quality vector, respectively.Solution to this set of equations is usually found by iterations.Starting with R and then normalized so that R 2 and Q 2 remain one.We stop the iterations when the sum of absolute changes of all vector elements in R and Q is less than 10 −8 .If E represents a connected graph, the solution is unique and independent of R α [13].A weighted bipartite network can be incorporated in the algorithm by replacing the binary matrix E with the matrix of link weights W.
We now simplify the QTR algorithm [16] by omitting Trust among the users-we refer it as the QR algorithm hence.Its definitory equations are where are the average quality and reputation value, respectively.The algorithm is further specified by the choice of θ Q , θ R , ρ Q , ρ R which all lie in the range [0, 1].In particular, θ Q decides whether item quality is obtained as a sum (when θ Q = 0) or average (when θ Q = 1) over reputation of users connected with a particular item; the meaning of θ R is analogous.By contrast, ρ Q decides whether interactions with items of inferior quality harm user reputation (when ρ Q > 0) or not (when ρ Q = 0); the meaning of ρ R is analogous.Solution of Eqs.(3) can be again found iteratively.When θ Q , θ R , ρ Q , ρ R are all zero, ER differs from biHITS only in using the weighted matrix W instead of E.
Algorithms with author credit.HITS-like algorithms that rely only on user feedback have two limitations.First, an item can only score highly after sufficient feedback has accumulated which can require substantial time in practice.Second, an item can attract the attention of users for quality-unrelated reasons (by a witty or provoking title, for example) and the algorithms lack mechanisms to correct for this.EigenRumor algorithm (ER) responds to this by introducing scores for "information providers" [14] which we refer to as author credit here.While this algorithm originally includes only two sets of entities-blog entries and blog authors-it can be easily adapted to our case where users, papers, and authors are present.The bipartite author-paper network can be represented by matrix P whose elements p mα are 1 if author m has (co)authored paper α and 0 otherwise (m = 1, . . ., O where O is the number of authors).Author and paper degree in this network are d m and d α , respectively.Denoting the vector of author credit values as A, the equations of EigenRumor are an extension of Eq. ( 1), where parameter ω ∈ [0, 1] determines the relative contribution of authors and users to paper quality.As noted in [14], matrices E and P can be normalized to reduce the bias towards active users and authors.Normalization is claimed to provide good results.Since the weighted user-paper interaction matrix W contains more information than E, we use W ′ analogous to E ′ here.
To introduce author credit in the QR algorithm and thus obtain a QRC algorithm (Quality-Reputation-Credit), we extend Eqs.(3) to the form Parameter λ plays the same role as ω in EigenRumor.When λ = 0, Q α and R i are the same as obtained by QR and author credit A m is computed simply as an additional set of scores.For any other value λ ∈ (0, 1], all three quantities depend on each other as illustrated by Fig. 1. Eqs. ( 6) can be again solved iteratively.
EigenRumor and QRC, albeit similar, show numerous differences.First, QRC uses three scores as opposed to two scores used by the original EigenRumor.Second, each summation term in QRC has its own normalization exponent (θ R , θ Q , φ A , φ P ) which decides how to aggregate over multiple user actions, authored papers, or coauthors.The absence of explicit normalization in Eigen-Rumor Eqs. ( 4) is compensated by the eventual use of matrices E ′ and P ′ which makes ER's equations for R i and A m similar (up to a different value of exponent) to those of QRC.However, the ER's equation for Q α is based on (E ′ ) T and (P ′ ) T which implies terms without counterparts in Eq. ( 6).
Model evaluation on artificial data.-We now describe an agent-based system [17] which aims at producing data that can be analyzed by the benchmark QR algorithm.We aim to evaluate the algorithm's performance by comparing the true values of quality and reputation with those produced by the algorithm.
In the agent-based system, each user i is endowed with intrinsic ability a i and activity level ν i , whereas each item α is endowed with intrinsic fitness f α .We assume that able users (those with high a i ) preferentially connect with high-quality items (those with high f α ).Ability and activity values are both defined in [0, 1] and drawn from the distribution p(x) = µ x µ−1 where µ ∈ (0, 1] adjusts the mean value x = µ/(µ + 1) as well as the fraction of ability/activity values above 1/2 which is 1 − 2 −µ .
The system evolves in discrete time steps.At each step, user i becomes active with probability ν i .In that case: 1.With probability p U , user i uploads new item α to the system.The item's fitness f α depends on the user's ability as , where x is a random variable drawn from U[0, X].
2. Downloads two items.The probability of choosing item α yet uncollected by user i is assumed proportional to (f α ) hai where h > 0.
We assume N to be fixed (no new users join the community).The number of items thus grows with simulation step t approximately as M (t) = N ν p U t and the number of links as In our simulations, we set µ = 1/2 so that only 30% of users have ability/activity larger than 1/2.We set X = 1/2 which means that despite some level of randomness, ability of a user and fitness of items submitted by them are still related.We set h = 5 so that users with abilclose to 1 are unlikely to accept items of low fitness (by contrast, users with zero ability accept items regardless of their fitness).Finally, we set N = 1000 and p U = 0.1 which implies network density η ≈ 2% which is similar to the values seen in real systems (while density is lower for the real data that we study here, user-item networks corresponding to the classical Movielens and Netflix datasets are of a higher density [18,Ch. 9]).We present results obtained with t = 200 which corresponds to M ≈ 6, 700 items, k i = 140, and k α = 21.Link weights assigned to uploads and downloads are W up = 1 and W down = 0.1 which reflects that uploading a new item is considered to be more demanding than downloading and thus deserves more reward.The influence of individual parameters on results is discussed later in this section.
To evaluate the quality and reputation estimates obtained with the algorithm, we compute the Pearson correlation coefficient between the estimated values and their true values used in the agent-based simulation: c Qf for items and c Ra for users.To assess the bias of results towards old items and active users, we measure c Qt and c Rν , respectively.While high correlation values are desirable for the first two quantities, values close to zero are preferable for the other two.
Results on artificial data.Results for the QR setting corresponding to biHITS and two other well-performing settings are shown in Tab. 1.We see that scores obtained with biHITS correlate least with user ability and item quality and are at the same time biased towards old items and, even more, active users.BiHITS is therefore not a suitable algorithm for situations where item age and user activity are heterogeneous, which is often the case in real systems [19,20].While the problem of correlations between quality estimates and item age is mitigated by aging which is present in most systems of this kind [21], high correlation between user activity and reputation requires additional normalization of the biHITS algorithm as done, for example, by EigenRumor or QR.
For QR, we evaluated all 16 possible choices of parameters (two possible values-zero or one-for all four of them).The setting where θ Q = 0 and θ R = ρ Q = ρ R = 1 is the only one which is numerically unstable and thus no reportable results were obtained for it.Configurations producing the best results (see Tab. 1) share two parameter values: θ Q = 0 and ρ R = 0. That's not surprising as θ Q = 1 would mean that popular items are not favored over unpopular ones and ρ R = 1 would mean that items are "punished" when users of low reputation connect with them.Settings QR1 and QR2 both achieve low correlation between reputation estimates and user activity which is due to θ R = 1 (i.e., user reputation is computed as an average over user actions).The choice of ρ Q = 1 gives QR2 an advantage over QR1 in all four correlation metrics which means that it is indeed beneficial to punish users for uploading or downloading inferior content.The only quantity in which QR1 and QR2 perform badly is c Qt which is strongly negative for both but, as we already said, this is likely to be improved in real systems where aging of items results in eventual saturation of their degree growth.
We conclude with a discussion of the influence of system parameters on the results.The shape of user acceptance probability is determined by h.QR's performance improves with h and eventually saturates at h ≃ 5. Parameters µ and X regulate the fraction of able and active users and the resulting distribution of item fitness.Our choice µ = 0.5 and X = 0.5 results in able/active users being a minority and the fitness distribution being rather uniform.While X is not decisive for the algorithm's performance (though, smaller values of X generally lead to better results), µ is crucial as having too few able/active users makes it impossible to detect quality content.On the other hand, if able users are many, the aggregate judgment is good enough and there is no need for a sophisticated algorithm.Network sparsity η is not particularly important as long as it is not too small (then there is too little information in the system) or too large (if every item is connected to almost all users, the presence of a link loses its information value).Finally, QR results depend only on the ratio ξ := W down /W up of the algorithm's parameters W down and W up .When ξ 10 −2 , download links are of little importance and the bipartite network effectively becomes very sparse to the detriment of the QR's performance.When ξ ∼ 1, the performance deteriorates as well because upload information is almost neglected (note that there are many more downloads than uploads).Our original choice ξ = 0.1 is nearly optimal.
Model evaluation on real data.-Any algorithm needs to be ultimately tested by its performance on real data.To this end, we use data obtained from the Econophysics Forum (EF, see www.unifr.ch/econophysics/)which is an online platform for interdisciplinary physics researchers and finance specialists.
Data description and analysis.From all possible actions, we consider only interactions between users and papers uploaded to the web site: Every user can upload a paper to the site, download a paper, and view a paper's abstract.To obtain the data, we analyzed the site's weblogs created from 6th July 2010 to 31st March 2013 (1000 days in total).We removed entries created by web bots (which cause approximately 75% of the site's traffic) and all papers uploaded before 6th July 2010 (for which we do not have the full record of user actions).To increase the data density, we removed the users who did not upload any papers and had only one action in total.In the case of a user repeatedly interacting with a given paper, only the earliest interaction was considered.Other approaches, such as cumulating all interactions or preferring paper downloads over abstract views, for example, result in inferior performance of QR.This choice is further motivated by the fact that the first interaction does best represent the user's interest: Papers that really capture users' attention are downloaded/read immediately when encountered, whereas a later download indicates other reasons of interest.The final input data contains 5071 users, 844 papers and 29748 links, implying η ≈ 0.7%.Note that the Econophysics Forum has an editor who has uploaded 85% of all papers in the analyzed sample.Paper metadata includes paper submission time, title, and a list of its authors.To avoid the problem of an author's name represented in multiple ways (e.g., 'H.Eugene Stanley' vs 'H.Stanley' vs 'HE Stanley'), we use only the first initial without comma and the surname ('H Stanley').As a result, there are 1527 authors in the analyzed sample.The paper metadata was augmented by citation counts, which were obtained from Google Scholar on 12th December 2013, and by impact factors of the journals where papers were eventually published.We shall use this external information to evaluate rankings of papers produced by various algorithms.
Figure 2 shows cumulative degree distributions for all involved parties: Users, papers, and authors.All distributions are broad and some of them might even pass statistical tests for power-law distributions.As a result, while 92% of users have ten actions in total or less, the most active users downloaded or viewed roughly a hundred of papers.With respect to the time span of the data, this is still a human level of activity which suggests that our removal of automated access was reasonably successful.The degree distribution of papers is shifted to the right as a whole with a negligible number of papers downloaded or viewed less than ten times and the most successful papers being of interest to hundreds of users.The most active authors are well-recognized in the econophysics community: Jean-Philippe Bouchaud, Shlomo Havlin, Dirk Helbing, Didier Sornette, and Eugene Stanley (in alphabetical order) have all authored more than 15 papers in the sample.
Results on real data.To distinguish the three different actions (upload, download, and abstract view), we set the respective link weights W up = 1, W down = 0. most demanding activity and viewing an abstract signalizes paper quality less than its direct download.We begin our analysis by inspecting algorithms without author credit: Random ranking of papers (RAND), popularity ranking (POP), where popularity is measured by the number of downloads, and biHITS.The average characteristics of top twenty papers according to these and other methods are summarized in Tab. 2. The expected bias towards old papers is clearly visible for the POP ranking whose top papers are on average 8 months older than RAND papers.While mean citation count of popular papers exceeds that of random papers, two of the most popular papers have never been published and four have not been cited to date: Wisdom of the crowd appears to be no good guide here.Both RAND and POP provide no information on the ranking of authors.BiHITS shows stronger bias towards old papers than POP which is not surprising as it is, ultimately, also a popularity-driven algorithm.Furthermore, it awards the Econophysics Forum editor who uploaded majority of papers with score which is so high that views and downloads by ordinary users add only small variations to the score of those papers.Even worse, papers that have not been submitted by the editor cannot reach the top of the ranking regardless of their success among the users.Thanks to normalization, the editor's weight does not represent a problem in QR1 and QR2.On the other hand, their top papers are not cited more than papers chosen by biHITS or POP.Furthermore, QR1 and QR2 choose rather popular papers and one could argue that they actually provide little new and useful information to the users.In fact, the excessive tendency of information-filtering algorithms towards popular objects is a long-standing challenge in this field [22,23].
Before analyzing ER and QRC, the parameters of QRC need to be set.We use θ , θ R , ρ Q , ρ R corresponding to QR1 which performed best on artificial data.We have also evaluated a variant of QRC based on QR2 and found that penalization of users connected to low quality papers through ρ Q = 1 leads to negative paper scores and in turn various counter-intuitive results.To avoid assigning high credit to authors of a single successful paper (beware the trap of papers with attractive titles), we use φ A = 0 which results in accumulation of author credit over the course of time.Since φ P = 0 (summing the credit of a paper's authors) gives an advantage to papers with many authors, we use φ P = 1.We have evaluated other possible choices of parameters φ P , φ A (as well as some other choices, such as paper quality contributed by the sum of credit of two most credible authors) and found that φ A = 0 and φ P = 1 indeed produce the most satisfactory results.Fig. 3 shows the average metrics of the top twenty papers obtained with QRC for λ ∈ [0, 1].As λ increases, the average submission day of papers in top 20 grows from 375 (the original QR1 value) to 519 when λ = 0; the inclusion of author credit thus helps to mitigate or even remove the time bias.The average number of downloads decreases with λ and eventually reaches less than 25% of the QR1 value.The average impact factor is improved over a wide range of λ and peaks at 3.8 for λ ≈ 0.57.The same is true for the average citation count which peaks at 34 for λ = 0.57.As can be seen in Table 2, QRC outperforms the other evaluated methods.The Mann-Whitney U test based on top 20 papers chosen by various algorithms confirms that QRC outperforms them at the significance level 0.02 with the exception of ER where, due to the small sample size and large fluctuations, significance is only 0.08.
There are two further points to make.First, top papers chosen by QRC are generally younger than those chosen by other methods and thus have had less time to accumulate citations.Second, QRC is the only method which puts "Catastrophic Cascade of Failures in Interdependent Networks" (available on arXiv under ID 1012.0206)among the top papers.This paper with mere three citations is a summer-school version of a slightly earlier identically entitled work which has accumulated almost 500 citations (it has not been submitted to the Econophysics Forum).The paper's small contribution to the overall citation count achieved by QRC thus severely underestimates the paper's true importance.In summary, QRC's overall citation count improvement is thus likely to be underestimated.
Since citation counts alone provide imperfect information about the quality of scientific work, we now turn to authors.Table 3 lists top ten authors obtained by QRC with λ = 0.57 to show that they indeed include reputed names from the field of econophysics (another outstanding person, J.-P.Bouchaud, is just below the line, on place 11, largely because of collaborating with less active thus less QRC-credible co-authors).As of December 2013, the mean h-index of QRC's top 10 authors obtained by querying the Thomson's Web of Knowledge was 41 ± 11 which is significantly more than 4 ± 2 for top 12 authors (who all have identical credit) according to EigenRumor.
Discussion. -We have proposed QRC, a new reputation algorithm for scientific online communities.QRC is based on three main components: Quality of papers, Reputation of users, and Credit of authors.We have used data from a scientific community web site, the Econophysics Forum, to evaluate the algorithm and compare its performance with that of other reputation algorithms.The newly proposed QRC algorithm outperforms those algorithms in various aspects.Papers scoring high in the resulting QRC algorithm are younger than those selected by bipartite HITS and they have been downloaded considerably fewer times than papers selected by any other algorithm considered here.At the same time, QRC's top papers have attracted significantly more citations and the average impact factors of their publication venues is also higher than for papers chosen by the other algorithms.In short, QRC is able to highlight the papers that have been largely neglected by the Econophysics Forum users (as demonstrated by their relatively low number of downloads), yet they have eventually attracted considerable attention from the scientific community (as indicated by the publication venues and the citation counts).Note that QRC introduces author credit endogenously, relying on no other information than user activity on the given web site.The observed improvements are thus not achieved by providing this algorithm with more information than what is made available to the other algorithms.Finally, QRC's top authors have on average substantially higher h-index than top authors found with other algorithms.
In the context of predicting future citation counts of papers, QRC represents an algorithm-focused alternative to machine-learning approaches [24,25].Note that the algorithm's range of applicability is not strictly limited to scientific online communities.QRC can be used in any community where: (1) shared perceptions of quality can emerge, (2) quality induces popularity, and (3) individual items have various authors.If a scientific community is in divide, for example, and its members deeply disagree on some theories or methods, condition (1) is violated and an attempt to produce a universal quality ranking might be in vain.While the causality between quality and popularity in science is imperfect (effects such as the first-mover advantage have reported [26]), it is still stronger than in music, for example, where condition ( 2) is questionable and the use of QRC is likely to produce dubious results.To overcome these limitations and thus extend the QRC's range of applicability remains a future challenge.
There are several research directions which remain open.The behavior and performance of the QRC algorithm upon non-integer choices of its parameters (such as the exponent 0.5 used in ( 5)) need to be examined.User surveys can be employed as an additional evaluation tool complementing the current quantitative approach based on citations, impact factor and h-index.As any other reputation system, QRC has also inherent preferences (and thus also incentives) for various kinds of behavior.In particular, it favors active authors and those who collaborate with other credible authors.Various forms of gaming of research metrics, ranging from self-citations and mutual citations to plagiarism therefore deserve particular attention.Thanks to its reliance on long-term quality indicators (author credit), QRC has the potential to prove substantially more robust to malicious behavior than its predecessors.For input data exceeding the three-year time span of the presently studied Econophysics Forum data, it may be suitable to introduce time decay of quality and credit values to prevent the oldest contributions and the most active authors from occupying top positions in their respective rankings.Results presented in [21,27] may provide a starting ground for these efforts.One should not forget that the QRC results are community-specific as they are based on feedback of a given group of users.This is not only a limitation but also an opportunity: The QRC algorithm can be eventually used to study the dynamics and differences between various research communities.

Fig. 2 :
Fig.2: Cumulative degree distributions in the Econophysics Forum data with respect to various actions for users, papers, and authors.The editor was removed from the user upload distribution for the sake of clarity.

Fig. 3 :
Fig. 3: Average metrics of QRC's top 20 papers versus λ.The vertical dashed line at λ = 0.57 marks the setting where citation count and impact factor are approximately maximized.

Table 1 :
Performance of three selected parameter settings in the QR algorithm.

Table 2 :
1, and W view = 0.05.This acknowledges paper upload as the Mean and standard error for basic metrics (submission day, number of downloads, citation count, and journal impact factor) of top 20 papers obtained with various algorithms (ER uses ω = 0.20, QRC uses λ = 0.57).

Table 3 :
Top ten authors in the QRC ranking: Their credit, number of authored papers, and the average number of downloads.The overall average number of papers per author and downloads per paper are 1.6 and 13, respectively.