Identifying the perceptive users for online social systems

In this paper, the perceptive user, who could identify the high-quality objects in their initial lifespan, is presented. By tracking the ratings given to the rewarded objects, we present a method to identify the user perceptibility, which is defined as the capability that a user can identify these objects at their early lifespan. Moreover, we investigate the behavior patterns of the perceptive users from three dimensions: User activity, correlation characteristics of user rating series and user reputation. The experimental results for the empirical networks indicate that high perceptibility users show significantly different behavior patterns with the others: Having larger degree, stronger correlation of rating series and higher reputation. Furthermore, in view of the hysteresis in finding the rewarded objects, we present a general framework for identifying the high perceptibility users based on user behavior patterns. The experimental results show that this work is helpful for deeply understanding the collective behavior patterns for online users.

Recently, the online user behavior patterns have attracted more and more attention [19][20][21]. The abundance of available information increases the difficulty in making choices for users: Buy objects, borrow DVDs, or watch movies. Nowadays, online rating systems provide channels for users to show their preferences in the form of ratings [22][23][24], which can be represented as growing weighted bipartite networks where users are linked with the rated objects over time and the weights are the ratings. Preferential attachment [25,26], the users connect objects in terms of the object degree preferentially, has been widely used to interpret user rating or selecting behaviors, presenting a homogeneous population composed of users driven by a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 object popularity. Meanwhile, Liu et al [17] found that users are heterogeneous in selecting the rated objects: Some objects are collected by almost all users, while some small-degree objects are only collected by large-degree users, indicating that the users' tastes could be expressed by two categories: Popular one and special one. The work of Ni et al [27] also described this idea. Inspired by these work, we carry on investigating the heterogeneity [28][29][30][31] of users in their rating patterns. An interesting phenomenon is found: While the majority of users usually collect the popular objects, some users frequently attach to the high-quality objects (which is rewarded finally) when they are rarely concerned, in which the latter group of users are our focus in this paper.
We divide objects into two sets: High-quality objects and the others, in which the highquality objects are defined as rewarded objects here, e.g. Oscars Award for film, Grammy Award for music, Emmy Award for television, Tony Award for theater, etc. There will be many users focusing on the rewarded objects when they become widely accepted, while an interesting phenomenon is found: There exist some users paying attention to the rewarded objects long before they actually be widely approved (finally rewarded), i.e. at their early lifespan. Here we present two definitions: Perceptive user and user perceptibility. Perceptive user is defined as the user who can make high appraisals of the rewarded objects long before they actually be rewarded. Meanwhile, the degree to which the user can identify the rewarded objects in their initial lifespan is defined as the user perceptibility.
Meanwhile, we present a method to identify the user perceptibility based on online user rating behaviors. Then we investigate the behavior patterns of the perceptive users from three aspects: User activity, correlation characteristics of user rating series and user reputation. Experimental results indicate that high perceptibility users show different behavior patterns than others. Finally, considering the hysteresis in finding the rewarded objects, we present a framework for identifying high perceptibility users based on users' behavior patterns.

Data sets
In this paper, two empirical data sets containing timestamps and ratings for movies: Movie-Lens and Netflix are investigated. The MovieLens data set is downloaded from the GroupLens (http://www.grouplens.org), consists 943,355 ratings given by 4,295 users to 3,706 movies during 1,039 days. The Netflix data set is provided by the Netflix Prize (http://www.netflixprize. com), consists 37,755,925 ratings delivered by 218,319 users on 7,803 movies during 2,241 days. The MovieLens and Netflix ratings are both given by the integer ratings scaling from 1 to 5. Meanwhile, each user has at least 50 ratings for two data sets. Here, two object sets mentioned above, high-quality objects and the others, are divided based on the Oscars awards. We select movies nominated for the best picture category at the Annual Academy Awards, popularly known as Oscars (http://www.filmsite.org), as the high-quality objects. There are 162 and 150 rewarded movies in the MovieLens and Netflix data sets, respectively.

Method description
The rating system can be modeled by a weighted bipartite network, where the users and objects are denoted by U = {u 1 , u 2 , . . ., u |U| }, O = {o 1 , o 2 , . . ., o |O| }. We use the Latin and Greek letters to represent the users and objects, respectively. The rating r iα given by user u i to object o α is the weight of the link connecting nodes u i and o α in the bipartite network. The timestamp of rating r iα is denoted by t iα and the highest rating is recorded as r h . The user set U α is defined as the users who rate to object o α , and the object set O i is recorded as the objects rated by user u i . In addition, the degrees of user u i and object o α are denoted as k i and ρ α , respectively. Two object sets, rewarded and non-rewarded ones, are denoted by object set O 1 and O 2 , respec- What's more, the numbers of rewarded and non-rewarded objects are denoted by n 1 and n 2 , respectively, n 1 + n 2 = |O|.
For each rewarded object, we track the ratings given by users who give the highest rating r h at the early lifespan of the object. The number of these links D i created by user u i can be expressed as, ( where D iα is a binary event to measure whether the user u i can make a high evaluation of object o α (o α 2 O 1 ) during the initial θ(0 < θ < 1) of its lifespan, t α1 and t αρ α are the timestamps of the first and last ratings the object o α received, respectively. The quantity D i is the number of identifying rewarded objects at their early lifespan for user u i and 0 D i n 1 . Meanwhile, θ is a tunable parameter and the value of D i increases with the parameter θ. It should be noted that there is no rating to be considered (D i = 0, i = 1, 2, . . ., |U|) when θ = 0 and the whole lifespan is viewed as the initial lifespan when θ = 1. Finally, we define the perceptibility p i as the proportion of D i in the number of rewarded objects n 1 for user u i ,

Results
The identification of the user perceptibility could quantitatively measure the degree to which the user can identify the rewarded objets in their lifepan. To qualitatively measure whether a user is a perceptive user, we a introduce a free-parameter bootstrap analysis [32][33][34]. The bootstrap sampling results show that, for the MovieLens and Netflix data sets, there are 5 and 27 identified perceptive users, respectively (accounting for 0.12% and 0.012% of all users, respectively). Here the parameter θ is set to 0.3 and 0.6 for the MovieLens and Netflix data sets, respectively. It should be noted that a larger parameter θ for the Netflix data set is selected due to the few rewarded objects with regard to the size of the whole objects and ratings (150 rewarded objects, 7803 objects and 37755925 ratings). Moreover, we investigate whether the identification of user perceptibility is of significance. To this end, we calculate the average perceptibility of the first L users who give the rating 5 (the highest rating) in order of time for each object in two empirical data sets, denoted by hp L i α for objects o α . The parameter L is set to 10 in the following analysis. All objects are divided into two groups based on their corresponding average perceptibility hp L i: Objects rated by high perceptibility users (recorded as object set Θ) and the others (recorded as object set Λ), in which the objects in set Θ are selected as top q(0 < q < 1) high hp L i objects. Firstly, we track the links attached to all objects in the future time window and calculate the average degree hρ O (t)i of two divided object groups as a function of time t, in which the length of the future time window are 100 and 200 days for the MovieLens and Netflix data sets, respectively. Fig 1 (a) and 1(b) shows the degree evolution of two divided object groups with the parameter q = 10% for the MovieLens and Netflix data sets, respectively. One can find that the average degrees of objects in set Λ in the future time window are larger than those of objects in set Θ, showing that the objects rated by high perceptibility users become less popular than the others, indicating that user perceptibility has little impact on finding the popular objects.
Subsequently, we investigate the ratio ϕ of rewarded objects in two divided object groups with different parameter q (Fig 1(c) and 1(d)). One can find that the ratio ϕ of rewarded objects in object set Θ is larger than that in object set Λ with different parameter q for two empirical data sets. For instance, the ratio ϕ of rewarded objects in object set Θ is larger than that in object set Λ by 263.0% and 722.0% with the parameter q = 5% for the MovieLens and Netflix data sets, respectively. Meanwhile, the ratios ϕ of rewarded objects in two divided object groups with θ = 0.2, 0.4 for MovieLens and θ = 0.5, 0.7 for Netflix show the similar results. Therefore, the results indicate that the user perceptibility is of significance in finding the rewarded objects rather than popular objects. and the ratio ϕ of rewarded objects in two divided object groups with different parameter q (c,d) for two empirical data sets, in which the time t is measured in days, and the parameter θ is set to 0.3 and 0.6 for the MovieLens and Netflix data sets, respectively. From the subplots (a,b) one can find that the average degrees of objects rated by high perceptibility users in the future time window are larger than those of the other objects. From the subplots (c,d) one can find that the ratio ϕ of rewarded objects in objects rated by high perceptibility users is higher than that in the other objects with different parameter q. The results indicate that the user perceptibility is helpful to find the potential rewarded objects. Furthermore, we investigate the relations between user perceptibility and user collective behavior patterns. All users are divided into two groups: High perceptibility users (recorded as user set F) and the others (recorded as user set Δ), in which the high perceptibility users are denoted as top q(0 < q < 1) high perceptibility users. We investigate the collective behavior patterns of two divided user groups from three aspects: User activity, correlation characteristics of user rating series and user reputation. User activity (denoted by k U ), namely user degree, is one of the most important user characteristics in social systems [27,35]. The larger user degree, the more active the user would be. In our analysis, correlation characteristics of user rating series is described by detrended fluctuation analysis (short for DFA), which is widely used for analyzing the statistical self-affinity of a time series [36][37][38][39], calculated by the scaling exponent η. The quantity η > 0: η < 0.5 corresponds to anti-correlated series; η = 0.5 corresponds to uncorrelated white noise; η > 0.5 corresponds to correlated series. User reputation is proposed to measure the user ability of rating accurate assessments of various objects [40,41]. So far, many reputation ranking methods have been widely investigated [42][43][44]. In this paper, we use the correlation based ranking algorithm [41] to calculate the user reputation denoted by μ. The quantity μ lies in [0, 1] and larger μ means higher user reputation. Fig 2 shows the average degree hk U i, scaling exponent hηi, reputation hμi of two divided user groups with different parameter q for the MovieLens and Netflix data sets, respectively. One can find that the average hk U i, hηi and hμi of user set F (high perceptibility users) are larger than those of user set Δ (the other users) with different parameter q for two empirical data sets. For instance, the average hk U i, hηi and hμi of user set F are larger than the ones of  q for (a,  c,e) MovieLens and (b,d,f) Netflix data sets, in which the parameter θ is set to 0.3 and 0.6, respectively. One can find that the average hk u i, hηi and hμi of high perceptibility users are larger than those of the other users with different parameter q for two empirical data sets, which indicates that high perceptibility users show different collective behavior patterns than the other users: Larger activity, stronger correlation of rating series and higher reputation.
https://doi.org/10.1371/journal.pone.0178118.g002 user set Δ by 180.1%, 11.8% and 17.3%, respectively with the parameter q = 5% for the Movie-Lens data set. For the Netflix data set, the increases are 120.5%, 6.3% and 11.6%, respectively with the parameter q = 5%. The collective behavior patterns of two divided user groups with θ = 0.2, 0.4 for MovieLens and θ = 0.5, 0.7 for Netflix show the similar results. The results indicate that high perceptibility users show larger activity, stronger correlation of rating series and higher reputation than other users.

A framework for high perceptibility user identification
High perceptibility users are denoted as top q(0 < q < 1) high perceptibility users, the identification of high perceptibility users is closely linked with the identification of user perceptibility. User perceptibility is calculated by tracking the ratings to the rewarded objects, while the discovery of the rewarded objects has hysteresis. With the growing amount of new users, objects and the corresponding ratings, the rewarded objects of the current rating systems are uncertain. Thus, the user perceptibility and high perceptibility users cannot be identified in real time. In terms of the fact that high perceptibility users have specific collective behavior patterns, we develop a general framework for identifying high perceptibility users based on users' behavior patterns.
All users are divided into two groups: High perceptibility users and the others. Given that identifying high perceptibility users belongs to a classification problem, random forests [45], one of the most widely used machine learning [46,47] methods, is introduced in our framework. The Data Flow Diagram (short for DFD) of the framework is shown in Fig 3. Firstly, the available ratings and the rewarded objects are calculated to identify the user perceptibility using the presented method (Process P1). Meanwhile, the available ratings are used to analyze the user collective behavior patterns from three aspects: Degree, DFA of rating series and reputation (Process P2). The process P1 and P2 could be performed simultaneously. Then, we use the The available ratings in the rating systems, on the one hand, are applied with the rewarded objects to identify the user perceptibility by the presented method (Process P1). On the other hand, they are used to analyze the user collective behavior patterns described by three aspects: Activity, DFA of rating series and reputation (Process P2). Then, we use the random forests to train the obtained results containing the user perceptibility and behavior patterns (Activity, DFA of rating series and reputation) (Process P3). Afterwards, the high perceptibility users will be identified based on the user collective behavior patterns analysed from the new ratings in the rating systems (Process P4) by the generalization of random forests (Process P5). random forests to train the obtained results, which contain the user perceptibility and behavior patterns (Process P3). When the rating systems generate new ratings, the user collective behavior patterns analysed based on the new ratings (Process P4) are used to identify high perceptibility users in the current rating systems by the generalization of random forests (Process P5).
Moreover, we investigate the performance of high perceptibility user identification using the presented framework. After identifying the user perceptibility based on the rewarded objects and ratings, high perceptibility users are classified as top q(0 < q < 1) high perceptibility users. We select 70% of user data (user perceptibility and behavior patterns) as the training set S tr and the remaining 30% as the test set S te for the MovieLens and Netflix data sets, respectively. High perceptibility users in the test set S te are denoted as set H te . Meanwhile, the identified high perceptibility user set H 0 te in the test set S te will be predicted by the generalization of random forests after training the data of the training set S tr . Then, the performance of high perceptibility user identification is measured by the precision P, recall R and F-measure F, where jH te \ H 0 te j is the number of high perceptibility users in the identified high perceptibility user set H 0 te . jH 0 te j is the number of users in the identified high perceptibility user set H 0 te . And |H te | is the number of users in the high perceptibility user set H te . Precision P, recall R and F-measure F all lie in [0, 1] and larger P, R or F represents better performance of high perceptibility user identification. The precision P, recall R and F-measure F with different parameter q for two empirical data sets are shown in Fig 4, in which the parameter q(0 < q < 1) represents the ratio of the high perceptibility users in all users. One can find that the framework can perform well in identifying the high perceptibility users. The precision P, recall R and F-measure F could reach P = 0.68, R = 0.66 and F = 0.67 with q = 50% for the MovieLens data set, and for the Netflix data set, the performance achieves P = 0.59, R = 0.55 and F = 0.57. Meanwhile, the precision P, recall R and F-measure F all increase with the parameter q in general. The performances of high perceptibility user identification with different parameter θ indicate that larger precision P, recall R and F-measure F are obtained in the case of larger parameter θ with different parameter q.
We can obtain both the performance of high perceptibility user identification and importance of behavior patterns using random forests. Besides the random forests, we also use other machine learning methods including gradient boosting machine [48,49] (short for GBM) and support vector machine [50,51] (short for SVM) to identify the high perceptibility users. The  q for (a-b) MovieLens and (c-d) Netflix data sets, respectively. One can find that the recall R of high perceptibility user identification using GBM and SVM have little difference with the results using random forests. The precision P is better using GBM and SVM than using random forests. precision P, recall R and F-measure F of high perceptibility user identification are shown in Fig 5, from which one can find that the performance could reach P = 0.72, R = 0.77 and F = 0.74 using GBM and P = 0.74, R = 0.71 and F = 0.72 using SVM with q = 50% for the MovieLens data set. The recall of high perceptibility user identification using GBM and SVM have little difference with the results using random forests. While the precision of high perceptibility user identification using GBM and SVM are different, the precision P is large when the parameter q is small. The precision is better using GBM and SVM than using random forests.

Conclusion and discussions
In this paper, taking into account collective behavior patterns and the heterogeneity of online users, we present the definition of perceptive user, which is defined as the user who can make high evaluations of the rewarded objects at their early lifespan. In addition, user perceptibility is defined as the degree to which the user can identify the rewarded objects in their initial lifespan. Then, we present a method for identifying the user perceptibility by tracking the ratings given to rewarded objects and the timestamps. Meanwhile, to track out the relations between user perceptibility and user collective behavior patterns, we investigate the user behavior patterns from three aspects: User activity, correlation characteristics of user rating series and user reputation. The experimental results for the MovieLens and Netflix data sets indicate that high perceptibility users have larger activity, stronger correlation of rating series and higher reputation than the other users. For the MovieLens data set, the average hk U i, hηi and hμi of user set F (high perceptibility users) are larger than those of user set Δ by 180.1%, 11.8% and 17.3%, respectively, with the parameter q = 5%, for example. Finally, given that there exists hysteresis in finding the rewarded objects, we present a general framework to identify the high perceptibility users in real time based on users' behavior patterns. The experimental results show that the framework can perform well in identifying the high perceptibility users. The precision P, recall R and F-measure F could reach P = 0.72, R = 0.77 and F = 0.74 with q = 50% for the MovieLens data set, for example.
The computational complexity of the method presented to identify the user perceptibility is O(n 1 Á hρ O i + n 1 Á |U|), where the first term accounts for the calculation of D iα , the results whether the user can make a high evaluation for each rewarded object in its initial lifespan. And the second term accounts for the calculation of D i , the number of identifying the rewarded objects at their early lifespan for each user. Substituting the inequality hρ O i |U|, we are left with O(n 1 Á |U|). Due to n 1 is a constant value in a certain rating system, one has the fact that the computational complexity of the user perceptibility identification is O(|U|), a linear function of the user size.
For a long time, popular objects are more concerned targets, while few users recognize and appreciate the rewarded objects when they are rarely concerned. The discovery of perceptive users and the identification of user perceptibility provides us a new perspective of understanding these special users. The results that user perceptibility can be helpful to find the potential rewarded objects indicate the identification of user perceptibility is of practical significance in e-commerce and marketing. Meanwhile, the presented framework for high perceptibility user identification, from investigating behavior patterns of two divided user groups to conversely identifying high perceptibility users based on the behavior patterns, gives us a systematic study of the perceptive users and it is also suitable for big data processing. In addition, the following points should be addressed in the future work. Firstly, the high-quality objects here are generated based on the rewarded ones, how to construct the high-quality object set is an open problem. Secondly, the user collective behavior patterns are investigated from three aspects in this paper, which may be found incompletely. As further improvement, we could consider more dimensions to deeper explore the user behavior patterns. Thirdly, random forests is applied in the framework for high perceptibility user identification, in which econometrics and time series analysis could be emphasized as well in our future research.