Quantifying the impact of scholarly papers based on higher-order weighted citations

Quantifying the impact of a scholarly paper is of great significance, yet the effect of geographical distance of cited papers has not been explored. In this paper, we examine 30,596 papers published in Physical Review C, and identify the relationship between citations and geographical distances between author affiliations. Subsequently, a relative citation weight is applied to assess the impact of a scholarly paper. A higher-order weighted quantum PageRank algorithm is also developed to address the behavior of multiple step citation flow. Capturing the citation dynamics with higher-order dependencies reveals the actual impact of papers, including necessary self-citations that are sometimes excluded in prior studies. Quantum PageRank is utilized in this paper to help differentiating nodes whose PageRank values are identical.


Introduction
With the rapidly growth of scholarly big data [1], there's a crucial need to quantify the impact of scholarly papers, to assess the performance of individual scholars, institutions, even for countries [2]. Currently, the impact of scholarly paper is mainly divided into two categories: unstructured metrics and structured metrics [3]. Unstructured metrics evaluate the impact of scholarly paper from a statistical point of view. Citations [4] [5] are the most representative unstructured metrics, with examples such as the H-index [6], the g-index [7], and the impact factor (IF) [8]. As an alternative measure of scientific impact, Xia et al. [9] have investigated scholarly impact reflected on social media, and explore the correlation between citations and messages/tweets on Facebook and Tweeter. The structured metrics mainly consider the importance of scholarly entities in scholarly network, such as citation network, co-authors network, author-paper network, etc. PageRank [10], a seminal example of structured metrics, has attracted growing attentions in scholarly impact evaluation. Sayyadi et al. [11] have estimated future prestige scores of scholarly papers via the following three features: citations, publication a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 date, and authorship. Wang et al. [12] have quantified the impact of scholarly papers by applying PageRank and HITS [13] on citation network, author-paper network, and journal-paper network. In the unweighted structured metrics, all citations are treated with equal importance. An alternative approach is to evaluate the impact of scholarly papers by time-aware weighted citation network [14]. In another study, Shah et al. [15] has proposed the S-index metric to model the influence prorogation by a weighted paper-paper citation networks. This paper applies a hierarchical model between the citing paper and the cited paper, thus the impact of a scholarly paper decayed rapidly over different hierarchical levels.
One potential problem for unstructured and structured metrics is that the impact of individual papers can be manipulated. For instance, aggressive self-citations or induced-citations may lead to an inflated impact. Bai et al. [16] has evaluated the impact of scholarly papers using a weighted citation network, in which Conflict of Interest citation relationships are identified and the citation strengths are weakened. Another potential problem with structured metrics is that little is known how actual geographic distance influences the impact of scholarly paper, and how higher-order dependencies in citation networks react to the impact of scholarly paper. Liben-Nowell et al. [17] investigated the relationship between geographical distance and friendship in the LiveJournal network, indicating that geographical proximity can indeed increase the probability of friendship. This proved that social network attributes and geographical distance is related, which is an important aspect of the theory of small world. A previous research found a strong linear relationship between institutions and distance [18]. Schubert et al. [19] revealed that geopolitical location, cultural relations and language are important factors in shaping preference of cross-citation. Wu [20] investigated citing distances, citation patterns and spatial diversity to explore geographical knowledge diffusion. Albarran et al. [21] found economic, political, sociological and intellectual factors were influencing the shaping of their citation distributions and the research performance of countries. A geographic analysis of citation flows between cities is helpful to uncover how new scientific paradigms spread, and understand how quickly a new research gets recognized by academic circle in different geographical areas [22]. Bai et al. [23] explored the relationships between citations and the actual geographic location of institutions for evaluating the impact of scholarly papers. Based on the previous work, we further explore the relationships between them, and construct a relative weight to represent the importance of citation.
The concept of higher-order dependencies has been introduced by Xu et al. [24] to ensure the correctness of network analysis. The higher-order dependencies mean that, when movements are simulated on the network, the next movement depends on several previous steps. The higher-order dependencies are widely applied to model various applications, including Web browsing behaviors [25], vehicle and human movements [26], stock market [27], etc. Bohlin et al. [28] have modelled citation flow between journals, and remembered their previous steps, corresponding to the zero-, first-, and second-order Markov models. Previous researchers evaluate the impact of papers based on the original citation network, ignoring the influence of multiple step citation flow on the impact of papers. In this paper, we construct a higher-order citation network, and apply the hierarchical citation structure to quantify the impact of scholarly papers.
Once a citation network is constructed, evaluation methods such as PageRank or HITS can be applied. Although PageRank was introduced to rank Web pages, the algorithm has been deployed in many applications such as finding important nodes in networks [29], measuring impacts of scholarly papers [30], evaluating impacts of scholars [31] or journals [32], as well as various applications in social networks [33] and graph analysis [34]. A personalized PageRank was developed to find the vertices in a graph [35]. A multilinear PageRank modified the PageRank to a higher-order Markov chain, and studied a computationally tractable approximation to the higher-order PageRank vector [36]. The multilinear PageRank modelled stochastic processes depending on the previous steps. But, it is well known that using PageRank to evaluate scholarly impact brings a problem of the evaluation results depending on parameter α. When the α value is different, the evaluation results will be changed accordingly.
To address the limitation of PageRank, Paparo et al. [37] have proposed the quantum PageRank algorithm to unambiguously identify the underlying topology of networks. The quantum PageRank algorithm clearly highlights the structure of secondary hubs in scale-free networks. It recognizes the hierarchical structure in scale-free networks, amplifying the difference of important degree of nodes. The algorithm mainly consists of the following parts: (1) The input state of the algorithm is constructed based on the transition matrix of PageRank. (2) Construct the unitary matrix and transfer matrix to generate the total transformation matrix.
(3) In order to obtain the probability of particle appearing in each node, square of total transformation matrix is used to update the initial state. (4) Calculate m times average value for each node of a given network, namely, the quantum PageRank value.
This paper analyzes temporal and geographical attributes of publications and citations, addresses the limitation of conventional techniques in quantifying the impact of scholarly papers, and the main contributions of this paper are summarized as follows: (1) Identifying the relationship between citations and geographic locations of affiliations. (2) Introducing a relative citation weight based on geographical distance between institutions to better quantify the impact of scholarly papers. (3) Exploring higher-order dependencies in citation networks. (4) Developing the higher-order weighted quantum PageRank algorithm to rank the impact of scholarly papers.  [38,39] and clustering analysis [40]. Red dots represent institutions, and the links between institutions represent citations. Fig 1A shows that the citation relationship between different institutions by grouping analysis. The number of institutions is about 200 by clustering analysis. As Fig 1 shows, the institutions between six continents cite each other. In particular, the citation between North America and Europe is more frequent compared to between other continents in the field of physics. Fig 1B can more clearly show the frequency of citation.

Relative citation weight
Geographical distance: Let I represent a set of institutions, I = {I 1 , I 2 , Á Á Á, I a , Á Á Á, I b Á Á Á}, and D represent the geographic distance between two institutions I a and I b . By approximating the geographic distance using the spherical model, D can be formulated as: where R is the radius of the earth, θ a and θ b are the latitudes of I a and I b , ϕ a and ϕ b are the longitudes of I a and I b . Δθ is the differences of latitudes between I a and I b , Δθ = θ a − θ b . Δϕ is the difference of longitudes between I a and I b , and Δϕ = ϕ a − ϕ b . While physical distance increases communication barrier for physical interactions between the author and citing researcher, it is expected that citation counts decline over the geographical distance that separates the researchers. (Further discussions can be found in the Discussion Section.) The decline pattern is modelled with an exponential decay, according to the following equation: where y represents the citation count, y 0 is a constant representing an offset of the citation count, x is the physical distance separating the researchers, whereas t 1 represents a scaling factor. A 1 is the default number of citation less the offset when the author and the citing researcher co-locate at the same physical location. Experimental results of the citation pattern are presented in the Results Section. Upon identifying the citation pattern, we construct a relative citation weight to quantify the impact of scholarly papers. We consider the citation network at the institution-level, in which each institution has its actual latitude and longitude. Institutions are identified with nodes, and an edge exists between two institutions if they have citation relationships. In the citation network, the relative citation weight between two institutions, W I a ;I b , is defined as: where G represent the set of all institutions, and I m and I n denotes any two individual institutions and m 6 ¼ n. D I m ;I n represents the geographic distance of two different institutions. max D I m ;I n indicates the maximum geographic distance between institutions.

Higher-order weighted quantum PageRank
In this section, we introduce the proposed higher-order quantum PageRank algorithm. Firstly, we construct higher-order dependencies in citation network. The specific process is as follows: (1) We use the random walk method to find the citation chain from the original citation network to identify the higher-order dependence of the citation relationships among the papers.
(2) We traverse all citation chains and add up the number of occurrences of each order citation of all nodes in the chain. Citation chains can navigate backwards and forwards to build up a picture of the intellectual base about a topic [41]. (3) In the case of different orders, we calculate the probability of each node citing other articles separately. (4) In the different orders, we compare the probability of occurrence in the same citation relationship. If the probability change is large in different orders, the original citation relationship is replaced by the highorder relationship. At the same time, the node representing the higher-order relationship replace the original node. (5) Rewire the higher-order citation edges. It is necessary because a higher-order node replacing the original node can result in the loss of previous steps. (6) Establish the probability transfer matrix G according to all the generated citation relationships. Given a directed graph with M nodes, i|k indicates the kth order of node i. N i ! j indicates the number of occurrences that node i cites node j. The probability of node i transferring to its neighboring nodes is defined as: where k 2 [2, order] with order shows the highest order, and t ranges from 1 to M.
In order to calculate the probability, the K-L divergence value D(P i ) needs to be obtained: using i|k replaces the previous node i, and node i will obtain an updated transition probability. Secondly, we calculate the transfer matrix G according to the directed graph with N nodes. Subsequently, we need to construct the initial state, namely, input state. The detail is as follows: (1) |ii|ji represents the direct edge that the node i points the node j. G k, i indicates the probabil- (2) we calculate the superposition of all nodes according to the following formula: where |ψ j i indicates a superposition of the vectors, which represents outgoing edges from node j. The stochastic pattern of the vectors |ψ j i for j = 0, 1, 2, . . ., N − 1 are normalized in matrix G. These vectors form an N-dimensional orthonormal set of vectors, and they are used as the initial state of quantum walk.
Then, we need to construct the unitary matrix π and the transfer matrix S to obtain the general transform matrix. The unitary matrix π is The transfer matrix S is used to move a quantum particle from node j to node k: The general transform matrix is defined as As the directions of the edges of the graph need to be swapped for an even number of times, we use U 2 to update the initial state |ψ 0 i each time. Then we calculate the probability that the particle appears on node i. The probability that the particles will appear at node i after m times of walking, P i, m , can be obtained using the following formula: where U 2m indicates U 2 iteration m times, U 2m † is the transpose of U 2m . Finally, in order to guarantee a probabilistic interpretation of high order quantum PageRank, we conduct the following process P i, m can be interpreted as the relative importance degree of node i, and it can be found by calculating the probability of a quantum walker on node i. Thus, the impact score of each scholarly paper can be calculated from the P i, m value, as shown in Eq (12).

Definition of a scholarly paper impact
Based on the observation that citations are inversely related to the geographical distance following an exponential distribution, the impact of each scholarly paper is defined as its average higher-order weighted quantum PageRank value: where S(P i ) represents the prestige score of a scholarly paper, hP i, m i represents the average value of higher-order weighted quantum PageRank scores, M represents the iteration number of the algorithm, and P i, m indicates the m-th value of higher-order weighted quantum PageRank scores. The concept of the prestige score is inherited from Quantum Google algorithm [37], with the importance of a node corresponds to the prestige score of a scholarly paper in our work.

Data description
Our experiments are conducted on the Physical Review C (PRC) data set, a subset of the American Physical Society (APS) data set (http://publish.aps.org/datasets). PRC consists of 34,443 papers, and each paper includes details of title, author name and affiliation, date of publication, and a list of cited papers. Then, 3,587 papers without citation details from the PRC data set are removed. Overall, 212,421 citations are identified from the data set. Geographic coordinates of over 27,000 institutions are obtained by calling the Geocode function of the Google Maps API.

Data processing
To better explore the relationship between citations and geographical distance, we divide geographical distance by adopting statistical analysis technique: grouping analysis and clustering analysis. For grouping analysis, we use multiples of 100 Km distance as threshold values, to determine the group of any two institutions. For instance, if institutes I a and I b are 250 Km apart, citations will be considered in the 200-300 Km group. For clustering analysis, we use Density-Based Spatial Clustering of Applications with Noise (DBSCAN) [42], which is a spatial clustering algorithm based on density. The number of clusters are determined by two parameters: (1) the furthest distance of any two points belonging to a cluster; and (2) the smallest number of samples in a cluster. We select the DBSCAN for clustering analysis method, mainly because this algorithm has the advantages of fast clustering and efficient processing of noise points and spatial clustering of arbitrary shape.    In order to characterize the citation trend, we also analyze the relationship between citations and geographical distance by clustering analysis (Fig 4). For both analysis methods, we analyze the citation trend by considering four cases: intra-countries, inter-countries, raw distance (with oceans) and land distance (without oceans). The citation trends of scholarly papers within the countries approximately follows cðdÞ $ y 0 þ A 1 e À d t 1 (Figs  3A and 4A). Yet, we find that the citation trends in-between countries (Figs 3B and 4B) are different from the ones within the countries. Citations rapidly decrease when the geographical distances between institutions range from 0 Mm to 5 Mm, then consistently increase and reach the peak at around 7Mm.

Citation dynamics
The citation trend exhibits a rapid decline from 7 Mm to 20 Mm. Together, Figs 3C and 4C indicate that citations change with actual geographic distance. We find, however, that the Quantifying the impact of scholarly papers based on higher-order weighted citations changing trend of the citations is similar to one of between countries. This phenomenon drives us to explore the reason behind the peak point in Figs 3B, 3C, 4B and 4C. As a result, we find that the distance of Atlantic Ocean plays a significant role. The reason is that the Atlantic separates America from Europe, about 75% affiliations and 67% citations of papers are from America and Europe, and the citations between America and Europe account for around 68% of the total citations. The uneven geographical distribution of institutions causes such trend. This observation drives us to explore the relationship between citations and geographical distance ignoring the influence of the non-uniform geographical distribution of institutions. To this end, we construct the distance matrix of six continents containing Asia, Europe, Africa, North America, South America, and Oceania through the ranging function of Google Maps. According to Figs 3D and 4D, it is apparent that the change of citations for publications closely relates to the geographical distance, with more citations associate to shorter geographical distance, and vice versa. This has clear implications in quantifying the impact of scholarly papers: if the citations of a paper are from long distance, these citations are more valuable compared to the citations of short distance, and further elaboration can be found in the Discussion Section. Figs 3D and 4D indicate citations appear to follow a similar trend as Figs 3A and 4A.
In addition, we analyze the citation trend by considering the time factor. To illustrate the difference of citation trend and geographical distance over different periods, comparisons over 4 decades ('70s, '80s, '90s and '00s) are shown in Fig 5. Fig 5A compares 1970-1979, 1980-1989, 1990-1999. The differences of citation trends (Fig 5A) are consistent with the change of productivity (Fig 5B) in four periods of time. We observe a positive correlation between the number of publications and citations. Fig 5C and 5D show the trends of citation in North America and Europe. These trends indicate that citations are closely related to geographical distance. These results inspire us to evaluate the impact of papers based on the geographical distance (see Methods). Quantifying the impact of scholarly papers based on higher-order weighted citations

Comparing the impact of papers
Based on citation network, we compare the scores of quantum PageRank and PageRank. We observe that many papers share the same PageRank score. The importance of some nodes in citation network cannot be distinguished, which is considered a typical drawback of PageRank. In order to show the difference of scores of quantum PageRank and PageRank, we randomly select 100 scores out of 27,000 for each algorithm. Fig 6 shows the comparative results of quantum PageRank and PageRank for the same nodes in the citation network. According to Fig 6A, we observe that node15-node21 yield the same scores of PageRank, while their quantum PageRank scores are different (see Fig 6B). Fig 6 indicates that quantum PageRank can better reveal the hierarchy of levels in the hierarchical networks.
In order to explore the performance of higher-order weighted quantum PageRank, we compare the scores of higher-order weighted quantum PageRank and weighted quantum PageRank. We find that higher-order weighted quantum PageRank algorithm can capture different scores when weighted PageRank algorithm shows the same scores, as shown in Fig 7.   Quantifying the impact of scholarly papers based on higher-order weighted citations nodes. According to this Figure, we observe that scores of node15-node19 are the same in the weighted quantum PageRank, while their scores are different in the higher-order quantum PageRank. The comparison between the two algorithms indicates that considering higherorder dependencies in citation network can better identify the impact of papers. Fig 8 illustrates the effect of higher-order citation networks on quantifying the impact of selfcitation. The pathways represent the citation between different papers. The red arrow indicates self-citation. The green arrow indicates the self-citation chain with higher-order dependencies. In Fig 8A, paper P 0 cites paper P 1 , and the citation belongs to self-citation. Other citation relationships do not include self-citation. For example, paper P 2 cites paper P 0 , and there is no common author for the two papers. Four papers cites P 0 , and P 0 cites six papers. W(P 0 ! P 1 ) represents the weight of paper P 0 cites paper P 1 . W(P 0 ! P 1 ) is equal to 0.72 × 10 −8 in the original citation network. However, in the higher-order citation network, W(P 0 ! P 1 ) is equal to the weight of paper P 0 |P 2 citing paper P 1 (W(P 0 |P 2 ! P 1 )), namely 2.27 × 10 −8 . The weight in the higher-order network is higher than the weight in the original citation network, indicating that the impact of the self-citation is improved. The citation structure contributes to the enhancement of weight of self-citation. Due to the pre-sequence nodes of paper P 0 are cited multiple times in the citation network, the weight of paper P 0 |P 2 citing paper P 1 is improved in the higher-order citation network. In Fig 8B, paper P 3 cites paper P 4 , and the citation belongs to self-citation. There is no self-citation in other citation relationships. W(P 3 ! P 4 ) represents the weight of paper P 3 cites paper P 4 in the original citation network. W(P 3 |P 5 ! P 4 ) represents the weight of paper P 3 cites paper P 4 in the higher-order citation network. We observe that the W(P 3 |P 5 ! P 4 ) in the higher-order citation network is lower than W (P 3 ! P 4 ) in the original citation network. The reason is that pre-sequence nodes of paper P 3 Quantifying the impact of scholarly papers based on higher-order weighted citations are only cited by a paper, and paper P 5 is a root node in the higher-order citation chain. The citation structure determines the weight change in the higher-order citation network.

Geographical distance
An interesting finding is that citation pattern is closely related to the geographical distributions of institutions, discounting the separation by oceans. The shorter the actual geographic distance between citing and cited institutions, the more citations. We weight the citation between institutions by ignoring the ocean separating them. Rare citations are considered more valuable: "less is more." Intuitively, long distance presents a barrier for disseminating research finding and socializing other researchers in person. Although publishing over the Internet has become a popular alternative, it is a challenge to promote among massive information made available on the Web. In addition to the Web presence, additional publicity through conferences, seminars, and workshops help making the work well-known. With the increased cost and effort for frequent travel to far-away destinations, citations made by geographically faraway researchers are considered more valuable. At the same time, long-distance citations include less manipulated promotion, thus better reflects the true impact of a paper.
It should be noted that the finding does not conflict to, and can be applied as a weighted factor on-top of, other "reputation" metrics such as citations from a paper written by a leading institute or published in a prestigious journal. Investigation of the weighted citation would be a different topic, and to combine it with the geographical distance analysis is beyond the scope of this paper.

Higher-order dependencies
In this paper, we propose a quantitative approach for evaluating the impact of scholarly papers via a higher-order citation networks. Evaluating the impact of papers in higher-order citation networks can more objectively reflect the true influence of scholarly papers. Meanwhile, the higher-order dependencies can weaken the effect of manipulated citation activities. For example, when researchers manipulate citations to boost the impact of their papers, they usually deliberately cite the new published papers by themselves or their friends. The manipulation activities can influence the true citation networks, and generate more influence to the firstorder citation networks.
The higher-order dependencies are more likely to happen for the denser nodes and root nodes in citation networks. We exclude sparse nodes (citation chains with appearing less than 50 times in all the citation chains) in the citation networks to find the higher-order dependencies. The ignored nodes in citation networks are regarded as the zero-order dependencies, and such nodes are a large proportion in citation networks. In fact, the number of citation relationships is based on the statistical citation chains, which is generated by using the random walk method. Therefore, for a certain pair of citation, we find that the number of cited papers of precedence nodes and the number of citations of the succeeding nodes determine the number of occurrences of the pair of nodes in all the citation chains. Based on this finding, we roughly estimate that the probability of such nodes getting more citations is low if the higher-order dependencies of the nodes appear in the citation chains less than 50 times. Given a paper, we trace its citation path, and we generate a citation tree according the citation relationships. For the root node in the citation tree, the total number of the root nodes appearing in all the citation chains is only related to the post-sequence nodes. For the leaf node in the citation tree, the total number of the leaf nodes appearing in all the citation chains is only related to the presequence nodes. The finding mentioned above can be extended to all the networks, in which researches can find the corresponding higher-order dependencies to better rank the nodes. The general pattern is that the number of in-degree and the number of out-degree of a node determine the number of occurrences of the node in all the communication paths in certain network. Furthermore, the general pattern can be used to evaluate the importance of nodes in different networks.

Ranking algorithm analysis
Due to the scores of PageRank more depending on the damping parameter α, the scores of PageRank look more arbitrary. Compared to PageRank, the scores of quantum PageRank are less dependent on the parameter α, indicating quantum PageRank is more robust compared to PageRank in term of the variation of damping parameter α [37]. We find that more citations are associated to shorter geographical distances. To weaken the impact of cited papers from citing papers with short distances, and strengthen the impact of scholarly papers from citations with long distances, we weight the citation networks by an inverse function of the geographical distance between institutions. Based on the finding that citations are closely related to geographical distance, we construct the higher-order weighted quantum PageRank algorithm for objectively quantifying the impact of scholarly papers. In the hierarchical networks, quantum PageRank can better distinguish the impact of nodes compared to PageRank, as shown in Fig  6. Higher-order weighted quantum PageRank can capture deeper structured information, and better distinguish the impact of nodes compared to weighted quantum PageRank, as shown in Fig 7. Supporting information S1 Data Source. Data source used in this paper. (DOCX)