Modeling the Citation Network by Network Cosmology

Citation between papers can be treated as a causal relationship. In addition, some citation networks have a number of similarities to the causal networks in network cosmology, e.g., the similar in-and out-degree distributions. Hence, it is possible to model the citation network using network cosmology. The casual network models built on homogenous spacetimes have some restrictions when describing some phenomena in citation networks, e.g., the hot papers receive more citations than other simultaneously published papers. We propose an inhomogenous causal network model to model the citation network, the connection mechanism of which well expresses some features of citation. The node growth trend and degree distributions of the generated networks also fit those of some citation networks well.


Introduction
Citation network of papers is a directed graph, which describes the inter-citations between the papers. The network regards papers as nodes and contains a directed link from paper i to paper j, if i cites j. The idea or method of a paper, more or less, is inspired by its references. The references thus can be treated as sources or causes of the idea or method of the paper. Therefore, the link in citation network is one of causal relationships [2]. Causal relationship extensively exists in physical, biological and social networks [3]. For example, the relationship defined by light cone structure induces a directed graph, called causal network, from universe models [4][5][6]. Nodes of those networks are sprinkled randomly and uniformly onto spacetimes. Two nodes will be linked by a directed edge from the young node to the old one, if one node is in the other one's light cone.
D. Krioukov et al proposed the concept of network cosmology in 2012 [7], showing that inand out-degree distributions in the causal networks of de Sitter space are power-laws and Poissonian respectively [8][9][10]. Some citation networks [11][12][13][14][15][16] also have such degree distributions. However, some assumptions of the existing models in network cosmology are not satisfied by various citation networks. For example, in the casual network on a patch of de Sitter space [7], the growth velocity of nodes at time t is proportional to cosh(t), which is too fast for some empirical data. In addition, the existing casual networks are built on homogenous spacetimes.
Hence the nodes born at the same time have equal opportunities to be linked. However, as the empirical studies about 'attractiveness' or 'fitness' of scientific papers show, the hot papers can receive more links or citations than other contemporaneous papers [17,18].
We propose an inhomogenous casual network model for citation networks. At each time, we generate a circle, whose center is on a fixed axis, and sprinkle some nodes uniformly and randomly onto this circle. The radius of the circle is proportional to the number of nodes on the circle. Each node attaches an intervals for its angular coordinate, called influence region. Generate a directed link from node i to node j, if i's angular coordinate belongs to the influence region of j and the birth time of i is later than that of j. The influence region gives a casual relationship for nodes and can be assumed to be inhomogenous: the nodes born simultaneously can have different lengths of influence region.
The connection mechanism is shown to effectively describe the main features of the citing behavior of papers, including relativity, latest, inheritance, popularity, and aging. Assume the growth function of nodes to be an exponential or a constant function of time and the length of the influence region to be inversely proportional to the number of existing nodes. Then the increasing trend of new-born nodes, expected out-degree evolutionary trend, and distributions of the network generated by the model are proved to fit those of some citation networks well.

The inhomogenous casual network model
Consider a (2 + 1)-dimensional spacetime with circumference polar coordinates {r, θ, t}. The nodes of the model are uniformly and randomly sprinkled onto a cluster of circles of the spacetime whose centers are on the time axis (Fig. 1). Hence we name it concentric circles model(CC model). For each time t between times t = 1 and t = t 0 , Step 1. Sprinkle N(t) nodes uniformly and randomly onto a new circle S 1 with radius RðtÞ ¼ N ðtÞ 2pd centred at point (0, 0, t), where δ is a positive real number; Step 2. Give each node i an interval D i for its angular coordinate θ i to express its influence region; Step 3. Connect node i and node j from i to j, if θ i 2 D j and t i > t j .
Since the radius coordinates of nodes are not used in what follows, we express nodes by their time and angular coordinates. For node i with coordinates (θ i , t i ), the arclength of influence region is assumed to be where α(t i ) is inversely proportional to the number of the existing nodes at time t i , and β(θ i ) is a piecewise continuous non-negative function of angular coordinate. For example, bðyÞ ¼ 4; y 2 ½0; 0:5pÞ; 1; y 2 ½0:5p; 2pÞ: ð2Þ ( For citation networks, α(Á) gives a description of the phenomenon that current research is more and more special. β(Á) gives an expression of the inhomogenous popularity of papers published simultaneously.
In this paper, we discuss two type of N(t): exponential and constant functions. Some journals publish a fixed number of papers at each time. To deal this case, we can assume N(t) = m and where m is an integer. In some journals, e.g., PNAS, the number of published papers is growing exponentially with time. To model the citation networks from such journals, we assume N(t) = m[e lt ] and where m is an integer, [Á] is the rounding function, and l is a positive real number. When the influence region is given by Equation (4) and β(Á) is a constant function, the model is a time discrete version of the causal network on a patch of a (1 + 1)-dimensional homogenous spacetime, whose metric in circumference polar coordinates {t 2 [1, t 0 ], θ 2 R mod 2π]} is given by Metric (5) is a solution of generalized hyperbolic geometric flow [19,20]. This flow is the resulting equations taking leading terms of the Einstein equations.
In the causal network of spacetimes, the relationship between nodes is defined by light cone. As Fig. 2 illustrates, the future light cone has a counterpart in the CC model: the influence region, but the past light cone doesn't have one. In fact, if node i belongs to node j's past light cone, then j must belong to i's future light cone. Hence the connection relationships given by the past light cones are redundant.

Modeling the citation behavior
The connection mechanism (Step 3) of the CC model gives geometric expressions to four features of the citing behavior between papers: relativity, latest, inheritance, and popularity.
In order to show details of the sources of the authors' information, ideas and arguments, it is a basic academic requirement that the papers cite some references which are relevant to themselves. The definition of the influence region expresses the relativity of the nodes: if the angular coordinate of node i belongs to D j , we say that i is relevant to j. Hence the connection mechanism gives a geometric realization that the nodes preferentially connect to the relevant nodes.
Papers cite the latest relevant references. It shows that the authors have a good understanding of recent developments. As Fig. 2(a) shows, the node in the CC model can connect to the latest relevant node.
Paper and the papers it cited usually have some common references. This phenomenon can be called inheritance. In the CC model, the smaller the angular distance Δ(θ i , θ j ) = π − jπ − jθ i − θ j jj is, the more likely θ i 2 D j , and so i is relevant to j. If the values of Δ(θ i , θ j ) and Δ(θ j , θ l ) are small, the value of Δ(θ i , θ l ) is necessarily small because of the triangle inequality. It means that the probability of i 2 D l is high. Therefore, the connection behavior of the CC model has the inheritance feature.
Papers prefer to cite the popular or hot papers. Here the node popularity is expressed by the length of node influence region. Since the nodes in the model are distributed uniformly, the nodes with larger influence region have more chances to attract connections. It means that the nodes in the model also prefer connecting to the popular nodes.
The popularity of papers has been fully considered in some typical models for citation networks [21][22][23][24][25][26][27][28]. Those inspiring and effective models focus on fitting the in-and out-degree distributions, clustering coefficients, aging, and assortative property of citation networks. Comparing to those models, as shown in the following sections, the CC model can not only fit the in-and out-degree distributions of some citation networks, but also fit the trends of the annual number of published papers and the trends of the annual average reference lengths of some datasets of papers. In terms of other properties of citation networks, e.g., the abundance of the triangle: paper i cites paper j, j cites l, and i cites l [27], the model of Wu et al [28] can generate a network with a giving a number of triangles that matches the empirical citation networks. The CC model needs to be generalized to have such ability, which is a problem we need to consider in the future.
The relativity of contents is one of the reasons for citation behaviors, which is not fully considered in above models for citation networks. The relativity is called similarity in the Populari-ty× Similarity optimization model(PSO) [29]. It is an undirected network growth model. In this model, instead of preferring the popular nodes, each new node is connected to a constant number of the existing nodes by optimizing certain trade-off between popularity and similarity.
Comparing to the PSO model, the essential difference is that the popularity is inhomogenous in the CC model, but homogenous in the PSO model: the nodes born at the same time has the same popularity.
Inheritance is called copy in the copy model [11]. In this model, a new node attaches to a randomly selected node, as well as all the ancestors of the selected node. It means that if the new node i connects to the existing nodes j and l, there must be a link between j and l. The CC model does not have this property. In fact, it is a general phenomenon in citation networks that two references cited by one paper may not have a citation between them. In addition, the relativity of the nodes is not considered in the copy model.

Degree distributions
We calculate the degree distributions for the case whose influence region is defined by Equation (4). The distributions for the other case is the same. The calculation has a little different and is omitted here. For the approximations '%' in this section, the value of the negligible term is smaller than one tenth of that of the remaining one.
The node with coordinate (θ, t) belongs to the influence region of the nodes whose coordinates (ϕ, s) satisfy Dðy; Þ < bðÞ ½e ls and s < t. When bðÞ ½e ls is small enough, β(ϕ) % β(θ), because that β(Á) is piecewise continuous. Hence the expected out-degree k + (θ, t) of the node with coordinates (θ, t) is The approximation holds for t > 10. Since the number of nodes increases exponentially with time, the nodes born in times [1,10] take a small proportion of the total nodes. The expected out-degree of those nodes are small. This makes that the forepart of a fitting curve has a little shifting from the synthetic data of the out-degree distribution( Fig. 3(a)). The influence region of the node with coordinates (θ, t) contains the nodes whose coordinates (ϕ, s) satisfy Dðy; Þ < bðyÞ ½e lt and s > t. Hence the expected in-degree k − (θ, t) of the node with coordinates (θ, t) is The first approximation holds for e lt > 10 and l < 1 (approximate e l − 1 by the first two terms of its Taylor expansion). The second approximation holds for e l(t 0 − t) > 10. So the restrictions for time are t > 1 l logð10Þ and t 0 À t > 1 l logð10Þ. Since the nodes that don't satisfy the restrictions are born early or late, the expected in-degree of those nodes are large or small. This makes that the forepart and tail of the fitting curve shift from the synthetic data of the in-degree distribution( Fig. 3(b)).
Since the nodes are distributed according to Poisson point process, the degrees in those networks will not be exactly equal to their excepted values. In order to find the correct in-or outdegree distributions, as Ref. [7] said, we have to average the Poisson distribution, which is the probability that node born at time t 2 [1,t 0 ] has in-or out-degree k, with the temporal density ρ(t). In the CC model, of nodes born at time t, in which the approximation holds for e lt > 10. So the out-degree distribution is the integration where Γ(Á, Á) is the upper incomplete gamma function, a 1 ¼ bðyÞ 2p , and τ = a 1 t. The condition for the first approximation is a 1 > 10l, which is satisfied by letting bðyÞm 2p > 10l. We have used lim x ! 1 Γ(s + 1, x) = x s e −x in the second approximation, which requires a large a 1 . The third approximation holds for t k 0 e Àa 1 ðt 0 À1Þ < 0:1, which can be satisfied by setting a large t 0 . When β(θ) is a piecewise constant function, p(k + = k) is close to a weighted summation of Poisson distributions. This summation is called mixture Poisson distribution.
The in-degree distribution is calculated as follows, where a 2 ¼ bðyÞm 2pl , and τ = a 2 e l(t0−t) . Here we have used the Laplace approximation in the third step and the Stirling's approximation ðk À 2Þ! % ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi 2pðk À 2Þ p kÀ2 e À Á k in the fourth step. The integration in the fourth step is independent of k approximately. It can be verified as follows, d dk where T 1 = a 2 e l(t 0 −1) and T 2 = a 2 . The condition for the approximation is a large a 2 or k, which is satisfied because of the same reason for the third step of Equation (10). The in-degree distribution is thus a power-law with exponent 2. The numerical experiments (Fig. 3) confirm the results given by Equations (10,11).

Fitting the empirical data
In this section, the trends of node growths, the trends of the expected out-degree of nodes, and the degree distributions of some citation data are fitted by above functions respectively. The paraments of the functions are estimated by cftool: a curve fitting toolbox in MATLAB. Four statistical measures: The sum of squares due to error (SSE), Root mean squared error (RMSE), Coefficient of determination (R 2 ), and Degree-of-freedom adjusted coefficient of determination (Adjusted R 2 ) are used for measuring the goodness of fits. The in-degree distributions are also fitted by Clauset et al's method [37]. The fitting function is f ðkÞ ¼ k Àg The citation network can only include a subset of the entire papers: if a paper cites, or is cited by, a paper outside the subsets, the network does not contain any information about this. Hence a node's out-degree is not be exactly equal to the length of its corresponding paper's references, and its in-degree is also not equal to that in the entire citation network containing the entire papers. We call the in-and out-degrees in the entire network expected in-and outdegrees.
Consider the dataset for papers from 1915 to 2012 of Proceedings of the National Academy of Sciences(PNAS, http://pnas.org). The first fitness is the exponentially increasing trend of the number of new-born nodes(Equation (9)). It is illustrated in Fig. 4(a) that the number of papers published on PNAS in a given year roughly grows exponentially with time. The annual number of papers in DBLP dataset also roughly shows the exponential increasing trend (Fig. 4(c)).
The second fitness is the linearly increasing trend of the expected out-degree of nodes(Equation (6)). It is illustrated in Fig. 4(b) that the annual average number of references of each PNAS paper grows with time, which is a piecewise linear increasing function of time approximately. The data displays a turn around the year of 1945. So it is cut into two fractions, one is 1915-1945 and the other is 1946-2012 to make a more precise fitting. In our opinion, the main reasons why reference tend to grow slowly or even decline during 1915-1945 is the two world wars (World War I: 1914-1918and World War II: 1938-1945. During this period, many scientists suffered drift and miserable fates. Many achievements were not published although they did the military a favor. After 1945, the information industry developed so rapidly that all the kinds of science and technology stepped into the golden age. So the relevant prosperity showed in the PNAS dataset in the same period. Obviously, the slope change in 1945 illustrates the development of science after wars. Since the DBLP dataset doesn't release the information of reference, we won't analyze the trend of its annual average reference length here. However, the relevant data, the papers from the issues from 1893 to 2003 of Physical Review journals [12], also shows the linearly increasing trend. The third fitness is the power-law in-degree distributions(Equation (11)). The empirical data (Table 1) includes: the citation networks of papers from e-print arXiv in the period from 1993-01 to 2003-04 in high energy physics phenomenology (Cit-HepPh) and that in high energy physics theory (Cit-HepTh) [13,14], and the citation networks from DBLP dataset (papers before 2010-05-15, papers before 2013-09-29) collected by Tang et al [15].
The statistical measures in Table 2 show that the citation networks from DBLP dataset roughly have the power-law in-degree distributions with power exponent 2, which are similar to the network generated by the CC model( Fig. 3(b)). The in-degree distributions of the nodes  with in-degree larger than 9 more accurately fit the power-law distributions with power exponent 2 and the value calculated by the method of Clauset et al [37] (Fig. 5(b, c, e, f)). However, the foreparts of the in-degree distributions of Cit-HepPh and Cit-HepTh do not follow the power-law distributions very well ( Fig. 5(a)). The reason for this unfitting phenomenon may be due to the fact that the time scales of these two networks are not large enough (10 years) to meet the CC model demands (the large scale time assumption for the approximations in Equation (11)). As Fig. 5 shows, the curves given by Clauset et al's method fit the tails of the in-degree distributions better. Hence, we should give the CC model the function for adjusting the power exponent of the in-degree distribution of the generated network. In the next section, we generalize the CC model to model the aging phenomena of the citation behavior. This generalized model has such function.   Table 1. Panels(b, c, e, f) show the fitting effects of the in-degree distributions of the nodes with in-degree larger than 9 by the power-law functions f(k) = ak −2 and fðkÞ ¼ The fourth fitness is the mixture Poisson distribution for out-degree (Equation (10)). Here we use a simple mixture Poisson distribution to fit the data, which is given by Equation (13), where a, b 2 R, c 2 [0, 1], and k 2 Z + . The goodness for fit in Table 3 shows that the out-degrees of the citation networks from DBLP dataset approximately follow Equation (13). But the fitting effects for Cit-HepTh and Cit-HepPh are not good. Except for the relatively short time scale, the reason for these unfitting phenomena may be due to the occurrence independence of the Poisson distribution: the events happened in the past have no effect on the probabilities of future occurrences. This kind of independence isn't fully satisfied in citation networks: papers are more or less effected by the ideas, theories, and methods in the previous papers. The generalized Poisson distribution happens to have the ability to describe the situations where the probability of occurrence of an event is affected by previous occurrences [36]. We next use the mixture generalized Poisson distribution defined by Equation (14) to fit the out-degree distributions, where a, b, d, e 2 R, c 2 [0, 1], and k 2 Z + . As Fig. 6(a-d) show, the node out-degrees, on the whole, follow the mixture distribution. Meanwhile, the statistical measures in Fig. 6 and in Table 3 show that the fitting effects of Equation (14) are better than Equation (13).

Modeling the aging phenomena
It has been empirically observed that the probability of a paper to be cited is a decrease function of the paper's age [30][31][32]. Some growing network models include the aging of nodes as a feature [33]. In those models, the probability that a paper receives a citation is expressed by a function Γ(k − , t), which is dependent on the number of citations k − already received and on the publication time t. In some models, the two effects are considered to be independent: Γ(k − , t) = f(k − )g(t) with some functional forms of f(k − ) and g(t) [34,35]. Under the enlightenment of the fitness expression in the PSO model, we give an influence region with aging effect: the influence region length of node i with coordinates (θ i , t i ) is given by where l > 0, t c is the current time, and a 2 [0, 1) is a parameter tuning the velocity of aging. When a > 0, the length of the node influence region is a decreasing function of t c , which models the phenomena that the probability of papers to be cited decreases with the papers' age. When the influence region is given by Equation (15), the expected in-and out-degree of the node with coordinate (θ, t) is k À ðy; tÞ ¼ e Àlð1ÀaÞ À e Àlð1ÀaÞt lð1 À aÞ : The approximations hold for lager t and t 0 . When t is lager enough, k + (θ, t) tends to a function which is free of t. It has been empirically observed that the annual average number of paper references is a monotone increasing sequence for some journals, e.g., PNAS (Fig. 4(a)). Meanwhile, it is reasonable to think that the  Table 1 and the fitting curves of the distributions. The fitting model is the mixture generalized Poisson distribution (Equation (14)).
doi:10.1371/journal.pone.0120687.g006 number of paper references can't grow to infinity, and should have an upper bound. Hence, the expected out-degree given by Equation (17) is reasonable, because that a bounded monotonic sequence has a limit.
With the similar calculations as those in Equations (10,11), we find that the network generated by the model whose influence region is given by Equation (15) has a power-law distribution with exponent 1 þ 1 a for in-degree. The out-degree distribution is close to a mixture Poisson distribution.

Conclusions
We propose a model for citation networks using network cosmology, whose connection mechanism gives a geometric expression of the main features of the citing behaviors: relativity, latest, inheritance, popularity, and aging. The model generalizes the homogenous assumption of some existing models in network cosmology: the nodes born at the same time can have different popularity. This property gives an expression of the phenomenon that hot papers can receive more citations than other concurrent published papers. We show that the node growth trend, expected node out-degree, and degree distributions of the network generated by the model fit those of some citation networks well.