Scientific X-ray: Scanning and quantifying the idea evolution of scientific publications

The rapid development of modern science nowadays makes it rather challenging to pick out valuable ideas from massive scientific literature. Existing widely-adopted citation-based metrics are not adequate for measuring how well the idea presented by a single publication is developed and whether it is worth following. Here, inspired by traditional X-ray imaging, which returns internal structure imaging of real objects along with corresponding structure analysis, we propose Scientific X-ray, a framework that quantifies the development degree and development potential for any scientific idea through an assembly of ‘X-ray’ scanning, visualization and parsing operated on the citation network associated with a target publication. We pick all 71,431 scientific articles of citation counts over 1,000 as high-impact target publications among totally 204,664,199 publications that cover 16 disciplines spanning from 1800 to 2021. Our proposed Scientific X-ray reproduces how an idea evolves from the very original target publication all the way to the up to date status via an extracted ‘idea tree’ that attempts to preserve the most representative idea flow structure underneath each citation network. Interestingly, we observe that while the citation counts of publications may increase unlimitedly, the maximum valid idea inheritance of those target publications, i.e., the valid depth of the idea tree, cannot exceed a limit of six hops, and the idea evolution structure of any arbitrary publication unexceptionally falls into six fixed patterns. Combined with a development potential index that we further design based on the extracted idea tree, Scientific X-ray can vividly tell how further a given idea presented by a given publication can still go from any well-established starting point. Scientific X-ray successfully identifies 40 out of 49 topics of Nobel prize as high-potential topics by their prize-winning papers in an average of nine years before the prizes are released. Various trials on articles of diverse topics also confirm the power of Scientific X-ray in digging out influential/promising ideas. Scientific X-ray is user-friendly to researchers with any level of expertise, thus providing important basis for grasping research trends, helping scientific policy-making and even promoting social development.


S1 Data details
Scientific X-ray is built on the database of Acemap, which contains 204,664,199 publications in all disciplines collected and integrated from bibliographic databases, including but not limited to IEEE, ACM, arXiv, Elsevier, and Spring. We select all 71,431 high-impact publications with more than 1,000 citations as pioneering works and construct citation networks for them. All leading articles were published between 1800 and 2021, and their research interests cover 294 fields in 16 disciplines: History, Computer science, Environmental science, Geology, Psychology, Mathematics, Physics, Materials science, Philosophy, Biology, Medicine, Sociology, Art, Economics, Chemistry, and Political science. Data details of the leading articles and citation network overviews appearing in corresponding section of main text and supplementary materials are shown as follows. S1.1 Data in main text S1. 1  Proceedings of the National Academy of Sciences of the United States of America  The Cryosphere Mangroves among the most carbon-rich forests in the tropics 1283 9708 Global, regional, and national comparative risk assessment of 79 behavioural, environmental and occupational, and metabolic risks or clusters of risks in 188 countries, 1990-2013: a systematic analysis for the Global  The New England Journal of Medicine There are too much redundant links in the citation network which has little academic influence on the citing papers. Therefore, repetitive, invalid inheritance relationships need to be removed to clearly and accurately reproduce the flow of the idea, which can be achieved by assessing the similarity between papers. Ideally, we assume that any child article in the network except the leading article is inspired by one of the most essential citation (the more similar the more important) so that we can get an idea tree that reveals the inheritance of ideas. In this way, we can characterize different evolution patterns of ideas through different idea tree structures. There are three steps to extracting the idea tree from the citation network. Initially, the nodes in the network are represented as vectors in a high-dimensional space. Then calculate the reduction index of the node, and measure the importance of the connection according to the difference in the reduction index. Finally, we get the idea tree by cutting the edges between the node pairs which have the largest reduction index difference.

The distance of academic articles in high-dimensional space
The first step is to measure the distance between the nodes by citation relationships. Particularly, we utilize graph embedding to measure such distance in high dimensional vector space. As for any target publication, we construct its citation network G (V, E), among which the V represents the set of all the nodes and E represents the set of all the edges. Especially, n = V is defined as the number of nodes in the network while m = E is defined as the number of edges. A represents the adjacency matrix of the network, with the form of: where A ij = 1 represents that there exists reference relationship that paper v i cites paper v j . However, due to the existence of errors in the real data, A ij = 0 doesn't absolutely means that paper v i doesn't cite paper v j . Actually, there seldom happens the phenomenon that two papers cite each other, which is determined by the reason that papers are usually published in sequential. Whereas, we do find few data in such form in the database, which we then turn such cite-each-other relationship to normal reference relationship following the rule that the paper published later follows the paper published priorly. Especially, we noticed that the leading paper cites none of the papers in the network since it's the earliest one, which blocks our later calculation for eigenvalues and eigenvectors. Considering this, we involve self-citation or self-loop to the leading paper, which allows subsequent eigenvalue decomposition. After the process above, we get W , the adjacency matrix with self-loop: Continue to process the adjacency matrix with self-loop W , we get the output matrix with self-loop D: Considering that D is an diagonal matrix, we will easily to get D − 1 2 . And based on this, we get the Laplace Matrix with self-loop L: L = D − W and normalized Laplace Matrix with self-loop: with both matrix positive semidefinite. Based on this, we then do eigenvalue decomposition on the normalized Laplace Matrix with self-loop, and acquire N eigenvalues and corresponding N eigenvectors. Generally Speaking, we will select the first k eigenvalues and the corresponding eigenvectors, which indicates that we promote the original citation network to the k-dimension space. To adequately exploit the data, we choose k = n for later analysis, while k can be selected from 2 to n in real-world to reduce the calculation cost of huge amount of data. Then for any two papers v i and v j in citation network, the distance of them in k-dimensional space is d ij = eigvector vi − eigvector vj 2 , and we also get the distance matrix d of the papers in k-dimensional space: which can be intuitively understand as the maximum distance in high-dimensional space for all edges existed in networks.
Based on the distance measurement in high-dimensional space, we exploit the random walk algorithm to calculate the reduction index of every node to the entire network. Whereas, before defining the reduction index of a specific node to the entire network, we need define the reduction index of single node to another (node pair) first.
The reduction index of one node to another (nodepair) We first define the reduction index between nodepair. For nodepair (v i , v j ), to measure the similarity of research content between v i and v j , we define the reduction index of nodepair as the sum of the weighted Dijkstra path [1] from v i to all v j k in v j 's reference list, where the weight is the distance between two adjacent nodes on the path in the high-dimensional space.
Specially, for any v j k ,the weighted dijkstra path from v i to v j k is weighted sum of edges in dijkstra path when there exists path from v i to v j k , and AverageStep is the average of the number of steps (edges) between every nodepair that can reach each other, whether in single step or multiple steps. Similar to Symeonidis et al. [2], the weighted shortest path is introduced to calculate the similarity between non-neighboring nodes. We are also inspired by the idea that 'if two nodes are connected to a similar node, then two nodes are similar' [3].For article v j , the articles in its reference list can be considered as the source of its ideas. Therefore it can be considered that the closer the article v i is to the articles in the reference list of v j , the more similar it is to the research content of article v j .
Besides, for the two situations depending on whether there exists path from v i to v j k or not, interpretations are as follows.
1. For the situation that there exists path from v i to v j k , such dijkstra path should quantify the distance between v i and v j k , whether they are connected directly or indirectly via paths.
2. For the situation that there exists no path from v i to v j k , the distance calculated should be significantly larger than the distance when there exists path. Therefore, we use the M axDistance to make the value significantly large. And for the interpretation of the value of M axDistance, we deem that the distance of a virtual edge (not really existing in the network) should be also significant larger than the distance of really existing edges. And for AverageStep, it can be interpreted that such M axDistance is for a direct virtual edge, while the virtual path from v i to v j k may consists not only single virtual edge but multiple virtual edges that connected indirectly.
After defining the reduction index of single node to another (node pair), we will then define the reduction index of the entire network for any specific node.
The reduction index of any specific node to the entire network For node v, its reduction index to the entire network G is defined as the sum of its reduction index to all other nodes in the network.
Reduction index to the entire network helps us judge the importance of citations. The greater of the difference in the reduction index of two nodes to the network is, the less important the reference relationship between them is. Therefore, we find undirected loops and cut the unimportant nodepair according to the difference in reduction index to the entire network while maintaining the connectivity in Directed Graph conditions. During the process, two fundamental but significant principles should be followed: 1. Cut the nodepair with largest difference in reduction index to the entire network.
For this principle, we sort the nodepairs according to the difference in reduction index to the entire network, and attempt to cut them in descending order. The specific criteria to whether cut it or not is stated in pricinple 2.

Maintain the connectivity in Directed Graph conditions
For this pricinple, we do not cut the edges that represents the last reference relationship still existing in the graph, and we will skip such edges when sorting and selecting the edges to cut. In such conditions, if every article except leading paper reserves only one reference relationship, then the output we get after cutting the edges will be undoubtedly a tree structure.
Following the two pricinples above, we do obtain an idea tree that reflects the idea flow of the citation network, which satisfies two properties: 1. The leading work is the only root node of the idea tree, while the whole structure is rooted on the leading node.
2. Starting from the root node of the idea tree, by doing inverse traversing operation of citation relationship, every node, or every paper in the idea tree, can be reached during the traversing.

S2.2 The calculation of knowledge entropy
Based on the idea tree, we can start from the structure of the tree and utilize structural information to measure the knowledge quality of academic articles. Specifically, the reason why we choose entropy to measure the quality of knowledge is that the entropy can measure the influence on uncertainty of the paper, which is compared with the situation that the paper does not exist. The uncertainty exists between two situations: With the paper involved, the structure of the idea tree is determined slightly; And without the paper involved, some structure of the idea tree is still unknown. Therefore, the larger a paper influences the idea tree, the structure of the idea tree is more certain, and the larger its knowledge entropy is.

Subtree entropy
For an academic paper a, the Subtree Entropy of a is defined as follows: The definition of subtree entropy follows the definition of structure entropy in [4], which measures the high-dimensional information embedded in network structures with the help of the partition tree. In the subtree entropy, g a represents the number of the edges in the original citation network from the nodes in the subtree rooted on a in the idea tree to the nodes out of the subtree. m represents the number of edges in the idea tree. The larger the g a , the more complex the structure associated with the subtree rooted on a. Therefore, the term ga 2m measures the importance of the subtree rooted on a to the whole idea tree. V a represents the number of the nodes in the subtree rooted on a while V a − represents the number of the nodes in the subtree rooted on a's parent node. The term − log Va V a − measures the uncertainty of the subtree rooted on a to its parent subtree. Generally speaking, the subtree entropy can measure the effect of the presence or absence of the corresponding subtree on the uncertainty of the whole idea tree. In this case, the greater the influence of a subtree on the idea tree, the greater the subtree entropy.

Mutual knowledge entropy and conditional knowledge entropy
With the definition of subtree entropy above, the definition of mutual knowledge entropy is also given as follows: where g ab represents the number of the edges in the original citation network from the nodes in the subtree rooted on a and the subtree rooted on b in the idea tree to the nodes out of the two subtrees. m represents the number of edges in the idea tree. V a represents the number of the nodes in the subtree rooted on a, V b represents the number of the nodes in the subtree rooted on b, while V ab − represents the number of the nodes in the subtree rooted on a and b's parent node, which indicates that a and b should have the same parent node, or the two nodes should locate in similar positions in the idea tree. The mutual knowledge entropy measures the degree of overlap of the knowledge contained in two subtrees. It can be considered that the overlapping knowledge is not created by these subtrees but inherited from the parent node. Considering the definition form and the character of mutual knowledge entropy, it satisfies: which reflects the symmetry of mutual knowledge entropy.
I (a, a) = H (a) which reflects the self-symmetry of mutual knowledge entropy.
With the mutual knowledge entropy defined above, the conditional knowledge entropy is further defined:

Knowledge entropy
Based on the subtree entropy and mutual knowledge entropy above, the definition of knowledge entropy is given as follows: at timestamp t, the citation network related to the target publication is G t = (V t , E t ). The idea tree extracted from the network is IdeaT ree (G t ). For any paper v belong to IdeaT ree (G t ) except the leading work, we define its knowledge entropy KE t (v) as: where H t (v) represents the subtree entropy of the subtree led by node v at t, C t (v) represents the children of v in IdeaT ree (G t ) at t, and I t (v i , v j ) represents the mutual subtree entropy of the subtree led by v i and v j at t. Knowledge entropy is composed of two parts. The first part is the subtree entropy of node v minus the subtree entropy of its child nodes, which quantifies the influence of node v itself on the formation of the idea tree structure by excluding the influence of child nodes C t (v). The first part may be negative, which indicates that the child nodes has more influence on the network structure than the parent node. The second part vi,vj ∈C t (v),i =j I t (v i , v j ) reflects the amount of knowledge inherited by the child nodes C t (v) from the parent node v utilizing the mutual knowledge entropy from the side. Although the first part of KE may be negative when the knowledge of a parent node is inherited by a large number of child nodes, it causes the second term of the formula to increase, so we still consider it to have high knowledge quality. In the actual calculation, when the second term of the formula is very small, resulting in a negative value of KE, we directly set KE to 0 considering that the amount of knowledge of a scientific article cannot be negative. In this case, the article neither influences the structure sufficiently nor fails to create valuable knowledge to be inherited, so we consider it to have a low knowledge quality.
As for the leading article, since it has no parent node, its subtree entropy cannot be calculated directly. However, considering the numerical difference between subtree entropy and knowledge entropy, the influnence of subtree entropy on knowledge entropy can be ignored. Therefore, for the leading article v s , its knowledge entropy is given as follows.
where C t (v s ) represents the children node of leading article in IdeaT ree (G t ) at t.

S2.3 The fitting of idea limit formula
Assuming that the form of the idea limit formula is ∆D t (v) = log KE t (v) (t−t0) γ , with the help of the least squares method, we use the evolution data of idea tree and knowledge entropy as the sample data to fit the time attenuation coefficient γ. For any high knowledge entropy node v in any idea tree, node v becomes visible at t 0 , i.e. KE t0 (v) ≥ M , taking node v as a reference, the valid depth of the subtree led by node v is 0 at t 0 . We know that the maximum valid depth M axV D subtreev of the subtree led by node v up to the current time t now is V D tnow subtreev , at any moment t between t 0 and t now , we can get the sample data ∆D t (v) of ∆D t (v): Similarly, we know that the knowledge entropy of node v at t is KE t (v), therefore, we can get the sample data (KE ti (v j ), t i − t 0 , ∆D ti (v j )) from all the idea trees to fit the formula. We transform the form of the formula to log KE t (v) − ∆D t (v) = γ log(t − t 0 ), and the value of a can be obtained according to the least squares method: Let the derivative of the objective function to γ be 0, we can get , and the fitting result of γ is 1.914. The red node in the network is the leading article. Except for the blue nodes, other nodes with the same color represent that they belong to a larger community, and the blue node does not belong to these communities. The size of the node is positively related to its citation within the network. Different types of leading articles' ideas will make the citing papers associated in different ways.

S3 Examples of citation networks pioneered by high-impact publications
As shown in Fig. S3-1(a-c), these networks are led by textbook, survey and software toolkit, respectively. All these networks only form a uniform ring around the summative leading article. As shown in Fig. S3-1  The leading work of the idea tree is 'Principal component analysis: a review and recent developments'. It was published in 2016, and it has already attracted 1,203 citations until 2021. leading work tends to summarize existing knowledge, so it is not very inspiring for child nodes and cannot provide new research idea. By observing the evolution of the idea trees over time, we find that even though the citation continues to increase, it never breeds new high-impact nodes within it, which stagnates its VD at zero. The leading work of the idea tree is 'Multilevel Analysis : Techniques and Applications, Third Edition'. It was published in 2017, and it has already attracted 2,321 citations until 2021. leading work tends to summarize existing knowledge, so it is not very inspiring for child nodes and cannot provide new research idea. By observing the evolution of the idea trees over time, we find that even though the citation of target publication continues to increase, it never breeds new high-impact nodes within it, which stagnates its VD at zero. Statistics   Fig S4-3. The evolution of idea tree structures led by 'An Introduction to Medical Statistics'

An Introduction to Medical
The leading work of the idea tree is 'An Introduction to Medical Statistics'. It was published in 1987, and it has already attracted 2,128 citations until 2021. leading work tends to summarize existing knowledge, so it is not very inspiring for child nodes and cannot provide new research idea. By observing the evolution of the idea trees over time, we find that even though the citation of target publication continues to increase, it never breeds new high-impact nodes within it, which stagnates its VD at zero. TensorFlow: a system for large-scale machine learning Fig S4-4. The evolution of idea tree structures led by 'TensorFlow: a system for large-scale machine learning' The leading work of the idea tree is 'TensorFlow: a system for large-scale machine learning'. It was published in 2016, and it has already attracted 5,199 citations until 2021. leading work tends to summarize existing knowledge, so it is not very inspiring for child nodes and cannot provide new research idea. By observing the evolution of the idea trees over time, we find that even though the citation of target publication continues to increase, it never breeds new high-impact nodes within it, which stagnates its VD at zero.

S4.2 Pattern 2: The increase in VD needs to be driven by non-trivial child nodes
Recent Arctic amplification and extreme mid-latitude weather  The pioneering work of the idea tree is 'Matching networks for one shot learning'. It was published in 2016, and it has already attracted 1,515 citations until 2021. The KE of child node A is the first to highlight, but after 2019, its KE almost stopped increasing. At this time, the KE of the child node B, which was directly inspired by paper A, has begun to emerge and exceed paper A. Paper B took over the task of motivating the VD increase. The idea tree achieves multiple inheritances of ideas from the leading article and ensures that it can continue to attract attention. This makes the VD continuously increased to two.  The pioneering work of the idea tree is 'A Meta-Analysis of Global Urban Land Expansion'. It was published in 2011, and it has already attracted 1,118 citations until 2021. The KE of child node B is the first to highlight, but after 2013, The growth of its KE started to slow down. At this time, the KE of the child node A, which was directly inspired by paper B, has begun to emerge and exceed paper B. Paper A took over the task of motivating the VD increase. The idea tree achieves multiple inheritances of ideas from the leading article and ensures that it can continue to attract attention. This mades the VD continuously increased to two.
Cleavage of GSDMD by inflammatory caspases determines pyroptotic cell death Fig S4- The pioneering work of the idea tree is 'Cleavage of GSDMD by inflammatory caspases determines pyroptotic cell death'. It was published in 2015, and it has already attracted 1,583 citations until 2021. The KE of child node A is the first to highlight, but after 2019, the KE of the child node B, which was directly inspired by paper A, has begun to exceed paper A. Paper B took over the task of motivating the VD increase. The idea tree achieves multiple inheritances of ideas and ensures that it can continue to attract attention. Several child nodes with high KE were birthed under the subtree led by paper B, and made the VD continuously increased to five.

S4.4 Pattern 4: The presence of overpowered child nodes can ruin the increase in the VD
The Kinetics Human Action Video Dataset   Fig S4-13 The pioneering work of the idea tree is 'Non-ideal interactions in calcic amphiboles and their bearing on amphibole-plagioclase thermometry'. It was published in 1994, and it has already attracted 1,612 citations until 2021. In the next layer of leading work, two nodes with high KE were born. In the early stage of the idea tree's development, the subtree led by paper B first became prosperous. When the KE of paper A exceeds paper B, the subtree led by paper A began to become prosperous, and new high KE nodes were born in it, thus increasing the VD by one. In this process, the newly emerged branches attracted the attention of the outside world. In contrast, the subtree led by paper B began to be ignored, thus missing the golden opportunity for development, leading to stagnation of its development.
A dynamic global vegetation model for studies of the coupled atmosphere-biosphere system Fig S4- The pioneering work of the idea tree is 'A dynamic global vegetation model for studies of the coupled atmosphere-biosphere system'. It was published in 2005, and it has already attracted 1,467 citations until 2021. In the next layer of leading work, two nodes with high KE were born. In the early stage of the idea tree's development, the subtree led by paper A first became prosperous. When the KE of paper B exceeds paper A, the subtree led by paper B began to become prosperous. In contrast, the subtree led by paper A began to be ignored, thus missing the golden opportunity for development, leading to stagnation of its development.  In the field of computer vision, the scientific publications in the first and third place of development potential is related to the semantic segmentation of images. With the urgent need for scene understanding in many practical applications such as current automatic driving and human-computer interaction, inferring corresponding semantic information from images has become a valuable application in the field of computer vision. In addition, there are also groundbreaking publication utilizing neural networks to process point cloud data (2), the creation (6) and improvement (5) of the current important YOLO model in object detection, and the seminal work utilizing deep learning to improve image resolution (8).  Table S5-3. Top ten publications in the field of data mining appearing in the past ten years according to DPI In the field of data mining, the most promising publication is GraphSAGE, which is related to graph neural networks. GraphSAGE solves the problem that the graph convolutional network (GCN) is too slow when representing new nodes, and enables rapid deployment in the production environment to bring practical benefits. In the second place is a recommendation system framework based on deep learning proposed by Google (2). Google has applied this method to its Google Play app recommendation business, and it has also been imitated and applied by many companies. This shows its huge development potential.