Anatomy of Scientific Evolution

The quest for historically impactful science and technology provides invaluable insight into the innovation dynamics of human society, yet many studies are limited to qualitative and small-scale approaches. Here, we investigate scientific evolution through systematic analysis of a massive corpus of digitized English texts between 1800 and 2008. Our analysis reveals great predictability for long-prevailing scientific concepts based on the levels of their prior usage. Interestingly, once a threshold of early adoption rates is passed even slightly, scientific concepts can exhibit sudden leaps in their eventual lifetimes. We developed a mechanistic model to account for such results, indicating that slowly-but-commonly adopted science and technology surprisingly tend to have higher innate strength than fast-and-commonly adopted ones. The model prediction for disciplines other than science was also well verified. Our approach sheds light on unbiased and quantitative analysis of scientific evolution in society, and may provide a useful basis for policy-making.

In 1880, the probability density functions (PDFs) of types-I and -II almost overlap. As time elapses, the PDF for type-II shifts to higher frequency ranges, while that for type-I stays in almost the same frequency range.

Figure E.
Average and median frequencies of 1-grams in types-I and -II over the years. Both the average and median frequencies for type-II increase over time, whereas those for type-I barely change. 6 Figure F. Probability that a scientific word turns out to be type-II as of 2008 if it first passed a particular level of frequency on the horizontal axis in a given range of the past years. The higher frequency a scientific word exceeds, the more likely it is of type-II.         The left panels are for the power-law fitness distribution (γ = 2.0, β = 1/4, N = 4,096, L = 10, pER = 0.1024, ρ = 0.2, α = 0.0001, fc = 0.00025) and the right panels are for the Gaussian fitness distribution (σ = 0.1, N = 4,096, L = 10, pER = 0.0064, ρ = 0.15, α = 0.0001, fc = 0.00025). Lifetime of type-III was treated as zero for this analysis. The upper two panels show the coefficients of variation (CVs) of lifetime and peak for each level of fitness, and the bottom panel shows the probability distributions of fitness for different types. The shaded areas indicate the ranges of fitness toward which the distributions of type-II and type-III are biased (type-II for high fitness and type-III for low fitness). In the left shaded areas, CVs of lifetime are ill-defined because all items in the areas belong to type-III with zero lifetime; the variability of lifetime in these areas can be viewed as effectively zero. For types-I and -II, plotted in logarithmic scale (left axis). For type-III, plotted in linear scale (right axis). The fractions of type-I and type-II tend to increase as fitness increases, but the slope is steeper for type-II. The fraction of type-III is very high in all ranges of fitness, but slightly increases as fitness decreases (in the right panel, open circles with dashed lines do not have a statistically-meaningful number of items).     Table B. List of scientific words predicted to be type-II, which first passed the frequency of 2.0× 10 -6 in the years between 2000 and 2008 (the exact years are recorded in the second column). From Figure 1b, the chance of being type-II is estimated to be 97.1%. We assigned a category and a context of use to each word according to its common usage.

Google Books Ngram Corpus
We obtained the annual data of n-gram counts contained in the English section of the Google Books Ngram Corpus Version 2 which spans 8,116,746 books published over the last five centuries [1]. A 1-gram is a string of characters uninterrupted by a space, e.g., a word or number. An n-gram is a sequence of 1-grams, e.g., n=3 for a phrase with three words. We here focus on 1-grams for simplicity of analysis. A sample of the data is given below (from the data file "googlebooks-engall-1gram-20120701-w.csv"): The first row shows that in 2000, the word "work" occurred 7,285,922 times in 100,673 different books. The second and third rows show that in the same year, the word "work" occurred 1,848,009 times as a verb, and 5,377,995 times as a noun. Relative frequency of a 1-gram is defined as the number of occurrences of the 1-gram in a given year divided by the total number of 1-grams in that year.

Preprocessing
In order to treat different inflectional forms (e.g., singular and plural) of the same word stem as equivalent in their essential meaning, we integrated such forms systematically by Porter Stemming Algorithm [2] when computing the 1-gram frequency. We also limited our analysis to data in the years between 1800 and 2008 because the amount of data before 1800 is not sufficient to obtain statistically meaningful results [3]. In addition, every 1-gram frequency in the years 1899 and 1905 was replaced by the average for that 1-gram from ±1 years, as the Google Books Ngram Corpus occasionally assigned 1899 or 1905 to books of unknown publication dates [3]. For any 1-gram that appeared in year t but not in t -1 and t +1 to t +10 against what expected from the usual patterns ( Figure A), we set its frequency in t to zero to avoid possible errors from the optical character recognition (OCR) processes.

Identification of scientific and technological words
How do we know whether a given 1-gram belongs to the vocabulary of science and technology? One simple way is to check whether it matches any word in a published science dictionary. We created a list of scientific and technological words (1-grams) from an online science dictionary "AccessScience" [4]. Because the list itself may be biased toward words in common use today, we added words from other various sources including those used in the past as well ( Table A): we extracted words from patent grant texts in the United States Patent and Trademark Office data provided by Google [5] and from article titles in a number of scientific journals. Among those words, only nouns were selected. A word was considered to be a noun, if it was used as a noun in 38 more than 90% of its total usage in the year 2000 (e.g., 5,377,995 / 7,285,922 = 73.8% usage as a noun for "work" in section 'Google Books Ngram Corpus'). We filtered out words with the year of birth < 1800 to make them consistent with section 'Preprocessing'. Then, we arranged the remaining words in descending order of usage within their respective sources. For most cases, the words of high usage within the sources were likely to be scientific and technological words. By manual inspection of randomly-sampled words (≥ 10% coverage for journals, ≥ 1% coverage for patents) along the descending order of usage level within each source, we selected all words of the usage level having at least an 80% chance of being scientific and technological words which are not used in too broad a context. If this cutoff covered all words occurring in that source, then we excluded words used only once in the source. In total, we obtained 7,855 scientific and technological words from the dictionary, patents, and journals.

Connection between word usage and events in society
One may ask about how word usage is related to the empirical events in society. We here present several examples in response to such questions. The original study of Google Books Ngram Corpus [3] reported that the boost of a word in its frequency can reflect the increasing impact of the relevant event on society. For example, peaks in "influenza" correspond to the dates of known pandemics [3]. Additionally, various studies in sociolinguistics have paid attention to connections between, e.g., social structures and word usage [6]- [7], urbanized population and word usage [8], and events in society and coherent changes in word usage [9]. Those studies seem to support our assumption that the frequency of a scientific word is indicative of the actual impact of the scientific concept on society."

Determination of fc, FPT, lifetime, and peak
The cutoff frequency fc defines the threshold above which a 1-gram can be roughly considered to be common in society. A proper choice of fc is important as the quantification of first passage time and lifetime (see below) depends on it. We chose fc = 10 -7 since 1-grams with frequency > 10 -7 are easily found in published dictionaries [3]. In 2000, there were 79,691 word stems (corresponding to ~200,000 1-grams) with frequency > 10 -7 ( Figure B). Our main results presented in this work, however, do not qualitatively change as long as 10 -8 ≤ fc ≤ 2×10 -7 . For a given 1-gram, first passage time (FPT) is defined as years to cross fc in frequency since the birth of the 1-gram, lifetime is defined as years between the first and the last year the frequency was above fc, and peak is defined as the highest frequency of the 1-gram over time. Specifically, we define lifetimes to 1-grams, which are under the frequency fc for at least 10 years until the year 2008, since they are rarely expected to bounce back ( Figure A). Figure C illustrates the definitions of FPT, lifetime, and peak.

Characterization of different 1-gram types
Most 1-grams could be classified into the following three types. Type-I has 1-grams with welldefined finite lifetimes (section 'Determination of fc, FPT, lifetime, and peak'). Type-II shows a lifetime to a distinctively long extent beyond the time frame, so the exact lifetime cannot presently be defined. Type-III, unlike types-I and -II, never had a frequency higher than fc. One may claim that the distinction between type-I and type-II was merely based on the limited period of observation allowed in our current dataset. Although the distinction was made in a rather heuristic way, we did observe a more fundamental difference between type-I and type-II. Figure  D shows the probability density function (PDF) of the frequency for each type of 1-grams in a given year. While the PDFs for type-I and type-II initially overlap, the difference between them grows over time as the PDF of type-II shifts to higher frequency ranges. The growing difference can be quantified by tracking the average and median frequencies of each type over the years, as shown in Figure E. While the average and median frequencies of type-I stay almost steady, the same statistics of type-II keep increasing. The results indicate an intrinsic difference between types-I and -II, manifested in their frequency growth patterns.

Predictability
Type-II includes scientific words prevailing in society longer than the other types. Thus, by identifying type-II scientific words at a relatively early stage, we can predict which words will be promising in the future. As demonstrated in Figure E, the frequency of a type-I word tends to stay at a low level, while that of a type-II word continuously grows. This fact implies that if we identify the scientific words whose frequency exceeds a sufficiently high level, many of them will be type-II. Figure F indeed shows that the higher the level of frequency exceeded, the more likely the word belongs to type-II. It also shows that the probability of being type-II varies slightly across the years when the words passed a particular level of frequency. This raises the question of which years are appropriate to choose to estimate the precision of type-II identification. The period of the years should be long enough for a reliable statistical analysis and the years should be old enough for a clear distinction between type-I and type-II in 2008. We selected the period of years between 1800 and 1919, which leaves 89 years until the end year of our dataset, and this 89-year period is longer or comparable to the typical lifespan of a human being.
For the period 1800-1919, the relationship between the level of frequency exceeded and the probability of being type-II in 2008 is presented in Figure 1b. Accordingly, we made a list of scientific words predicted to be future type-II based on the level of frequency passed in the years between 2000 and 2008 (Tables B-E). All entries were classified into respective categories, and we filtered out the words used in too broad a context, not necessarily in a scientific context.

Significance test
To test the statistical significance of the relation between the level of frequency passed and the probability of being type-II, we performed a two-sided Z-test under the null hypothesis that there is no association between the frequency level and the probability of type-II, resulting in their correlation merely by chance. For this analysis, we calculated expected values and standard deviations from the null distributions. Among N scientific words (1-grams) in total, let q be the fraction of words over a certain frequency level and r be the fraction of type-II. The expected number of type-II over the frequency level is Nqr and the variance is Nqr(1r). The central limit theorem ensured that this null distribution converged well to the Gaussian distribution, giving a Z-score as well as a P-value (Table F).

Internet webpage volume
To test the validity of our type-II prediction results against an up-to-date independent dataset, we used the Google web search engine that showed the Internet webpage volumes updated annually between 2008 and 2013 for the words of our search queries (accessed in February and March 2014). Because Google provides search results using a stemming algorithm, we submitted the singular forms of the words instead of the word stems themselves. Because Google does not permit automatic search queries by web robots, we manually submitted (i) the type-II-predicted scientific words in Tables B-D, and (ii) their counterparts, randomly-selected from the scientific words that first reached any frequency ≤ 2×10 -6 between 2000 and 2008. For the normalization in Figure 1c, we used 100 random words from (ii). For the control group against (i) in Figure 1d, we used 100 random words from (ii) not overlapping with (i). In Figure 1d, the comparison between the search queries for (i) and for the control group shows that the prediction results also work for the webpage volumes since 2008, although the prediction itself is based on the 1-gram data between 2000 and 2008.

Relations between FPT, lifetime, and peak
This section discusses the unique features of scientific words manifested in the relations between FPT, lifetime, and peak. For FPT and lifetime in this section, we use their rescaled values (section 'Rescaling of FPT and lifetime') unless specified.

Adjusted density plot
To find the correlation between two quantities, x and y in the linear scale, we first take a small window of size bx×by, place the lower left corner of the window at the starting (smallest) points of x and y (xmin, ymin), measure the density of data points inside the window, and assign the value to the lower left corner. We repeat the same procedure after shifting the position of the window by bx/kx along the x-axis or by/ky along the y-axis until the entire xy-plane is spanned (kx and ky are constants). If one axis (say x) is in the logarithmic scale, the density at each position is calculated in a similar way except that the window is shifted in the x-direction by multiplying   x k x i / 1 to the xcoordinate and the window length along the x-axis increases by the same factor. Finally, we normalize every density at each x relative to the maximum across the y-axis. We call this density "adjusted density", which is suited for clarifying the dependence of y on x when plotted on the xy plane. 41

FPT and lifetime
Figure I (same as Figure 2b and c) shows the density plot between FPT and lifetime, for scientific words (left) and an entire set of 1-grams (right) in type-I. For scientific words, FPT and lifetime are negatively correlated, with a transition at FPT~1.2 giving rise to a sudden appearance of lifetime~2.0 (Pearson's Chi-squared test, P = 4.3×10 -47 ). For an entire set of 1grams, there is no such transition.

Peak and lifetime
Figure J shows the density plot between peak and lifetime for scientific words (left) and an entire set of 1-grams (right) in type-I. At small values of peak for scientific words, lifetimes are mostly short. As peak increases, a sudden leap from short to long lifetime is observed at peak ~ 5×10 -7 . This transition barely occurs for an entire set of 1-grams, at much larger peak (11.3 times larger) than for scientific words. Figure K shows the density plot between FPT and peak for scientific words (left) and an entire set of 1-grams (right) in type-I. FPT and peak have negative correlation.

Significance test
To test the statistical significance of sudden leap into ~2.0 in lifetime at FPT~1.2 for type-I scientific words, we constructed a 2× 2 contingency table displaying the numbers of the words at FPT ≥ 1.2 and < 1.2, and lifetime ≥ 2.0 and < 2.0. Then, we computed the Pearson's Chi-square value and a P-value based on a Chi-square distribution with 1 degree of freedom, with a null hypothesis that there is no association between FPT and lifetime.

Model description
To build a mechanistic model to account for our observation, we considered the three key factors in the spread of science and technologypreferential adoption, homophily, and fitness, as described in the main text. In this section, we explain further details of how the model accommodates these factors. The model consists of N agents where individual agents represent various forms of social units to invent and adopt items. The items are transmitted from agent to agent. We assume that the adopted ranges of such items are projected into the actual usage levels of the corresponding words in the 1-gram dataset [3].

Homophily
Each agent is assigned ε, which characterizes the level of involvement in specialized areas. In general, ε can be a vector with real-number components. For the simplicity of our model, here ε is a scalar binary number: ε = 1 if the agent belongs to the scientific community, otherwise, ε = 0. In other words, agents such as scientists, engineers, scientific journalists, research institutes, and scientific publishers can take ε = 1, and we call them simply 'scientists' in our model. Scientists occupy only a small fraction of the whole system, with a certain chance of being a scientist (equal 42 to ρ) given to each agent at the beginning of the simulation. Once ε has been determined to be either ε = 1 or 0 for each agent, it never changes during the simulation. To consider the effect of homophily, we introduce a weight function for every pair of agents, w(|εi -εj|), which captures how influential agents i and j in the pair are to each other in the spread of innovation. w(|εi -εj|) should be a decreasing function of |εi -εj| and we chose the form w(|εi -εj|) = exp[-(εi -εj) 2 ].

Preferential adoption and homophily
When agent i adopts another j's item q, preferential adoption and homophily work as the following function, p(q, i)×p(q, j) Here,  r denotes the sum over all agents in the system and δ(q, r)=1 if agent r holds item q, otherwise, δ(q, r)=0. w(|εm -εr|) comes from section 'Homophily'. A square root appears in p(q, m) because it makes p(q, i)×p(q, j) linearly proportional to the population having item q in the case that ε's are identical for all agents.

Network for information spread
Adoption of new items takes place through direct information spread between agents. For the simulation results presented in this study, the global network topology of such information channels connecting different agents was set following the Erdős-Rényi model [10]. Specifically, we used a G(N,pER) model where each agent is randomly linked to another with probability pER [11]. To avoid generating isolated agents, we took pER > ln(N)/N.
We also considered another network model, the static model of scale-free networks [12], which is known to produce a fat-tailed, power-law degree distribution in contrast to the Erdős-Rényi model. For the degree exponent between 2.0 and 3.0 (other parameters set equal to those of Figure  3a-d), we found that our main results did not much change with the selection of this network topology.

Fitness
To each invented item, we assign fitness λ, which gives the intrinsic differences between items in their adoption rates.

Gaussian distribution
Provided that fitness λ is a sum of numerous uncorrelated properties of an item, one can assume that the fitness distribution follows the Gaussian distribution Λg (0.5, σ) ~ exp[(λ-0.5) 2 /2σ 2 ], whose domain is centred at 0.5 and bounded by 0 and 1. In this case, we consider the following function contributing to the probability that a new item qj with fitness replaces an old item qi with fitness in its adoption:

Power-law distribution
Alternatively, one can assume that the fitness distribution follows a fat-tailed distribution such as a power-law, Λp (γ, xmin) ~ (x/xmin) -γ , whose domain is bounded by 1 and 11. In this case, we consider the following function contributing to the probability that a new item qj with fitness replaces an old item qi with fitness in its adoption:

Update rule
In our model, every agent has L distinct items at every instant. At every time step, a new item is introduced by randomly-selected agent i with probability α, and is assigned the category simply by following agent i's specialty εi (section 'Homophily'). The new item randomly replaces one of the agent i's old items in the same category as the new one. If there is no such item in the same category, any old item of agent i is randomly chosen and replaced. The new item has fitness with the probability distribution mentioned in section 'Fitness'.
Next, we randomly select a pair of agents j and k in direct contact through pre-assigned information channels (section 'Network for information spread') and their items qj and qk belonging to the same category. If agents j and k have no items in the same category, any pair of their items is selected. Then, agent j adopts item qk by replacing item qj with the following probability, provided that agent j has never adopted item qk before: is smaller than 0 (larger than 1), we consider it to be 0 (to be 1). At every N× L repetitions of the above steps, the frequencies of all items in the system are recorded. The frequency of an item is defined as the ratio of the item's copy number to the total counts of items (= N×L) in the system. Here, we use such N×L repetitions of the steps as the arbitrary unit of time to measure the FPT and lifetime of items.

Initialization
After the system is set up with given parameters, we start with N agents having no items. We run the simulation as described in section 'Update rule', except that a transmitted (newly generated) item is appended to the receiving (producing) agent's item list if the list contains fewer than L distinct items. If the receiving (producing) agent already has L distinct items, then one of them is replaced with the transmitted (newly generated) item according to the rules in section 'Update 44 rule'. The initialization process is complete once every agent has L distinct items, and the simulation time starts at that moment.

Ergodicity
In section 'Data analysis', all statistics for the 1-gram data were obtained from the long time series data. For the model analysis here, we use the ensemble results assembled from multiple simulations rather than use the results from a single long simulation, to save simulation times. Simulations for each ensemble were performed under the same model parameters but can have different initial conditions and network connectivity due to the randomness in the initialization process. One may question the validity of using such ensemble results instead of results from a sheer long-time simulation. We claim that our model is ergodic enough so that both ensemble and long-time results give almost equivalent patterns. Two Erdős-Rényi networks with equal pER do not have much statistical difference in their structural properties when the network size is large enough [11], so their dynamical properties would not be much different either. Moreover, most items cannot survive over the frequency fc in the system for longer than 50000 steps, and within 100000 steps all items of the system fall below fc and are effectively replaced by the new, not leaving much trace of the past. Therefore, a long simulation of our model would be nearly equivalent to an ensemble of different simulations.

Model results
The simulation of our model shows, whether for the scientific category or not, the existence of type-II-like items having distinctively longer lifetimes than the others (Figures L-N; see also section 'Distinct dynamics of type-I and type-II in their adoption'). They appear even if all agents and items are assigned the same ε and the same λ, respectively, indicating that preferential adoption is sufficient for the existence of type-II ( Figure N for the same λ case). However, homophily and fitness effects are also important to explain the observed patterns in scientific words, as discussed below.

Relation between FPT and lifetime
In Figures L-N, we show density plots between FPT and lifetime for different forms of fitness distributions, which supplement the results in Figure 3a  For all three different fitness distributions, we could identify the range of parameters in which (i) type-I and type-II items are clearly distinguishable and (ii) type-I scientific items exhibit the sudden transition of lifetime across FPT. If we don't consider preferential adoption, homophily, and fitness for our model, then the functional form of P(qj, qk, j, k) in section 'Update rule' is changed into P(qi, qj, i, j) = θ, where θ is an arbitrary constant. In this case, the feature (i) is observed at a very narrow range of θ, e.g., only at θ ~ 0.01 in the same condition as Fig. 3a-d. If we now consider preferential adoption, the feature (i) appears easily without such parameter finetuning. However, preferential adoption alone is not enough for the feature (ii), as the feature (ii) does not appear if all agents have identical ε's. Therefore, for the features (i) and (ii), preferential adoption and homophily are both important. It is noteworthy that features (i) and (ii) can be produced, even with the Dirac delta distribution of the fitness. Nonetheless, fitness is also important in our model, as the negative correlation between FPT and lifetime in the regime of (rescaled) long lifetime ≥2.0 in Figure 2b after the transition, herein called feature (iii), cannot be reproduced under the Dirac delta distribution of fitness ( Figure N). Therefore, the three fundamental components in the modelpreferential adoption, homophily, and fitnessare important to explain the observed patterns in scientific evolution.

Significance test
To test the statistical significance of the sudden transition of lifetime across FPT for type-I scientific items, we conducted an analysis similar to section 'Significance test' in 'Data analysis': in Figure 3a, an abruptly long lifetime ~ 2,000 appears at FPT ~ 5,000. We constructed a 2× 2 contingency table displaying the numbers of the words at FPT ≥ 5,000 and < 5,000, and lifetime ≥ 2,000 and < 2,000. Then, we computed the Pearson's Chi-square value and a P-value based on a Chi-square distribution with 1 degree of freedom, with a null hypothesis that there is no association between FPT and lifetime (P = 5.4×10 -22 ).

Distinct dynamics of type-I and type-II in their adoption
The right panels of Figures L-N show clear gaps between short and long lifetimes of nonscientific items, giving a straightforward way to split type-I and type-II at lifetimes ~ 12,000, 8,400, and 8,000 for Figures L-N, respectively. We assume that these values of lifetime to split type-I and type-II are approximately equal for both non-scientific and scientific items. Based on this assumption, we split type-I from scientific items along the boundaries defined in the legends of Figures L-N (see also section 'Relation between FPT and lifetime'). One may question the validity of this classification scheme of type-I and type-II for scientific items, as the left bottom panels of Figures L-N show less clear gaps between type-I and type-II than the right bottom panels. Nonetheless, we were able to demonstrate that type-I and type-II scientific items are qualitatively different in their dynamics. Figure O shows the probability distributions of Δtf = tf´ − tf for type-I and type-II scientific items, where tf´ (tf) of each item denotes the last time that item's frequency outside (inside) the scientific community fell below fc. Δtf > 0 (< 0) indicates that the item has been in longer common use outside (inside) than inside (outside) the scientific community. In Figure O, we observe that type-I and type-II tend to occupy different regimes of Δtf: Δtf < 0 for type-I and Δtf > 0 for type-II. In other words, type-II scientific items tend to survive in the outer society even though they are no 46 longer active within the scientific community. The adoption of type-I scientific items shows the opposite trend, largely driven by the internal dynamics of the scientific community itself. In conclusion, the simulation results demonstrate the fundamental difference between type-I and type-II in their dynamics during adoption, supporting the validity of our classification scheme for type-I and type-II. For this analysis, we excluded the items whose frequency either outside or inside the scientific community never exceeded fc, because of their ill-defined tf´ and tf. These items would not be well found near the boundaries between type-I and type-II, so excluding them would not distract the rigorous examination of the difference between these two types.

Effect of fitness on lifetime and peak
In our model, the spread of an item depends on its fitness as well as social effects. The latter effects do not always favor the spread of a higher-fitness item because they may amplify random fluctuations in the item's spread and strengthen the spread of the item in the majority regardless of its fitness. In this section, we present simulation results showing how critical fitness is in determining the long-term fate of individual itemslifetime and peak. Figure P shows the averages of lifetime and peak steadily increasing over fitness, but also the large variability out of this average trend. In Figure 3c and Figure Q, we use the coefficient of variation (CV) as an indicator of the variability. CV is defined as the ratio of standard deviation to mean. Figure Q shows that the variability of lifetime and peak increases non-monotonically across fitness, reaching the maximum at the intermediate level of fitness. In other words, the long-term fate of scientific items is less variable at low and high fitness, and actually type-II and type-III have distributions biased to these fitness regimes (type-II for high fitness and type-III for low fitness; Figures Q-R).

Late bloomers: effect of fitness on FPT
Common intuition suggests that FPT and fitness should be anti-correlated. On the contrary, the simulation results clearly show the positive correlation between them for types-I and -II scientific items ( Figure S). These counter-intuitive results can be explained by the fact that high fitness helps the science survive long periods of frequency < fc, allowing for long FPT as well as short FPT ( Figure T). In contrast, low-fitness science is difficult to survive unless it initially spreads fast, either having short FPT or falling to type-III ( Figure T). The existence of high fitness, long FPT science reminds us of the concept 'late bloomers'.
The above findings raise the possibility that scientific words with very long but finite FPT in the Google Books Ngram Corpus dataset can be good candidates for late bloomers with high fitness. We listed in Table G such late bloomer candidates from type-II scientific words with rescaled FPT ≥ 2.0. For this, we manually excluded the words involving dating or OCR errors, and non-scientific use. 47

Significance test
To test the statistical significance of a positive correlation between FPT and fitness, we performed a two-sided Z-test under the null hypothesis that there is no association between FPT and the fraction of scientific items with high fitness >10.5 (Figure 3d; power-law fitness distribution with γ = 2.0, β = 1/4, N = 4,096, L = 10, pER = 0.1024, ρ = 0.2, α = 0.0001, fc = 0.00025). We calculated an expected value and a standard deviation from the null distribution. For types-I and -II scientific items, let q be the fraction of FPT > 10000 and r be the fraction of fitness > 10.5. If there are N items in total, the expected number of items with fitness > 10.5 among those with FPT > 10000 is Nqr and the variance is Nqr(1-r). The central limit theorem ensured that this null distribution converged well to the Gaussian distribution, giving a Z-score as well as a P-value.

Evolution of other fields
Our model predicts that other innovative fields such as food and art have similar features to science in FPT-lifetime relation, as demonstrated in Figure 4. However, one of the sources from which we collected food-related words [13] contained 43 (out of 236) type-I words overlapping with those analysed for scientific evolution. To avoid any possible artifact in Figure 4a

Significance test
To test the statistical significance of the sudden leap in lifetime across FPT in Figure 4 and Figure U, we conducted an analysis similar to section 'Significance test' in 'Relations between FPT, lifetime, and peak'.