Evolution of scaling behaviors embedded in sentence series from A Story of the Stone

The novel entitled A Story of the Stone provides us precise details of life and social structure of the 18th century China. Its writing lasted a long duration of about 10 years, in which the author’s habit may change significantly. It had been published anonymously up to the beginning of the 20th century, which left a mystery of the author’s attribution. In the present work we focus our attention on scaling behavior embedded in the sentence series from this novel, hope to find how the ideas are organized from single sentences to the whole text. Especially we are interested in the evolution of scale invariance to monitor the changes of the author’s language habit and to find some clues on the author’s attribution. The sentence series are separated into a total of 69 non-overlapping segments with a length of 500 sentences each. The correlation dependent balanced estimation of diffusion entropy (cBEDE) is employed to evaluate the scaling behaviors embedded in the short segments. It is found that the total, the part attributed currently to Xueqin Cao (X-part), and the other part attributed to E Gao (E-part), display scale invariance in a large scale up to 103 sentences, while their scaling exponents are almost identical. All the segments behave scale invariant in considerable wide scales, most of which reach one third of the length. In the curve of scaling exponent versus segment number, the X-part has rich patterns with averagely larger values, while the E-part has a U-shape with a significant low bottom. This finding is a new clue to support the attribution of the E-part to E Gao.


Introduction
Structural patterns in language can give clues on how our brains process information, how our society is organized and shared, and how the world is structured [1][2][3][4]. They provide us also quantitative measures to distinguish habit styles of different authors and to monitor the development of linguistic abilities. A quantitative investigation on human language can trace back to the Zipf's law [5], namely, the occurring frequency of a most frequent word in a text obeys generally a power-law according to its rank, if we rank the words in a descending order of their occurring frequencies. This work has stimulated great efforts on discovering structural patterns embedded in texts. For instance, all the distances between successive occurrences of a word form a distance series. The distance distributes with a stretched exponential function a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 [6,7], and the distance series is long-range correlated [8][9][10][11]. Herein a distance refers to the number of words (characters); If we take each distinguished word as a node and link every pair of nodes that occur successively, this procedure will generate a typical small-world network [12][13][14](a small average distance between the words and a high probability of a node's neighbors being linked also).
The focus of the mentioned findings is on the words in a text. In a very recent paper [15], the statistical properties of punctuation marks were investigated in detail. It is found that they obey also the Zipf's law, and adding them as words to Zipfian analysis restores the power-law Zipfian behavior from a shifted power-law. Besides the grammatical role, a punctuation carries also some semantic load similar to that of the articles, conjunctions, pronouns and prepositions. Hence, it is necessary to incorporate punctuation marks in a language analysis, which opens more space for new observations.
Here we pay specially our attentions on the punctuation marks indicating end of a sentence. A sentence is the natural unit of a language. From several sentences as a context, one can restrict a word to an exact meaning. Sentences are organized logically into meaning groups, paragraphs, chapters and the whole text to display our ideas with a concise structure. The sequential order of sentence is a cooperative result of logic, fluency, rhythm, harmony, intonation and author's style, etc.
One of the important properties of a sentence series is its scaling behavior. Let us consider a sentence series denoted with, from which one can extract all the possible segments with a predefined length s, Here ξ n is the number of words (characters) contained in the nth sentence. Now we regard each segment as a trajectory in a duration of s, starting from the original point. The sentence series is mapped to an ensemble of N − s + 1 realizations of a stochastic process, whose displacements read, The stochastic process can be described with the probability distribution function (PDF) of the displacements, p(x, s). The sentence series is called to be scale invariant if the PDF satisfies a special form [16], namely, its shape keeps unchanged under a re-scale procedure of displacements, x ! x s d . A scale invariance implies that our ideas at different levels are structured with identical rules.
Drozdz et al. [17] found that more than one hundred classical novels all over the world shared a universal characteristic of almost a perfect long-range correlation. For all the sentence series constructed from the novels, the autocorrelation coefficient decays according to a power-law. Yang et al. [18] found also the scale invariant behavior in the sentence series from A story of the Stone, the scaling exponent of which turns out to be similar with that of the New Testament in the Holy Bible [17].
A language evolves with time and environmental conditions, the characteristics of which may be stored in the classics completed in long durations. For an example, the novel entitled A Story of the Stone is one of the the top four masterpieces in Chinese novels [19], and famous for its great value for researchers from diverse research fields. Its writing lasted about 10 years, during which the social environment and personal circumstances for the author(s) had changed significantly, which may lead to changes of language habits. In the present paper, we will separate the sentence series from A Story of the Stone into many segments as being the representatives of the habits in the corresponding durations, and evaluate the scaling behavior embedded in every segment, from which we hope to find the evolutionary behaviors of the author's language habit.
What is more, the novel was initially published anonymously up to the beginning of the 20th century, which left some arguments on the author's attribution [19]. Current opinion is that the part from the 1st to the 80th chapter was written by Xueqin Cao, while the other part from the 81th to the 120th chapter was written by E Gao, though there are still some debates up to now. By monitoring the evolutionary behavior of the habit, we hope also to find some clues on the author's attribution.
In literature, there exist some standard tools to evaluate scaling behaviors of time series, such as the wavelet transformation modulus maxima (WTMM) [20][21][22] and the de-trended fluctuation analysis (DFA) [23][24][25][26][27][28][29][30]. But when we try to find scaling behaviors embedded in the segments of sentence series, one will meet several essential problems [31][32][33]. First, these variance based methods are dependent with dynamical processes. For a fractional Brownian motion, they can find the scale invariant behavior and the estimated scaling exponent H equals to the real one, namely, H = δ. For a Levy walk, the power-law of the variance versus the scale s can be found but it's scaling exponent H does not equal to δ. The relation between them reads, . For a Levy flight motion, one can not detect qualitatively out the scale invariance. We have almost no knowledge, however, on the dynamical mechanism to produce the sentence series. Second, the tools are developed based upon probability theory, which requires the sentence series has an infinite length, at least it should be long enough. But a segment of sentence series has only a limited length (several hundreds). This finite length may induce unreasonable errors or even mistakes to the statistical quantities. Third, a segment of sentence series is generally non-stationary.
More than ten years ago, a concept called diffusion entropy analysis (DEA) [31-33] was proposed to overcome the dynamical mechanism dependent problem, in which instead of the variance one calculates from the PDF of displacements the Shannon entropy at different scales. It is proved that the entropy has a linear relation with the natural logarithm of the scale s, the slope of which is the scaling exponent δ. Detailed works prove its powerful in evaluation of scaling exponents for time series produced by different mechanisms [34][35][36][37][38][39][40][41][42][43][44][45][46][47][48][49][50]. In a very recent paper, Bonachela et al. [51] proposed an estimator of entropy that performs well when the size of a data set is very small (several tens). The key idea is to balance the bias and fluctuation to be sure their errors are all acceptable. Replacing the original definition of entropy in the DEA method with the balanced estimator, we proposed a new method called balanced estimator of diffusion entropy (cBEDE) [52][53][54][55]. Calculations with a large amount of theoretical and empirical data show that it can give us a reliable estimation of scaling behavior embedded in a very short time series with a length of * 10 2 . To our knowledge, at present the cBEDE is the sole method that can obtain a reliable scaling behavior embedded in very short time series with several hundreds a length.
Consequently, in the present paper we are interested in the evolution of scaling behaviors embedded in sentence series. A long sentence series is separated into non-overlapping segments with a predefined length. Scaling behavior of every segment is employed as representative of the language's characteristics within the corresponding duration. Specially, the classical novel entitled A Story of the Stone is investigated to monitor the change of language habit in a more than ten years duration. We hope also to find some new clues on the author's attribution problem. Each segment covers a total of 500 sentences. To obtain reliable scaling behaviors from so short a time series, the cBEDE is the natural and the sole selection of method.
It is found that the total, the part attributed to Xueqin Cao and the other part attributed to E Gao display almost perfect scale invariance in a wide scale interval of * 10 3 sentences, while their scaling exponents are almost identical. All the segments of sentence series behave scale invariant in considerable wide scales, most of which can reach 1 3 of the length. In the curve of scaling exponent versus segment identification number, the part attributed to E Gao is a Ushaped valley with a wide low bottom, which supports the current opinion of the author's attribution of this part. The part attributed to Xueqin Cao, however, contains rich patterns, the values of scaling exponents of which are averagely larger.

Data
A Story of the Stone is one of the four greatest Chinese classical vernacular novels [19]. By means of a huge cast of characters and psychological scope, it illustrates precisely the details of life and social structure in the 18th century China. It was completed in a long duration covering approximately 10 years from 1749 to 1759. It was published anonymously until the beginning of the 20th century, which left a famous mystery to the present people, i.e., the author's attribution. The current opinion is that the part from the 1st to the 80th chapter was written by Xueqin Cao, and the remaining totally 40 chapters were written by E Gao, though there are still some debates.
The text can be downloaded freely (see the Data Availability Statement). From the original text we identify the positions of full stops ("."), question marks ("?"), exclamation marks ("!") and suspension points (". . .. . ."), as being the stop of a sentence. Increments between the successive positions of the identified stops form the sentence series, an element of which is the number of characters contained in the corresponding sentence. The sentence series for the whole text contains a total of 34,759 elements, in which the part attributed to Xueqin Cao (called X-part) has a length of 23,504 elements and the other part attributed to E Gao (called E-part) has a length of 11,255 elements.

Correlation dependent balanced estimation of diffusion entropy analysis
If a sentence series behaves scale invariant, as defined in Eq (4), a straightforward computation leads to a linear relation of Shannon entropy versus the natural logarithm of the scale s [33], The slope of this relation equals to δ, which is independent with the specific dynamical mechanism the language obeys. It is called diffusion entropy analysis (DEA) [31-33], because the probability, p(x, s), is obtained from a diffusion process. The relation in Eq (5) stands only for stationary sentence series, which is generally not the case in reality. Herein, the central moving average method [56][57][58][59] is used to extract the trend in the displacement series, x(s) = {x 1 (s), x 2 (s), Á Á Á, x N−s+1 (s)}. Let a window with a size of s slide along this series. The trend for the element at the center of the window, i.e., the sþ1 the integer part of a real number. Subtracting the trend series from the displacement series produces the de-trended displacement series with a length of N − 2s + 1 elements.
From the de-trended displacement series, let us estimate the PDF. Selecting a certain fraction of the standard deviation of the sentence series ξ, 1 stdðxÞ, as being the size of a bin, one can separate the distribution interval of the de-trended displacement series into a total of R(s) bins. Reckoning the number of the elements in the de-trended series that occur in every bin, denoted with n(k, s), k = 1, 2, Á Á Á, R(s), the occurring probability in the bins can be approximated with, The Shannon entropy is then estimated simply using, IfÊ sim obeys the relation in Eq (5), the sentence series is scale invariant. Though the estimation of p(k, s), k = 1, 2, Á Á Á, R(s) is unbiased, the estimation of diffusion entropyÊ sim is biased due to its being a nonlinear function of the estimated probabilities. A rough estimation shows us that it will under-estimate the entropy, the bias of which reads À RðsÞÀ 1 2ðNÀ 2sþ1Þ þ O½RðsÞ [60]. With the increase of s, the total number of elements (N − 2s + 1) decreases, while the number of bins R(s) increases according to s δ . Because a segment of sentence series is generally very short (several hundreds a length), increase of s will lead to unreasonable large bias to the simple estimation of entropy. Hence, an estimator with an acceptable high-performance is required.
Let us denote the perfect occurring probabilities in the bins with p(k, s), k = 1, 2, Á Á Á, R(s). For a total of N − 2s + 1 displacements (de-trended), the expected numbers in the bins are (N − 2s + 1)p(k, s), k = 1, 2, Á Á Á, R(s). The occurring numbers of n(k, s), k = 1, 2, Á Á Á, R(s) are a realization of the perfect PDF. The deviations of the realizations from the expected occurring numbers come from uncorrelated statistical noises. A simple constraint is that the total number occurring in all the bins keeps constant. Hence, a correction of the simple estimation in Eq (7) should make the bias and the statistical fluctuation reach simultaneously minima. By a procedure of minimizing the statistical average of the summation of bias and standard deviation of the estimated diffusion entropy, a tedious computation leads to a new estimator, called correlation dependent balanced estimation of diffusion entropy (cBEDE) [55]. Detailed calculations show that it can give us a reliable estimation of scaling exponent from a very short time series with several hundreds a length (e.g., for a trajectory of the fractional Brownian motion, if the length is 300, the bias is less than 0.03 and the standard deviation is about 0.05). Accordingly, it is employed in this work to evaluate scaling behaviors embedded in the segments of sentence series.
To determine the size of a bin, we select different values of (such as 2, 3 and 4) and plot the curve ofÊ cBEDE versus lns corresponding to every . One will find a special interval of , in which the curves degenerate almost to an identical curve under an operation of vertical shift. The estimated scaling exponent is -independent and should be a correct estimation. In calculations is selected to be 2.
In the scaling behavior, the scale ranges of, for instance, e 0 -e 1 , e 1 -e 2 , e 2 -e 3 and e 3 -e 4 should have the same contributions in estimating the Hurst exponent, though the covered range increases exponentially [18]. In calculations, the window sizes are selected to be s = [1.2 m ], m = 1, 2, Á Á Á, and s N 3 , where [Á] is the integer part of a real number. The points in the curve ofÊ cBEDE versus lns distribute at almost identical intervals. All the scales contain almost the same number of points and consequently have identical contributions in the regression procedure of the least square method.
If a sentence series behaves scale invariant, the distribution region of displacement will extends according to * s δ . Hence, if we select the bin size in the procedure of estimating the PDF to be a certain fraction of the standard deviation of the de-trended displacements at time s (i.e, the size will increase with s), the diffusion entropy should keep constant. Slight deviations from this constant comes from the algorithm and noises, which have been corrected in the final results ofÊ cBEDE [55].

Power spectrum
The scaling behavior of the sentence series in Eq (1) can also be displayed simply with its power spectrum [61], S(f). For simplicity, we denote the series ξ = {ξ 1 , ξ 2 , Á Á Á, ξ N } with ξ = {ξ 0 , ξ 2 , Á Á Á, ξ N−1 }, the discrete Fourier transform (DFT) of which reads, lðf Þ ¼ N are the frequencies. The power spectrum writes, Sðf Þ jlð f Þj 2 : ð10Þ An existence of scale invariance implies a power-law, S( f) * f −β . The relation between β and δ reads, The power spectrum is sensitive to noise and trends in the sentence series. The length of the series should also be long enough to obtain a reliable result. In the present paper we calculate also the power spectra for the total, X-part and E-part series, to show the clues of scaling behaviors. The Welch's method [61] is used to reduce the impacts of noises caused by imperfect and finite data.

Results
From the original text of A Story of the Stone, we identify the positions of specific punctuation marks indicating the end of a sentence. The number of characters (including punctuation marks) between every pair of successive positions is the length of the corresponding sentence. Recording sequentially the length of every sentence in the text produces the sentence series. Fig 1(a) presents the total sentence series, in which the two parts written by Xueqin Cao (Xpart) and E Gao (E-part) are separated by a vertical red dotted line. To monitor the evolutionary behavior, we separate the sentence series into a total of 69 successive non-overlapping segments, denoted with seg01, seg02, Á Á Á, seg69, respectively. Each segment has a length of 500 sentences. We illustrate the segments numbered seg10 in Fig 1(b) and seg60 in Fig 1(c), as typical examples. , 1 8 and 1 10 of the corresponding series lengths, respectively. The scaling exponents for the total, X-part and E-part series are 0.65, 0.63 and 0.62, which are almost identical. Hence, the global behaviors can not give us a distinguishable difference between the three series. We calculate also the power spectra of the three series (see Fig 2(b)). One can find that, roughly   To obtain the global behavior, we conduct a smooth procedure, namely, let a window covering five scaling exponents slide along the curve and replace the scaling exponent at the center with the average of the covered scaling exponents. The resulting curve is regarded as the smoothing curve (see the red curve). Interestingly, when the segment number becomes larger than 47, i.e., from then on the segments belong to the E-part, the smoothing curve decreases rapidly to a small value of about 0.57 at the 51th segment, then oscillates around the small value up to the 65th segment, and finally increases in a speedy way to about 0.70 at the last segment. By this way the smoothing curve shows a significant wide valley in the E-part. As for the X-part, though there exist rich structural patterns, averagely the scaling exponents are comparatively larger.
To be sure the detected scaling behaviors are dependent with sentence order, we calculate also the scaling exponents for the shuffled sentence series and the shuffled segments. The scaling exponents are almost identical (% 0.5) with confidence intervals within [−0.05, +0.05] (not shown).

Summary and conclusion
The novel entitled A Story of the Stone provides us details of life and structure of social society in the 18th century of China, and attracts subsequently attentions from diverse research fields. In the present paper, we focus our attention on the scaling behaviors embedded in the sentence series from this novel, from which we hope to find how the description is structured and constructed from microscopic to macroscopic level, i.e., from single sentences, to paragraphs, chapters and the final whole text. This novel was completed in a long duration lasting about ten years, in which the author's language habit might change significantly. It had been published anonymously up to the beginning of the 20th century, which left a famous mystery of author's attribution. Hence, we are interested specially on how the scaling behavior evolves with time, and on if the scaling behavior can provides us much more clues on debates of the author's attribution.
To obtain the evolutionary behavior we separate the whole text into many non-overlapping segments, and take the scaling behavior embedded in every segment as the representative of the behavior in the corresponding time interval. In literature, there are several standard tools to evaluate the scaling behaviors embedded in time series, such as the WTMM, the DFA, and This smoothing procedure results in the smoothing curve (the red curve). When the segment number becomes larger than 47 (from then on the segments belong to the E-part), the smoothing curve has a U-shape with a wide bottom. In the E-part, the curve contains rich patterns, and the scaling exponents are comparatively larger. the DEA. These tools are designed based upon probability theory, which requires the time series having an infinite length (at least the length is long enough). However, in our work we separate the sentence series into a total of 69 segments with a limited length of 500 sentences each. Very recently, we developed a new concept called cBEDE to obtain reliable scaling behaviors embedded in very short time series. Calculations on a large amount of fractional Brownian motions and empirical records from stock markets and physiological experiments prove its high performance. For instance, for time series with a length of 300 the bias is in the interval of [−0.03, +0.03] and the confidence interval is [−0.05, +0.05]. Accordingly, the cBEDE is employed to monitor the evolution of scaling behavior.
A current opinion is that the part from the 1th to the 80th chapter was written by Xueqin Cao, and the other part from the 81th to the 120th chapter by E Gao, herein denoted with Xpart E-part respectively. The total, X-part, and E-part series behave scale invariant in a considerable wide interval of scale (up to more than 10 3 sentences), but the scaling exponents are undistinguishable in value. Hence, the ideas are structured with an identical rule up to a scale of * 10 3 sentences covering averagely three to four chapters. However, the scaling invariance can not give us any clue on the author's attribution.
All the segments of sentence series display almost perfect scale invariance, the scale ranges of most of which can reach 1 3 of the length. The values of scaling exponent distribute in a wide interval of [0.43, 0.75. From the curve of scaling exponent versus the segment number, one can find that the E-part has a U-shape with a wide low bottom. This finding gives a new clue to attribute the E-part to E Gao. However, though averagely the curve for the X-part has comparatively larger values of scaling exponent, it has a complicated shape with rich patterns.
Summarily, the scale invariance exists up to a scale of 10 3 sentences (about three chapters). The scaling behaviors for the X-part and the E-part are undistinguishable from that for the whole novel. However, the scaling behaviors for segments of sentence series display rich structures and significant difference for the X-part and the E-part. Hence, structures at different scales from sentences to chapters can tell us valuable information.
Recent years have witnessed a significant progress in detecting structural patterns of time series. For instance, the unbiased estimator of probability moments [62] can evaluate multifractals in very short time series (with a length of * 10 2 ). By mapping a time series to a network one can extract the structural patterns at different scales [63][64][65][66][67][68][69][70][71][72][73][74][75][76][77][78][79][80]. By using a graph-let as the representative of a local state, we can monitor the evolutionary behavior of a complex system [81,82]. We hope our work stimulates an incorporation of the methods to enrich the knowledge of and to deepen the understanding on the novel of A Story of The Stone.