The authors have declared that no competing interests exist.
The performance of deep learning in natural language processing has been spectacular, but the reasons for this success remain unclear because of the inherent complexity of deep learning. This paper provides empirical evidence of its effectiveness and of a limitation of neural networks for language engineering. Precisely, we demonstrate that a neural language model based on long short-term memory (LSTM) effectively reproduces Zipf’s law and Heaps’ law, two representative statistical properties underlying natural language. We discuss the quality of reproducibility and the emergence of Zipf’s law and Heaps’ law as training progresses. We also point out that the neural language model has a limitation in reproducing long-range correlation, another statistical property of natural language. This understanding could provide a direction for improving the architectures of neural networks.
Deep learning has performed spectacularly in various natural language processing tasks such as machine translation [
One approach to tackling this problem is mathematical analysis of the potential of neural networks [
We have found that two well acknowledged statistical laws of natural language—Zipf’s law [
We constructed a neural language model that learns from a corpus and generates a pseudo-text, and then investigated whether the model produced any statistical laws of language. The language model estimates the probability of the next element of the sequence,
In all experiments in this article, the model was trained to minimize the cross-entropy by using an Adam optimizer with the proposed hyper-parameters [
In the normal scheme of deep learning research, the model learns from all the samples of the training dataset once during an epoch. In this work, however, we redefined the scheme so that the model learns from 1% of the training dataset during every epoch. That is, the model learns from all the samples every 100 epochs. We adopted this definition because the evolutions of Zipf’s law and Heaps’ law are so fast that their corresponding behaviors are clearly present after the model has learned from all the samples once. Although we discuss this topic in Section 3, we emphasize here that either this redefinition or some other approach was necessary to observe the model’s development with respect to Zipf’s law and Heaps’ law.
Generation of a pseudo-text begins with 128 characters in succession as context, where the 128-character sequence exists in the original text. One character to follow the context is chosen randomly according to the probability distribution of the neural model’s output. The context is then shifted ahead by one character to include the latest character. This procedure is repeated to produce a pseudo-text of 2 million characters unless otherwise noted. The following is an example of a generated pseudo-text:
We chose a character-level language model because word-level models have the critical problem of being unable to introduce new words during generation: by definition, they do not generate new words unless special architectures are added. A word-level model typically processes all words with rarity above a certain threshold by transforming each into a singular symbol “unk”. With such a model, there is a definite vocabulary size limit, thus destroying the tail of the rank-frequency distribution. Zipf’s law and Heaps’ law therefore cannot be reproduced with such a model. There have been discussions and proposals regarding this “unk” problem [
Note that the English datasets, consisting of the Complete Works of Shakespeare and The Wall Street Journal (WSJ), were preprocessed according to [
Zipf’s law and Heaps’ law are two representative statistical properties of natural language. Zipf’s law states that, given word rank
Heaps’ law, another statistical law of natural language, underlies the growth rate of vocabulary size (the number of types) with respect to text length (the number of tokens). Given vocabulary size
The upper-left graph in
All axes in this and subsequent figures in this paper are in logarithmic scale, and the plots were generated using logarithmic bins. The model learned from 4,121,423 characters in the Complete Works of Shakespeare, which was preprocessed as described in the main text. The colored sets of the plots of the figures in the first row show the rank-frequency distributions of 1,2,3,4,5 grams. The figures in the second row show the vocabulary growth in red. The exponents
The graphs on the right side of
The potential of the stacked LSTM is still apparent even when we change the kind of text. Figs
The model learned from 4,780,916 characters. The length of the pseudo-text is 20 million characters. The rank-frequency distributions are shown for 1,2,3,4,5,8,16-grams. The preprocessing procedure was the same as for the Complete Works of Shakespeare.
The text was processed at the byte level with word borders.
The observations made for the Complete Works of Shakespeare apply also to Figs
These results indicate that a neural language model can learn the statistical laws behind natural language, and that the stacked LSTM is especially capable of reproducing both patterns of
We also tested language models with different architectures.
The stacked LSTM acquires the behaviors of Zipf’s law and Heaps’ law as learning progresses. It starts learning obviously at the level of a monkey typing.
As shown by
The left-hand graphs are in logarithmic scale for the x-axes and linear scale for the y-axes. The fitting lines for Zipf’s exponent
The middle and lower left graphs in
The right-hand side of
As training progresses, the stacked LSTM first learns short patterns (uni-grams and 2-grams) and then gradually acquires longer patterns (3- to 5-grams). It also learns vocabulary as training progresses, which lowers the exponent of Heaps’ law. There are no tipping points at which the neural nets drastically change their behavior, and the two power laws are both acquired at a fairly early stage of learning.
Natural language has structural features other than
Long-range correlation describes a property by which two subsequences within a sequence remain similar even with a long distance between them. Typically, such sequences have a power-law relationship between the distance and the similarity. This statistical property is observed for various sequences in complex systems. Various studies [
Measurement of long-range correlation is not a simple problem, as we will see, and various methods have been proposed. [
[
The red and blue points represent the mutual information of the datasets and the generated pseudo-texts, respectively. Following [
We doubt, however, that the power decay of the mutual information is being properly observed for a natural language text when measured with the method proposed in [
There are two reasons for this difference in results between the Wikipedia source and the Shakespeare/WSJ: the kind of data, and the quantification method. Regarding the kind of data, we must emphasize that [
As for the problem of the quantification method, as seen from the plateau appearing in the results for the Shakespeare and WSJ datasets, the mutual information in its basic form is highly susceptible to the low frequency problem. Therefore, [
Still, [
Quantification of long-range correlation has been studied in the statistical physics domain and has been effective in analyzing extreme events in natural phenomena and financial markets [
The application of long-range correlation to natural language is controversial, because all proposed methods are for numerical data, whereas natural language has a different nature. Various reports show how natural language is indeed long-range correlated [
This method is based on the autocorrelation function applied to a sequence
The upper and lower graphs are in log-log and semi-log scale, respectively. The fitting line for The Wall Street Journal was estimated from the data points where
In summary, this analysis provides qualitative evidence regarding a shortcoming of the stacked LSTM: it has a limitation with respect to reproducing long-range correlation, as quantified using a method proposed in the statistical physics domain.
To further clarify the stacked LSTM’s performance and limitation, we conducted three experiments from different perspectives. First,
Second,
Third,
Because long-range correlation is a global, scale-free property of a text, one reason for the limitation of the stacked LSTM could lie in the context length of
One possible future approach is to test new neural models with enhanced long-memory features, such as a CNN application [
To understand the effectiveness and limitations of deep learning for natural language processing, we empirically analyzed the capacity of neural language models in terms of the statistical laws of natural language. This paper considered three statistical laws of natural language: Zipf’s law, the power law underlying the rank-frequency distribution; Heaps’ law, the power-law increase in vocabulary size with respect to text size; and long-range correlation, which captures the self-similarity underlying natural language sequences.
The analysis revealed that neural language models satisfy Zipf’s law, not only for uni-grams, but also for longer
Finally, a stacked LSTM showed a limitation with respect to capturing the long-range correlation of natural language. Investigation of a previous work [
Our analysis suggests a direction for improving language models, which has always been the central problem in handling natural language on machines. The current neural language models are unable to handle the global structures underlying texts. Because the Zipf’s law behavior with long
Our future work thus includes exploring conditions to reproduce the long-range correlation in text with language models, including both stochastic and neural language models.
Each pair of graphs consists of the rank-frequency distribution (upper graph) and the vocabulary growth (lower graph). The models had the following specifications. CNN: 8 layers of one-dimensional convolution with 256 filters having a width of 7 without padding and global max pooling after the last convolutional layer. The activation function was rectified linear-unit, and batch normalization was applied before activation in every convolutional layer. Simple-RNN: 1 layer of RNN with 512 units and an output softmax layer. Single-layer LSTM: 1 layer of LSTM with 512 units and an output softmax layer. Stacked-LSTM: as described in Section 2.
(EPS)
Let
(EPS)
Each pair of graphs represents the mutual information (upper) and the autocorrelation function (lower). The results were obtained with a CNN (upper left), simple RNN (upper right), single-layer LSTM (lower left), and stacked LSTM (bottom right, the same graphs from
(EPS)
Because of the API’s requirements, the original text was split into 5000 characters to obtain the translated text. Despite the results given in Section 4.2, the translated text exhibits long-range correlation as measured by the autocorrelation function. This result does not contradict our observation in Section 4.2, because translation does not radically change the order of words and the translation system has the capacity to output rare words.
(EPS)
We thank JST PRESTO and RISTEX HITE for financial support. Moreover, we thank Ryosuke Takahira of the Tanaka-Ishii Group for his help in creating