G2Basy: A framework to improve the RNN language model and ease overfitting problem

Recurrent neural networks are efficient ways of training language models, and various RNN networks have been proposed to improve performance. However, with the increase of network scales, the overfitting problem becomes more urgent. In this paper, we propose a framework—G2Basy—to speed up the training process and ease the overfitting problem. Instead of using predefined hyperparameters, we devise a gradient increasing and decreasing technique that changes the parameters training batch size and input dropout simultaneously by a user-defined step size. Together with a pretrained word embedding initialization procedure and the introduction of different optimizers at different learning rates, our framework speeds up the training process dramatically and improves performance compared with a benchmark model of the same scale. For the word embedding initialization, we propose the concept of “artificial features” to describe the characteristics of the obtained word embeddings. We experiment on two of the most often used corpora—the Penn Treebank and WikiText-2 datasets—and both outperform the benchmark results and show potential towards further improvement. Furthermore, our framework shows better results with the larger and more complicated WikiText-2 corpus than with the Penn Treebank. Compared with other state-of-the-art results, we achieve comparable results with network scales hundreds of times smaller and within fewer training epochs.


Introduction
Natural language processing (NLP) is the area of artificial intelligence that concerns the automatic generation and understanding of human languages [1]. Language models are an essential part of NLP that can predict upcoming words based on a given context [2]. It can be used in a wide range of applications such as question answering, sentiment analysis, machine translation, speech recognition, etc. Language models generated by neural networks are first proposed by Xu and Rudnicky [3] in 2000, and later are further studied by Bengio and Ducharme [4]. Their work inspires more sophisticated models that show great performance on multiple NLP applications [5][6][7][8]. Some of the research objectives of language modeling drifted from training models to acquiring word embeddings ever since. After the introduction of RNNs in [9], distributed word representations have shown superior performance over traditional NLP routines, and thus becomes more received in both academic and industry. Deep  brings new representations, architectures, and techniques for NLP researches. However, by introducing deep neural networks, we also bring its intrinsic defects to those NLP tasks, such as uninterpretable results, overfitting, and difficulty of training. Due to the large size of the training corpus and network parameters, training word embeddings can take weeks or more. Therefore, it makes sense to speed up the training time by initializing new networks with pretrained word embeddings [6]. And the process of using pretrained initialization can go on recursively according to [10]. Here, we initialized the input word embeddings with the pretrained GloVe [11] and design our framework based on it. Unlike its preceding models, GloVe takes into consideration the overall co-occurrence statistical information of the whole training corpus. Therefore, among its obtained word vectors, the Euclidean distance reflects their semantic similarities. This prompted us to derive the idea of initializing an RNN with randomly selected pretrained word vectors. Since the space comprising pretrained word vectors is only a small subset of the whole multidimensional space it is in, we presume that initializing with randomly selected pretrained word embeddings, can outperform those initialized with random data generated from certain distribution. After all, the word vectors are obtained from hundreds of millions of trainings, making them more reliable.
To alleviate the overfitting problem and enhance the generalization ability of language models, mechanisms like tied weights [12], dropout [13], and a vast variety of optimization algorithms, such as Momentum [14], Adadelta [15], and Adam [16], have been proposed. However, these techniques do not work well on RNNs, especially on LSTM networks [17], which are designed to solve long time lag tasks. Still, multiple approaches have been used to apply dropout-one of the most successful regularization techniques-to RNNs [16][17][18]. Zaremba et al. [19] apply dropout only to the nonrecurrent connections. Gal et al. [20] randomly drop inputs, outputs, and recurrent connections at each time step. Song et al. [21] conduct dropout with a regular and online generated pattern to eliminate unnecessary computation and data access. In our experiments, we apply dropout only to the network input and tie the input and output vectors, as done in [12]. Then, we treat the dropout hyperparameter as a variable and change its values during the training process.
In this paper, we propose an LSTM-based framework. It uses the pretrained GloVe word embeddings to initialize its input vectors and changes optimization algorithms during training. Together with the hyperparameter updating strategies we propose, our framework offers quicker convergence in much shorter training time compared with the benchmark. We call our framework G2Basy: the Gradient Batch Size and Hybrid optimizers framework with pretrained GloVe vectors. Experiments conducted on the PTB [22] and WikiText-2 [23] corpora show that our framework outperforms the benchmark model. Furthermore, compared with other state-of-the-art regularized multilayer RNN models with much larger scales, our framework still achieves close results.

Materials and methods
As mentioned above, we use public corpora -PTB and WikiText-2-to train, evaluate and test our G2Basy framework. And it is covered in three aspects, i.e., word vector initialization, hyperparameter update strategy, and other auxiliary techniques. In this section, we focus on some hyperparameters that are usually overlooked. When estimating performance, reaching equivalent results within less training time is also one of our goals, besides smaller test perplexity.

Experiments datasets
We conduct word-level prediction experiments on the PTB dataset, preprocessed by [22], which consists of 929k training words, 73k validation words, and 82k test words. After testing all of our training strategies on the PTB corpus, we apply them to the larger and more sophisticated WikiText-2 corpus to further test the generalization ability of our framework. The statistics for the corpora are listed in Table 1 below.

Initialization
From the informatics point of view, words can be treated as encoders of the entities or meanings to which they refer. Therefore, in addition to morphology information such as word form, tense, and affix [24][25][26], other statistical information such as the direction of word embeddings and training sentence length can be used as indicators for further analysis. If we assume that there exists one certain vector for each specific word, then compared to the whole multidimensional space, the pretrained word embeddings, therefore, belong to a much more compact subspace. Before further discussion, we define "unseen words" as those in the corpus that don't have a counterpart in the vocabulary of the GloVe pretrained embeddings and refer to them as such from now. Then we assume that when using pretrained word embeddings, it may improve the results if we initialize the unseen words with randomly selected pretrained word vectors (or vectors pointing in the same direction). We also tried to push it further and initialize all the words with random pretrained word vectors.
When we treat the unseen words separately, there are two ways to initialize them, i.e., with data generated from a uniform distribution U(-0.1,0.1] (abbreviated as "Uniform" later) or other distributions and with vectors randomly selected from GloVe. From now on, we use R (GloVe) to represent initializing words with random GloVe vectors, R(Glove/2) to represent initializing with random GloVe vectors divided by scalar 2, and R(GloVe/4) to represent initializing with random GloVe vectors divided by scalar 4. A full initialization process is, therefore, written as GloVe+R(GloVe/2), and its two parts represent the initialization of the words that have counterparts in the pretrained embeddings and those unseen words, respectively.

Hyperparameter updates
If we keep the model structure unchanged when training neural networks, then optimization algorithms become one of the key factors to acquiring better performance. Many optimization algorithms help adjust gradients better and faster, but SGD is still dominantly used [27,28] and remains one of the most robust. Therefore, we choose the mini-batch version of basic SGD with simulated annealing as the benchmark to evaluate the performance of our framework.
Among all the possible hyperparameters, we focus on two sets of them. The first is the selection of optimization algorithms and the second is the training batch size and the dropout of network input. De, Goyal, and Smith et al. [29][30][31] have proved that changing the value of the training batch size improves model performance. Balles [32] proposed a strategy that couples the training batch size and learning rate which yields faster optimization convergence and simultaneously simplifying learning rate tuning. In this paper, we tune the training batch size and the dropout value simultaneously in an adaptive-like way since the learning rate parameter is already taken care of during the simulated annealing process in our G2Basy framework. The benchmark training parameter settings are listed in Table 2 below. We referred to other researches [20,23,33,34] that use the same dataset when setting those parameters, and then conducted a grid search to finetune and verify them. As for parameters "Layers" and "Word vector Dimension" we only choose their values as in Table 2 because they suit our computational resources, any increase would make the training time unbearable. We prefer uniform distribution for two reasons. Theoretically, it doesn't make priori assumptions about the data distribution. And in our experiments, it shows better performance and robustness compared with other distributions such as Gaussian.
Optimizer algorithms. Almost every optimizer has a default setting for the learning rate at which they usually perform best. If their mechanisms of gradient updates do not counteract with each other, then the use of multiple optimization algorithms can improve training performance, similar to feed the output of one network as the input to another.
Adadelta accumulates gradients among a fixed window instead of all the gradients, and it also corrects units with Hessian Approximation. Although it is designed not to set the learning rate manually, when initialized with network parameters pretrained by SGD, better results can be obtained within fewer training epochs in a shorter time. It works even better when pretrained word embeddings are used. The pseudocode of Adadelta is depicted in Table 3 below. For more detailed information, refer to [15].
One typical way of finding a more appropriate learning rate is simulated annealing. And it usually begins with a large initial learning rate which decays during training. Furthermore, because of the intrinsic differences of parameter updating among different optimization algorithms, introducing another new optimizer does not necessarily work better. Besides SGD, the other optimizer we choose is abbreviated as ASGD [35]. It is a recursive algorithm of stochastic approximation type with the averaging of trajectories. ASGD improves the previous stochastic approximation methods so that it does not require a large amount of a priori information. It is based on the following paradoxical idea: a slow algorithm having less than optimal convergence rate must be averaged [36,37]. And for stochastic optimization, consider the problem of searching for the minimum x � of the smooth function ℓ(x), x 2 R N . The values of the gradient y t = rℓ(x t−1 + ξ t ) containing random noise ξ t are available at an arbitrary point x t−1 of R N . To solve this problem, the following recursive algorithm of averaging is proposed: Under the following conditions: is a twice continuously differentiable function and lI � r 2 ℓ(x)�LI for all x and some l > 0 and L > 0; and I is the identity matrix.
Assumption 2. (ξ t ) t � 1 is the sequence of mutually independent and identically distributed random variables and Eξ t = 0.
Assumption 3. It holds that: And four more conditions must be satisfied for the algorithm to work properly regarding the relationships among γ t , ℓ(x � ) and φ(x). Refer [35] for more detailed information and the derivation procedure.
Batch size and dropout update. Another powerful regularization procedure is dropout [13,18,20,21], but it does not work well on RNNs. Thus, when used in RNNs, it is usually applied only to the input and output layers. During the basic simulated annealing procedure, as the learning rate decreases, the model's improvement tends to slow down and get stuck or show oscillation. As we mentioned at the beginning of this section, training batch size and dropout are tuned simultaneously. When training RNN language models, changing training batch size can not only shorten the training time but also provide input with different contexts. This offers more possible information about the corpus for our model to learn, so better results can be expected if variety is introduced to the value of batch size.
The specific steps of the gradient batch size and dropout updating algorithm are shown in Table 4. The "reset random seed" in step (Step 9) is necessary since accumulated long  dependencies can harm the RNN's generalization ability. It is also an indispensable procedure for the training initialized with pretrained word vectors to work. The essential idea of the "reset random seed" step is to add disturbance to the dropout procedure in the forward function of the RNN model. This step brings uncertainty to the dropout as to which input word vectors will make it to the model. To reset the manual seed, simply change the dropout pattern of the input vectors. One way to do this is to change the random seed of the dropout procedure.
To use the gradient batch size training of the PTB dataset as an example, we start the training with batch size 10 and dropout 0.05 and increase them by 5 and 0.05 respectively after each training epoch until the so-far best valid loss stops updating. We then follow the training procedure of basic simulated annealing, but when annealing happens, we update not only the learning rate but also the batch size and dropout. We stop updating batch size when it decreases to 30 and keep it at that value until the training stops. The batch size value 30 here corresponds to the hyperparameter min_β in Table 4, and we found it manually because it offers better results in a grid search. As the Pareto diagram shown in Fig 1, the sentence length of training text in PTB corpus falls mostly in the range of 10 to 30. Yet the lengths distribution of the WikiText-2 corpus is far more different from the PTB, most of its sentences are much longer, and cover a larger length range.
Training stopping criteria. Valid loss is a traditional criterion to monitor the training conditions. If the number of training epochs is preset, usually the model with the minimum valid loss is chosen as the final result. In our situation, since we are training to find the limits of the training strategies we propose, therefore, it is not practical to preset the training epoch number or to train without considering early stopping. During early exploratory training, for each training epoch we observe both the valid loss and test loss and select a set of stopping criteria according to experimental observations and conventional standards. Before presenting our stopping criteria, we will first introduce the parameters used. Our basic training process is simulated annealing, and annealing happens when the current valid loss becomes larger than the so-far best. In an ideal situation, the valid loss drops below the sofar best valid loss right after the annealing epoch. However, in other situations, it does not drop, thus causing another annealing to occur, i.e. annealing happens twice in a row. For situations like this, we define the parameter Continued_Anneal_Num as the number of continuously annealing epochs. Table 5 below is an example of continuously annealing when Con-tinued_Anneal_Num equals 3.
We set different stopping criteria for different training procedures based on the exploratory experiments conducted on the PTB dataset. During training, we set Continued_Anneal_Num as a hyperparameter and set different threshold values of it for different training procedures. For basic benchmark annealing training, if Continued_Anneal_Num reaches 5, the training procedure trains one more epoch and then takes the better valid perplexity (PPL) between the last two training epochs as the final result. For the strategy of different optimizers, we stop the training when the first annealing happens after introducing ASGD and take the epoch before  annealing as the final result. The stopping criterion for gradient batch size is similar to basic annealing, just change the threshold value of Continued_Anneal_Num from 5 to 7. Learning rate back-tracking with ASGD. There is another way of getting better perplexity results with ASGD, which is changing the optimizer from SGD to ASGD when the learning rate is close to 0 and increasing the learning rate. In our experiments, we set the increased learning rate back to 0.02, since in the basic simulated annealing procedure ASGD performs best at a learning rate of 0.01953125. We call this strategy Learning Rate Back-tracking. A pitfall of the back-tracking is that it requires more training time. A more specific description of this method is illustrated in Table 6 below.

G2Basy: A framework to combine them all
We call this Gradient Batch Size and Hybrid optimizers framework with pretrained GloVe word embeddings initialization G2Basy. Its overall diagram is illustrated in Fig 2. For the G2Basy framework, by combining all the strategies mentioned above, the best expectation is that they all add positive effects to the final results. However, the experimental results tell us differently. The combination of gradient batch size and the ASGD optimizer does not work out with the PTB dataset as well as we expected. In addition, there is another problem that does not show up in the diagram of Fig 2, i.e., the timing of introducing the ASGD optimizer. The default setting of the learning rate parameter in ASGD is 0.01, which we considered when choosing the introduction time. There are some trade-offs between training time and the test perplexity when ASGD is brought in at different learning rates. The larger the learning rate, the quicker it reaches a local minimum and overfits which shows during the training process as higher and higher valid perplexity. In the next section, we will have a more detailed discussion of when to bring in the ASGD optimizer and how it influences the model's performance.

Results and discussion
There are three key factors in our G2Basy framework, i.e., initializing with pretrained word vectors, introducing different optimizers during training, and gradient updating of training batch size and dropout. They can improve results singly and offer even better results if combined. As shown in Table 7 in the next section, pretrained word vectors offers the best outcomes, followed by the gradient batch size and dropout updating procedure, which shows

Initialization
As mentioned earlier, we refer to those words that do not have counterparts in the pretrained GloVe embeddings as unseen words. We use R(GloVe) to represent the procedure of initializing words with random GloVe vectors, and R(GloVe/2) to represent initializing with random GloVe vectors divided by scalar 2, and R(GloVe/4) to represent initializing with random GloVe vectors divided by scalar 4. We discard R(GloVe) initialization due to its poor performance in the experiments. A possible theoretical reason for this is that the pretrained word embeddings are rather large already, thus, the model skips the process of early parameter growth, missing many possible search areas in the solution space. In Table 7 we list the PTB's valid and test perplexity results for different initialization and training strategies. The last three columns need some extra explanation. To evaluate the performance of our proposed strategies, we monitor the test perplexity of each training epoch and then design precise stopping criteria (see those criteria in the section "Training Stopping Criteria" on Page 7) for all the training procedures to avoid useless training and ease overfitting. The "Valid PPL" and "Test PPL" in columns 3 and 4 are reached at epoch numbers listed in column 5, "Stopping Epoch". Those are the results we get when our stopping criteria are satisfied. The last two columns list the best test PPL and the training epoch where it is obtained, which happens after the stopping epoch. During training, after the stopping criteria are satisfied, most of the time the training results show oscillation but the trend is still pointing in the direction of better results. Although the "Best Test PPL" is better than "Test PPL" as its name indicates, usually the improvements are not proportional to the training time they consume. The oscillation situation makes searching much more unpredictable than normal model training.
As shown in Figs 3 and 4 below, the pretrained GloVe embeddings make the most contribution to performance improvement. The GloVe+Uniform initialization slightly outperforms GloVe+R(GloVe/2), although we expected otherwise. However, as seen in the figures GloVe +R(GloVe/2) converges more quickly and the results become more and more similar as training proceeds. During the training of the PTB dataset, when initializing all the words with randomly selected GloVe embeddings or vectors which point in the same direction, the model converges quickly in the first few epochs, but soon reaches a not-so-satisfactory point and then either gets stuck or shows overfitting. Therefore, we only replace the words that have counterparts in the pretrained GloVe with randomly selected GloVe vectors and the others retain the random data generated from the predefined specific uniform distribution. In contrast, we initialize all the words in the WikiText-2 corpus with random GloVe vectors and it outperforms the benchmark uniform distribution initialization as we expected earlier, indicating that the random GloVe initialization works better with larger and more complicated training corpus.
The gradient batch size and dropout strategy explained in the section "Batch size and dropout update" on Page 5 improves the performance of all five initialization strategies and delays the occurrence of overfitting. The R(GloVe/4) results outperform R(GloVe/2), while with the basic simulated annealing training the situation is the opposite.
We define the above mentioned random GloVe initialization as "artificial features". This describes the phenomenon that word vectors obtained from training, whether matching with the words or not, differ from those generated from random distributions (albeit pseudorandom). Compared with training based on characters, morphology, or so on, this broader abstraction improves the final results to a certain degree. If we can initialize the words with the pretrained word vectors of words that relate to them (which can include their synonyms, antonyms, or even words containing overlapping letters), it is foreseeable that the results are more likely to outperform those initialized by randomly distributed data. This describes a transitional state between disorder (completely random initialization) and order (perfectly matched pretrained word vectors), and the often used character-, prefix-or suffix-based methods can also be included in this framework. In specific applications, appropriate initialization techniques can be found somewhere between disorder and order according to the specific problem we are going to solve to maximize the reuse of previously trained word embeddings and improve the training results.

Hyperparameter updates
In addition to basic SGD, we use two more optimization algorithms, i.e., Adadelta and ASGD. The effect of Adadelta is not so obvious since we only use it in between SGD and ASGD. The ASGD optimizer, however, speeds up the training time tremendously, while still offering equivalent or better results. When setting the upper limit of the batch size parameter, we referred to the statistical information of the corpus's average sentence length, which we discussed earlier in section "Batch size and dropout" on Page 5, as well as actual exploratory training performance. The combined strategy of changing optimizer and gradient batch size is not shown in the overall result summary in Table 7, because it does not apply to all of the initialization situations. In the R(GloVe/2) initialization, the combined strategy gets a test PPL of 83.5 at training epoch 57 when changing the optimizer from SGD to ASGD at the learning rate 0.078125. However, when using gradient batch size only, we get a result of 83.6 at training epoch 111.
The ASGD optimizer almost halves the training time when reaching an equivalent test PPL. We get a similar result with GloVe+Uniform initialization-the test PPL is 78.7 at epoch 69, while the solo batch size procedure achieves the same at training epoch 87, yet its final result is 0.3 smaller than that of the combined strategy. In other words, if not very sensitive to the result, the combination of gradient batch size and ASGD optimizer can shorten the training time significantly.
We try increasing the learning rate at various points during training to see if the searching path will deviate to a better local optimum, but it does not work out. Then, we introduce a second variable-optimizer-and wait until the learning rate is very close to 0 (less than 10 −8 ). We then set the learning rate to 0.02 and change the optimizer from SGD to ASGD. The training starts to converge slowly for dozens of epochs and we get the best test PPL result, 77.9, among all the trainings. Table 8 shows an overview of our learning rate back-tracking technique. There are only 3 different results instead of 5 because in the other 2 initializations the training overfits before the learning rate is small enough for us to conduct back-tracking. We change the optimizer at epoch number 100 deliberately, though this is likely not optimal. Epoch 100 was chosen as a byproduct of our exploratory training. Therefore, further research is needed to explore the back-tracking procedure with hybrid optimizers.
The gradient training batch size strategy is a very robust procedure. It introduces two more hyperparameters, yet still reduces the search space greatly by defining the search rules and search step sizes. This strategy converges quickly within the first few training epochs and shows training potential when proper optimizers take over from SGD.
There is an extra latent hyperparameter in our framework, besides the abovementioned settings of learning rate, dropout, and training batch size-the timing of introducing different optimizers. Adadelta is introduced when the learning rate reaches 1.25 (the closest to Adadelta's default parameter settings among all the values that occur during the simulated annealing procedure). Although it does not improve the result obviously, it delays the annealing procedure, thus ensuring that the model is trained at a rather high learning rate. Therefore, ASGD is the optimizer that makes the most difference and it works out best at a learning rate of 0.01953125 (referred to as lr1) and 0.078125 (referred to as lr2). We did not choose these learning rates for any particular reason. They are only two values among those we get when start the learning rate from 20 and divide it by 4 whenever annealing proceeds. ASGD still works if we introduce it at a learning rate of 0.3125, but the training soon begins to overfit after a few epochs. From Fig 4 above, we can see that changing the optimizer to ASGD at lr2 almost cuts the training time in half and the results are equivalent and even better for certain initializations. For the three randomly initialized trainings, although their lines in the figure are closely entwined, if examined closely the initialization of randomly selected GloVe vectors [both R (GloVe/2) and R(GloVe/4)] outperforms uniform distribution initialization, which partially verifies our assumption about using pretrained word vectors to narrow the search space and improve the final results. Figs 5-7 below show the results of different training strategies under the same distribution initialization. Annotations 'Optimizer1' and 'Optimizer2' correspond to introducing ASGD at the learning rates of 0.01953125 and 0.078125, respectively. The gradient batch size procedure provides the best results for most of the initializations, yet introducing ASGD from a rather early training stage, as done in 'Optimizer2', trains much faster than the other strategies and still provides the best results for precise GloVe initializations and second-best for other initializations.

Results on PTB and WikiText-2 datasets
When network scale is considered in language modeling, there are usually 3 different model sizes, i.e., small ones with 200 units in each layer, medium ones with 650 units in each layer, and large ones with 1500 units in each layer. The number of layers is set to 2 most of the time, so we use 2 layers as well. All our models are small due to the insufficiency of computational resources.
Although there exists a gap between our test perplexity and the test perplexity obtained from state-of-the-art models, we still make a comparison with some of them [20,23,33,34] and also with one of the smaller-scale models that is approximately the same size as ours [38]. The comparisons are listed in Tables 9 and 10 below.

Conclusion
We propose a framework-G2Basy-that combines different initializations, multiple optimizers, and the gradient training batch size strategy, which aims to ease the overfitting problem to  get better results and accelerate the training procedure. Our framework offers sufficient perplexity results using a more compact model and within fewer training epochs than conventional techniques. It also provides practical guidance for RNN language modeling training under limited computing resources. Our research offers a new perspective to look into the initialization of word embeddings. We propose the concept of "artificial features" which describes the training-developed characteristics from treating the multidimensional space consisting of all the word embeddings as a whole and taking every word embedding involved as a cross-section of the word vector space. Therefore, when dealing with problems such as polysemous words, the word vectors, and those of their contexts', can be subjected to basic vector operations. In this case, the vector of the word is not fixed and it varies with the context of use. If a distributed environment is available, more training optimizers and techniques can be explored to test and find more optimal solutions. In the meanwhile, the ASGD optimizer is a robust one with rather low complexity which can be generalized to other applications easily. Additionally, the strategy of combining training batch size and dropout performs well in reducing training time. Besides, the concept of "artificial features" we propose provides better outcomes and has the potential to be further studied. To sum up, the framework we propose is an efficient and robust one with easy generalization and it can offer superior results as well as faster convergence compared with the basic simulated annealing procedure.