Fast, Simple and Accurate Handwritten Digit Classification by Training Shallow Neural Network Classifiers with the ‘Extreme Learning Machine’ Algorithm

doi:10.1371/journal.pone.0134254

Table 1.

Comparison of our results on the MNIST data set with published results using other methods.

The percentages listed in brackets are the mean error percentage we obtained from 10 independent realisations of each method. The remaining percentage for the results obtained in this report are from the trained ELM out of the 10 repeats with the best results. Values for each trial shown in Figs 3 and 4 demonstrate small spreads either side of the mean value. The abbreviations CIW, C and RF are explained in later sections. Note: SLFN is “Single-hidden Layer Feedforward Network”; DLFN is “Dual-hidden Layer Feedforward Network.” The result described as ‘Distortions’ was obtained by augmenting the training set using affine and elastic distortions, as describe in the main text. ‘Deep. Conv. Net’ is an abbreviation for ‘Deep Convolutional Network’.

More »

Expand

Fig 1.

Illustration of the three core methods of shaping ELM input weights.

In (a), which is a cartoon of the Computed Input Weights ELM (CIW-ELM) process [15], two classes of input data are indicated by ‘+’ and ‘o’ symbols. The vectors to the ‘+’ symbols are multiplied by random bipolar binary {−1, 1}) vectors u₁ and u₂ to produce a biased random weight vector w₁. Similarly the weights to the ‘o’ class are also multiplied by random vectors u₁ and u₂ to produce a biased random weight vector w₂. Note that in practice we would not use the same random binary vectors. In (b), we show the Constrained ELM (C-ELM) process [21]. The black arrows are weight vectors derived by computing the difference of two classes; in this case, the difference between the ‘+’ elements and the ‘o’ elements. In (c), we illustrate the Receptive Field ELM (RF-ELM) method; weights for each hidden layer neuron are restriced to being non-zero for only a small random rectangular receptive field in the original image plane.

More »

Expand

Fig 2.

Combined two-layer RF-CIW-ELM and RF-C-ELM network.

This figure depicts the structure of our multilayer ELM network that combines a CIW-RF-ELM network with a C-ELM network, using what is effectively an autoencoder output. Note that the middle linear layer of neurons can be removed by combining the output layer weights of the first network with the input layer weights of the second; we have not shown this here, in order to clarify the development of the structure.

More »

Expand

Fig 3.

Error rates for MNIST images for various SLFN ELM methods with shaped input weights.

The first row shows the mean error percentage from 10 different trained networks applied to classify (a) the 10000-point MNIST test data set, and (b) the 60000-point MNIST training data set used to train the networks, for various different sizes of hidden layer, M. Markers show the actual error percentage from each of the 10 networks. Note that the data for the combination RF-CIW-C-ELM method is plotted against M used in just one of the three parts of the overall network; the total number of hidden units used is actually 2M+500. Therefore RF-CIW-C-ELM does not outperform the other methods for the same total number of hidden-units for small M. However it can be seen that for large M RF-CIW-C-ELM produces results below 1% error on the test data set and provide the best error rates overall. The second row illustrates that increasing the number of hidden-units above about M = 15000 leads to overfitting, since as shown in (c), the total number of errors plateaus, whilst the total number of errors on the training set continues to decrease (shown in (d)). Note that (c) and (d) show results from a single trained network only.

More »

Expand

Fig 4.

ELM-backpropagation error rates for MNIST for various SLFN ELM methods with shaped input weights.

Each trace shows the mean error percentage from 10 different trained networks applied to classify (a) the 10000-point MNIST test data set, and (b) the 60000-point MNIST training data set used to train the networks, for various different sizes of hidden layer, M, when ten iterations of backpropagation were also used. Markers show the actual error percentage from each of the 10 networks. In comparison with Fig 3, it can be seen that backpropagation significantly improves the error rate for small M with all methods, but has little impact when M = 12800. The total number of hidden units used for RF-CIW-C-ELM is actually 2M+500, but each parallel ELM has M hidden-units.

More »

Expand

Fig 5.

Mean training times for MNIST for various SLFN ELM training methods with shaped input weights.

Each trace shows the mean run time from 10 different networks trained on all 60000 MNIST training data points, to achieve the test-date error rates shown in Fig 4. The total time for setup and training are shown, excluding time to load the MNIST data from files. When backpropagation is applied, the runtime scales approximately linearly with the number of iterations, but each backpropagation iteration is slower than each trace shown here, because both input and output weights are updated in each iteration. The time for testing is not included in the figure, but was approximately 10 seconds for M = 12,800, and increases only linearly with M.

More »

Expand

Fig 6.

Error rates for NORB-small for RF-C-ELM.

The error rate on the 24300 stereo-channel NORB-small test images as a function of the number of hidden-units, M. The data was preprocessed by downsampling each channel of each image to 13 × 13 pixels, and then contrast normalising. Our best result from all repeats was 94.76%, for M = 10000.

More »

Expand