Gaussian-binary restricted Boltzmann machines for modeling natural image statistics

We present a theoretical analysis of Gaussian-binary restricted Boltzmann machines (GRBMs) from the perspective of density models. The key aspect of this analysis is to show that GRBMs can be formulated as a constrained mixture of Gaussians, which gives a much better insight into the model’s capabilities and limitations. We further show that GRBMs are capable of learning meaningful features without using a regularization term and that the results are comparable to those of independent component analysis. This is illustrated for both a two-dimensional blind source separation task and for modeling natural image patches. Our findings exemplify that reported difficulties in training GRBMs are due to the failure of the training algorithm rather than the model itself. Based on our analysis we derive a better training setup and show empirically that it leads to faster and more robust training of GRBMs. Finally, we compare different sampling algorithms for training GRBMs and show that Contrastive Divergence performs better than training methods that use a persistent Markov chain.


Introduction
Inspired by the hierarchical structure of the visual cortex, recent studies on probabilistic models used deep hierarchical architectures to learn high order statistics of the data [Karklin and Lewicki(2009), Köster and Hyvärinen(2010)].One widely used architecture is a deep believe network (DBN), which is usually trained as stacked restricted Boltzmann machines (RBMs) [Hinton and Salakhutdinov(2006), Bengio et al.(2006), Erhan et al.(2010)].Since the original formulation of RBMs assumes binary input values, the model needs to be modified in order to handle continuous input values.One common way is to replace the binary input units by linear units with independent Gaussian noise, which is known as Gaussian-binary restricted Boltzmann machines (GRBMs) or Gaussian-Bernoulli restricted Boltzmann machines [Krizhevsky(2009), Cho et al.(2011)] first proposed by [Welling et al.(2004)].
The training of GRBMs is known to be difficult, so that several modifications have been proposed to improve the training.[Lee et al.(2007)] used a sparse penalty during training, which allowed them to learn meaningful features from natural image patches.[ Krizhevsky(2009)] trained GRBMs on natural images and concluded that the difficulties are mainly due to the existence of high-frequency noise in the images, which further prevents the model from learning the important structures.[Theis et al.(2011)] illustrated that in terms of likelihood estimation GRBMs are already outperformed by simple mixture models.Other researchers focused on improving the model in the 2 Gaussian-binary restricted Boltzmann machines (GRBMs)

The model
A Boltzmann Machine (BM) is a Markov Random Field with stochastic visible and hidden units [ Smolensky(1986)], which are denoted as X := (X 1 , . . ., X M ) T and H := (H 1 , . . ., H N ) T , respectively.In general, we use bold letters denote vectors and matrices.
The joint probability distribution is defined as where E (X, H) denotes an energy function as known from statistical physics, which defines the dependence between X and H.The temperature parameter T 0 is usually ignored by setting its value to one, but it can play an important role in inference of BMs [Desjardins et al.(2010)].The partition function Z normalizes the probability distribution by integrating over all possible values of X and H, which is intractable in most cases.So that in training BMs using gradient descent the partition function is usually estimated using sampling methods.However, even sampling in BMs remains difficult due to the dependencies between all variables.
An RBM is a special case of a BM where the energy function contains no terms combining two different hidden or two different visible units.Viewed as a graphical model, there are no lateral connections within the visible or hidden layer, which results in a bipartite graph.This implies that the hidden units are conditionally independent given the visibles and vice versa, which allows efficient sampling.
The values of the visible and hidden units are usually assumed to be binary, i.e.X m , H n ∈ {0, 1}.
The most common way to extend an RBM to continuous data is a GRBM, which assumes continuous values for the visible units and binary values for the hidden units.Its energy function [Cho et al.(2011), Wang et al.(2012)] is defined as where ||u|| denotes the Euclidean norm of u.In GRBMs the visible units given the hidden values are Gaussian distributed with standard deviation σ.Notice that some authors [Krizhevsky(2009), Cho et al.(2011), Melchior(2012)] use an independent standard deviation for each visible unit, which comes into account if the data is not whitened [Melchior(2012)].
The conditional probability distribution of the visible given the hidden units is given by where w i * and w * j denote the ith row and the jth column of the weight matrix, respectively.N x; µ, σ 2 denotes a Gaussian distribution with mean µ and variance σ 2 .And N X; µ, σ 2 denotes an isotropic multivariate Gaussian distribution centered at vector µ with variance σ 2 in all directions.From ( 7) to (8) we used the relation The conditional probability distribution of the hidden units given the visibles can be derived as follows . ( P (H|x) turns out to be a product of independent sigmoid functions, which is a frequently used non-linear activation function in artificial neural networks.

Maximium likelihood estimation
Maximum likelihood estimation (MLE) is a frequently used technique for training probabilistic models like BMs.In MLE we have a data set X = {x 1 , . . ., xL } where the observations xl are assumed to be independent and identically distributed (i.i.d.).The goal is to find the optimal parameters Θ that maximize the likelihood of the data, i.e. maximize the probability that the data is generated by the model [Bishop(2006)].For practical reasons one often considers the logarithm of the likelihood, which has the same maximum as the likelihood since it is a monotonic function.The log-likelihood is defined as We use the average log-likelihood per training case denoted by l.For RBMs it is defined as where x ∈ X .And f (u) u denotes the expectation of the function f (u) with respect to variable u.
The gradient of the l turns out to be the difference between the expectations of the energies gradient under the data and model distribution, which is given by (1)  [Hinton(2002)].The CD-gradient approximation is given by where x (k) denotes the samples after k steps of Gibbs sampling.The derivatives of the GRBM's energy function with respect to the parameters are given by and the corresponding gradient approximations (20) become , where P (h = 1|x) := (P (h 1 = 1|x) , • • • , P (h N = 1|x)) T , i.e.P (h = 1|x) denotes a vector of probabilities.

The marginal probability distribution of the visible units
From the perspective of density estimation, the performance of the model can be assessed by examining how well the model estimates the data distribution.We therefore take a look at the model's marginal probability distribution of the visible units, which can be formalized as a product of experts (PoE) or as a mixture of Gaussians (MoG))1 .

In the Form of Product of Experts
We derive the marginal probability distribution of the visible units P (X) by factorizing the joint probability distribution over the hidden units. (1,4) Equation ( 34) illustrates that P (X) can be written as a product of N factors, referred to as a product of experts [Hinton(2002)].Each expert p j (X) consists of two isotropic Gaussians with the same variance N σ 2 .The first Gaussian is placed at the visible bias b.The second Gaussian is shifted relative to the first one by N times the weight vector w * j and scaled by a factor that depends on w * j and b.Every hidden unit leads to one expert, each mode of which corresponds to one state of the corresponding hidden unit.Figure 1 (a) and (b) illustrate P (X) of a GRBM-2-2 viewed as a PoE, where GRBM-M -N denotes a GRBM with M visible and N hidden units.

In the Form of Mixture of Gaussians
Using Bayes'theorem, the marginal probaility of X can also be formalized as: where H k denotes the set of all possible binary vectors with exactly k ones and M − k zeros respectively.As an example, N −1 j=1 N k>j P (h jk : h jk ∈ H 2 ) = h∈H2 P (h) sums over the probabilities of all binary vectors having exactly two entries set to one.P (H) in ( 36) is derived as follows Since the form in ( 37) is similar to a mixture of isotropic Gaussians, we follow its naming convention.Each Gaussian distribution is called a component of the model distribution, which is exactly the conditional probability of the visible units given a particular state of the hidden units.As well as in MoGs, each component has a mixing coefficient, which is the marginal probability of the corresponding state and can also be viewed as the prior probability of picking the corresponding component.The total number of components in a GRBM is 2 N , which is exponential in the number of hidden units, see Figure 1 (c) for an example.
The locations of the components in a GRBM are not independent of each other as it is the case in MoGs.They are centered at b+ Wh, which is the vector sum of the visible bias and selected weight vectors.The selection is done by the corresponding entries in h taking the value one.This implies that only the M + 1 components that sum over exactly one or zero weights can be placed and scaled independently.We name them first order components and the anchor component respectively.All 2 N − M − 1 higher order components are then determined by the choice of the anchor and first order components.This indicates that GRBMs are constrained MoGs with isotropic components.

Two-dimensional blind source separation
The general presumption in the analysis of natural images is that they can be considered as a mixture of independent super-Gaussian sources [Bell and Sejnowski(1997)], but see [Zetzsche and Rohrbein(2001)] for an analysis of remaining dependencies.In order to be able to visualize how GRBMs model natural image statistics, we use a mixture of two independent Laplacian distributions as a toy example.
The independent sources s = (s 1 , s 2 ) T are mixed by a random mixing matrix A yielding where p . It is common to whiten the data (see Section 4.1), resulting in where V = x′ x′ T − 1 2 is the whitening matrix calculated with principle component analysis (PCA).Through all this paper, we used the whitened data.
In order to assess the performance of GRBMs in modeling the data distribution, we ran the experiments for 200 times and calculated the l for test data analytically.For comparision, we also calculated the l over the test data for ICA2 , an isotropic two-dimensional Gaussian distribution and the true data distribution 3 .The results are presented in Table 1, which confirm the conclusion of [Theis et al.(2011)] that GRBMs are not as good as ICA in terms of l.To illustrate how GRBMs model the statistical structure of the data, we looked at the probability distributions of the 200 trained GRBMs.About half of them (110 out of 200) recovered the independent components, see Figure 2 (a) as an example.This can be further illustrated by plotting the Amari errors 4 between the true unmixing matrix A and estimated model matrices, i.e. the unmixing matrix of ICA and the weight matrix of the GRBM, as shown in Figure 3.One can see that these 110 GRBMs estimated the unmixing matrix quite well, although GRBMs are not as good as ICA.This is due to the fact that the weight vectors in GRBMs are not restricted to be orthogonal as in ICA.
For the remaining 90 GRBMs, the two weight vectors pointed to the opposite direction as shown in Figure 2 (b).Accordingly, these GRBMs failed to estimate the unmixing matrix, but in terms of density estimation these solutions have the same quality as the orthogonal ones.Thus all the 200 GRBMs were able to learn the statistical structures in the data and model the data distribution pretty well.
For comparison, we plotted the probability distribution of a learned GRBM with four hidden units, see Figure 2 (c), in which GRBMs can always find the two independent components correctly.
To further show how the components contribute to the model distribution, we randomly chose one of the 110 GRBMs and calculated the mixing coefficients of the anchor and the first order components , as shown in Table 2.The large mixing coefficient for the anchor component indicates that the model will most likely reach hidden states in which none of the hidden units are activated.In general, the more activated hidden units a state has, the less likely it will be reached, which leads naturally to a sparse representation of the data.
Table 2: The mixing coefficients of a successfully-trained GRBM-2-2, GRBM-2-4 and an MoG-3.  The Amari error [Bach and Jordan(2002)] between two matrices A and B is defined as: The dominance of h∈H0 P (h) and all h∈H1 P (h) can also be seen in Figure 2 by comparing a GRBM-2-2 (a) with an two dimensional MoG having three isotropic components denoted by MoG-2-3 (d).Although the MoG-2-3 has one component fewer than the GRBM-2-2, their probability distributions are almost the same.

Natural image patches
In contrast to random images, natural images have a common underlying structure which could be used to code them more efficiently than with a pixel-wise representation.
[ Olshausen and Field(1996)] showed that sparse coding is such an efficient coding scheme and that it is in addition a biological plausible model for the simple cells in the primary visual cortex.[Bell and Sejnowski(1997)] showed that the independent components provide a comparable representation for natural images.We now want to test empirically whether GRBMs generate such biological plausible results like sparse coding and ICA.
We used the imlog natural image Database of [van Hateren and van der Schaaf(1998)] and randomly sampled 70000 grey scale image patches with a size of 14 × 14 pixels.The data was whitened using Zero-phase Component Analysis (ZCA), afterwards it was divided into 40000 training and 30000 testing image patches.We followed the training recipes mentioned in Section 4, since training a GRBM on natural image patches is not a trivial task.
In Figure 4, we show the learned weight vectors namely features or filters, which can be regarded as receptive fields of the hidden units.They are fairly similar to the filters learned by ICA [Bell and Sejnowski(1997)].Similar to the 2D experiment, we calculated the anchor and first order mixing coefficients, as shown in Table 3.The coefficients are much smaller compared to the anchor and first order coefficients of the GRBMs in the two dimensional case.However, they are  still significantly large, considering that the total number of components in this case is 2 196 .Similar to the two-dimensional experiments, the more activated hidden units a state has, the less likely it will be reached, which leads naturally to a sparse representation.To support this statement, we plotted the histogram of the number of activated hidden units per training sample, as shown in Figure 5.We also examined the results of GRBMs in the over-complete case, i.e.GRBM-196-588.There is no prominent difference of the filters compared to the complete case shown in Figure 4. To further compare the filters in the complete and over-complete case, we estimated the spatial frequency, location and orientation for all filters in the spatial and frequency domains, see Figure 6 and Figure 7 respectively.This is achieved by fitting a Gabor function of the form used by [Lewicki and Olshausen(1999)].Note that the additional filters in the over-complete case increase the variety of spatial frequency, location and orientation.

Successful Training of GRBMs on Natural Images
The training of GRBMs has been reported to be difficult [Krizhevsky(2009), Cho et al.(2011)].
Based on our analysis we are able to propose some recipes which should improve the success and speed of training GRBMs on natural image patches.Some of them do not depend on the data distribution and should therefore improve the training in general.

Preprocessing of the Data
The preprocessing of the data is important especially if the model is highly restricted like GRBMs.
Whitening is a common preprocessing step for natural images.It removes the first and second order statistics from the data, so that it has zero mean and unit variance in all directions.This allows train- Figure 7: A polar plot of frequency tuning and orientation of the learned filters.The crosshairs describe the selectivity of the filters, which is given by the 1/16-bandwidth in spatial-frequency and orientation, [Lewicki and Olshausen(1999)].
ing algorithms to focus on higher order statistics like kurtosis, which is assumed to play an important role in natural image representations [Olshausen and Field(1996), Hyvärinen et al.(2001)].
The components of GRBMs are isotropic Gaussians, so that the model would use several components for modeling covariances.But the whitened data has a spherical covariance matrix so that the distribution can be modelled already fairly well by a single component.The other components can then be used to model higher order statistics, so that we claim that whitening is also an important preprocessing step for GRBMs.

Parameter Initialization
The initial choice of model parameters is important for optimization process.Using prior knowledge about the optimization problem can help to derive an initialization, which can improve the speed and success of the training.
For GRBMs we know from the analysis above that the anchor component, which is placed at the visible bias, represents most of the whitened data.Therefore it is reasonable in practice to set the visible bias to the data's mean.
Learning the right scaling is usually very slow since the weights and biases determine both the position and scaling of the components.In the final stage of training GRBMs on whitened natural images, the first components are scaled down extremely compared to the anchor component.Therefore, it will usually speed up the training process if we initialize the parameters so that the first order scaling factors are already very small.Considering equation (37), we are able to set a specific first order scaling factor by initializing the hidden bias to so that the scaling is determined by τ j , which should ideally be chosen close to the unknown final scaling factors.In practice, the choice of 0.01 showed good performance in most cases.The learning rate for the hidden bias can then be set much smaller than the learning rate for the weights.
According to [Bengio(2010)], the weights should be initialized to where U (a, b) is the uniform distribution in the interval [a, b].In our experience, this works better than the commonly used initialization to small Gaussian-distributed random values.

Gradient Restriction and Choices of the Hyperparameters
The choice of the hyper-parameters has an significant impact on the speed and success of training GRBMs.For successful training in an acceptable number of updates, the learning rate needs to be sufficiently big.Otherwise the learning process becomes too slow or the algorithm converges to a local optimum where all components are placed in the data's mean.But if the learning rate is chosen too big, the gradient can easily diverge resulting in a number overflow of the weights.This effect becomes even more crucial as the model dimensionality increases, so that a GRBM with 196 visible and 1000 hidden units diverges already for a learning rate of 0.001.
We therefore propose restricting the weight gradient column norms ∇w :j to a meaningful size to prevent divergence.Since we know that the components are placed in the region of data, there is no need for a weight norm to be bigger than twice the maximal data norm.Consequently, this natural bound also holds for the gradient and can in practice be chosen even smaller.It allows to choose big learning rates even for very large models and therefore enables fast and stable training.In practice, one should restrict the norm of the update matrix rather than the gradient matrix to also restrict the effects of the momentum term and etc.
Since the components are placed on the data they are naturally restricted, which makes the use of a weight decay useless or even counter productive since we want the weights to grow up to a certain norm.Thus we do recommend not to use a weight decay regularization.
A momentum term adds a percentage of the old gradient to the current gradient which leads to a more robust behavior especially for small batch-sizes.In the early stage of training the gradient usually varies a lot, a large momentum can therefore be used to prevent the weights from converging to zero.In the late stage however, it can also prevent convergence so that in practice a momentum of 0.9 that will be reduced to zero in the final stage of training is recommended.

Training Method
Using the gradient approximation, RBMs are usually trained as described in Section 2.2.The quality of the approximation highly depends on the set of samples used for estimating the model expectation, which should ideally be i.i.d.But Gibbs sampling usually has a low mixing rate, which means that the samples tend to stay close to the previously presented samples.Therefore, a few steps of Gibbs sampling commonly leads to a biased approximation of the gradient.In order to increase the mixing rate [Tieleman(2008)] suggested to use a persistent Markov chain for drawing samples from the model distribution, which is referred as persistent Contrastive Divergence (PCD).[Desjardins et al.(2010)] proposed to use parallel tempering (PT), which selects samples from a persistent Markov chain with a different scaling of the energy function.In particular, [Cho et al.(2011)] analyzed PT algorithm for training GRBMs and proposed a modified version of PT.
In our experiments all methods above lead to meaningful features and comparable l, but differ in convergence speed as shown in Figure 8.As for PT, we used original algorithm [Desjardins et al.(2010)] together with weight restrictions and temperatures from 0.1 to 1 with stepsize 0.1.Although, PT has a better performance than CD, it has also a much higher computational cost as shown in Table 4.

Discussion
The difficulties of using GRBMs for modeling natural images have been reported by several authors [Krizhevsky(2009), Bengio et al.(2006)] and various modifications have been proposed to address this problem.
[ Ranzato and Hinton(2010)] analyzed the problem from the view of generative models and argued that the failure of GRBMs is due to the model's focus on predicting the mean intensity of each pixel rather than the dependence between pixels.To model the covariance matrices at the same time, they proposed the mean-covariance RBM (mcRBM).In addition to the conventional hidden units h m , The learning rate was 0.1, an initial momentum term of 0.9 was multiplied with 0.9 after each fifth epoch, the gradient was restricted to one hundredth of the maximal data norm (0.48), no weight decay was used.
there is a group of hidden units h c dedicated to model the covariance between the visible units.
From the view of density models, mcRBMs can be regarded as improved GRBMs such that the additional hidden units are used to depict the covariances.The conditional probabilities of mcRBM are given by where Ranzato and Hinton(2010)].By comparing (47) and ( 9), it can be seen that the components of mcRBM can have a covariance matrix that is not restricted to be diagonal as it is the case for GRBMs.
From the view of generative models another explanation for the failure of GRBMs is provided by [Courville et al.(2011)].Although they agree with the poor ability of GRBMs in modeling covariances, [Courville et al.(2011)] argue that the deficiency is due to the binary nature of the hidden units.In order to overcome this limitation, they developed the spike-and-slab RBM (ssRBM), which splits each binary hidden unit into a binary spike variable h j and a real valued slab variable s j .The conditional probability of visible units is given by where Λ is a diagonal matrix and B is determined by integrating the Gaussian N (X; Λ −1 N j=1 w * j s j h j , Λ −1 ) over the ball ||X|| 2 < R [Courville et al.(2011)].In contrast to the conditional probability of GRBMs (9), w * j in (48) is scaled by the continuous variable s j , which implies that the components can be shifted along their weight vectors.
We have shown that GRBMs are capable of modeling natural image patches and that the reported failures are due to the training procedure.[Lee et al.(2007)] showed also that GRBMs could learn meaningful filters by using a sparse penalty.But this penalty changes the objective function and introduced a new hyper-parameter.[Cho et al.(2011)] addressed these training difficulties, by proposing a modification of PT and an adaptive learning rate.However, we claim that the reported difficulties of training GRBMs with PT are due to the mentioned gradient divergence problem.With gradient restriction we were able to overcome the problem and train GRBMs with normal PT successfully.

Conclusion
In this paper, we provide a theoretical analysis of GRBM and showed that its product of experts formulation can be rewritten as a constrained mixture of Gaussians.This representation gives a much better insight into the capabilities and limitations of the model.We use two-dimensional blind source separation task as a toy problem to demonstrate how GRBMs model the data distribution.In our experiments, GRBMs were capable of learning meaningful features both in the toy problem and in modeling natural images.
In both cases, the results are comparable to that of ICA.But in contrast to ICA the features are not restricted to be orthogonal and can form an over-complete representation.However, the success of training GRBMs highly depends on the training setup, for which we proposed several recipes based on the theoretical analysis.Some of them can be further generalized to other datasets or directly applied like the gradient restriction.
In our experience, maximizing the l does not imply good features and vice versa.Prior knowledge about the data distribution will be beneficial in the modeling process.For instance, our recipes are based on the prior knowledge of the natural image statistics, which is center peaked and has heavy tails.It will be an interesting topic to integrate prior knowledge of the data distribution into the model rather than starting modeling from scratch.
Considering the simplicity and easiness of training with our proposed recipe, we believe that GRBMs provide a possible way for modeling natural images.Since GRBMs are usually used as first layer in deep belief networks, the successful training of GRBMs should therefore improve the performance of the whole network.

Figure 1 :
Figure 1: Illustration of a GRBM-2-2 as a PoE and MoG, in which arrows indicate the roles of the visible bias vector and the weight vectors.(a) and (b) visualize the two experts of the GRBM.The red (dotted) and blue (dashed) circles indicate the center of the two Gaussians in each expert.(c) visualizes the components in the GRBM.Denoted by the green (filled) circles, the four components are the results of the product of the two experts.Notice how each component sits right between a red (dotted) and a blue (dashed) circle.

Figure 2 :
Figure 2: Illustration of the log-probability densities.The data is plotted in blue dots.(a)GRBM-2-2 which learned two independent components.(b)GRBM-2-2 which learned one independent component with opposite directions.(c)GRBM-2-4.(d)An isotropic MoG with three components.The arrows indicate the weight vectors of GRBM, while the crosses denote the means of the MoG components.Comparing (a) and (d), the contribution of the second order component is so insignificant that the probability distribution of the GRBM with four components is almost the same as the MoG with only three components.

Figure 3 :
Figure3: The Amari errors between the real unmixing matrix and the estimations from ICA and the 110 GRBMs.The box extends from the lower to the upper quantile values of the data, with a line at the median.The whiskers extend from the box to show the range of the reliable data points.The outlier points are marked by "+".As a base line, the amari errors between the real unmixing matrices and random matrices are provided.

Figure 4 :
Figure 4: Illustration of 196 learned filters of a GRBM-196-196.The plot has been ordered from left to right and from top to bottom by the increasing average activation level of the corresponding hidden units.

Figure 5 :
Figure 5: The histogram of the number of activated hidden units per training sample.The histograms before and after training are plotted in blue (dotted) and in green (solid), respectively.

Figure 6 :
Figure 6: The spatial layout and size of the filters, which are described by the position and size of the bars.Each bar denotes the center position and the orientation of a fitted Gabor function within 14×14 grid.The thickness and length of each bar are propotional to its spatial-frequency bandwidth.

Table 1 :
Comparision of l between different models l ± std

Table 3 :
The mixing coefficients ofper component (the Partition function was estimated using AIS).

Table 4 :
Comparison of the CPU time for training a GRBM with different methods.
Figure 8: Evolution of the l of a GRBM 196-16 on the whitened natural image dataset for CD, PCD using a k of 1, 10 each and PT with 10 temperatures.The learning curves are the average over 40 trials.