Restricted Boltzmann Machines

**Figure 4.9:** Restricted Boltzmann Machines.

The breaking point between shallow and deep training techniques is considered to be 2006, when Hinton and others at the University of Toronto introduced Deep Belief Networks (DBNs) (HOT06), an algorithm that "greedily" trains a layered structure by training one layer at a time using an unsupervised training algorithm. The distinctive feature of DBNs lies in the fact that the layers are composed of Restricted Boltzmann Machines (RBMs) (FH94,Smo86).

Let $\mathbf{v} \in \{0,1\}^{n}$ be a binary stochastic variable associated with the visible state and $\mathbf{h} \in \{0,1\}^{m}$ a binary stochastic variable associated with the hidden state. Given a state $(\mathbf{v}, \mathbf{h})$ , the energy of the configuration of the visible and hidden layers is given by (Hop82)

$\begin{displaymath} E(\mathbf{v}, \mathbf{h} ) = - \sum_{i=1}^{n} a_i v_i - \su... ...1}^{m} b_j h_j - \sum_{i=1}^{n} \sum_{j=1}^{m} w_{i,j} v_i h_j \end{displaymath}$

(4.81)

where

and

are the binary states of the visible layer and the hidden layer, respectively, while $a_i$

are the weights, and $w_{i,j}$ are the weights associated between them. A Boltzmann Machine is similar to a Hopfield network, with the distinction that all outputs are stochastic. Therefore, a Boltzmann Machine can be defined as a special case of the Ising model, which in turn is a particular case of a Markov Random Field. Similarly, RBMs can be interpreted as stochastic neural networks where the nodes and connections correspond to neurons and synapses, respectively.

The probability of the joint configuration $(\mathbf{a}, \mathbf{b}, \mathbf{W})$ is given by the Boltzmann distribution:

$\begin{displaymath} P(\mathbf{v}, \mathbf{h} ) = \frac{1}{Z(\cdot)} e^{-E(\mathbf{v}, \mathbf{h}) } \end{displaymath}$

(4.82)

where the partition function $Z$

is defined as

$\begin{displaymath} Z=\sum_{\mathbf{v}, \mathbf{h}} e^{-E(\mathbf{v}, \mathbf{h}) } \end{displaymath}$

(4.83)

the sum of the energies of all possible pairs of visible and hidden states.

The term restricted refers to the fact that direct interactions between units belonging to the same layer are not permitted, but only interactions between adjacent layers are allowed.

Given an input $\mathbf{v}$ , the hidden binary state $h_j$ is activated with probability:

$\begin{displaymath} p(h_j = 1 \vert \mathbf{v}) = \sigma \left( b_j + \sum_i v_i w_{i,j} \right) \end{displaymath}$

(4.84)

where $\sigma(x)$ is the logistic function $1 / (1 + exp(-x) )$

. Similarly, it is straightforward to obtain the visible state from the hidden state:

$\begin{displaymath} p(v_i = 1 \vert \mathbf{h}) = \sigma \left( a_i + \sum_j h_i w_{i,j} \right) \end{displaymath}$

(4.85)

Obtaining the model $(\mathbf{a}, \mathbf{b}, \mathbf{W})$ that allows for the representation of all training input values is, however, a very complex task. A much faster procedure was proposed by Hinton in 2002: it is only since then that RBMs can be trained using the contrastive divergence (CD) algorithm (Hin12).

Paolo medici
2025-10-22