Recurrent Neural Networks and Transformers (for sequences and images)

**Figure 4.11:** Conceptual diagram of the self-attention mechanism in a Transformer.

Convolutional Neural Networks (CNNs) represented a breakthrough in the processing of static images, due to their ability to capture hierarchies of local features through convolutional filters and pooling. However, traditional CNNs are not well suited for handling sequential data such as text, time signals, or video, where order and temporal dependencies among elements are crucial. To address these challenges, Recurrent Neural Networks (RNNs) were introduced, in which neurons are equipped with recurrent connections that enable the maintenance of a memory of past states. RNNs have been applied to tasks such as speech recognition, machine translation, time-series analysis, and automatic image captioning. Nevertheless, standard RNNs struggle to learn long-term dependencies due to the vanishing and exploding gradient problems. To mitigate these limitations, more sophisticated architectures such as Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU) were developed, capable of regulating which information to retain or discard through gating mechanisms.

Despite these advances, the first generation of deep sequential networks relied on an encoder?decoder paradigm, in which the information from the input sequence was compressed into a lower-dimensional tensor, expected to preserve as much relevant information as possible for the downstream task. This approach, however, suffers from an intrinsic limitation: important input elements may be overlooked or attenuated during compression.

A fundamental shift came with the introduction of the attention mechanism. The underlying idea of attention is simple yet powerful: instead of compressing all information into a single vector, the model can dynamically ?focus? on the most relevant parts of the input sequence. Attention weights are computed according to the relevance of each element with respect to the others, overcoming the limitations of compressed recurrent representations.

Let $X \in \mathbb{R}^{n \times d}$ be the input sequence matrix, where $n$ is the number of tokens (the individual units into which the algorithm divides the input sequence, varying from sequence to sequence) and $d$ is the embedding dimension (a numerical vector representing each token, fixed for the model and encoding semantic and syntactic information).

The query ( $Q$ ), key ( $K$ ), and value ( $V$ ) matrices are obtained through linear projections:

$\begin{displaymath} % Label for future reference Q = X W^Q, \quad K = X W^K, \quad V = X W^V \end{displaymath}$

(4.89)

where:

$W^Q, W^K \in \mathbb{R}^{d \times d_k}$ and $W^V \in \mathbb{R}^{d \times d_v}$ are weight matrices learned during training;
$Q, K \in \mathbb{R}^{n \times d_k}$ and $V \in \mathbb{R}^{n \times d_v}$ .

The self-attention mechanism can then be expressed in scalar form (for a single token) as:

$\begin{displaymath} % Label for future reference \alpha_{i,j} = \text{softmax}\... ...rt{d_k}}\right), \quad z_i = \sum_{j=1}^{n} \alpha_{i,j} V_j \end{displaymath}$

(4.90)

where:

$Q_i, K_j \in \mathbb{R}^{d_k}$ are, respectively, the query of token and the key of token ;
$V_j \in \mathbb{R}^{d_v}$ is the value of token ;
$\alpha_{i,j}$ is the scalar weight indicating the degree to which token attends to token ;
$z_i \in \mathbb{R}^{d_v}$ is the new representation of token resulting from the weighted combination of values.
The softmax function, which normalizes the weights $\alpha_{i,j}$ between 0 and 1, is defined as:

$\begin{displaymath} \text{softmax}(\mathbf{x})_i = \frac{e^{x_i}}{\sum_{j=1}^{n} e^{x_j}}, \quad i = 1, \dots, n \end{displaymath}$ (4.91)

where $\mathbf {x}$ is the input vector (in our case, $\mathbf{x} = \frac{Q_i K^\top}{\sqrt{d_k}}$ for row ).

Intuitively, attention can be seen as a dynamic generalization of weighting methods such as Bag of Words or TF-IDF. Unlike TF-IDF, which assigns static weights to terms, attention assigns context- and task-dependent weights, enabling the model to focus selectively on the most relevant parts. Semantically, the resulting matrix $Z$ has the same length as the input ( $n$ tokens in, $n$ tokens out), but each token is enriched with contextual information from the entire sequence.

The attention mechanism directly led to the development of Transformers (VSP$^+$17), which are now the de facto standard for sequence modeling in natural language processing, computer vision, and multimodal learning. In Transformers, the central operator is self-attention, which models relationships among all sequence elements in parallel and directly. Compared to RNNs, Transformers provide substantial advantages in terms of parallelization, numerical stability, and the ability to capture long-range dependencies.

In computer vision, the application of Transformers gave rise to the Vision Transformers (ViT) (DBK$^+$20), where an image is partitioned into small patches treated as a sequence, analogous to words in a text. These models have demonstrated performance that is competitive with or superior to CNNs on various classification, recognition, and segmentation tasks, particularly in the presence of large-scale datasets.

In practical applications, a single self-attention module is insufficient to extract adequate information from the input tokens. The multi-head self-attention mechanism extends the idea of self-attention by allowing the model to examine the sequence from multiple perspectives simultaneously. Specifically:

distinct heads are employed. For each head $p = 1, \dots, h$ , separate projection matrices are defined:

$\begin{displaymath} Q^{(p)} = X W^{Q,(p)}, \quad K^{(p)} = X W^{K,(p)}, \quad V^{(p)} = X W^{V,(p)}, \end{displaymath}$ (4.92)

where each matrix $W^{Q,(p)}, W^{K,(p)}, W^{V,(p)} \in \mathbb{R}^{d \times d_k}$ .
Each head computes self-attention (scaling, softmax, combination) within its subspace. In compact matrix form, where $\text{Attention}$ denotes the scaled dot-product attention operator:

$\begin{displaymath} Z^{(p)} = \text{Attention}\left(Q^{(p)}, K^{(p)}, V^{(p)}\r... ...left(\frac{Q^{(p)} (K^{(p)})^\top}{\sqrt{d_k}}\right) V^{(p)} \end{displaymath}$ (4.93)
The outputs of the different heads are concatenated:

$\begin{displaymath} Z_{\text{concat}} = \big[Z^{(1)} \, ; \, Z^{(2)} \, ; \, \dots \, ; \, Z^{(h)}\big] \in \mathbb{R}^{n \times (h \cdot d_v)} \end{displaymath}$ (4.94)

and then projected back into a -dimensional space (the model dimension) through an output matrix $W^O \in \mathbb{R}^{(h \cdot d_v) \times d}$ :

$\begin{displaymath} Z = Z_{\text{concat}} \, W^O \end{displaymath}$ (4.95)
Often one chooses and $d = h \cdot d_k$ , so that concatenation followed by projection yields a dimension consistent with the input to the next layer.
This enables the modeling of different ?types? of attention (e.g., syntactic relations, semantic dependencies, local vs. global interactions) within the same layer.

**Table 4.1:** Examples of geometry (layers, embedding, heads) for well-known Transformer models.
Model	Layers	$\mathbf{d}$	$\mathbf{h}$	$\mathbf{d_k} = d / h$
Transformer (base) (VSP$^+$17)	6 encoder + 6 decoder	512	8	64
BERT-Base	12 encoder	768	12	64
BERT-Large	24 encoder	1024	16	64
GPT-3 (175B)	many decoders	12288	96	128

Today, RNNs and Transformers are best seen as complementary tools: the former remain useful in scenarios with relatively short sequences or limited resources, while the latter constitute the foundation of modern state-of-the-art deep learning architectures. The evolution from recurrent to attention-based mechanisms has marked a paradigm shift: from the notion of compressed memory to dynamic, context-dependent representations, where the model autonomously decides ?what to attend to? for each element of the sequence.

Paolo medici
2025-10-22