Transformer

Transformer Architecture - Attention Is All You Need !

6 minute read

PlaylistNatural Language Processing | Full Course

Transformer

Transformer replaced recurrence with attention, enabling faster training and superior performance in NLP and AI tasks.

Source: Attention is all you need, Vaswani et al., 2017; https://arxiv.org/pdf/1706.03762

Transformer Architecture
It was designed for machine translation, so it has 2 parts, viz., encoder and decoder.

Encoder
The encoder is composed of a stack of N = 6 identical layers.

Each encoder layer has 2 sub-layers:
- Multi-Head Self Attention
- Fully connected Neural Network (Feed Forward)
Residual connection around each of the sub-layers.
Each sub-layer has layer normalization.

Decoder
The decoder is also composed of a stack of N = 6 identical layers.

Each decoder layer has 3 sub-layers:
- Masked Multi-Head Attention
- Multi-Head Encoder Decoder (Cross) Attention
- Fully connected Neural Network (Feed Forward)
Residual connection around each of the sub-layers.
Each sub-layer has layer normalization.

Attention Is All You Need

Transformer replaced recurrence with attention.
3 kinds of attention are used in the transformer architecture:

Multi-Head Self Attention (Encoder)
Masked Multi-Head Attention (Decoder)
Encoder Decoder (Cross) Attention (Decoder)

Self Attention

Unlike older models that read left-to-right, self-attention sees the whole sentence at once, allowing it to catch relationships whether they are side-by-side or miles apart.

Every word in a sentence looks at every other word (including itself) to decide which word is most relevant to its own meaning in that specific context.

We have discussed self attention in detail in the previous article.

Read more about Self Attention

Limitations of Single Head Self Attention

In a single-head attention system, the model has to condense all relationships into one attention score.

e.g., The Reserve Bank of India headquarters in Mumbai, sits on the bank of the Mithi River.

If a word like “bank” refers to both a “river” and “financial institution” in a complex sentence, a single head tries to average those two different meanings.
By averaging them, we often end up with a blurry representation that does not capture either meaning well.

Multi-Head Attention

Allows the model to “dissect” the word.
Each head can look at different aspects of a sentence, such as, syntax (subject-verb), semantics(meaning), pronouns, etc.

Head 1 can focus on the “river” context while
Head 2 focuses on the “financial institution” context.
No need to compromise (average).
Also, each head can be processed in parallel.

Multi-Head Attention

Multi-Head
Each head is an independent attention mechanism. Each head has its own set of learnable weight matrices (\(W_i^Q, W_i^K, W_i^V\)).

\[\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)\]

\[\text{Multi-Head}(Q, K, V) = \text{Concat}(\text{head}_1, \dots, \text{head}_h)W^O\]

images/natural_language_processing/transformer/multi_head_attention_2.png

Video Multi Head Self Attention in Transformer Architecture | Detailed Explanation

Masked Multi-Head Attention

In decoder (auto-regressive), a token can only see itself and the tokens that came before it.
So, we add a mask to hide the future words.

\[\text{MaskedAttention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}} + M\right)V\]

where,\(M\): Mask (a square matrix of the same size as the attention scores).

\(M = 0\): for positions we want to keep (“past” and “current” tokens)
\(M = 1\): for positions we want to hide (“future” tokens).

e.g.

\[M = \begin{bmatrix} 0 & -\infty & -\infty \\ 0 & 0 & -\infty \\ 0 & 0 & 0 \end{bmatrix}\]

Note: Softmax function is what actually performs the “masking” by turning those ‘\(-\infty\)’ values into zeros.

Read more about Softmax

Encoder-Decoder Attention

This mimics the typical “encoder-decoder” attention mechanisms in sequence-to-sequence models.
The “queries” come from the previous decoder layer, and the “keys” and “values” come from the output of the encoder.

Feed Forward Network

Fully connected Feed Forward Network introduces non-linear activation functions that allows the model to learn complex, non-linear relationships that attention(linear operation) alone cannot capture.

\[FFN(x) = max(0, xW_1 + b_1)W_2 + b_2 \]

Feed Forward Network

Note: The dimensionality of input and output is \(d_{model}\) = 512, and the inner-layer has dimensionality \(d_{ff}\) = 2048.

images/natural_language_processing/transformer/feed_forward_network_2.png

Residual Connection

Residual connection is used to mitigate the vanishing gradient problem.
Skip-connections allow original input features to skip layers and be preserved, which helps the model avoid losing information in deep networks.
But, the most vital role of residual connection is to solve the “degradation problem”.

Degradation Problem
Counter-intuitively, as model depth increases, accuracy tends to saturate and then degrade rapidly; not because of overfitting, but because of optimization failure.
In a deep neural network (without residuals), the optimization surface becomes so rugged and chaotic that it becomes very difficult to reach the local minima through gradient descent.

\[y = f(x) + x \]

images/natural_language_processing/transformer/residual_connection.png

Note: A layer can simply pass the input through unchanged (identity mapping) if it does not find useful features.

Layer Normalization

Unlike Batch Normalization, which normalizes across the mini-batch for each individual feature, Layer Normalization normalizes across all the features (dimensions) for a single training instance.

By centering mean (\(\mu\)) around zero and scaling them to variance (\(\sigma^2\)) of one, it prevents exploding or vanishing gradients, enables faster convergence.
The mean (\(\mu\)) and variance (\(\sigma^2\)) are calculated over all features(dimensions) for that specific token.

Reason: Transformers primarily use LayerNorm because it treats each token independently.
This is critical for natural language processing (NLP) where sentences in a batch often have different lengths.