Understanding Self-Attention: The Core Mechanism of Transformers

Understanding Self-Attention: The Core Mechanism

Self-attention is a fundamental building block of transformer models, enabling them to weigh the importance of different tokens relative to each other within a sequence. This mechanism computes a weighted sum of value vectors, where weights are determined by the similarity between query and key vectors derived from input embeddings. The process allows the model to dynamically focus on relevant parts of the input for each position in the sequence.

The core steps of self-attention involve:

Projection to Queries, Keys, and Values: Each input token embedding is linearly projected into three vectors (queries (Q), keys (K), and values (V)) using learned weight matrices. In practice, this means:

\[
Q = X W_Q, \quad K = X W_K, \quad V = X W_V
\]

Here, \(X\) is the input matrix of token embeddings, and \(W_Q\), \(W_K\), \(W_V\) are trainable parameters. The projection allows the model to map the same input to different roles (query, key, value) for the attention calculation.
Computing Attention Scores: The attention score for each query-key pair is calculated as a scaled dot product:

\[
\text{scores} = \frac{Q K^T}{\sqrt{d_k}}
\]

Here, \(d_k\) is the dimension of the key vectors. Scaling by \(\sqrt{d_k}\) prevents dot products from becoming too large, which could push the softmax function into regions with very small gradients.
Softmax Normalization: The scores are passed through a softmax function to convert them into attention weights that sum to 1 for each query. This step ensures that each position’s attention is distributed over all tokens.

\[
\text{weights} = \text{softmax}(\text{scores})
\]
Weighted Sum of Values: The output for each token is computed as a weighted sum of all value vectors:

\[
\text{output} = \text{weights} \times V
\]

This allows each token to attend to every other token, including itself, capturing contextual relationships across the entire sequence.

To illustrate, consider the following Python example using small matrices. In this code, each token’s embedding is a 4-dimensional vector, and the projections are represented by identity matrices for simplicity:

Note: The following code is an illustrative example and has not been verified against official documentation. Please refer to the official docs for production-ready code.

import numpy as np

# Input embeddings for 3 tokens, each with dimension 4
X = np.array([
 [0.1, 0.2, 0.3, 0.4],
 [0.5, 0.6, 0.7, 0.8],
 [0.9, 1.0, 1.1, 1.2]
])

# Identity matrices for simplicity representing W_Q, W_K, W_V
W_Q = np.eye(4)
W_K = np.eye(4)
W_V = np.eye(4)

# Project input to queries, keys, values
Q = X @ W_Q
K = X @ W_K
V = X @ W_V

# Calculate scaled dot product attention scores
d_k = Q.shape[-1]
scores = Q @ K.T / np.sqrt(d_k)

# Softmax fn
def softmax(x):
 e_x = np.exp(x - np.max(x, axis=-1, keepdims=True))
 return e_x / e_x.sum(axis=-1, keepdims=True)

weights = softmax(scores)

# Compute weighted sum of values
output = weights @ V

print("Attention Weights:\n", weights)
print("Output:\n", output)

Multi-Head Attention and Its Advantages

A limitation of single-head self-attention is that it captures only one aspect of the relationships between tokens. Multi-head attention addresses this by running multiple self-attention operations in parallel, each with separate learned projections. Each head focuses on different representation subspaces, enabling the model to capture diverse contextual features simultaneously.

Concretely, each head computes its own queries, keys, and values:

\[
\text{head}_i = \text{Attention}(X W_Q^i, X W_K^i, X W_V^i)
\]

The outputs of all heads are concatenated and linearly transformed:

\[
\text{MultiHead}(X) = \text{Concat}(\text{head}_1, \dots, \text{head}_h) W_O
\]

where \(h\) is the number of heads and \(W_O\) is a learned projection matrix.

The benefits include:

Capturing different types of dependencies simultaneously. For example, in a translation task, one head might focus on subject-verb relationships, while another might focus on object placement.
Improving model expressiveness without increasing computational cost linearly, since the projections are performed in parallel and combined efficiently.
Facilitating parallelization during training and inference, which makes transformer models well-suited for GPU acceleration.

The diagram below visually summarizes the process of multi-head attention:

Multi-Head Attention Diagram — Multi-head attention enables parallel attention to different relationships in the sequence.

To see how to build such mechanisms from scratch, read Build Tiny Character-Level Language Model from Scratch in Python, which provides a step-by-step walkthrough using Python.

Residual Connections and Layer Normalization

Training deep transformer architectures is stabilized by two key components:

Residual Connections: These add the input of a layer back to its output, helping gradients flow and preventing degradation in deep networks:

\[
\text{LayerOutput} = \text{Layer}(X) + X
\]

For example, if a transformer block processes an input vector and produces an output, adding the input back ensures that the original information is preserved, and the network can learn modifications on top of it.
Layer Normalization: This normalizes activations across the feature dimension for each token independently, reducing internal covariate shift and accelerating convergence:

\[
\text{LayerNorm}(x) = \frac{x – \mu}{\sigma + \epsilon} \times \gamma + \beta
\]

Here, \(\mu\) and \(\sigma\) are the mean and standard deviation of input features, and \(\gamma\), \(\beta\) are learned scale and shift parameters. For example, normalizing helps stabilize the output values, making the network easier to train at scale.

Together, these components ensure stable training and better performance of transformer blocks in large models by allowing deeper architectures without vanishing or exploding gradients.

Implementing Transformer Block in Python

The combined mechanisms lead to the standard transformer block architecture, which includes:

Multi-head self-attention sub-layer, which computes context-aware representations.
Add & Norm, which applies a residual connection plus layer normalization for stable learning.
Position-wise feed-forward network (typically a multilayer perceptron with a hidden layer), which processes each token independently to increase model capacity.
Another Add & Norm, providing a second round of stabilization and residual learning.

Below is a simplified Python implementation using numpy that shows these concepts without external deep learning frameworks. This example focuses on clarity rather than efficiency or production readiness.

Note: The following code is an illustrative example and has not been verified against official documentation. Please refer to the official docs for production-ready code.

import numpy as np

def softmax(x):
 e_x = np.exp(x - np.max(x, axis=-1, keepdims=True))
 return e_x / e_x.sum(axis=-1, keepdims=True)

def layer_norm(x, epsilon=1e-6):
 mean = x.mean(axis=-1, keepdims=True)
 std = x.std(axis=-1, keepdims=True)
 return (x - mean) / (std + epsilon)

def feed_forward(x, W1, b1, W2, b2):
 h = np.maximum(0, x @ W1 + b1) # ReLU activation
 return h @ W2 + b2

def transformer_block(x, W_q, W_k, W_v, W_o, W1, b1, W2, b2):
 # Project inputs to queries, keys, values
 Q = x @ W_q
 K = x @ W_k
 V = x @ W_v

 d_k = Q.shape[-1]
 scores = Q @ K.T / np.sqrt(d_k)
 attn_weights = softmax(scores)
 context = attn_weights @ V

 # Output projection
 attn_output = context @ W_o

 # Add & Norm
 x = layer_norm(x + attn_output)

 # Feed-forward sub-layer
 ff_output = feed_forward(x, W1, b1, W2, b2)

 # Add & Norm
 output = layer_norm(x + ff_output)

 return output

# Example dimensions
seq_len, embed_dim, ff_hidden = 3, 4, 8

np.random.seed(0)
x = np.random.rand(seq_len, embed_dim)

# Initialize weights randomly
W_q = np.random.rand(embed_dim, embed_dim)
W_k = np.random.rand(embed_dim, embed_dim)
W_v = np.random.rand(embed_dim, embed_dim)
W_o = np.random.rand(embed_dim, embed_dim)

W1 = np.random.rand(embed_dim, ff_hidden)
b1 = np.random.rand(ff_hidden)
W2 = np.random.rand(ff_hidden, embed_dim)
b2 = np.random.rand(embed_dim)

output = transformer_block(x, W_q, W_k, W_v, W_o, W1, b1, W2, b2)
print("Transformer block output:\n", output)

This code:

Projects input embeddings into queries, keys, and values for attention calculation.
Computes scaled dot-product attention and applies softmax to get attention weights.
Applies a learned output projection after combining attention results.
Adds residual connections and normalizes after both attention and feed-forward layers for stability.
Uses a simple two-layer feed-forward network with ReLU activation for nonlinearity.

This minimal example captures the essence of the transformer block as introduced in Attention Is All You Need by Vaswani et al. (2017) and used extensively in modern large language models. If you want to see how this fits into building an entire LLM pipeline, check out Building Modern LLMs: A Developer’s Guide from Scratch.

Comparison: Single-Head vs Multi-Head Attention

Aspect	Single-Head Attention	Multi-Head Attention	Source
Parallel Attention Mechanisms	1	Typically 8-16	Vaswani et al., 2017
Contextual Subspace Learning	Single representation space	Multiple representation subspaces	Vaswani et al., 2017
Model Expressiveness	Not measured	Improved, captures diverse patterns	Vaswani et al., 2017

This comparison shows how introducing multiple heads enables the model to learn a richer set of relationships compared to a single attention mechanism.

Key Takeaways

Self-attention computes weighted sums of values based on query-key similarity, enabling dynamic contextual focus in transformers.
Queries, keys, and values are linear projections of input vectors learned during training.
Multi-head attention allows transformers to attend to multiple representation subspaces simultaneously, enhancing expressiveness.
Residual connections and layer normalization stabilize training by improving gradient flow and reducing internal covariate shift.
A complete transformer block combines multi-head attention, residuals, layer normalization, and a feed-forward network, which can be implemented in Python for educational purposes.

Next Steps

Now that you understand how self-attention and transformer blocks operate, the next part of this series will cover the training process in detail. It will explain how loss functions like cross-entropy connect to next-token prediction, how training loops operate, and what scaling laws reveal about model performance as size and data increase.

Understanding Self-Attention: The Core Mechanism of Transformers