transformer-attention-explained.md

Transformers Explained: Attention, QKV, and Why LLMs Work

A frontend engineer's guide to the architecture behind GPT self-attention, positional encoding, and the encoder-decoder split.

May 28, 20252 min read

ML
Transformers
Deep Learning

Share𝕏

Edit on GitHub

Transformers Explained: Attention, QKV, and Why LLMs Work

The 2017 paper Attention Is All You Need replaced recurrence with self-attention and modern AI was born. No RNN loops. No convolutions over sequences. Just matrices.

Self-attention in one paragraph

Each token asks every other token: "How relevant are you to me?" That relevance is a learned attention score, computed from three projections:

Query (Q) what am I looking for?
Key (K) what do I contain?
Value (V) what information do I pass if selected?

# Scaled dot-product attention
scores = (Q @ K.T) / sqrt(d_k)
weights = softmax(scores)
output = weights @ V

The sqrt(d_k) scaling prevents softmax from saturating when dimensions grow large.

Multi-head attention

One attention pattern isn't enough. Multi-head attention runs parallel attention "heads" each learning different relationships (syntax, coreference, long-range deps):

MultiHead(Q, K, V) = Concat(head_1, ..., head_h) @ W_O
where head_i = Attention(Q @ W_Qi, K @ W_Ki, V @ W_Vi)

Positional encoding

Attention is permutation-invariant it doesn't know word order. Transformers inject positional encodings (sinusoidal or learned) so "dog bites man" ≠ "man bites dog".

Encoder vs decoder

Component	Used in	Masking
Encoder	BERT, embeddings	Bidirectional
Decoder	GPT, generation	Causal (left-to-right)
Encoder-decoder	TTS, translation	Cross-attention

GPT is a decoder-only stack: causal masking ensures token i never peeks at token i+1.

Why it scales

Attention is O(n²) in sequence length expensive, but embarrassingly parallel on GPUs. That parallelism, plus scaling data and parameters, is why trillion-parameter models became feasible.

Think of attention as a dynamic routing table. Each token broadcasts a query; others respond with keys; values flow proportional to match strength. No explicit graph the network learns the wiring.

That's the entire revolution in one mechanism.

Continue reading

Gradient Descent: The Hill-Climbing Algorithm Behind Deep Learning

May 4, 2025

Loss surfaces, learning rates, momentum, and Adam the optimization loop that makes neural networks actually learn.

ML
Math

2 min

AI Coding Is Like Horseback Riding: the Horse Runs, You Still Get Tired

August 2, 2026

The horse is doing the running. You're still the one who has to stay balanced, read the terrain, and not fall off. That's the whole difference between vibe coding and actually directing the thing.

AI Engineering
Vibe Coding

4 min

The Day the Facility Guy Shipped a Feature to My App

July 29, 2026

I taught our building's maintenance guy how to vibe code for fun. Twenty minutes later he'd chatted his way into a working feature inside one of my real apps. Here's what actually happened, and why it didn't disprove anything I've written in this series.

AI Engineering
Vibe Coding