gradient-descent-deep-learning.md

Gradient Descent: The Hill-Climbing Algorithm Behind Deep Learning

Loss surfaces, learning rates, momentum, and Adam the optimization loop that makes neural networks actually learn.

May 4, 20252 min read

ML
Math
Deep Learning

Share𝕏

Gradient Descent: The Hill-Climbing Algorithm Behind Deep Learning

Neural networks are function approximators. Gradient descent is how they find those functions by rolling downhill on a loss surface in million-dimensional space.

The loop

Forward pass compute predictions
Loss function measure error (MSE, cross-entropy)
Backprop compute gradients via chain rule
Update nudge weights opposite the gradient

for epoch in range(epochs):
    loss = compute_loss(model, batch)
    grads = autograd(loss, model.params)
    for param, grad in zip(model.params, grads):
        param -= learning_rate * grad

Repeat until loss plateaus or your cloud bill explodes.

Learning rate matters

Too high → oscillation or divergence. Too low → weeks of training.

Strategy	Idea
Fixed LR	Simple, fragile
Step decay	Drop LR at milestones
Cosine annealing	Smooth decay curve
Warmup	Small LR early, stabilize

Momentum

SGD with momentum accumulates velocity overshoots small local minima, accelerates through flat regions:

v = β * v + gradient
param -= lr * v

Think ball rolling downhill, gaining inertia.

Adam and friends

Adam adapts per-parameter learning rates using running averages of gradient and squared gradient. Default optimizer for most deep learning not always optimal, rarely catastrophic.

In high dimensions, saddle points outnumber true local minima. Modern networks are overparameterized many global or near-global solutions exist. The hard part is finding them fast, not escaping traps.

Takeaway

Every LLM, every classifier, every diffusion model same core loop. Master gradient descent and backprop, and the rest of deep learning is architecture choices stacked on top.

Continue reading

Transformers Explained: Attention, QKV, and Why LLMs Work

May 28, 2025

A frontend engineer's guide to the architecture behind GPT self-attention, positional encoding, and the encoder-decoder split.

ML
Transformers

2 min

AI Coding Is Like Horseback Riding: the Horse Runs, You Still Get Tired

August 2, 2026

The horse is doing the running. You're still the one who has to stay balanced, read the terrain, and not fall off. That's the whole difference between vibe coding and actually directing the thing.

AI Engineering
Vibe Coding

4 min

The Day the Facility Guy Shipped a Feature to My App

July 29, 2026

I taught our building's maintenance guy how to vibe code for fun. Twenty minutes later he'd chatted his way into a working feature inside one of my real apps. Here's what actually happened, and why it didn't disprove anything I've written in this series.

AI Engineering
Vibe Coding