Understanding Denoising Diffusion Probabilistic Models from Scratch

Diffusion models have taken the generative AI world by storm, powering systems like DALL-E 2, Stable Diffusion, and Imagen. But how do they actually work? In this post, I'll walk through the core mathematics and intuition behind Denoising Diffusion Probabilistic Models (DDPMs) and show how to implement a minimal version in PyTorch.

The Core Idea

The key insight behind diffusion models is elegant: instead of learning to generate data directly, we learn to reverse a gradual noising process. We define a forward process that slowly destroys structure in the data by adding Gaussian noise over T steps, then train a neural network to reverse this process step by step.

The Forward Process

Given a data sample x₀, the forward process q(xₜ | xₜ₋₁) adds a small amount of Gaussian noise at each step:

q(xₜ | xₜ₋₁) = N(xₜ; √(1 - βₜ) xₜ₋₁, βₜI)

where βₜ is a variance schedule that increases from a small value (e.g., 1e-4) to a larger value (e.g., 0.02) over T = 1000 steps. A key property is that we can sample xₜ directly from x₀ in closed form:

q(xₜ | x₀) = N(xₜ; √ᾱₜ x₀, (1 - ᾱₜ)I)

where ᾱₜ = ∏ᵢ₌₁ᵗ (1 - βᵢ). This means we can corrupt any training sample to any noise level in a single step — no need to iteratively apply the forward process during training.

The Reverse Process

The reverse process p_θ(xₜ₋₁ | xₜ) is what we learn. Ho et al. (2020) showed that parameterizing the reverse process as predicting the noise ε added at each step leads to a simple and effective training objective:

L_simple = E[||ε - ε_θ(xₜ, t)||²]

This is just mean squared error between the actual noise and the predicted noise — surprisingly simple for such a powerful model.

The U-Net Architecture

The noise prediction network ε_θ is typically a U-Net with:

Residual blocks for stable training
Self-attention layers at lower resolutions to capture global structure
Sinusoidal time embeddings to condition the network on the current noise level

Sampling

To generate a new sample, we start from pure Gaussian noise x_T ~ N(0, I) and iteratively denoise:

for t in reversed(range(T)):
    z = torch.randn_like(x) if t > 0 else 0
    predicted_noise = model(x, t)
    x = (1/√αₜ) * (x - (βₜ/√(1-ᾱₜ)) * predicted_noise) + √βₜ * z

Key Takeaways

Diffusion models frame generation as iterative denoising — a much more stable training target than GANs.
The forward process has a closed-form solution, making training efficient.
The simple noise-prediction objective is surprisingly effective.
Sampling is slow (T steps) but can be accelerated with DDIM or DPM-Solver.

In my Conditional DDPM project, I extended this framework with class conditioning to control which digit gets generated. Check it out for a practical implementation.