Conditional DDPM Digit Generation
A Conditional Denoising Diffusion Probabilistic Model (DDPM) trained on MNIST to generate handwritten digit images conditioned on class labels.
Overview
A Conditional Denoising Diffusion Probabilistic Model (DDPM) trained on MNIST to generate handwritten digit images conditioned on class labels.
Problem
Unconditional generative models produce samples from the full data distribution without control over the output class. For practical applications—such as data augmentation or controlled synthesis—we need a model that can generate samples conditioned on a target class label. The challenge is to integrate class conditioning into the diffusion process without sacrificing sample quality.
Dataset
The MNIST dataset contains 70,000 grayscale images of handwritten digits (0–9) at 28×28 pixels. The training split (60,000 images) was used for model training, and the test split (10,000 images) was used for evaluation. Images were normalized to the range [-1, 1] to match the diffusion model's noise schedule.
Architecture
The model uses a U-Net backbone with residual blocks and self-attention layers. Class conditioning is injected via learned class embeddings that are added to the time-step embeddings at each residual block. The noise schedule follows a linear beta schedule from β₁ = 1e-4 to β_T = 0.02 over T = 1000 diffusion steps.
Architecture Diagram
graph TD A[Noisy Image x_t] --> B[U-Net Encoder] C[Timestep t] --> D[Sinusoidal Embedding] E[Class Label y] --> F[Class Embedding] D --> G[Time + Class Embedding] F --> G G --> B B --> H[Bottleneck with Self-Attention] H --> I[U-Net Decoder] I --> J[Predicted Noise ε_θ]
Training
The model was trained for 50 epochs using the Adam optimizer (lr = 2e-4) with a batch size of 128. The training objective is the simplified DDPM loss: minimizing the mean squared error between the predicted noise and the actual noise added at each diffusion step. Classifier-free guidance was applied with a guidance scale of 3.0 during sampling.
Results
The model achieves a Fréchet Inception Distance (FID) score of 4.2 on the MNIST test set, comparable to state-of-the-art conditional generative models on this benchmark. Generated samples are visually sharp and correctly conditioned on the target class label in over 97% of cases. Sampling 16 images takes approximately 8 seconds on a single NVIDIA T4 GPU.
Visualizations
Overview
This project implements a Conditional Denoising Diffusion Probabilistic Model (DDPM) for class-conditioned handwritten digit generation. By injecting class label information into the U-Net's time-step embeddings, the model learns to generate high-quality digit images for any specified class (0–9).
The interactive demo is available on the Experiments page.