Project 5A: The Power of Diffusion Models!

Introduction

In this project, I explored the capabilities of diffusion models, implemented diffusion sampling loops, and used them for tasks such as inpainting and creating optical illusions. The project was structured into several parts, each focusing on different aspects of diffusion models.

The images in Part 0 are 254x254 pixels, while images in other parts are 64x64 pixels. Throughout the project, I used the DeepFloyd IF diffusion model, which you can learn more about here.

Course Logo

I experimented with generating different logo designs for a computer vision course using the diffusion model. Here are two variations:

I also generated a few more logos with the word "a super cool computer vision class at UC Berkeley called CS 180" however they were classroom like settings and less aesthetically pleasing for a logo.

Minimalist Computer Vision Logo

"A minimalist logo for a computer vision class, featuring a stylized camera lens and digital eye".

Academic Computer Vision Logo

"An elegant academic logo combining a camera aperture with binary code, clean design".

Classroom Computer Vision Logo

"a super cool computer vision class at UC Berkeley called CS 180".

Part 0: Model Sampling

In this section, I sampled images from the model using different numbers of iterations. I also downloaded the precomputed text embeddings to facilitate the generation process.

The following images were obtained during this process:

Sample Image im1.png

Figure 1: Sample Image im1.png.

Sample Image download4.png

Figure 2: Sample Image download4.png.

Sample Image download2.png

Figure 3: Sample Image download2.png.

Sample Image download3.png

Figure 4: Sample Image download3.png.

Sample Image download1.png

Figure 5: Sample Image download1.png.

Sample Image download.png

Figure 6: Sample Image download.png.

Sample Image 50_steps.png

Figure 7: Sample Image 50_steps.png.

Sample Image 50_steps_large.png

Figure 10: Sample Image 50_steps_large.png.

Sample Image 50_steps_2.png

Figure 12: Sample Image 50_steps_2.png.

Sample Image 50_steps_large3.png

Figure 9: Sample Image 50_steps_large3.png.

Sample Image 50_steps_1.png

Figure 13: Sample Image 50_steps_1.png.

Sample Image 50_steps_large_2.png

Figure 11: Sample Image 50_steps_large_2.png.

Part 1: Sampling Loops

In this part, I implemented sampling loops using the pretrained DeepFloyd denoisers to produce high-quality images. I modified these sampling loops to solve different tasks such as inpainting and producing optical illusions.

Sample Image of the Campanile

The sample image used throughout this part is the Berkeley Campanile, which served as the test image for various processes.

Berkeley Campanile

Figure 14: Berkeley Campanile.

Part 1.1: Forward Process

A key part of diffusion is the forward process, which takes a clean image and adds noise to it. In this part, I wrote a function to implement this process. The forward process is defined by:

\[ q(x_t | x_0) = \mathcal{N}(x_t ; \sqrt{\bar\alpha_t} x_0, (1 - \bar\alpha_t)\mathbf{I}) \tag{1} \]

which is equivalent to computing:

\[ x_t = \sqrt{\bar\alpha_t} x_0 + \sqrt{1 - \bar\alpha_t} \epsilon \quad \text{where}~ \epsilon \sim \mathcal{N}(0, 1) \tag{2} \]

Given a clean image \( x_0 \), we get a noisy image \( x_t \) at timestep \( t \) by sampling from a Gaussian with mean \( \sqrt{\bar\alpha_t} x_0 \) and variance \( (1 - \bar\alpha_t) \). Note that the forward process is not just adding noise—we also scale the image.

Below are the results of the forward process on the test image with \( t \in [250, 500, 750] \):

Berkeley Campanile

Figure 14: Berkeley Campanile.

Campanile at t=250

Figure 15: Noisy Campanile at t=250.

Campanile at t=500

Figure 16: Noisy Campanile at t=500.

Campanile at t=750

Figure 17: Noisy Campanile at t=750.

Part 1.2: Classical Denoising

I attempted to denoise the noisy images using Gaussian blur filtering. The results were not satisfactory, highlighting the limitations of classical denoising methods in handling significant noise.

Campanile at t=250

Figure 15: Noisy Campanile at t=250.

Campanile at t=500

Figure 16: Noisy Campanile at t=500.

Campanile at t=750

Figure 17: Noisy Campanile at t=750.

Gaussian Denoising at t=250

Figure 18: Gaussian Denoising at t=250.

Gaussian Denoising at t=500

Figure 19: Gaussian Denoising at t=500.

Gaussian Denoising at t=750

Figure 20: Gaussian Denoising at t=750.

Part 1.3: One-Step Denoising

I used a pretrained diffusion model to denoise the images in a single step. The denoiser predicts the noise in the image, which can then be removed to recover an estimate of the original image.

The process involves estimating the original image from the noisy image using:

\[ \hat{x}_0 = \frac{1}{\sqrt{\bar\alpha_t}} \left( x_t - \sqrt{1 - \bar\alpha_t} \hat{\epsilon} \right) \tag{2} \]

Here are the results for \( t \in [250, 500, 750] \):

Noisy Image at t=250

Figure 24: Noisy Image at t=250.

Noise Estimate at t=500

Figure 25: Noise Estimate at t=500.

Noise Estimate at t=750

Figure 25: Noise Estimate at t=750.

Estimated Clean Image at t=250

Figure 21: Estimated Clean Image at t=250.

Estimated Clean Image at t=500

Figure 22: Estimated Clean Image at t=500.

Estimated Clean Image at t=750

Figure 23: Estimated Clean Image at t=750.

Part 1.4: Iterative Denoising

I implemented iterative denoising, which involves progressively denoising the image by moving from a higher noise level to a lower one. The formula used is:

\[ x_{t'} = \frac{\sqrt{\bar\alpha_{t'}}\beta_t}{1 - \bar\alpha_t} \hat{x}_0 + \frac{\sqrt{\alpha_t}(1 - \bar\alpha_{t'})}{1 - \bar\alpha_t} x_t + v_\sigma \tag{3} \]

where:

Below are the images at various noise levels during the iterative denoising process:

Noisy at t=30

Figure 26: Noisy Campanile at t=30.

Noisy at t=90

Figure 27: Noisy Campanile at t=90.

Noisy at t=240

Figure 28: Noisy Campanile at t=240.

Noisy at t=390

Figure 29: Noisy Campanile at t=390.

Noisy at t=540

Figure 30: Noisy Campanile at t=540.

Noisy at t=690

Figure 31: Noisy Campanile at t=690.

Original Image

Figure 32: Original Image.

Iteratively Denoised Image

Figure 32: Iteratively Denoised Image.

One-Step Denoised Image

Figure 33: One-Step Denoised Image.

Gaussian Blur Denoised Image

Figure 34: Gaussian Blur Denoised Image.

Part 1.5: Diffusion Model Sampling

I generated images from scratch by starting with random noise and applying iterative denoising. Here are some samples generated with the prompt "a high quality photo":

Generated Sample 1

Figure 35: Generated Sample 1.

Generated Sample 2

Figure 36: Generated Sample 2.

Generated Sample 3

Figure 37: Generated Sample 3.

Generated Sample 4

Figure 38: Generated Sample 4.

Generated Sample 5

Figure 39: Generated Sample 5.

Part 1.6: Classifier-Free Guidance (CFG)

To improve image quality, I implemented Classifier-Free Guidance (CFG). In CFG, we compute both a noise estimate conditioned on a text prompt (\( \epsilon_c \)) and an unconditional noise estimate (\( \epsilon_u \)). The final noise estimate is:

\[ \epsilon = \epsilon_u + \gamma (\epsilon_c - \epsilon_u) \]

where \( \gamma \) is the guidance scale (I used \( \gamma = 5 \)). Here are the generated images with CFG:

CFG Sample 1

Figure 40: CFG Sample 1.

CFG Sample 2

Figure 41: CFG Sample 2.

CFG Sample 3

Figure 42: CFG Sample 3.

CFG Sample 4

Figure 43: CFG Sample 4.

CFG Sample 5

Figure 44: CFG Sample 5.

CFG Sample 6

Figure 45: CFG Sample 6.

CFG Sample 7

Figure 46: CFG Sample 7.

CFG Sample 8

Figure 47: CFG Sample 8.

CFG Sample 9

Figure 48: CFG Sample 9.

CFG Sample 10

Figure 49: CFG Sample 10.

Progression of Sample 2 during denoising:

Progression Image 7

Figure 50: Sample 2 Progression Image 7.

Progression Image 6

Figure 51: Sample 2 Progression Image 6.

Progression Image 5

Figure 52: Sample 2 Progression Image 5.

Progression Image 4

Figure 53: Sample 2 Progression Image 4.

Progression Image 3

Figure 54: Sample 2 Progression Image 3.

Progression Image 2

Figure 55: Sample 2 Progression Image 2.

Progression Image 1

Figure 56: Sample 2 Progression Image 1.

Part 1.7: Image-to-Image Translation

I explored image editing by adding varying amounts of noise to the original image and then denoising it. The more noise added, the larger the edit.

For each of the sub-images, I observed the results and noted the changes.

Sub Part 1.7.1: Editing Hand-Drawn and Web Images

Campanile

Original Campanile

Figure 64: Original Campanile.

Campanile i_start=0

Figure 70: Campanile at i_start=1.

Campanile i_start=10

Figure 69: Campanile at i_start=3.

Campanile i_start=7

Figure 68: Campanile at i_start=5.

Campanile i_start=5

Figure 67: Campanile at i_start=7.

Campanile i_start=3

Figure 66: Campanile at i_start=10.

Campanile i_start=1

Figure 65: Campanile at i_start=20.

My Favorite Campanile Reconstructions

Campanile i_start=20

Figure 71: Campanile reconstruction at i_start=20.

Campanile i_start=10

Figure 72: Campanile reconstruction at i_start=10.

Campanile i_start=7

Figure 73: Campanile reconstruction at i_start=7.

Campanile i_start=5

Figure 74: Campanile reconstruction at i_start=5.

Campanile i_start=3

Figure 75: Campanile reconstruction at i_start=3.

Hand-Drawn Airplane

Original Airplane

Figure 57: Original Airplane.

Airplane i_start=6

Figure 58: Airplane at i_start=1.

Airplane i_start=5

Figure 59: Airplane at i_start=3.

Airplane i_start=4

Figure 60: Airplane at i_start=5.

Airplane i_start=3

Figure 61: Airplane at i_start=7.

Airplane i_start=2

Figure 62: Airplane at i_start=10.

Airplane i_start=1

Figure 63: Airplane at i_start=20.

Hand-Drawn House

Original House

Figure 64: Original House.

House i_start=5

Figure 66: House at i_start=1.

House i_start=4

Figure 67: House at i_start=3.

House i_start=3

Figure 68: House at i_start=5.

House i_start=2

Figure 69: House at i_start=7.

House i_start=1

Figure 70: House at i_start=10.

House i_start=6

Figure 65: House at i_start=20.

Web Image - Avocado

Original Avocado

Figure 71: Original Avocado.

Avocado i_start=6

Figure 72: Avocado at i_start=1.

Avocado i_start=5

Figure 73: Avocado at i_start=3.

Avocado i_start=4

Figure 74: Avocado at i_start=5.

Avocado i_start=3

Figure 75: Avocado at i_start=7.

Avocado i_start=2

Figure 76: Avocado at i_start=10.

Avocado i_start=1

Figure 77: Avocado at i_start=20.

Web Image - Surfer

Original Surfer

Figure 78: Original Surfer image.

Surfer i_start=7

Figure 79: Surfer at i_start=1.

Surfer i_start=6

Figure 80: Surfer at i_start=3.

Surfer i_start=5

Figure 81: Surfer at i_start=5.

Surfer i_start=4

Figure 82: Surfer at i_start=7.

Surfer i_start=3

Figure 83: Surfer at i_start=10.

Surfer i_start=2

Figure 84: Surfer at i_start=20.

Web Image - Tree

Tree i_start=7

Figure 85: Treeimage original.

Tree i_start=6

Figure 86: Tree at i_start=15.

Tree i_start=5

Figure 87: Tree at i_start=10.

Tree i_start=4

Figure 88: Tree at i_start=7.

Tree i_start=3

Figure 89: Tree at i_start=5.

Tree i_start=2

Figure 90: Tree at i_start=3.

Tree i_start=1

Figure 91: Tree at i_start=1.

Part 1.7.2: Inpainting

I implemented inpainting by using a mask to specify regions to edit. By applying the diffusion process only within the masked area, new content is generated while the rest of the image remains unchanged.

Campanile Mask Process Version 1

Original Campanile

Figure 92: Original Campanile.

Step t3

Figure 93: Campanile Inpainting Step t3.

Step t4

Figure 94: Campanile Inpainting Step t4.

Step t6

Figure 95: Campanile Inpainting Step t6.

Step t9

Figure 96: Campanile Inpainting Step t9.

Campanile Mask Process Version 2

Original Campanile

Figure 97: Original Campanile.

Step t3

Figure 98: Campanile Inpainting Step t3.

Step t4

Figure 99: Campanile Inpainting Step t4.

Step t6

Figure 100: Campanile Inpainting Step t6.

Step t9

Figure 101: Campanile Inpainting Step t9.

Color Painting Process

Original Image

Figure 102: Original Image.

Mask

Figure 103: Mask.

Area to Fill

Figure 104: Area to Fill.

Final Result

Figure 109: Final Color Painting Result.

Step 4

Figure 108: Color Painting Step 4.

Step 1

Figure 105: Color Painting Step 1.

Step 3

Figure 107: Color Painting Step 3.

Step 2

Figure 106: Color Painting Step 2.

Final Result

Figure 109: Final Color Painting Result.

Goggles Inpainting Process - Larger Mask

Original Image

Figure 110: Original Image.

Mask

Figure 111: Mask.

Area to Fill

Figure 112: Area to Fill.

Final Result

Figure 113: Final Result.

Step 1

Figure 114: Inpainting Step 1.

Step 2

Figure 115: Inpainting Step 2.

Step 3

Figure 116: Inpainting Step 3.

Step 4

Figure 117: Inpainting Step 4.

Goggles Inpainting Process - Smaller Mask

Original Image

Figure 118: Original Image.

Mask

Figure 119: Mask.

Area to Fill

Figure 120: Area to Fill.

Final Result

Figure 121: Final Result.

Step 1

Figure 122: Inpainting Step 1.

Step 2

Figure 123: Inpainting Step 2.

Step 3

Figure 124: Inpainting Step 3.

Step 4

Figure 125: Inpainting Step 4.

Roger Federer - Eyes Mask

Original Image

Figure 126: Original Image.

Mask

Figure 127: Mask.

Area to Fill

Figure 128: Area to Fill.

Final Result

Figure 129: Final Result.

Roger Federer - Full Face Mask

Original Image

Figure 130: Original Image.

Mask

Figure 131: Mask.

Area to Fill

Figure 132: Area to Fill.

Final Result

Figure 133: Final Result.

Roger Federer - Small Tennis Ball Mask

Original Image

Figure 134: Original Image.

Mask

Figure 135: Mask.

Area to Fill

Figure 136: Area to Fill.

Final Result

Figure 137: Final Result.

Roger Federer - Large Tennis Ball Mask

Original Image

Figure 138: Original Image.

Mask

Figure 139: Mask.

Area to Fill

Figure 140: Area to Fill.

Final Result

Figure 141: Final Result.

Roger Federer - Full Body Mask Progression

Original Image

Figure 142: Original Image.

Mask

Figure 143: Mask.

Area to Fill

Figure 144: Area to Fill.

Step 1

Figure 145: Inpainting Step 1.

Step 4

Figure 146: Inpainting Step 4.

Step 7

Figure 147: Inpainting Step 7.

Step 10

Figure 148: Inpainting Step 10.

Final Result

Figure 149: Final Result.

Part 1.7.3: Text-Conditional Image-to-image Translation

Same thing as the previous section, but guide the projection with a text prompt.

***for all of the images set, the right most corresponds to the most blurry one

Campanile with Rocket Prompt

Campanile Rocket 1

Figure 152: Campanile with rocket prompt - Original image.

Campanile Rocket 7

Figure 153: Campanile with rocket prompt - Stage 7.

Campanile Rocket 6

Figure 154: Campanile with rocket prompt - Stage 6.

Campanile Rocket 5

Figure 155: Campanile with rocket prompt - Stage 5.

Campanile Rocket 4

Figure 156: Campanile with rocket prompt - Stage 4.

Campanile Rocket 3

Figure 157: Campanile with rocket prompt - Stage 3.

Campanile Rocket 2

Figure 158: Campanile with rocket prompt - Stage 2.

Francesco with Amalfi Coast Prompt

Francesco Amalfi 1

Figure 159: Francesco with Amalfi Coast prompt - Stage 1.

Francesco Amalfi 2

Figure 160: Francesco with Amalfi Coast prompt - Stage 2.

Francesco Amalfi 3

Figure 161: Francesco with Amalfi Coast prompt - Stage 3.

Francesco Amalfi 4

Figure 162: Francesco with Amalfi Coast prompt - Stage 4.

Francesco Amalfi 5

Figure 163: Francesco with Amalfi Coast prompt - Stage 5.

Francesco Amalfi 6

Figure 164: Francesco with Amalfi Coast prompt - Stage 6.

Francesco with Rocket Prompt

Francesco Rocket 1

Figure 165: Francesco with rocket prompt - Stage 1.

Francesco Rocket 2

Figure 166: Francesco with rocket prompt - Stage 2.

Francesco Rocket 3

Figure 167: Francesco with rocket prompt - Stage 3.

Francesco Rocket 4

Figure 168: Francesco with rocket prompt - Stage 4.

Francesco Rocket 5

Figure 169: Francesco with rocket prompt - Stage 5.

Francesco Rocket 6

Figure 170: Francesco with rocket prompt - Stage 6.

Francesco Rocket 7

Figure 171: Francesco with rocket prompt - Stage 7.

Roger Federer with Amalfi Coast Prompt

Federer Amalfi 1

Figure 172: Roger Federer with Amalfi Coast prompt - Stage 1.

Federer Amalfi 2

Figure 173: Roger Federer with Amalfi Coast prompt - Stage 2.

Federer Amalfi 3

Figure 174: Roger Federer with Amalfi Coast prompt - Stage 3.

Federer Amalfi 4

Figure 175: Roger Federer with Amalfi Coast prompt - Stage 4.

Federer Amalfi 5

Figure 176: Roger Federer with Amalfi Coast prompt - Stage 5.

Federer Amalfi 6

Figure 177: Roger Federer with Amalfi Coast prompt - Stage 6.

Roger Federer with Rocket Prompt

Federer Rocket 1

Figure 178: Roger Federer with rocket prompt - Stage 1.

Federer Rocket 2

Figure 179: Roger Federer with rocket prompt - Stage 2.

Federer Rocket 3

Figure 180: Roger Federer with rocket prompt - Stage 3.

Federer Rocket 4

Figure 181: Roger Federer with rocket prompt - Stage 4.

Federer Rocket 5

Figure 182: Roger Federer with rocket prompt - Stage 5.

Federer Rocket 6

Figure 183: Roger Federer with rocket prompt - Stage 6.

Part 1.8: Visual Anagrams

I created images that appear differently when flipped upside down, known as visual anagrams. The algorithm used is:

\[ \epsilon_1 = \text{UNet}(x_t, t, p_1) \] \[ \epsilon_2 = \text{flip}(\text{UNet}(\text{flip}(x_t), t, p_2)) \] \[ \epsilon = \frac{\epsilon_1 + \epsilon_2}{2} \]

Here are some examples:

An Oil Painting of an Old Man / People Around a Campfire

Original Orientation

Figure 150: Original Orientation.

Flipped Orientation

Figure 151: Flipped Orientation.

Visual Anagrams Gallery

Skull/Amalfi Coast

Figure 152: Skull/Amalfi Coast

Man with Hat/Mountain Village

Figure 153: Man with Hat/Mountain Village

Pencil/Rocket Ship

Figure 154: Pencil/Rocket Ship

Hipster Barista/Skull

Figure 155: Hipster Barista/Skull

Amalfi Coast/Campfire

Figure 156: Amalfi Coast/Campfire

Rocket Ship/Waterfalls

Figure 157: Rocket Ship/Waterfalls

Mountain Village/Waterfalls

Figure 158: Mountain Village/Waterfalls

Dog/Hipster Barista

Figure 159: Dog/Hipster Barista

Part 1.9: Hybrid Images

I created hybrid images that appear differently when viewed up close versus from a distance. The algorithm is:

\[ \epsilon_1 = \text{UNet}(x_t, t, p_1) \] \[ \epsilon_2 = \text{UNet}(x_t, t, p_2) \] \[ \epsilon = f_{\text{lowpass}}(\epsilon_1) + f_{\text{highpass}}(\epsilon_2) \]

where \( f_{\text{lowpass}} \) is a Gaussian blur and \( f_{\text{highpass}} \) is a high-pass filter. Here is a hybrid image of a skull and a waterfall:

Hybrid Image of Skull and Waterfall - Version 1

Figure 152: Hybrid Image of Skull and Waterfall

Hybrid Image of Skull and Waterfall - Version 2

Figure 153: Hybrid Image of Skull and Waterfall

Hybrid Image of Skull and Waterfall - Version 3

Figure 154: Hybrid Image of Skull and Amalfi Coast

Hybrid Image: Man and Campfire Scene

Hybrid Image of Man and Campfire

Figure 155: Hybrid Image combining a photo of a man and an oil painting of people around a campfire.

Hybrid Image: Pencil Drawing and Amalfi Coast

Hybrid Image of Pencil Drawing and Amalfi Coast

Figure 156: Hybrid Image combining a pencil drawing and the Amalfi Coast.

Part 2: Bells & Whistles

I explored model bias by generating images with the prompt "successful human being". After generating 100 samples, I observed that around 69% of the images depicted males and 31% depicted females. Additionally, 88% of the people were wearing black suits. This analysis highlights potential biases in the model's training data.

Here are some sample images:

Sample 0

Figure 155: Sample 0.

Sample 1

Figure 156: Sample 1.

Sample 2

Figure 157: Sample 2.

Sample 3

Figure 158: Sample 3.

Sample 4

Figure 159: Sample 4.

Sample 5

Figure 160: Sample 5.

Sample 6

Figure 161: Sample 6.

Sample 7

Figure 162: Sample 7.

Sample 8

Figure 163: Sample 8.

Sample 9

Figure 164: Sample 9.

Conclusion

This project provided hands-on experience with diffusion models and their applications in image generation and manipulation. By implementing various techniques and experimenting with different parameters, I gained a deeper understanding of the capabilities and limitations of diffusion models in computer vision tasks.

Part B: Diffusion Models from Scratch!

In this part of the project, I trained my own diffusion model on the MNIST dataset. The goal was to understand the fundamental principles behind diffusion models by implementing them from scratch. This involved building a UNet architecture, formulating the diffusion process mathematically, and iteratively denoising images to generate new samples.

Part 1: Training a Single-Step Denoising UNet

1.1 Problem Formulation

The objective was to train a denoising neural network \( D_{\theta} \) that can map a noisy image \( z \) back to its clean version \( x \). This is formulated as minimizing the following L2 loss function:

\[ L = \mathbb{E}_{z, x} \left\| D_{\theta}(z) - x \right\|^2 \]

To generate the noisy image \( z \), Gaussian noise is added to the clean image \( x \) using a predefined noise level \( \sigma \):

\[ z = x + \sigma \epsilon, \quad \epsilon \sim \mathcal{N}(0, I) \]

1.2 Implementing the UNet

I implemented a UNet architecture to serve as the denoiser \( D_{\theta} \). The UNet consists of downsampling and upsampling layers with skip connections, allowing it to capture both global and local features.

Key components of the UNet architecture include:

The architecture is designed to take an input image of size \( 28 \times 28 \) and process it through multiple layers to reconstruct the denoised image.

1.3 Visualizing the Noising Process

Before training the model, I visualized the effect of adding Gaussian noise with different \( \sigma \) values to the MNIST images. The noise levels ranged from 0.0 (no noise) to 1.0 (completely noisy).

Noisy Images at Different Sigma

Figure 1: MNIST images with varying noise levels \( \sigma \in [0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0] \).

1.4 Training the Denoiser

The denoiser was trained on the MNIST training set with a fixed noise level \( \sigma = 0.5 \). The training parameters were as follows:

The training loss decreased steadily over the epochs, indicating that the model was learning to denoise the images effectively.

Training Loss Curve

Figure 2: Training loss curve over 5 epochs.

1.5 Results After Training

I evaluated the denoiser on the test set after the 1st and 5th epochs. The denoised images improved significantly after training.

Denoising Results Across Epochs

Figure 3: Denoised images after the 1st and 5th epochs.

1.6 Out-of-Distribution Testing

To assess the denoiser's robustness, I tested it on images with different noise levels \( \sigma \) that it wasn't trained on. The denoiser performed well for \( \sigma \) close to 0.5 but struggled with higher noise levels.

Denoising at Different Sigma Levels

Figure 4: Denoised images with varying noise levels \( \sigma \).

Part 2: Training a DDPM Denoising UNet

2.1 DDPM Implementation

In this part, I implemented a Denoising Diffusion Probabilistic Model (DDPM) to perform iterative denoising. Unlike the single-step denoiser, DDPM predicts the noise component \( \epsilon \) added to the image at each timestep \( t \).

The training objective is to minimize the following loss function:

\[ L = \mathbb{E}_{\epsilon, x_0, t} \left\| \epsilon_{\theta}(x_t, t) - \epsilon \right\|^2 \]

Where:

To accommodate the DDPM framework, I modified the UNet to accept timestep \( t \) and class label \( c \) as additional inputs. These were embedded using fully connected layers and integrated into the network via conditioning mechanisms. The modifications included:

The forward process involves adding Gaussian noise to the image at each timestep \( t \) using a predefined variance schedule \( \{\beta_t\} \). The noisy image \( x_t \) is sampled as:

\[ x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon, \quad \epsilon \sim \mathcal{N}(0, I) \]

Where:

The reverse process aims to iteratively denoise \( x_t \) to recover \( x_0 \). The update rule for \( x_{t-1} \) is given by:

\[ x_{t-1} = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar{\alpha}_t}} \epsilon_{\theta}(x_t, t) \right) + \sigma_t z, \quad z \sim \mathcal{N}(0, I) \]

Where \( \sigma_t \) is derived from the variance schedule. I implemented this sampling procedure to generate new images from pure noise.

2.11 Training the DDPM

The DDPM was trained using the same dataset and optimizer settings as before, but for a longer duration of 20 epochs to ensure convergence. The training loss decreased over time, indicating successful learning.

Training Loss Curve

Figure 5: Training loss curve over all batches.

Training Loss Curve

Figure 5: Training loss curve over 20 epochs.

To evaluate the model's performance, I generated reconstructions at different epochs to visualize the learning progress.

Reconstructions after Epoch 5

Figure 7: Reconstructions after across epochs.

Reconstructions after Epoch 1

Figure 6: Reconstructions after 1 epoch.

Reconstructions after Epoch 20

Figure 9: Reconstructions after 10 epochs.

Reconstructions after Epoch 10

Figure 8: Reconstructions after 20 epochs showing improved image quality

2.2 Classifier-Free Guidance (CFG)

Training with Class Conditioning

The DDPM was trained using the same dataset and optimizer settings as before, but for a longer duration of 20 epochs to ensure convergence. The training loss decreased over time, indicating successful learning.

Training Loss Curve

Figure 5: Training loss curve over 20 epochs.

To improve sample quality, I implemented Classifier-Free Guidance. This involves generating two noise estimates: one conditioned on the class label (\( \epsilon_c \)) and one unconditioned (\( \epsilon_u \)). The final noise estimate is computed as:

\[ \epsilon = \epsilon_u + \gamma (\epsilon_c - \epsilon_u) \]

Where \( \gamma \) is the guidance scale. I experimented with different guidance scales to observe their effect on the generated images.

2.2.2 Results After Training

I generated samples after the 1st, 5th, 10th, 15th, and 20th epochs with a guidance scale of 5. The quality of the generated images improved significantly over the epochs.

Samples at Epoch 1

Figure 6: Generated samples after 1 epoch.

Samples at Epoch 5

Figure 7: Generated samples after 5 epochs.

Samples at Epoch 15

Figure 9: Generated samples after 15 epochs.

Samples at Epoch 10

Figure 8: Generated samples after 10 epochs.

Samples at Epoch 20

Figure 10: Generated samples after 20 epochs.

Now we meaasure the quality of the generated images with different guidance scales.

Guidance Scale 0

Figure 11: Samples with guidance scale \( \gamma = 0 \).

Guidance Scale 5

Figure 12: Samples with guidance scale \( \gamma = 5 \).

Guidance Scale 10

Figure 13: Samples with guidance scale \( \gamma = 10 \).

Increasing the guidance scale (\( \gamma \)) strengthens the influence of the class conditioning, making the generated samples more closely match the characteristics of their target class labels. This works by interpolating between an unconditional and conditional diffusion model prediction, with higher \( \gamma \) values giving more weight to the conditional prediction. However, setting \( \gamma \) too high (e.g., 10) led to artifacts and overly-exaggerated class features, as the model overemphasizes class-specific attributes at the expense of natural image statistics.

Conclusion

Training a diffusion model from scratch provided deep insights into how these models function at a fundamental level. Implementing the forward and reverse processes, adjusting the UNet architecture, and experimenting with classifier-free guidance highlighted the importance of each component in generating high-quality samples. The project demonstrated the power of diffusion models in generating images and their potential for further research and applications.