Project 5A: Diffusion Models

In this project, I explored the capabilities of diffusion models, implemented diffusion sampling loops, and used them for tasks such as inpainting and creating optical illusions. The project was structured into several parts, each focusing on different aspects of diffusion models.

The images in Part 0 are 254x254 pixels, while images in other parts are 64x64 pixels. Throughout the project, I used the DeepFloyd IF diffusion model, which you can learn more about here.

Course Logo

I experimented with generating different logo designs for a computer vision course using the diffusion model. Here are two variations:

I also generated a few more logos with the word "a super cool computer vision class at UC Berkeley called CS 180" however they were classroom like settings and less aesthetically pleasing for a logo.

"A minimalist logo for a computer vision class, featuring a stylized camera lens and digital eye".

"An elegant academic logo combining a camera aperture with binary code, clean design".

"a super cool computer vision class at UC Berkeley called CS 180".

Part 0: Model Sampling

In this section, I sampled images from the model using different numbers of iterations. I also downloaded the precomputed text embeddings to facilitate the generation process.

The following images were obtained during this process:

Part 1: Sampling Loops

In this part, I implemented sampling loops using the pretrained DeepFloyd denoisers to produce high-quality images. I modified these sampling loops to solve different tasks such as inpainting and producing optical illusions.

Sample Image of the Campanile

The sample image used throughout this part is the Berkeley Campanile, which served as the test image for various processes.

Figure 14: Berkeley Campanile.

Part 1.1: Forward Process

A key part of diffusion is the forward process, which takes a clean image and adds noise to it. In this part, I wrote a function to implement this process. The forward process is defined by:

\[ q(x_t | x_0) = \mathcal{N}(x_t ; \sqrt{\bar\alpha_t} x_0, (1 - \bar\alpha_t)\mathbf{I}) \tag{1} \]

which is equivalent to computing:

\[ x_t = \sqrt{\bar\alpha_t} x_0 + \sqrt{1 - \bar\alpha_t} \epsilon \quad \text{where}~ \epsilon \sim \mathcal{N}(0, 1) \tag{2} \]

Given a clean image \( x_0 \), we get a noisy image \( x_t \) at timestep \( t \) by sampling from a Gaussian with mean \( \sqrt{\bar\alpha_t} x_0 \) and variance \( (1 - \bar\alpha_t) \). Note that the forward process is not just adding noise—we also scale the image.

Below are the results of the forward process on the test image with \( t \in [250, 500, 750] \):

Part 1.2: Classical Denoising

I attempted to denoise the noisy images using Gaussian blur filtering. The results were not satisfactory, highlighting the limitations of classical denoising methods in handling significant noise.

Figure 15: Noisy Campanile at t=250.

Figure 16: Noisy Campanile at t=500.

Figure 17: Noisy Campanile at t=750.

Figure 18: Gaussian Denoising at t=250.

Figure 19: Gaussian Denoising at t=500.

Figure 20: Gaussian Denoising at t=750.

Part 1.3: One-Step Denoising

I used a pretrained diffusion model to denoise the images in a single step. The denoiser predicts the noise in the image, which can then be removed to recover an estimate of the original image.

The process involves estimating the original image from the noisy image using:

\[ \hat{x}_0 = \frac{1}{\sqrt{\bar\alpha_t}} \left( x_t - \sqrt{1 - \bar\alpha_t} \hat{\epsilon} \right) \tag{2} \]

Here are the results for \( t \in [250, 500, 750] \):

Part 1.4: Iterative Denoising

I implemented iterative denoising, which involves progressively denoising the image by moving from a higher noise level to a lower one. The formula used is:

\[ x_{t'} = \frac{\sqrt{\bar\alpha_{t'}}\beta_t}{1 - \bar\alpha_t} \hat{x}_0 + \frac{\sqrt{\alpha_t}(1 - \bar\alpha_{t'})}{1 - \bar\alpha_t} x_t + v_\sigma \tag{3} \]

where:

\( x_t \) is your image at timestep \( t \).
\( x_{t'} \) is your noisy image at timestep \( t' \) where \( t' < t \) (less noisy).
\( \hat{x}_0 \) is our current estimate of the clean image using equation (2).
\( \alpha_t = \bar\alpha_t / \bar\alpha_{t'} \).
\( \beta_t = 1 - \alpha_t \).

Below are the images at various noise levels during the iterative denoising process:

Part 1.5: Diffusion Model Sampling

I generated images from scratch by starting with random noise and applying iterative denoising. Here are some samples generated with the prompt "a high quality photo":

Part 1.6: Classifier-Free Guidance (CFG)

To improve image quality, I implemented Classifier-Free Guidance (CFG). In CFG, we compute both a noise estimate conditioned on a text prompt (\( \epsilon_c \)) and an unconditional noise estimate (\( \epsilon_u \)). The final noise estimate is:

\[ \epsilon = \epsilon_u + \gamma (\epsilon_c - \epsilon_u) \]

where \( \gamma \) is the guidance scale (I used \( \gamma = 5 \)). Here are the generated images with CFG:

Part 1.7: Image-to-Image Translation

I explored image editing by adding varying amounts of noise to the original image and then denoising it. The more noise added, the larger the edit.

For each of the sub-images, I observed the results and noted the changes.

Sub Part 1.7.1: Editing Hand-Drawn and Web Images

Campanile

Figure 64: Original Campanile.

Figure 70: Campanile at i_start=1.

Figure 69: Campanile at i_start=3.

Figure 68: Campanile at i_start=5.

Figure 67: Campanile at i_start=7.

Figure 66: Campanile at i_start=10.

Figure 65: Campanile at i_start=20.

My Favorite Campanile Reconstructions

Hand-Drawn Airplane

Figure 57: Original Airplane.

Figure 58: Airplane at i_start=1.

Figure 59: Airplane at i_start=3.

Figure 60: Airplane at i_start=5.

Figure 61: Airplane at i_start=7.

Figure 62: Airplane at i_start=10.

Figure 63: Airplane at i_start=20.

Hand-Drawn House

Web Image - Avocado

Web Image - Surfer

Web Image - Tree

Part 1.7.2: Inpainting

I implemented inpainting by using a mask to specify regions to edit. By applying the diffusion process only within the masked area, new content is generated while the rest of the image remains unchanged.

Campanile Mask Process Version 1

Campanile Mask Process Version 2

Color Painting Process

Goggles Inpainting Process - Larger Mask

Goggles Inpainting Process - Smaller Mask

Roger Federer - Eyes Mask

Roger Federer - Full Face Mask

Roger Federer - Small Tennis Ball Mask

Roger Federer - Large Tennis Ball Mask

Roger Federer - Full Body Mask Progression

Part 1.7.3: Text-Conditional Image-to-image Translation

Same thing as the previous section, but guide the projection with a text prompt.

***for all of the images set, the right most corresponds to the most blurry one

Campanile with Rocket Prompt

Figure 152: Campanile with rocket prompt - Original image.

Figure 153: Campanile with rocket prompt - Stage 7.

Figure 154: Campanile with rocket prompt - Stage 6.

Figure 155: Campanile with rocket prompt - Stage 5.

Figure 156: Campanile with rocket prompt - Stage 4.

Figure 157: Campanile with rocket prompt - Stage 3.

Figure 158: Campanile with rocket prompt - Stage 2.

Francesco with Amalfi Coast Prompt

Figure 159: Francesco with Amalfi Coast prompt - Stage 1.

Figure 160: Francesco with Amalfi Coast prompt - Stage 2.

Figure 161: Francesco with Amalfi Coast prompt - Stage 3.

Figure 162: Francesco with Amalfi Coast prompt - Stage 4.

Figure 163: Francesco with Amalfi Coast prompt - Stage 5.

Figure 164: Francesco with Amalfi Coast prompt - Stage 6.

Francesco with Rocket Prompt

Figure 165: Francesco with rocket prompt - Stage 1.

Figure 166: Francesco with rocket prompt - Stage 2.

Figure 167: Francesco with rocket prompt - Stage 3.

Figure 168: Francesco with rocket prompt - Stage 4.

Figure 169: Francesco with rocket prompt - Stage 5.

Figure 170: Francesco with rocket prompt - Stage 6.

Figure 171: Francesco with rocket prompt - Stage 7.

Roger Federer with Amalfi Coast Prompt

Figure 172: Roger Federer with Amalfi Coast prompt - Stage 1.

Figure 173: Roger Federer with Amalfi Coast prompt - Stage 2.

Figure 174: Roger Federer with Amalfi Coast prompt - Stage 3.

Figure 175: Roger Federer with Amalfi Coast prompt - Stage 4.

Figure 176: Roger Federer with Amalfi Coast prompt - Stage 5.

Figure 177: Roger Federer with Amalfi Coast prompt - Stage 6.

Roger Federer with Rocket Prompt

Figure 178: Roger Federer with rocket prompt - Stage 1.

Figure 179: Roger Federer with rocket prompt - Stage 2.

Figure 180: Roger Federer with rocket prompt - Stage 3.

Figure 181: Roger Federer with rocket prompt - Stage 4.

Figure 182: Roger Federer with rocket prompt - Stage 5.

Figure 183: Roger Federer with rocket prompt - Stage 6.

Part 1.8: Visual Anagrams

I created images that appear differently when flipped upside down, known as visual anagrams. The algorithm used is:

\[ \epsilon_1 = \text{UNet}(x_t, t, p_1) \] \[ \epsilon_2 = \text{flip}(\text{UNet}(\text{flip}(x_t), t, p_2)) \] \[ \epsilon = \frac{\epsilon_1 + \epsilon_2}{2} \]

Here are some examples:

An Oil Painting of an Old Man / People Around a Campfire

Visual Anagrams Gallery

Figure 152: Skull/Amalfi Coast

Figure 153: Man with Hat/Mountain Village

Figure 154: Pencil/Rocket Ship

Figure 155: Hipster Barista/Skull

Figure 156: Amalfi Coast/Campfire

Figure 157: Rocket Ship/Waterfalls

Figure 158: Mountain Village/Waterfalls

Figure 159: Dog/Hipster Barista

Part 1.9: Hybrid Images

I created hybrid images that appear differently when viewed up close versus from a distance. The algorithm is:

\[ \epsilon_1 = \text{UNet}(x_t, t, p_1) \] \[ \epsilon_2 = \text{UNet}(x_t, t, p_2) \] \[ \epsilon = f_{\text{lowpass}}(\epsilon_1) + f_{\text{highpass}}(\epsilon_2) \]

where \( f_{\text{lowpass}} \) is a Gaussian blur and \( f_{\text{highpass}} \) is a high-pass filter. Here is a hybrid image of a skull and a waterfall:

Hybrid Image: Man and Campfire Scene

Hybrid Image: Pencil Drawing and Amalfi Coast

Part 2: Bells & Whistles

I explored model bias by generating images with the prompt "successful human being". After generating 100 samples, I observed that around 69% of the images depicted males and 31% depicted females. Additionally, 88% of the people were wearing black suits. This analysis highlights potential biases in the model's training data.

Here are some sample images:

Conclusion

This project provided hands-on experience with diffusion models and their applications in image generation and manipulation. By implementing various techniques and experimenting with different parameters, I gained a deeper understanding of the capabilities and limitations of diffusion models in computer vision tasks.

Part B: Diffusion Models from Scratch!

In this part of the project, I trained my own diffusion model on the MNIST dataset. The goal was to understand the fundamental principles behind diffusion models by implementing them from scratch. This involved building a UNet architecture, formulating the diffusion process mathematically, and iteratively denoising images to generate new samples.

Part 1: Training a Single-Step Denoising UNet

1.1 Problem Formulation

The objective was to train a denoising neural network \( D_{\theta} \) that can map a noisy image \( z \) back to its clean version \( x \). This is formulated as minimizing the following L2 loss function:

\[ L = \mathbb{E}_{z, x} \left\| D_{\theta}(z) - x \right\|^2 \]

To generate the noisy image \( z \), Gaussian noise is added to the clean image \( x \) using a predefined noise level \( \sigma \):

\[ z = x + \sigma \epsilon, \quad \epsilon \sim \mathcal{N}(0, I) \]

1.2 Implementing the UNet

I implemented a UNet architecture to serve as the denoiser \( D_{\theta} \). The UNet consists of downsampling and upsampling layers with skip connections, allowing it to capture both global and local features.

Key components of the UNet architecture include:

ConvBlock: Two convolutional layers followed by Batch Normalization and GELU activation.
DownBlock: A ConvBlock followed by a downsampling operation.
UpBlock: An upsampling operation followed by a ConvBlock.

The architecture is designed to take an input image of size \( 28 \times 28 \) and process it through multiple layers to reconstruct the denoised image.

1.3 Visualizing the Noising Process

Before training the model, I visualized the effect of adding Gaussian noise with different \( \sigma \) values to the MNIST images. The noise levels ranged from 0.0 (no noise) to 1.0 (completely noisy).

Figure 1: MNIST images with varying noise levels \( \sigma \in [0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0] \).

1.4 Training the Denoiser

The denoiser was trained on the MNIST training set with a fixed noise level \( \sigma = 0.5 \). The training parameters were as follows:

Optimizer: Adam optimizer with a learning rate of \( 1 \times 10^{-4} \).
Batch Size: 256.
Epochs: 5.

The training loss decreased steadily over the epochs, indicating that the model was learning to denoise the images effectively.

Figure 2: Training loss curve over 5 epochs.

1.5 Results After Training

I evaluated the denoiser on the test set after the 1st and 5th epochs. The denoised images improved significantly after training.

Figure 3: Denoised images after the 1st and 5th epochs.

1.6 Out-of-Distribution Testing

To assess the denoiser's robustness, I tested it on images with different noise levels \( \sigma \) that it wasn't trained on. The denoiser performed well for \( \sigma \) close to 0.5 but struggled with higher noise levels.

Figure 4: Denoised images with varying noise levels \( \sigma \).

Part 2: Training a DDPM Denoising UNet

2.1 DDPM Implementation

In this part, I implemented a Denoising Diffusion Probabilistic Model (DDPM) to perform iterative denoising. Unlike the single-step denoiser, DDPM predicts the noise component \( \epsilon \) added to the image at each timestep \( t \).

The training objective is to minimize the following loss function:

\[ L = \mathbb{E}_{\epsilon, x_0, t} \left\| \epsilon_{\theta}(x_t, t) - \epsilon \right\|^2 \]

Where:

\( x_t = a_t x_0 + b_t \epsilon \)
\( \epsilon \sim \mathcal{N}(0, I) \)
\( t \in \{0, 1, \dots, T\} \)

To accommodate the DDPM framework, I modified the UNet to accept timestep \( t \) and class label \( c \) as additional inputs. These were embedded using fully connected layers and integrated into the network via conditioning mechanisms. The modifications included:

Embedding \( t \) and \( c \) using fully connected layers.
Applying these embeddings after the unflattening and first upsampling layers.
Implementing a mask vector to optionally drop the class conditioning for classifier-free guidance.

The forward process involves adding Gaussian noise to the image at each timestep \( t \) using a predefined variance schedule \( \{\beta_t\} \). The noisy image \( x_t \) is sampled as:

\[ x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon, \quad \epsilon \sim \mathcal{N}(0, I) \]

Where:

\( \alpha_t = 1 - \beta_t \)
\( \bar{\alpha}_t = \prod_{s=1}^{t} \alpha_s \)

The reverse process aims to iteratively denoise \( x_t \) to recover \( x_0 \). The update rule for \( x_{t-1} \) is given by:

\[ x_{t-1} = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar{\alpha}_t}} \epsilon_{\theta}(x_t, t) \right) + \sigma_t z, \quad z \sim \mathcal{N}(0, I) \]

Where \( \sigma_t \) is derived from the variance schedule. I implemented this sampling procedure to generate new images from pure noise.

2.11 Training the DDPM

The DDPM was trained using the same dataset and optimizer settings as before, but for a longer duration of 20 epochs to ensure convergence. The training loss decreased over time, indicating successful learning.

Figure 5: Training loss curve over all batches.

Figure 5: Training loss curve over 20 epochs.

To evaluate the model's performance, I generated reconstructions at different epochs to visualize the learning progress.

Figure 7: Reconstructions after across epochs.

Figure 6: Reconstructions after 1 epoch.

Figure 9: Reconstructions after 10 epochs.

Figure 8: Reconstructions after 20 epochs showing improved image quality

2.2 Classifier-Free Guidance (CFG)

Training with Class Conditioning

Figure 5: Training loss curve over 20 epochs.

To improve sample quality, I implemented Classifier-Free Guidance. This involves generating two noise estimates: one conditioned on the class label (\( \epsilon_c \)) and one unconditioned (\( \epsilon_u \)). The final noise estimate is computed as:

\[ \epsilon = \epsilon_u + \gamma (\epsilon_c - \epsilon_u) \]

Where \( \gamma \) is the guidance scale. I experimented with different guidance scales to observe their effect on the generated images.

2.2.2 Results After Training

I generated samples after the 1st, 5th, 10th, 15th, and 20th epochs with a guidance scale of 5. The quality of the generated images improved significantly over the epochs.

Figure 6: Generated samples after 1 epoch.

Figure 7: Generated samples after 5 epochs.

Figure 9: Generated samples after 15 epochs.

Figure 8: Generated samples after 10 epochs.

Figure 10: Generated samples after 20 epochs.

Now we meaasure the quality of the generated images with different guidance scales.

Figure 11: Samples with guidance scale \( \gamma = 0 \).

Figure 12: Samples with guidance scale \( \gamma = 5 \).

Figure 13: Samples with guidance scale \( \gamma = 10 \).

Increasing the guidance scale (\( \gamma \)) strengthens the influence of the class conditioning, making the generated samples more closely match the characteristics of their target class labels. This works by interpolating between an unconditional and conditional diffusion model prediction, with higher \( \gamma \) values giving more weight to the conditional prediction. However, setting \( \gamma \) too high (e.g., 10) led to artifacts and overly-exaggerated class features, as the model overemphasizes class-specific attributes at the expense of natural image statistics.

Conclusion

Training a diffusion model from scratch provided deep insights into how these models function at a fundamental level. Implementing the forward and reverse processes, adjusting the UNet architecture, and experimenting with classifier-free guidance highlighted the importance of each component in generating high-quality samples. The project demonstrated the power of diffusion models in generating images and their potential for further research and applications.

Project 5A: The Power of Diffusion Models!

Introduction

Course Logo

Part 0: Model Sampling

Part 1: Sampling Loops

Sample Image of the Campanile

Part 1.1: Forward Process

Part 1.2: Classical Denoising

Part 1.3: One-Step Denoising

Part 1.4: Iterative Denoising

Part 1.5: Diffusion Model Sampling

Part 1.6: Classifier-Free Guidance (CFG)

Part 1.7: Image-to-Image Translation

Sub Part 1.7.1: Editing Hand-Drawn and Web Images

Campanile

My Favorite Campanile Reconstructions

Hand-Drawn Airplane

Hand-Drawn House

Web Image - Avocado

Web Image - Surfer

Web Image - Tree

Part 1.7.2: Inpainting

Campanile Mask Process Version 1

Campanile Mask Process Version 2

Color Painting Process

Goggles Inpainting Process - Larger Mask

Goggles Inpainting Process - Smaller Mask

Roger Federer - Eyes Mask

Roger Federer - Full Face Mask

Roger Federer - Small Tennis Ball Mask

Roger Federer - Large Tennis Ball Mask

Roger Federer - Full Body Mask Progression

Part 1.7.3: Text-Conditional Image-to-image Translation

Campanile with Rocket Prompt

Francesco with Amalfi Coast Prompt

Francesco with Rocket Prompt

Roger Federer with Amalfi Coast Prompt

Roger Federer with Rocket Prompt

Part 1.8: Visual Anagrams

An Oil Painting of an Old Man / People Around a Campfire

Visual Anagrams Gallery

Part 1.9: Hybrid Images

Hybrid Image: Man and Campfire Scene

Hybrid Image: Pencil Drawing and Amalfi Coast

Part 2: Bells & Whistles

Conclusion

Part B: Diffusion Models from Scratch!

Part 1: Training a Single-Step Denoising UNet

1.1 Problem Formulation

1.2 Implementing the UNet

1.3 Visualizing the Noising Process

1.4 Training the Denoiser

1.5 Results After Training

1.6 Out-of-Distribution Testing

Part 2: Training a DDPM Denoising UNet

2.1 DDPM Implementation

2.11 Training the DDPM

2.2 Classifier-Free Guidance (CFG)

Training with Class Conditioning

2.2.2 Results After Training

Conclusion