CS180 Project 3

Rishi Nath 2024

A.0 Setup

The three provided text prompts, using the Deeployd IF diffusion model. I used the random seed seed=1444; each of these was generated with num_inference_steps=20. I only display the pre-upsampled images here.

"a man wearing a hat"

"an oil painting of a snowy mountain village"

"a rocket ship"

I also generated the same rocket ship prompt with more inference steps.

"a rocket ship"

num_inference_steps=20

"a rocket ship"

num_inference_steps=40

A.1.1 Implementing the Forward Process

I used the given formula (A.2) to noise the im with noise levels [250, 500, and 750]:

Original Image

t=250

t=500

t=750

A.1.2 Classical Denoising

I used torchvision.transforms.functional.gaussian_blur. Here, k is kernel size (k * k). In retrospect, I could have used much larger k.

t=250	t=500	t=750
k=3	k=5	k=7

A.1.3 One-step denoising

From equation (A.2), I derived the following relation to predict x_0:

We use stage_1.unet (with the prompt "a high quality photo") to estimate epsilon from x_t given t, then use the above relation to recover an estimate of x_0.

t=250	t=500	t=750

A.1.4 Iterative denoising

I implemented iterative denoising as described in the project spec. Essentially, we can predict x_t' from x_t, where t' < t (i.e. t is a more noisy noise level), using the given (A.3) equation:

The iterative desoising algorithm I used has a stride of 30; meaning that at each step, t' - t = 30. Here are some intermediate results from the algorithm:

t=690

t=540

t=390

t=240

t=90

Here are the results from the methods from the previous sections on this noise level; side-by-side for comparison purposes:

Original Image

Guassian Denoised

One-step Denoised

Iteratively Denoised

A.1.5 Diffusion Model Sampling

Here, I used the iterative_denoise function described above; setting i_start = 0, and feeding it pure noise. We're still using the "a high quality photo" prompt.

A.1.6 Classifier-Free Guidance (CFG)

Now, we apply CFG. Essentially, we also add a scaled difference between a prompted noise and an umprompted noise to the final noise estimate. As before, we generate images by passing this new modified iterative_denoise_cfg function i_start=0 and pure noise.

The images generated with CFG are certainly higher quality, at least in terms of their detail and vibrancy.

A.1.7.0 Image-to-image Translation

Here we use iterative_denoise_cfg to perform image-to-image translation. This follows the "SDEdit" algorithm. We will run SDEdit on some images with i_start indicated in captions:

i_start = 1	i_start = 3	i_start = 5	i_start = 7	i_start = 10	i_start = 20	Original Image
i_start = 1	i_start = 3	i_start = 5	i_start = 7	i_start = 10	i_start = 20	Original Image
i_start = 1	i_start = 3	i_start = 5	i_start = 7	i_start = 10	i_start = 20	Original Image

A.1.7.1 Image-to-image Translation

The results of Image-to-image translation on two web images (the art of Gandalf and the eagle; the famous painting "Son of Man" by René Magritte) are shown above. Here are the results on a poorly hand-drawn tomato:

i_start = 1

i_start = 3

i_start = 5

i_start = 7

i_start = 10

i_start = 20

Original Image

A.1.7.2 Inpainting

Following the description in the spec, I implemented inpainting. Essentially, we create an image mask; where outside the mask we force the pixels to have the same value as the original image, otherwise we reuse our earlier denoising functions.

Original Image

Mask

Inpainting Target

Inpainted

Reverse Inpainted

In the rightmost image, I inverted the mask and applied the same inpainting algorithm. Here are some more inpainting results. I really liked the apollo inpainting! I was very pleasantly surprised by the model's "creativity". The dog also fits the scene due to the shadow caused by the moon lander.

Original Image	Mask	Inpainting Target	Inpainted
Original Image	Mask	Inpainting Target	Inpainted

A.1.7.3 Text-Conditional Image-to-image Translation

Using SDEdit; except instead of restricting ourselves to only using the prompt "a high quality photo", allows us to use the model to guide transformed images with various prompts. See A.1.7.1 for a brief explanation of SDEdit and i_start. Here is the Campanile editing to look like "a rocket ship":

i_start = 1

i_start = 3

i_start = 5

i_start = 7

i_start = 10

i_start = 20

Original Image

Here is the "Son of Man" painting edited to look like "a rocket ship":

i_start = 1

i_start = 3

i_start = 5

i_start = 7

i_start = 10

i_start = 20

Original Image

Here is the Gandalf & Eagle drawing edited to look like "a photo of a dog" (warning, cursed results...):

i_start = 1

i_start = 3

i_start = 5

i_start = 7

i_start = 10

i_start = 20

Original Image

A.1.8 Visual Anagrams

Here, we simultaneously denoise a second text prompt, but upside down; leading to some interesting results.

"an oil painting of people around a campfire"	"an oil painting of an old man"
"a photo of a hipster barista"	"a photo of the amalfi cost" (sic)
"a lithograph of waterfalls"	"a lithograph of a skull"

A.1.9 Hybrid Images

Similarly to A.1.8, we will denoise two prompts simultaneously; this time, instead of flipping, one will be for high frequencies, and the other for low frequencies, similar to project 2. For each image, the first listed prompt is the low frequency prompt, the second listed prompt is the high frequency prompt.

"a lithograph of a skull"

"a lithograph of waterfalls"

"an oil painting of an old man"

"an oil painting of people around a campfire"

"a photo of a dog"

"a photo of the amalfi cost" (sic)

B.1 Training a Single-Step Denoising Unet

The unet is tasked with denoising MNIST images. Noise corresponds to a sigma level between 0 and 1:

I implemented the unet exactly as specified by the project. In short, I found the provided diagram to be very helpful:

I also trained it with the hyperparameters provided in the project spec. So, I will just display the results, after 1 epoch of training and 5 epochs of training respectively:

And the overall training loss for all 5 epochs:

The model was only trained on noise level sigma=0.5 Here is how it performs on denoising other noise levels:

B.2 Training a Diffusion Model

Again, I followed the spec exactly, so I'll just display my results after 1, 5, and 20 epochs (0-indexed captions, sorry!):

And more results at 20 epochs:

The overall 20-epoch long training loss: