Deep learning is a subset of machine learning, which is a subset of artificial intelligence (AI), the technology behind the most exciting capabilities in robotics, natural language processing, image and video recognition, large language models (LLMs), generative AI, etc.
To address intricate problems, extensive amounts of data and substantial computational capabilities are essential for the functioning of deep learning algorithms. These algorithms are versatile in handling various types of data.
This article will delve into a comprehensive exploration of the Diffusion Model, a prominent member of the deep learning domain and the driving force behind Stable Diffusion, which is pretty popular and widely used in generative AI these days.
Stable Diffusion has been praised for making AI image generation accessible and flexible, becoming one of the key tools for creative professionals and hobbyists working with generative AI.
Before we begin, let’s see the overview of Latent Diffusion Models architecture:
Table of Contents
What is Deep Learning?
Deep learning is a subfield of machine learning that solves complex problems using artificial neural networks. These neural networks are made up of interconnected nodes arranged in multiple layers that extract features from input data. Large datasets are used to train these models, allowing them to detect patterns and correlations that humans would find difficult or impossible to detect.
The impact of deep learning on artificial intelligence has been substantial. It has paved the way for the development of intelligent systems capable of independent learning, adaptation, and decision-making. Deep learning has led to remarkable advancements in various domains, encompassing image and speech recognition, natural language processing, machine translation, text generation, image generation (as would be reviewed in this article), autonomous driving, and numerous others.
Why Python for Deep Learning?
Python has gained widespread popularity as a programming language due to its versatility and ease of use in diverse domains of computer science, especially in the field of deep learning. Thanks to its extensive range of libraries and frameworks specially tailored for deep learning, Python has emerged as a top choice among many machine learning professionals.
Python has emerged as the language of choice for deep learning, and here are some of the reasons why:
1. Simple to learn and use:
Python is a high-level programming language that is easy to learn and use, even for those who are new to programming. Its concise and uncomplicated syntax makes it easy to write and understand. This allows developers to concentrate on solving problems without worrying about the details of the language.
2. Abundant libraries and frameworks:
Python has a vast ecosystem of libraries and frameworks that cater specifically to deep learning. Some of these libraries include TensorFlow, PyTorch, Keras, and Theano. These libraries provide pre-built functions and modules that simplify the development process, reducing the need to write complex code from scratch.
3. Strong community support:
Python has a large and active community of developers contributing to its development, maintenance, and improvement. This community offers support and guidance to beginners, making it easier to learn and use Python for deep learning.
4. Platform independence:
Python is platform-independent, which means that code written on one platform can be easily executed on another platform without any modification. This makes it easier to deploy deep learning models on different platforms and devices.
5. Easy integration with other languages:
Python can be easily integrated with other programming languages, such as Delphi, C++, and Java, making it ideal for building complex systems that require integrating different technologies.
Overall, Python’s ease of use, an abundance of libraries and frameworks, strong community support, platform independence, and ease of integration with other languages make it an indispensable tool for machine learning practitioners. Its popularity continues to soar as a result.
What are Diffusion and Latent Diffusion Models?
A diffusion model is a type of generative model in machine learning designed to create data by reversing a noise-adding process. It models the way data can evolve from randomness (pure noise) to meaningful structures, such as images, audio, or other complex data distributions.
The following table shows how the diffusion model is compared with other generative models[8]:
Aspect | Diffusion Models | GANs (Generative Adversarial Networks) |
Training Stability | More stable | Prone to mode collapse |
Output Quality | High detail, fewer artifacts | Sometimes sharper, but less reliable |
Speed | Slower to generate images | Faster at inference |
Mode Coverage | Better at covering the data’s full distribution | GANs may miss some modes |
To dive deeper into GAN, read our previous article below:
On the other hand, a Latent Diffusion Model (LDM) is an advanced type of diffusion model that operates in a compressed (latent) space rather than directly on pixel data, making it more computationally efficient. LDMs, such as Stable Diffusion, enable faster image generation without compromising quality, which is especially useful for large-scale generative tasks like text-to-image synthesis.
The following table shows how LDMs improve over traditional diffusion models[8]:
Aspect | Traditional Diffusion Models | Latent Diffusion Models |
Data Space | Operates directly on pixels | Works in a compressed latent space |
Speed | Slower due to pixel-level steps | Faster due to reduced dimensionality |
Resource Usage | Higher GPU/CPU requirements | More efficient for large-scale models |
Quality | High, but with higher cost | High quality with lower overhead |
What is Stable Diffusion?
Stable Diffusion is a generative artificial intelligence (generative AI) model that allows us to produce unique, high-quality, or even photorealistic images from text and image prompts[1]. Stable Diffusion leverages the Latent Diffusion model[2][5][10], developed by researchers from the Machine Vision and Learning group at LMU Munich, a.k.a CompVis.
Model checkpoints were publicly released at the end of August 2022 by a collaboration of Stability AI, CompVis, and Runway with support from EleutherAI and LAION[7][9]. For more information, you can check out their official blog post[13][14].
At the time this article was written, Stable Diffusion 3 Medium had already been released. Stable Diffusion 3 Medium is the latest and most advanced text-to-image AI model in our Stable Diffusion 3 series, comprising two billion parameters. It excels in photorealism, processes complex prompts, and generates clear text.
Try Stable Diffusion online with no-code approach
Before we dive deeper into Stable Diffusion with Python, let’s try it online first, with online Stable Diffusion 2.1 Demo:
For faster generation and API access, you can try: DreamStudio Beta.
Or, you can try Playground AI, which enables us to try dozens of different filters and presets, to generate far better outputs:
How do you get started in Stable Diffusion with Python using Hugging Face’s Diffusers library?
The easiest way to get started with Stable Diffusion and other diffusion models with Python is by using Hugging Face’s Diffusers library.
What is 🤗 Diffusers library?
🤗 Diffusers is a leading library for state-of-the-art pre-trained diffusion models, enabling the generation of images, audio, and even 3D structures of molecules. It serves as a modular toolkit suitable for both simple inference tasks or training your own custom diffusion model.
🤗 Diffusers library is designed with a focus on usability over performance, simplicity over easy, and customizability over abstractions. One goal of the 🤗 Diffusers library is to make diffusion models accessible to a wide range of deep learning practitioners.
The underlying model of 🤗 Diffusers library, a neural network, is trained to predict a way to slightly denoise the image in each step. After a certain number of steps, a sample is obtained.
The following is the architecture of the neural network (commonly follows the U-net architecture as proposed by reference[4] and improved upon in the Pixel++ paper):
Some of the highlights of the architecture are:
- this model predicts images of the same size as the input
- the model makes the input image go through several blocks of ResNet layers which halves the image size by 2
- then through the same number of blocks that upsample it again
- skip connections link features on the downsample path to corresponding layers in the upsample path.
How to install Diffusers on your local machine?
Move to your chosen or preferred working directory, and then create a new virtual environment, and install Python version 3.10.
Create a virtual environment called “diffusers
”, and install Python 3.10:
1 |
conda create --name diffusers python=3.10 |
To activate this environment, use:
1 |
conda activate diffusers |
To deactivate an active environment, use this command:
1 |
conda deactivate |
Before we begin any further, make sure we have all the necessary libraries installed using the following pip
command:
1 |
pip install --upgrade diffusers accelerate transformers |
We would install the following two libraries:
- Accelerate: To speed up model loading for inference and training.
- Transformers: This is required to run the most popular diffusion models, such as Stable Diffusion.
There are three main components of the library to know about:
1. DiffusionPipeline
The DiffusionPipeline is a high-level end-to-end class designed to rapidly generate samples from popular pre-trained diffusion models for inference, in a user-friendly fashion.
We’ll begin by importing a pipeline first. We’ll use the google/ddpm-celebahq-256
model developed by Google and U.C. Berkeley. It’s a model that utilizes the Denoising Diffusion Probabilistic Models (DDPM) algorithm that is trained on a dataset of celebrities images.
Hands-on and selected outputs:
The following is a code snippet for the basic use of DiffusionPipeline, and a sufficient explanation of selected outputs:
1 2 3 4 5 6 7 8 9 10 11 |
from diffusers import DDPMPipeline image_pipe = DDPMPipeline.from_pretrained("google/ddpm-celebahq-256") image_pipe.to("cpu") images = image_pipe().images # Show an example of the generated image from the Hugging Face Hub (DDPM-CelebHQ) images[0].show() # Browse what was the building blocks of the pipeline: image_pipe |
To generate an image, we simply run the pipeline and don’t even need to give it any input, it will generate a random initial noise sample and then iterate the diffusion process.
The pipeline returns as output a dictionary with a generated sample
of interest:
Let’s take a look at the image by running images[0].show()
on PyScripter IDE:
Run image_pipe
on PyScripter, to see what the pipeline is made of, so we can try to understand better what was going on under the hood:
Now we can see what’s inside the pipeline: A scheduler and a UNet model. Let’s look closely at them and what this pipeline just did under the hood.
2. Pretrained models
Popular pre-trained model architectures and modules can be used as building blocks for creating diffusion systems.
Instances of the model class are neural networks that take a noisy sample
as well as a timestep as inputs to predict a less noisy output sample. In this subsection, we’ll load a pre-trained model and play around with it to understand the model API. We’ll load a simple unconditional image generation model of type UNet2DModel
which was released with the DDPM Paper[3] and for instance, take a look at another checkpoint trained on church images: google/ddpm-church-256
.
Hands-on and selected outputs
The following is a code snippet for the basic use of the models, and a sufficient explanation of selected outputs:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
import os os.environ['TF_ENABLE_ONEDNN_OPTS'] = '0' from diffusers import UNet2DModel repo_id = "google/ddpm-church-256" model = UNet2DModel.from_pretrained(repo_id, use_safetensors=False) model model.config model_random = UNet2DModel(**model.config) model_random.save_pretrained("my_model") # Add random gaussian sample import torch torch.manual_seed(0) noisy_sample = torch.randn(1, model.config.in_channels, model.config.sample_size, model.config.sample_size) noisy_sample.shape # Inference with torch.no_grad(): noisy_residual = model(sample=noisy_sample, timestep=2).sample |
Now let’s take a look at the model’s configuration. By accessing the config
attribute using model.config
on PyScripter IDE, we can browse all the necessary parameters to define the model architecture:
You can access all the complete output of the model
and model.config
in the repository [3].
A couple of important config parameters are:
sample_size
: defines theheight
andwidth
dimension of the input sample.in_channels
: defines the number of input channels of the input sample.down_block_types
andup_block_types
: define the type of down- and upsampling blocks that are used to create the UNet architecture as was seen in the figure at the beginning of this notebook.block_out_channels
: defines the number of output channels of the downsampling blocks, also used in reversed order for the number of input channels of the upsampling blocks.layers_per_block
: defines how many ResNet blocks are present in each UNet block.
Coming back to the trained model, let’s now see how you can use the model for inference. First, you need a random gaussian sample in the shape of an image (batch_size
× in_channels
× sample_size
× sample_size
). We have a batch
axis because a model can receive multiple random noises. A channel
axis because each one consists of multiple channels (such as red-green-blue). And finally, sample_size
corresponds to the height and width. Let’s confirm the output shapes match using noisy_sample.shape
:
The predicted noisy_residual
has the exact same shape as the input and we use it to compute a slightly less noisy image. Let’s confirm the output shapes match using noisy_residual.shape
:
3. Schedulers
Schedulers are algorithms wrapped into a Python class that define the noise schedule, which is used to add noise to the model during training and also define the algorithm to compute the slightly less noisy sample given the model output (noisy_residual
). This article only focuses on how to use scheduler classes for inference.
We will use DDPMScheduler
, the denoising algorithm proposed in the DDPM Paper[4].
Hands-on and selected outputs
The following is a code snippet for the basic use of the schedulers, and a sufficient explanation of selected outputs:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 |
import os os.environ['TF_ENABLE_ONEDNN_OPTS'] = '0' from diffusers import UNet2DModel from diffusers import DDPMScheduler repo_id = "google/ddpm-ema-church-256" model = UNet2DModel.from_pretrained(repo_id, use_safetensors=False) scheduler = DDPMScheduler.from_pretrained(repo_id) scheduler.config scheduler.save_config("my_scheduler") new_scheduler = DDPMScheduler.from_pretrained("my_scheduler") # Add random gaussian sample import torch torch.manual_seed(0) noisy_sample = torch.randn(1, model.config.in_channels, model.config.sample_size, model.config.sample_size) noisy_sample.shape # Inference with torch.no_grad(): noisy_residual = model(sample=noisy_sample, timestep=2).sample less_noisy_sample = scheduler.step(model_output=noisy_residual, timestep=2, sample=noisy_sample).prev_sample less_noisy_sample.shape # Define the denoising loop import PIL.Image import numpy as np def display_sample(sample, i): image_processed = sample.cpu().permute(0, 2, 3, 1) image_processed = (image_processed + 1.0) * 127.5 image_processed = image_processed.numpy().astype(np.uint8) image_pil = PIL.Image.fromarray(image_processed[0]) print(f"Image at step {i}") image_pil.show() # Display the progress, at every 50th step import tqdm sample = noisy_sample for i, t in enumerate(tqdm.tqdm(scheduler.timesteps)): # 1. predict noise residual with torch.no_grad(): residual = model(sample, t).sample # 2. compute less noisy image and set x_t -> x_t-1 sample = scheduler.step(residual, t, sample).prev_sample # 3. optionally look at image if (i + 1) % 50 == 0: display_sample(sample, i + 1) |
Let’s take a look at the scheduler configuration here, by running scheduler.config
on PyScripter IDE:
Different schedulers are usually defined by different parameters. The following are the most important ones that we need to know:
num_train_timesteps
defines the length of the denoising process, e.g. how many timesteps are needed to process random Gaussian noise to a data sample.beta_schedule
defines the type of noise schedule that shall be used for inference and training.beta_start
andbeta_end
define the smallest and highest noise values of the schedule.
We’ll try to use the model output from the previous section. We can see that the computed sample has the exact same shape as the model input, which means that we are ready to pass it to the model again in the next step.
The last step is to bring it all together and define the denoising loop. This loop prints out the (less and less) noisy samples along the way for better visualization in the denoising loop.
In the code above, we already define a display function that takes care of post-processing the denoised image, and then convert it to a PIL.Image
and display it. Here is the output (displayed using the PIL.Image
) of the 50th step:
It takes quite some time to see a meaningful shape, it can be seen after 800 steps. And, here is the final result, the 1000th step:
By saving the image results after the 50th step and its multiples until the 1000th step and aggregating them, we can see the following denoising progress:
Isn’t it amazing? We should now have a solid foundational understanding of the schedulers and all other components of the 🤗 Diffusers library. The key points to keep in mind are:
1. Schedulers have no trainable weights (parameter-free).
2. During inference, schedulers specify the algorithm that computes the slightly less noisy sample.
To end the subsection about models and schedulers, please also note that we very much deliberately try to keep models and schedulers as independent from each other as possible. This means a scheduler should never accept a model as an input and vice-versa. The model predicts the noise residual or slightly less noisy image with its trained weights, while the scheduler computes the previous sample given the model’s output.
How to perform text-to-image using Stable Diffusion on 🤗 Diffusers?
In this section, we will try text-to-image or image generation from text, using the Diffusers library. We will try it using two different classes: DiffusionPipeline
and StableDiffusionPipeline
.
Using DiffusionPipeline
Load the model with the from_pretrained()
method:
1 2 3 |
from diffusers import DiffusionPipeline pipeline = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", use_safetensors=True) |
The DiffusionPipeline
downloads and caches all modeling, tokenization, and scheduling components.
You’ll see that the Stable Diffusion pipeline is composed of the UNet2DConditionModel
and PNDMScheduler
among other things:
The following is the complete code example to generate an image from a text prompt using Diffusers:
1 2 3 4 5 6 7 |
from diffusers import DiffusionPipeline pipeline = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", use_safetensors=True) image = pipeline("A painting of a cat playing bass guitar").images[0] image image.save("painting_of_cat_playing_bass2.png") |
Run it on PyScripter IDE, the “A painting of a cat playing bass guitar
” prompt would generate the following output:
Using StableDiffusionPipeline
In this second example, we will generate an image from a text prompt, directly using StableDiffusionPipeline
from the diffusers
library.
Run the following code and your prompt on PyScripter IDE:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
import torch from diffusers import StableDiffusionPipeline model_id = "runwayml/stable-diffusion-v1-5" pipe = StableDiffusionPipeline.from_pretrained(model_id) prompt = "A man coding on his laptop" pipe = pipe.to("cpu") generator = torch.Generator("cpu").manual_seed(0) image = pipe(prompt, generator=generator).images[0] image image.save("a_man_coding_on_his_laptop.png") |
Other interesting implementations
The advancement of generative AI-particularly Stable Diffusion, enables us to do creativity-demanding tasks such as creating video art, with just a few clicks away. Below are videos I generate using combinations of text-to-image, image-to-image, frame interpolation, and text/image-to-video using Playground AI and Runway ML, which all are rooted in Stable Diffusion.
Text/image to video
Text/image to video is a multimodal AI system that can generate novel videos from text, images, or video clips.
Here is the collection of 11 short videos (4 seconds each) that I generated or animated from existing images with additional guidance from text prompts, with help from Runway ML:
The following are the prompts I use to guide the image-to-video generation process:
1 2 3 4 5 6 7 8 9 10 11 |
1. A cat playing bass guitar 2. An astronaut working on ISS 3. Human and robot handshake 4. Jackson Pollock's No. 1 (Lavender Mist) live drip painting 5. Typing on keyboard 6. A woman working with her laptop 7. Two software developers brainstorming 8. The startup founder gives a presentation 9. A PhD student teaches us about her research 10. A young mathematician struggling to solve equations on a blackboard 11. An alien gray staring at us |
Text-to-image + frame interpolation
Frame interpolation is a technique to turn a sequence of images into an animated video, by filling in between images with smooth transitions.
First, I generate 120 images from unusual, obscure, and complex text prompts using Playground AI. The following are the prompts I used to generate images:
1 2 3 4 5 6 7 8 9 10 11 12 |
1. Photographs of Space-Time Anomalies 2. Illustration of a future Artificial General Intelligence (AGI) 3. Illustration of deep learning and AI community 4. Very advanced alien civilizations that live inside black hole 5. Vibrating membrane from brane theory and m-theory 6. First ever photo of an atom, First ever photo of a proton taken using electron microscope 7. Deep sea, Deep sea with deep sea creatures, Deep sea monsters 8. A battalion of military robots 9. Dream of the Future of Humanity: Interplanetary, Interstellar, and Intergalactic Colony 10. Surface of exoplanet 11. Reimagine newton's apple and universal law of gravitation 12. Draw Feynman diagram in artistic but scientifically formal way |
Then, I create a video by automatically generating smooth transitions between those images using frame interpolation from Runway ML:
Conclusion
In conclusion, leveraging Python for deep learning with diffusion models unlocks immense potential for generative AI, take Stable Diffusion as a perfect example. These models, particularly Latent Diffusion Models (LDMs), have revolutionized the field by combining computational efficiency with high-quality outputs, enabling accessible and versatile applications such as text-to-image synthesis.
We’ve also learned the hands-on parts of diffusion models by utilizing libraries like Hugging Face’s 🤗 Diffusers with Python, to draw picture from random noise (we explored pre-trained models, customizable pipelines, and efficient inference methods to draw church realistically) and performing text-to-image synthesis.
As we delve deeper into these cutting-edge advancements, diffusion models continue to shape the future of AI-driven innovation, empowering diverse domains ranging from art and design to scientific discovery.
I hope this article was successful in giving you a comprehensive and accessible introduction to diffusion models, and a solid understanding and workflow of how to implement them to your domains and project goals, so, it would inspire you to learn more and experiment with diffusion models yourself.
Check out the full repository here: github.com/Embarcadero/DL_Python07_ DiffusionModels
Click here to get started with PyScripter, a free, feature-rich, and lightweight Python IDE.
Download RAD Studio to create more powerful Python GUI Windows Apps in 5x less time.
Check out Python4Delphi, which makes it simple to create Python GUIs for Windows using Delphi.
Also, look into DelphiVCL4Python, which makes it simple to create Windows GUIs with Python.
References & further readings
[1] Amazon Web Services, Inc. (2024).
What is Stable Diffusion? AWS What is. aws.amazon.com/what-is/stable-diffusion
[2] Hakim, M. A. (2023).
How to bypass the ChatGPT information cutoff? A busy (or lazy) man guide to read more recent ML/DL papers. Paper-001: “Rombach et al., 2022”. hkaLabs blog. hkalabs.com/blog/how-to-read-deep-learning-papers-using-bing-chat-ai-001
[3] Hakim, M. A. (2024).
Article47 – Deep Learning Python 07 – Diffusion Models. embarcaderoBlog-repo. GitHub repository. github.com/MuhammadAzizulHakim/ embarcaderoBlog-repo/tree/main/Article47%20-%20Deep%20Learning%20Python%2007%20-%20Diffusion%20Models
[4] Ho, J., Jain, A., & Abbeel, P. (2020).
Denoising diffusion probabilistic models. Advances in neural information processing systems, 33, 6840-6851.
[5] Hugging Face. (2024).
Diffusers. Hugging Face docs. huggingface.co/docs/diffusers/index
[6] Hugging Face. (2023).
diffusers_intro.ipynb: Introducing Hugging Face’s new library for diffusion models. Hugging Face GitHub repository. colab.research.google.com/github/huggingface/ notebooks/blob/main/diffusers/diffusers_intro.ipynb
[7] Hugging Face. (2023).
The Stable Diffusion Guide. Hugging Face docs. huggingface.co/docs/diffusers/v0.13.0/en/ stable_diffusion
[8] OpenAI. (2024).
ChatGPT (Nov version) [Large language model]. chat.openai.com/chat
[9] Patil, S., Cuenca, P., Lambert, N., and von Platen, P. (2024).
Stable Diffusion with 🧨 Diffusers. Hugging Face blog. huggingface.co/blog/stable_diffusion
[10] Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022).
High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10684-10695).
[11] Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022).
latent-diffusion: High-Resolution Image Synthesis with Latent Diffusion Models. CompVis – Computer Vision and Learning LMU Munich. GitHub repository. github.com/CompVis/latent-diffusion
[12] Ronneberger, O., Fischer, P., & Brox, T. (2015).
U-net: Convolutional networks for biomedical image segmentation. In Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18 (pp. 234-241). Springer International Publishing.
[13] Stability AI. (2024).
Stability AI: Activating humanity’s potential through generative AI. Stability AI official website. stability.ai
[14] Stability AI. (2024).
Stable Diffusion Public Release. Stability AI news. stability.ai/news/stable-diffusion-public-release
[15] Towards AI Editorial Team. (2023).
Diffusion Models vs. GANs vs. VAEs: Comparison of Deep Generative Models. Toward AI blog. towardsai.net/p/generative-ai/diffusion-models-vs-gans-vs-vaes-comparison-of-deep-generative-models