CodeIDEProjectsPythonWindows

Unlock the Power of Python for Deep Learning with Diffusion Model – The Engine behind Stable Diffusion

blogbanner pythongui22

Deep learning is a subset of machine learning, which is a subset of artificial intelligence (AI), the technology behind the most exciting capabilities in robotics, natural language processing, image and video recognition, large language models (LLMs), generative AI, etc.

To address intricate problems, extensive amounts of data and substantial computational capabilities are essential for the functioning of deep learning algorithms. These algorithms are versatile in handling various types of data. 

This article will delve into a comprehensive exploration of the Diffusion Model, a prominent member of the deep learning domain and the driving force behind Stable Diffusion, which is pretty popular and widely used in generative AI these days.

Stable Diffusion has been praised for making AI image generation accessible and flexible, becoming one of the key tools for creative professionals and hobbyists working with generative AI.

Before we begin, let’s see the overview of Latent Diffusion Models architecture:

00 architectureofldm
Overview of Latent Diffusion architecture Image source Reference 8

Table of Contents

What is Deep Learning?

Deep learning is a subfield of machine learning that solves complex problems using artificial neural networks. These neural networks are made up of interconnected nodes arranged in multiple layers that extract features from input data. Large datasets are used to train these models, allowing them to detect patterns and correlations that humans would find difficult or impossible to detect.

The impact of deep learning on artificial intelligence has been substantial. It has paved the way for the development of intelligent systems capable of independent learning, adaptation, and decision-making. Deep learning has led to remarkable advancements in various domains, encompassing image and speech recognition, natural language processing, machine translation, text generation, image generation (as would be reviewed in this article), autonomous driving, and numerous others.

000
Example of AI generated image using Stable Diffusion XL model that I generated using the following prompt Illustration of a future Artificial General Intelligence AGI

Why Python for Deep Learning?

Python has gained widespread popularity as a programming language due to its versatility and ease of use in diverse domains of computer science, especially in the field of deep learning. Thanks to its extensive range of libraries and frameworks specially tailored for deep learning, Python has emerged as a top choice among many machine learning professionals.

Python has emerged as the language of choice for deep learning, and here are some of the reasons why:

1. Simple to learn and use:

Python is a high-level programming language that is easy to learn and use, even for those who are new to programming. Its concise and uncomplicated syntax makes it easy to write and understand. This allows developers to concentrate on solving problems without worrying about the details of the language.

2. Abundant libraries and frameworks:

Python has a vast ecosystem of libraries and frameworks that cater specifically to deep learning. Some of these libraries include TensorFlow, PyTorch, Keras, and Theano. These libraries provide pre-built functions and modules that simplify the development process, reducing the need to write complex code from scratch.

3. Strong community support:

Python has a large and active community of developers contributing to its development, maintenance, and improvement. This community offers support and guidance to beginners, making it easier to learn and use Python for deep learning.

4. Platform independence:

Python is platform-independent, which means that code written on one platform can be easily executed on another platform without any modification. This makes it easier to deploy deep learning models on different platforms and devices.

5. Easy integration with other languages:

Python can be easily integrated with other programming languages, such as Delphi, C++, and Java, making it ideal for building complex systems that require integrating different technologies.

Overall, Python’s ease of use, an abundance of libraries and frameworks, strong community support, platform independence, and ease of integration with other languages make it an indispensable tool for machine learning practitioners. Its popularity continues to soar as a result.

What are Diffusion and Latent Diffusion Models?

A diffusion model is a type of generative model in machine learning designed to create data by reversing a noise-adding process. It models the way data can evolve from randomness (pure noise) to meaningful structures, such as images, audio, or other complex data distributions.

The following table shows how the diffusion model is compared with other generative models[8]:

AspectDiffusion ModelsGANs (Generative Adversarial Networks)
Training StabilityMore stableProne to mode collapse
Output QualityHigh detail, fewer artifactsSometimes sharper, but less reliable
SpeedSlower to generate imagesFaster at inference
Mode CoverageBetter at covering the data’s full distributionGANs may miss some modes

To dive deeper into GAN, read our previous article below:

On the other hand, a Latent Diffusion Model (LDM) is an advanced type of diffusion model that operates in a compressed (latent) space rather than directly on pixel data, making it more computationally efficient. LDMs, such as Stable Diffusion, enable faster image generation without compromising quality, which is especially useful for large-scale generative tasks like text-to-image synthesis.

The following table shows how LDMs improve over traditional diffusion models[8]:

AspectTraditional Diffusion ModelsLatent Diffusion Models
Data SpaceOperates directly on pixelsWorks in a compressed latent space
SpeedSlower due to pixel-level stepsFaster due to reduced dimensionality
Resource UsageHigher GPU/CPU requirementsMore efficient for large-scale models
QualityHigh, but with higher costHigh quality with lower overhead

What is Stable Diffusion?

Stable Diffusion is a generative artificial intelligence (generative AI) model that allows us to produce unique, high-quality, or even photorealistic images from text and image prompts[1]. Stable Diffusion leverages the Latent Diffusion model[2][5][10], developed by researchers from the Machine Vision and Learning group at LMU Munich, a.k.a CompVis.

Model checkpoints were publicly released at the end of August 2022 by a collaboration of Stability AI, CompVis, and Runway with support from EleutherAI and LAION[7][9]. For more information, you can check out their official blog post[13][14].

At the time this article was written, Stable Diffusion 3 Medium had already been released. Stable Diffusion 3 Medium is the latest and most advanced text-to-image AI model in our Stable Diffusion 3 series, comprising two billion parameters. It excels in photorealism, processes complex prompts, and generates clear text.

Try Stable Diffusion online with no-code approach

Before we dive deeper into Stable Diffusion with Python, let’s try it online first, with online Stable Diffusion 2.1 Demo:

00 huggingfacedemo01
Outputs of A cat playing bass guitar prompt

For faster generation and API access, you can try: DreamStudio Beta.

00 dreamstudiodemo02
Outputs of A cat playing bass guitar prompt

Or, you can try Playground AI, which enables us to try dozens of different filters and presets, to generate far better outputs:

00 playgroundaidemo01
Output of A cat playing bass guitar prompt The left side shows the dozens of different filters or styles offered by Playground AI
00 playgroundaidemo02
Outputs of A cat playing bass guitar prompt using 4 different filters

How do you get started in Stable Diffusion with Python using Hugging Face’s Diffusers library?

The easiest way to get started with Stable Diffusion and other diffusion models with Python is by using Hugging Face’s Diffusers library.

What is 🤗 Diffusers library?

00 diffuserslibrarylogo

🤗 Diffusers is a leading library for state-of-the-art pre-trained diffusion models, enabling the generation of images, audio, and even 3D structures of molecules. It serves as a modular toolkit suitable for both simple inference tasks or training your own custom diffusion model.

🤗 Diffusers library is designed with a focus on usability over performance, simplicity over easy, and customizability over abstractions. One goal of the 🤗 Diffusers library is to make diffusion models accessible to a wide range of deep learning practitioners.

The underlying model of 🤗 Diffusers library, a neural network, is trained to predict a way to slightly denoise the image in each step. After a certain number of steps, a sample is obtained.

The following is the architecture of the neural network (commonly follows the U-net architecture as proposed by reference[4] and improved upon in the Pixel++ paper):

unet model

Some of the highlights of the architecture are:

  • this model predicts images of the same size as the input
  • the model makes the input image go through several blocks of ResNet layers which halves the image size by 2
  • then through the same number of blocks that upsample it again
  • skip connections link features on the downsample path to corresponding layers in the upsample path.

How to install Diffusers on your local machine?

Move to your chosen or preferred working directory, and then create a new virtual environment, and install Python version 3.10.

Create a virtual environment called “diffusers”, and install Python 3.10:

To activate this environment, use:

To deactivate an active environment, use this command:

Before we begin any further, make sure we have all the necessary libraries installed using the following pip command:

We would install the following two libraries:

  • Accelerate: To speed up model loading for inference and training.
  • Transformers: This is required to run the most popular diffusion models, such as Stable Diffusion.

There are three main components of the library to know about:

1. DiffusionPipeline

The DiffusionPipeline is a high-level end-to-end class designed to rapidly generate samples from popular pre-trained diffusion models for inference, in a user-friendly fashion.

We’ll begin by importing a pipeline first. We’ll use the google/ddpm-celebahq-256 model developed by Google and U.C. Berkeley. It’s a model that utilizes the Denoising Diffusion Probabilistic Models (DDPM) algorithm that is trained on a dataset of celebrities images.

Hands-on and selected outputs:

The following is a code snippet for the basic use of DiffusionPipeline, and a sufficient explanation of selected outputs:

output01 diffusersonpyscripter pipelines01

To generate an image, we simply run the pipeline and don’t even need to give it any input, it will generate a random initial noise sample and then iterate the diffusion process.

The pipeline returns as output a dictionary with a generated sample of interest:

output01 diffusersonpyscripter pipelines02

Let’s take a look at the image by running images[0].show() on PyScripter IDE:

output01 diffusersonpyscripter pipelines03

Run image_pipe on PyScripter, to see what the pipeline is made of, so we can try to understand better what was going on under the hood:

output01 diffusersonpyscripter pipelines04

Now we can see what’s inside the pipeline: A scheduler and a UNet model. Let’s look closely at them and what this pipeline just did under the hood.

2. Pretrained models

Popular pre-trained model architectures and modules can be used as building blocks for creating diffusion systems.

Instances of the model class are neural networks that take a noisy sample as well as a timestep as inputs to predict a less noisy output sample. In this subsection, we’ll load a pre-trained model and play around with it to understand the model API. We’ll load a simple unconditional image generation model of type UNet2DModel which was released with the DDPM Paper[3] and for instance, take a look at another checkpoint trained on church images: google/ddpm-church-256.

Hands-on and selected outputs

The following is a code snippet for the basic use of the models, and a sufficient explanation of selected outputs:

output02 diffusersonpyscripter models01

Now let’s take a look at the model’s configuration. By accessing the config attribute using model.config on PyScripter IDE, we can browse all the necessary parameters to define the model architecture:

output02 diffusersonpyscripter models02

You can access all the complete output of the model and model.config in the repository [3].

A couple of important config parameters are:

  • sample_size: defines the height and width dimension of the input sample.
  • in_channels: defines the number of input channels of the input sample.
  • down_block_types and up_block_types: define the type of down- and upsampling blocks that are used to create the UNet architecture as was seen in the figure at the beginning of this notebook.
  • block_out_channels: defines the number of output channels of the downsampling blocks, also used in reversed order for the number of input channels of the upsampling blocks.
  • layers_per_block: defines how many ResNet blocks are present in each UNet block.

Coming back to the trained model, let’s now see how you can use the model for inference. First, you need a random gaussian sample in the shape of an image (batch_size × in_channels × sample_size × sample_size). We have a batch axis because a model can receive multiple random noises. A channel axis because each one consists of multiple channels (such as red-green-blue). And finally, sample_size corresponds to the height and width. Let’s confirm the output shapes match using noisy_sample.shape:

output02 diffusersonpyscripter models03

The predicted noisy_residual has the exact same shape as the input and we use it to compute a slightly less noisy image. Let’s confirm the output shapes match using noisy_residual.shape:

output02 diffusersonpyscripter models04

3. Schedulers

Schedulers are algorithms wrapped into a Python class that define the noise schedule, which is used to add noise to the model during training and also define the algorithm to compute the slightly less noisy sample given the model output (noisy_residual). This article only focuses on how to use scheduler classes for inference.

We will use DDPMScheduler, the denoising algorithm proposed in the DDPM Paper[4].

Hands-on and selected outputs

The following is a code snippet for the basic use of the schedulers, and a sufficient explanation of selected outputs:

Let’s take a look at the scheduler configuration here, by running scheduler.config on PyScripter IDE:

output03 diffusersonpyscripter schedulers01 01

Different schedulers are usually defined by different parameters. The following are the most important ones that we need to know:

  • num_train_timesteps defines the length of the denoising process, e.g. how many timesteps are needed to process random Gaussian noise to a data sample.
  • beta_schedule defines the type of noise schedule that shall be used for inference and training.
  • beta_start and beta_end define the smallest and highest noise values of the schedule.

We’ll try to use the model output from the previous section. We can see that the computed sample has the exact same shape as the model input, which means that we are ready to pass it to the model again in the next step.

output03 diffusersonpyscripter schedulers01

The last step is to bring it all together and define the denoising loop. This loop prints out the (less and less) noisy samples along the way for better visualization in the denoising loop.

In the code above, we already define a display function that takes care of post-processing the denoised image, and then convert it to a PIL.Image and display it. Here is the output (displayed using the PIL.Image) of the 50th step:

output03 diffusersonpyscripter schedulers02

It takes quite some time to see a meaningful shape, it can be seen after 800 steps. And, here is the final result, the 1000th step:

output03 diffusersonpyscripter schedulers03

By saving the image results after the 50th step and its multiples until the 1000th step and aggregating them, we can see the following denoising progress:

output03 diffusersonpyscripter schedulers04

Isn’t it amazing? We should now have a solid foundational understanding of the schedulers and all other components of the 🤗 Diffusers library. The key points to keep in mind are:

1. Schedulers have no trainable weights (parameter-free).

2. During inference, schedulers specify the algorithm that computes the slightly less noisy sample.

To end the subsection about models and schedulers, please also note that we very much deliberately try to keep models and schedulers as independent from each other as possible. This means a scheduler should never accept a model as an input and vice-versa. The model predicts the noise residual or slightly less noisy image with its trained weights, while the scheduler computes the previous sample given the model’s output.

How to perform text-to-image using Stable Diffusion on 🤗 Diffusers? 

In this section, we will try text-to-image or image generation from text, using the Diffusers library. We will try it using two different classes: DiffusionPipeline and StableDiffusionPipeline.

Using DiffusionPipeline

Load the model with the from_pretrained() method:

The DiffusionPipeline downloads and caches all modeling, tokenization, and scheduling components.

output301 diffusionpipeline from pretrained

You’ll see that the Stable Diffusion pipeline is composed of the UNet2DConditionModel and PNDMScheduler among other things:

output302 diffusionpipeline2

The following is the complete code example to generate an image from a text prompt using Diffusers:

Run it on PyScripter IDE, the “A painting of a cat playing bass guitar” prompt would generate the following output:

output303 paintingofcatplayingbass2
painting of cat playing bass2

Using StableDiffusionPipeline

In this second example, we will generate an image from a text prompt, directly using StableDiffusionPipeline from the diffusers library.

Run the following code and your prompt on PyScripter IDE:

output304 amancodingonhislaptop sd
a man coding on his laptop

Other interesting implementations

The advancement of generative AI-particularly Stable Diffusion, enables us to do creativity-demanding tasks such as creating video art, with just a few clicks away. Below are videos I generate using combinations of text-to-image, image-to-image, frame interpolation, and text/image-to-video using Playground AI and Runway ML, which all are rooted in Stable Diffusion.

Text/image to video

Text/image to video is a multimodal AI system that can generate novel videos from text, images, or video clips.

Here is the collection of 11 short videos (4 seconds each) that I generated or animated from existing images with additional guidance from text prompts, with help from Runway ML:

The following are the prompts I use to guide the image-to-video generation process:

Text-to-image + frame interpolation

Frame interpolation is a technique to turn a sequence of images into an animated video, by filling in between images with smooth transitions. 

First, I generate 120 images from unusual, obscure, and complex text prompts using Playground AI. The following are the prompts I used to generate images:

Then, I create a video by automatically generating smooth transitions between those images using frame interpolation from Runway ML:

Conclusion

In conclusion, leveraging Python for deep learning with diffusion models unlocks immense potential for generative AI, take Stable Diffusion as a perfect example. These models, particularly Latent Diffusion Models (LDMs), have revolutionized the field by combining computational efficiency with high-quality outputs, enabling accessible and versatile applications such as text-to-image synthesis.

We’ve also learned the hands-on parts of diffusion models by utilizing libraries like Hugging Face’s 🤗 Diffusers with Python, to draw picture from random noise (we explored pre-trained models, customizable pipelines, and efficient inference methods to draw church realistically) and performing text-to-image synthesis.

As we delve deeper into these cutting-edge advancements, diffusion models continue to shape the future of AI-driven innovation, empowering diverse domains ranging from art and design to scientific discovery.

I hope this article was successful in giving you a comprehensive and accessible introduction to diffusion models, and a solid understanding and workflow of how to implement them to your domains and project goals, so, it would inspire you to learn more and experiment with diffusion models yourself.

Check out the full repository here: github.com/Embarcadero/DL_Python07_ DiffusionModels


Click here to get started with PyScripter, a free, feature-rich, and lightweight Python IDE.

Download RAD Studio to create more powerful Python GUI Windows Apps in 5x less time.

Check out Python4Delphi, which makes it simple to create Python GUIs for Windows using Delphi.

Also, look into DelphiVCL4Python, which makes it simple to create Windows GUIs with Python.


References & further readings

[1] Amazon Web Services, Inc. (2024). 

What is Stable Diffusion? AWS What is. aws.amazon.com/what-is/stable-diffusion

[2] Hakim, M. A. (2023). 

How to bypass the ChatGPT information cutoff? A busy (or lazy) man guide to read more recent ML/DL papers. Paper-001: “Rombach et al., 2022”. hkaLabs blog. hkalabs.com/blog/how-to-read-deep-learning-papers-using-bing-chat-ai-001

[3] Hakim, M. A. (2024).

Article47 – Deep Learning Python 07 – Diffusion Models. embarcaderoBlog-repo. GitHub repository. github.com/MuhammadAzizulHakim/ embarcaderoBlog-repo/tree/main/Article47%20-%20Deep%20Learning%20Python%2007%20-%20Diffusion%20Models

[4] Ho, J., Jain, A., & Abbeel, P. (2020). 

Denoising diffusion probabilistic models. Advances in neural information processing systems, 33, 6840-6851.

[5] Hugging Face. (2024). 

Diffusers. Hugging Face docs. huggingface.co/docs/diffusers/index

[6] Hugging Face. (2023). 

diffusers_intro.ipynb: Introducing Hugging Face’s new library for diffusion models. Hugging Face GitHub repository. colab.research.google.com/github/huggingface/ notebooks/blob/main/diffusers/diffusers_intro.ipynb

[7] Hugging Face. (2023). 

The Stable Diffusion Guide. Hugging Face docs. huggingface.co/docs/diffusers/v0.13.0/en/ stable_diffusion

[8] OpenAI. (2024). 

ChatGPT (Nov version) [Large language model]. chat.openai.com/chat

[9] Patil, S., Cuenca, P., Lambert, N., and von Platen, P. (2024). 

Stable Diffusion with 🧨 Diffusers. Hugging Face blog. huggingface.co/blog/stable_diffusion

[10] Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). 

High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10684-10695).

[11] Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022).

latent-diffusion: High-Resolution Image Synthesis with Latent Diffusion Models. CompVis – Computer Vision and Learning LMU Munich. GitHub repository. github.com/CompVis/latent-diffusion

[12] Ronneberger, O., Fischer, P., & Brox, T. (2015). 

U-net: Convolutional networks for biomedical image segmentation. In Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18 (pp. 234-241). Springer International Publishing.

[13] Stability AI. (2024).

Stability AI: Activating humanity’s potential through generative AI. Stability AI official website. stability.ai

[14] Stability AI. (2024).

Stable Diffusion Public Release. Stability AI news. stability.ai/news/stable-diffusion-public-release

[15] Towards AI Editorial Team. (2023).

Diffusion Models vs. GANs vs. VAEs: Comparison of Deep Generative Models. Toward AI blog. towardsai.net/p/generative-ai/diffusion-models-vs-gans-vs-vaes-comparison-of-deep-generative-models

Related posts
CodeIDELearn PythonPythonPython GUITkinter

How To Make More Than 20 ChatGPT Prompts Work With Python GUI Builders And OpenCV Library?

CodeIDEProjectsPythonWindows

Unlock the Power of Python for Deep Learning with Radial Basis Function Networks (RBFNs)

CodeIDELearn PythonPythonPython GUITkinter

How To Make More Than 20 ChatGPT Prompts Work With Python GUI Builders And NumPy Library?

CodeIDEProjectsPythonWindows

Unlock the Power of Python for Deep Learning with Generative Adversarial Networks (GANs) - The Engine behind DALL-E

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.