» Developer guides / KerasCV / High-performance image generation using Stable Diffusion in KerasCV
Authors: fchollet, lukewood, divamgupta
Date created: 2022/09/25
Last modified: 2022/09/25
Description: Generate new images using KerasCV's StableDiffusion model.
Overview
In this guide, we will show how to generate novel images based on a text prompt usingthe KerasCV implementation of stability.ai's text-to-image model,Stable Diffusion.
Stable Diffusion is a powerful, open-source text-to-image generation model. While thereexist multiple open-source implementations that allow you to easily create images fromtextual prompts, KerasCV's offers a few distinct advantages.These include XLA compilation andmixed precision support,which together achieve state-of-the-art generation speed.
In this guide, we will explore KerasCV's Stable Diffusion implementation, show how to usethese powerful performance boosts, and explore the performance benefitsthat they offer.
To get started, let's install a few dependencies and sort out some imports:
!pip install --upgrade keras-cv
import timeimport keras_cvfrom tensorflow import kerasimport matplotlib.pyplot as plt
Introduction
Unlike most tutorials, where we first explain a topic then show how to implement it,with text-to-image generation it is easier to show instead of tell.
Check out the power of keras_cv.models.StableDiffusion()
.
First, we construct a model:
model = keras_cv.models.StableDiffusion(img_width=512, img_height=512)
Next, we give it a prompt:
images = model.text_to_image("photograph of an astronaut riding a horse", batch_size=3)def plot_images(images): plt.figure(figsize=(20, 20)) for i in range(len(images)): ax = plt.subplot(1, len(images), i + 1) plt.imshow(images[i]) plt.axis("off")plot_images(images)
25/25 [==============================] - 19s 317ms/step
Pretty incredible!
But that's not all this model can do. Let's try a more complex prompt:
images = model.text_to_image( "cute magical flying dog, fantasy art, " "golden color, high quality, highly detailed, elegant, sharp focus, " "concept art, character concepts, digital painting, mystery, adventure", batch_size=3,)plot_images(images)
25/25 [==============================] - 8s 316ms/step
The possibilities are literally endless (or at least extend to the boundaries ofStable Diffusion's latent manifold).
Wait, how does this even work?
Unlike what you might expect at this point, Stable Diffusion doesn't actually run on magic.It's a kind of "latent diffusion model". Let's dig into what that means.
You may be familiar with the idea of super-resolution:it's possible to train a deep learning model to denoise an input image -- and thereby turn it into a higher-resolutionversion. The deep learning model doesn't do this by magically recovering the information that's missing from the noisy, low-resolutioninput -- rather, the model uses its training data distribution to hallucinate the visual details that would be most likelygiven the input. To learn more about super-resolution, you can check out the following Keras.io tutorials:
- Image Super-Resolution using an Efficient Sub-Pixel CNN
- Enhanced Deep Residual Networks for single-image super-resolution
When you push this idea to the limit, you may start asking -- what if we just run such a model on pure noise?The model would then "denoise the noise" and start hallucinating a brand new image. By repeating the process multipletimes, you can get turn a small patch of noise into an increasingly clear and high-resolution artificial picture.
This is the key idea of latent diffusion, proposed inHigh-Resolution Image Synthesis with Latent Diffusion Models in 2020.To understand diffusion in depth, you can check the Keras.io tutorialDenoising Diffusion Implicit Models.
Now, to go from latent diffusion to a text-to-image system,you still need to add one key feature: the ability to control the generated visual contents via prompt keywords.This is done via "conditioning", a classic deep learning technique which consists of concatenating to thenoise patch a vector that represents a bit of text, then training the model on a dataset of {image: caption} pairs.
This gives rise to the Stable Diffusion architecture. Stable Diffusion consists of three parts:
- A text encoder, which turns your prompt into a latent vector.
- A diffusion model, which repeatedly "denoises" a 64x64 latent image patch.
- A decoder, which turns the final 64x64 latent patch into a higher-resolution 512x512 image.
First, your text prompt gets projected into a latent vector space by the text encoder,which is simply a pretrained, frozen language model. Then that prompt vector is concatenatedto a randomly generated noise patch, which is repeatedly "denoised" by the diffusion model over a seriesof "steps" (the more steps you run the clearer and nicer your image will be -- the default value is 50 steps).
Finally, the 64x64 latent image is sent through the decoder to properly render it in high resolution.
All-in-all, it's a pretty simple system -- the Keras implementationfits in four files that represent less than 500 lines of code in total:
- text_encoder.py: 87 LOC
- diffusion_model.py: 181 LOC
- decoder.py: 86 LOC
- stable_diffusion.py: 106 LOC
But this relatively simple system starts looking like magic once you train on billions of pictures and their captions.As Feynman said about the universe: "It's not complicated, it's just a lot of it!"
Perks of KerasCV
With several implementations of Stable Diffusion publicly available why should you usekeras_cv.models.StableDiffusion?
Aside from the easy-to-use API, KerasCV's Stable Diffusion model comeswith some powerful advantages, including:
- Graph mode execution
- XLA compilation through
jit_compile=True
- Support for mixed precision computation
When these are combined, the KerasCV Stable Diffusion model runs orders of magnitudefaster than naive implementations. This section shows how to enable all of thesefeatures, and the resulting performance gain yielded from using them.
For the purposes of comparison, we ran benchmarks comparing the runtime of theHuggingFace diffusers implementation ofStable Diffusion against the KerasCV implementation.Both implementations were tasked to generate 3 images with a step count of 50 for eachimage. In this benchmark, we used a Tesla T4 GPU.
All of our benchmarks are open source on GitHub, and may be re-run on Colab toreproduce the results.The results from the benchmark are displayed in the table below:
GPU | Model | Runtime |
---|---|---|
Tesla T4 | KerasCV (Warm Start) | 28.97s |
Tesla T4 | diffusers (Warm Start) | 41.33s |
Tesla V100 | KerasCV (Warm Start) | 12.45 |
Tesla V100 | diffusers (Warm Start) | 12.72 |
30% improvement in execution time on the Tesla T4!. While the improvement is much loweron the V100, we generally expect the results of the benchmark to consistently favor the KerasCVacross all NVIDIA GPUs.
For the sake of completeness, both cold-start and warm-start generation times arereported. Cold-start execution time includes the one-time cost of model creation and compilation,and is therefore negligible in a production environment (where you would reuse the same model instancemany times). Regardless, here are the cold-start numbers:
GPU | Model | Runtime |
---|---|---|
Tesla T4 | KerasCV (Cold Start) | 83.47s |
Tesla T4 | diffusers (Cold Start) | 46.27s |
Tesla V100 | KerasCV (Cold Start) | 76.43 |
Tesla V100 | diffusers (Cold Start) | 13.90 |
While the runtime results from running this guide may vary, in our testing the KerasCVimplementation of Stable Diffusion is significantly faster than its PyTorch counterpart.This may be largely attributed to XLA compilation.
Note: The performance benefits of each optimization can varysignificantly between hardware setups.
To get started, let's first benchmark our unoptimized model:
benchmark_result = []start = time.time()images = model.text_to_image( "A cute otter in a rainbow whirlpool holding shells, watercolor", batch_size=3,)end = time.time()benchmark_result.append(["Standard", end - start])plot_images(images)print(f"Standard model: {(end - start):.2f} seconds")keras.backend.clear_session() # Clear session to preserve memory.
25/25 [==============================] - 8s 316ms/stepStandard model: 8.17 seconds
Mixed precision
"Mixed precision" consists of performing computation using float16
precision, while storing weights in the float32
format.This is done to take advantage of the fact that float16
operations are backed bysignificantly faster kernels than their float32
counterparts on modern NVIDIA GPUs.
Enabling mixed precision computation in Keras(and therefore for keras_cv.models.StableDiffusion) is as simple as calling:
keras.mixed_precision.set_global_policy("mixed_float16")
INFO:tensorflow:Mixed precision compatibility check (mixed_float16): OKYour GPU will likely run quickly with dtype policy mixed_float16 as it has compute capability of at least 7.0. Your GPU: NVIDIA A100-SXM4-40GB, compute capability 8.0
That's all. Out of the box - it just works.
model = keras_cv.models.StableDiffusion()print("Compute dtype:", model.diffusion_model.compute_dtype)print( "Variable dtype:", model.diffusion_model.variable_dtype,)
Compute dtype: float16Variable dtype: float32
As you can see, the model constructed above now uses mixed precision computation;leveraging the speed of float16
operations for computation, while storing variablesin float32
precision.
# Warm up model to run graph tracing before benchmarking.model.text_to_image("warming up the model", batch_size=3)start = time.time()images = model.text_to_image( "a cute magical flying dog, fantasy art, " "golden color, high quality, highly detailed, elegant, sharp focus, " "concept art, character concepts, digital painting, mystery, adventure", batch_size=3,)end = time.time()benchmark_result.append(["Mixed Precision", end - start])plot_images(images)print(f"Mixed precision model: {(end - start):.2f} seconds")keras.backend.clear_session()
25/25 [==============================] - 15s 226ms/step25/25 [==============================] - 6s 226ms/stepMixed precision model: 6.02 seconds
XLA Compilation
TensorFlow comes with theXLA: Accelerated Linear Algebra compiler built-in.keras_cv.models.StableDiffusion supports a jit_compile
argument out of the box.Setting this argument to True
enables XLA compilation, resulting in a significantspeed-up.
Let's use this below:
# Set back to the default for benchmarking purposes.keras.mixed_precision.set_global_policy("float32")model = keras_cv.models.StableDiffusion(jit_compile=True)# Before we benchmark the model, we run inference once to make sure the TensorFlow# graph has already been traced.images = model.text_to_image("An avocado armchair", batch_size=3)plot_images(images)
25/25 [==============================] - 36s 245ms/step
Let's benchmark our XLA model:
start = time.time()images = model.text_to_image( "A cute otter in a rainbow whirlpool holding shells, watercolor", batch_size=3,)end = time.time()benchmark_result.append(["XLA", end - start])plot_images(images)print(f"With XLA: {(end - start):.2f} seconds")keras.backend.clear_session()
25/25 [==============================] - 6s 245ms/stepWith XLA: 6.27 seconds
On an A100 GPU, we get about a 2x speedup. Fantastic!
Putting it all together
So, how do you assemble the world's most performant stable diffusion inferencepipeline (as of September 2022).
With these two lines of code:
keras.mixed_precision.set_global_policy("mixed_float16")model = keras_cv.models.StableDiffusion(jit_compile=True)
And to use it...
# Let's make sure to warm up the modelimages = model.text_to_image( "Teddy bears conducting machine learning research", batch_size=3,)plot_images(images)
25/25 [==============================] - 39s 157ms/step
Exactly how fast is it?Let's find out!
start = time.time()images = model.text_to_image( "A mysterious dark stranger visits the great pyramids of egypt, " "high quality, highly detailed, elegant, sharp focus, " "concept art, character concepts, digital painting", batch_size=3,)end = time.time()benchmark_result.append(["XLA + Mixed Precision", end - start])plot_images(images)print(f"XLA + mixed precision: {(end - start):.2f} seconds")
25/25 [==============================] - 4s 158ms/stepXLA + mixed precision: 4.25 seconds
Let's check out the results:
print("{:<20} {:<20}".format("Model", "Runtime"))for result in benchmark_result: name, runtime = result print("{:<20} {:<20}".format(name, runtime))
Model Runtime Standard 8.17177152633667 Mixed Precision 6.022329568862915 XLA 6.265935659408569 XLA + Mixed Precision 4.252242088317871
It only took our fully-optimized model four seconds to generate three novel images froma text prompt on an A100 GPU.
Conclusions
KerasCV offers a state-of-the-art implementation of Stable Diffusion -- andthrough the use of XLA and mixed precision, it delivers the fastest Stable Diffusion pipeline available as of September 2022.
Normally, at the end of a keras.io tutorial we leave you with some future directions to continue in to learn.This time, we leave you with one idea:
Go run your own prompts through the model! It is an absolute blast!
If you have your own NVIDIA GPU, or a M1 MacBookPro, you can also run the model locally on your machine.(Note that when running on a M1 MacBookPro, you should not enable mixed precision, as it is not yet well supportedby Apple's Metal runtime.)
FAQs
What is Stable Diffusion text-to-image generation model? ›
Stable Diffusion is a text-to-image model that enables you to create photorealistic images from just a text prompt. A diffusion model trains by learning to remove noise that was added to a real image. This de-noising process generates a realistic image.
Does Midjourney use Stable Diffusion? ›It's believed that Midjourney is based on Stable Diffusion in some manner, however. Because of the way Midjourney is accessed through Discord and is cloud-based only, it's arguably easier to use and get into than Stable Diffusion, but once both tools are up and running, they are equally easy to use.
What is the maximum image size in Stable Diffusion? ›Combined with our text-to-image models, Stable Diffusion 2.0 can now generate images with resolutions of 2048x2048–or even higher.
What is the best resolution for Stable Diffusion? ›The Stable Diffusion XL 1024 model is remarkable in its resolution, being trained on images with 1024 x 1024 pixel resolutions. This is a drastic increase from the standard high-resolution image that was once 256 x 256 pixels.
How long does Stable Diffusion take to generate an image? ›An open sourced text to image model from Stability AI. Like Dalle-2, it can generate images based on text prompt in seconds.
How many images do you need to train a Stable Diffusion model? ›5-10 images are enough, but for styles you may get better results if you have 20-100 examples. Many of the recommendations for training DreamBooth also apply to LoRA. The training images can be JPGs or PNGs.
How many images are in a Stable Diffusion set? ›Stable Diffusion is a text-to-image latent diffusion model created by the researchers and engineers from CompVis, Stability AI and LAION. It is trained on 512x512 images from a subset of the LAION-5B database. LAION-5B is the largest, freely accessible multi-modal dataset that currently exists.
What do you need for Stable Diffusion model? ›These requirements include a GPU with a minimum of 6GB VRAM, which is found in most modern NVIDIA GPUs. Additionally, approximately 10GB of storage capacity on either a hard drive or solid-state drive is necessary.
Is Midjourney vs Stable Diffusion better? ›Currently, Midjourney offers no image-to-image functionality, a method for diffusion models to generate images based on another image. This is unsurprising since the earlier versions of Midjourney may not be diffusion models. Verdict: Stable Diffusion wins.
Is Dall E better than Stable Diffusion? ›When it comes to the quality, Dall-E2 looks to be more capable than Stable Diffusion. However, the latter is more permissive with the text prompt because it allows generations of famous people like celebrities and politicians.
Why use Midjourney over Stable Diffusion? ›
In terms of usage, it's Stable Diffusion that gives you the library to create more images as it can be used for free, and for the most part, you own the rights to the images that you generate. Midjourney, on the other hand, only offers a limited trial that you can use to generate up to 25 creations.
What is the best upscaler for Stable Diffusion? ›Image AI upscaler like ESRGAN is an indispensable tool to improve the quality of images generated by Stable Diffusion. In fact, it is so commonly used that many Stable Diffusion GUI has built-in support for it.
How much does Stable Diffusion cost per image? ›Stable Diffusion wins on pricing
Each text prompt generates a set of four images and costs one credit. Credits cost $15 for 115, so that's ~$0.13/prompt or ~$0.0325/image.
32 is full precision, 16 is half. If you have 24gb VRAM 32bit precision is easy, if you have less you may still use it depending on your SD build and how it optimizes memory use. Best to stick with 16 is you're not sure. Difference in quality is virtually imperceptible.
Is Stable Diffusion 1.5 better? ›As we can see, Stable Diffusion 1.5 seems to perform better than Stable Diffusion 2 overall.
How many RAM do you need for Stable Diffusion? ›It requires at least 12GB of system memory, and 12GB of install space.
Do you need a good computer for Stable Diffusion? ›Stable Diffusion requires a minimum of 8 GB of RAM and a decent graphics card, so laptops with low specs may not be able to run the software efficiently.
How much does it cost to use Stable Diffusion? ›First-time users get 200 free credits, and once you use them all, you have to pay a $10 fee per generation.
What size do you need for Stable Diffusion? ›Stable Diffusion is recommended to be run with 10 GB or more VRAM, however users with less VRAM may opt to load the weights in float16 precision instead of the default float32 to tradeoff model performance with lower VRAM usage.
How much does it cost to run Stable Diffusion? ›Stable Diffusion integrates with: AI Dev Codes, Amazon Bedrock, Artimator, Blender, DiffusionBee, DreamStudio, MobileGPT, MosaicML, NVIDIA AI Foundations, Playground AI, PromptPerfect, and Stability AI. Yes, Stable Diffusion offers a free trial. Pricing for Stable Diffusion starts at $0.2 per image.
Can I train Stable Diffusion on my own images? ›
Stable Diffusion is available for various uses in different models and ways. However, the main purpose is the same, which is to generate images. Therefore, you can use the training model to train Stable Diffusion for your personalized image content. It is easy and does not require much time either.
How do you prepare images for Stable Diffusion training? ›- Use different expressions and angles.
- Use different backgrounds if you can.
- Use a square or 1:1 ratio setting. By default, Stable Diffusion's default image size is 512 x 512 pixels, so using square images makes your input more similar to your desired output.
Easy Diffusion is an easy to install and use distribution of Stable Diffusion, the leading open source text-to-image AI software. Easy Diffusion installs all required software components required to run Stable Diffusion plus its own user friendly and powerful web interface for free.
Can Stable Diffusion generate 3D models? ›Just to be clear, though: Stability for Blender doesn't generate 3D models, just 2D images you can use in various ways, such as reference material (3D generative AI is in its infancy, but there are models that do this, like Google's DreamFusion, Nvidia's Get3D, or OpenAI's Point-E.)
How many steps are in Stable Diffusion? ›Around 25 sampling steps are usually enough to achieve high-quality images. Using more may produce a slightly different picture, but not necessarily better quality.
Is Stable Diffusion the best AI art generator? ›One of the most popular text-to-art AI generators right now, Stable Diffusion, is an open-source AI art generator that takes text prompts and outputs images in mere seconds.
Does Stable Diffusion use deep learning? ›The stable diffusion model falls under a class of deep learning models known as diffusion. More specifically, they are generative models; this means they are trained to generate new data similar to what it's learned in the past.
What applications use Stable Diffusion? ›Image and video processing: Stable diffusion models can be applied to image and video processing tasks such as denoising, inpainting, and super-resolution. Clean and high-resolution images can be produced by training the model on noisy images.
Is Stable Diffusion only for Nvidia? ›Stable Diffusion is not compatible with smartphones or most laptops. However, it can operate smoothly on an average gaming PC in 2023, provided that certain requirements are met. These requirements include a GPU with a minimum of 6GB VRAM, which is found in most modern NVIDIA GPUs.
What's the best AI image generator? ›AI art generator | Price | Output speed |
---|---|---|
Bing Image Creator | Free | Fast |
DALL-E 2 by OpenAI | Free + Credits (depends on sign up date) | Fast |
Dream by WOMBO | Free + Subscription | Fast |
Craiyon | Free | Slower |
What AI does Stable Diffusion use? ›
Stable diffusion v1 uses Open AI's ViT-L/14 Clip model. Embedding is a 768-value vector. Each token has its own unique embedding vector. Embedding is fixed by the CLIP model, which is learned during training.
What is the limitation of Stable Diffusion? ›Stable Diffusion Reimagine does not recreate images driven by original input. Instead, Stable Diffusion Reimagine creates new images inspired by originals. This technology has known limitations: It can inspire amazing results based on some images and produce less impressive results for others.
What is the difference between Stable Diffusion and DreamStudio? ›The differences between StableStudio and DreamStudio are relatively minor. StableStudio doesn't have DreamStudio branding or Stability-specific account features, like billing and API management, and API calls on the backend have been replaced by a plugin system.
Is Midjourney better than Dalle 2? ›It's difficult to say whether Midjourney or DALL-E 2 is better, as both are highly skilled models that can produce some impressive results in the field of art and design. Midjourney can be accessed via Discord, whereas DALL-E 2 is only available via OpenAI's website.
How much RAM do you need for Stable Diffusion 2? ›- Windows 10/11, Linux or Mac.
- An NVIDIA graphics card, preferably with 4GB or more of VRAM or an M1 or M2 Mac. But if you don't have a compatible graphics card, you can still use it with a “Use CPU” setting. It'll be very slow, but it should still work.
- 8GB of RAM and 20GB of disk space.
The latest update to Stable Diffusion also includes an adult content filter limiting the generation of NSFW images. Text-to-image example from Stable Diffusion 2.0.
Can I use Stable Diffusion images for commercial use? ›Is Stable Diffusion free for commercial use? Stable Diffusion from Stability AI comes with a permissive license, allowing users to create images for both commercial and non-commercial use as long as the user adheres to Stability AI's policies on commercial usage.
Can you make money with Stable Diffusion? ›There are several ways to earn an income with Stable Diffusion AI Art: Create and Sell Your Own AI Art: Create your own AI art, sell it on art marketplaces, and build up a portfolio of work. Consider selling on platforms like ArtStation, Artfinder, and Saatchi Art.
Is Stable Diffusion stealing images? ›Getty Images suing the makers of popular AI art tool for allegedly stealing photos. Getty Images announced a lawsuit against Stability AI, the company behind popular AI art tool Stable Diffusion, alleging the tech company committed copyright infringement.
How many people use Stable Diffusion? ›Then, there's Stability AI's open-source image generation model Stable Diffusion, which has been used on pop music videos, Hollywood movies and by more than 10 million people on a daily basis.
What is the latest version of Stable Diffusion? ›
Stable Diffusion v1-5 is the latest version of the state of the art text-to-image model. For faster generation you can try text to image tool at Runway.
How to get Stable Diffusion? ›To run Stable Diffusion, open a command prompt or terminal window and navigate to the Stable Diffusion folder. Replace /path/to/text_input. txt with the path to your text input file and /path/to/output_folder with the path to the folder where you want to save the generated image.
Does Stable Diffusion copy images? ›A new study shows Stable Diffusion and like models replicate data. Image-generating AI models like DALL-E 2 and Stable Diffusion can — and do — replicate aspects of images from their training data, researchers show in a new study, raising concerns as these services enter wide commercial use.
Does Stable Diffusion save images? ›By default, each image generated is saved under your /stable-diffusion-webui/output/txt2img-images/ folder and can be accessed via 'File Browser'. You can download a selected image locally to your desktop by clicking the download button, indicated in the image below.
How many steps are needed for Stable Diffusion? ›Around 25 sampling steps are usually enough to achieve high-quality images. Using more may produce a slightly different picture, but not necessarily better quality.
How do you upscale an image in Stable Diffusion? ›If your image is 512×512 pixels, resizing 2x is 1024×1024 pixels, and 4x is 2048×2048 pixels. Select R-ESRGAN 4x+, an AI upscaler that works for most images. Press Generate to start upscaling. When it is done, the upscaled image will appear in the output window on the right.
How exactly does Stable Diffusion work? ›Stable Diffusion is an energy-based model that learns to generate images by minimizing an energy function. The energy function measures how well the developed image matches the input text description. Stable Diffusion can create images that closely match the input text by minimizing the energy function.
Does Stable Diffusion use RAM or GPU? ›Minimum Requirements to run Stable Diffusion
These requirements include a GPU with a minimum of 6GB VRAM, which is found in most modern NVIDIA GPUs. Additionally, approximately 10GB of storage capacity on either a hard drive or solid-state drive is necessary.
Is Stable Diffusion better than DALL-E? Stable Diffusion and DALL-E 2 work in similar ways – they create unique artwork from text prompts. Both applications have strong track records and are probably the most popular AI art generators on the market.
What are the advantages of Stable Diffusion? ›By training models using input images and generating new images based on them, stable diffusion can create high-quality images from low-quality sources and enhance specific features in an image, such as colors and textures.