카테고리 없음

What is Stable Diffusion Model?

경제자산통계프로 2023. 6. 24. 09:00

For estimation and removal of the noise, U-Net is used most often. Interestingly, that neural network has an architecture that reminds us of the letter U, which is how it got its name. U-Net is a fully connected convolutional neural network which makes it very useful for processing images.

U-Net is distinguished by its ability to take the image as the entrance and find a low-dimensional representation of that image by reducing the sampling, which makes it more suitable for processing and finding the important attributes, and then it reverts the image to the first dimension by increasing the sampling.

Efficiency of Stable Diffusion and DALL·E

Stable Diffusion is much more efficient than DALL·E. First of all, the carbon footprint is smaller. Second, this model can be used by anyone with a 10-gig graphics card. It can be run in a few seconds and doesn’t require as much hardware.

Stability.Ai is a solution studio dedicated to innovating ideas. It’s an organization that has just been born, but one that promises to launch new open models. As has happened with EleutherAI
The true power of the Stable Diffusion model is that it can generate images from text prompts. This is done by modifying the inner diffusion model to accept conditioning inputs.

the model learns to progressively remove noise, pixellating it, until a clear image is achieved. It decomposes, analyzes the image and finally creates the final result.

The original name of Stable Diffusion is “Latent Diffusion Model” (LDM). As its name points out, the Diffusion process happens in the latent space. This is what makes it faster than a pure Diffusion model.
U-Net iteratively denoises the random latent image representations while being conditioned on the text embeddings. The output of the U-Net, being the noise residual, is used to compute a denoised latent image representation via a scheduler algorithm.

The Power of U-Net and Text Embeddings

The checkpoint resumed training from stable-diffusion-v1-2. 195,000 steps at resolution 512x512 on "lion-improved-aesthetics" and a 10 % dropping of the text-conditioning to improve classifier-free guidance sampling

there is much more sense to conduct this method T number of times than try to remove the entire noise. By repeating this method, the noise will be gradually removed and we will get a much “cleaner” image. A simplified process is as follows: there is an image with the noise and we try to predict the image without the noise by adding complete noise on the initial and removing it iteratively.

the diffusing (sampling) process iteratively feeds a full-sized image to the U-Net to get the final result. This makes the pure Diffusion model extremely slow when the number of total diffusing steps T and the image size are large.

We'll also get the unconditional text embeddings for classifier free guidance, which are just the embeddings for the padding token.

The forward diffusion process adds Gaussian noise to the input image step by step. Nonetheless, it can be done faster using the following closed form formula to directly get the noisy image at a specific time step t: In general, results are better the more steps you use, however, the more steps, the longer the generation takes. Stable Diffusion works quite well with a relatively small number of steps, so we recommend using the default number of inference steps of 50.

The Stable Diffusion model can be run in inference with just a couple of lines using the StableDiffusionPipeline pipeline. The pipeline sets up everything you need to generate images from text with a simple from_pretrained function call.
It is also known as classifier free guidance, which in simple terms forces the generation to better match the prompt potentially at the cost of image quality or diversity. Values between 7 and 8.5 are usually good choices for Stable Diffusion. By default, the pipeline uses a guidance_scale of 7.5.

The model captures the image and progressively adds noise to it. This ‘noise’ consists of tiny spots and dots distributed throughout the image that worsen the quality. We start from a clear image and turn it into an image with noise and gradient.

Continuous experimentation via Prompt engineering is critical to getting the perfect outcomes. We explored Dall-E 2 and Stable Diffusion and have consolidated our best tips and tricks to getting the most out of your prompts, including prompt length.


Even with their advanced capabilities, diffusion models do have limitations which we will cover later in the guide. But as these models are continuously improved or the next generative paradigm takes over, they will enable humanity to create images, videos, and other immersive experiences with simply a thought.

Tasks and Applications of Diffusion Models

The most interesting part is that a stable diffusion model can be set in motion successfully locally with even 8 GB VRAM memory. So far, this problem has been solved by training on a smaller image with the size of 256x256, and then an additional neural network would be used which would produce an image in a bigger resolution (super resolution diffusion). The latent diffusion model has a different approach.

Frames provide a rough guide for the output type the diffusion model should generate. The applications are endless and I think they have far reaching consequences for artists the world over as well as creative thinking in general.

Stable diffusion gained huge publicity over the last couple of months. The reasons for that are the fact that the source code is available to everybody and that stable diffusion does not hold any rights to generated images, and it also allows people to be creative which can be seen from a large number of incredible published images that this model created.

Diffusion models have been shown to achieve state-of-the art results for generating image data. But one downside of diffusion models is that the reverse denoising process is slow because of its repeated, sequential nature. In addition, these models consume a lot of memory because they operate in pixel space Diffusion models work by destroying training data by adding noise and then learning to recover the data by reversing this noising process. In Other words, Diffusion models can generate coherent images from noise.

Sampling means painting an image from Gaussian noise. The following diagram shows how we can use the trained U-Net to generate an imageThe text encoder is responsible for transforming the input prompt, e.g. "An astronaut riding a horse" into an embedding space that can be understood by the U-Net. It is usually a simple transformer-based encoder that maps a sequence of input tokens to a sequence of latent text embeddings.

stable diffusion model
stable diffusion model


we initialize the scheduler with our chosen num_inference_steps. This will compute the sigmas and exact time step values to be used during the denoising process. Inputs from embeddings like CLIP can guide the seeds to provide powerful text-to-image capabilities.

Diffusion models can complete various tasks, including image generation, image denoising, inpainting, outpainting, and bit diffusion.

diffusion model are more stable than GANs, which are subject to mode collapse, where they only represent a few modes of the true distribution of data after training. This mode collapse means that in the extreme case, only a single image would be returned for any prompt, though the issue is not quite as extreme in practice.


These tools provide a quick and easy way for beginners to start with diffusion models, allowing you to generate images with prompts and perform inpainting and outpainting. DreamStudio offers more control over the output parameters.

If at some point you get a black image, it may be because the content filter built inside the model might have detected an NSFW result. If you believe this shouldn't be the case, try tweaking your prompt or using a different seed. The model predictions include information about whether NSFW was detected for a particular result. Let's see what they look like