Understanding the Influence of Prompts with DAAM Extension – Text to Image Generation Mastery

Estimated read time 6 min read

Greetings everyone. In this article, I will show you how you can understand the effect of each prompt you are using to generate images by text-to-image. To see that, we will use DAAM extension of Stable Diffusion Automatic1111 web UI.

If you don’t know yet what is Stable Diffusion, Automatic1111 or how to use them, I have an excellent tutorial series playlist on my site. This is the playlist, and you see we have so far 14 plus articles on how to use Stable Diffusion for Automatic1111 and how to use Stable Diffusion on Google Colab.

So DAAM is the method name the authors of this paper have chosen: “Interpreting Stable Diffusion using cross attention.” The paper aims to understand how an input word influences parts of the generated image. The authors of the paper have released their script to use it, and there is an extension for Automatic1111 that allows us to use this script. [Here is the extension repo link]

Now I will install this extension on my Automatic1111. To do that, I am going to the extensions tab and in here click install from URL, paste it and install. It has been installed. Let’s just apply and restart UI. Extension has been installed.

So, how are we going to use it? You see, there will be a new option here: Attention Heatmap. This is where we are going to use this extension, and I am going to make my examples on Protegen x3.4 version. [You can download this model here](insert model download link). It’s a very good model, a very realistic model.

Let’s start by typing our first prompt and see how each word is affecting our prompt. This is the prompt: “I have used a sports car beautiful oasis with palm trees,” and you see we got a beautiful image right here.

Before starting that, I want to show you the versions I’m using: Python version, Torch version, xformers version, Gradio, Commit hash, and checkpoint hash, if you wonder about that.

How are we going to get the heatmap of this image? To do that, I am just copying this prompt here and going to the attention heatmap, pasting it here, and you need to put a comma for each word to see their emphasis on the picture. Also, pick this “use grid output to grid directory.” There is an easier way to put a comma to each of the word. To do that, you can use Notepad++ and in the find what, just put a space character and replace it with a comma, replace all, and you will get the prompt separated with commas.

I am also putting the seed as the same seed like this, and then I will hit generate, and then we will see the effect of each word in the prompt on the image. So you see there are now two images. One of them is showing the effect of each prompt, and the output is here. When we click the heatmap, we can see how much each word is affecting our result. For example, there is a little bit too much effect of a keyword and perhaps with the keyword. So if you want to reduce the effect of some of the keywords, you can use this heatmap and reduce them. For example, let’s reduce the effect of a keyword like here and with keywords like this one. This will reduce the strength, the effect of both of the keywords. This is prompt emphasis of the Automatic1111 web UI feature.

Now we can see the effect of “a” and “with” keywords are decreased. And let’s see the output image. OK, it is like this. If you are still not satisfied with the image, then you can also increase the effect of other keywords and decrease the effect of some other ones. For example, if “palm” is having too much effect and “beautiful” or “oasis” is having little effect, we can also modify them. Let’s reduce “palm” to 0.8. Let’s also increase “oasis” to 1.2. And also, let’s increase “beautiful” to 1.2. And let’s reduce “car” to 0.8.

If you don’t know how these attention emphasis is working, there is a wiki page on Automatic1111, and this displays how they are affecting the results. So this is how it is increasing attention. This is how it is decreasing attention. This is how it is working, basically.

OK, so we got our new results. This is our new result. And this is how each word is having an effect: car, oasis, beautiful, with, trees, palm, “a.” So based on the result, you can even further modify the image and decide what to choose, what to decrease in strength, and what to increase in emphasis.

Also, there are some other options: Heatmap blend alpha. If you make this one, then it won’t show the original image in the generated heatmap. So it will be like this. Then you will have a clearer idea of how much each word is having an effect. The trees are not having much effect because I think it is already using the “palm” word. Therefore, we can just remove the “palm” from the prompt and let’s see what we are going to get. We have removed the “palm” instead of “trees,” and now we don’t have “palm,” but only “trees.” And now this is how the effect is appearing: “sports,” “beautiful,” “oasis,” and “car.” This is the new output. Let’s take it back. So I will just remove the “trees” and generate again. And now this is our image back. And this is now with “car,” “sports,” “oasis,” “beautiful,” “palm,” and there is no more “trees.” Therefore, it has zero effect. And let’s remove the “with” keyword. By the way, the combination of these keywords is generating a chain of effect. Therefore, they may not be displaying very much emphasis here. But as a chain effect, they could be affecting a lot. So this is our new image, and this is the output we got: “Car,” “sports,” “oasis,” “beautiful.” They are like these.

This is a cool and neat thing that you can play with. Based on this, you can increase or decrease the emphasis, the attention of each word and try to get a better understanding of how prompts are affecting your results. It’s a cool addition. Hopefully, see you in another article.