
Driven by scalable diffusion models trained on large-scale paired text-image datasets, text-to-image synthesis methods have shown compelling results. However, these models still fail to precisely follow the text prompt when multiple objects, attributes, and spatial compositions are involved in the prompt. In this paper, we identify the potential reasons in both the cross-attention and self-attention layers of the diffusion model. We propose two novel losses to refocus the attention maps according to a given layout during the sampling process. We perform comprehensive experiments on the DrawBench and HRS benchmarks using layouts synthesized by Large Language Models, showing that our proposed losses can be integrated easily and effectively into existing text-to-image methods and consistently improve their alignment between the generated images and the text prompts.
A blue airplane and a green car.
A banana on the left of an apple.
At each denoising step, we update the noise sample by optimizing our Cross-Attention Refocusing (CAR) and Self-Attention Refocusing (SAR) losses (red block) before denoising with the predicted score (yellow block). For each cross-attention map, CAR is designed to encourage a region to attend more to the corresponding token while discouraging the remaining region from attending to that token (green block). For each self-attention map, SAR prevents the pixels in a region from attending to irrelevant regions (blue block).
[1] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image iffusion models with deep language understanding. NeurIPS 2022.