Grounded Text-to-Image Synthesis with Attention Refocusing

University of Maryland, College Park
CVPR 2024

Controllable text-to-image synthesis with attention refocusing. We introduce a new framework to improve the controllability of text-to-image synthesis given the text prompts. We first leverage GPT-4 to generate layouts from the text prompts and then use grounded text-to-image methods to generate the images given the layouts and prompts. However, the detailed information, like the quantity, identity, and attributes, is often still incorrect or mixed in the existing models. We propose a training-free method, attention-refocusing, to improve on these aspects substantially. Our method is model-agnostic and can be applied to enhance the control capacity of methods like GLIGEN (top row) and ControlNet (bottom rows)

Plug-and-Play attention refocusing to various text-to-image models

Applying Attention Refocusing to GLIGEN [Li et al. CVPR 2023]

Ablation study

Comparison with existing methods

Mask-based guidance

Applying Attention Refocusing to ControlNet

Comparison with DenseDiffusion [ Kim et al. ICCV 2023]

Additional applications

Instruct text-to-image model

An air hot balloon in the sky, oil painting

A boat in the river, the sun setting

Objects shuffling

Two stages text-to-image generation: GLIGEN +attention-refocusing as grounded text-to-image

Comparison between our two-stage text-to-image model and existing state-of-the-art methods

Diverse generation from our method (GLIGEN+Ours)


[1] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image iffusion models with deep language understanding. NeurIPS 2022.
[2] Eslam Mohamed Bakr, Pengzhan Sun, Xiaoqian Shen, Faizan Farooq Khan, Li Erran Li, and Mohamed Elhoseiny. HRS-Bench: Holistic, Reliable and Scalable Benchmark for Text-to-Image Models. arXiv preprint arXiv:2304.05390, 2023.
[3] Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and Daniel Cohen-Or. Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models.SIGGRAPH, 2023.
[4] HYuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. GLIGEN: Open-Set Grounded Text-to-Image Generation. CVPR, 2023.
[5] Omer Bar-Tal, Lior Yariv, Yaron Lipman, and Tali Dekel. Multidiffusion: Fusing diffusion paths for controlled image generation. ICML, 2023.
[6] Minghao Chen, Iro Laina, and Andrea Vedaldi. Training-Free Layout Control with Cross-Attention Guidance. arXiv preprint arXiv:2304.03373.