Region and Boundary

Abstract

Recent text-to-image (T2I) diffusion models have achieved remarkable progress in generating high-quality images given text-prompts as input. However, these models fail to convey appropriate spatial composition specified by a layout instruction. In this work, we probe into zero-shot grounded T2I generation with diffusion models, that is, generating images corresponding to the input layout information without training auxiliary modules or finetuning diffusion models. We propose a Region and Boundary (R&B) aware cross-attention guidance approach that gradually modulates the attention maps of diffusion model during generative process, and assists the model to synthesize images (1) with high fidelity, (2) highly compatible with textual input, and (3) interpreting layout instructions accurately. Specifically, we leverage the discrete sampling to bridge the gap between consecutive attention maps and discrete layout constraints, and design a region-aware loss to refine the generative layout during diffusion process. We further propose a boundary- aware loss to strengthen object discriminability within the corresponding regions. Experimental results show that our method outperforms existing state-of-the-art zero-shot grounded T2I generation methods by a large margin both qualitatively and quantitatively on several benchmarks.

Visual comparisons with competing training-free methods

Visual comparisons with different baselines across different text prompts

Visual comparisons with different baselines across different random seeds

Visual variations across different text and box prompts

Visual variations across different random seeds

Visual variations across different text prompts

Multiple variations of generated images. The shape and localization of boxes are changed for spatial variations. And the instances of text prompts are changed for semantic variations.

Quantitative comparisons

Quantitative results on HRS and DrawBench, best results are bold

Quantitative comparisons of spatial accuracy and text similarity

BibTeX


        @misc{xiao2023rb,
          title={R&B: Region and Boundary Aware Zero-shot Grounded Text-to-image Generation}, 
          author={Jiayu Xiao and Liang Li and Henglei Lv and Shuhui Wang and Qingming Huang},
          year={2023},
          eprint={2310.08872},
          archivePrefix={arXiv},
          primaryClass={cs.CV}
        }

R&B: REGION AND BOUNDARY AWARE ZERO-SHOT GROUNDED TEXT-TO-IMAGE GENERATION

R&B injects spatial instructions into the denoising process through classifier guidance. And generates images that highly align with the text description and layout instructions without auxiliary training.

Abstract

Framework

Two loss functions are proposed to modulate the cross-attention of diffusion unet: region-aware loss and boundary-aware loss. Optimizing these two losses helps to correctly incorperate the layout instructions into generative process.

Visual comparisons with competing training-free methods

Visual comparisons with different baselines across different text prompts

Visual comparisons with different baselines across different random seeds

Visual comparisons with different baselines across different random seeds

Visual variations across different text and box prompts

Visual variations across different random seeds

Visual variations across different random seeds

Visual variations across different random seeds

Visual variations across different text prompts

Multiple variations of generated images. The shape and localization of boxes are changed for spatial variations. And the instances of text prompts are changed for semantic variations.

Quantitative comparisons

Quantitative results on HRS and DrawBench, best results are bold

Quantitative comparisons of spatial accuracy and text similarity

BibTeX