PanGu-Draw: Advancing Resource-Efficient Text-to-Image Synthesis with Time-Decoupled Training and Reusable Coop-Diffusion

Guansong Lu¹, Yuanfan Guo¹, Jianhua Han¹, Minzhe Niu¹, Yihan Zeng¹

Songcen Xu¹, Zeyi Huang², Zhao Zhong², Wei Zhang¹, Hang Xu¹

¹Huawei Noah's Ark Lab, ²Huawei

“一位年轻女性，身着优雅礼服，佩戴毕业帽，微笑着面对镜头，伸出手臂，她的背景是夕阳下的校园。”

“A young woman, wearing an elegant gown and graduation cap, smiles at the camera and extends her arms, with the sunset on campus in the background.”

“一位面带微笑的女子，身穿白色T恤，红色夹克在阳光下熠熠生辉，画面清新，风格像动漫，细节丰富。”

“A smiling woman wearing a white T-shirt and a red jacket shines in the sun. The picture is fresh, anime-like in style, and rich in details.”

“一个巨大的水晶球，内部蕴含着一个微型雨林，雨林中蝴蝶飞舞，阳光透过树叶洒落。”

“A huge crystal ball contains a miniature rainforest inside, with butterflies flying in the rainforest and sunlight shining through the leaves.”

“一只穿着中世纪铠甲的兔子，手持长剑站在一座古老城堡的城墙上，背后是落日的余晖。”

“A rabbit wearing medieval armor and holding a sword stands on the wall of an ancient castle with the setting sun behind him.”

“赛博朋克风格的摄影机，无人机在夜空中飞行，以粒子水墨画风展现，具有强烈的光影效果。”

“Cyberpunk style camera, drone flying in the night sky, presented in particle ink painting style, with strong light and shadow effects.”

“一艘古老的海盗船，完全由糖果和巧克力制成。”

“An ancient pirate ship made entirely of candy and chocolate.”

“旅行者们泛舟在波光粼粼的湖面上，周围是雄伟的山脉，以中国水墨画风格描绘，画面色彩淡雅，具有古典诗意。”

“Travelers are boating on the sparkling lake, surrounded by majestic mountains, painted in the style of Chinese ink painting, with elegant colors and a classical poetic feel.”

“超大广角下，沙漠、河流、绿洲交相辉映，落日余晖洒满大地，此景宛如摄影艺术家的作品，画面开阔，色彩丰富。”

“Under the super wide angle, deserts, rivers, and oasis complement each other, and the setting sun fills the earth. This scene is like the work of a photography artist, with a broad picture and rich colors.”

“一台未来风格的摩托车，闪耀着霓虹灯，停在夜晚的东京街头。”

“A futuristic motorcycle, shining with neon lights, is parked on the streets of Tokyo at night.”

“一座由冰晶和雪花构成的精致城堡，坐落在北极的冰原上。”

“An exquisite castle made of ice crystals and snowflakes, located on the Arctic ice sheet.”

Abstract

Current large-scale diffusion models represent a giant leap forward in conditional image synthesis, capable of interpreting diverse cues like text, human poses, and edges. However, their reliance on substantial computational resources and extensive data collection remains a bottleneck. On the other hand, the integration of existing diffusion models, each specialized for different controls and operating in unique latent spaces, poses a challenge due to incompatible image resolutions and latent space embedding structures, hindering their joint use. Addressing these constraints, we present "PanGu-Draw", a novel latent diffusion model designed for resource-efficient text-to-image synthesis that adeptly accommodates multiple control signals. We first propose a resource-efficient Time-Decoupling Training Strategy, which splits the monolithic text-to-image model into structure and texture generators. Each generator is trained using a regimen that maximizes data utilization and computational efficiency, cutting data preparation by 48% and reducing training resources by 51%. Secondly, we introduce "Coop-Diffusion", an algorithm that enables the cooperative use of various pre-trained diffusion models with different latent spaces and predefined resolutions within a unified denoising process. This allows for multi-control image synthesis at arbitrary resolutions without the necessity for additional data or retraining. Empirical validations of Pangu-Draw show its exceptional prowess in text-to-image and multi-control image generation, suggesting a promising direction for future model training efficiencies and generation versatility. The largest 5B T2I PanGu-Draw model is released on the Ascend platform.

Method

Time-Decoupling Training Strategy

We introduce the Time-Decoupling Training Strategy, which divides a comprehensive text-to-image model into two specialized sub-models: a structure generator and a texture generator. The structure generator is responsible for early-stage denoising across larger time steps and focuses on establishing the foundational outlines of the image. Conversely, the texture generator operates during the latter, smaller time steps to elaborate on the textural details. Each generator is half the size of the original and is trained in isolation, which not only alleviates the need for high-memory computation devices but also avoids the complexities associated with model sharding and its accompanying inter-machine communication overhead. Furthermore, the structure generator is trained with high-resolution images and upscaled lower-resolution ones, achieving higher data efficiency and avoid the problem of semantic degeneration; and the texture generator is trained at a lower resolution while still sampling at high resolution, achieving an overall 51% improvement in training efficiency.

Prompt Enhancement LLM with RLAIF Algorithm

To further enhance our generation quality, we harness the advanced comprehension abilities of large language models (LLM) to align users’ succinct inputs with the detailed inputs required by the model. Initially, we finetune the LLM using a human-annotated dataset, transforming a succinct prompt into a more enriched version. Subsequently, to optimize for PanGu-Draw, we employ the Reward rAnked FineTuning (RAFT) method, which selects the prompt pairs yielding the highest reward for further fine-tuning.

Controllable Stylized Text-to-Image Generation

While techniques like LoRA allow one to adapt a text-to-image model to a specific style (e.g., human-aesthetic-preferred style, cartoon style), they do not allow one to adjust the degree of the desired style. Inspired by the classifier-free guidance mechanism, we propose to perform controllable stylized text-to-image generation by prepending a special prefix to the original prompt of human-aesthetic-prefer and cartoon samples, denoted as \(c_{aes}\) and \(c_{cartoon}\) respectively, during training. During sampling, we extrapolated the prediction in the direction of \(\epsilon_\theta(z_t, t, c_{style})\) and away from \(\epsilon_\theta(z_t, t, c)\) as follows:

\(\hat{\epsilon}_{\theta}(z_t, t, c) = \epsilon_{\theta}(z_t, t, \emptyset) + s \cdot ({\epsilon}_{\theta}(z_t, t, c) - \epsilon_{\theta}(z_t, t, \emptyset)) + s_{style} \cdot ({\epsilon}_{\theta}(z_t, t, c_{style}) - \epsilon_{\theta}(z_t, t, c)) \),

where \(s\) is the classifier-free guidance scale, \(c_{style} \in \{c_{aes}, c_{cartoon}\}\) and \(s_{style}\) is the style guidance scale.

Coop-Diffusion: Multi-Diffusion Fusion Algorithm

We propose the Coop-Diffusion algorithm to fuse diverse pre-trained diffusion models for multi-control or multi-resolution image generation without training a new model. (a) Existing pre-trained diffusion models, each tailored for specific controls and operating within distinct latent spaces and image resolutions. (b) This sub-module bridges the gap arising from different latent spaces by transforming the model prediction \(\epsilon_t'\) in latent space B to the target latent space A as \(\tilde\epsilon_t\) using the image space as an intermediate. (c) This sub-module bridges the gap arising from different resolutions by performing upsampling on the predicted clean data \(\hat{x}_{0,t}'\).

Quantitative Results

Comparisons of PanGu-Draw with English text-to-image generation models on COCO dataset in terms of FID. Our 5B PanGu model is the best-released model in terms of FID.	Comparisons of PanGu-Draw with Chinese text-to-image generation models on COCO-CN dataset in terms of FID, IS and CN-CLIP-score.

Quantitative comparison. — Results of user study on ImageVal-prompt in terms of image-text alignment, image fidelity, and aesthetics. PanGu-Draw achieves better results than SD and SDXL across all three metrics. It also attains competitive performance of Midjourney 5.2 and DALL-E 3, indicating PanGu-Draw’s excellent text-to-image capabilities.

Qualitative Results

Text-to-Image Generation

Prompt Enhancement LLM with RLAIF Algorithm

Text-to-image generation results without and with prompt enhancement. Enriched text improve image generation by better image aesthetic perception (left), more detailed background (middle) and better interpretation of abstract concepts (right).

Controllable Stylized Text-to-Image Generation

Multi-Diffusion Fusing Results

Generation results of the fusing of an image variation model and PanGu-Draw and with the proposed Coop-Diffusion algorithm.

Generation results guided by fusing signals of text and pose/edge map by our Coop-Diffusion algorithm.

Images generated with a low-resolution (LR) model (first row: text-to-image model; second row: Edge-to-image Control- Net) and the fusion of the LR model and HR PanGu-Draw with our Coop-Diffusion algorithm. This allows for single-stage super-resolution for better details and higher inference efficiency.

BibTeX

@article{lu2023pangudraw,
  title={PanGu-Draw: Advancing Resource-Efficient Text-to-Image Synthesis with Time-Decoupled Training and Reusable Coop-Diffusion},
  author={Guansong Lu and Yuanfan Guo and Jianhua Han and Minzhe Niu and Yihan Zeng and Songcen Xu and Zeyi Huang and Zhao Zhong and Wei Zhang and Hang Xu},
  journal={arXiv preprint arXiv:2312.16486},
  year={2023}
}