How far can we go with ImageNet for Text-to-Image generation?

Training high-quality text-to-image models using 1000x less data in just 200 GPU hours

Lucas Degeorge, Arijit Ghosh, Nicolas Dufour, Vicky Kalogeiton, David Picard

Overview

Current text-to-image models face significant challenges: they require enormous datasets (billions of images), substantial computational resources (thousands of GPU hours), and raise concerns about data curation, copyright issues, and inappropriate content generation.

Our key contribution: With careful data augmentation and training strategies, high-quality text-to-image models can be trained using only ImageNet data — reducing the required training data by a factor of 1000× and computational requirements to under 500 GPU hours.

Paper and Code

📜 Paper on arXiv
💻 GitHub Repository

Methods

1. Text Augmentation: detailed captioning

ImageNet’s original labels are simple class names, but the images contain rich visual information. We generate highly detailed captions that describe those images

Example of a detailed caption on ImageNet

2. Image Augmentation: CutMix

We implement a modified CutMix strategy to improve the compositional understanding of the models. CutMix can be used at high noise level of diffusion to prevent learning the border. Re-captioning make the text smooth.

Our CutMix strategy combines multiple ImageNet classes with corresponding caption updates
Example generations from our models following various text prompts

Gradio demo

Coming soon …

BibTeX

@article{degeorge2025farimagenettexttoimagegeneration, 
     title           ={How far can we go with ImageNet for Text-to-Image generation?}, 
     author          ={Lucas Degeorges and Arijit Ghosh and Nicolas Dufour and David Picard and Vicky Kalogeiton}, 
     year            ={2025}, 
     journal         ={arXiv},
 }

Acknowledgments

This work was granted access to the HPC resources of IDRIS under the allocation 2025-AD011015436 and 2025-AD011015594 made by GENCI, and by the SHARP ANR project ANR-23-PEIA-0008 funded in the context of the France 2030 program. The authors would like to thank Thibaut Loiseau, Yannis Siglidis, Yohann Perron, Louis Geist, Robin Courant and Sinisa Stekovic for their insightful comments, suggestions, and discussions.