How far can we go with ImageNet for Text-to-Image generation?
Training high-quality text-to-image models using 1000x less data in just 200 GPU hours

Overview
Current text-to-image models face significant challenges: they require enormous datasets (billions of images), substantial computational resources (thousands of GPU hours), and raise concerns about data curation, copyright issues, and inappropriate content generation.
Our key contribution: With careful data augmentation and training strategies, high-quality text-to-image models can be trained using only ImageNet data — reducing the required training data by a factor of 1000× and computational requirements to under 500 GPU hours.
Paper and Code
📜 Paper on arXiv
💻 GitHub Repository
Methods
1. Text Augmentation: detailed captioning
ImageNet’s original labels are simple class names, but the images contain rich visual information. We generate highly detailed captions that describe those images

2. Image Augmentation: CutMix
We implement a modified CutMix strategy to improve the compositional understanding of the models. CutMix can be used at high noise level of diffusion to prevent learning the border. Re-captioning make the text smooth.

Gallery

Gradio demo
Coming soon …
BibTeX
@article{degeorge2025farimagenettexttoimagegeneration,
title ={How far can we go with ImageNet for Text-to-Image generation?},
author ={Lucas Degeorges and Arijit Ghosh and Nicolas Dufour and David Picard and Vicky Kalogeiton},
year ={2025},
journal ={arXiv},
}
Acknowledgments
This work was granted access to the HPC resources of IDRIS under the allocation 2025-AD011015436 and 2025-AD011015594 made by GENCI, and by the SHARP ANR project ANR-23-PEIA-0008 funded in the context of the France 2030 program. The authors would like to thank Thibaut Loiseau, Yannis Siglidis, Yohann Perron, Louis Geist, Robin Courant and Sinisa Stekovic for their insightful comments, suggestions, and discussions.