How far can we go with ImageNet for Text-to-Image generation?
Training high-quality text-to-image generation models with 1/10th the parameters, 1/1000th the training images, in about 500 H100 hours

Overview
The common belief in text-to-image (T2I) generation is that larger training datasets lead to better performance, pushing the field towards billion-scale datasets. However, this “bigger is better” approach often overlooks data efficiency, reproducibility, and accessibility, as many large datasets are closed-source or decay over time.
Our work challenges this paradigm. We demonstrate that it’s possible to match or even outperform models trained on massive web-scraped collections by using only ImageNet, a widely available and standardized dataset. By enhancing ImageNet with carefully designed text and image augmentations, our approach achieves:
- A +1% overall score over SD-XL on GenEval and +0.5% on DPGBench, using models with just 1/10th the parameters and trained on 1/1000th the number of images.
- Our models (300M-400M parameters) can be trained with a significantly reduced compute budget (around 500 H100 hours).
- We also show successful scaling to higher resolution (e.g., $512^2$) generation under these constraints.
This research opens avenues for more reproducible and accessible T2I research, enabling more teams to contribute to the field without requiring massive compute resources or proprietary datasets. All our training data, code, and models are openly available.
Paper and Code
📜 Paper on arXiv
💻 GitHub Repository
Adopting ImageNet for text-to-image generation
- Long informative captions
ImageNet’s original labels are simple class names, but the images contain rich visual information. We generate highly detailed captions that describe those images
- Image augmentations to reduce overfitting and increase compositionality abilities
Models trained on ImageNet (even with TA) can suffer from early overfitting due to its relatively small scale (1.2M images) and struggle with complex compositions due to its object-centric nature. We implement two Image Augmentation (IA) methods (CutMix and Crop) to mitigate those issues.
These IA strategies, when combined with TA (TA+IA), demonstrably reduce overfitting and significantly improve compositional reasoning (+7 points on GenEval’s “Two Objects” sub-task).

Comparison with State-of-the-Art
Our approach, despite using significantly fewer resources, demonstrates strong performance compared to established large-scale models.

Gallery

Gradio demo
Coming soon …
BibTeX
@article{degeorge2025farimagenettexttoimagegeneration,
title ={How far can we go with ImageNet for Text-to-Image generation?},
author ={Lucas Degeorge and Arijit Ghosh and Nicolas Dufour and David Picard and Vicky Kalogeiton},
year ={2025},
journal ={arXiv},
}
Acknowledgments
This work was granted access to the HPC resources of IDRIS under the allocation 2025-AD011015436 and 2025-AD011015594 made by GENCI, and by the SHARP ANR project ANR-23-PEIA-0008 funded in the context of the France 2030 program. The authors would like to thank Alexei A. Efros, Thibaut Loiseau, Yannis Siglidis, Yohann Perron, Louis Geist, Robin Courant and Sinisa Stekovic for their insightful comments, suggestions, and discussions.