How far can we go with ImageNet for Text-to-Image generation? Training high-quality text-to-image models using 1000x less data in just 200 GPU hours