DALL-E is OpenAIβs premier image generation model (text-to-image generation). Itβs a part of the GPT family of models but specifically designed for visual creativity and synthesis.
DALL-E uses a Transformer Architecture but incorporates both text and image representations, essentially understanding how they correlate. This multimodal learning was enabled by Contrastive Language-Image Pretraining (CLIP). DALL-E was trained on a massive dataset of image-text pairs which provided it with context to understand how descriptive language translates into visual elements.
DALL-E 2 and newer models are Diffusion Models rather than the auto-regressive generation used in the origin DALL-E. It iteratively denoises an input into a coherent image and improves the image quality and realism.
DALL-E has zero-shot generalization capabilities, which allows it to generate plausible visuals for concepts it hasnβt explicitly seen during training. DALL-E 2 also provided us with the ability to edit parts of existing images while maintaining coherence with the surrounding content.