MODELS
CLIP ViT-Large
OpenAI's foundational text-image embedding — the bedrock of every diffusion model.
Specs
- Context window
- 77
- Modalities
- text, image
- Tool use
- —
- Vision
- ✓
- Streaming
- —
- License
- mit
- Released
- 2021-01-05
Pricing
CLIP (January 2021) is OpenAI's contrastive language-image pretraining model — a joint text + image encoder that embeds both into a shared 768-dim (ViT-Large) space. The foundation for Stable Diffusion, DALL-E (early), countless multimodal retrieval pipelines, image classification, NSFW detection. MIT licence, weights freely available, runs on a single GPU. ViT-Large/14 is the most-used variant; ViT-G/14 (OpenCLIP) is the largest.
Editor's verdict
Old by 2026 (5 years) but still everywhere — if you do any computer vision builder work, CLIP is in your stack whether you know it or not. For new image-text retrieval projects, SigLIP or BGE-M3 (multimodal) are stronger; for Chinese retrieval Chinese-CLIP is purpose-built. Keep CLIP on the radar as the historical foundation; for production-grade modern work, pick the descendant that matches your language and use case.
Reviews
No reviews yet. Be the first.
Last updated: 2026-04-29