Skip to content
CLIP ViT-Large logo

MODELS

CLIP ViT-Large

OpenAI's foundational text-image embedding — the bedrock of every diffusion model.

openaiclipopen source

Specs

Context window
77
Modalities
text, image
Tool use
Vision
Streaming
License
mit
Released
2021-01-05

Pricing

CLIP (January 2021) is OpenAI's contrastive language-image pretraining model — a joint text + image encoder that embeds both into a shared 768-dim (ViT-Large) space. The foundation for Stable Diffusion, DALL-E (early), countless multimodal retrieval pipelines, image classification, NSFW detection. MIT licence, weights freely available, runs on a single GPU. ViT-Large/14 is the most-used variant; ViT-G/14 (OpenCLIP) is the largest.

Editor's verdict

Old by 2026 (5 years) but still everywhere — if you do any computer vision builder work, CLIP is in your stack whether you know it or not. For new image-text retrieval projects, SigLIP or BGE-M3 (multimodal) are stronger; for Chinese retrieval Chinese-CLIP is purpose-built. Keep CLIP on the radar as the historical foundation; for production-grade modern work, pick the descendant that matches your language and use case.

Reviews

No reviews yet. Be the first.

Last updated: 2026-04-29

We use cookies

Anonymous analytics help us improve the site. You can opt out anytime. Learn more