MODELS

CLIP ViT-Large

Name: CLIP ViT-Large
Brand: openai

OpenAI's foundational text-image embedding — the bedrock of every diffusion model.

openaiclipopen source

Go to official site →API docs →

Specs

Context window: 77
Modalities: text, image
Tool use: —
Vision: ✓
Streaming: —
License: mit
Released: 2021-01-05

Pricing

CLIP (January 2021) is OpenAI's contrastive language-image pretraining model — a joint text + image encoder that embeds both into a shared 768-dim (ViT-Large) space. The foundation for Stable Diffusion, DALL-E (early), countless multimodal retrieval pipelines, image classification, NSFW detection. MIT licence, weights freely available, runs on a single GPU. ViT-Large/14 is the most-used variant; ViT-G/14 (OpenCLIP) is the largest.

Editor's verdict

Old by 2026 (5 years) but still everywhere — if you do any computer vision builder work, CLIP is in your stack whether you know it or not. For new image-text retrieval projects, SigLIP or BGE-M3 (multimodal) are stronger; for Chinese retrieval Chinese-CLIP is purpose-built. Keep CLIP on the radar as the historical foundation; for production-grade modern work, pick the descendant that matches your language and use case.

Reviews

No reviews yet. Be the first.

Last updated: 2026-04-29