When someone says "Llama 3 is open source," they're using "open source" loosely. In the AI world the more accurate phrase is open weights: the trained model file is downloadable, but the training data, code, and exact recipe usually aren't. Knowing this distinction matters because it changes what you can legally and practically do with the model.
Three levels of openness
Most "open" LLMs sit on a spectrum:
Open weights, restrictive license — Llama 3 / 4 (Meta), Qwen (Alibaba), some others. You can download the model, fine-tune it, run it commercially, but the license adds restrictions: monthly active user thresholds, naming requirements, prohibited use clauses. Llama's license, for example, requires special permission if your service has > 700 million MAU.
Open weights, truly permissive license — Mistral 7B / Mixtral (Apache 2.0), Falcon, OLMo (Allen Institute), Pythia. Any commercial use, no strings.
Fully open — open weights + open training data + open code + open recipe. Examples: OLMo 2, Pythia, BLOOM. Rare and usually less competitive on benchmarks because the very best training data is proprietary.
The vast majority of "open source LLMs" you'll hear about (Llama, Qwen, DeepSeek, Yi, etc.) are open weights — not full open source by Free Software Foundation standards. The Open Source Initiative published an actual "Open Source AI Definition" in 2024 that almost no major model technically meets.
Why this matters in practice
If you're building a product, three things actually affect you:
Can you run it commercially? Almost always yes for the major "open" models. Watch for: Llama's MAU threshold, Mistral models' restrictions on derivative naming, license updates between model versions. Read the license once before depending on it.
Can you fine-tune and redistribute? Mostly yes, but if you redistribute fine-tuned weights, the original license often still applies. "Built with Llama" attribution requirements are common.
Can you reproduce or audit it? Without training data and code, no — you have to trust the lab. This matters for regulated industries (healthcare, finance) where reproducibility is a compliance requirement.
What you actually get with open weights
The practical wins of using open-weight models:
- No per-token API fees. You pay for compute (GPU hours) once, not per query. Above a few million queries a month, this dominates frontier API costs.
- Privacy. No data leaves your infrastructure. Healthcare, legal, government use cases that can't ship data to OpenAI.
- Customization. Full control over fine-tuning, quantization, deployment. Run on your hardware, your config.
- No vendor lock-in. Anthropic could change pricing tomorrow; your local Llama 3 70B keeps running unchanged.
- Inspection. You can probe, prune, and analyze the model in ways closed APIs don't allow.
The trade-offs are real. Frontier closed models (Claude, GPT-5) typically lead open-weight models by 6-12 months on raw capability. For most tasks the gap is small enough that price + privacy wins; for the hardest reasoning, frontier is still ahead.
The 2026 open-weight landscape
The major model families:
- Llama 3 / 4 (Meta) — most popular general-purpose family, strong English, broad ecosystem.
- Qwen 3 (Alibaba) — top multilingual, especially strong Chinese; competitive sizes from 0.5B to 235B.
- DeepSeek V3 / R1 — Chinese, very cost-efficient. R1 is the first major open-weight reasoning model.
- Mistral / Mixtral — French, focus on European-language quality, strict permissive licensing.
- Gemma (Google) — DeepMind's open-weight line, smaller-scale, high quality.
- Phi (Microsoft) — small models trained on synthetic textbook-style data, surprisingly capable for size.
- Yi (01.AI) — competitive Chinese model family.
- OLMo (Allen AI) — fully open-source, lower benchmarks but useful for research and education.
How to actually run an open-weight model
Four deployment patterns:
Local on your laptop:
- Ollama — easiest by far.
ollama run llama3and you have a chat. Mac and Linux, basic Windows. - LM Studio — GUI app, good for non-technical users.
- llama.cpp — the engine many tools wrap. Quantized models run on CPU + small GPU.
Self-hosted server:
- vLLM — best-in-class for serving open-weight LLMs. Used by anyone serious about throughput.
- TensorRT-LLM, TGI, SGLang — alternatives, each with their tradeoffs.
Hosted inference (someone else's GPUs):
- Together AI, Fireworks, Groq, Cerebras — pay per token, no DevOps. Often cheaper than frontier APIs for the same model class.
- Replicate, Modal, RunPod — flexible per-second GPU rental with deployment helpers.
On-prem in your datacenter:
- For regulated industries, run on your own GPUs. The complexity is real (driver versions, memory management, queue scheduling) but maturity is rising.
When NOT to use open-weight
- Quality is non-negotiable, latency is OK. Frontier closed models still lead on the hardest tasks.
- You don't have the GPU budget or expertise. Self-hosting LLMs has real ops cost; managed APIs avoid it.
- Volume is low. If you do < 1M tokens/day, frontier APIs are cheaper than running idle GPUs.
- You need OpenAI's specific features. Some workflows depend on closed-API features (Canvas, Tasks, etc.) that don't exist in open-weight ecosystems.
Is open weight "better" for the world?
Debate worth being aware of. Pros (more research, less concentration, more control). Cons (release timing accelerates capability proliferation, including for malicious uses). The labs disagree publicly. As a builder, your choice is mostly practical — capability, cost, control — but the politics shape which models exist next year.
Further reading
- What is a Large Language Model (LLM)
- Open-source LLM vs frontier API: which one for which task
- How to self-host an LLM stack on a single GPU box
- How to pick the right LLM for your use case
- Self-host a high-throughput inference server with vLLM