LLM observability: logging, tracing, evals

Most LLM products are built blind. The team writes a prompt, ships it, gets bug reports, can't reproduce them, and tries random fixes hoping one sticks. Six months in, no one knows whether the latest model upgrade made things better or worse.

Observability is the discipline of making your LLM system legible. Three layers — logging, tracing, evals — that turn a black box into something you can debug.

Why standard APM doesn't cover it

Datadog, New Relic, Sentry, and friends are great at HTTP request flows. They tell you a request took 800ms, that endpoint X has a 99% success rate. They are not built for these LLM-specific questions:

Why did the model give this answer? You need the full prompt, including dynamically retrieved context.
Did the model use the right tool? You need each step of an agent's decision chain.
Is this conversation getting worse over time? You need to compare turn 5 to turn 1, not just one HTTP call.
Did the new prompt break something? You need before/after on a fixed eval set.

This is what dedicated LLM observability solves.

Layer 1: structured logs

The minimum useful logging captures, per LLM call:

Full request body. Model name, system prompt, user prompt, all messages, tool definitions, temperature, max_tokens, all sampling params.
Full response body. All output blocks (text, tool_use, thinking blocks if reasoning), stop_reason, finish_reason.
Token usage. Input tokens (split into cached vs non-cached if you use prompt caching), output tokens.
Latency. Time-to-first-token, total time.
Cost. Computed from token counts and current pricing.
Trace ID. Unique ID linking this call to the user request that caused it.

Don't redact prompts in dev. In prod, redact PII before storing if your privacy posture requires.

A tip: store logs in a queryable format (JSON in Postgres, Parquet in S3, or a managed observability tool). Plain text logs become useless past 1000 entries.

Layer 2: traces

A single LLM call is rarely the whole story. Real systems are multi-step:

user_request -> retrieve(query) -> rerank -> llm_call(plan) -> tool_call(search) -> llm_call(answer) -> response

A trace captures all of these as nested spans, like Jaeger or OpenTelemetry traces but with LLM-specific fields. You see:

The full timeline.
Which step caused the latency.
Which retrieval call returned which docs.
Why the model chose tool X (its full reasoning text).
Where errors happened.

For agents, traces are non-negotiable. Without them, debugging "the agent did the wrong thing on turn 4" is impossible.

Layer 3: evals

Evals are tests for your LLM behavior. Two flavors:

Offline evals. A fixed dataset of inputs + expected behaviors. Run on every prompt or model change. Block deploy if regressions appear. Covered in detail in the RAG evaluation post.
Online evals. Score real production traffic in near-real-time. LLM-as-judge runs on a sampled X% of conversations and flags low-quality answers for review.

The combination is powerful: offline catches deliberate breakage before deploy, online catches drift in real usage that your eval set doesn't cover.

The 2026 tool landscape

The major players, what they do well, what they don't:

Langfuse (open source). Excellent traces, native eval support, self-hostable. Default choice for cost-sensitive teams.
LangSmith (LangChain). Best if you're already on LangChain/LangGraph. Tightly integrated, paid.
Helicone (open source). Lightweight, easy to drop in via proxy. Good for getting started.
Arize Phoenix (open source). Strong on RAG-specific metrics and visualizations.
Honeyhive / Galileo / Patronus (commercial). Enterprise focus, more guardrails-as-a-service.
OpenTelemetry GenAI semantic conventions. The emerging standard for LLM tracing. Tools above are converging on it.

If you're starting from zero in 2026: install Langfuse self-hosted or Helicone proxy. Both take under an hour. Re-evaluate at scale.

The four signals to alert on

Once you have observability set up, build dashboards and alerts on:

Cost per user / per request. A bug that triples token usage will silently triple your bill. Alert on any single user blowing past 10× the median.
Failure rate. API errors, timeout, refusals. Spike means something broke (provider outage, expired key, prompt change pushing over context).
Latency p95. LLM latency drifts. The same prompt at 2pm UTC and 8pm UTC can differ 3×. p95 over 8 seconds usually means you need streaming.
Quality signal. User thumbs up/down, time-to-resolution, retry rate. Hardest to instrument but most important. If quality drops 10% week-over-week and you don't know it, your product is dying invisibly.

What changes when you have observability

Several things become possible that weren't before:

Reproduce any bug in seconds. Click a trace, copy the exact request, replay against any model.
Compare model upgrades scientifically. Run last month's traffic through GPT-5 and Claude 4.7, see which performs better on your real distribution.
Track prompt changes. Tag every prompt with a version, see how each version performed across thousands of conversations.
Find the worst 1%. Sort conversations by user thumbs-down or by latency, fix the tail not the average.
Cost optimization. See which prompts are bloated, which retrieved docs are overkill, which model tier is overspec.

When NOT to invest yet

Solo founder, day 1, no users. Console.log is fine. Spend the time shipping.
Prototype phase. Ship the prototype, get 10 users, then add observability. Premature observability is sunk cost.
Single short prompt, no agent, no retrieval. Just log inputs/outputs to a database table. Don't over-tool.

The trigger for serious investment: when you start asking "why did it do that?" more than once a week and can't answer.

A pragmatic starter stack

For a real product with users, in 2026:

Langfuse self-hosted for traces and logging (free, open source).
30-question regression eval in version control, run in CI on every prompt change.
PostHog or Mixpanel for user-level events (signup, chat_completed, thumbs_down) so quality signals tie back to product metrics.
Sentry for actual bugs (typeerrors, 500s) — not LLM-specific but still needed.
A weekly review meeting where someone reads 50 random conversations. The cheapest, highest-signal eval there is.

You can graduate to fancier tools when you've outgrown this.