Debug production issues with an LLM in the loop

It's 11pm, your error tracker is lighting up, and you're staring at a stack trace from a service you haven't touched in eight months. There's a culture of LLM use in coding tutorials that assumes you have time to chat with the model, refine your prompt, iterate. You don't. You have a customer waiting and a Sentry alert that won't stop pinging.

This article is about real production debugging with an LLM, not greenfield coding. The patterns are different. Speed matters more than elegance. The goal is 'understand → fix → verify → ship', not 'write the most beautiful code'.

The single most useful workflow: feed it the trace

The biggest leverage move is also the most underused: paste the actual stack trace, the actual error message, and the actual relevant code into the LLM in one shot. Not 'I'm getting a TypeError can you help' — the literal trace, all of it, plus the file the trace points to.

Claude 4.5 / 4.6 Sonnet and GPT-5 will reliably identify the bug from this in 60-80% of common production cases (off-by-one, null handling, type confusion, race condition hints, async timing, regex bug). Even when they don't fix it, they narrow your search space dramatically.

The key insight: an LLM is much better at reading a stack trace than at reading your project. So the more concrete context you give it about the failure (logs, traces, error text), the better. The less context you give, the more it makes up plausible-sounding nonsense.

The four kinds of production bugs and how LLMs handle each

Pure logic bugs (off-by-one, null deref, type confusion). LLMs find these fast. Paste the function plus the failing input and the error. Solved within minutes most of the time.

Distributed / concurrency bugs (race conditions, deadlocks, eventually-consistent issues). LLMs are mediocre here. They can suggest patterns to look for but rarely diagnose the actual cause from your specific code without runtime instrumentation. Use them to generate hypotheses, then test each one. Don't trust the first 'this looks like the issue' answer.

Performance bugs (slow query, N+1, memory leak). LLMs handle these well IF you give them profiler output, slow query log, or memory snapshot. They handle them badly if you just say 'this is slow'. Always paste numbers.

Environment bugs (works locally, breaks in prod). LLMs are weak here because the difference is usually in something they can't see — your specific Docker image, your env vars, your DNS, your IAM policy. Use them to enumerate likely causes ('what could differ between local and prod for an S3 SignatureDoesNotMatch error?'), then check each.

When the LLM is wrong, it's wrong with confidence

This is the most important lesson for production use. An LLM debugging your code will sometimes confidently identify a bug that isn't the actual bug. It will write a fix that looks plausible. You'll deploy it. The error will keep happening because the real bug was somewhere else.

Mitigation: always reproduce the bug locally or in staging before deploying the fix. The fix is only verified when the broken case becomes the working case in your repro. 'It looked right' is not verification.

If you can't repro locally (intermittent / data-dependent / load-dependent bugs), add observability before fixing. Log the inputs to the failing function. Add a structured event you can grep in your logs. Wait for it to fail again, get the data, then fix with that data in hand. Skip this step and you'll fix three wrong things before the right one.

Prompt patterns that pay off in incidents

'Suspect the simplest first.' Add this to your prompt. LLMs default to suggesting elaborate root causes. Telling them to suspect the simplest cause first (typo, recent deploy, config) gets you to ground truth faster.

'Show me three possibilities, not one.' When the cause is unclear, asking for three differential hypotheses with severity ranking is more useful than asking for 'the answer'. You want a list to check, not a confident wrong claim.

'Don't write the fix yet — just explain what's happening.' A pattern from senior debugging: separate diagnosis from fix. Tell the LLM to walk through the call path and explain the failure first. Often you'll spot the real issue before it does.

'What logs would you add to confirm this hypothesis?' Even better than a fix is a way to verify before fixing. Have the model suggest the minimum logging or instrumentation that would prove or disprove its theory.

Tools that actually help vs. tools that just claim to

Cursor and Claude Code with your repo loaded are good at 'find the call site' / 'where else is this function used' kinds of questions. They save real time vs. grep when the codebase is large.

Sentry's AI suggestions and similar error-tracker AI features are mid. They tend to suggest defensive code that doesn't fix the root cause. Treat them as 'hint to investigate' not 'apply this fix'.

GitHub Copilot Chat in your IDE is fine for the 'fix this small thing' moment but doesn't have enough context for production-level debugging unless you paste the trace in.

Pure terminal LLM CLI tools (Aider, Claude Code, Codex CLI) are surprisingly strong because you can pipe logs, traces, and command output directly. kubectl logs ... | claude is a power move during incidents.

When NOT to use an LLM for production debugging

Don't use it under genuine time pressure if you don't already know how to debug without it. If a customer is screaming and you've never debugged this kind of issue, the LLM might lead you down a wrong path that you can't evaluate. Defer to your senior on-call.

Don't use it for security incidents unless it's a sandboxed model your org has approved. Pasting auth tokens, customer data, or sensitive logs into a third-party API during an incident is a separate compliance event that hurts more than it helps.

Don't use it for the post-mortem. The post-incident write-up needs to be your honest reasoning, not a model's plausible reconstruction. Use it to summarize logs or timeline, not to draft 'why this happened'.