Multi-step agents fail in ways single-prompt LLM apps don't. The model gets the first three steps right and then on step four does something unhinged — calls the wrong tool, hallucinates a parameter, gets stuck repeating the same action. You can't reproduce it by re-running the prompt because the prompt is now different (more history, different intermediate results).
This is the systematic playbook for finding what went wrong, not just shotgunning fixes.
Before you start: get a trace
Everything below assumes you can see the full trace of what happened. If you don't have observability set up (Langfuse, LangSmith, custom logs), stop debugging and add it. Two hours adding trace logging will save twenty hours of "I think it's the prompt? Let me try changing it."
A proper trace shows:
- Every LLM call's full input and output
- Every tool call and its result
- The state passed between steps
- Token counts and latency per step
The 7 failure modes (in order of frequency)
In my experience debugging agents, every weird behavior reduces to one of these seven patterns. Walk through them in order — the most common ones are first.
1. Context overflow
The most common cause of "agent goes nuts at step 5+." The message history has grown to 80% of the context window, and the model is now ignoring system instructions in favor of recent tool outputs.
Symptoms: Behavior is fine for the first few turns, gets weird around turn 4-6, completely breaks down by turn 8+.
Fix:
- Print total token count at each turn.
- If it's growing fast, summarize older turns or drop their tool results.
- Use prompt caching for the system prompt so it doesn't have to be re-sent every turn (some providers).
- Lower
max_model_lenand force the agent to be more concise.
2. Stale tool results poisoning the context
The agent did a search early on. The search returned 5000 tokens of garbage. That garbage is now in the context for every subsequent turn, and the model keeps trying to relate new questions to it.
Symptoms: Agent fixates on something irrelevant from earlier in the conversation. Won't let go of a wrong premise.
Fix:
- Truncate tool results to N tokens.
- After a tool result is consumed, replace it in the history with a summary: "[search result for 'X' — see step 2 for full output, summary: Y]."
- Allow the agent to explicitly drop old context with a
forget(step_id)tool.
3. Schema drift
The model is supposed to call update_user(id: int, name: str) but it called update_user(user_id: "123", new_name: "Alice"). The agent thinks it succeeded; the tool actually errored or did the wrong thing.
Symptoms: Logs show the tool was called. Database doesn't reflect the change. Or the tool errored but the error wasn't propagated and the agent thinks all is well.
Fix:
- Make tool input schemas strict (use
additionalProperties: false, enums, exact field names). - Validate the model's tool args server-side; return clear errors when validation fails.
- Make sure tool errors are surfaced back to the agent as tool_result content, not swallowed.
- Consider using structured outputs / strict mode if the provider supports it.
4. The model thinks it answered when it didn't
The agent's tool returned an error. The error string contained text like "the user's email is empty." The model interprets that text as the answer to your question and stops.
Symptoms: Agent terminates without doing useful work. Output looks plausible but contains text that's clearly from a tool error or fragment.
Fix:
- Format tool errors with explicit prefix:
"ERROR: <reason>"not just<reason>. - In the system prompt: "If you receive a tool result starting with ERROR, do not include the error text in your final answer. Try a different approach or report the failure."
- Treat unexpected
stop_reasonvalues as bugs, not normal completion.
5. Tool call loops
The agent keeps calling the same tool with slightly different parameters, never converging. Either it can't get the right answer and won't give up, or each tool result confuses it more.
Symptoms: maxSteps is hit. Trace shows 8 search calls with similar queries. Cost climbs.
Fix:
- Detect repeated calls in your loop logic. If the same tool is called twice with similar input, intercept and force the agent to try something different.
- Improve tool descriptions: "if this returns no results, the user's data does not exist; do not retry with variations."
- Add a step counter to the system prompt: "You are on step 5 of 10. After step 8, you must give a final answer or admit failure."
6. Reasoning blocks not used / mis-used
For reasoning models (o3, Claude with extended thinking, DeepSeek R1), the model sometimes ignores its own reasoning, contradicts it in the final answer, or thinks-then-loses-track when reasoning gets cut off mid-stream.
Symptoms: Reasoning trace says "the answer is 42" but the final response says 47. Or the model produces 4000 tokens of thinking, then a brief and unrelated final answer.
Fix:
- Increase reasoning token budget if you're hitting the cap.
- Make the final-answer step a separate explicit prompt: "Given your prior reasoning, what is your final answer in one sentence?"
- For Claude extended thinking: ensure thinking and response token budgets are both adequate.
7. Non-determinism + temperature
You ran the same prompt twice and got different results. One worked, one didn't. You can't reproduce the bug.
Symptoms: Cannot reliably reproduce a failure. The trace from yesterday's bad run is different from today's run.
Fix:
- Set
temperature=0andtop_p=1during debugging. Pin the exact model version. - Even with temperature 0, modern models aren't fully deterministic due to GPU non-determinism — but they're close enough.
- For real determinism, save the exact responses from a known-bad run to a fixture file. Replay the agent loop using those fixtures instead of live calls. Now you can step through with a debugger.
The systematic process
When something goes wrong, work this list in order:
- Get a clean trace of one failure. Reproducible if possible.
- Find the exact step where things went wrong. Read each step's input/output. Don't skim.
- Match it to one of the 7 failure modes above.
- Form a single hypothesis. "I think it's failure mode 3 because the tool args don't match my schema."
- Make ONE change. Don't simultaneously change the prompt and the tool schema and the model.
- Re-run the same input. Did the change fix the symptom?
- Run on 5 other inputs. Did the change break anything else?
- If still broken, return to step 4 with a different hypothesis.
The failure mode here is changing six things at once and not knowing which one fixed it (or which one broke something new).
Anti-pattern: changing the model
When an agent misbehaves, the temptation is to swap models. "Let me try Claude 4.7 instead of Sonnet 4.6." Sometimes this works. Often it papers over the underlying bug — the agent is still fragile, you just got lucky with this model. The next prompt edit will resurface it.
Figure out why the agent failed first. Then decide if the right fix is a stronger model.
A debugging tip that helps a lot
When the agent fails, copy the entire trace into Claude or GPT-5 and ask: "This agent is supposed to do X. It failed at step N with output Y. Walk through the trace and tell me where the model went wrong and why."
LLMs are surprisingly good at this. They can read a 5000-token trace and pinpoint the step that triggered the cascade. Use them as your first reviewer before doing the human walk-through.
When NOT to debug
If you're 3+ hours in on the same bug, stop debugging and reconsider the architecture. Recurring weird behavior in agents is often a sign that the task is too open-ended for the model and decomposing it would help more than another prompt edit.
Specifically: if the agent has more than 5 distinct tools and an open task definition, consider splitting it into a planner agent (decides which tool sequence) and an executor agent (calls the tools). Smaller decision spaces fail less.
Further reading
- The build-agent-loop-from-scratch post in this Learn library.
- LangSmith and Langfuse trace examples.
- Building effective agents — Anthropic, December 2024.
- Look up: chain-of-thought failure modes, agent decomposition, structured outputs, eval-driven prompt iteration.