What is an AI agent? And how is it different from a chatbot?

If a chatbot answers your question, an agent does your task. It reads, decides, takes an action, observes the result, decides again, and keeps going until the job is done. That's the core idea. Everything else — frameworks, memory systems, multi-agent setups — are details.

The minimal agent loop

Stripped to its essence, every AI agent runs the same loop:

while not done:
    1. Look at current state + goal
    2. Decide on next action (using an LLM)
    3. Execute action (call a tool, API, code)
    4. Observe result
    5. Update state

The LLM is the brain of step 2 — it picks what to do next given what's happened so far. The "tools" in step 3 are real: send a Slack message, query a database, fetch a web page, run a Python snippet, write a file. Modern LLMs from Anthropic, OpenAI, and Google all support tool use (also called function calling) — you describe a tool's input schema, the model decides when to call it.

A simple example: an agent that researches a competitor for you.

Goal: "Write a one-page brief on Mistral."
LLM decides: "I should search the web first." Calls search_web("Mistral AI").
Gets back 10 results. LLM picks the most useful 3 and calls fetch_page(url) on each.
Reads the pages. LLM decides: "I have enough; write the brief." Generates the markdown.
Done.

That's the whole pattern. The same skeleton scales up to coding agents (Cursor's composer, Claude Code), customer service agents, research agents, and so on.

What makes agents different from chatbots

A chatbot is a single LLM call. You ask, it answers, end of transaction.

An agent has three things a chatbot doesn't:

Tools. It can do, not just say. Cursor can edit your files. Claude Code can run shell commands. Operator can click web pages. The set of available tools defines what the agent is.

A loop. The agent decides when it's done. A chatbot does one turn; an agent might do 50 turns to complete a task. Each turn it reads what happened, picks a next action, and executes.

Memory across steps. The agent's growing context — "here's what I tried, here's what worked" — feeds back into each new decision. This is why agents need careful prompt and context management; without it, they forget what they already tried and loop forever.

Why agents are flaky in 2026

Despite a year of breathless press, most agents still fail in production. The core issue: errors compound.

If each step works 95% of the time, a 10-step task succeeds only ~60% (0.95^10). A 50-step task succeeds 7%. Agents that operate over many steps need either much higher per-step reliability or a way to detect and recover from errors.

Real agents use defenses:

Verification steps — after taking an action, check if it actually worked.
Step limits — cap the loop at, say, 20 iterations to avoid runaway costs and infinite loops.
Human-in-the-loop — for irreversible actions (sending emails, spending money, deleting data), pause and confirm.
Specialized tooling — domain-specific tools (e.g., Cursor's file-edit primitives) reduce surface area for errors.

Real agent products in 2026

The ones actually shipping useful work:

Coding agents — Cursor's composer, Claude Code, Windsurf. They edit code in your repo, run tests, iterate. The most mature category.
Web operators — OpenAI's Operator, Anthropic's computer use. Browse, click, fill forms. Useful for repetitive web tasks but still slow and breakable.
Research agents — Perplexity Pro's deep research, Claude's research mode, Gemini Deep Research. Search, read, synthesize multi-source briefs.
Vertical agents — Sales (Clay, Outreach), recruiting (Mercor), customer support (Decagon, Sierra). Constrained to a narrow domain, which is why they work.
Personal-task agents — Booking, scheduling, simple errands. Mostly demos, not yet reliable.

The pattern: narrow, well-tooled agents work. Wide-open "do anything" agents don't.

When NOT to use an agent

The task is one prompt. If you can write the answer with a single LLM call, do it. Adding an agent loop just costs more and creates more failure modes.
Errors are unrecoverable. Don't let an agent send emails, buy things, delete files, or change production systems without explicit human approval per action.
You don't have evals. If you can't measure whether the agent did the task correctly, you can't improve it. Build a small eval set first, even 20 examples.
The cost-per-task doesn't make sense. Agents that loop 30 times can cost $0.50-$5 per task. Make sure that's worth less than the alternative (human, simpler tool, or just doing the task once).

Where to start

If you want to build one, start with:

A clear, narrow task ("draft a sales follow-up given this CRM record").
A small, well-defined toolset (3-5 tools, schemas).
A step limit (10 iterations max).
Logging at every step so you can debug failures.
A human-in-the-loop confirmation for any external action.

Frameworks like LangGraph, Mastra, and CrewAI give you scaffolding, but a plain Python loop calling Claude's tool-use API is often clearer for the first version.