Google DeepMind announced Gemini 3 today, the long-rumored successor to the 2.5 family. The headline isn't a single benchmark number — it's a new mode called "Deep Agent" where the model plans, executes, self-evaluates, and resumes work across sessions, with task durations measured in hours-to-days instead of minutes.
The technical paper accompanying the launch describes a continuous tool-use loop with structured memory checkpoints written to a Google Cloud-backed scratchpad. Crucially, the model maintains a running self-model of which sub-goals it has and hasn't accomplished — addressing the "agent forgets the goal halfway through" failure mode that plagues current systems. Google demonstrated a 36-hour data analysis task and a multi-day software refactor, both with human checkpoints rather than full autonomy.
For practitioners the open question is reliability. Anthropic and OpenAI have both published research showing long-horizon agents fail in compounding ways — small mistakes early cascade. Google says Gemini 3 has a learned "verification head" that catches these, but third-party reproduction will take weeks. If the claims hold, this resets what "agent" means as a product category.