OpenAI announces GPT-5.1 with native multimodal video understanding

OpenAI introduced GPT-5.1 today, an incremental update to the GPT-5 family released last fall. The headline change is a native video understanding tower trained jointly with the language model — meaning the model can ingest raw video files up to one hour without your application stack having to do frame extraction or scene segmentation first.

In the demos, GPT-5.1 answered grounded questions about a 47-minute lecture recording (correctly identifying when the speaker contradicted an earlier claim) and parsed a 30-minute surgical procedure for a medical reviewer use case. Latency for video inputs is ~12x text-only, which OpenAI says will improve as they roll out optimized inference paths.

Pricing isn't published yet — the model is in restricted preview for video features, with text-only access on a waitlist. The competitive context: Gemini 2.5 Pro has had hour-long video for a while, but GPT-5.1 reportedly handles temporal reasoning ("what changed between minute 12 and minute 38?") notably better. Worth tracking when the bake-offs land.