TODAY

DeepSeek Quietly Ships Standalone Vision Model with Fast OCR

DeepSeek launched a new vision mode that appears to be a separate multimodal model rather than vision tacked onto V4. It handles OCR with Markdown formatting, reconstructs HTML from webpage screenshots, does spatial reasoning, spot-the-difference, and pattern recognition. Non-thinking mode responds nearly instantly but can hallucinate details; thinking mode lifts accuracy but can take 4+ minutes. Notably, this shipped well ahead of V4's stated multimodal roadmap. Takeaway: DeepSeek is moving faster than the field expected, giving Chinese-language users more closed-source, open-source, and on-device multimodal options.

Published: 2026-05-03

Sources

量子位:DeepSeek 识图模式一手实测zh-CN