Multi-modal

An AI system that can process or generate multiple types of input/output — text plus images, audio, video — instead of just one modality.

Multi-modal AI handles more than one type of data. A pure text LLM is unimodal. A model that takes images plus text and reasons across them — "explain this chart", "is this MRI normal?", "transcribe what's in this photo" — is multi-modal. The frontier today extends to audio (Whisper, GPT-4o voice), video (Gemini, Sora), and any-to-any generation. It matters because most real-world tasks aren't text-only. Customer support involves screenshots. Medical work involves images. Engineering involves diagrams. A multi-modal model can answer "what's wrong with this PCB?" by looking at the photo, instead of asking the user to describe it in words. Whole product categories (visual search, screen-reading agents, accessibility tools) depend on it. A concrete example: Claude 3, GPT-4o, and Gemini all accept image input. Paste a screenshot of a SQL error, ask for a fix — the model reads the error message, looks at any visible code, and suggests changes. Or send a hand-drawn UI sketch and get HTML/CSS that matches. Under the hood, most current models bolt a vision encoder onto an LLM and project image features into the same embedding space as text tokens. Truly "natively multi-modal" architectures (Gemini, GPT-4o claim this) train all modalities together from the start. Generation across modalities (text→image with DALL-E, text→video with Sora, voice synthesis) typically uses separate diffusion or other models. Related: vision-language model, CLIP, image generation, text-to-speech.