Speech-to-text (STT/ASR)

Converting spoken audio into text — also called Automatic Speech Recognition (ASR). The most-used model is OpenAI's Whisper.

Speech-to-text (STT), also called Automatic Speech Recognition (ASR), is the task of converting spoken audio into written text. Modern systems handle multiple languages, recognize speakers, time-stamp output, and produce reasonable punctuation — all in near-real-time on consumer hardware. It matters because audio is everywhere: meeting recordings, podcast interviews, customer support calls, voice messages, lectures, dictation. STT turns all of these into text that can be searched, summarized, translated, indexed, or analyzed. The combination of accurate STT plus an LLM for summarization is the foundation of most "AI meeting note-taker" products (Otter, Granola, Read, Fireflies, Tactiq). The game-changer was OpenAI's Whisper (2022, open-sourced) — multilingual, robust to accents and background noise, and free to download and run locally. Whisper essentially raised the accuracy floor for the entire industry. Variants like Whisper-large-v3, distil-Whisper, and faster-whisper have continued to improve speed and quality. For Chinese, Whisper handles Mandarin well; specialized models like FunASR (Alibaba) and Paraformer can be better for some Chinese accents and noisy domains. Real-time streaming STT for live captioning has its own model variants. Related: text-to-speech, multi-modal, Whisper, ASR.