Text-to-speech (TTS)

Converting written text into spoken audio — modern neural TTS systems (ElevenLabs, OpenAI TTS, Google) produce near-human-quality voices that can clone, emote, and speak many languages.

Text-to-speech (TTS) is the task of producing spoken audio from written text. Old TTS systems sounded robotic; modern neural TTS — driven by transformer and diffusion architectures — produces voices that are often indistinguishable from human speech, with control over emotion, pace, and language. It matters because TTS unlocks a lot of audio-first product categories: audiobook narration, podcast generation, accessibility tools (screen readers for the visually impaired), voice-overs for video, language-learning apps, AI agents that speak (voice mode in ChatGPT and Gemini), and IVR / phone-tree replacements. For content creators, TTS makes it possible to produce voiced content at scale without recording studio sessions. Leading providers: ElevenLabs (consumer-quality voice cloning, dominant in podcast/audiobook market), OpenAI TTS (built into the API, multiple voices), Google's WaveNet and successors, Amazon Polly, Azure TTS. Open-source options like XTTS-v2 and OpenVoice support voice cloning from a short sample. For Chinese specifically, native-trained Chinese TTS (Bytedance's, Tencent's, ElevenLabs' Chinese models) typically outperform models that added Chinese as an afterthought. Voice cloning is the most controversial area — high-quality cloning from 10-30 seconds of audio raises real fraud and impersonation concerns. ElevenLabs and OpenAI both implement consent verification and watermarking for cloned voices. Related: speech-to-text, multi-modal, voice cloning.