How to pick an AI voice / TTS tool: ElevenLabs vs OpenAI vs Cartesia

Text-to-speech in 2026 is solved enough that the real question isn't "is it good enough?" but "which trade-off do you want?" Pick the wrong tool for the job and you'll either pay 5× more than you need to or hit a quality ceiling for a use case the model wasn't built for.

ElevenLabs: the quality leader, the expensive one

ElevenLabs is still the default high-end choice. The voices sound natural, the emotional range is the best in the industry, and the multilingual story (32 languages with one cloned voice in 2026) is unmatched. If you're doing audiobook production, premium podcast voiceover, or character voicing for indie games, ElevenLabs gets you to broadcast quality faster than anyone else.

Voice cloning is where ElevenLabs really pulls ahead. Instant Voice Clone (30 seconds of audio) is good. Professional Voice Clone (an hour of clean audio) is genuinely indistinguishable from the original speaker for most listeners. The Voice Library has thousands of community voices.

The weakness is cost. The Creator tier ($22/mo) gives you about 2 hours of generated audio. Scaling to production volumes — say 100k characters/day for a podcast service — gets expensive fast. Their API pricing per character is 2-3× what newer competitors charge.

OpenAI TTS / Realtime: cheap, fast, integrated

OpenAI's TTS (tts-1, tts-1-hd) and the Realtime API (with voice models like Coral, Marin, Cedar) are dramatically cheaper than ElevenLabs and good enough for most workflows. The Realtime API specifically is built for two-way voice conversation — sub-300ms latency, interruption handling, native audio output without text intermediate.

Use OpenAI for: voice agents, IVR replacement, conversational AI products, anything where the requirement is "natural enough" rather than "audiobook quality." The voices are competent but limited (about 6 distinct voices in 2026, no cloning).

Weakness: voice diversity is poor. If your product needs a specific character voice or a custom-cloned voice, OpenAI doesn't offer that. Multilingual is okay but accents are weaker than ElevenLabs in non-English languages.

Cartesia Sonic: the latency king

Cartesia is the dark horse builders should know about. Sonic 2 streams audio with about 90ms time-to-first-byte, the lowest in the industry. For real-time agents — where every 50ms of latency in voice response feels like awkwardness — Cartesia is in a different league. Quality is ElevenLabs-adjacent, voice cloning is supported, and pricing is roughly half of ElevenLabs.

Where Cartesia wins: real-time voice agents (customer support, language tutoring, voice-controlled products). Anywhere a noticeable pause kills the UX. They've also been ahead on streaming voice cloning workflows (clone, then stream output of the cloned voice in real time).

Weakness: smaller voice library, less mature SDKs in some languages, less brand recognition. They're newer, so the workflow polish ElevenLabs has accumulated isn't all there yet.

Chinese-language voice: a different field

Once you need natural-sounding Mandarin (especially Taiwanese 台灣國語 or 普通話), the Western leaders weaken and the Chinese ones get more interesting:

MiniMax — top-tier Chinese voices, voice cloning works well. Cheap.
Volcano Engine (字節跳動) — used inside Doubao, very strong on naturalness. API access via volcengine.com.
Tencent Cloud TTS — broad voice library, integrated with WeChat ecosystem.
iFlyTek — older incumbent, accuracy is high, voices feel slightly dated.

ElevenLabs Chinese voices are decent but you can hear the foreign accent. For Chinese-first products, start with MiniMax or Volcano and only consider ElevenLabs if you also need 30 other languages from the same workflow.

Voice cloning ethics and policy

If you're cloning a voice you don't own, you're walking into a legal mess. ElevenLabs requires a verbal consent statement; OpenAI doesn't offer voice cloning at all. Several US states (Tennessee's ELVIS Act, California, New York) have right-of-publicity laws specifically targeting voice. The EU AI Act marks deepfake voice as high-risk.

Your rule of thumb: only clone voices where you have the speaker's explicit, written, dated permission for the specific use case. Don't clone celebrities. Don't clone competitors. Don't clone deceased public figures even when they have no estate to sue you.

When NOT to use AI voice

Live customer-facing applications where empathy or trust matter (crisis hotlines, healthcare conversations, sensitive sales). Even the best AI voice in 2026 is detectable to about half of careful listeners after 30 seconds. If the customer relationship depends on feeling heard, hire humans.

Long-form content where pronunciation correctness matters (medical content, legal content, anything heavy with proper nouns or technical terms). All TTS systems still mispronounce names and rare words. Audiobook production studios still pay humans to QA every page.

Accessibility content for blind users. Dedicated screen readers (NVDA, VoiceOver) are still better optimized for navigation cues than general TTS — using ElevenLabs as a screen reader is using a sledgehammer.

Cost comparison at scale

For 100k characters of generated audio per month (rough mid-tier pricing in 2026):

ElevenLabs Pro: ~$50-90/mo plus overage
OpenAI tts-1-hd: ~$30/mo
Cartesia Sonic: ~$25-40/mo
MiniMax (Chinese): ~$10-15/mo

At 10× that volume the picture shifts dramatically — ElevenLabs scales to $500+/mo while Cartesia and OpenAI stay sub-$200.

Decision tree

Audiobook, premium podcast, character voicing: ElevenLabs
Voice agent, real-time, low latency: Cartesia Sonic
ChatGPT-integrated voice products: OpenAI Realtime API
Chinese-first content: MiniMax or Volcano Engine
Self-hosted (no API call): F5-TTS or Coqui XTTS open-source

Most production setups end up using two: a high-end model for hero content (ElevenLabs) plus a fast/cheap model for variable content (OpenAI or Cartesia).

Next steps

Read about voice agent architecture (STT → LLM → TTS pipelines vs end-to-end speech models)
Try voice cloning carefully — once with permission, never with a celebrity
Look at real-time SDKs from LiveKit, Pipecat, Daily for voice agent infrastructure
Compare quality directly with the same script — you'll hear differences a feature chart can't show