Voice AI has a dirty secret: most of it was never designed for conversation. The dominant paradigm — feed text in, get audio out — traces its lineage to audiobook narration and voiceover production, where the model never hears the person on the other end. That’s fine when you’re generating a podcast intro. It’s not fine when a frustrated user is trying to get support from an AI agent at 11pm.
Inworld AI is calling that out directly with the launch of Realtime TTS-2, a new voice model released as a research preview via its Inworld API and Inworld Realtime API. The model hears the full audio of the exchange, picks up the user’s tone, pacing and emotional state, then takes voice direction in plain English the way developers prompt an LLM.
What’s Actually Different Here
The meaningful architectural distinction with TTS-2 is that it operates as a closed-loop system. The model takes the actual audio of the prior turns of the exchange as input, not just a transcript — it hears how the user actually sounded. That’s a non-trivial difference. A transcript of “okay, fine” gives you the words. The audio of “okay, fine” tells you whether the person is relieved, resigned, or sarcastic. TTS-2 is designed to use that signal.
The same line lands differently after a joke than after bad news, and the model knows the difference because it heard the prior turn. Tone, pacing, and emotional state carry forward automatically. Practically speaking, audio context flows across turns inside a Realtime session without developers needing to pass explicit prior_audio fields or build additional plumbing.
Four Capabilities, One Model
Inworld team is shipping TTS-2 with four key features, positioning the combination and not any individual piece, as the differentiation.
- Voice Direction: It lets developers steer delivery using plain-language prompts inline at inference time. Instead of selecting from a fixed emotion enum like [sad] or [excited], developers pass a bracket tag like [speak sadly, as if something bad just happened] directly in the text. Long, descriptive prompts beat short labels — the model responds far better to full context than single-word labels. Inline non-verbal markers like [laugh], [sigh], [breathe], [clear_throat], and [cough] can be dropped anywhere in the text where the moment should occur, and the model places them as audio events, not pronounced words.
- Conversational Awareness: It is the closed-loop architecture described above — the architectural shift that separates TTS-2 from prior-generation models that treat each sentence as a stateless generation call.
- Crosslingual support: One voice identity is preserved across over 100 languages, including mid-utterance language switches inside a single generation. No language flag is needed — the model handles transitions automatically, keeping timbre, pitch, and character constant across the switch. The top-tier languages ship at native-speaker quality, while the long tail is described as launch-window experimental, consistent with the model releasing as a research preview.
- Advanced Voice Design: It generates a saved voice from a written prompt and no reference audio required. Developers can describe a person in prose, save the result as a reusable voice, and call it like any other voice in the app. Voice Design ships with three stability modes: Expressive (for live consumer conversation and companions), Balanced (the default for most agent workloads), and Stable (for IVR and professional deployments where pitch drift is unacceptable).
The Conversational Layer Underneath
Beyond the four key features, it calls out a set of behaviors that push speech further into what it describes as “person paying attention” territory. The most technically interesting is disfluencies: the model generates natural uh and um, self-corrections, mid-noun-phrase pauses, and trailing thoughts that signal warmth and recall rather than malfunction. Critically, different speaker profiles cluster fillers differently, and the model follows the rhythm — filler-as-energy sounds different from filler-as-hesitation. Voice cloning is also supported via a two-step API: upload a reference sample (5–15 seconds, clean, single speaker) to /voices/v1/voices:clone, get a voice ID, and use it like any other voice.
Where It Fits in the Stack
TTS-2 is one layer in Inworld’s broader Realtime API pipeline. The full stack includes Realtime STT, which transcribes and profiles the speaker in one pass — capturing age, accent, pitch, vocal style, emotional tone, and pacing as structured signals on the same connection. A Realtime Router that routes across 200+ models, selecting the appropriate model and tools based on the user’s state and conversation context. And TTS-2 at the output layer. The pipeline runs over a single persistent WebSocket connection, with sub-200ms median time-to-first-audio for the TTS layer.
https://artificialanalysis.ai/text-to-speech/leaderboard. (data as of May 5, 2026)
The Broader Context
Realtime TTS 1.5 already ranks #1 on the Artificial Analysis Speech Arena (as of May 5, 2026), ahead of Google (#2) and ElevenLabs (#3). The launch of TTS-2 signals that Inworld considers raw audio quality a solved problem — and is now competing on the behavioral layer: context-awareness, steerability, and identity consistency across languages.
Check out the Docs and Technical details. Also, feel free to follow us on Twitter and don’t forget to join our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

