xAI Launches Standalone Grok Speech-to-Text and Text-to-Speech APIs, Targeting Enterprise Voice Developers

Elon Musk’s AI company xAI has launched two standalone audio APIs — a Speech-to-Text (STT) API and a Text-to-Speech (TTS) API — both built on the same infrastructure that powers Grok Voice on mobile apps, Tesla vehicles, and Starlink customer support. The release moves xAI squarely into the competitive speech API market currently occupied by ElevenLabs, Deepgram, and AssemblyAI.

What Is the Grok Speech-to-Text API?

Speech-to-Text is the technology that converts spoken audio into written text. For developers building meeting transcription tools, voice agents, call center analytics, or accessibility features, an STT API is a core building block. Rather than developing this from scratch, developers call an endpoint, send audio, and receive a structured transcript in return.

The Grok STT API is now generally available, offering transcription across 25 languages with both batch and streaming modes. The batch mode is designed for processing pre-recorded audio files, while streaming enables real-time transcription as audio is captured. Pricing is kept straightforward: Speech-to-Text is $0.10 per hour for batch and $0.20 per hour for streaming.

The API includes word-level timestamps, speaker diarization, and multichannel support, along with intelligent Inverse Text Normalization that correctly handles numbers, dates, currencies, and more. It also accepts 12 audio formats — nine container formats (WAV, MP3, OGG, Opus, FLAC, AAC, MP4, M4A, MKV) and three raw formats (PCM, µ-law, A-law), with a maximum file size of 500 MB per request.

Speaker diarization is the process of separating audio by individual speakers — answering the question ‘who said what.’ This is critical for multi-speaker recordings like meetings, interviews, or customer calls. Word-level timestamps assign precise start and end times to each word in the transcript, enabling use cases like subtitle generation, searchable recordings, and legal documentation. Inverse Text Normalization converts spoken forms like ‘one hundred sixty-seven thousand nine hundred eighty-three dollars and fifteen cents’ into readable structured output: “$167,983.15.”.

Benchmark Performance

xAI research team is making strong claims on accuracy. On phone call entity recognition — names, account numbers, dates — Grok STT claims a 5.0% error rate versus ElevenLabs at 12.0%, Deepgram at 13.5%, and AssemblyAI at 21.3%. That is a substantial margin if it holds in production. For video and podcast transcription, Grok and ElevenLabs tied at a 2.4% error rate, with Deepgram and AssemblyAI trailing at 3.0% and 3.2% respectively. xAI team also reports a 6.9% word error rate on general audio benchmarks.

https://x.ai/news/grok-stt-and-tts-apis

What is the Grok Text-to-Speech API?

Text-to-Speech converts written text into spoken audio. Developers use TTS APIs to power voice assistants, read-aloud features, podcast generation, IVR (interactive voice response) systems, and accessibility tools.

The Grok TTS API delivers fast, natural speech synthesis with detailed control via speech tags, and is priced at $4.20 per 1 million characters. The API accepts up to 15,000 characters per REST request; for longer content, a WebSocket streaming endpoint is available that has no text length limit and begins returning audio before the full input is processed. The API supports 20 languages and five distinct voices: Ara, Eve, Leo, Rex, and Sal — with Eve set as the default.

Beyond voice selection, developers can inject inline and wrapping speech tags to control delivery. These include inline tags like [laugh], [sigh], and [breath], and wrapping tags like text and text, letting developers create engaging, lifelike delivery without complex markup. This expressiveness addresses one of the core limitations of traditional TTS systems, which often produce technically correct but emotionally flat output.

Key Takeaways

xAI has launched two standalone audio APIs — Grok Speech-to-Text (STT) and Text-to-Speech (TTS) — built on the same production stack already serving millions of users across Grok mobile apps, Tesla vehicles, and Starlink customer support.
The Grok STT API offers real-time and batch transcription across 25 languages with speaker diarization, word-level timestamps, Inverse Text Normalization, and support for 12 audio formats — priced at $0.10/hour for batch and $0.20/hour for streaming.
On phone call entity recognition benchmarks, Grok STT reports a 5.0% error rate, significantly outperforming ElevenLabs (12.0%), Deepgram (13.5%), and AssemblyAI (21.3%), with particularly strong performance in medical, legal, and financial use cases.
The Grok TTS API supports five expressive voices (Ara, Eve, Leo, Rex, Sal) across 20 languages, with inline and wrapping speech tags like [laugh], [sigh], and giving developers fine-grained control over vocal delivery — priced at $4.20 per 1 million characters.

Check out the Technical details here. Also, feel free to follow us on Twitter and don’t forget to join our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

Michal Sutter is a data science professional with a Master of Science in Data Science from the University of Padova. With a solid foundation in statistical analysis, machine learning, and data engineering, Michal excels at transforming complex datasets into actionable insights.

What's Hot

The best Garmin watch deals of July 2026 — up to 44% off epix, Vivoactive, and beyond

Gemini won’t cut you off as much, and its reliability improved in Google Home

X by Xreal is the first affordable pair of smart glasses that don’t feel like a compromise

Best Buy launches huge 4th of July sale event — see the 25+ deals that Android users should actually care about

Meta launches cheaper smart glasses without Ray-Ban

Best Buy launches massive anti-Prime Day sale — see 25+ Android deals that might make Amazon look bad

Google launches Wear OS 7 with Live Updates and a battery life boost

Cricket Wireless launches huge sale on Android phones — score a free Samsung Galaxy A37, Moto G Stylus, and more

Verizon quietly launches deal that gets you a free Google Pixel 10 Pro AND $100 gift card — they will even waive the activation fee

The best Garmin watch deals of July 2026 — up to 44% off epix, Vivoactive, and beyond

Gemini won’t cut you off as much, and its reliability improved in Google Home

X by Xreal is the first affordable pair of smart glasses that don’t feel like a compromise

The best Garmin watch deals of July 2026 — up to 44% off epix, Vivoactive, and beyond

Gemini won’t cut you off as much, and its reliability improved in Google Home

X by Xreal is the first affordable pair of smart glasses that don’t feel like a compromise

Usefull link

categories

What's Hot

xAI Launches Standalone Grok Speech-to-Text and Text-to-Speech APIs, Targeting Enterprise Voice Developers

What Is the Grok Speech-to-Text API?

Benchmark Performance

What is the Grok Text-to-Speech API?

Key Takeaways

Related Posts

Usefull link

categories