Speech
Learn how AI speech synthesis works, when to use text-to-speech, and how to design natural voice experiences for apps, assistants, and accessibility features.
Speech synthesis turns text into audio that sounds spoken rather than written. It is the capability behind narration, voice assistants, accessibility playback, character voices, and many real-time conversational experiences.
In modern products, speech is rarely just "read this text out loud". The best experiences consider voice selection, latency, emotional tone, playback controls, and whether the audio is a one-off file or part of a live conversation.
Overview
At its core, speech synthesis means taking text as input and producing audio as output. That audio can be generated as a single file, streamed in chunks, or produced in real time as part of an interactive voice system.
Most speech products involve some combination of:
- converting text into spoken audio
- selecting a voice or speaker profile
- controlling style, pace, or delivery
- streaming or downloading the result
- playing the result inside an app or assistant
Where speech is useful
Speech is especially useful when reading is not the best interface. It adds reach, accessibility, and a stronger sense of presence than text alone.
Good fit
Accessibility playback, narration, reading assistants, virtual characters, voice UIs, and spoken summaries are all strong speech use cases.
Where it appears in these docs
See Text to Speech, Voice, and the Eleven Labs provider guide for more applied follow-up.
Not always needed
If the user only needs silent, skimmable output, plain text is often faster, cheaper, and easier to control.
Speech vs voice vs transcription
These terms are related, but they refer to different capabilities. Keeping them separate makes it easier to design the right system.
| Capability | Input | Output | Best for |
|---|---|---|---|
| Speech synthesis | Text | Audio | Narration, playback, spoken responses |
| Transcription | Audio | Text | Captions, notes, search, voice input |
| Real-time voice | Audio and text turns | Interactive conversation | Voice assistants, live agents, low-latency sessions |
AI SDK example
The AI SDK includes a speech generation API for provider-backed text-to-speech flows. The example below shows the basic shape: choose a speech model, send text, and receive audio.
import { experimental_generateSpeech as generateSpeech } from "ai";
import { openai } from "@ai-sdk/openai";
const { audio } = await generateSpeech({
model: openai.speech("tts-1"),
text: "Hello from the AI SDK!",
voice: "alloy",
});
console.log(audio);This is the foundation for many speech features. In a real app, you would usually stream or save the audio rather than just logging it.
Common product patterns
Speech features tend to fall into a few common UX patterns. Choosing the right one depends on whether the user wants playback, interaction, or audio as a generated asset.
Text-to-speech player
The user enters text, chooses a voice, and plays or downloads the result. This is the most common TTS pattern.
Narration layer
The app adds optional speech playback to written content such as articles, summaries, or onboarding instructions.
Voice response layer
A text or chat system generates the answer, then speech synthesis reads it aloud for a more immersive interaction.
Real-time conversation
Speech synthesis is used as one piece of a live assistant flow alongside transcription, turn-taking, and session control.
Design considerations
Speech can feel magical in demos, but product quality usually comes down to a few practical decisions. These are the areas worth thinking through up front.
For narration, a short wait is fine. For assistants or live responses, latency has a much bigger effect on whether the interaction feels natural.
The voice communicates brand, tone, and trust. A great voice for accessibility may not be the right one for a playful character or a support assistant.
Speech quality alone is not enough. Users often need pause, replay, speed control, and download options.
Text written for reading is not always pleasant to hear. Long sentences, timestamps, code, or URLs may need pre-processing before synthesis.
If speech is streamed as it is generated, the product can feel much more alive, but error handling and buffering become more important.
Beginner mistakes to avoid
Many weak speech features fail for predictable reasons. The model may be good, but the experience still feels awkward if the surrounding design is poor.
Reading raw text without cleanup
Lists, links, code snippets, and long machine-written sentences often sound unnatural if sent directly to speech synthesis.
Ignoring playback controls
Even high-quality audio becomes frustrating if users cannot pause, replay, or adjust speed.
Choosing a voice without context
The same voice can feel warm, robotic, premium, or wrong depending on the product and audience.
Using speech where silence is better
Some tasks are simply easier to scan in text than to hear in audio. Speech should add value, not just novelty.
Related documentation
In this docs set, speech is best understood through the app and provider pages that turn the capability into concrete product flows. These are the best places to continue once you understand the fundamentals.
Text to Speech
See a full speech-synthesis experience with voice selection, playback, and streaming audio.
Voice
See how speech fits into a real-time conversational assistant alongside transcripts and session control.
Eleven Labs
Explore a provider focused on realistic voice synthesis, cloning, and broader audio workflows.
OpenAI
See provider-level speech generation support in the broader OpenAI capabilities surface.
A practical checklist
Before shipping a speech feature, it helps to test whether the output sounds good, behaves well, and actually improves the product rather than just adding novelty.
- Pick voices that match the product tone and audience.
- Clean up text before sending it to speech synthesis.
- Add playback controls early, not as an afterthought.
- Measure latency if the experience is interactive.
- Decide whether audio should be streamed, downloaded, or stored.
Learn more
These references are useful if you want to go deeper into both implementation and product design for speech features.
How is this guide?
Last updated on
Transcription
Learn how speech-to-text works, when to use AI transcription, and how to design transcription flows for notes, search, captions, and voice interfaces.
AI providers
Compare the main AI providers in the stack, understand what each is best at, and choose the right model platform for your product.