Speech

Convert text to natural-sounding audio for narration, assistants, and accessibility. Speech synthesis APIs and voice UX patterns.

Speech synthesis turns text into audio that sounds spoken rather than written. It is the capability behind narration, voice assistants, accessibility playback, character voices, and many real-time conversational experiences.

In modern products, speech is rarely just "read this text out loud". The best experiences consider voice selection, latency, emotional tone, playback controls, and whether the audio is a one-off file or part of a live conversation.

Overview

At its core, speech synthesis means taking text as input and producing audio as output. That audio can be generated as a single file, streamed in chunks, or produced in real time as part of an interactive voice system.

Most speech products involve some combination of:

converting text into spoken audio
selecting a voice or speaker profile
controlling style, pace, or delivery
streaming or downloading the result
playing the result inside an app or assistant

Where speech is useful

Speech is especially useful when reading is not the best interface. It adds reach, accessibility, and a stronger sense of presence than text alone.

Good fit

Accessibility playback, narration, reading assistants, virtual characters, voice UIs, and spoken summaries are all strong speech use cases.

Where it appears in these docs

See Text to Speech, Voice, and the Eleven Labs provider guide for more applied follow-up.

Not always needed

If the user only needs silent, skimmable output, plain text is often faster, cheaper, and easier to control.

Speech vs voice vs transcription

These terms are related, but they refer to different capabilities. Keeping them separate makes it easier to design the right system.

Capability	Input	Output	Best for
Speech synthesis	Text	Audio	Narration, playback, spoken responses
Transcription	Audio	Text	Captions, notes, search, voice input
Real-time voice	Audio and text turns	Interactive conversation	Voice assistants, live agents, low-latency sessions

AI SDK example

The AI SDK includes a speech generation API for provider-backed text-to-speech flows. The example below shows the basic shape: choose a speech model, send text, and receive audio.

import { experimental_generateSpeech as generateSpeech } from "ai";
import { openai } from "@ai-sdk/openai";

const { audio } = await generateSpeech({
  model: openai.speech("tts-1"),
  text: "Hello from the AI SDK!",
  voice: "alloy",
});

console.log(audio);

This is the foundation for many speech features. In a real app, you would usually stream or save the audio rather than just logging it.

Common product patterns

Speech features tend to fall into a few common UX patterns. Choosing the right one depends on whether the user wants playback, interaction, or audio as a generated asset.

Text-to-speech player

The user enters text, chooses a voice, and plays or downloads the result. This is the most common TTS pattern.

Narration layer

The app adds optional speech playback to written content such as articles, summaries, or onboarding instructions.

Voice response layer

A text or chat system generates the answer, then speech synthesis reads it aloud for a more immersive interaction.

Real-time conversation

Speech synthesis is used as one piece of a live assistant flow alongside transcription, turn-taking, and session control.

Design considerations

Speech can feel magical in demos, but product quality usually comes down to a few practical decisions. These are the areas worth thinking through up front.

Beginner mistakes to avoid

Many weak speech features fail for predictable reasons. The model may be good, but the experience still feels awkward if the surrounding design is poor.

Reading raw text without cleanup

Lists, links, code snippets, and long machine-written sentences often sound unnatural if sent directly to speech synthesis.

Ignoring playback controls

Even high-quality audio becomes frustrating if users cannot pause, replay, or adjust speed.

Choosing a voice without context

The same voice can feel warm, robotic, premium, or wrong depending on the product and audience.

Using speech where silence is better

Some tasks are simply easier to scan in text than to hear in audio. Speech should add value, not just novelty.

In this docs set, speech is best understood through the app and provider pages that turn the capability into concrete product flows. These are the best places to continue once you understand the fundamentals.

Text to Speech

See a full speech-synthesis experience with voice selection, playback, and streaming audio.

Voice

See how speech fits into a real-time conversational assistant alongside transcripts and session control.

Eleven Labs

Explore a provider focused on realistic voice synthesis, cloning, and broader audio workflows.

OpenAI

See provider-level speech generation support in the broader OpenAI capabilities surface.

A practical checklist

Before shipping a speech feature, it helps to test whether the output sounds good, behaves well, and actually improves the product rather than just adding novelty.

Pick voices that match the product tone and audience.
Clean up text before sending it to speech synthesis.
Add playback controls early, not as an afterthought.
Measure latency if the experience is interactive.
Decide whether audio should be streamed, downloaded, or stored.

Learn more

These references are useful if you want to go deeper into both implementation and product design for speech features.