Speech

Learn how AI speech synthesis works, when to use text-to-speech, and how to design natural voice experiences for apps, assistants, and accessibility features.

Speech synthesis turns text into audio that sounds spoken rather than written. It is the capability behind narration, voice assistants, accessibility playback, character voices, and many real-time conversational experiences.

In modern products, speech is rarely just "read this text out loud". The best experiences consider voice selection, latency, emotional tone, playback controls, and whether the audio is a one-off file or part of a live conversation.

Overview

At its core, speech synthesis means taking text as input and producing audio as output. That audio can be generated as a single file, streamed in chunks, or produced in real time as part of an interactive voice system.

Most speech products involve some combination of:

  • converting text into spoken audio
  • selecting a voice or speaker profile
  • controlling style, pace, or delivery
  • streaming or downloading the result
  • playing the result inside an app or assistant

Where speech is useful

Speech is especially useful when reading is not the best interface. It adds reach, accessibility, and a stronger sense of presence than text alone.

Good fit

Accessibility playback, narration, reading assistants, virtual characters, voice UIs, and spoken summaries are all strong speech use cases.

Where it appears in these docs

See Text to Speech, Voice, and the Eleven Labs provider guide for more applied follow-up.

Not always needed

If the user only needs silent, skimmable output, plain text is often faster, cheaper, and easier to control.

Speech vs voice vs transcription

These terms are related, but they refer to different capabilities. Keeping them separate makes it easier to design the right system.

CapabilityInputOutputBest for
Speech synthesisTextAudioNarration, playback, spoken responses
TranscriptionAudioTextCaptions, notes, search, voice input
Real-time voiceAudio and text turnsInteractive conversationVoice assistants, live agents, low-latency sessions

AI SDK example

The AI SDK includes a speech generation API for provider-backed text-to-speech flows. The example below shows the basic shape: choose a speech model, send text, and receive audio.

import { experimental_generateSpeech as generateSpeech } from "ai";
import { openai } from "@ai-sdk/openai";

const { audio } = await generateSpeech({
  model: openai.speech("tts-1"),
  text: "Hello from the AI SDK!",
  voice: "alloy",
});

console.log(audio);

This is the foundation for many speech features. In a real app, you would usually stream or save the audio rather than just logging it.

Common product patterns

Speech features tend to fall into a few common UX patterns. Choosing the right one depends on whether the user wants playback, interaction, or audio as a generated asset.

Text-to-speech player

The user enters text, chooses a voice, and plays or downloads the result. This is the most common TTS pattern.

Narration layer

The app adds optional speech playback to written content such as articles, summaries, or onboarding instructions.

Voice response layer

A text or chat system generates the answer, then speech synthesis reads it aloud for a more immersive interaction.

Real-time conversation

Speech synthesis is used as one piece of a live assistant flow alongside transcription, turn-taking, and session control.

Design considerations

Speech can feel magical in demos, but product quality usually comes down to a few practical decisions. These are the areas worth thinking through up front.

Beginner mistakes to avoid

Many weak speech features fail for predictable reasons. The model may be good, but the experience still feels awkward if the surrounding design is poor.

Reading raw text without cleanup

Lists, links, code snippets, and long machine-written sentences often sound unnatural if sent directly to speech synthesis.

Ignoring playback controls

Even high-quality audio becomes frustrating if users cannot pause, replay, or adjust speed.

Choosing a voice without context

The same voice can feel warm, robotic, premium, or wrong depending on the product and audience.

Using speech where silence is better

Some tasks are simply easier to scan in text than to hear in audio. Speech should add value, not just novelty.

In this docs set, speech is best understood through the app and provider pages that turn the capability into concrete product flows. These are the best places to continue once you understand the fundamentals.

A practical checklist

Before shipping a speech feature, it helps to test whether the output sounds good, behaves well, and actually improves the product rather than just adding novelty.

  • Pick voices that match the product tone and audience.
  • Clean up text before sending it to speech synthesis.
  • Add playback controls early, not as an afterthought.
  • Measure latency if the experience is interactive.
  • Decide whether audio should be streamed, downloaded, or stored.

Learn more

These references are useful if you want to go deeper into both implementation and product design for speech features.

How is this guide?

Last updated on

On this page

Make AI your edge, not replacement.Get TurboStarter AI