Index TTS 2: Tone‑Controlled, Stream‑Ready Text‑to‑Speech

Today we’re introducing Index TTS 2 — a major upgrade to our text‑to‑speech engine that brings multiple voice tones, easy voice cloning, and real‑time streaming to developers and creators.

What’s new

Multiple voice tones via Tone Synthesis for precise delivery and style.
Easy voice cloning from short samples with simple controls.
Real‑time streaming with low latency for interactive apps.
Studio‑quality output with robust noise removal and consistency.
Developer‑friendly APIs with straightforward JSON requests.

Quick start

Generate speech with a voice and optional tone:

curl -X POST https://api.voiceasy.ai/v1/tts   -H "Authorization: Bearer $VOICEASY_API_KEY"   -H "Content-Type: application/json"   -d '{
    "text": "Introducing Index TTS 2",
    "voice_id": "en-female-1",
    "tone_id": "promo-energetic",
    "format": "mp3",
    "sample_rate": 44100
  }' -o index-tts2.mp3

Prefer a calmer delivery? Switch the tone_id to friendly or your own custom tone.

Tone control and cloning

Use Tone Synthesis to control emphasis, pauses, and intonation, then clone voices from short samples to build brand‑consistent audio across products.

Learn more: Tone Synthesis ·My Tones

Streaming for interactive experiences

Index TTS 2 is designed for real‑time scenarios — from live assistance to creative tools — with low‑latency streaming and consistent audio quality.

Background and team

IndexTTS2 is developed by the Index Team, with public materials and releases hosted on GitHub. The project highlights controllable, expressive, zero‑shot TTS, and is associated with the bilibili indextts initiative. Official updates and releases are published via the core repository.

Sources: index-tts/index-tts • IndexTTS2 demo page

Technical highlights

Autoregressive duration control: two generation modes — precise token‑count duration and natural autoregressive duration — to enable audio‑visual sync and flexible prosody.[repo]
Emotion/timbre disentanglement: independent control over timbre and emotion (style prompt), improving emotional fidelity without degrading pronunciation.[demo]
GPT latents & training strategy: GPT latent representations and a three‑stage training paradigm to stabilize speech in highly expressive deliveries.[repo]
Soft instruction via text: fine‑tuned Qwen3 enables natural‑language control of emotion and delivery without complex SSML.[repo]
Ecosystem integrations: community nodes and hosted endpoints, including ComfyUI workflows and serverless inference on fal.ai.[ComfyUI • fal.ai]

Architecture at a glance

High‑level overview of IndexTTS2 components and data flow.

Diagram is illustrative; refer to official docs for implementation specifics.

How it compares

Compared to typical zero‑shot or cloned TTS systems, IndexTTS2 emphasizes precise duration control and disentangled emotion/timbre, while maintaining naturalness. Many commercial systems focus on voice cloning and SSML‑style controls; open‑source systems often excel in zero‑shot timbre but lack easy duration control. IndexTTS2’s design targets dubbing, interactive tooling, and high‑expressivity use cases.

Precise duration: explicit token control for sync‑critical scenarios.
Expressive emotion: natural‑language prompts vs. SSML handcrafting.
Zero‑shot timbre: reconstructs target timbre with separate style prompts.
Streaming potential: designed for low‑latency and real‑time pipelines.

TTS trends to watch

Duration‑aware autoregressive TTS for dubbing and sync‑sensitive media.
Text‑driven emotion control replacing bespoke SSML.
Multi‑modal prosody prompts (audio + text) for refined delivery.
High‑quality streaming for assistants, creative tools, and live content.
On‑device and efficient inference with modern vocoders.

Roadmap and future plans

Public materials indicate ongoing releases and planned enablement of certain controls. Expect continued improvements across duration control, stability under expressive styles, and developer tooling.

Enable full duration control features across hosted endpoints.
Broaden language coverage and style presets.
Iterate training data and latency optimizations for streaming.
Expand ecosystem integrations and SDKs.

Source: index-tts/index-tts