The landscape of Text-to-Speech (TTS) just shifted. In March 2026, Fish Audio officially open-sourced its next-generation model: Fish Audio S2 (S2-Pro).

While previous models focused on “cloning” a voice, S2-Pro focuses on instructing it. This isn’t just about turning text into sound; it’s about giving developers and creators a “Director’s Chair” for AI speech. At The AI FlowHub, we believe this model is the missing link for truly autonomous, human-like voice agents.

What Makes Fish Audio S2-Pro a Game Changer?

The standout feature of S2-Pro is its Dual-Autoregressive (Dual-AR) architecture. It splits the workload between a 4-Billion parameter “Slow” model for semantics and a 400-Million parameter “Fast” model for acoustic detail. This balance allows for high-fidelity audio without the massive latency typical of large models.

1. Infinite Emotional Control (Natural Language Tags)

Unlike older systems that used a few rigid “mood” settings, S2-Pro uses bracket syntax [tag]. You can insert natural language instructions directly into your text.

  • Example: [whisper in a small voice] I have a secret... [laughing nervously] but I can't tell you.
  • The model supports over 15,000 unique tags, allowing for word-level control over prosody, pitch, and emotion.

2. Production-Ready Latency (Sub-150ms)

For real-time applications like AI customer support or live gaming agents, speed is everything.

  • Time-to-First-Audio (TTFA): Optimized for ~100ms on high-end hardware (like NVIDIA H200).
  • Throughput: Capable of generating over 3,000 acoustic tokens per second.

3. Native Multi-Speaker Support

You no longer need separate files for a conversation. A single S2-Pro generation pass can handle multiple distinct speaker identities seamlessly. By using <|speaker:i|> tokens, you can create a full podcast or a multi-character dialogue in one go.

4. Optimized with SGLang

The system is built to run on the SGLang-Omni framework. This integration brings advanced serving optimizations like RadixAttention (prefix caching) and continuous batching, making it significantly cheaper and faster to host than traditional TTS pipelines.

Key Use Cases for your AI Flow

  • Interactive Voice Agents: Combine S2-Pro with an LLM (like GPT-4 or Claude) for a voice assistant that doesn’t just answer, but sounds empathetic or urgent based on the context.
  • Automated Content Dubbing: Scale your YouTube or social media presence into 80+ languages while maintaining the original speaker’s emotional nuances.
  • Gaming & Animation: Create dynamic NPCs (Non-Player Characters) that react with laughter, sighs, or anger in real-time based on player actions.

How to Get Started

Because Fish Audio is committed to open-source, you can deploy this on your own infrastructure today: