feat: streaming generate_stream() with sub-100ms TTFB

Splits the input text at sentence-ending punctuation (with secondary split on , ; : for sentences over 220 chars), yields one wav chunk per clause. Callers can start playback as soon as chunk 0 arrives — TTFB ~ 50 ms on M4 — while the rest synthesise in the background. API: for idx, wav in pipe.generate_stream('Phrase 1. Phrase 2.', voice='F1', lang='fr'): play_audio(wav) For non-streaming consumers: chunks = [w for _, w in pipe.generate_stream(text, ...)] full = pipe.concat_chunks(chunks, gap_ms=80) Bench on a 23 s French paragraph (M3 Ultra): chunks: 6 TTFB: 54 ms (first 2.44 s audio chunk ready) total: 410 ms (RTF x56) Whisper: 98 % word overlap on concat The 80 ms inter-chunk silence in concat_chunks roughly matches the natural breathing pause between sentences and masks the prosody discontinuity from independent chunk generation. Each chunk uses seed + idx so chunks don't sound identical even on repeated nouns. Example script in examples/streaming_demo.py.
2026-05-20 12:23:17 +02:00
parent 485f2ff476
commit ad6bcee30e
2 changed files with 130 additions and 0 deletions
--- a/examples/streaming_demo.py
+++ b/examples/streaming_demo.py
@@ -0,0 +1,47 @@
+"""Streaming TTS demo — start audio playback before synthesis finishes.
+
+For an interactive agent the time-to-first-byte (TTFB) of the TTS pipeline
+determines how snappy the conversation feels. With Supertonic 3 MLX the
+first audio chunk is ready in ~ 50 ms on M4 — well under the 100 ms
+threshold for "instantaneous".
+
+This example streams chunks into a queue and plays them through
+``sounddevice`` in real time. Replace the queue with whatever pipe / WS
+connection your app uses.
+
+    pip install sounddevice
+    python examples/streaming_demo.py
+
+If you don't have a speaker, drop ``sounddevice`` and just measure the
+chunk timings (the loop body shows how to do that).
+"""
+import time
+from supertonic_3_mlx import Pipeline
+
+PARAGRAPH = (
+    "Bonjour, je m'appelle Olivier. "
+    "Je travaille sur un projet d'intelligence artificielle. "
+    "Le modèle Supertonic est porté vers MLX pour fonctionner nativement sur Apple Silicon. "
+    "Le streaming permet à l'application de jouer l'audio avant la fin de la synthèse."
+)
+
+pipe = Pipeline.from_pretrained("ambassadia/supertonic-3-mlx")
+
+# Optional playback via sounddevice — comment out if not installed
+try:
+    import sounddevice as sd
+    have_audio = True
+except ImportError:
+    have_audio = False
+    print("(install sounddevice for live playback — measuring chunk timings only)")
+
+t_start = time.perf_counter()
+for idx, wav in pipe.generate_stream(PARAGRAPH, voice="F2", lang="fr"):
+    elapsed_ms = (time.perf_counter() - t_start) * 1000
+    label = "← TTFB" if idx == 0 else ""
+    print(f"chunk {idx}: ready in {elapsed_ms:>6.0f} ms  ({len(wav) / pipe.sample_rate:>4.2f}s of audio) {label}")
+    if have_audio:
+        sd.play(wav, pipe.sample_rate, blocking=False)
+        sd.wait()
+
+print("\ndone.")