feat: streaming generate_stream() with sub-100ms TTFB
Splits the input text at sentence-ending punctuation (with secondary
split on , ; : for sentences over 220 chars), yields one wav chunk
per clause. Callers can start playback as soon as chunk 0 arrives —
TTFB ~ 50 ms on M4 — while the rest synthesise in the background.
API:
for idx, wav in pipe.generate_stream('Phrase 1. Phrase 2.', voice='F1', lang='fr'):
play_audio(wav)
For non-streaming consumers:
chunks = [w for _, w in pipe.generate_stream(text, ...)]
full = pipe.concat_chunks(chunks, gap_ms=80)
Bench on a 23 s French paragraph (M3 Ultra):
chunks: 6
TTFB: 54 ms (first 2.44 s audio chunk ready)
total: 410 ms (RTF x56)
Whisper: 98 % word overlap on concat
The 80 ms inter-chunk silence in concat_chunks roughly matches the
natural breathing pause between sentences and masks the prosody
discontinuity from independent chunk generation. Each chunk uses
seed + idx so chunks don't sound identical even on repeated nouns.
Example script in examples/streaming_demo.py.
This commit is contained in:
47
examples/streaming_demo.py
Normal file
47
examples/streaming_demo.py
Normal file
@@ -0,0 +1,47 @@
|
||||
"""Streaming TTS demo — start audio playback before synthesis finishes.
|
||||
|
||||
For an interactive agent the time-to-first-byte (TTFB) of the TTS pipeline
|
||||
determines how snappy the conversation feels. With Supertonic 3 MLX the
|
||||
first audio chunk is ready in ~ 50 ms on M4 — well under the 100 ms
|
||||
threshold for "instantaneous".
|
||||
|
||||
This example streams chunks into a queue and plays them through
|
||||
``sounddevice`` in real time. Replace the queue with whatever pipe / WS
|
||||
connection your app uses.
|
||||
|
||||
pip install sounddevice
|
||||
python examples/streaming_demo.py
|
||||
|
||||
If you don't have a speaker, drop ``sounddevice`` and just measure the
|
||||
chunk timings (the loop body shows how to do that).
|
||||
"""
|
||||
import time
|
||||
from supertonic_3_mlx import Pipeline
|
||||
|
||||
PARAGRAPH = (
|
||||
"Bonjour, je m'appelle Olivier. "
|
||||
"Je travaille sur un projet d'intelligence artificielle. "
|
||||
"Le modèle Supertonic est porté vers MLX pour fonctionner nativement sur Apple Silicon. "
|
||||
"Le streaming permet à l'application de jouer l'audio avant la fin de la synthèse."
|
||||
)
|
||||
|
||||
pipe = Pipeline.from_pretrained("ambassadia/supertonic-3-mlx")
|
||||
|
||||
# Optional playback via sounddevice — comment out if not installed
|
||||
try:
|
||||
import sounddevice as sd
|
||||
have_audio = True
|
||||
except ImportError:
|
||||
have_audio = False
|
||||
print("(install sounddevice for live playback — measuring chunk timings only)")
|
||||
|
||||
t_start = time.perf_counter()
|
||||
for idx, wav in pipe.generate_stream(PARAGRAPH, voice="F2", lang="fr"):
|
||||
elapsed_ms = (time.perf_counter() - t_start) * 1000
|
||||
label = "← TTFB" if idx == 0 else ""
|
||||
print(f"chunk {idx}: ready in {elapsed_ms:>6.0f} ms ({len(wav) / pipe.sample_rate:>4.2f}s of audio) {label}")
|
||||
if have_audio:
|
||||
sd.play(wav, pipe.sample_rate, blocking=False)
|
||||
sd.wait()
|
||||
|
||||
print("\ndone.")
|
||||
Reference in New Issue
Block a user