1 Commits

Author SHA1 Message Date
ambassadia
ad6bcee30e feat: streaming generate_stream() with sub-100ms TTFB
Splits the input text at sentence-ending punctuation (with secondary
split on , ; : for sentences over 220 chars), yields one wav chunk
per clause. Callers can start playback as soon as chunk 0 arrives —
TTFB ~ 50 ms on M4 — while the rest synthesise in the background.

API:
    for idx, wav in pipe.generate_stream('Phrase 1. Phrase 2.', voice='F1', lang='fr'):
        play_audio(wav)

For non-streaming consumers:
    chunks = [w for _, w in pipe.generate_stream(text, ...)]
    full   = pipe.concat_chunks(chunks, gap_ms=80)

Bench on a 23 s French paragraph (M3 Ultra):
    chunks:    6
    TTFB:      54 ms  (first 2.44 s audio chunk ready)
    total:    410 ms  (RTF x56)
    Whisper:   98 % word overlap on concat

The 80 ms inter-chunk silence in concat_chunks roughly matches the
natural breathing pause between sentences and masks the prosody
discontinuity from independent chunk generation. Each chunk uses
seed + idx so chunks don't sound identical even on repeated nouns.

Example script in examples/streaming_demo.py.
2026-05-20 12:23:17 +02:00