supertonic-3-mlx

olivier/supertonic-3-mlx

Fork 0

Commit Graph

Author	SHA1	Message	Date
ambassadia	ad6bcee30e	feat: streaming generate_stream() with sub-100ms TTFB Splits the input text at sentence-ending punctuation (with secondary split on , ; : for sentences over 220 chars), yields one wav chunk per clause. Callers can start playback as soon as chunk 0 arrives — TTFB ~ 50 ms on M4 — while the rest synthesise in the background. API: for idx, wav in pipe.generate_stream('Phrase 1. Phrase 2.', voice='F1', lang='fr'): play_audio(wav) For non-streaming consumers: chunks = [w for _, w in pipe.generate_stream(text, ...)] full = pipe.concat_chunks(chunks, gap_ms=80) Bench on a 23 s French paragraph (M3 Ultra): chunks: 6 TTFB: 54 ms (first 2.44 s audio chunk ready) total: 410 ms (RTF x56) Whisper: 98 % word overlap on concat The 80 ms inter-chunk silence in concat_chunks roughly matches the natural breathing pause between sentences and masks the prosody discontinuity from independent chunk generation. Each chunk uses seed + idx so chunks don't sound identical even on repeated nouns. Example script in examples/streaming_demo.py.	2026-05-20 12:23:17 +02:00

Author

SHA1

Message

Date

ambassadia

ad6bcee30e

feat: streaming generate_stream() with sub-100ms TTFB

Splits the input text at sentence-ending punctuation (with secondary
split on , ; : for sentences over 220 chars), yields one wav chunk
per clause. Callers can start playback as soon as chunk 0 arrives —
TTFB ~ 50 ms on M4 — while the rest synthesise in the background.

API:
    for idx, wav in pipe.generate_stream('Phrase 1. Phrase 2.', voice='F1', lang='fr'):
        play_audio(wav)

For non-streaming consumers:
    chunks = [w for _, w in pipe.generate_stream(text, ...)]
    full   = pipe.concat_chunks(chunks, gap_ms=80)

Bench on a 23 s French paragraph (M3 Ultra):
    chunks:    6
    TTFB:      54 ms  (first 2.44 s audio chunk ready)
    total:    410 ms  (RTF x56)
    Whisper:   98 % word overlap on concat

The 80 ms inter-chunk silence in concat_chunks roughly matches the
natural breathing pause between sentences and masks the prosody
discontinuity from independent chunk generation. Each chunk uses
seed + idx so chunks don't sound identical even on repeated nouns.

Example script in examples/streaming_demo.py.

2026-05-20 12:23:17 +02:00

1 Commits