supertonic-3-mlx

olivier/supertonic-3-mlx

Fork 0

Commit Graph

Author	SHA1	Message	Date
ambassadia	d32aaae32d	feat: create_voice() — mix presets to synthesise custom voices The 10 preset voices live on a hypersphere of radius ≈ 7.1 in the 12 800-D style-token space (verified empirically: pairwise cosines 0.86-0.97, SVD shows 7 axes cover 99 % of variance). Linear or spherical interpolation between presets stays in the trained distribution and produces new intelligible voices. API: voice = pipe.create_voice({'F2': 0.7, 'M1': 0.3}) # slerp by default voice = pipe.create_voice({'F2': 0.5, 'M1': 0.5}, interp='lerp') wav = pipe.generate('Bonjour', voice=voice, lang='fr') The voice argument of pipe.generate() now accepts either a preset name (str) or a custom voice descriptor (dict from create_voice). Whisper validation on 6 custom blends (FR test phrase): F2 70 / M1 30 → 100 % (lightly androgyne F voice) F2 50 / M1 50 → 91 % (true androgyne) avg of 5 F voices → 100 % (mean feminine timbre) avg of 5 M voices → 91 % (mean masculine timbre) warm fem (F4+F5) → 91 % bright masc (M1+M5) → 100 % All blends remain intelligible — the trained voice manifold is convex enough that interpolations don't fall out of the model's distribution. Example script in examples/custom_voice_demo.py.	2026-05-20 12:25:15 +02:00
ambassadia	ad6bcee30e	feat: streaming generate_stream() with sub-100ms TTFB Splits the input text at sentence-ending punctuation (with secondary split on , ; : for sentences over 220 chars), yields one wav chunk per clause. Callers can start playback as soon as chunk 0 arrives — TTFB ~ 50 ms on M4 — while the rest synthesise in the background. API: for idx, wav in pipe.generate_stream('Phrase 1. Phrase 2.', voice='F1', lang='fr'): play_audio(wav) For non-streaming consumers: chunks = [w for _, w in pipe.generate_stream(text, ...)] full = pipe.concat_chunks(chunks, gap_ms=80) Bench on a 23 s French paragraph (M3 Ultra): chunks: 6 TTFB: 54 ms (first 2.44 s audio chunk ready) total: 410 ms (RTF x56) Whisper: 98 % word overlap on concat The 80 ms inter-chunk silence in concat_chunks roughly matches the natural breathing pause between sentences and masks the prosody discontinuity from independent chunk generation. Each chunk uses seed + idx so chunks don't sound identical even on repeated nouns. Example script in examples/streaming_demo.py.	2026-05-20 12:23:17 +02:00
transcrilive	12dbf4a821	v0.1.0 — initial release MLX-native port of Supertone's Supertonic 3 multilingual TTS. Runs the full flow-matching + classifier-free-guidance pipeline at ~x100 realtime on Apple Silicon, with audio cosine 1.0 vs the cached MLX path and cosine 0.98 vs the upstream ONNX Runtime reference. Weights are hosted at https://huggingface.co/ambassadia/supertonic-3-mlx and auto-downloaded on first use; this repository ships the port code, the model card, audio samples, and a zero-config setup_and_test.sh. Install: pip install git+https://gitea.tavportal.com/olivier/supertonic-3-mlx.git Quick test: git clone https://gitea.tavportal.com/olivier/supertonic-3-mlx.git cd supertonic-3-mlx && ./setup_and_test.sh Licenses (dual): model weights = BigScience Open RAIL-M (Section 4 propagation), port code = Apache-2.0. See LICENSE, LICENSE-CODE, NOTICE. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 09:17:05 +02:00

Author

SHA1

Message

Date

ambassadia

d32aaae32d

feat: create_voice() — mix presets to synthesise custom voices

The 10 preset voices live on a hypersphere of radius ≈ 7.1 in the
12 800-D style-token space (verified empirically: pairwise cosines
0.86-0.97, SVD shows 7 axes cover 99 % of variance). Linear or
spherical interpolation between presets stays in the trained
distribution and produces new intelligible voices.

API:
    voice = pipe.create_voice({'F2': 0.7, 'M1': 0.3})   # slerp by default
    voice = pipe.create_voice({'F2': 0.5, 'M1': 0.5}, interp='lerp')
    wav   = pipe.generate('Bonjour', voice=voice, lang='fr')

The voice argument of pipe.generate() now accepts either a preset
name (str) or a custom voice descriptor (dict from create_voice).

Whisper validation on 6 custom blends (FR test phrase):
    F2 70 / M1 30          → 100 % (lightly androgyne F voice)
    F2 50 / M1 50          →  91 % (true androgyne)
    avg of 5 F voices      → 100 % (mean feminine timbre)
    avg of 5 M voices      →  91 % (mean masculine timbre)
    warm fem (F4+F5)       →  91 %
    bright masc (M1+M5)    → 100 %

All blends remain intelligible — the trained voice manifold is convex
enough that interpolations don't fall out of the model's distribution.

Example script in examples/custom_voice_demo.py.

2026-05-20 12:25:15 +02:00

ambassadia

ad6bcee30e

feat: streaming generate_stream() with sub-100ms TTFB

Splits the input text at sentence-ending punctuation (with secondary
split on , ; : for sentences over 220 chars), yields one wav chunk
per clause. Callers can start playback as soon as chunk 0 arrives —
TTFB ~ 50 ms on M4 — while the rest synthesise in the background.

API:
    for idx, wav in pipe.generate_stream('Phrase 1. Phrase 2.', voice='F1', lang='fr'):
        play_audio(wav)

For non-streaming consumers:
    chunks = [w for _, w in pipe.generate_stream(text, ...)]
    full   = pipe.concat_chunks(chunks, gap_ms=80)

Bench on a 23 s French paragraph (M3 Ultra):
    chunks:    6
    TTFB:      54 ms  (first 2.44 s audio chunk ready)
    total:    410 ms  (RTF x56)
    Whisper:   98 % word overlap on concat

The 80 ms inter-chunk silence in concat_chunks roughly matches the
natural breathing pause between sentences and masks the prosody
discontinuity from independent chunk generation. Each chunk uses
seed + idx so chunks don't sound identical even on repeated nouns.

Example script in examples/streaming_demo.py.

2026-05-20 12:23:17 +02:00

transcrilive

12dbf4a821

v0.1.0 — initial release

MLX-native port of Supertone's Supertonic 3 multilingual TTS. Runs the
full flow-matching + classifier-free-guidance pipeline at ~x100 realtime
on Apple Silicon, with audio cosine 1.0 vs the cached MLX path and
cosine 0.98 vs the upstream ONNX Runtime reference.

Weights are hosted at https://huggingface.co/ambassadia/supertonic-3-mlx
and auto-downloaded on first use; this repository ships the port code,
the model card, audio samples, and a zero-config setup_and_test.sh.

Install:
    pip install git+https://gitea.tavportal.com/olivier/supertonic-3-mlx.git

Quick test:
    git clone https://gitea.tavportal.com/olivier/supertonic-3-mlx.git
    cd supertonic-3-mlx && ./setup_and_test.sh

Licenses (dual): model weights = BigScience Open RAIL-M (Section 4
propagation), port code = Apache-2.0. See LICENSE, LICENSE-CODE, NOTICE.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-20 09:17:05 +02:00

3 Commits