Two coupled bugs producing structureless ('Whisper hallucinates Société
Radio-Canada') audio on the v0.1.0 release.
Fix#1 — Euler timestep schedule (PRIMARY, smoking gun)
ONNX SDK passes current_step = 0..N-1 → t_norm = [0.0, 0.2, 0.4, 0.6, 0.8].
We were passing step + 1 → [0.2, 0.4, 0.6, 0.8, 1.0].
Flow-matching is trained on the SDK schedule; the off-by-one collapses
the trajectory to noise (ONNX-only ablation: wav cosine 0.0037 vs ref).
Fix#2 — text preprocessing (SECONDARY)
Supertonic 3 wraps utterances in <lang>text</lang> via the SDK's
UnicodeProcessor; we were emitting raw character IDs and ignoring lang.
Min-viable port: NFKD normalisation + whitespace collapse + trailing
period + language token wrap. Bit-identical Whisper output vs the full
SDK preprocessor (verified inline).
Measured impact (FR test phrase, Whisper-large-v3):
before: 10/10 voices → 0% word overlap (Whisper hallucinations only)
after: M2 56%, F1/F2 25%, F3 19%, F5/M1 12%, F4/M3/M5 0%, M4 6%
Audio is now structurally voiced French with target words appearing in
the best voices, but still falls short of the ONNX SDK 81-88% ceiling.
Per-step Euler bisect (same conditioning, ONNX vs MLX VE side-by-side)
shows the residual bug is in the VE velocity prediction; cosine drops
1.000 → 0.9995 → 0.965 → 0.889 → 0.673 → 0.453 across steps 0..5,
exponential compounding from ~0.05 % per-step drift. Continues in a
follow-up commit.
Repos remain PRIVATE on HF + GitHub until full fix lands.
T.6 post-publish audit caught two leaks in the published artefacts:
1. `conversion_report.json` (4 hits on both HF and GitHub) exposed
absolute paths from the build machine:
"safetensors": "/Users/transcrilive/MLX_CONVERTOR/sub-projects/supertonic3-mlx/hf_release/weights/X.safetensors"
"onnx": "/tmp/supertonic3/model/onnx/X.onnx"
This revealed the dev Mac's username (transcrilive) + the private
monorepo name (MLX_CONVERTOR) + the internal sub-projects layout.
2. `src/supertonic_3_mlx/pipeline.py` docstring (1 hit) had a
from_pretrained example pointing at /tmp/supertonic3/model.
Fixes:
- conversion_report.json regenerated with basenames only
("vector_estimator.onnx" / "weights/vector_estimator.safetensors")
- pipeline.py docstring example updated to use the canonical Hub repo id
- the upstream converter tool (in the dev monorepo) patched so future
regenerations of the report don't reintroduce the leak
No tokens, credentials, or keys were ever exposed; tokens are kept only
in env vars / keyrings and never enter the published artefacts.
Adds the Newton-sentence benchmark numbers measured on two real Macs +
the upstream CoreML and ONNX baselines. Highlights:
- Mac Studio M3 Ultra: 45.8 ms wall median (best 39 ms), RTF x88
- MacBook Air M4: 86.7 ms wall median, RTF x47
- M4 + CoreML: 303.5 ms wall median, RTF x27
- M4 + ONNX SDK: ~1200 ms wall median, RTF ~x3
Same FR utterance, same warmup protocol, 5 warm runs each. The
ms-per-second-of-audio column is the honest backend comparison since the
two paths produce slightly different audio durations (DurationPredictor
+ CoreML's speed=1.05 give different timing). MLX wins 1.78× over the
CoreML build on identical M4 hardware, and ~35-40× over the upstream
ONNX SDK.
GPU memory footprint on the Ultra: 750 MB active, 844 MB peak.
MLX-native port of Supertone's Supertonic 3 multilingual TTS. Runs the
full flow-matching + classifier-free-guidance pipeline at ~x100 realtime
on Apple Silicon, with audio cosine 1.0 vs the cached MLX path and
cosine 0.98 vs the upstream ONNX Runtime reference.
Weights are hosted at https://huggingface.co/ambassadia/supertonic-3-mlx
and auto-downloaded on first use; this repository ships the port code,
the model card, audio samples, and a zero-config setup_and_test.sh.
Install:
pip install git+https://gitea.tavportal.com/olivier/supertonic-3-mlx.git
Quick test:
git clone https://gitea.tavportal.com/olivier/supertonic-3-mlx.git
cd supertonic-3-mlx && ./setup_and_test.sh
Licenses (dual): model weights = BigScience Open RAIL-M (Section 4
propagation), port code = Apache-2.0. See LICENSE, LICENSE-CODE, NOTICE.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>