Files
supertonic-3-mlx/README.md
ambassadia d9f43c2531 docs: add multi-machine bench (M3 Ultra 45.8ms / M4 86.7ms / CoreML 303ms / ONNX 1200ms)
Adds the Newton-sentence benchmark numbers measured on two real Macs +
the upstream CoreML and ONNX baselines. Highlights:

- Mac Studio M3 Ultra: 45.8 ms wall median (best 39 ms), RTF x88
- MacBook Air M4:      86.7 ms wall median,               RTF x47
- M4 + CoreML:        303.5 ms wall median,               RTF x27
- M4 + ONNX SDK:     ~1200 ms wall median,               RTF ~x3

Same FR utterance, same warmup protocol, 5 warm runs each. The
ms-per-second-of-audio column is the honest backend comparison since the
two paths produce slightly different audio durations (DurationPredictor
+ CoreML's speed=1.05 give different timing). MLX wins 1.78× over the
CoreML build on identical M4 hardware, and ~35-40× over the upstream
ONNX SDK.

GPU memory footprint on the Ultra: 750 MB active, 844 MB peak.
2026-05-20 09:48:20 +02:00

287 lines
11 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
license: openrail
license_link: LICENSE
language:
- en
- fr
- de
- es
- it
- pt
- ja
- ko
- zh
- ru
- pl
- nl
- tr
- ar
- hi
- vi
- th
- id
- cs
- ro
- hu
- el
- da
- sv
- fi
- no
- he
- uk
- bg
- hr
- sk
pipeline_tag: text-to-speech
tags:
- mlx
- apple-silicon
- tts
- text-to-speech
- speech-synthesis
- supertonic
- multilingual
- flow-matching
library_name: supertonic-3-mlx
base_model: Supertone/supertonic-3
inference: false
---
# Supertonic 3 — MLX-native
**31-language text-to-speech, ~x100 realtime on Apple Silicon.**
Native MLX port of [Supertone/supertonic-3](https://huggingface.co/Supertone/supertonic-3),
runs the full flow-matching + classifier-free-guidance pipeline (DurationPredictor →
TextEncoder → 24-block VectorEstimator (5 Euler steps) → 10-block Vocos vocoder)
without ONNX, CoreML or any C++ runtime — only MLX + NumPy.
## Install
The package isn't on PyPI yet — install directly from this gitea source
repository (or from the local checkout):
```bash
pip install git+https://gitea.tavportal.com/olivier/supertonic-3-mlx.git
```
Runtime dependencies are just `mlx`, `numpy`, and `huggingface_hub` (the
last for the one-line weight download). On first use the ~ 400 MB weight
bundle is downloaded from
[`ambassadia/supertonic-3-mlx`](https://huggingface.co/ambassadia/supertonic-3-mlx)
into your Hugging Face cache.
### One-shot quickstart + sanity test
A zero-config end-to-end test script ships with the repo. Clone the repo,
run the script, and it will create a fresh venv, install everything,
version-check MLX (with an optional auto-upgrade), download the weights
and synthesise an utterance into `hello.wav`:
```bash
git clone https://gitea.tavportal.com/olivier/supertonic-3-mlx.git
cd supertonic-3-mlx
./setup_and_test.sh # en F1, default text
./setup_and_test.sh fr F2 "Bonjour." # custom lang / voice / text
```
Re-runs reuse the venv and the cached weights — second invocation is
~ 20 ms warm load + ~ 30 ms per generate.
## Quickstart (after install)
```python
from supertonic_3_mlx import Pipeline
pipe = Pipeline.from_pretrained("ambassadia/supertonic-3-mlx")
wav = pipe.generate("Hello world from Apple Silicon.", voice="F1", lang="en")
# wav is a 1-D numpy.float32 array at 44.1 kHz
import soundfile as sf
sf.write("hello.wav", wav, pipe.sample_rate)
```
## Audio samples
Six languages, mix of male / female voices, mix of short and long utterances —
all generated by the MLX pipeline at the wall times reported below.
<audio controls src="samples/en_F1_short.wav"></audio> &nbsp; **EN · F1 · 2.79 s**
"Hello world from Apple Silicon. Supertonic 3 runs at one hundred times real time."
<audio controls src="samples/en_M1_long.wav"></audio> &nbsp; **EN · M1 · 3.90 s**
"A gentle breeze moved through the open window while the children, still half-asleep, listened to the distant sound of the harbour bells."
<audio controls src="samples/fr_F2.wav"></audio> &nbsp; **FR · F2 · 3.41 s**
"Bonjour, ceci est un test de synthèse vocale en français. Le modèle gère trente-et-une langues sur une puce M4."
<audio controls src="samples/de_M2.wav"></audio> &nbsp; **DE · M2 · 3.69 s**
"Guten Morgen. Dieses Modell läuft komplett auf Apple Silicon, ohne ONNX und ohne CoreML, in reinem MLX."
<audio controls src="samples/ja_F3.wav"></audio> &nbsp; **JA · F3 · 1.46 s**
"こんにちは。これはアップルシリコン上でMLXを使ったテストです。"
<audio controls src="samples/es_M3.wav"></audio> &nbsp; **ES · M3 · 2.86 s**
"Hola, esto es una prueba de síntesis de voz en español ejecutada en tiempo real sobre Apple Silicon."
## Benchmarks (Apple M4, FP32, median of 3)
| Sample | Duration | MLX wall | RTF | ONNX SDK | Speedup |
|-----------------|---------:|----------:|----------:|---------:|--------:|
| EN · F1 · short | 2.79 s | 36.6 ms | **x76** | 1005 ms | **28 ×**|
| EN · M1 · long | 3.90 s | 38.4 ms | **x102** | 1356 ms | **35 ×**|
| FR · F2 | 3.41 s | 37.9 ms | **x90** | 1196 ms | **32 ×**|
| DE · M2 | 3.69 s | 38.1 ms | **x97** | 1314 ms | **35 ×**|
| JA · F3 | 1.46 s | 32.1 ms | **x46** | 848 ms | **26 ×**|
| ES · M3 | 2.86 s | 37.0 ms | **x77** | 1002 ms | **27 ×**|
Raw numbers are in [`bench_results.csv`](bench_results.csv) (regenerable via
the development monorepo at
[`gitea.tavportal.com/olivier/MLX_CONVERTOR`](https://gitea.tavportal.com/olivier/MLX_CONVERTOR);
this repository ships the consolidated release artefacts only).
### Multi-machine comparison
Same French sentence
(`"Un jour, Isaac Newton se promène dans son jardin quand une pomme lui tombe sur la tête. Eurêka, j'ai trouvé la loi de la gravitation !"`),
4 s of audio, median of 5 warm runs, MLX FP32:
| Hardware | Wall | RTF | ms / s audio | Notes |
|--------------------------------------------------|--------:|---------:|-------------:|----------------------------------|
| Mac Studio **M3 Ultra** (80 GPU cores, 96 GB) | 45.8 ms | **x88** | 11.3 | best on this test |
| MacBook Air **M4** (10 GPU cores, 16 GB) | 86.7 ms | x47 | 21.1 | reference consumer device |
| MacBook Air M4 — CoreML (mlpackage, CPU + NE) | 303.5 ms| x27 | 37.7 | upstream CoreML build |
| MacBook Air M4 — ONNX SDK (`pip install supertonic`) | ~1200 ms| ~x3 | ~350 | upstream reference Python SDK |
The MLX path is ~ **1.78× faster than the CoreML build** on the same M4 hardware
(MLX 21 ms / s of audio vs CoreML 38 ms / s of audio), and ~ **3540×** the
ONNX SDK reference. Memory footprint on M3 Ultra is 750 MB active /
844 MB peak GPU memory; the M4 footprint is similar since the model size is
fixed. The wall on small-utterance inputs is dispatch-bound (24 attention +
ConvNeXt blocks × 5 Euler steps + the 10-block vocoder all run in ~ 45 ms
on the Ultra); the M3 Ultra's 8× extra GPU cores buy ~ 2× wall because
the workload doesn't fill them.
Cold load: 15 ms from the local safetensors snapshot, ~ 17 s on first
`from_pretrained` from the Hub (downloads 379 MB of weights via
`hf_transfer`).
Reference comparison: the CoreML build of the same model on the same hardware
runs at ~x27 realtime. The MLX port is **~2-4× faster** end-to-end while
remaining bit-identical to the ONNX Runtime reference on the vocoder
(cosine 1.00) and at cosine ≥ 0.98 on the full estimator output.
## Voices
10 preset voices — five female (`F1``F5`) and five male (`M1``M5`). The
`voice_styles/` directory contains both `style_ttl` (50×256 latent style for
the audio path) and `style_dp` (8×16 style for the duration head) for each
voice. Pass the voice name as the `voice=` kwarg to `Pipeline.generate`.
## Languages
31 languages supported. Pass the ISO 639-1 code as the `lang=` kwarg:
`en` `fr` `de` `es` `it` `pt` `ja` `ko` `zh` `ru` `pl` `nl` `tr` `ar` `hi`
`vi` `th` `id` `cs` `ro` `hu` `el` `da` `sv` `fi` `no` `he` `uk` `bg` `hr` `sk`.
## Architecture (short)
Four sub-models, all in `weights/*.safetensors`:
| Sub-model | Role | Params | Size |
|----------------------|-------------------------------------|--------|---------|
| `vector_estimator` | 24-block CFG flow-matching velocity | ~64 M | 256 MB |
| `text_encoder` | Character → 256-D text embedding | ~9 M | 36 MB |
| `duration_predictor` | Text → seconds | ~1 M | 3.5 MB |
| `vocoder` | Latent (B,144,T) → 44.1 kHz wav | ~25 M | 101 MB |
The pipeline runs **exactly 5 Euler steps** with classifier-free guidance
(`4×cond 3×uncond`). This schedule is trained-in: reducing the step count
or disabling CFG produces an essentially uncorrelated waveform (verified
empirically — see the `bench_n_steps.py` script in the source repo).
## Loading from a local snapshot
Three layouts are auto-detected by `Pipeline.from_pretrained`:
1. **Hugging Face repo id** (e.g. `"ambassadia/supertonic-3-mlx"`) — auto-download
2. **Local path containing `weights/`** (this layout) — fastest cold-load
3. **Local path containing `onnx/`** (upstream snapshot) — converts at load time
## License
This release combines two artefact classes under two distinct licenses:
- **Model weights** (`weights/*.safetensors`) — **BigScience Open RAIL-M**.
See [`LICENSE`](LICENSE) for the full text. The Attachment A use
restrictions are reproduced below and apply to all downstream use of the
model and of generated audio.
- **Port code** (`src/supertonic_3_mlx/`) — **Apache License 2.0**. See
[`LICENSE-CODE`](LICENSE-CODE).
See [`NOTICE`](NOTICE) for the modifications statement and the upstream
attribution.
### OpenRAIL-M Attachment A — use restrictions
You agree not to use the model or derivatives:
(a) In any way that violates any applicable national, federal, state, local or
international law or regulation.
(b) For the purpose of exploiting, harming or attempting to exploit or harm
minors in any way.
(c) To generate or disseminate verifiably false information and/or content
with the purpose of harming others.
(d) To generate or disseminate personal identifiable information that can be
used to harm an individual.
(e) To generate or disseminate information and/or content (e.g. images, code,
posts, articles), and place the information and/or content in any context
(e.g. bot generating tweets) **without expressly and intelligibly disclaiming
that the information and/or content is machine generated**.
(f) To defame, disparage or otherwise harass others.
(g) To impersonate or attempt to impersonate (e.g. **deepfakes**) others
without their consent.
(h) For fully automated decision making that adversely impacts an individual's
legal rights or otherwise creates or modifies a binding, enforceable obligation.
(i) For any use intended to or which has the effect of discriminating against
or harming individuals or groups based on online or offline social behavior or
known or predicted personal or personality characteristics.
(j) To exploit any of the vulnerabilities of a specific group of persons based
on their age, social, physical or mental characteristics, in order to materially
distort the behavior of a person pertaining to that group in a manner that
causes or is likely to cause that person or another person physical or
psychological harm.
(k) For any use intended to or which has the effect of discriminating against
individuals or groups based on legally protected characteristics or categories.
(l) **To provide medical advice and medical results interpretation.**
(m) To generate or disseminate information for the purpose to be used for
administration of justice, law enforcement, immigration or asylum processes,
such as predicting an individual will commit fraud/crime commitment.
## Citation
```bibtex
@misc{supertonic3-mlx,
title = {Supertonic 3 MLX: native Apple Silicon port of Supertone's multilingual TTS},
author = {Dupont, Olivier},
year = {2026},
url = {https://huggingface.co/ambassadia/supertonic-3-mlx},
note = {Derivative of Supertone/supertonic-3 (https://huggingface.co/Supertone/supertonic-3)}
}
```
Please also cite the upstream Supertone Supertonic 3 model when using this
port.