Adds the Newton-sentence benchmark numbers measured on two real Macs + the upstream CoreML and ONNX baselines. Highlights: - Mac Studio M3 Ultra: 45.8 ms wall median (best 39 ms), RTF x88 - MacBook Air M4: 86.7 ms wall median, RTF x47 - M4 + CoreML: 303.5 ms wall median, RTF x27 - M4 + ONNX SDK: ~1200 ms wall median, RTF ~x3 Same FR utterance, same warmup protocol, 5 warm runs each. The ms-per-second-of-audio column is the honest backend comparison since the two paths produce slightly different audio durations (DurationPredictor + CoreML's speed=1.05 give different timing). MLX wins 1.78× over the CoreML build on identical M4 hardware, and ~35-40× over the upstream ONNX SDK. GPU memory footprint on the Ultra: 750 MB active, 844 MB peak.
287 lines
11 KiB
Markdown
287 lines
11 KiB
Markdown
---
|
||
license: openrail
|
||
license_link: LICENSE
|
||
language:
|
||
- en
|
||
- fr
|
||
- de
|
||
- es
|
||
- it
|
||
- pt
|
||
- ja
|
||
- ko
|
||
- zh
|
||
- ru
|
||
- pl
|
||
- nl
|
||
- tr
|
||
- ar
|
||
- hi
|
||
- vi
|
||
- th
|
||
- id
|
||
- cs
|
||
- ro
|
||
- hu
|
||
- el
|
||
- da
|
||
- sv
|
||
- fi
|
||
- no
|
||
- he
|
||
- uk
|
||
- bg
|
||
- hr
|
||
- sk
|
||
pipeline_tag: text-to-speech
|
||
tags:
|
||
- mlx
|
||
- apple-silicon
|
||
- tts
|
||
- text-to-speech
|
||
- speech-synthesis
|
||
- supertonic
|
||
- multilingual
|
||
- flow-matching
|
||
library_name: supertonic-3-mlx
|
||
base_model: Supertone/supertonic-3
|
||
inference: false
|
||
---
|
||
|
||
# Supertonic 3 — MLX-native
|
||
|
||
**31-language text-to-speech, ~x100 realtime on Apple Silicon.**
|
||
Native MLX port of [Supertone/supertonic-3](https://huggingface.co/Supertone/supertonic-3),
|
||
runs the full flow-matching + classifier-free-guidance pipeline (DurationPredictor →
|
||
TextEncoder → 24-block VectorEstimator (5 Euler steps) → 10-block Vocos vocoder)
|
||
without ONNX, CoreML or any C++ runtime — only MLX + NumPy.
|
||
|
||
## Install
|
||
|
||
The package isn't on PyPI yet — install directly from this gitea source
|
||
repository (or from the local checkout):
|
||
|
||
```bash
|
||
pip install git+https://gitea.tavportal.com/olivier/supertonic-3-mlx.git
|
||
```
|
||
|
||
Runtime dependencies are just `mlx`, `numpy`, and `huggingface_hub` (the
|
||
last for the one-line weight download). On first use the ~ 400 MB weight
|
||
bundle is downloaded from
|
||
[`ambassadia/supertonic-3-mlx`](https://huggingface.co/ambassadia/supertonic-3-mlx)
|
||
into your Hugging Face cache.
|
||
|
||
### One-shot quickstart + sanity test
|
||
|
||
A zero-config end-to-end test script ships with the repo. Clone the repo,
|
||
run the script, and it will create a fresh venv, install everything,
|
||
version-check MLX (with an optional auto-upgrade), download the weights
|
||
and synthesise an utterance into `hello.wav`:
|
||
|
||
```bash
|
||
git clone https://gitea.tavportal.com/olivier/supertonic-3-mlx.git
|
||
cd supertonic-3-mlx
|
||
./setup_and_test.sh # en F1, default text
|
||
./setup_and_test.sh fr F2 "Bonjour." # custom lang / voice / text
|
||
```
|
||
|
||
Re-runs reuse the venv and the cached weights — second invocation is
|
||
~ 20 ms warm load + ~ 30 ms per generate.
|
||
|
||
## Quickstart (after install)
|
||
|
||
```python
|
||
from supertonic_3_mlx import Pipeline
|
||
|
||
pipe = Pipeline.from_pretrained("ambassadia/supertonic-3-mlx")
|
||
wav = pipe.generate("Hello world from Apple Silicon.", voice="F1", lang="en")
|
||
|
||
# wav is a 1-D numpy.float32 array at 44.1 kHz
|
||
import soundfile as sf
|
||
sf.write("hello.wav", wav, pipe.sample_rate)
|
||
```
|
||
|
||
## Audio samples
|
||
|
||
Six languages, mix of male / female voices, mix of short and long utterances —
|
||
all generated by the MLX pipeline at the wall times reported below.
|
||
|
||
<audio controls src="samples/en_F1_short.wav"></audio> **EN · F1 · 2.79 s** —
|
||
"Hello world from Apple Silicon. Supertonic 3 runs at one hundred times real time."
|
||
|
||
<audio controls src="samples/en_M1_long.wav"></audio> **EN · M1 · 3.90 s** —
|
||
"A gentle breeze moved through the open window while the children, still half-asleep, listened to the distant sound of the harbour bells."
|
||
|
||
<audio controls src="samples/fr_F2.wav"></audio> **FR · F2 · 3.41 s** —
|
||
"Bonjour, ceci est un test de synthèse vocale en français. Le modèle gère trente-et-une langues sur une puce M4."
|
||
|
||
<audio controls src="samples/de_M2.wav"></audio> **DE · M2 · 3.69 s** —
|
||
"Guten Morgen. Dieses Modell läuft komplett auf Apple Silicon, ohne ONNX und ohne CoreML, in reinem MLX."
|
||
|
||
<audio controls src="samples/ja_F3.wav"></audio> **JA · F3 · 1.46 s** —
|
||
"こんにちは。これはアップルシリコン上でMLXを使ったテストです。"
|
||
|
||
<audio controls src="samples/es_M3.wav"></audio> **ES · M3 · 2.86 s** —
|
||
"Hola, esto es una prueba de síntesis de voz en español ejecutada en tiempo real sobre Apple Silicon."
|
||
|
||
## Benchmarks (Apple M4, FP32, median of 3)
|
||
|
||
| Sample | Duration | MLX wall | RTF | ONNX SDK | Speedup |
|
||
|-----------------|---------:|----------:|----------:|---------:|--------:|
|
||
| EN · F1 · short | 2.79 s | 36.6 ms | **x76** | 1005 ms | **28 ×**|
|
||
| EN · M1 · long | 3.90 s | 38.4 ms | **x102** | 1356 ms | **35 ×**|
|
||
| FR · F2 | 3.41 s | 37.9 ms | **x90** | 1196 ms | **32 ×**|
|
||
| DE · M2 | 3.69 s | 38.1 ms | **x97** | 1314 ms | **35 ×**|
|
||
| JA · F3 | 1.46 s | 32.1 ms | **x46** | 848 ms | **26 ×**|
|
||
| ES · M3 | 2.86 s | 37.0 ms | **x77** | 1002 ms | **27 ×**|
|
||
|
||
Raw numbers are in [`bench_results.csv`](bench_results.csv) (regenerable via
|
||
the development monorepo at
|
||
[`gitea.tavportal.com/olivier/MLX_CONVERTOR`](https://gitea.tavportal.com/olivier/MLX_CONVERTOR);
|
||
this repository ships the consolidated release artefacts only).
|
||
|
||
### Multi-machine comparison
|
||
|
||
Same French sentence
|
||
(`"Un jour, Isaac Newton se promène dans son jardin quand une pomme lui tombe sur la tête. Eurêka, j'ai trouvé la loi de la gravitation !"`),
|
||
4 s of audio, median of 5 warm runs, MLX FP32:
|
||
|
||
| Hardware | Wall | RTF | ms / s audio | Notes |
|
||
|--------------------------------------------------|--------:|---------:|-------------:|----------------------------------|
|
||
| Mac Studio **M3 Ultra** (80 GPU cores, 96 GB) | 45.8 ms | **x88** | 11.3 | best on this test |
|
||
| MacBook Air **M4** (10 GPU cores, 16 GB) | 86.7 ms | x47 | 21.1 | reference consumer device |
|
||
| MacBook Air M4 — CoreML (mlpackage, CPU + NE) | 303.5 ms| x27 | 37.7 | upstream CoreML build |
|
||
| MacBook Air M4 — ONNX SDK (`pip install supertonic`) | ~1200 ms| ~x3 | ~350 | upstream reference Python SDK |
|
||
|
||
The MLX path is ~ **1.78× faster than the CoreML build** on the same M4 hardware
|
||
(MLX 21 ms / s of audio vs CoreML 38 ms / s of audio), and ~ **35–40×** the
|
||
ONNX SDK reference. Memory footprint on M3 Ultra is 750 MB active /
|
||
844 MB peak GPU memory; the M4 footprint is similar since the model size is
|
||
fixed. The wall on small-utterance inputs is dispatch-bound (24 attention +
|
||
ConvNeXt blocks × 5 Euler steps + the 10-block vocoder all run in ~ 45 ms
|
||
on the Ultra); the M3 Ultra's 8× extra GPU cores buy ~ 2× wall because
|
||
the workload doesn't fill them.
|
||
|
||
Cold load: 15 ms from the local safetensors snapshot, ~ 17 s on first
|
||
`from_pretrained` from the Hub (downloads 379 MB of weights via
|
||
`hf_transfer`).
|
||
|
||
Reference comparison: the CoreML build of the same model on the same hardware
|
||
runs at ~x27 realtime. The MLX port is **~2-4× faster** end-to-end while
|
||
remaining bit-identical to the ONNX Runtime reference on the vocoder
|
||
(cosine 1.00) and at cosine ≥ 0.98 on the full estimator output.
|
||
|
||
## Voices
|
||
|
||
10 preset voices — five female (`F1`–`F5`) and five male (`M1`–`M5`). The
|
||
`voice_styles/` directory contains both `style_ttl` (50×256 latent style for
|
||
the audio path) and `style_dp` (8×16 style for the duration head) for each
|
||
voice. Pass the voice name as the `voice=` kwarg to `Pipeline.generate`.
|
||
|
||
## Languages
|
||
|
||
31 languages supported. Pass the ISO 639-1 code as the `lang=` kwarg:
|
||
`en` `fr` `de` `es` `it` `pt` `ja` `ko` `zh` `ru` `pl` `nl` `tr` `ar` `hi`
|
||
`vi` `th` `id` `cs` `ro` `hu` `el` `da` `sv` `fi` `no` `he` `uk` `bg` `hr` `sk`.
|
||
|
||
## Architecture (short)
|
||
|
||
Four sub-models, all in `weights/*.safetensors`:
|
||
|
||
| Sub-model | Role | Params | Size |
|
||
|----------------------|-------------------------------------|--------|---------|
|
||
| `vector_estimator` | 24-block CFG flow-matching velocity | ~64 M | 256 MB |
|
||
| `text_encoder` | Character → 256-D text embedding | ~9 M | 36 MB |
|
||
| `duration_predictor` | Text → seconds | ~1 M | 3.5 MB |
|
||
| `vocoder` | Latent (B,144,T) → 44.1 kHz wav | ~25 M | 101 MB |
|
||
|
||
The pipeline runs **exactly 5 Euler steps** with classifier-free guidance
|
||
(`4×cond − 3×uncond`). This schedule is trained-in: reducing the step count
|
||
or disabling CFG produces an essentially uncorrelated waveform (verified
|
||
empirically — see the `bench_n_steps.py` script in the source repo).
|
||
|
||
## Loading from a local snapshot
|
||
|
||
Three layouts are auto-detected by `Pipeline.from_pretrained`:
|
||
|
||
1. **Hugging Face repo id** (e.g. `"ambassadia/supertonic-3-mlx"`) — auto-download
|
||
2. **Local path containing `weights/`** (this layout) — fastest cold-load
|
||
3. **Local path containing `onnx/`** (upstream snapshot) — converts at load time
|
||
|
||
## License
|
||
|
||
This release combines two artefact classes under two distinct licenses:
|
||
|
||
- **Model weights** (`weights/*.safetensors`) — **BigScience Open RAIL-M**.
|
||
See [`LICENSE`](LICENSE) for the full text. The Attachment A use
|
||
restrictions are reproduced below and apply to all downstream use of the
|
||
model and of generated audio.
|
||
- **Port code** (`src/supertonic_3_mlx/`) — **Apache License 2.0**. See
|
||
[`LICENSE-CODE`](LICENSE-CODE).
|
||
|
||
See [`NOTICE`](NOTICE) for the modifications statement and the upstream
|
||
attribution.
|
||
|
||
### OpenRAIL-M Attachment A — use restrictions
|
||
|
||
You agree not to use the model or derivatives:
|
||
|
||
(a) In any way that violates any applicable national, federal, state, local or
|
||
international law or regulation.
|
||
|
||
(b) For the purpose of exploiting, harming or attempting to exploit or harm
|
||
minors in any way.
|
||
|
||
(c) To generate or disseminate verifiably false information and/or content
|
||
with the purpose of harming others.
|
||
|
||
(d) To generate or disseminate personal identifiable information that can be
|
||
used to harm an individual.
|
||
|
||
(e) To generate or disseminate information and/or content (e.g. images, code,
|
||
posts, articles), and place the information and/or content in any context
|
||
(e.g. bot generating tweets) **without expressly and intelligibly disclaiming
|
||
that the information and/or content is machine generated**.
|
||
|
||
(f) To defame, disparage or otherwise harass others.
|
||
|
||
(g) To impersonate or attempt to impersonate (e.g. **deepfakes**) others
|
||
without their consent.
|
||
|
||
(h) For fully automated decision making that adversely impacts an individual's
|
||
legal rights or otherwise creates or modifies a binding, enforceable obligation.
|
||
|
||
(i) For any use intended to or which has the effect of discriminating against
|
||
or harming individuals or groups based on online or offline social behavior or
|
||
known or predicted personal or personality characteristics.
|
||
|
||
(j) To exploit any of the vulnerabilities of a specific group of persons based
|
||
on their age, social, physical or mental characteristics, in order to materially
|
||
distort the behavior of a person pertaining to that group in a manner that
|
||
causes or is likely to cause that person or another person physical or
|
||
psychological harm.
|
||
|
||
(k) For any use intended to or which has the effect of discriminating against
|
||
individuals or groups based on legally protected characteristics or categories.
|
||
|
||
(l) **To provide medical advice and medical results interpretation.**
|
||
|
||
(m) To generate or disseminate information for the purpose to be used for
|
||
administration of justice, law enforcement, immigration or asylum processes,
|
||
such as predicting an individual will commit fraud/crime commitment.
|
||
|
||
## Citation
|
||
|
||
```bibtex
|
||
@misc{supertonic3-mlx,
|
||
title = {Supertonic 3 MLX: native Apple Silicon port of Supertone's multilingual TTS},
|
||
author = {Dupont, Olivier},
|
||
year = {2026},
|
||
url = {https://huggingface.co/ambassadia/supertonic-3-mlx},
|
||
note = {Derivative of Supertone/supertonic-3 (https://huggingface.co/Supertone/supertonic-3)}
|
||
}
|
||
```
|
||
|
||
Please also cite the upstream Supertone Supertonic 3 model when using this
|
||
port.
|