v0.1.0 — initial release
MLX-native port of Supertone's Supertonic 3 multilingual TTS. Runs the full flow-matching + classifier-free-guidance pipeline at ~x100 realtime on Apple Silicon, with audio cosine 1.0 vs the cached MLX path and cosine 0.98 vs the upstream ONNX Runtime reference. Weights are hosted at https://huggingface.co/ambassadia/supertonic-3-mlx and auto-downloaded on first use; this repository ships the port code, the model card, audio samples, and a zero-config setup_and_test.sh. Install: pip install git+https://gitea.tavportal.com/olivier/supertonic-3-mlx.git Quick test: git clone https://gitea.tavportal.com/olivier/supertonic-3-mlx.git cd supertonic-3-mlx && ./setup_and_test.sh Licenses (dual): model weights = BigScience Open RAIL-M (Section 4 propagation), port code = Apache-2.0. See LICENSE, LICENSE-CODE, NOTICE. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
260
README.md
Normal file
260
README.md
Normal file
@@ -0,0 +1,260 @@
|
||||
---
|
||||
license: openrail
|
||||
license_link: LICENSE
|
||||
language:
|
||||
- en
|
||||
- fr
|
||||
- de
|
||||
- es
|
||||
- it
|
||||
- pt
|
||||
- ja
|
||||
- ko
|
||||
- zh
|
||||
- ru
|
||||
- pl
|
||||
- nl
|
||||
- tr
|
||||
- ar
|
||||
- hi
|
||||
- vi
|
||||
- th
|
||||
- id
|
||||
- cs
|
||||
- ro
|
||||
- hu
|
||||
- el
|
||||
- da
|
||||
- sv
|
||||
- fi
|
||||
- no
|
||||
- he
|
||||
- uk
|
||||
- bg
|
||||
- hr
|
||||
- sk
|
||||
pipeline_tag: text-to-speech
|
||||
tags:
|
||||
- mlx
|
||||
- apple-silicon
|
||||
- tts
|
||||
- text-to-speech
|
||||
- speech-synthesis
|
||||
- supertonic
|
||||
- multilingual
|
||||
- flow-matching
|
||||
library_name: supertonic-3-mlx
|
||||
base_model: Supertone/supertonic-3
|
||||
inference: false
|
||||
---
|
||||
|
||||
# Supertonic 3 — MLX-native
|
||||
|
||||
**31-language text-to-speech, ~x100 realtime on Apple Silicon.**
|
||||
Native MLX port of [Supertone/supertonic-3](https://huggingface.co/Supertone/supertonic-3),
|
||||
runs the full flow-matching + classifier-free-guidance pipeline (DurationPredictor →
|
||||
TextEncoder → 24-block VectorEstimator (5 Euler steps) → 10-block Vocos vocoder)
|
||||
without ONNX, CoreML or any C++ runtime — only MLX + NumPy.
|
||||
|
||||
## Install
|
||||
|
||||
The package isn't on PyPI yet — install directly from this gitea source
|
||||
repository (or from the local checkout):
|
||||
|
||||
```bash
|
||||
pip install git+https://gitea.tavportal.com/olivier/supertonic-3-mlx.git
|
||||
```
|
||||
|
||||
Runtime dependencies are just `mlx`, `numpy`, and `huggingface_hub` (the
|
||||
last for the one-line weight download). On first use the ~ 400 MB weight
|
||||
bundle is downloaded from
|
||||
[`ambassadia/supertonic-3-mlx`](https://huggingface.co/ambassadia/supertonic-3-mlx)
|
||||
into your Hugging Face cache.
|
||||
|
||||
### One-shot quickstart + sanity test
|
||||
|
||||
A zero-config end-to-end test script ships with the repo. Clone the repo,
|
||||
run the script, and it will create a fresh venv, install everything,
|
||||
version-check MLX (with an optional auto-upgrade), download the weights
|
||||
and synthesise an utterance into `hello.wav`:
|
||||
|
||||
```bash
|
||||
git clone https://gitea.tavportal.com/olivier/supertonic-3-mlx.git
|
||||
cd supertonic-3-mlx
|
||||
./setup_and_test.sh # en F1, default text
|
||||
./setup_and_test.sh fr F2 "Bonjour." # custom lang / voice / text
|
||||
```
|
||||
|
||||
Re-runs reuse the venv and the cached weights — second invocation is
|
||||
~ 20 ms warm load + ~ 30 ms per generate.
|
||||
|
||||
## Quickstart (after install)
|
||||
|
||||
```python
|
||||
from supertonic_3_mlx import Pipeline
|
||||
|
||||
pipe = Pipeline.from_pretrained("ambassadia/supertonic-3-mlx")
|
||||
wav = pipe.generate("Hello world from Apple Silicon.", voice="F1", lang="en")
|
||||
|
||||
# wav is a 1-D numpy.float32 array at 44.1 kHz
|
||||
import soundfile as sf
|
||||
sf.write("hello.wav", wav, pipe.sample_rate)
|
||||
```
|
||||
|
||||
## Audio samples
|
||||
|
||||
Six languages, mix of male / female voices, mix of short and long utterances —
|
||||
all generated by the MLX pipeline at the wall times reported below.
|
||||
|
||||
<audio controls src="samples/en_F1_short.wav"></audio> **EN · F1 · 2.79 s** —
|
||||
"Hello world from Apple Silicon. Supertonic 3 runs at one hundred times real time."
|
||||
|
||||
<audio controls src="samples/en_M1_long.wav"></audio> **EN · M1 · 3.90 s** —
|
||||
"A gentle breeze moved through the open window while the children, still half-asleep, listened to the distant sound of the harbour bells."
|
||||
|
||||
<audio controls src="samples/fr_F2.wav"></audio> **FR · F2 · 3.41 s** —
|
||||
"Bonjour, ceci est un test de synthèse vocale en français. Le modèle gère trente-et-une langues sur une puce M4."
|
||||
|
||||
<audio controls src="samples/de_M2.wav"></audio> **DE · M2 · 3.69 s** —
|
||||
"Guten Morgen. Dieses Modell läuft komplett auf Apple Silicon, ohne ONNX und ohne CoreML, in reinem MLX."
|
||||
|
||||
<audio controls src="samples/ja_F3.wav"></audio> **JA · F3 · 1.46 s** —
|
||||
"こんにちは。これはアップルシリコン上でMLXを使ったテストです。"
|
||||
|
||||
<audio controls src="samples/es_M3.wav"></audio> **ES · M3 · 2.86 s** —
|
||||
"Hola, esto es una prueba de síntesis de voz en español ejecutada en tiempo real sobre Apple Silicon."
|
||||
|
||||
## Benchmarks (Apple M4, FP32, median of 3)
|
||||
|
||||
| Sample | Duration | MLX wall | RTF | ONNX SDK | Speedup |
|
||||
|-----------------|---------:|----------:|----------:|---------:|--------:|
|
||||
| EN · F1 · short | 2.79 s | 36.6 ms | **x76** | 1005 ms | **28 ×**|
|
||||
| EN · M1 · long | 3.90 s | 38.4 ms | **x102** | 1356 ms | **35 ×**|
|
||||
| FR · F2 | 3.41 s | 37.9 ms | **x90** | 1196 ms | **32 ×**|
|
||||
| DE · M2 | 3.69 s | 38.1 ms | **x97** | 1314 ms | **35 ×**|
|
||||
| JA · F3 | 1.46 s | 32.1 ms | **x46** | 848 ms | **26 ×**|
|
||||
| ES · M3 | 2.86 s | 37.0 ms | **x77** | 1002 ms | **27 ×**|
|
||||
|
||||
Raw numbers are in [`bench_results.csv`](bench_results.csv) (regenerable via
|
||||
the development monorepo at
|
||||
[`gitea.tavportal.com/olivier/MLX_CONVERTOR`](https://gitea.tavportal.com/olivier/MLX_CONVERTOR);
|
||||
this repository ships the consolidated release artefacts only).
|
||||
|
||||
Reference comparison: the CoreML build of the same model on the same hardware
|
||||
runs at ~x27 realtime. The MLX port is **~2-4× faster** end-to-end while
|
||||
remaining bit-identical to the ONNX Runtime reference on the vocoder
|
||||
(cosine 1.00) and at cosine ≥ 0.98 on the full estimator output.
|
||||
|
||||
## Voices
|
||||
|
||||
10 preset voices — five female (`F1`–`F5`) and five male (`M1`–`M5`). The
|
||||
`voice_styles/` directory contains both `style_ttl` (50×256 latent style for
|
||||
the audio path) and `style_dp` (8×16 style for the duration head) for each
|
||||
voice. Pass the voice name as the `voice=` kwarg to `Pipeline.generate`.
|
||||
|
||||
## Languages
|
||||
|
||||
31 languages supported. Pass the ISO 639-1 code as the `lang=` kwarg:
|
||||
`en` `fr` `de` `es` `it` `pt` `ja` `ko` `zh` `ru` `pl` `nl` `tr` `ar` `hi`
|
||||
`vi` `th` `id` `cs` `ro` `hu` `el` `da` `sv` `fi` `no` `he` `uk` `bg` `hr` `sk`.
|
||||
|
||||
## Architecture (short)
|
||||
|
||||
Four sub-models, all in `weights/*.safetensors`:
|
||||
|
||||
| Sub-model | Role | Params | Size |
|
||||
|----------------------|-------------------------------------|--------|---------|
|
||||
| `vector_estimator` | 24-block CFG flow-matching velocity | ~64 M | 256 MB |
|
||||
| `text_encoder` | Character → 256-D text embedding | ~9 M | 36 MB |
|
||||
| `duration_predictor` | Text → seconds | ~1 M | 3.5 MB |
|
||||
| `vocoder` | Latent (B,144,T) → 44.1 kHz wav | ~25 M | 101 MB |
|
||||
|
||||
The pipeline runs **exactly 5 Euler steps** with classifier-free guidance
|
||||
(`4×cond − 3×uncond`). This schedule is trained-in: reducing the step count
|
||||
or disabling CFG produces an essentially uncorrelated waveform (verified
|
||||
empirically — see the `bench_n_steps.py` script in the source repo).
|
||||
|
||||
## Loading from a local snapshot
|
||||
|
||||
Three layouts are auto-detected by `Pipeline.from_pretrained`:
|
||||
|
||||
1. **Hugging Face repo id** (e.g. `"ambassadia/supertonic-3-mlx"`) — auto-download
|
||||
2. **Local path containing `weights/`** (this layout) — fastest cold-load
|
||||
3. **Local path containing `onnx/`** (upstream snapshot) — converts at load time
|
||||
|
||||
## License
|
||||
|
||||
This release combines two artefact classes under two distinct licenses:
|
||||
|
||||
- **Model weights** (`weights/*.safetensors`) — **BigScience Open RAIL-M**.
|
||||
See [`LICENSE`](LICENSE) for the full text. The Attachment A use
|
||||
restrictions are reproduced below and apply to all downstream use of the
|
||||
model and of generated audio.
|
||||
- **Port code** (`src/supertonic_3_mlx/`) — **Apache License 2.0**. See
|
||||
[`LICENSE-CODE`](LICENSE-CODE).
|
||||
|
||||
See [`NOTICE`](NOTICE) for the modifications statement and the upstream
|
||||
attribution.
|
||||
|
||||
### OpenRAIL-M Attachment A — use restrictions
|
||||
|
||||
You agree not to use the model or derivatives:
|
||||
|
||||
(a) In any way that violates any applicable national, federal, state, local or
|
||||
international law or regulation.
|
||||
|
||||
(b) For the purpose of exploiting, harming or attempting to exploit or harm
|
||||
minors in any way.
|
||||
|
||||
(c) To generate or disseminate verifiably false information and/or content
|
||||
with the purpose of harming others.
|
||||
|
||||
(d) To generate or disseminate personal identifiable information that can be
|
||||
used to harm an individual.
|
||||
|
||||
(e) To generate or disseminate information and/or content (e.g. images, code,
|
||||
posts, articles), and place the information and/or content in any context
|
||||
(e.g. bot generating tweets) **without expressly and intelligibly disclaiming
|
||||
that the information and/or content is machine generated**.
|
||||
|
||||
(f) To defame, disparage or otherwise harass others.
|
||||
|
||||
(g) To impersonate or attempt to impersonate (e.g. **deepfakes**) others
|
||||
without their consent.
|
||||
|
||||
(h) For fully automated decision making that adversely impacts an individual's
|
||||
legal rights or otherwise creates or modifies a binding, enforceable obligation.
|
||||
|
||||
(i) For any use intended to or which has the effect of discriminating against
|
||||
or harming individuals or groups based on online or offline social behavior or
|
||||
known or predicted personal or personality characteristics.
|
||||
|
||||
(j) To exploit any of the vulnerabilities of a specific group of persons based
|
||||
on their age, social, physical or mental characteristics, in order to materially
|
||||
distort the behavior of a person pertaining to that group in a manner that
|
||||
causes or is likely to cause that person or another person physical or
|
||||
psychological harm.
|
||||
|
||||
(k) For any use intended to or which has the effect of discriminating against
|
||||
individuals or groups based on legally protected characteristics or categories.
|
||||
|
||||
(l) **To provide medical advice and medical results interpretation.**
|
||||
|
||||
(m) To generate or disseminate information for the purpose to be used for
|
||||
administration of justice, law enforcement, immigration or asylum processes,
|
||||
such as predicting an individual will commit fraud/crime commitment.
|
||||
|
||||
## Citation
|
||||
|
||||
```bibtex
|
||||
@misc{supertonic3-mlx,
|
||||
title = {Supertonic 3 MLX: native Apple Silicon port of Supertone's multilingual TTS},
|
||||
author = {Dupont, Olivier},
|
||||
year = {2026},
|
||||
url = {https://huggingface.co/ambassadia/supertonic-3-mlx},
|
||||
note = {Derivative of Supertone/supertonic-3 (https://huggingface.co/Supertone/supertonic-3)}
|
||||
}
|
||||
```
|
||||
|
||||
Please also cite the upstream Supertone Supertonic 3 model when using this
|
||||
port.
|
||||
Reference in New Issue
Block a user