v0.1.0 — initial release

MLX-native port of Supertone's Supertonic 3 multilingual TTS. Runs the full flow-matching + classifier-free-guidance pipeline at ~x100 realtime on Apple Silicon, with audio cosine 1.0 vs the cached MLX path and cosine 0.98 vs the upstream ONNX Runtime reference. Weights are hosted at https://huggingface.co/ambassadia/supertonic-3-mlx and auto-downloaded on first use; this repository ships the port code, the model card, audio samples, and a zero-config setup_and_test.sh. Install: pip install git+https://gitea.tavportal.com/olivier/supertonic-3-mlx.git Quick test: git clone https://gitea.tavportal.com/olivier/supertonic-3-mlx.git cd supertonic-3-mlx && ./setup_and_test.sh Licenses (dual): model weights = BigScience Open RAIL-M (Section 4 propagation), port code = Apache-2.0. See LICENSE, LICENSE-CODE, NOTICE. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 09:17:05 +02:00
commit 12dbf4a821
36 changed files with 3812 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,260 @@
+---
+license: openrail
+license_link: LICENSE
+language:
+- en
+- fr
+- de
+- es
+- it
+- pt
+- ja
+- ko
+- zh
+- ru
+- pl
+- nl
+- tr
+- ar
+- hi
+- vi
+- th
+- id
+- cs
+- ro
+- hu
+- el
+- da
+- sv
+- fi
+- no
+- he
+- uk
+- bg
+- hr
+- sk
+pipeline_tag: text-to-speech
+tags:
+- mlx
+- apple-silicon
+- tts
+- text-to-speech
+- speech-synthesis
+- supertonic
+- multilingual
+- flow-matching
+library_name: supertonic-3-mlx
+base_model: Supertone/supertonic-3
+inference: false
+---
+
+# Supertonic 3 — MLX-native
+
+**31-language text-to-speech, ~x100 realtime on Apple Silicon.**
+Native MLX port of [Supertone/supertonic-3](https://huggingface.co/Supertone/supertonic-3),
+runs the full flow-matching + classifier-free-guidance pipeline (DurationPredictor →
+TextEncoder → 24-block VectorEstimator (5 Euler steps) → 10-block Vocos vocoder)
+without ONNX, CoreML or any C++ runtime — only MLX + NumPy.
+
+## Install
+
+The package isn't on PyPI yet — install directly from this gitea source
+repository (or from the local checkout):
+
+```bash
+pip install git+https://gitea.tavportal.com/olivier/supertonic-3-mlx.git
+```
+
+Runtime dependencies are just `mlx`, `numpy`, and `huggingface_hub` (the
+last for the one-line weight download). On first use the ~ 400 MB weight
+bundle is downloaded from
+[`ambassadia/supertonic-3-mlx`](https://huggingface.co/ambassadia/supertonic-3-mlx)
+into your Hugging Face cache.
+
+### One-shot quickstart + sanity test
+
+A zero-config end-to-end test script ships with the repo. Clone the repo,
+run the script, and it will create a fresh venv, install everything,
+version-check MLX (with an optional auto-upgrade), download the weights
+and synthesise an utterance into `hello.wav`:
+
+```bash
+git clone https://gitea.tavportal.com/olivier/supertonic-3-mlx.git
+cd supertonic-3-mlx
+./setup_and_test.sh                              # en F1, default text
+./setup_and_test.sh fr F2 "Bonjour."             # custom lang / voice / text
+```
+
+Re-runs reuse the venv and the cached weights — second invocation is
+~ 20 ms warm load + ~ 30 ms per generate.
+
+## Quickstart (after install)
+
+```python
+from supertonic_3_mlx import Pipeline
+
+pipe = Pipeline.from_pretrained("ambassadia/supertonic-3-mlx")
+wav  = pipe.generate("Hello world from Apple Silicon.", voice="F1", lang="en")
+
+# wav is a 1-D numpy.float32 array at 44.1 kHz
+import soundfile as sf
+sf.write("hello.wav", wav, pipe.sample_rate)
+```
+
+## Audio samples
+
+Six languages, mix of male / female voices, mix of short and long utterances —
+all generated by the MLX pipeline at the wall times reported below.
+
+<audio controls src="samples/en_F1_short.wav"></audio> &nbsp; **EN · F1 · 2.79 s** —
+"Hello world from Apple Silicon. Supertonic 3 runs at one hundred times real time."
+
+<audio controls src="samples/en_M1_long.wav"></audio> &nbsp; **EN · M1 · 3.90 s** —
+"A gentle breeze moved through the open window while the children, still half-asleep, listened to the distant sound of the harbour bells."
+
+<audio controls src="samples/fr_F2.wav"></audio> &nbsp; **FR · F2 · 3.41 s** —
+"Bonjour, ceci est un test de synthèse vocale en français. Le modèle gère trente-et-une langues sur une puce M4."
+
+<audio controls src="samples/de_M2.wav"></audio> &nbsp; **DE · M2 · 3.69 s** —
+"Guten Morgen. Dieses Modell läuft komplett auf Apple Silicon, ohne ONNX und ohne CoreML, in reinem MLX."
+
+<audio controls src="samples/ja_F3.wav"></audio> &nbsp; **JA · F3 · 1.46 s** —
+"こんにちは。これはアップルシリコン上でMLXを使ったテストです。"
+
+<audio controls src="samples/es_M3.wav"></audio> &nbsp; **ES · M3 · 2.86 s** —
+"Hola, esto es una prueba de síntesis de voz en español ejecutada en tiempo real sobre Apple Silicon."
+
+## Benchmarks (Apple M4, FP32, median of 3)
+
+| Sample          | Duration | MLX wall  | RTF       | ONNX SDK | Speedup |
+|-----------------|---------:|----------:|----------:|---------:|--------:|
+| EN · F1 · short |   2.79 s |   36.6 ms | **x76**   |  1005 ms | **28 ×**|
+| EN · M1 · long  |   3.90 s |   38.4 ms | **x102**  |  1356 ms | **35 ×**|
+| FR · F2         |   3.41 s |   37.9 ms | **x90**   |  1196 ms | **32 ×**|
+| DE · M2         |   3.69 s |   38.1 ms | **x97**   |  1314 ms | **35 ×**|
+| JA · F3         |   1.46 s |   32.1 ms | **x46**   |   848 ms | **26 ×**|
+| ES · M3         |   2.86 s |   37.0 ms | **x77**   |  1002 ms | **27 ×**|
+
+Raw numbers are in [`bench_results.csv`](bench_results.csv) (regenerable via
+the development monorepo at
+[`gitea.tavportal.com/olivier/MLX_CONVERTOR`](https://gitea.tavportal.com/olivier/MLX_CONVERTOR);
+this repository ships the consolidated release artefacts only).
+
+Reference comparison: the CoreML build of the same model on the same hardware
+runs at ~x27 realtime. The MLX port is **~2-4× faster** end-to-end while
+remaining bit-identical to the ONNX Runtime reference on the vocoder
+(cosine 1.00) and at cosine ≥ 0.98 on the full estimator output.
+
+## Voices
+
+10 preset voices — five female (`F1`–`F5`) and five male (`M1`–`M5`). The
+`voice_styles/` directory contains both `style_ttl` (50×256 latent style for
+the audio path) and `style_dp` (8×16 style for the duration head) for each
+voice. Pass the voice name as the `voice=` kwarg to `Pipeline.generate`.
+
+## Languages
+
+31 languages supported. Pass the ISO 639-1 code as the `lang=` kwarg:
+`en` `fr` `de` `es` `it` `pt` `ja` `ko` `zh` `ru` `pl` `nl` `tr` `ar` `hi`
+`vi` `th` `id` `cs` `ro` `hu` `el` `da` `sv` `fi` `no` `he` `uk` `bg` `hr` `sk`.
+
+## Architecture (short)
+
+Four sub-models, all in `weights/*.safetensors`:
+
+| Sub-model            | Role                                | Params | Size    |
+|----------------------|-------------------------------------|--------|---------|
+| `vector_estimator`   | 24-block CFG flow-matching velocity | ~64 M  | 256 MB  |
+| `text_encoder`       | Character → 256-D text embedding    | ~9 M   |  36 MB  |
+| `duration_predictor` | Text → seconds                      | ~1 M   | 3.5 MB  |
+| `vocoder`            | Latent (B,144,T) → 44.1 kHz wav     | ~25 M  | 101 MB  |
+
+The pipeline runs **exactly 5 Euler steps** with classifier-free guidance
+(`4×cond − 3×uncond`). This schedule is trained-in: reducing the step count
+or disabling CFG produces an essentially uncorrelated waveform (verified
+empirically — see the `bench_n_steps.py` script in the source repo).
+
+## Loading from a local snapshot
+
+Three layouts are auto-detected by `Pipeline.from_pretrained`:
+
+1. **Hugging Face repo id** (e.g. `"ambassadia/supertonic-3-mlx"`) — auto-download
+2. **Local path containing `weights/`** (this layout) — fastest cold-load
+3. **Local path containing `onnx/`** (upstream snapshot) — converts at load time
+
+## License
+
+This release combines two artefact classes under two distinct licenses:
+
+- **Model weights** (`weights/*.safetensors`) — **BigScience Open RAIL-M**.
+  See [`LICENSE`](LICENSE) for the full text. The Attachment A use
+  restrictions are reproduced below and apply to all downstream use of the
+  model and of generated audio.
+- **Port code** (`src/supertonic_3_mlx/`) — **Apache License 2.0**. See
+  [`LICENSE-CODE`](LICENSE-CODE).
+
+See [`NOTICE`](NOTICE) for the modifications statement and the upstream
+attribution.
+
+### OpenRAIL-M Attachment A — use restrictions
+
+You agree not to use the model or derivatives:
+
+(a) In any way that violates any applicable national, federal, state, local or
+international law or regulation.
+
+(b) For the purpose of exploiting, harming or attempting to exploit or harm
+minors in any way.
+
+(c) To generate or disseminate verifiably false information and/or content
+with the purpose of harming others.
+
+(d) To generate or disseminate personal identifiable information that can be
+used to harm an individual.
+
+(e) To generate or disseminate information and/or content (e.g. images, code,
+posts, articles), and place the information and/or content in any context
+(e.g. bot generating tweets) **without expressly and intelligibly disclaiming
+that the information and/or content is machine generated**.
+
+(f) To defame, disparage or otherwise harass others.
+
+(g) To impersonate or attempt to impersonate (e.g. **deepfakes**) others
+without their consent.
+
+(h) For fully automated decision making that adversely impacts an individual's
+legal rights or otherwise creates or modifies a binding, enforceable obligation.
+
+(i) For any use intended to or which has the effect of discriminating against
+or harming individuals or groups based on online or offline social behavior or
+known or predicted personal or personality characteristics.
+
+(j) To exploit any of the vulnerabilities of a specific group of persons based
+on their age, social, physical or mental characteristics, in order to materially
+distort the behavior of a person pertaining to that group in a manner that
+causes or is likely to cause that person or another person physical or
+psychological harm.
+
+(k) For any use intended to or which has the effect of discriminating against
+individuals or groups based on legally protected characteristics or categories.
+
+(l) **To provide medical advice and medical results interpretation.**
+
+(m) To generate or disseminate information for the purpose to be used for
+administration of justice, law enforcement, immigration or asylum processes,
+such as predicting an individual will commit fraud/crime commitment.
+
+## Citation
+
+```bibtex
+@misc{supertonic3-mlx,
+  title  = {Supertonic 3 MLX: native Apple Silicon port of Supertone's multilingual TTS},
+  author = {Dupont, Olivier},
+  year   = {2026},
+  url    = {https://huggingface.co/ambassadia/supertonic-3-mlx},
+  note   = {Derivative of Supertone/supertonic-3 (https://huggingface.co/Supertone/supertonic-3)}
+}
+```
+
+Please also cite the upstream Supertone Supertonic 3 model when using this
+port.