--- license: openrail license_link: LICENSE language: - en - fr - de - es - it - pt - ja - ko - zh - ru - pl - nl - tr - ar - hi - vi - th - id - cs - ro - hu - el - da - sv - fi - no - he - uk - bg - hr - sk pipeline_tag: text-to-speech tags: - mlx - apple-silicon - tts - text-to-speech - speech-synthesis - supertonic - multilingual - flow-matching library_name: supertonic-3-mlx base_model: Supertone/supertonic-3 inference: false --- # Supertonic 3 — MLX-native **31-language text-to-speech, ~x100 realtime on Apple Silicon.** Native MLX port of [Supertone/supertonic-3](https://huggingface.co/Supertone/supertonic-3), runs the full flow-matching + classifier-free-guidance pipeline (DurationPredictor → TextEncoder → 24-block VectorEstimator (5 Euler steps) → 10-block Vocos vocoder) without ONNX, CoreML or any C++ runtime — only MLX + NumPy. ## Install The package isn't on PyPI yet — install directly from this gitea source repository (or from the local checkout): ```bash pip install git+https://gitea.tavportal.com/olivier/supertonic-3-mlx.git ``` Runtime dependencies are just `mlx`, `numpy`, and `huggingface_hub` (the last for the one-line weight download). On first use the ~ 400 MB weight bundle is downloaded from [`ambassadia/supertonic-3-mlx`](https://huggingface.co/ambassadia/supertonic-3-mlx) into your Hugging Face cache. ### One-shot quickstart + sanity test A zero-config end-to-end test script ships with the repo. Clone the repo, run the script, and it will create a fresh venv, install everything, version-check MLX (with an optional auto-upgrade), download the weights and synthesise an utterance into `hello.wav`: ```bash git clone https://gitea.tavportal.com/olivier/supertonic-3-mlx.git cd supertonic-3-mlx ./setup_and_test.sh # en F1, default text ./setup_and_test.sh fr F2 "Bonjour." # custom lang / voice / text ``` Re-runs reuse the venv and the cached weights — second invocation is ~ 20 ms warm load + ~ 30 ms per generate. ## Quickstart (after install) ```python from supertonic_3_mlx import Pipeline pipe = Pipeline.from_pretrained("ambassadia/supertonic-3-mlx") wav = pipe.generate("Hello world from Apple Silicon.", voice="F1", lang="en") # wav is a 1-D numpy.float32 array at 44.1 kHz import soundfile as sf sf.write("hello.wav", wav, pipe.sample_rate) ``` ## Audio samples Six languages, mix of male / female voices, mix of short and long utterances — all generated by the MLX pipeline at the wall times reported below.   **EN · F1 · 2.79 s** — "Hello world from Apple Silicon. Supertonic 3 runs at one hundred times real time."   **EN · M1 · 3.90 s** — "A gentle breeze moved through the open window while the children, still half-asleep, listened to the distant sound of the harbour bells."   **FR · F2 · 3.41 s** — "Bonjour, ceci est un test de synthèse vocale en français. Le modèle gère trente-et-une langues sur une puce M4."   **DE · M2 · 3.69 s** — "Guten Morgen. Dieses Modell läuft komplett auf Apple Silicon, ohne ONNX und ohne CoreML, in reinem MLX."   **JA · F3 · 1.46 s** — "こんにちは。これはアップルシリコン上でMLXを使ったテストです。"   **ES · M3 · 2.86 s** — "Hola, esto es una prueba de síntesis de voz en español ejecutada en tiempo real sobre Apple Silicon." ## Benchmarks (Apple M4, FP32, median of 3) | Sample | Duration | MLX wall | RTF | ONNX SDK | Speedup | |-----------------|---------:|----------:|----------:|---------:|--------:| | EN · F1 · short | 2.79 s | 36.6 ms | **x76** | 1005 ms | **28 ×**| | EN · M1 · long | 3.90 s | 38.4 ms | **x102** | 1356 ms | **35 ×**| | FR · F2 | 3.41 s | 37.9 ms | **x90** | 1196 ms | **32 ×**| | DE · M2 | 3.69 s | 38.1 ms | **x97** | 1314 ms | **35 ×**| | JA · F3 | 1.46 s | 32.1 ms | **x46** | 848 ms | **26 ×**| | ES · M3 | 2.86 s | 37.0 ms | **x77** | 1002 ms | **27 ×**| Raw numbers are in [`bench_results.csv`](bench_results.csv) (regenerable via the development monorepo at [`gitea.tavportal.com/olivier/MLX_CONVERTOR`](https://gitea.tavportal.com/olivier/MLX_CONVERTOR); this repository ships the consolidated release artefacts only). Reference comparison: the CoreML build of the same model on the same hardware runs at ~x27 realtime. The MLX port is **~2-4× faster** end-to-end while remaining bit-identical to the ONNX Runtime reference on the vocoder (cosine 1.00) and at cosine ≥ 0.98 on the full estimator output. ## Voices 10 preset voices — five female (`F1`–`F5`) and five male (`M1`–`M5`). The `voice_styles/` directory contains both `style_ttl` (50×256 latent style for the audio path) and `style_dp` (8×16 style for the duration head) for each voice. Pass the voice name as the `voice=` kwarg to `Pipeline.generate`. ## Languages 31 languages supported. Pass the ISO 639-1 code as the `lang=` kwarg: `en` `fr` `de` `es` `it` `pt` `ja` `ko` `zh` `ru` `pl` `nl` `tr` `ar` `hi` `vi` `th` `id` `cs` `ro` `hu` `el` `da` `sv` `fi` `no` `he` `uk` `bg` `hr` `sk`. ## Architecture (short) Four sub-models, all in `weights/*.safetensors`: | Sub-model | Role | Params | Size | |----------------------|-------------------------------------|--------|---------| | `vector_estimator` | 24-block CFG flow-matching velocity | ~64 M | 256 MB | | `text_encoder` | Character → 256-D text embedding | ~9 M | 36 MB | | `duration_predictor` | Text → seconds | ~1 M | 3.5 MB | | `vocoder` | Latent (B,144,T) → 44.1 kHz wav | ~25 M | 101 MB | The pipeline runs **exactly 5 Euler steps** with classifier-free guidance (`4×cond − 3×uncond`). This schedule is trained-in: reducing the step count or disabling CFG produces an essentially uncorrelated waveform (verified empirically — see the `bench_n_steps.py` script in the source repo). ## Loading from a local snapshot Three layouts are auto-detected by `Pipeline.from_pretrained`: 1. **Hugging Face repo id** (e.g. `"ambassadia/supertonic-3-mlx"`) — auto-download 2. **Local path containing `weights/`** (this layout) — fastest cold-load 3. **Local path containing `onnx/`** (upstream snapshot) — converts at load time ## License This release combines two artefact classes under two distinct licenses: - **Model weights** (`weights/*.safetensors`) — **BigScience Open RAIL-M**. See [`LICENSE`](LICENSE) for the full text. The Attachment A use restrictions are reproduced below and apply to all downstream use of the model and of generated audio. - **Port code** (`src/supertonic_3_mlx/`) — **Apache License 2.0**. See [`LICENSE-CODE`](LICENSE-CODE). See [`NOTICE`](NOTICE) for the modifications statement and the upstream attribution. ### OpenRAIL-M Attachment A — use restrictions You agree not to use the model or derivatives: (a) In any way that violates any applicable national, federal, state, local or international law or regulation. (b) For the purpose of exploiting, harming or attempting to exploit or harm minors in any way. (c) To generate or disseminate verifiably false information and/or content with the purpose of harming others. (d) To generate or disseminate personal identifiable information that can be used to harm an individual. (e) To generate or disseminate information and/or content (e.g. images, code, posts, articles), and place the information and/or content in any context (e.g. bot generating tweets) **without expressly and intelligibly disclaiming that the information and/or content is machine generated**. (f) To defame, disparage or otherwise harass others. (g) To impersonate or attempt to impersonate (e.g. **deepfakes**) others without their consent. (h) For fully automated decision making that adversely impacts an individual's legal rights or otherwise creates or modifies a binding, enforceable obligation. (i) For any use intended to or which has the effect of discriminating against or harming individuals or groups based on online or offline social behavior or known or predicted personal or personality characteristics. (j) To exploit any of the vulnerabilities of a specific group of persons based on their age, social, physical or mental characteristics, in order to materially distort the behavior of a person pertaining to that group in a manner that causes or is likely to cause that person or another person physical or psychological harm. (k) For any use intended to or which has the effect of discriminating against individuals or groups based on legally protected characteristics or categories. (l) **To provide medical advice and medical results interpretation.** (m) To generate or disseminate information for the purpose to be used for administration of justice, law enforcement, immigration or asylum processes, such as predicting an individual will commit fraud/crime commitment. ## Citation ```bibtex @misc{supertonic3-mlx, title = {Supertonic 3 MLX: native Apple Silicon port of Supertone's multilingual TTS}, author = {Dupont, Olivier}, year = {2026}, url = {https://huggingface.co/ambassadia/supertonic-3-mlx}, note = {Derivative of Supertone/supertonic-3 (https://huggingface.co/Supertone/supertonic-3)} } ``` Please also cite the upstream Supertone Supertonic 3 model when using this port.