Files
supertonic-3-mlx/README.md
transcrilive 12dbf4a821 v0.1.0 — initial release
MLX-native port of Supertone's Supertonic 3 multilingual TTS. Runs the
full flow-matching + classifier-free-guidance pipeline at ~x100 realtime
on Apple Silicon, with audio cosine 1.0 vs the cached MLX path and
cosine 0.98 vs the upstream ONNX Runtime reference.

Weights are hosted at https://huggingface.co/ambassadia/supertonic-3-mlx
and auto-downloaded on first use; this repository ships the port code,
the model card, audio samples, and a zero-config setup_and_test.sh.

Install:
    pip install git+https://gitea.tavportal.com/olivier/supertonic-3-mlx.git

Quick test:
    git clone https://gitea.tavportal.com/olivier/supertonic-3-mlx.git
    cd supertonic-3-mlx && ./setup_and_test.sh

Licenses (dual): model weights = BigScience Open RAIL-M (Section 4
propagation), port code = Apache-2.0. See LICENSE, LICENSE-CODE, NOTICE.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 09:17:05 +02:00

9.6 KiB
Raw Blame History

license, license_link, language, pipeline_tag, tags, library_name, base_model, inference
license license_link language pipeline_tag tags library_name base_model inference
openrail LICENSE
en
fr
de
es
it
pt
ja
ko
zh
ru
pl
nl
tr
ar
hi
vi
th
id
cs
ro
hu
el
da
sv
fi
no
he
uk
bg
hr
sk
text-to-speech
mlx
apple-silicon
tts
text-to-speech
speech-synthesis
supertonic
multilingual
flow-matching
supertonic-3-mlx Supertone/supertonic-3 false

Supertonic 3 — MLX-native

31-language text-to-speech, ~x100 realtime on Apple Silicon. Native MLX port of Supertone/supertonic-3, runs the full flow-matching + classifier-free-guidance pipeline (DurationPredictor → TextEncoder → 24-block VectorEstimator (5 Euler steps) → 10-block Vocos vocoder) without ONNX, CoreML or any C++ runtime — only MLX + NumPy.

Install

The package isn't on PyPI yet — install directly from this gitea source repository (or from the local checkout):

pip install git+https://gitea.tavportal.com/olivier/supertonic-3-mlx.git

Runtime dependencies are just mlx, numpy, and huggingface_hub (the last for the one-line weight download). On first use the ~ 400 MB weight bundle is downloaded from ambassadia/supertonic-3-mlx into your Hugging Face cache.

One-shot quickstart + sanity test

A zero-config end-to-end test script ships with the repo. Clone the repo, run the script, and it will create a fresh venv, install everything, version-check MLX (with an optional auto-upgrade), download the weights and synthesise an utterance into hello.wav:

git clone https://gitea.tavportal.com/olivier/supertonic-3-mlx.git
cd supertonic-3-mlx
./setup_and_test.sh                              # en F1, default text
./setup_and_test.sh fr F2 "Bonjour."             # custom lang / voice / text

Re-runs reuse the venv and the cached weights — second invocation is ~ 20 ms warm load + ~ 30 ms per generate.

Quickstart (after install)

from supertonic_3_mlx import Pipeline

pipe = Pipeline.from_pretrained("ambassadia/supertonic-3-mlx")
wav  = pipe.generate("Hello world from Apple Silicon.", voice="F1", lang="en")

# wav is a 1-D numpy.float32 array at 44.1 kHz
import soundfile as sf
sf.write("hello.wav", wav, pipe.sample_rate)

Audio samples

Six languages, mix of male / female voices, mix of short and long utterances — all generated by the MLX pipeline at the wall times reported below.

  EN · F1 · 2.79 s — "Hello world from Apple Silicon. Supertonic 3 runs at one hundred times real time."

  EN · M1 · 3.90 s — "A gentle breeze moved through the open window while the children, still half-asleep, listened to the distant sound of the harbour bells."

  FR · F2 · 3.41 s — "Bonjour, ceci est un test de synthèse vocale en français. Le modèle gère trente-et-une langues sur une puce M4."

  DE · M2 · 3.69 s — "Guten Morgen. Dieses Modell läuft komplett auf Apple Silicon, ohne ONNX und ohne CoreML, in reinem MLX."

  JA · F3 · 1.46 s — "こんにちは。これはアップルシリコン上でMLXを使ったテストです。"

  ES · M3 · 2.86 s — "Hola, esto es una prueba de síntesis de voz en español ejecutada en tiempo real sobre Apple Silicon."

Benchmarks (Apple M4, FP32, median of 3)

Sample Duration MLX wall RTF ONNX SDK Speedup
EN · F1 · short 2.79 s 36.6 ms x76 1005 ms 28 ×
EN · M1 · long 3.90 s 38.4 ms x102 1356 ms 35 ×
FR · F2 3.41 s 37.9 ms x90 1196 ms 32 ×
DE · M2 3.69 s 38.1 ms x97 1314 ms 35 ×
JA · F3 1.46 s 32.1 ms x46 848 ms 26 ×
ES · M3 2.86 s 37.0 ms x77 1002 ms 27 ×

Raw numbers are in bench_results.csv (regenerable via the development monorepo at gitea.tavportal.com/olivier/MLX_CONVERTOR; this repository ships the consolidated release artefacts only).

Reference comparison: the CoreML build of the same model on the same hardware runs at ~x27 realtime. The MLX port is ~2-4× faster end-to-end while remaining bit-identical to the ONNX Runtime reference on the vocoder (cosine 1.00) and at cosine ≥ 0.98 on the full estimator output.

Voices

10 preset voices — five female (F1F5) and five male (M1M5). The voice_styles/ directory contains both style_ttl (50×256 latent style for the audio path) and style_dp (8×16 style for the duration head) for each voice. Pass the voice name as the voice= kwarg to Pipeline.generate.

Languages

31 languages supported. Pass the ISO 639-1 code as the lang= kwarg: en fr de es it pt ja ko zh ru pl nl tr ar hi vi th id cs ro hu el da sv fi no he uk bg hr sk.

Architecture (short)

Four sub-models, all in weights/*.safetensors:

Sub-model Role Params Size
vector_estimator 24-block CFG flow-matching velocity ~64 M 256 MB
text_encoder Character → 256-D text embedding ~9 M 36 MB
duration_predictor Text → seconds ~1 M 3.5 MB
vocoder Latent (B,144,T) → 44.1 kHz wav ~25 M 101 MB

The pipeline runs exactly 5 Euler steps with classifier-free guidance (4×cond 3×uncond). This schedule is trained-in: reducing the step count or disabling CFG produces an essentially uncorrelated waveform (verified empirically — see the bench_n_steps.py script in the source repo).

Loading from a local snapshot

Three layouts are auto-detected by Pipeline.from_pretrained:

  1. Hugging Face repo id (e.g. "ambassadia/supertonic-3-mlx") — auto-download
  2. Local path containing weights/ (this layout) — fastest cold-load
  3. Local path containing onnx/ (upstream snapshot) — converts at load time

License

This release combines two artefact classes under two distinct licenses:

  • Model weights (weights/*.safetensors) — BigScience Open RAIL-M. See LICENSE for the full text. The Attachment A use restrictions are reproduced below and apply to all downstream use of the model and of generated audio.
  • Port code (src/supertonic_3_mlx/) — Apache License 2.0. See LICENSE-CODE.

See NOTICE for the modifications statement and the upstream attribution.

OpenRAIL-M Attachment A — use restrictions

You agree not to use the model or derivatives:

(a) In any way that violates any applicable national, federal, state, local or international law or regulation.

(b) For the purpose of exploiting, harming or attempting to exploit or harm minors in any way.

(c) To generate or disseminate verifiably false information and/or content with the purpose of harming others.

(d) To generate or disseminate personal identifiable information that can be used to harm an individual.

(e) To generate or disseminate information and/or content (e.g. images, code, posts, articles), and place the information and/or content in any context (e.g. bot generating tweets) without expressly and intelligibly disclaiming that the information and/or content is machine generated.

(f) To defame, disparage or otherwise harass others.

(g) To impersonate or attempt to impersonate (e.g. deepfakes) others without their consent.

(h) For fully automated decision making that adversely impacts an individual's legal rights or otherwise creates or modifies a binding, enforceable obligation.

(i) For any use intended to or which has the effect of discriminating against or harming individuals or groups based on online or offline social behavior or known or predicted personal or personality characteristics.

(j) To exploit any of the vulnerabilities of a specific group of persons based on their age, social, physical or mental characteristics, in order to materially distort the behavior of a person pertaining to that group in a manner that causes or is likely to cause that person or another person physical or psychological harm.

(k) For any use intended to or which has the effect of discriminating against individuals or groups based on legally protected characteristics or categories.

(l) To provide medical advice and medical results interpretation.

(m) To generate or disseminate information for the purpose to be used for administration of justice, law enforcement, immigration or asylum processes, such as predicting an individual will commit fraud/crime commitment.

Citation

@misc{supertonic3-mlx,
  title  = {Supertonic 3 MLX: native Apple Silicon port of Supertone's multilingual TTS},
  author = {Dupont, Olivier},
  year   = {2026},
  url    = {https://huggingface.co/ambassadia/supertonic-3-mlx},
  note   = {Derivative of Supertone/supertonic-3 (https://huggingface.co/Supertone/supertonic-3)}
}

Please also cite the upstream Supertone Supertonic 3 model when using this port.