Files

transcrilive 12dbf4a821 v0.1.0 — initial release

MLX-native port of Supertone's Supertonic 3 multilingual TTS. Runs the
full flow-matching + classifier-free-guidance pipeline at ~x100 realtime
on Apple Silicon, with audio cosine 1.0 vs the cached MLX path and
cosine 0.98 vs the upstream ONNX Runtime reference.

Weights are hosted at https://huggingface.co/ambassadia/supertonic-3-mlx
and auto-downloaded on first use; this repository ships the port code,
the model card, audio samples, and a zero-config setup_and_test.sh.

Install:
    pip install git+https://gitea.tavportal.com/olivier/supertonic-3-mlx.git

Quick test:
    git clone https://gitea.tavportal.com/olivier/supertonic-3-mlx.git
    cd supertonic-3-mlx && ./setup_and_test.sh

Licenses (dual): model weights = BigScience Open RAIL-M (Section 4
propagation), port code = Apache-2.0. See LICENSE, LICENSE-CODE, NOTICE.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-20 09:17:05 +02:00

9.6 KiB

Raw Blame History

license, license_link, language, pipeline_tag, tags, library_name, base_model, inference

license

license_link

language

pipeline_tag

Supertonic 3 — MLX-native

31-language text-to-speech, ~x100 realtime on Apple Silicon. Native MLX port of Supertone/supertonic-3, runs the full flow-matching + classifier-free-guidance pipeline (DurationPredictor → TextEncoder → 24-block VectorEstimator (5 Euler steps) → 10-block Vocos vocoder) without ONNX, CoreML or any C++ runtime — only MLX + NumPy.

Install

The package isn't on PyPI yet — install directly from this gitea source repository (or from the local checkout):

pip install git+https://gitea.tavportal.com/olivier/supertonic-3-mlx.git

Runtime dependencies are just mlx, numpy, and huggingface_hub (the last for the one-line weight download). On first use the ~ 400 MB weight bundle is downloaded from ambassadia/supertonic-3-mlx into your Hugging Face cache.

One-shot quickstart + sanity test

A zero-config end-to-end test script ships with the repo. Clone the repo, run the script, and it will create a fresh venv, install everything, version-check MLX (with an optional auto-upgrade), download the weights and synthesise an utterance into hello.wav:

git clone https://gitea.tavportal.com/olivier/supertonic-3-mlx.git
cd supertonic-3-mlx
./setup_and_test.sh                              # en F1, default text
./setup_and_test.sh fr F2 "Bonjour."             # custom lang / voice / text

Re-runs reuse the venv and the cached weights — second invocation is ~ 20 ms warm load + ~ 30 ms per generate.

Quickstart (after install)

from supertonic_3_mlx import Pipeline

pipe = Pipeline.from_pretrained("ambassadia/supertonic-3-mlx")
wav  = pipe.generate("Hello world from Apple Silicon.", voice="F1", lang="en")

# wav is a 1-D numpy.float32 array at 44.1 kHz
import soundfile as sf
sf.write("hello.wav", wav, pipe.sample_rate)

Audio samples

Six languages, mix of male / female voices, mix of short and long utterances — all generated by the MLX pipeline at the wall times reported below.

EN · F1 · 2.79 s — "Hello world from Apple Silicon. Supertonic 3 runs at one hundred times real time."

EN · M1 · 3.90 s — "A gentle breeze moved through the open window while the children, still half-asleep, listened to the distant sound of the harbour bells."

FR · F2 · 3.41 s — "Bonjour, ceci est un test de synthèse vocale en français. Le modèle gère trente-et-une langues sur une puce M4."

DE · M2 · 3.69 s — "Guten Morgen. Dieses Modell läuft komplett auf Apple Silicon, ohne ONNX und ohne CoreML, in reinem MLX."

JA · F3 · 1.46 s — "こんにちは。これはアップルシリコン上でMLXを使ったテストです。"

ES · M3 · 2.86 s — "Hola, esto es una prueba de síntesis de voz en español ejecutada en tiempo real sobre Apple Silicon."

Benchmarks (Apple M4, FP32, median of 3)

Sample	Duration	MLX wall	RTF	ONNX SDK	Speedup
EN · F1 · short	2.79 s	36.6 ms	x76	1005 ms	28 ×
EN · M1 · long	3.90 s	38.4 ms	x102	1356 ms	35 ×
FR · F2	3.41 s	37.9 ms	x90	1196 ms	32 ×
DE · M2	3.69 s	38.1 ms	x97	1314 ms	35 ×
JA · F3	1.46 s	32.1 ms	x46	848 ms	26 ×
ES · M3	2.86 s	37.0 ms	x77	1002 ms	27 ×

Raw numbers are in bench_results.csv (regenerable via the development monorepo at gitea.tavportal.com/olivier/MLX_CONVERTOR; this repository ships the consolidated release artefacts only).

Reference comparison: the CoreML build of the same model on the same hardware runs at ~x27 realtime. The MLX port is ~2-4× faster end-to-end while remaining bit-identical to the ONNX Runtime reference on the vocoder (cosine 1.00) and at cosine ≥ 0.98 on the full estimator output.

Voices

10 preset voices — five female (F1–F5) and five male (M1–M5). The voice_styles/ directory contains both style_ttl (50×256 latent style for the audio path) and style_dp (8×16 style for the duration head) for each voice. Pass the voice name as the voice= kwarg to Pipeline.generate.

Languages

31 languages supported. Pass the ISO 639-1 code as the lang= kwarg: en fr de es it pt ja ko zh ru pl nl tr ar hi vi th id cs ro hu el da sv fi no he uk bg hr sk.

Architecture (short)

Four sub-models, all in weights/*.safetensors:

Sub-model	Role	Params	Size
`vector_estimator`	24-block CFG flow-matching velocity	~64 M	256 MB
`text_encoder`	Character → 256-D text embedding	~9 M	36 MB
`duration_predictor`	Text → seconds	~1 M	3.5 MB
`vocoder`	Latent (B,144,T) → 44.1 kHz wav	~25 M	101 MB

The pipeline runs exactly 5 Euler steps with classifier-free guidance (4×cond − 3×uncond). This schedule is trained-in: reducing the step count or disabling CFG produces an essentially uncorrelated waveform (verified empirically — see the bench_n_steps.py script in the source repo).

Loading from a local snapshot

Three layouts are auto-detected by Pipeline.from_pretrained:

Hugging Face repo id (e.g. "ambassadia/supertonic-3-mlx") — auto-download
Local path containing weights/ (this layout) — fastest cold-load
Local path containing onnx/ (upstream snapshot) — converts at load time

License

This release combines two artefact classes under two distinct licenses:

Model weights (weights/*.safetensors) — BigScience Open RAIL-M. See LICENSE for the full text. The Attachment A use restrictions are reproduced below and apply to all downstream use of the model and of generated audio.
Port code (src/supertonic_3_mlx/) — Apache License 2.0. See LICENSE-CODE.

See NOTICE for the modifications statement and the upstream attribution.

OpenRAIL-M Attachment A — use restrictions

You agree not to use the model or derivatives:

(a) In any way that violates any applicable national, federal, state, local or international law or regulation.

(b) For the purpose of exploiting, harming or attempting to exploit or harm minors in any way.

(d) To generate or disseminate personal identifiable information that can be used to harm an individual.

(e) To generate or disseminate information and/or content (e.g. images, code, posts, articles), and place the information and/or content in any context (e.g. bot generating tweets) without expressly and intelligibly disclaiming that the information and/or content is machine generated.

(f) To defame, disparage or otherwise harass others.

(g) To impersonate or attempt to impersonate (e.g. deepfakes) others without their consent.

(h) For fully automated decision making that adversely impacts an individual's legal rights or otherwise creates or modifies a binding, enforceable obligation.

(i) For any use intended to or which has the effect of discriminating against or harming individuals or groups based on online or offline social behavior or known or predicted personal or personality characteristics.

(j) To exploit any of the vulnerabilities of a specific group of persons based on their age, social, physical or mental characteristics, in order to materially distort the behavior of a person pertaining to that group in a manner that causes or is likely to cause that person or another person physical or psychological harm.

(k) For any use intended to or which has the effect of discriminating against individuals or groups based on legally protected characteristics or categories.

(l) To provide medical advice and medical results interpretation.

(m) To generate or disseminate information for the purpose to be used for administration of justice, law enforcement, immigration or asylum processes, such as predicting an individual will commit fraud/crime commitment.

Citation

@misc{supertonic3-mlx,
  title  = {Supertonic 3 MLX: native Apple Silicon port of Supertone's multilingual TTS},
  author = {Dupont, Olivier},
  year   = {2026},
  url    = {https://huggingface.co/ambassadia/supertonic-3-mlx},
  note   = {Derivative of Supertone/supertonic-3 (https://huggingface.co/Supertone/supertonic-3)}
}

Please also cite the upstream Supertone Supertonic 3 model when using this port.

9.6 KiB Raw Blame History Unescape Escape