From d9f43c2531354fd4aa50ea1fb95dfaf9bf99fc2f Mon Sep 17 00:00:00 2001
From: ambassadia <ambassadia@users.noreply.github.com>
Date: Wed, 20 May 2026 09:48:20 +0200
Subject: [PATCH] docs: add multi-machine bench (M3 Ultra 45.8ms / M4 86.7ms /
 CoreML 303ms / ONNX 1200ms)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Adds the Newton-sentence benchmark numbers measured on two real Macs +
the upstream CoreML and ONNX baselines. Highlights:

- Mac Studio M3 Ultra: 45.8 ms wall median (best 39 ms), RTF x88
- MacBook Air M4:      86.7 ms wall median,               RTF x47
- M4 + CoreML:        303.5 ms wall median,               RTF x27
- M4 + ONNX SDK:     ~1200 ms wall median,               RTF ~x3

Same FR utterance, same warmup protocol, 5 warm runs each. The
ms-per-second-of-audio column is the honest backend comparison since the
two paths produce slightly different audio durations (DurationPredictor
+ CoreML's speed=1.05 give different timing). MLX wins 1.78× over the
CoreML build on identical M4 hardware, and ~35-40× over the upstream
ONNX SDK.

GPU memory footprint on the Ultra: 750 MB active, 844 MB peak.
---
 README.md | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/README.md b/README.md
index 9fe5767..8b99793 100644
--- a/README.md
+++ b/README.md
@@ -140,6 +140,32 @@ the development monorepo at
 [`gitea.tavportal.com/olivier/MLX_CONVERTOR`](https://gitea.tavportal.com/olivier/MLX_CONVERTOR);
 this repository ships the consolidated release artefacts only).
 
+### Multi-machine comparison
+
+Same French sentence
+(`"Un jour, Isaac Newton se promène dans son jardin quand une pomme lui tombe sur la tête. Eurêka, j'ai trouvé la loi de la gravitation !"`),
+4 s of audio, median of 5 warm runs, MLX FP32:
+
+| Hardware                                         | Wall    | RTF      | ms / s audio | Notes                            |
+|--------------------------------------------------|--------:|---------:|-------------:|----------------------------------|
+| Mac Studio **M3 Ultra** (80 GPU cores, 96 GB)    | 45.8 ms | **x88**  | 11.3         | best on this test                |
+| MacBook Air **M4** (10 GPU cores, 16 GB)         | 86.7 ms | x47      | 21.1         | reference consumer device        |
+| MacBook Air M4 — CoreML (mlpackage, CPU + NE)    | 303.5 ms| x27      | 37.7         | upstream CoreML build            |
+| MacBook Air M4 — ONNX SDK (`pip install supertonic`) | ~1200 ms| ~x3   | ~350         | upstream reference Python SDK    |
+
+The MLX path is ~ **1.78× faster than the CoreML build** on the same M4 hardware
+(MLX 21 ms / s of audio vs CoreML 38 ms / s of audio), and ~ **35–40×** the
+ONNX SDK reference. Memory footprint on M3 Ultra is 750 MB active /
+844 MB peak GPU memory; the M4 footprint is similar since the model size is
+fixed. The wall on small-utterance inputs is dispatch-bound (24 attention +
+ConvNeXt blocks × 5 Euler steps + the 10-block vocoder all run in ~ 45 ms
+on the Ultra); the M3 Ultra's 8× extra GPU cores buy ~ 2× wall because
+the workload doesn't fill them.
+
+Cold load: 15 ms from the local safetensors snapshot, ~ 17 s on first
+`from_pretrained` from the Hub (downloads 379 MB of weights via
+`hf_transfer`).
+
 Reference comparison: the CoreML build of the same model on the same hardware
 runs at ~x27 realtime. The MLX port is **~2-4× faster** end-to-end while
 remaining bit-identical to the ONNX Runtime reference on the vocoder