fix(critical): Euler timestep off-by-one + missing <lang> tag in tokenizer
Two coupled bugs producing structureless ('Whisper hallucinates Société
Radio-Canada') audio on the v0.1.0 release.
Fix #1 — Euler timestep schedule (PRIMARY, smoking gun)
ONNX SDK passes current_step = 0..N-1 → t_norm = [0.0, 0.2, 0.4, 0.6, 0.8].
We were passing step + 1 → [0.2, 0.4, 0.6, 0.8, 1.0].
Flow-matching is trained on the SDK schedule; the off-by-one collapses
the trajectory to noise (ONNX-only ablation: wav cosine 0.0037 vs ref).
Fix #2 — text preprocessing (SECONDARY)
Supertonic 3 wraps utterances in <lang>text</lang> via the SDK's
UnicodeProcessor; we were emitting raw character IDs and ignoring lang.
Min-viable port: NFKD normalisation + whitespace collapse + trailing
period + language token wrap. Bit-identical Whisper output vs the full
SDK preprocessor (verified inline).
Measured impact (FR test phrase, Whisper-large-v3):
before: 10/10 voices → 0% word overlap (Whisper hallucinations only)
after: M2 56%, F1/F2 25%, F3 19%, F5/M1 12%, F4/M3/M5 0%, M4 6%
Audio is now structurally voiced French with target words appearing in
the best voices, but still falls short of the ONNX SDK 81-88% ceiling.
Per-step Euler bisect (same conditioning, ONNX vs MLX VE side-by-side)
shows the residual bug is in the VE velocity prediction; cosine drops
1.000 → 0.9995 → 0.965 → 0.889 → 0.673 → 0.453 across steps 0..5,
exponential compounding from ~0.05 % per-step drift. Continues in a
follow-up commit.
Repos remain PRIVATE on HF + GitHub until full fix lands.
This commit is contained in:
@@ -25,7 +25,7 @@ Flow:
|
|||||||
|
|
||||||
Public API:
|
Public API:
|
||||||
|
|
||||||
pipe = SupertonicMLXPipeline.from_pretrained("ambassadia/supertonic-3-mlx")
|
pipe = SupertonicMLXPipeline.from_pretrained("/tmp/supertonic3/model")
|
||||||
wav = pipe.generate("Hello world", voice="F1", lang="en")
|
wav = pipe.generate("Hello world", voice="F1", lang="en")
|
||||||
import soundfile as sf
|
import soundfile as sf
|
||||||
sf.write("out.wav", wav, pipe.sample_rate)
|
sf.write("out.wav", wav, pipe.sample_rate)
|
||||||
@@ -214,15 +214,49 @@ def _load_into(model, weights: dict) -> int:
|
|||||||
# ── Tokenization ────────────────────────────────────────────────────
|
# ── Tokenization ────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
|
||||||
def _encode_text(text: str, indexer: list[int], lang: str = "en") -> np.ndarray:
|
_ENDING_PUNCT = ".!?,;:'\")]}»›"
|
||||||
"""Encode a text string into character IDs.
|
|
||||||
|
|
||||||
The unicode_indexer is a flat list of size 65536; ``indexer[ord(c)]`` gives
|
|
||||||
the token ID for character ``c`` (-1 = unknown). For Phase T.4 we wrap the
|
def _preprocess_text(text: str, lang: str = "en") -> str:
|
||||||
text with no special language tokens — the ONNX SDK uses language tags but
|
"""Mirror the SDK's UnicodeProcessor._preprocess_text contract.
|
||||||
our pipeline currently runs unconditioned on language for the first WAV
|
|
||||||
emission (parity validation happens after).
|
Supertonic 3 is multilingual; the model is trained with utterances
|
||||||
|
wrapped in ``<lang>...</lang>`` language tokens (Supertone's
|
||||||
|
``UnicodeProcessor._add_language_token``). Bypassing this wrapping was
|
||||||
|
the secondary bug that compounded with the off-by-one Euler schedule to
|
||||||
|
produce structureless audio (verified by ONNX-only ablation in
|
||||||
|
``debug/supertonic3_schedule_ablation.py``).
|
||||||
|
|
||||||
|
Minimum viable port of the SDK's pipeline:
|
||||||
|
1. NFKD unicode normalisation
|
||||||
|
2. Whitespace collapse + strip
|
||||||
|
3. Trailing period if the string doesn't end with punctuation
|
||||||
|
4. Language token wrap ``<lang>text</lang>``
|
||||||
|
|
||||||
|
The SDK additionally performs emoji removal, symbol normalisation,
|
||||||
|
abbreviation expansion, and quote deduplication — those are quality
|
||||||
|
polish and can be ported later; they are not load-bearing for the
|
||||||
|
primary fix.
|
||||||
"""
|
"""
|
||||||
|
import unicodedata, re
|
||||||
|
text = unicodedata.normalize("NFKD", text)
|
||||||
|
text = re.sub(r"\s+", " ", text).strip()
|
||||||
|
if text and text[-1] not in _ENDING_PUNCT:
|
||||||
|
text += "."
|
||||||
|
if lang is not None:
|
||||||
|
text = f"<{lang}>{text}</{lang}>"
|
||||||
|
return text
|
||||||
|
|
||||||
|
|
||||||
|
def _encode_text(text: str, indexer: list[int], lang: str = "en") -> np.ndarray:
|
||||||
|
"""Encode a text string into character IDs via the SDK-compatible pipeline.
|
||||||
|
|
||||||
|
``indexer`` is a flat list of size 65536; ``indexer[ord(c)]`` gives the
|
||||||
|
token ID for character ``c`` (-1 = unknown). The text is first
|
||||||
|
preprocessed via :func:`_preprocess_text` so the encoding matches what
|
||||||
|
Supertonic 3 was trained on (NFKD-normalised + ``<lang>``-wrapped).
|
||||||
|
"""
|
||||||
|
text = _preprocess_text(text, lang=lang)
|
||||||
ids = []
|
ids = []
|
||||||
for c in text:
|
for c in text:
|
||||||
cp = ord(c)
|
cp = ord(c)
|
||||||
@@ -231,7 +265,6 @@ def _encode_text(text: str, indexer: list[int], lang: str = "en") -> np.ndarray:
|
|||||||
if tok >= 0:
|
if tok >= 0:
|
||||||
ids.append(tok)
|
ids.append(tok)
|
||||||
if not ids:
|
if not ids:
|
||||||
# fallback to a single space token to avoid empty input
|
|
||||||
ids = [indexer[ord(" ")]] if indexer[ord(" ")] >= 0 else [0]
|
ids = [indexer[ord(" ")]] if indexer[ord(" ")] >= 0 else [0]
|
||||||
return np.asarray(ids, dtype=np.int32)
|
return np.asarray(ids, dtype=np.int32)
|
||||||
|
|
||||||
@@ -522,11 +555,18 @@ class SupertonicMLXPipeline:
|
|||||||
for k, v in style_kv:
|
for k, v in style_kv:
|
||||||
kv_flat.extend([k, v])
|
kv_flat.extend([k, v])
|
||||||
|
|
||||||
# Euler with CFG — 5 steps by default
|
# Euler with CFG — 5 steps by default.
|
||||||
|
# NOTE: ONNX SDK passes ``current_step = 0..N-1`` and computes
|
||||||
|
# ``t_norm = current_step / total_step`` → schedule = [0.0, 0.2,
|
||||||
|
# 0.4, 0.6, 0.8]. Previously we were passing ``step + 1`` which
|
||||||
|
# shifted the schedule to [0.2, 0.4, 0.6, 0.8, 1.0]; the flow-matching
|
||||||
|
# model is trained on the SDK schedule and the off-by-one collapses
|
||||||
|
# the audio to structureless noise (verified by ONNX-only ablation
|
||||||
|
# in debug/supertonic3_schedule_ablation.py — wav cosine 0.0037).
|
||||||
x = noise
|
x = noise
|
||||||
total_step = mx.array([float(n_steps)], dtype=self.dtype)
|
total_step = mx.array([float(n_steps)], dtype=self.dtype)
|
||||||
for step in range(n_steps):
|
for step in range(n_steps):
|
||||||
current_step = mx.array([float(step + 1)], dtype=self.dtype)
|
current_step = mx.array([float(step)], dtype=self.dtype)
|
||||||
t_norm = current_step / total_step
|
t_norm = current_step / total_step
|
||||||
t_norm_2 = mx.concatenate([t_norm, t_norm], axis=0)
|
t_norm_2 = mx.concatenate([t_norm, t_norm], axis=0)
|
||||||
x = self._cached_step_compiled(
|
x = self._cached_step_compiled(
|
||||||
|
|||||||
Reference in New Issue
Block a user