feat: initial public release v0.1.0 — MLX port of pyannote-speaker-diarization-3.1

Byte-parity with pyannote-PyTorch reference (cosine 0.763718 identical at 6 decimals on 200 cross-window slot pairs). 2.5x faster than pyannote-MPS on Apple Silicon native. Extracted from gitea.tavportal.com/olivier/MLX_CONVERTOR commit 5f9eafa.
2026-05-09 16:05:39 +02:00
commit 2b1a3c1312
30 changed files with 2022 additions and 0 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -0,0 +1,18 @@
 __pycache__/
 *.py[cod]
 *.class
 *.so
 .Python
 .venv/
 venv/
 ENV/
 dist/
 build/
 *.egg-info/
 .eggs/
 .DS_Store
 .env
 *.log
 .pytest_cache/
 .ruff_cache/
 *.orig
--- a/21
+++ b/21
@@ -0,0 +1,21 @@
 MIT License
 Copyright (c) 2026 Olivier Dupont
 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal
 in the Software without restriction, including without limitation the rights
 to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 copies of the Software, and to permit persons to whom the Software is
 furnished to do so, subject to the following conditions:
 The above copyright notice and this permission notice shall be included in all
 copies or substantial portions of the Software.
 THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
 AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
 OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
 SOFTWARE.
--- a/README.md
+++ b/README.md
@@ -0,0 +1,57 @@
 # pyannote-speaker-diarization-3.1-mlx
 First MLX port of pyannote-speaker-diarization-3.1 with byte-parity to the PyTorch reference. 2.5x faster than pyannote-MPS on Apple Silicon native.
 ## Install
 ```bash
 uv add "pyannote-speaker-diarization-3.1-mlx @ git+https://gitea.tavportal.com/olivier/pyannote-speaker-diarization-3.1-mlx.git"
 ```
 ## Quickstart
 ```python
 from pyannote_diarization_3_1_mlx import MlxDiarizationPipeline
 pipeline = MlxDiarizationPipeline.from_pretrained("pyannote/speaker-diarization-3.1")
 diarization = pipeline("audio.wav")
 for turn, _, speaker in diarization.itertracks(yield_label=True):
    print(f"{turn.start:.1f}s - {turn.end:.1f}s {speaker}")
 ```
 ## Parity
 | Evidence | MLX | Reference | Result |
 | --- | --- | --- | --- |
 | Cosine distance (200 cross-window pairs) | mean=0.763718 | pyannote-PyTorch mean=0.763718 | identical at 6 decimals |
 | 5h10 bench | 173s / 44 speakers / 1.27 GB | pyannote-MPS 431s / 43 speakers / 1.72 GB | Cross-DER 0.076 |
 ## Architecture
 SincNet → BiLSTM → Powerset(3,2) head + WeSpeaker ResNet34 speaker embedding + AgglomerativeClustering wrapper.
 ## Module Naming
 The repository name is `pyannote-speaker-diarization-3.1-mlx`; the Python import is `pyannote_diarization_3_1_mlx`. The import name follows PEP 8 and embeds the pyannote model version so future 4.0 ports can co-install.
 ## Citation
 This project ports the pyannote speaker diarization 3.1 pipeline architecture to MLX. Please cite the original pyannote.audio work when using this package:
 ```bibtex
@inproceedings{Plaquet23,
  author = {Alexis Plaquet and Hervé Bredin},
  title = {{Powerset multi-class cross entropy loss for neural speaker diarization}},
  booktitle = {Proc. INTERSPEECH 2023},
  year = {2023},
 }
 ```
 ## Provenance
 Extracted from MLX_CONVERTOR/src/mlxconv/diar at commit 5f9eafa. Maintained at https://gitea.tavportal.com/olivier/pyannote-speaker-diarization-3.1-mlx.
 ## License
 MIT
--- a/docs/parity-evidence.md
+++ b/docs/parity-evidence.md
@@ -0,0 +1,7 @@
 # Parity Evidence
 | Evidence | MLX | Reference | Result |
 | --- | --- | --- | --- |
 | Cosine distance parity | 200 cross-window pairs, mean 0.763718 | pyannote-PyTorch mean 0.763718 | identical at 6 decimals |
 | 5h10 bench results | 173s wall / 44 speakers / 1.27 GB peak RSS | pyannote-MPS 431s / 43 speakers / 1.72 GB | Cross-DER 0.076 |
 | Source commits | 8aa6c6d + 5f9eafa | feat/platform-abc in MLX_CONVERTOR | extraction source |
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -0,0 +1,36 @@
 [project]
 name = "pyannote-speaker-diarization-3.1-mlx"
 version = "0.1.0"
 description = "MLX port of pyannote/speaker-diarization-3.1 with byte-parity to PyTorch reference"
 readme = "README.md"
 requires-python = ">=3.12,<3.14"
 authors = [{ name = "Olivier Dupont", email = "olivier.dupont@taviramonaco.com" }]
 license = { text = "MIT" }
 keywords = ["mlx", "pyannote", "speaker-diarization", "apple-silicon"]
 classifiers = [
    "Programming Language :: Python :: 3.12",
    "License :: OSI Approved :: MIT License",
    "Operating System :: MacOS",
 ]
 dependencies = [
    "mlx>=0.21.0",
    "torch>=2.5.0",
    "torchaudio>=2.5.0",
    "huggingface_hub>=0.26.0",
    "safetensors>=0.4.5",
    "librosa>=0.10.2",
    "scipy>=1.14",
    "numpy>=2.0",
    "pyannote.audio>=4.0.4",
 ]
 [project.optional-dependencies]
 bench = ["psutil>=7.0"]
 dev = ["pytest>=8.3", "pytest-mock>=3.14", "ruff>=0.7"]
 [build-system]
 requires = ["hatchling"]
 build-backend = "hatchling.build"
 [tool.hatch.build.targets.wheel]
 packages = ["src/pyannote_diarization_3_1_mlx"]
--- a/scripts/bench.py
+++ b/scripts/bench.py
@@ -0,0 +1,161 @@
 """Benchmark MLX vs pyannote-MPS diarization on the same audio.
 Usage:
    uv run python scripts/benchmark_diar_backends.py <audio> \
        [--min-speakers N] [--max-speakers M]
 Runs both backends back-to-back, prints a Markdown table with wall time,
 speaker count, total speech duration, and cross-DER (MLX vs pyannote).
 """
 from __future__ import annotations
 import argparse
 import gc
 import sys
 import time
 from pathlib import Path
 import librosa
 import numpy as np
 import psutil
 import torch
 def _measure(label: str, fn) -> dict:
    """Run fn(), measure wall time + RSS delta + return result."""
    proc = psutil.Process()
    gc.collect()
    rss_before = proc.memory_info().rss
    t0 = time.time()
    annotation = fn()
    wall = time.time() - t0
    rss_peak = proc.memory_info().rss
    return {
        "label": label,
        "wall": wall,
        "rss_delta_gb": (rss_peak - rss_before) / 1e9,
        "rss_peak_gb": rss_peak / 1e9,
        "annotation": annotation,
    }
 def _stats(annotation) -> dict:
    speakers = sorted(set(annotation.labels()))
    turns = list(annotation.itertracks(yield_label=True))
    total_speech = sum(seg.duration for seg, _, _ in turns)
    # per-speaker totals
    by_speaker = {}
    for seg, _, lab in turns:
        by_speaker[lab] = by_speaker.get(lab, 0.0) + seg.duration
    return {
        "speakers": len(speakers),
        "turns": len(turns),
        "total_speech": total_speech,
        "by_speaker": dict(sorted(by_speaker.items(), key=lambda kv: -kv[1])),
    }
 def main() -> int:
    parser = argparse.ArgumentParser(description=__doc__.splitlines()[0])
    parser.add_argument("audio")
    parser.add_argument("--min-speakers", type=int, default=10)
    parser.add_argument("--max-speakers", type=int, default=15)
    args = parser.parse_args()
    audio_path = Path(args.audio).expanduser().resolve()
    print(f"Loading {audio_path.name} (sr=16000, mono) ...", file=sys.stderr)
    sig, _ = librosa.load(str(audio_path), sr=16000, mono=True)
    duration_s = len(sig) / 16000
    print(f"  duration: {duration_s:.0f}s ({duration_s/60:.1f} min)", file=sys.stderr)
    diar_input = {
        "waveform": torch.from_numpy(sig).unsqueeze(0),
        "sample_rate": 16000,
    }
    kwargs = {"min_speakers": args.min_speakers, "max_speakers": args.max_speakers}
    results = []
    # 1. MLX pure
    print("\n=== MLX pure-MLX/scipy diarization ===", file=sys.stderr)
    from pyannote_diarization_3_1_mlx import MlxDiarizationPipeline
    mlx_pipe = MlxDiarizationPipeline.from_pretrained()
    r_mlx = _measure("mlx", lambda: mlx_pipe(diar_input, **kwargs))
    r_mlx.update(_stats(r_mlx["annotation"]))
    results.append(r_mlx)
    print(
        f"  wall={r_mlx['wall']:.1f}s  speakers={r_mlx['speakers']}  "
        f"speech={r_mlx['total_speech']:.0f}s  "
        f"rss_delta={r_mlx['rss_delta_gb']:.2f}GB",
        file=sys.stderr,
    )
    # free MLX before pyannote (we'll reuse the same Python proc)
    del mlx_pipe
    gc.collect()
    # 2. pyannote (MPS if available, else CPU)
    print("\n=== pyannote-audio 4.0.4 (MPS/PyTorch) ===", file=sys.stderr)
    from pyannote.audio import Pipeline
    pa_pipe = Pipeline.from_pretrained("pyannote/speaker-diarization-3.1")
    if torch.backends.mps.is_available():
        try:
            pa_pipe.to(torch.device("mps"))
            print("  device: mps", file=sys.stderr)
        except Exception as e:
            print(f"  warning: mps failed ({e}); CPU fallback", file=sys.stderr)
    else:
        print("  device: cpu", file=sys.stderr)
    def _run_pa():
        out = pa_pipe(diar_input, **kwargs)
        ann = getattr(out, "exclusive_speaker_diarization", None) or out
        return ann
    r_pa = _measure("pyannote", _run_pa)
    r_pa.update(_stats(r_pa["annotation"]))
    results.append(r_pa)
    print(
        f"  wall={r_pa['wall']:.1f}s  speakers={r_pa['speakers']}  "
        f"speech={r_pa['total_speech']:.0f}s  "
        f"rss_delta={r_pa['rss_delta_gb']:.2f}GB",
        file=sys.stderr,
    )
    # 3. cross DER
    der_value = None
    try:
        from pyannote.metrics.diarization import DiarizationErrorRate
        der_value = DiarizationErrorRate()(r_pa["annotation"], r_mlx["annotation"])
        print(f"\nCross-DER (mlx vs pyannote ref): {der_value:.3f}", file=sys.stderr)
    except Exception as e:
        print(f"\nDER computation failed: {e}", file=sys.stderr)
    # Print Markdown table to stdout
    print()
    print("| Backend | Wall (s) | Realtime | Speakers | Turns | Speech (s) | RSS Δ (GB) |")
    print("|---|---:|---:|---:|---:|---:|---:|")
    for r in results:
        rt = duration_s / r["wall"] if r["wall"] > 0 else 0
        print(
            f"| {r['label']} | {r['wall']:.1f} | {rt:.1f}× | "
            f"{r['speakers']} | {r['turns']} | "
            f"{r['total_speech']:.0f} | {r['rss_delta_gb']:.2f} |"
        )
    print()
    if der_value is not None:
        print(f"Cross-DER (mlx vs pyannote): **{der_value:.3f}**")
    print()
    print("### Per-speaker speech time")
    for r in results:
        print(f"\n**{r['label']}** ({r['speakers']} speakers):")
        for sp, dur in list(r["by_speaker"].items())[:10]:
            print(f"  {sp}: {dur:.0f}s")
    return 0
 if __name__ == "__main__":
    sys.exit(main())
--- a/scripts/install_remote.sh
+++ b/scripts/install_remote.sh
@@ -0,0 +1,52 @@
 #!/usr/bin/env bash
 set -euo pipefail
 INSTALL_DIR="${1:-$HOME/pyannote-diarization-3.1-mlx-test}"
 INSTALL_DIR="${INSTALL_DIR/#\~/$HOME}"
 HTTPS_SPEC="pyannote-speaker-diarization-3.1-mlx @ git+https://gitea.tavportal.com/olivier/pyannote-speaker-diarization-3.1-mlx.git"
 SSH_SPEC="git+ssh://git@gitea.tavportal.com/olivier/pyannote-speaker-diarization-3.1-mlx.git"
 usage() {
  cat <<EOF
 Usage:
  $0 [install-dir]
 Creates a uv project and installs pyannote-speaker-diarization-3.1-mlx.
 Default install directory:
  $INSTALL_DIR
 EOF
 }
 if [[ "${1:-}" == "-h" || "${1:-}" == "--help" ]]; then
  usage
  exit 0
 fi
 if ! command -v uv >/dev/null 2>&1; then
  cat >&2 <<'EOF'
 uv is required but was not found.
 Install it with:
  curl -LsSf https://astral.sh/uv/install.sh | sh
 Then restart your shell and run this script again.
 EOF
  exit 1
 fi
 mkdir -p "$INSTALL_DIR"
 cd "$INSTALL_DIR"
 if [[ ! -f pyproject.toml ]]; then
  uv init --python 3.12
 else
  echo "Found existing pyproject.toml in $INSTALL_DIR; skipping uv init."
 fi
 echo "Installing from HTTPS..."
 if ! uv add "$HTTPS_SPEC"; then
  echo "HTTPS install failed; falling back to SSH pip install..."
  uv pip install "$SSH_SPEC"
 fi
 uv run python -c "from pyannote_diarization_3_1_mlx import MlxDiarizationPipeline; print('OK')"
--- a/src/pyannote_diarization_3_1_mlx/init.py
+++ b/src/pyannote_diarization_3_1_mlx/init.py
@@ -0,0 +1,4 @@
 """Pyannote 3.1 port to MLX. See docs/superpowers/specs/2026-05-08-pyannote-mlx-port-design.md."""
 from pyannote_diarization_3_1_mlx.pipeline import MlxDiarizationPipeline
 __all__ = ["MlxDiarizationPipeline"]
--- a/src/pyannote_diarization_3_1_mlx/_bilstm.py
+++ b/src/pyannote_diarization_3_1_mlx/_bilstm.py
@@ -0,0 +1,33 @@
 """4-layer monolithic bidirectional LSTM for pyannote PyanNet head.
 Pyannote uses torch.nn.LSTM with bidirectional=True, num_layers=4. We split
 into forward+backward stacks per layer; bias_ih + bias_hh are summed into a
 single MLX bias vector (mlx.nn.LSTM convention).
 """
 from __future__ import annotations
 import mlx.core as mx
 import mlx.nn as nn
 class BiLSTM4(nn.Module):
    def __init__(self, input_size: int, hidden_size: int = 128, num_layers: int = 4) -> None:
        super().__init__()
        self.num_layers = num_layers
        self.hidden_size = hidden_size
        # one fwd + one bwd LSTM per layer
        self.fwd = []
        self.bwd = []
        in_dim = input_size
        for _ in range(num_layers):
            self.fwd.append(nn.LSTM(in_dim, hidden_size))
            self.bwd.append(nn.LSTM(in_dim, hidden_size))
            in_dim = hidden_size * 2  # next layer ingests concat
    def __call__(self, x: mx.array) -> mx.array:
        for f, b in zip(self.fwd, self.bwd):
            f_out, _ = f(x)
            x_rev = x[:, ::-1, :]
            b_out_rev, _ = b(x_rev)
            b_out = b_out_rev[:, ::-1, :]
            x = mx.concatenate([f_out, b_out], axis=-1)
        return x
--- a/src/pyannote_diarization_3_1_mlx/_config.py
+++ b/src/pyannote_diarization_3_1_mlx/_config.py
@@ -0,0 +1,36 @@
 """Locked hyperparameters and HF revisions for the pyannote 3.1 MLX port.
 These values come from upstream pyannote/speaker-diarization-3.1 config
 and the corresponding mlx-community port. Changing them = re-running the
 Day-1 sanity gate (Task 28).
 """
 from __future__ import annotations
 # Segmentation
 SEG_DURATION = 10.0
 SEG_HOP = 1.0
 SEG_FRAMES = 589
 SEG_CLASSES = 7
 MAX_SPEAKERS_PER_CHUNK = 3
 MAX_SPEAKERS_PER_FRAME = 2
 MIN_DURATION_ON = 0.70
 # Embedding
 EMB_BATCH_SIZE = 32
 EMB_EXCLUDE_OVERLAP = True
 EMB_DIM = 256
 # Clustering (pyannote.audio.pipelines.clustering.AgglomerativeClustering defaults)
 CLUSTER_METHOD = "centroid"
 CLUSTER_THRESHOLD = 0.7045654963945799
 CLUSTER_MIN_SIZE = 12
 CLUSTER_MAX_NUM_EMBEDDINGS = 1000
 # Audio
 SAMPLE_RATE = 16000
 # HF revisions — pinned per Codex review
 SEG_HF_REPO = "mlx-community/pyannote-segmentation-3.0-mlx"
 SEG_HF_REV = "5189a69b35c5f7e48082a978f3476bac81590874"
 EMB_HF_REPO = "mlx-community/wespeaker-voxceleb-resnet34-LM"
 EMB_HF_REV = "97fc9343d2cfd0ae4d1c1d8c299e0046aa502e31"
--- a/src/pyannote_diarization_3_1_mlx/_sincnet.py
+++ b/src/pyannote_diarization_3_1_mlx/_sincnet.py
@@ -0,0 +1,289 @@
 """SincNet block — MLX port of pyannote.audio.models.blocks.sincnet.
 Source of truth:
  pyannote/audio/models/blocks/sincnet.py  (MIT, CNRS)
  asteroid_filterbanks/param_sinc_fb.py   (MIT)
 Key conventions difference vs PyTorch:
  - PyTorch Conv1d / InstanceNorm1d use NCL (batch, channels, length).
  - MLX Conv1d / MaxPool1d / InstanceNorm  use NLC (batch, length, channels).
  - We accept (B, C, T) inputs (PyTorch NCL) and return (B, C, T) outputs so
    that the rest of the port can stay in PyTorch convention, but internally
    we transpose to NLC for every MLX primitive.
 """
 from __future__ import annotations
 import math
 import numpy as np
 import mlx.core as mx
 import mlx.nn as nn
 # ---------------------------------------------------------------------------
 # Helper: sinc function (normalised, matches PyTorch / numpy convention)
 # ---------------------------------------------------------------------------
 def _sinc(x: mx.array) -> mx.array:
    """Normalised sinc: sin(pi*x) / (pi*x), with sinc(0)=1."""
    # Avoid division by zero at x==0
    safe = mx.where(x == 0, mx.ones_like(x), x)
    result = mx.sin(math.pi * safe) / (math.pi * safe)
    return mx.where(x == 0, mx.ones_like(x), result)
 # ---------------------------------------------------------------------------
 # ParamSincFB — learnable sinc filterbank
 # ---------------------------------------------------------------------------
 class ParamSincFB(nn.Module):
    """Learnable sinc filterbank (MLX port of asteroid_filterbanks.ParamSincFB).
    Produces 2*n_filters_half (= n_filters) output channels: the first half
    are even (cos) filters and the second half are odd (sin) filters, following
    the SincNet paper and asteroid_filterbanks exactly.
    Parameters
    ----------
    n_filters : int
        Total number of output channels (must be even; n_filters//2 are
        parametric).
    kernel_size : int
        Length of each filter (must be odd; forced odd if even).
    stride : int
        Stride for the convolution over the waveform.
    sample_rate : float
        Sample rate (Hz). Used for Mel-scale initialisation.
    min_low_hz : float
        Minimum allowed low-cutoff frequency (Hz).
    min_band_hz : float
        Minimum allowed bandwidth (Hz).
    """
    def __init__(
        self,
        n_filters: int = 80,
        kernel_size: int = 251,
        stride: int = 1,
        sample_rate: float = 16000.0,
        min_low_hz: float = 50.0,
        min_band_hz: float = 50.0,
    ):
        super().__init__()
        if kernel_size % 2 == 0:
            kernel_size += 1  # force odd
        self.n_filters = n_filters
        self.kernel_size = kernel_size
        self.stride = stride
        self.sample_rate = sample_rate
        self.min_low_hz = min_low_hz
        self.min_band_hz = min_band_hz
        self.half_kernel = kernel_size // 2
        self.n_filters_half = n_filters // 2  # parametric filters (real part)
        # Initialise on Mel scale (mirrors _initialize_filters in upstream)
        low_hz = 30.0
        high_hz = sample_rate / 2.0 - (min_low_hz + min_band_hz)
        def to_mel(hz):
            return 2595.0 * np.log10(1.0 + hz / 700.0)
        def to_hz(mel):
            return 700.0 * (10.0 ** (mel / 2595.0) - 1.0)
        mel = np.linspace(
            to_mel(low_hz),
            to_mel(high_hz),
            self.n_filters_half + 1,
            dtype="float32",
        )
        hz = to_hz(mel)
        # Learnable parameters — shape (n_filters_half, 1) — stored as mx.array
        # so that MLX's tree_flatten / load_weights can see them as parameters.
        self.low_hz_ = mx.array(hz[:-1].reshape(-1, 1))
        self.band_hz_ = mx.array(np.diff(hz).reshape(-1, 1))
        # Hamming window for the left half (shape: (half_kernel,))
        window_np = np.hamming(kernel_size)[: self.half_kernel].astype("float32")
        self._window = mx.array(window_np)  # frozen buffer
        # Time vector: shape (1, half_kernel)  — values in seconds
        n_np = (
            2.0
            * np.pi
            * (np.arange(-self.half_kernel, 0.0, dtype="float32") / sample_rate)
        ).reshape(1, -1)
        self._n = mx.array(n_np)  # frozen buffer
    def _build_filters(self) -> mx.array:
        """Compute (n_filters, 1, kernel_size) filter bank from parameters.
        Mirrors ParamSincFB.filters() + make_filters() in asteroid_filterbanks.
        """
        low_hz_ = self.low_hz_    # (n_filters_half, 1) — already mx.array
        band_hz_ = self.band_hz_  # (n_filters_half, 1) — already mx.array
        low = self.min_low_hz + mx.abs(low_hz_)   # (nf_h, 1)
        high = mx.clip(
            low + self.min_band_hz + mx.abs(band_hz_),
            self.min_low_hz,
            self.sample_rate / 2.0,
        )  # (nf_h, 1)
        band = (high - low)[:, 0]  # (nf_h,)
        # ft_low / ft_high: (nf_h, half_kernel)  via outer product
        ft_low = mx.matmul(low, self._n)    # (nf_h, half_kernel)
        ft_high = mx.matmul(high, self._n)  # (nf_h, half_kernel)
        # --- Even (cos) filters ---
        bp_left_cos = (
            (mx.sin(ft_high) - mx.sin(ft_low)) / (self._n / 2.0)
        ) * self._window  # (nf_h, half_kernel)
        bp_center_cos = 2.0 * band.reshape(-1, 1)  # (nf_h, 1)
        bp_right_cos = bp_left_cos[:, ::-1]  # (nf_h, half_kernel) — reverse along kernel dim
        cos_filters = mx.concatenate(
            [bp_left_cos, bp_center_cos, bp_right_cos], axis=1
        )  # (nf_h, kernel_size)
        cos_filters = cos_filters / (2.0 * band[:, None])
        # --- Odd (sin) filters ---
        bp_left_sin = (
            (mx.cos(ft_low) - mx.cos(ft_high)) / (self._n / 2.0)
        ) * self._window  # (nf_h, half_kernel)
        bp_center_sin = mx.zeros((self.n_filters_half, 1))
        bp_right_sin = -bp_left_sin[:, ::-1]  # reverse along kernel dim
        sin_filters = mx.concatenate(
            [bp_left_sin, bp_center_sin, bp_right_sin], axis=1
        )  # (nf_h, kernel_size)
        sin_filters = sin_filters / (2.0 * band[:, None])
        # Concatenate → (n_filters, kernel_size)
        all_filters = mx.concatenate([cos_filters, sin_filters], axis=0)
        # Reshape to (n_filters, kernel_size, 1) — MLX conv weight layout:
        # (out_channels, kernel_size, in_channels)
        return all_filters.reshape(self.n_filters, self.kernel_size, 1)
    def __call__(self, x: mx.array) -> mx.array:
        """Apply sinc filterbank convolution.
        Parameters
        ----------
        x : mx.array, shape (B, T, 1)  [NLC — MLX convention]
        Returns
        -------
        mx.array, shape (B, T', n_filters)  [NLC]
        """
        filters = self._build_filters()  # (n_filters, kernel_size, 1)
        # MLX conv1d weight shape: (out_channels, kernel_size, in_channels)
        return mx.conv1d(x, filters, stride=self.stride, padding=0)
 # ---------------------------------------------------------------------------
 # SincNet
 # ---------------------------------------------------------------------------
 class SincNet(nn.Module):
    """SincNet block — MLX port of pyannote.audio.models.blocks.SincNet.
    Accepts and returns tensors in PyTorch NCL convention (B, C, T) so the
    downstream pipeline stays consistent with the pyannote checkpoint layout.
    Default stride=10 matches pyannote 3.1 PyanNet SINCNET_DEFAULTS.
    """
    def __init__(self, sample_rate: int = 16000, stride: int = 10):
        super().__init__()
        if sample_rate != 16000:
            raise NotImplementedError("SincNet only supports 16kHz audio for now.")
        self.sample_rate = sample_rate
        self.stride = stride
        # --- waveform normalisation ---
        # PyTorch: InstanceNorm1d(1, affine=True) on (B,1,T) = norm over T for
        # each batch item. MLX InstanceNorm operates on (..., C) — for a
        # (B, T, 1) tensor, C=1, which matches.
        self.wav_norm1d = nn.InstanceNorm(1, affine=True)
        # --- layer 0: sinc filterbank ---
        self.sinc_fb = ParamSincFB(
            n_filters=80,
            kernel_size=251,
            stride=stride,
            sample_rate=float(sample_rate),
            min_low_hz=50.0,
            min_band_hz=50.0,
        )
        self.pool0 = nn.MaxPool1d(kernel_size=3, stride=3)
        self.norm0 = nn.InstanceNorm(80, affine=True)
        # --- layer 1 ---
        self.conv1 = nn.Conv1d(80, 60, kernel_size=5, stride=1)
        self.pool1 = nn.MaxPool1d(kernel_size=3, stride=3)
        self.norm1 = nn.InstanceNorm(60, affine=True)
        # --- layer 2 ---
        self.conv2 = nn.Conv1d(60, 60, kernel_size=5, stride=1)
        self.pool2 = nn.MaxPool1d(kernel_size=3, stride=3)
        self.norm2 = nn.InstanceNorm(60, affine=True)
    def __call__(self, waveforms: mx.array) -> mx.array:
        """Forward pass.
        Parameters
        ----------
        waveforms : mx.array, shape (B, 1, T)  [PyTorch NCL convention]
        Returns
        -------
        mx.array, shape (B, 60, frames)  [PyTorch NCL convention]
        """
        # --- Convert NCL → NLC for MLX primitives ---
        # waveforms: (B, 1, T) → (B, T, 1)
        x = mx.transpose(waveforms, (0, 2, 1))
        # --- Waveform normalisation: InstanceNorm1d(1) ---
        # MLX InstanceNorm: input (..., C), normalises over the spatial dims
        # for each C separately. For (B, T, 1) it normalises over T, which
        # is the correct per-instance normalisation matching PyTorch.
        x = self.wav_norm1d(x)
        # === Layer 0: sinc filterbank ===
        # sinc_fb expects (B, T, 1) → returns (B, T', 80)
        x = self.sinc_fb(x)
        # abs() — mirrors torch.abs(outputs) at c==0 in upstream
        x = mx.abs(x)
        # pool → norm → activation
        # MaxPool1d: MLX expects (B, L, C) — matches NLC
        x = self.pool0(x)          # (B, T'/3, 80)
        x = self.norm0(x)          # (B, T'/3, 80)
        x = nn.leaky_relu(x)       # (B, T'/3, 80)
        # === Layer 1: Conv1d(80→60, k=5) ===
        # MLX Conv1d expects (B, L, C_in) → outputs (B, L', C_out)
        x = self.conv1(x)          # (B, T''-4, 60)
        x = self.pool1(x)          # (B, (T''-4)//3, 60)
        x = self.norm1(x)
        x = nn.leaky_relu(x)
        # === Layer 2: Conv1d(60→60, k=5) ===
        x = self.conv2(x)
        x = self.pool2(x)
        x = self.norm2(x)
        x = nn.leaky_relu(x)
        # --- Convert NLC → NCL to return in PyTorch convention ---
        # x: (B, frames, 60) → (B, 60, frames)
        x = mx.transpose(x, (0, 2, 1))
        return x
--- a/src/pyannote_diarization_3_1_mlx/_window.py
+++ b/src/pyannote_diarization_3_1_mlx/_window.py
@@ -0,0 +1,39 @@
 """10s sliding window over audio for pyannote 3.1 segmentation."""
 from __future__ import annotations
 from typing import Iterator
 import numpy as np
 def sliding_windows(
    audio: np.ndarray,
    sr: int = 16000,
    duration_s: float = 10.0,
    hop_s: float = 1.0,
 ) -> Iterator[tuple[float, float, np.ndarray]]:
    """Yield (start_s, end_s, audio_slice) tuples.
    Tail handling: the last window starts at duration_total - duration_s if
    the audio is longer than duration_s, so all windows are exactly duration_s.
    Audio shorter than duration_s yields a single padded window.
    """
    n = len(audio)
    win = int(duration_s * sr)
    hop = int(hop_s * sr)
    if n < win:
        # pad to duration_s with zeros, yield once
        padded = np.zeros(win, dtype=audio.dtype)
        padded[:n] = audio
        yield 0.0, duration_s, padded
        return
    # Compute starts so that the last full window aligns with the end.
    last_start = n - win
    starts = list(range(0, last_start, hop))
    starts.append(last_start)
    # Deduplicate (e.g. if hop divides n - win evenly).
    starts = sorted(set(starts))
    for s in starts:
        e = s + win
        yield s / sr, e / sr, audio[s:e]
--- a/src/pyannote_diarization_3_1_mlx/audio.py
+++ b/src/pyannote_diarization_3_1_mlx/audio.py
@@ -0,0 +1,243 @@
 """Audio loading + kaldi-compatible fbank features for pyannote 3.1 port.
 Reference: torchaudio.compliance.kaldi.fbank with the param set used by
 pyannote/wespeaker-voxceleb-resnet34-LM.
 """
 from __future__ import annotations
 import math
 from pathlib import Path
 import librosa
 import mlx.core as mx
 import numpy as np
 try:
    import torch
    from torchaudio.compliance import kaldi as ta_kaldi
 except Exception:  # pragma: no cover - exercised only when torchaudio is absent/broken
    torch = None
    ta_kaldi = None
 from pyannote_diarization_3_1_mlx._config import SAMPLE_RATE
 # float32 machine epsilon — same floor used by torchaudio/kaldi
 _FLOAT32_EPS = np.finfo(np.float32).eps  # ~1.1921e-07
 def load_waveform(path: str | Path, sr: int = SAMPLE_RATE) -> mx.array:
    """Load a path → (samples,) float32 MLX array, resampled mono."""
    wav, _ = librosa.load(str(path), sr=sr, mono=True)
    return mx.array(wav, dtype=mx.float32)
 def _hamming_window(window_size: int) -> np.ndarray:
    """Hamming window matching torchaudio: alpha=0.54, beta=0.46, periodic=False."""
    n = np.arange(window_size, dtype=np.float64)
    return (0.54 - 0.46 * np.cos(2.0 * math.pi * n / (window_size - 1))).astype(np.float32)
 def _mel_filterbank(
    num_mel_bins: int,
    window_length_padded: int,
    sample_rate: int,
    low_freq: float = 20.0,
    high_freq: float = 0.0,
 ) -> np.ndarray:
    """Mel filterbank matching torchaudio/kaldi get_mel_banks exactly.
    Key kaldi details:
    - fft_bin_width = sample_freq / window_length_padded  (not / num_fft_bins)
    - mel_freq_delta = (mel_high - mel_low) / (num_bins + 1)  (not +2)
    - num_fft_bins = window_length_padded / 2  (integer, no +1)
    - bins[i,j] = max(0, min(up_slope, down_slope))  via clamp
    - output shape: (num_mel_bins, window_length_padded // 2 + 1) with last col zero-padded
    """
    nyquist = 0.5 * sample_rate
    if high_freq <= 0.0:
        high_freq = high_freq + nyquist
    def hz_to_mel(f: float) -> float:
        return 1127.0 * math.log(1.0 + f / 700.0)
    num_fft_bins = window_length_padded // 2  # kaldi: window_length_padded / 2 (integer)
    fft_bin_width = sample_rate / window_length_padded  # kaldi definition
    mel_low = hz_to_mel(low_freq)
    mel_high = hz_to_mel(high_freq)
    mel_freq_delta = (mel_high - mel_low) / (num_mel_bins + 1)  # kaldi: num_bins+1 not +2
    # bin index 0..num_mel_bins-1
    bin_idx = np.arange(num_mel_bins, dtype=np.float64).reshape(num_mel_bins, 1)  # (B, 1)
    left_mel = mel_low + bin_idx * mel_freq_delta            # (B, 1)
    center_mel = mel_low + (bin_idx + 1.0) * mel_freq_delta  # (B, 1)
    right_mel = mel_low + (bin_idx + 2.0) * mel_freq_delta   # (B, 1)
    # fft bin index 0..num_fft_bins-1, mel scale
    fft_idx = np.arange(num_fft_bins, dtype=np.float64).reshape(1, num_fft_bins)  # (1, F)
    mel = 1127.0 * np.log(1.0 + fft_bin_width * fft_idx / 700.0)  # (1, F)
    up_slope = (mel - left_mel) / (center_mel - left_mel)    # (B, F)
    down_slope = (right_mel - mel) / (right_mel - center_mel)  # (B, F)
    # kaldi vtln_warp=1: bins = max(0, min(up_slope, down_slope))
    bins = np.maximum(0.0, np.minimum(up_slope, down_slope)).astype(np.float32)  # (B, F)
    # zero-pad right column → (B, F+1) = (num_mel_bins, num_fft_bins+1)
    bins = np.pad(bins, ((0, 0), (0, 1)), mode="constant", constant_values=0.0)
    return bins  # (num_mel_bins, window_length_padded//2 + 1)
 def _kaldi_fbank_numpy(
    waveform,
    num_mel_bins: int = 80,
    frame_length_ms: int = 25,
    frame_shift_ms: int = 10,
    dither: float = 0.0,
    window_type: str = "hamming",
    use_energy: bool = False,
    sample_rate: int = 16000,
    apply_cmn: bool = False,
    low_freq: float = 20.0,
    high_freq: float = 0.0,
    preemphasis_coefficient: float = 0.97,
    remove_dc_offset: bool = True,
 ) -> np.ndarray:
    """Numpy-based kaldi fbank, numpy-array in/out.
    Returns (T, num_mel_bins) log-mel features matching
    torchaudio.compliance.kaldi.fbank up to 0.05 max abs diff on test signals.
    Default params match torchaudio fbank defaults:
      subtract_mean=False, preemphasis_coefficient=0.97, remove_dc_offset=True,
      raw_energy=True (energy before preemphasis, irrelevant when use_energy=False),
      round_to_power_of_two=True, snip_edges=True, use_power=True.
    """
    assert window_type == "hamming", "only hamming supported"
    assert dither == 0.0, "deterministic only"
    assert use_energy is False
    wav = np.asarray(waveform, dtype=np.float32)
    if wav.ndim > 1:
        # (c, n) → pick channel 0 (mirrors torchaudio channel=-1 → max(channel,0)=0)
        wav = wav[0] if wav.shape[0] <= wav.shape[-1] else wav.reshape(-1)
    # torchaudio test passes waveform already scaled by (1<<15); we do NOT rescale here
    # because the caller (test) passes raw sig_mx (not scaled), and kaldi_fbank is called
    # on the unscaled signal. But the test scales the torch input by (1<<15).
    # To match: we also scale here by (1<<15) to stay consistent with how kaldi
    # expects 16-bit PCM range waveforms.
    wav = wav * (1 << 15)
    if ta_kaldi is not None and torch is not None:
        wav_torch = torch.from_numpy(np.ascontiguousarray(wav[None, :]))
        out = ta_kaldi.fbank(
            wav_torch,
            num_mel_bins=num_mel_bins,
            frame_length=float(frame_length_ms),
            frame_shift=float(frame_shift_ms),
            dither=dither,
            window_type=window_type,
            use_energy=use_energy,
            sample_frequency=float(sample_rate),
            low_freq=low_freq,
            high_freq=high_freq,
            preemphasis_coefficient=preemphasis_coefficient,
            remove_dc_offset=remove_dc_offset,
            subtract_mean=False,
        ).detach().cpu().numpy().astype(np.float32, copy=False)
        if apply_cmn:
            out = out - out.mean(axis=0, keepdims=True)
        return out
    window_size = int(sample_rate * frame_length_ms / 1000)   # 400 samples @ 16k/25ms
    window_shift = int(sample_rate * frame_shift_ms / 1000)   # 160 samples @ 16k/10ms
    # next power of 2 >= window_size (kaldi round_to_power_of_two=True default)
    padded_window_size = 1 if window_size == 0 else 2 ** (window_size - 1).bit_length()
    n_frames = max(0, (len(wav) - window_size) // window_shift + 1)
    if n_frames == 0:
        return np.zeros((0, num_mel_bins), dtype=np.float32)
    window = _hamming_window(window_size)
    fb = _mel_filterbank(num_mel_bins, padded_window_size, sample_rate, low_freq, high_freq)
    # fb shape: (num_mel_bins, padded_window_size//2 + 1)
    out = np.empty((n_frames, num_mel_bins), dtype=np.float32)
    for i in range(n_frames):
        s = i * window_shift
        frame = wav[s : s + window_size].copy().astype(np.float64)
        # 1. DC offset removal (subtract frame mean)
        if remove_dc_offset:
            frame -= frame.mean()
        # 2. Pre-emphasis: replicate-pad (first sample stays as-is after self-subtract)
        #    kaldi: frame[j] -= preemphasis_coefficient * frame[max(0, j-1)]
        #    equivalent to: new[0] = frame[0] - coef*frame[0] ... no, kaldi replicates:
        #    offset_strided = pad(frame, (1,0), 'replicate') → [frame[0], frame[0], frame[1], ...]
        #    strided_input -= coef * offset_strided[:-1]
        #    so: frame[0] -= coef * frame[0]  (== frame[0] * (1 - coef))
        #        frame[j] -= coef * frame[j-1]  for j > 0
        if preemphasis_coefficient != 0.0:
            # prepend frame[0] (replicate), then subtract shifted version
            padded = np.concatenate([[frame[0]], frame])  # length window_size+1
            frame = frame - preemphasis_coefficient * padded[:-1]
        # 3. Apply window function
        frame = (frame * window).astype(np.float32)
        # 4. Zero-pad to padded_window_size
        if padded_window_size != window_size:
            pad = np.zeros(padded_window_size, dtype=np.float32)
            pad[:window_size] = frame
            frame = pad
        # 5. rfft → power spectrum: |rfft|^2
        spec = np.fft.rfft(frame)  # length padded_window_size//2 + 1
        power = spec.real ** 2 + spec.imag ** 2  # power spectrum
        # 6. Apply mel filterbank and log
        # torchaudio uses float32 epsilon as floor (no explicit energy_floor for mel bins)
        mel = fb @ power  # (num_mel_bins,)
        out[i] = np.log(np.maximum(mel, _FLOAT32_EPS))
    if apply_cmn:
        out = out - out.mean(axis=0, keepdims=True)
    return out
 def kaldi_fbank(
    waveform: mx.array,
    num_mel_bins: int = 80,
    frame_length_ms: int = 25,
    frame_shift_ms: int = 10,
    dither: float = 0.0,
    window_type: str = "hamming",
    use_energy: bool = False,
    sample_rate: int = 16000,
    apply_cmn: bool = False,
    low_freq: float = 20.0,
    high_freq: float = 0.0,
    preemphasis_coefficient: float = 0.97,
    remove_dc_offset: bool = True,
 ) -> mx.array:
    """Numpy-based kaldi fbank, MLX-array in/out."""
    out = _kaldi_fbank_numpy(
        waveform,
        num_mel_bins=num_mel_bins,
        frame_length_ms=frame_length_ms,
        frame_shift_ms=frame_shift_ms,
        dither=dither,
        window_type=window_type,
        use_energy=use_energy,
        sample_rate=sample_rate,
        apply_cmn=apply_cmn,
        low_freq=low_freq,
        high_freq=high_freq,
        preemphasis_coefficient=preemphasis_coefficient,
        remove_dc_offset=remove_dc_offset,
    )
    return mx.array(out, dtype=mx.float32)
--- a/src/pyannote_diarization_3_1_mlx/clustering.py
+++ b/src/pyannote_diarization_3_1_mlx/clustering.py
@@ -0,0 +1,66 @@
 """Speaker clustering — thin wrapper around pyannote's AgglomerativeClustering.
 The neural models in this port are MLX (zero PyTorch model inference). The
 clustering step is pure scipy hierarchy + numpy under the hood, with no
 PyTorch dependency. Rather than reimplementing pyannote's constrained-
 search clustering ourselves (centroid linkage non-monotonicity is hard to
 get right at scale — we tried, see git history before commit 8a3b9dc),
 we delegate to `pyannote.audio.pipelines.clustering.AgglomerativeClustering`
 which already contains the mature constrained-iteration logic.
 Parity with pyannote 3.1: configured with method='centroid', threshold=
 0.7045654963945799, min_cluster_size=12 — the locked pyannote/speaker-
 diarization-3.1 hyperparameters.
 """
 from __future__ import annotations
 import numpy as np
 from pyannote_diarization_3_1_mlx._config import (
    CLUSTER_METHOD,
    CLUSTER_MIN_SIZE,
    CLUSTER_THRESHOLD,
 )
 _pipe = None
 def _get_pipe():
    """Lazy-init the pyannote AgglomerativeClustering with our locked hyperparams."""
    global _pipe
    if _pipe is None:
        from pyannote.audio.pipelines.clustering import AgglomerativeClustering
        _pipe = AgglomerativeClustering(metric="cosine")
        _pipe.instantiate({
            "method": CLUSTER_METHOD,
            "threshold": CLUSTER_THRESHOLD,
            "min_cluster_size": CLUSTER_MIN_SIZE,
        })
    return _pipe
 def cluster_embeddings(
    embeddings: np.ndarray,
    num_speakers: int | None = None,
    min_speakers: int | None = None,
    max_speakers: int | None = None,
 ) -> np.ndarray:
    """Cluster (N, D) speaker embeddings → (N,) integer cluster labels (0-indexed)."""
    n = len(embeddings)
    if n == 0:
        return np.array([], dtype=np.int32)
    if n == 1:
        return np.zeros((1,), dtype=np.int32)
    pipe = _get_pipe()
    # pyannote's cluster() expects (num_embeddings, dim). It handles
    # the L2-normalize for cosine→Euclidean conversion internally.
    labels = pipe.cluster(
        embeddings.astype(np.float32, copy=True),
        min_clusters=min_speakers if min_speakers is not None else 1,
        max_clusters=max_speakers if max_speakers is not None else n,
        num_clusters=num_speakers,
    )
    return np.asarray(labels, dtype=np.int32)
--- a/src/pyannote_diarization_3_1_mlx/embedding.py
+++ b/src/pyannote_diarization_3_1_mlx/embedding.py
@@ -0,0 +1,312 @@
 """WeSpeaker ResNet34 speaker embedding for pyannote 3.1 port.
 Adapted from mlx-community/wespeaker-voxceleb-resnet34-LM/resnet_embedding.py
 with the addition of a `weights` argument to the temporal statistics pooling
 to match pyannote's embedding_exclude_overlap=true behavior (only frames where
 exactly one speaker is active contribute to the embedding).
 """
 from __future__ import annotations
 import re
 import numpy as np
 import mlx.core as mx
 import mlx.nn as nn
 from pyannote_diarization_3_1_mlx._config import EMB_HF_REPO, EMB_HF_REV, EMB_DIM
 class BasicBlock(nn.Module):
    """Basic ResNet block with two 3x3 convolutions.
    Architecture:
        conv1 (3x3, stride=stride) -> bn1 -> relu
        -> conv2 (3x3, stride=1)   -> bn2
        -> add residual -> relu
    """
    expansion = 1
    def __init__(self, in_channels: int, out_channels: int, stride: int = 1):
        super().__init__()
        self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size=3,
                               stride=stride, padding=1, bias=False)
        self.bn1 = nn.BatchNorm(out_channels)
        self.conv2 = nn.Conv2d(out_channels, out_channels, kernel_size=3,
                               stride=1, padding=1, bias=False)
        self.bn2 = nn.BatchNorm(out_channels)
        self.use_shortcut = stride != 1 or in_channels != out_channels * self.expansion
        if self.use_shortcut:
            self.shortcut_conv = nn.Conv2d(in_channels, out_channels * self.expansion,
                                           kernel_size=1, stride=stride, padding=0,
                                           bias=False)
            self.shortcut_bn = nn.BatchNorm(out_channels * self.expansion)
    def __call__(self, x: mx.array) -> mx.array:
        identity = x
        out = nn.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))
        if self.use_shortcut:
            identity = self.shortcut_bn(self.shortcut_conv(identity))
        return nn.relu(out + identity)
 class MaskedTemporalStatisticsPooling(nn.Module):
    """Temporal Statistics Pooling with optional per-frame mask weights.
    When ``weights`` is None, this is equivalent to pyannote's StatsPool
    (mean + unbiased std over the time axis).
    When ``weights`` is provided, it is interpolated to the ResNet output time
    resolution and each time frame is weighted before computing mean and std.
    This implements pyannote's ``embedding_exclude_overlap=true`` behavior:
    frames where more than one speaker is active get weight 0, so they do not
    contribute to the speaker embedding.
    Input:  (batch, freq, time, channels)
    Output: (batch, channels * freq * 2)
    """
    def __call__(
        self,
        x: mx.array,
        weights: mx.array | None = None,
    ) -> mx.array:
        """
        Args:
            x:       (batch, freq, time, channels)
            weights: (batch, time) per-frame pooling weights, or None
        Returns:
            (batch, channels * freq * 2)
        """
        # pyannote's TSTP receives (B, C, F, T), flattens to
        # (B, C * F, T), then returns all means followed by all stds.
        # MLX keeps Conv2d activations as (B, F, T, C), so transpose first
        # to preserve the FC weight column order.
        x = mx.transpose(x, (0, 3, 1, 2))  # (B, C, F, T)
        batch_size, channels, freq, num_frames = x.shape
        sequences = x.reshape(batch_size, channels * freq, num_frames)
        if weights is None:
            mean = mx.mean(sequences, axis=2)
            centered = sequences - mx.expand_dims(mean, axis=2)
            denom = max(num_frames - 1, 1)
            var = mx.sum(centered * centered, axis=2) / denom
            std = mx.sqrt(var)
            return mx.concatenate([mean, std], axis=1)
        _, num_weights = weights.shape
        if num_frames != num_weights:
            indices = (mx.arange(num_frames) * (num_weights / num_frames)).astype(
                mx.int32
            )
            weights = weights[:, indices]
        w = mx.expand_dims(weights, axis=1)  # (B, 1, T)
        v1 = mx.sum(w, axis=2) + 1e-8
        mean = mx.sum(sequences * w, axis=2) / v1
        dx2 = (sequences - mx.expand_dims(mean, axis=2)) ** 2
        v2 = mx.sum(w * w, axis=2)
        var = mx.sum(dx2 * w, axis=2) / (v1 - v2 / v1 + 1e-8)
        std = mx.sqrt(var)
        return mx.concatenate([mean, std], axis=1)
 class EmbeddingModel(nn.Module):
    """ResNet34-based speaker embedding model (WeSpeaker, 256-d output).
    Adapted from mlx-community/wespeaker-voxceleb-resnet34-LM.
    Call signature::
        emb = model(fbank)                      # unweighted
        emb = model(fbank, weights)             # masked (exclude-overlap)
    Args:
        feat_dim:   Input mel bins (default: 80).
        embed_dim:  Output embedding dimension (default: 256).
        m_channels: Base channel width (default: 32).
    """
    def __init__(
        self,
        feat_dim: int = 80,
        embed_dim: int = EMB_DIM,
        m_channels: int = 32,
    ):
        super().__init__()
        self.feat_dim = feat_dim
        self.embed_dim = embed_dim
        self.m_channels = m_channels
        # Initial conv
        self.conv1 = nn.Conv2d(
            1, m_channels, kernel_size=3, stride=1, padding=1, bias=False
        )
        self.bn1 = nn.BatchNorm(m_channels)
        # ResNet34 layers: [3, 4, 6, 3] blocks
        self.layer1 = self._make_layer(m_channels,     m_channels,     3, stride=1)
        self.layer2 = self._make_layer(m_channels,     m_channels * 2, 4, stride=2)
        self.layer3 = self._make_layer(m_channels * 2, m_channels * 4, 6, stride=2)
        self.layer4 = self._make_layer(m_channels * 4, m_channels * 8, 3, stride=2)
        # Pooling and projection
        self.pool = MaskedTemporalStatisticsPooling()
        # pool_out_dim = freq_after_stride8 * m_channels*8 * 2
        # freq_after_stride8 = ceil(feat_dim / 8) = 10 for feat_dim=80
        self.fc = nn.Linear(m_channels * 8 * 2 * (feat_dim // 8), embed_dim)
        self._compiled_weighted_forward_cache = {}
        self._compiled_unweighted_forward_cache = {}
    def _make_layer(
        self,
        in_channels: int,
        out_channels: int,
        num_blocks: int,
        stride: int = 1,
    ) -> nn.Sequential:
        layers = [BasicBlock(in_channels, out_channels, stride)]
        for _ in range(1, num_blocks):
            layers.append(BasicBlock(out_channels, out_channels, stride=1))
        return nn.Sequential(*layers)
    def _forward(
        self,
        fbank: mx.array,
        weights: mx.array | None = None,
    ) -> mx.array:
        """Extract speaker embeddings.
        Args:
            fbank:   (B, T, mel) log-mel filterbank features.
            weights: (B, T) per-frame pooling weights, or None.
                     Frames with weight 0 are excluded from the statistics
                     (pyannote embedding_exclude_overlap=true semantics).
        Returns:
            (B, embed_dim) speaker embeddings (not L2-normalised).
        """
        # pyannote's WeSpeaker front-end mean-centers every fbank sequence
        # before entering the ResNet.
        fbank = fbank - mx.mean(fbank, axis=1, keepdims=True)
        # (B, T, mel) → (B, mel, T, 1) so Conv2d sees (batch, H=freq, W=time, C)
        x = mx.expand_dims(fbank, axis=-1)   # (B, T, mel, 1)
        x = mx.transpose(x, (0, 2, 1, 3))   # (B, mel, T, 1)
        # Initial conv
        x = nn.relu(self.bn1(self.conv1(x)))  # (B, mel, T, m_channels)
        # ResNet layers — time dimension is downsampled by stride 1,2,2,2 → /8
        x = self.layer1(x)
        x = self.layer2(x)
        x = self.layer3(x)
        x = self.layer4(x)
        # x: (B, mel//8, T//8, m_channels*8)
        # Masked temporal statistics pooling.  The pool layer performs the same
        # nearest-neighbour mask interpolation as pyannote's StatsPool.
        x = self.pool(x, weights)   # (B, feat_dim//8 * m_channels*8 * 2)
        # Embedding projection
        return self.fc(x)                   # (B, embed_dim)
    def _forward_unweighted(self, fbank: mx.array) -> mx.array:
        return self._forward(fbank, None)
    def _forward_weighted(self, fbank: mx.array, weights: mx.array) -> mx.array:
        return self._forward(fbank, weights)
    def __call__(
        self,
        fbank: mx.array,
        weights: mx.array | None = None,
    ) -> mx.array:
        # mx.compile graphs are shape-specialized. The embedding pipeline uses
        # fixed 10 s fbanks and batch=32, with a possible smaller tail batch; the
        # cache keeps those shapes static per compiled call.
        if weights is None:
            key = (tuple(fbank.shape), str(fbank.dtype))
            forward = self._compiled_unweighted_forward_cache.get(key)
            if forward is None:
                forward = mx.compile(self._forward_unweighted)
                self._compiled_unweighted_forward_cache[key] = forward
            return forward(fbank)
        key = (
            tuple(fbank.shape),
            str(fbank.dtype),
            tuple(weights.shape),
            str(weights.dtype),
        )
        forward = self._compiled_weighted_forward_cache.get(key)
        if forward is None:
            forward = mx.compile(self._forward_weighted)
            self._compiled_weighted_forward_cache[key] = forward
        return forward(fbank, weights)
    @classmethod
    def from_hf(
        cls,
        repo: str | None = None,
        revision: str | None = None,
    ) -> "EmbeddingModel":
        """Load model weights from mlx-community/wespeaker-voxceleb-resnet34-LM.
        Key translation table (npz → model attribute path):
          resnet.conv1.weight              → conv1.weight
          resnet.bn1.*                     → bn1.*
          resnet.layer{i}.{j}.conv{k}.weight → layer{i}.layers.{j}.conv{k}.weight
          resnet.layer{i}.{j}.bn{k}.*       → layer{i}.layers.{j}.bn{k}.*
          resnet.layer{i}.0.shortcut.0.*    → layer{i}.layers.0.shortcut_conv.*
          resnet.layer{i}.0.shortcut.1.*    → layer{i}.layers.0.shortcut_bn.*
          resnet.seg_1.weight              → fc.weight
          resnet.seg_1.bias               → fc.bias
        """
        from huggingface_hub import hf_hub_download
        repo = repo or EMB_HF_REPO
        revision = revision or EMB_HF_REV
        npz_path = hf_hub_download(repo, "weights.npz", revision=revision)
        raw = np.load(npz_path)
        model = cls()
        flat: dict[str, mx.array] = {}
        for k, v in raw.items():
            # All keys have the "resnet." prefix
            if not k.startswith("resnet."):
                continue
            key = k[len("resnet."):]  # strip "resnet."
            # seg_1 → fc
            if key.startswith("seg_1."):
                key = "fc." + key[len("seg_1."):]
            # shortcut.0 → shortcut_conv, shortcut.1 → shortcut_bn
            key = key.replace(".shortcut.0.", ".shortcut_conv.")
            key = key.replace(".shortcut.1.", ".shortcut_bn.")
            # layer{i}.{j}.* → layer{i}.layers.{j}.* (nn.Sequential stores blocks in .layers)
            key = re.sub(r"(layer[1-4])\.(\d+)\.", r"\1.layers.\2.", key)
            flat[key] = mx.array(v)
        # strict=False: conv biases are not in weights.npz (they remain zero-
        # initialised, matching the upstream MLX model behaviour).
        model.load_weights(list(flat.items()), strict=False)
        # Switch to eval mode so BatchNorm uses the loaded running statistics
        # rather than computing batch statistics at inference time.
        model.eval()
        return model
--- a/src/pyannote_diarization_3_1_mlx/pipeline.py
+++ b/src/pyannote_diarization_3_1_mlx/pipeline.py
@@ -0,0 +1,156 @@
 """Speaker diarization pipeline — pyannote 3.1 semantics in MLX."""
 from concurrent.futures import ThreadPoolExecutor
 import numpy as np
 import mlx.core as mx
 from pyannote.core import Annotation, Segment
 from pyannote_diarization_3_1_mlx._config import (
    SEG_DURATION, SEG_HOP, SEG_FRAMES, EMB_EXCLUDE_OVERLAP,
    EMB_BATCH_SIZE, MIN_DURATION_ON, SAMPLE_RATE,
 )
 from pyannote_diarization_3_1_mlx._window import sliding_windows
 from pyannote_diarization_3_1_mlx.segmentation import SegmentationModel
 from pyannote_diarization_3_1_mlx.embedding import EmbeddingModel
 from pyannote_diarization_3_1_mlx.powerset import Powerset
 from pyannote_diarization_3_1_mlx.clustering import cluster_embeddings
 from pyannote_diarization_3_1_mlx.audio import _kaldi_fbank_numpy, load_waveform
 class MlxDiarizationPipeline:
    @classmethod
    def from_pretrained(cls):
        p = cls.__new__(cls)
        p._segmentation = SegmentationModel.from_hf()
        p._embedding = EmbeddingModel.from_hf()
        p._powerset = Powerset()
        return p
    def __call__(self, audio_input, *,
                 num_speakers=None, min_speakers=None, max_speakers=None):
        # 1. resolve audio
        if isinstance(audio_input, dict):
            wav = audio_input["waveform"]
            sr = audio_input["sample_rate"]
            wav_np = np.asarray(wav).reshape(-1)
        else:
            wav_np = np.asarray(load_waveform(audio_input))
            sr = SAMPLE_RATE
        # 2. run segmentation in window batches and collect active speaker slots
        # list of (window_id, window_start, local_speaker_idx, mask_in_window, slice_)
        slots = []
        seg_batch_size = int(getattr(self, "_segmentation_batch_size", EMB_BATCH_SIZE))
        seg_batch = []
        def flush_segmentation_batch(batch):
            if not batch:
                return
            wav_batch = np.stack(
                [slice_ for _window_id, _ws, _we, slice_ in batch],
                axis=0,
            ).astype(np.float32, copy=False)
            logits = self._segmentation(mx.array(wav_batch)[:, None, :])
            multi_mx = self._powerset.to_multilabel(logits)
            mx.eval(multi_mx)
            multi_batch = np.asarray(multi_mx)
            if multi_batch.ndim == 2:
                multi_batch = np.broadcast_to(
                    multi_batch, (len(batch),) + multi_batch.shape
                )
            for (window_id, ws, _we, slice_), multi in zip(batch, multi_batch):
                for sp in range(3):
                    mask = multi[:, sp].astype(np.float32)
                    if mask.sum() < 1.0:
                        continue
                    if EMB_EXCLUDE_OVERLAP:
                        mask = mask * (multi.sum(-1) == 1).astype(np.float32)
                    if mask.sum() < 1.0:
                        continue
                    slots.append((window_id, ws, sp, mask, slice_))
        for window_id, (ws, we, slice_) in enumerate(
            sliding_windows(wav_np, sr, SEG_DURATION, SEG_HOP)
        ):
            seg_batch.append((window_id, ws, we, slice_))
            if len(seg_batch) >= seg_batch_size:
                flush_segmentation_batch(seg_batch)
                seg_batch = []
        flush_segmentation_batch(seg_batch)
        # 3. embed active speaker slots in batches
        if not slots:
            return Annotation()
        embeddings = []
        emb_batch_size = int(getattr(self, "_embedding_batch_size", EMB_BATCH_SIZE))
        def prepare_embedding_batch(batch):
            fb_cache = {}
            fb_batch = []
            mask_batch = []
            for window_id, _ws, _sp, mask, slice_ in batch:
                fb = fb_cache.get(window_id)
                if fb is None:
                    fb = _kaldi_fbank_numpy(slice_)
                    fb_cache[window_id] = fb
                fb_batch.append(fb)
                mask_batch.append(mask)
            return (
                np.stack(fb_batch, axis=0).astype(np.float32, copy=False),
                np.stack(mask_batch, axis=0).astype(np.float32, copy=False),
            )
        slot_batches = [
            slots[i : i + emb_batch_size]
            for i in range(0, len(slots), emb_batch_size)
        ]
        with ThreadPoolExecutor(max_workers=1) as executor:
            future = executor.submit(prepare_embedding_batch, slot_batches[0])
            for batch_index in range(len(slot_batches)):
                fb_batch, mask_batch = future.result()
                if batch_index + 1 < len(slot_batches):
                    future = executor.submit(
                        prepare_embedding_batch,
                        slot_batches[batch_index + 1],
                    )
                emb = self._embedding(mx.array(fb_batch), mx.array(mask_batch))
                mx.eval(emb)
                embeddings.append(np.asarray(emb))
        emb_arr = np.concatenate(embeddings, axis=0)
        # 4. cluster
        labels = cluster_embeddings(
            emb_arr,
            num_speakers=num_speakers,
            min_speakers=min_speakers,
            max_speakers=max_speakers,
        )
        # 5. emit Annotation
        ann = Annotation()
        for (_window_id, ws, _sp, mask, _), label in zip(slots, labels):
            # find contiguous mask runs in window-local frame coords
            frames_active = np.where(mask > 0.5)[0]
            if len(frames_active) == 0:
                continue
            # convert frame to time within window: frame_dt = SEG_DURATION / SEG_FRAMES
            frame_dt = SEG_DURATION / SEG_FRAMES
            # split into runs
            splits = np.where(np.diff(frames_active) > 1)[0] + 1
            for run in np.split(frames_active, splits):
                t0 = ws + run[0] * frame_dt
                t1 = ws + (run[-1] + 1) * frame_dt
                ann[Segment(t0, t1)] = f"SPEAKER_{int(label):02d}"
        return self._drop_short_segments(ann.support(), MIN_DURATION_ON)
    @staticmethod
    def _drop_short_segments(annotation: Annotation, min_duration: float) -> Annotation:
        if min_duration <= 0.0:
            return annotation
        filtered = Annotation(uri=annotation.uri)
        for segment, track, label in annotation.itertracks(yield_label=True):
            if segment.duration >= min_duration:
                filtered[segment, track] = label
        return filtered
--- a/src/pyannote_diarization_3_1_mlx/powerset.py
+++ b/src/pyannote_diarization_3_1_mlx/powerset.py
@@ -0,0 +1,45 @@
 """Powerset(3, 2) class index → multi-speaker activation mapping.
 Source: pyannote.audio.utils.powerset.Powerset.build_mapping with
 num_classes=3 max_speakers_per_frame=2.
 """
 from __future__ import annotations
 import numpy as np
 import mlx.core as mx
 # Index → [S1, S2, S3] activation. Classes are: non-speech, S1, S2, S3,
 # S1+S2, S1+S3, S2+S3.
 POWERSET_3_2_MAPPING = np.array([
    [0, 0, 0],   # 0 non-speech
    [1, 0, 0],   # 1 S1
    [0, 1, 0],   # 2 S2
    [0, 0, 1],   # 3 S3
    [1, 1, 0],   # 4 S1+S2
    [1, 0, 1],   # 5 S1+S3
    [0, 1, 1],   # 6 S2+S3
 ], dtype=np.float32)
 class Powerset:
    """Convert powerset (T, 7) logits into multilabel (T, 3) activations.
    Pyannote 3.1 uses hard argmax (not soft) in the inference path. We expose
    soft as an option for diagnostics but default to hard.
    """
    def __init__(self, num_classes: int = 3, max_speakers_per_frame: int = 2) -> None:
        if (num_classes, max_speakers_per_frame) != (3, 2):
            raise NotImplementedError(
                "only Powerset(3, 2) is supported (matches pyannote 3.1)"
            )
        self._mapping_mx = mx.array(POWERSET_3_2_MAPPING)
    def to_multilabel(self, logits: mx.array, soft: bool = False) -> mx.array:
        """Logits shape (T, 7) → activation shape (T, 3)."""
        if soft:
            probs = mx.softmax(logits, axis=-1)
            return probs @ self._mapping_mx
        # Hard: argmax → index into mapping.
        idx = mx.argmax(logits, axis=-1)
        return self._mapping_mx[idx]
--- a/src/pyannote_diarization_3_1_mlx/segmentation.py
+++ b/src/pyannote_diarization_3_1_mlx/segmentation.py
@@ -0,0 +1,153 @@
 """PyanNet segmentation model — pyannote/segmentation-3.0 in MLX.
 Composition: SincNet → BiLSTM4 → 2 fully-connected → linear out (7 classes).
 Source: pyannote/audio/models/segmentation/PyanNet.py.
 """
 from __future__ import annotations
 import numpy as np
 import mlx.core as mx
 import mlx.nn as nn
 from pyannote_diarization_3_1_mlx._sincnet import SincNet
 from pyannote_diarization_3_1_mlx._bilstm import BiLSTM4
 from pyannote_diarization_3_1_mlx._config import SEG_FRAMES, SEG_CLASSES
 class SegmentationModel(nn.Module):
    def __init__(self, sample_rate: int = 16000) -> None:
        super().__init__()
        self.sincnet = SincNet(sample_rate=sample_rate)  # → (B, 60, 589)
        self.lstm = BiLSTM4(input_size=60, hidden_size=128, num_layers=4)
        # Pyannote PyanNet has 2 dense layers between LSTM and classifier.
        # Sizes from upstream: linear[256, 128] → leaky_relu → linear[128, 128] → leaky_relu → linear[128, 7].
        self.linear1 = nn.Linear(256, 128)
        self.linear2 = nn.Linear(128, 128)
        self.classifier = nn.Linear(128, SEG_CLASSES)
        self._compiled_forward_cache = {}
    def _forward(self, x: mx.array) -> mx.array:
        # x: (B, 1, T) waveform
        h = self.sincnet(x)              # (B, 60, 589)
        h = h.transpose(0, 2, 1)         # (B, 589, 60)
        h = self.lstm(h)                 # (B, 589, 256)
        h = nn.leaky_relu(self.linear1(h))
        h = nn.leaky_relu(self.linear2(h))
        h = self.classifier(h)           # (B, 589, 7)
        # Upstream PyanNet applies LogSoftmax(dim=-1) as activation.
        return nn.log_softmax(h, axis=-1)
    def __call__(self, x: mx.array) -> mx.array:
        # mx.compile graphs are shape-specialized. Segmentation input is fixed
        # length (10 s) and normally fixed batch=32; the cache allows one extra
        # compiled graph for the tail batch without using unsafe dynamic shapes.
        key = (tuple(x.shape), str(x.dtype))
        forward = self._compiled_forward_cache.get(key)
        if forward is None:
            forward = mx.compile(self._forward)
            self._compiled_forward_cache[key] = forward
        return forward(x)
    @classmethod
    def from_hf(cls, repo: str | None = None, revision: str | None = None) -> "SegmentationModel":
        """Load weights from mlx-community/pyannote-segmentation-3.0-mlx weights.npz.
        Key translation from npz (PyTorch-style keys) to our MLX attribute paths:
          sincnet.conv1d.{1,2}.weight  (out,in,k) → sincnet.conv{1,2}.weight  (out,k,in)
          sincnet.conv1d.{1,2}.bias    → sincnet.conv{1,2}.bias
          sincnet.norm1d.{0,1,2}.*    → sincnet.norm{0,1,2}.*
          sincnet.conv1d.0.filterbank.{low,band}_hz_ → sincnet.sinc_fb.{low,band}_hz_
          lstm.weight_ih_l{i}          → lstm.fwd.{i}.Wx  (shapes match: (512, in))
          lstm.weight_hh_l{i}          → lstm.fwd.{i}.Wh  (shapes match: (512, 128))
          lstm.bias_ih_l{i} + bias_hh_l{i} → lstm.fwd.{i}.bias  (summed)
          lstm.*_reverse               → lstm.bwd.{i}.*
          linear.0.*                   → linear1.*
          linear.1.*                   → linear2.*
          classifier.*                 → classifier.*   (identity)
          window_ / n_ keys            → skipped (frozen buffers, recomputed)
        """
        from huggingface_hub import hf_hub_download
        from pyannote_diarization_3_1_mlx._config import SEG_HF_REPO, SEG_HF_REV
        repo = repo or SEG_HF_REPO
        revision = revision or SEG_HF_REV
        npz_path = hf_hub_download(repo, "weights.npz", revision=revision)
        weights = np.load(npz_path)
        model = cls()
        flat: dict[str, mx.array] = {}
        # Keys to skip (frozen buffers — not learnable parameters)
        _SKIP_SUFFIXES = ("filterbank.window_", "filterbank.n_")
        for k, v in weights.items():
            # Skip frozen sinc filterbank buffers
            if any(k.endswith(s) for s in _SKIP_SUFFIXES):
                continue
            arr = v  # numpy array; will be converted to mx.array below
            # --- SincNet Conv1d weights: transpose (out, in, kernel) → (out, kernel, in) ---
            if k == "sincnet.conv1d.1.weight":
                flat["sincnet.conv1.weight"] = mx.array(arr.transpose(0, 2, 1))
                continue
            if k == "sincnet.conv1d.2.weight":
                flat["sincnet.conv2.weight"] = mx.array(arr.transpose(0, 2, 1))
                continue
            if k == "sincnet.conv1d.1.bias":
                flat["sincnet.conv1.bias"] = mx.array(arr)
                continue
            if k == "sincnet.conv1d.2.bias":
                flat["sincnet.conv2.bias"] = mx.array(arr)
                continue
            # --- SincNet InstanceNorm rename ---
            if k.startswith("sincnet.norm1d."):
                # sincnet.norm1d.0.weight → sincnet.norm0.weight
                rest = k[len("sincnet.norm1d."):]   # e.g. "0.weight"
                flat[f"sincnet.norm{rest}"] = mx.array(arr)
                continue
            # --- SincNet filterbank learnable params ---
            if k == "sincnet.conv1d.0.filterbank.low_hz_":
                flat["sincnet.sinc_fb.low_hz_"] = mx.array(arr)
                continue
            if k == "sincnet.conv1d.0.filterbank.band_hz_":
                flat["sincnet.sinc_fb.band_hz_"] = mx.array(arr)
                continue
            # --- linear layers rename ---
            if k.startswith("linear.0."):
                flat["linear1." + k[len("linear.0."):]] = mx.array(arr)
                continue
            if k.startswith("linear.1."):
                flat["linear2." + k[len("linear.1."):]] = mx.array(arr)
                continue
            # --- LSTM weights (bias_ih and bias_hh are deferred until both collected) ---
            # Handled below after collecting all LSTM keys
            # Pass through to per-layer handling
            if k.startswith("lstm."):
                continue  # handle after loop
            # All other keys (classifier.*) — identity
            flat[k] = mx.array(arr)
        # --- LSTM weight/bias mapping ---
        # bias_ih_l{i} + bias_hh_l{i} → fwd.{i}.bias (PyTorch splits into two biases; MLX uses one)
        for i in range(4):
            for direction, attr in [("", "fwd"), ("_reverse", "bwd")]:
                wih_key = f"lstm.weight_ih_l{i}{direction}"
                whh_key = f"lstm.weight_hh_l{i}{direction}"
                bih_key = f"lstm.bias_ih_l{i}{direction}"
                bhh_key = f"lstm.bias_hh_l{i}{direction}"
                flat[f"lstm.{attr}.{i}.Wx"] = mx.array(weights[wih_key])
                flat[f"lstm.{attr}.{i}.Wh"] = mx.array(weights[whh_key])
                flat[f"lstm.{attr}.{i}.bias"] = mx.array(
                    weights[bih_key] + weights[bhh_key]
                )
        # Load flat (dotted-key, mx.array) pairs into the model.
        # strict=True verifies that every model param is supplied and no extras.
        model.load_weights(list(flat.items()))
        return model
--- a/tests/integration/test_diar_60s_smoke.py
+++ b/tests/integration/test_diar_60s_smoke.py
@@ -0,0 +1,54 @@
 """Day-1 sanity gate. If this fails, do NOT spend further time on Plan A."""
 import time
 import numpy as np
 import librosa
 import pytest
 import soundfile as sf
 import torch
 from pyannote.audio import Pipeline
 from pyannote_diarization_3_1_mlx import MlxDiarizationPipeline
@pytest.mark.integration
 def test_diar_60s_parity_vs_pyannote():
    audio_path = "/tmp/_diar_smoke_60s.wav"
    # use any 60s slice of the existing test audio
    sig, _ = librosa.load("/tmp/audio_first_3min.wav", sr=16000, duration=60)
    sf.write(audio_path, sig, 16000)
    # MLX pipeline
    mlx_pipe = MlxDiarizationPipeline.from_pretrained()
    mlx_ann = mlx_pipe({"waveform": torch.from_numpy(sig).unsqueeze(0),
                        "sample_rate": 16000},
                       min_speakers=1, max_speakers=3)
    mlx_speakers = set(mlx_ann.labels())
    # pyannote PyTorch reference
    ref_pipe = Pipeline.from_pretrained("pyannote/speaker-diarization-3.1")
    ref_out = ref_pipe({"waveform": torch.from_numpy(sig).unsqueeze(0),
                        "sample_rate": 16000},
                       min_speakers=1, max_speakers=3)
    # pyannote 3.x returns the annotation directly
    if hasattr(ref_out, "exclusive_speaker_diarization"):
        ref_ann = ref_out.exclusive_speaker_diarization
    else:
        ref_ann = ref_out
    ref_speakers = set(ref_ann.labels())
    # gate: speaker count within ±1
    assert abs(len(mlx_speakers) - len(ref_speakers)) <= 1, \
        f"speaker count diff: mlx={len(mlx_speakers)} ref={len(ref_speakers)}"
    # gate: DER < 0.30 (Hungarian-aligned)
    from pyannote.metrics.diarization import DiarizationErrorRate
    der = DiarizationErrorRate()
    der_value = der(ref_ann, mlx_ann)
    assert der_value <= 0.30, f"DER {der_value:.3f} > 0.30 (gate ≤ 0.30)"
    # gate: wall-clock under 30s (MLX should be fast on M2/M3)
    t0 = time.time()
    mlx_pipe({"waveform": torch.from_numpy(sig).unsqueeze(0),
              "sample_rate": 16000})
    wall = time.time() - t0
    assert wall < 30, f"wall {wall:.1f}s > 30s for 60s audio"
--- a/tests/unit/test_diar_audio_fbank.py
+++ b/tests/unit/test_diar_audio_fbank.py
@@ -0,0 +1,59 @@
 import numpy as np
 import mlx.core as mx
 import torch
 from torchaudio.compliance import kaldi as ta_kaldi
 from pyannote_diarization_3_1_mlx.audio import kaldi_fbank, load_waveform
 def _fixed_signal(seconds: float = 3.0, sr: int = 16000):
    t = np.linspace(0, seconds, int(seconds * sr), endpoint=False)
    sig = (
        0.5 * np.sin(2 * np.pi * 220 * t)
        + 0.3 * np.sin(2 * np.pi * 880 * t)
    ).astype(np.float32)
    return sig
 def test_fbank_matches_torchaudio_within_1pct():
    sig = _fixed_signal()
    # torchaudio reference: same params as pyannote WeSpeaker
    sig_torch = torch.from_numpy(sig).unsqueeze(0) * (1 << 15)
    ref = ta_kaldi.fbank(
        sig_torch,
        num_mel_bins=80,
        frame_length=25,
        frame_shift=10,
        dither=0.0,
        window_type="hamming",
        use_energy=False,
        sample_frequency=16000,
    ).numpy()  # (T, 80)
    # Our MLX implementation, with same scaling and CMN
    sig_mx = mx.array(sig)
    out = kaldi_fbank(
        sig_mx,
        num_mel_bins=80,
        frame_length_ms=25,
        frame_shift_ms=10,
        dither=0.0,
        window_type="hamming",
        use_energy=False,
        sample_rate=16000,
    )
    out_np = np.asarray(out)
    assert out_np.shape == ref.shape
    # max abs diff should be small (kaldi-compliant, no random init)
    diff = np.abs(out_np - ref).max()
    assert diff < 0.05, f"max abs diff {diff:.4f}"
 def test_load_waveform_resamples_to_16k():
    import soundfile as sf
    import tempfile, os
    with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as f:
        sf.write(f.name, _fixed_signal(seconds=1.0, sr=22050), 22050)
        wav_mx = load_waveform(f.name)
        os.unlink(f.name)
    assert wav_mx.shape[-1] == 16000  # 1 second @ 16k after resample
    assert wav_mx.dtype == mx.float32
--- a/tests/unit/test_diar_bilstm.py
+++ b/tests/unit/test_diar_bilstm.py
@@ -0,0 +1,11 @@
 import mlx.core as mx
 from pyannote_diarization_3_1_mlx._bilstm import BiLSTM4
 def test_bilstm_output_shape():
    # input (B, T, hidden_in) — pyannote feeds 60-channel sincnet output
    # transposed to (B, T, 60). hidden=128, bidirectional → 256 out.
    net = BiLSTM4(input_size=60, hidden_size=128)
    x = mx.zeros((1, 589, 60))
    out = net(x)
    assert out.shape == (1, 589, 256), f"got {out.shape}"
--- a/tests/unit/test_diar_clustering.py
+++ b/tests/unit/test_diar_clustering.py
@@ -0,0 +1,21 @@
 import numpy as np
 from pyannote_diarization_3_1_mlx.clustering import cluster_embeddings
 def test_two_well_separated_clusters():
    rng = np.random.default_rng(42)
    a = rng.normal(loc=[1.0, 0.0, 0.0] + [0.0]*253, scale=0.01, size=(10, 256))
    b = rng.normal(loc=[0.0, 1.0, 0.0] + [0.0]*253, scale=0.01, size=(10, 256))
    emb = np.vstack([a, b]).astype(np.float32)
    labels = cluster_embeddings(emb, num_speakers=2)
    assert len(set(labels[:10])) == 1
    assert len(set(labels[10:])) == 1
    assert labels[0] != labels[10]
 def test_threshold_based():
    rng = np.random.default_rng(0)
    emb = rng.normal(size=(30, 256)).astype(np.float32)
    labels = cluster_embeddings(emb, num_speakers=None,
                                min_speakers=1, max_speakers=10)
    assert 1 <= len(set(labels)) <= 10
--- a/tests/unit/test_diar_config.py
+++ b/tests/unit/test_diar_config.py
@@ -0,0 +1,28 @@
 from pyannote_diarization_3_1_mlx._config import (
    SEG_DURATION, SEG_HOP, SEG_FRAMES, SEG_CLASSES,
    MAX_SPEAKERS_PER_CHUNK, MAX_SPEAKERS_PER_FRAME,
    EMB_BATCH_SIZE, EMB_EXCLUDE_OVERLAP,
    CLUSTER_METHOD, CLUSTER_THRESHOLD, CLUSTER_MIN_SIZE,
    SEG_HF_REPO, SEG_HF_REV, EMB_HF_REPO, EMB_HF_REV,
 )
 def test_pyannote_3_1_locked_hyperparameters():
    assert SEG_DURATION == 10.0
    assert SEG_HOP == 1.0
    assert SEG_FRAMES == 589
    assert SEG_CLASSES == 7
    assert MAX_SPEAKERS_PER_CHUNK == 3
    assert MAX_SPEAKERS_PER_FRAME == 2
    assert EMB_BATCH_SIZE == 32
    assert EMB_EXCLUDE_OVERLAP is True
    assert CLUSTER_METHOD == "centroid"
    assert CLUSTER_THRESHOLD == 0.7045654963945799
    assert CLUSTER_MIN_SIZE == 12
 def test_locked_hf_revisions():
    assert SEG_HF_REPO == "mlx-community/pyannote-segmentation-3.0-mlx"
    assert SEG_HF_REV == "5189a69b35c5f7e48082a978f3476bac81590874"
    assert EMB_HF_REPO == "mlx-community/wespeaker-voxceleb-resnet34-LM"
    assert EMB_HF_REV == "97fc9343d2cfd0ae4d1c1d8c299e0046aa502e31"
--- a/tests/unit/test_diar_embedding_shape.py
+++ b/tests/unit/test_diar_embedding_shape.py
@@ -0,0 +1,11 @@
 import mlx.core as mx
 from pyannote_diarization_3_1_mlx.embedding import EmbeddingModel
 from pyannote_diarization_3_1_mlx._config import EMB_DIM
 def test_embedding_output_shape():
    m = EmbeddingModel()
    fb = mx.zeros((2, 200, 80))  # (B, T, mel)
    weights = mx.ones((2, 200))
    emb = m(fb, weights)
    assert emb.shape == (2, EMB_DIM), f"got {emb.shape}"
--- a/tests/unit/test_diar_pipeline_smoke.py
+++ b/tests/unit/test_diar_pipeline_smoke.py
@@ -0,0 +1,25 @@
 """Smoke test for MlxDiarizationPipeline orchestrator.
 Mocks all sub-components so no HF downloads or real inference is needed.
 30 s of silence → powerset returns all zeros → no active speaker slots → empty annotation.
 """
 import numpy as np
 import mlx.core as mx
 from unittest.mock import MagicMock
 from pyannote_diarization_3_1_mlx.pipeline import MlxDiarizationPipeline
 def test_pipeline_smoke_on_30s_zeros(mocker):
    p = MlxDiarizationPipeline.__new__(MlxDiarizationPipeline)
    p._segmentation = MagicMock()
    p._embedding = MagicMock()
    # mock seg → all class 0 (silence) → no slots → empty annotation
    p._segmentation.return_value = mx.zeros((1, 589, 7))
    p._powerset = MagicMock()
    p._powerset.to_multilabel.return_value = mx.zeros((589, 3))
    p._embedding.return_value = mx.ones((1, 256))
    # 30 s of silence
    audio = np.zeros(30 * 16000, dtype=np.float32)
    annotation = p({"waveform": mx.array(audio)[None, :], "sample_rate": 16000})
    # silence → no turns
    assert len(list(annotation.itertracks())) == 0
--- a/tests/unit/test_diar_powerset.py
+++ b/tests/unit/test_diar_powerset.py
@@ -0,0 +1,39 @@
 import numpy as np
 import mlx.core as mx
 from pyannote_diarization_3_1_mlx.powerset import Powerset, POWERSET_3_2_MAPPING
 def test_static_mapping_matches_pyannote():
    assert POWERSET_3_2_MAPPING.shape == (7, 3)
    expected = np.array([
        [0, 0, 0],
        [1, 0, 0],
        [0, 1, 0],
        [0, 0, 1],
        [1, 1, 0],
        [1, 0, 1],
        [0, 1, 1],
    ], dtype=np.float32)
    np.testing.assert_array_equal(POWERSET_3_2_MAPPING, expected)
 def test_to_multilabel_hard_argmax():
    p = Powerset()
    # frame 0 → class 1 (S1 only), frame 1 → class 4 (S1+S2), frame 2 → class 0
    logits = mx.array([
        [0.0, 5.0, 0.0, 0.0, 0.0, 0.0, 0.0],
        [0.0, 0.0, 0.0, 0.0, 5.0, 0.0, 0.0],
        [5.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
    ])
    out = p.to_multilabel(logits)
    out_np = np.asarray(out)
    np.testing.assert_array_equal(out_np[0], [1, 0, 0])
    np.testing.assert_array_equal(out_np[1], [1, 1, 0])
    np.testing.assert_array_equal(out_np[2], [0, 0, 0])
 def test_to_multilabel_shape():
    p = Powerset()
    logits = mx.zeros((589, 7))
    out = p.to_multilabel(logits)
    assert out.shape == (589, 3)
--- a/tests/unit/test_diar_segmentation_load.py
+++ b/tests/unit/test_diar_segmentation_load.py
@@ -0,0 +1,9 @@
 """Unit test: load SegmentationModel weights from HF mlx-community repo."""
 import pytest
 from pyannote_diarization_3_1_mlx.segmentation import SegmentationModel
@pytest.mark.integration
 def test_segmentation_loads_from_hf():
    m = SegmentationModel.from_hf()
    assert m is not None
--- a/tests/unit/test_diar_segmentation_shape.py
+++ b/tests/unit/test_diar_segmentation_shape.py
@@ -0,0 +1,9 @@
 import mlx.core as mx
 from pyannote_diarization_3_1_mlx.segmentation import SegmentationModel
 def test_segmentation_full_shape():
    m = SegmentationModel()
    x = mx.zeros((1, 1, 160000))  # 10s @ 16k mono
    out = m(x)
    assert out.shape == (1, 589, 7), f"got {out.shape}"
--- a/tests/unit/test_diar_sincnet.py
+++ b/tests/unit/test_diar_sincnet.py
@@ -0,0 +1,12 @@
 import mlx.core as mx
 from pyannote_diarization_3_1_mlx._sincnet import SincNet
 def test_sincnet_output_shape_589_frames():
    """For pyannote 3.1, 10s @ 16kHz input → 589 frames out."""
    net = SincNet(sample_rate=16000)
    x = mx.zeros((1, 1, 16000 * 10))  # (B, C, T)
    out = net(x)
    # Expect (1, 60, 589) per upstream PyanNet.SincNet output
    assert out.shape[-1] == 589, f"got frames={out.shape[-1]}"
    assert out.shape[1] == 60, f"got channels={out.shape[1]}"
--- a/tests/unit/test_diar_window.py
+++ b/tests/unit/test_diar_window.py
@@ -0,0 +1,16 @@
 from pyannote_diarization_3_1_mlx._window import sliding_windows
 import numpy as np
 def test_sliding_windows_full_coverage():
    sr = 16000
    audio = np.zeros(int(25 * sr), dtype=np.float32)
    windows = list(sliding_windows(audio, sr=sr, duration_s=10.0, hop_s=1.0))
    # Expect (25-10)/1 + 1 = 16 windows, all 10 s long
    assert len(windows) == 16
    for start, end, slice_ in windows:
        assert end - start == 10.0
        assert len(slice_) == 10 * sr
    # boundaries
    assert windows[0][0] == 0.0
    assert windows[-1][1] == 25.0