Speaker Embeddings

Extract 256-dimensional L2-normalized speaker vectors using WeSpeaker ResNet34-LM. These embeddings capture the unique vocal characteristics of a speaker and can be used for identification, verification, and voice search.

Architecture

WeSpeaker ResNet34-LM is a deep residual network trained for speaker representation learning.

Stage	Details
Input	Conv2d (1 to 32 channels)
ResNet34	[3, 4, 6, 3] residual blocks
Stats Pooling	Mean + standard deviation over time
Projection	Linear (5120 to 256)
Output	L2-normalized 256-dim embedding

Model size: ~6.6M parameters, ~25 MB on disk.

Mel Features

The model uses 80-dimensional mel-frequency features computed with a Hamming window. Log scaling uses a simple log(max(mel, 1e-10)) formula without additional normalization. Batch normalization is fused into the Conv2d layers at conversion time for inference efficiency.

CLI Usage

# Extract speaker embedding
.build/release/speech embed-speaker voice.wav

# JSON output (includes the 256-dim vector)
.build/release/speech embed-speaker voice.wav --json

# Choose inference engine
.build/release/speech embed-speaker voice.wav --engine coreml

Options

Option	Description
`--engine`	Inference engine: `mlx` or `coreml`
`--json`	JSON output format with full embedding vector

Use Cases

Speaker Verification

Compare two audio samples to determine if they are from the same speaker. Extract embeddings from both and compute the cosine similarity. A higher similarity score indicates a higher probability of the same speaker.

import SpeechVAD

let model = try await WeSpeakerModel.fromPretrained()

let samples1: [Float] = loadAudio("sample1.wav")
let samples2: [Float] = loadAudio("sample2.wav")
let embedding1 = model.embed(audio: samples1, sampleRate: 16000)
let embedding2 = model.embed(audio: samples2, sampleRate: 16000)

let similarity = WeSpeakerModel.cosineSimilarity(embedding1, embedding2)
print("Similarity: \(similarity)")  // > 0.7 typically same speaker

Speaker Identification

Match an unknown audio sample against a database of enrolled speaker embeddings. The enrolled speaker with the highest cosine similarity is the predicted identity.

Voice Search

Index a collection of audio recordings by speaker embedding, then query with a new audio sample to find all recordings from the same speaker.

Important

Speaker embeddings work best with clean speech of at least 2-3 seconds. Very short clips or noisy recordings may produce less reliable embeddings. Consider applying speech enhancement first for noisy audio.

Model Downloads

Model	Backend	Size	HuggingFace
WeSpeaker-ResNet34-LM	MLX	~25 MB	aufklarer/WeSpeaker-ResNet34-LM-MLX
WeSpeaker-ResNet34-LM	CoreML	~25 MB	aufklarer/WeSpeaker-ResNet34-LM-CoreML

Swift API

import SpeechVAD

let model = try await WeSpeakerModel.fromPretrained()

// Extract embedding from audio samples
let samples: [Float] = loadAudio("voice.wav")
let embedding = model.embed(audio: samples, sampleRate: 16000)
print("Embedding dimensions: \(embedding.count)")  // 256