Speaker Embeddings

Extract 256-dimensional L2-normalized speaker vectors using WeSpeaker ResNet34-LM. These embeddings capture the unique vocal characteristics of a speaker and can be used for identification, verification, and voice search.

Architecture

WeSpeaker ResNet34-LM is a deep residual network trained for speaker representation learning.

StageDetails
InputConv2d (1 to 32 channels)
ResNet34[3, 4, 6, 3] residual blocks
Stats PoolingMean + standard deviation over time
ProjectionLinear (5120 to 256)
OutputL2-normalized 256-dim embedding

Model size: ~6.6M parameters, ~25 MB on disk.

Mel Features

The model uses 80-dimensional mel-frequency features computed with a Hamming window. Log scaling uses a simple log(max(mel, 1e-10)) formula without additional normalization. Batch normalization is fused into the Conv2d layers at conversion time for inference efficiency.

CLI Usage

# Extract speaker embedding
.build/release/audio embed-speaker voice.wav

# JSON output (includes the 256-dim vector)
.build/release/audio embed-speaker voice.wav --json

# Choose inference engine
.build/release/audio embed-speaker voice.wav --engine coreml

Options

OptionDescription
--engineInference engine: mlx or coreml
--jsonJSON output format with full embedding vector

Use Cases

Speaker Verification

Compare two audio samples to determine if they are from the same speaker. Extract embeddings from both and compute the cosine similarity. A higher similarity score indicates a higher probability of the same speaker.

import SpeechVAD

let model = try await WeSpeaker.loadFromHub()
let embedding1 = try await model.embed(audioFile: "sample1.wav")
let embedding2 = try await model.embed(audioFile: "sample2.wav")

let similarity = cosineSimilarity(embedding1, embedding2)
print("Similarity: \(similarity)")  // > 0.7 typically same speaker

Speaker Identification

Match an unknown audio sample against a database of enrolled speaker embeddings. The enrolled speaker with the highest cosine similarity is the predicted identity.

Voice Search

Index a collection of audio recordings by speaker embedding, then query with a new audio sample to find all recordings from the same speaker.

Important

Speaker embeddings work best with clean speech of at least 2-3 seconds. Very short clips or noisy recordings may produce less reliable embeddings. Consider applying speech enhancement first for noisy audio.

Model Downloads

ModelBackendSizeHuggingFace
WeSpeaker-ResNet34-LMMLX~25 MBaufklarer/WeSpeaker-ResNet34-LM-MLX
WeSpeaker-ResNet34-LMCoreML~25 MBaufklarer/WeSpeaker-ResNet34-LM-CoreML

Swift API

import SpeechVAD

let model = try await WeSpeaker.loadFromHub()

// Extract embedding from file
let embedding = try await model.embed(audioFile: "voice.wav")
print("Embedding dimensions: \(embedding.count)")  // 256

// Extract embedding from audio samples
let samples: [Float] = loadAudio("voice.wav")
let embedding = try await model.embed(samples: samples, sampleRate: 16000)