Speaker Embeddings
Extract 256-dimensional L2-normalized speaker vectors using WeSpeaker ResNet34-LM. These embeddings capture the unique vocal characteristics of a speaker and can be used for identification, verification, and voice search.
Architecture
WeSpeaker ResNet34-LM is a deep residual network trained for speaker representation learning.
| Stage | Details |
|---|---|
| Input | Conv2d (1 to 32 channels) |
| ResNet34 | [3, 4, 6, 3] residual blocks |
| Stats Pooling | Mean + standard deviation over time |
| Projection | Linear (5120 to 256) |
| Output | L2-normalized 256-dim embedding |
Model size: ~6.6M parameters, ~25 MB on disk.
Mel Features
The model uses 80-dimensional mel-frequency features computed with a Hamming window. Log scaling uses a simple log(max(mel, 1e-10)) formula without additional normalization. Batch normalization is fused into the Conv2d layers at conversion time for inference efficiency.
CLI Usage
# Extract speaker embedding
.build/release/audio embed-speaker voice.wav
# JSON output (includes the 256-dim vector)
.build/release/audio embed-speaker voice.wav --json
# Choose inference engine
.build/release/audio embed-speaker voice.wav --engine coreml
Options
| Option | Description |
|---|---|
--engine | Inference engine: mlx or coreml |
--json | JSON output format with full embedding vector |
Use Cases
Speaker Verification
Compare two audio samples to determine if they are from the same speaker. Extract embeddings from both and compute the cosine similarity. A higher similarity score indicates a higher probability of the same speaker.
import SpeechVAD
let model = try await WeSpeaker.loadFromHub()
let embedding1 = try await model.embed(audioFile: "sample1.wav")
let embedding2 = try await model.embed(audioFile: "sample2.wav")
let similarity = cosineSimilarity(embedding1, embedding2)
print("Similarity: \(similarity)") // > 0.7 typically same speaker
Speaker Identification
Match an unknown audio sample against a database of enrolled speaker embeddings. The enrolled speaker with the highest cosine similarity is the predicted identity.
Voice Search
Index a collection of audio recordings by speaker embedding, then query with a new audio sample to find all recordings from the same speaker.
Speaker embeddings work best with clean speech of at least 2-3 seconds. Very short clips or noisy recordings may produce less reliable embeddings. Consider applying speech enhancement first for noisy audio.
Model Downloads
| Model | Backend | Size | HuggingFace |
|---|---|---|---|
| WeSpeaker-ResNet34-LM | MLX | ~25 MB | aufklarer/WeSpeaker-ResNet34-LM-MLX |
| WeSpeaker-ResNet34-LM | CoreML | ~25 MB | aufklarer/WeSpeaker-ResNet34-LM-CoreML |
Swift API
import SpeechVAD
let model = try await WeSpeaker.loadFromHub()
// Extract embedding from file
let embedding = try await model.embed(audioFile: "voice.wav")
print("Embedding dimensions: \(embedding.count)") // 256
// Extract embedding from audio samples
let samples: [Float] = loadAudio("voice.wav")
let embedding = try await model.embed(samples: samples, sampleRate: 16000)