La traducción de esta página aún no está disponible. El contenido a continuación está en inglés — referencia: la versión en inglés.
audio-server
audio-server is a local HTTP + WebSocket server that exposes every Soniqo model through a simple REST API, plus an OpenAI Realtime API-compatible WebSocket at /v1/realtime. It ships in the same Homebrew bottle as the audio CLI — brew install speech drops both on your PATH.
Install and run
brew tap soniqo/speech https://github.com/soniqo/speech-swift
brew install speech
audio-server --port 8080
# Starting server on http://127.0.0.1:8080
# Endpoints:
# POST /transcribe - Speech-to-text (WAV body or JSON with audio_base64)
# POST /speak - Text-to-speech (JSON: {text, engine?, language?})
# POST /respond - Speech-to-speech (WAV body, voice/max_steps via query)
# POST /enhance - Speech enhancement (WAV body)
# GET /health - Health check
# WS /v1/realtime - OpenAI Realtime API (JSON events, base64 PCM16 audio)
Command-line flags
| Flag | Default | Description |
|---|---|---|
--host | 127.0.0.1 | Bind address. Change to 0.0.0.0 to expose on LAN. |
--port | 8080 | TCP port. |
--preload | off | Load all models eagerly at startup. Slower boot (~30–60 s) but zero first-request latency. |
Models are downloaded on first use and cached in ~/Library/Caches/qwen3-speech/. The first request for a given model pays the download + load cost (30 s – 2 min depending on model size); subsequent requests are warm.
REST endpoints
POST /transcribe — Speech-to-text
Accepts either a raw WAV body or a JSON envelope with base64-encoded audio.
# WAV body (preferred — lower overhead)
curl -X POST http://127.0.0.1:8080/transcribe \
-H "Content-Type: audio/wav" \
--data-binary @recording.wav
# JSON with base64
curl -X POST http://127.0.0.1:8080/transcribe \
-H "Content-Type: application/json" \
-d '{"audio_base64":"'"$(base64 -i recording.wav)"'","language":"en"}'
Response: {"text": "…", "language": "en", "confidence": 0.93}.
POST /speak — Text-to-speech
curl -X POST http://127.0.0.1:8080/speak \
-H "Content-Type: application/json" \
-d '{"text":"Hello, world!","engine":"kokoro","language":"en"}' \
--output hello.wav
Response body is a WAV blob. Supported engine values: qwen3 (default), cosyvoice, kokoro.
POST /respond — Speech-to-speech
curl -X POST "http://127.0.0.1:8080/respond?voice=en_female_calm&max_steps=256" \
-H "Content-Type: audio/wav" \
--data-binary @question.wav \
--output answer.wav
Runs PersonaPlex 7B — audio in, audio out. Transcript is returned as an X-Response-Text header. See PersonaPlex guide for voice preset names.
POST /enhance — Speech enhancement
curl -X POST http://127.0.0.1:8080/enhance \
-H "Content-Type: audio/wav" \
--data-binary @noisy.wav \
--output clean.wav
DeepFilterNet3 at 48 kHz. Input is resampled if needed.
GET /health — Liveness probe
curl http://127.0.0.1:8080/health
# {"status":"ok"}
WebSocket: /v1/realtime
Drop-in compatible with the OpenAI Realtime API — the same JSON event schema (session.update, input_audio_buffer.append, response.create, response.audio.delta, …) with base64-encoded PCM16 audio at 24 kHz. Clients written against OpenAI's Realtime SDK work against audio-server without code changes (just switch the WebSocket URL).
JavaScript example
const ws = new WebSocket("ws://127.0.0.1:8080/v1/realtime");
ws.onopen = () => {
ws.send(JSON.stringify({
type: "session.update",
session: { modalities: ["audio", "text"] }
}));
// Stream PCM16 mono 24kHz audio from the mic:
const audioBase64 = await capturePCM16Chunk();
ws.send(JSON.stringify({
type: "input_audio_buffer.append",
audio: audioBase64
}));
ws.send(JSON.stringify({ type: "response.create" }));
};
ws.onmessage = ev => {
const msg = JSON.parse(ev.data);
if (msg.type === "response.audio.delta") {
playPCM16Base64(msg.delta);
}
};
Python example
import asyncio, base64, json, wave, websockets
async def main():
async with websockets.connect("ws://127.0.0.1:8080/v1/realtime") as ws:
await ws.send(json.dumps({"type": "session.update",
"session": {"modalities": ["audio", "text"]}}))
with wave.open("question.wav", "rb") as wav:
pcm16 = wav.readframes(wav.getnframes())
await ws.send(json.dumps({"type": "input_audio_buffer.append",
"audio": base64.b64encode(pcm16).decode()}))
await ws.send(json.dumps({"type": "response.create"}))
async for raw in ws:
msg = json.loads(raw)
if msg["type"] == "response.audio.delta":
open("answer.pcm", "ab").write(base64.b64decode(msg["delta"]))
elif msg["type"] == "response.done":
break
asyncio.run(main())
Deployment notes
- No auth by default.
audio-serverbinds127.0.0.1and trusts every caller. Put it behind a reverse proxy (Caddy, nginx, tailscale) if you expose it beyond localhost. - Models are lazy-loaded. First request for
/transcribetriggers the ASR download + load (~700 MB, 30–60 s). Use--preloadfor zero cold-start but slower boot. - No streaming transcription yet.
POST /transcribeexpects the full WAV up front. Use/v1/realtimefor streaming. - Systemd / launchd. No service unit ships today. A plain
nohup audio-server &or your usual process supervisor works.
Source
- Sources/AudioServer — Hummingbird-based HTTP router, lazy model registry,
/v1/realtimeWebSocket handler. - Sources/AudioServerCLI —
@mainentry point, argument parser. - Upstream: OpenAI Realtime API reference.