Translation pending

La traducción de esta página aún no está disponible. El contenido a continuación está en inglés — referencia: la versión en inglés.

Speech Server

speech-server is a local HTTP + WebSocket server that exposes every Soniqo model through a simple REST API, plus an OpenAI Realtime API-compatible WebSocket at /v1/realtime. It ships in the same Homebrew bottle as the speech CLI — brew install speech drops both on your PATH.

Install and run

brew install speech

speech-server --port 8080
# Starting server on http://127.0.0.1:8080
# Endpoints:
#   POST /transcribe  - Speech-to-text (WAV body or JSON with audio_base64)
#   POST /speak       - Text-to-speech (JSON: {text, engine?, language?})
#   POST /respond     - Speech-to-speech (WAV body, voice/max_steps via query)
#   POST /enhance     - Speech enhancement (WAV body)
#   GET  /health      - Health check
#   WS   /v1/realtime - OpenAI Realtime API (JSON events, base64 PCM16 audio)

Command-line flags

Flag	Default	Description
`--host`	`127.0.0.1`	Bind address. Change to `0.0.0.0` to expose on LAN.
`--port`	`8080`	TCP port.
`--preload`	off	Load all models eagerly at startup. Slower boot (~30–60 s) but zero first-request latency.

Models are downloaded on first use and cached in ~/Library/Caches/qwen3-speech/. The first request for a given model pays the download + load cost (30 s – 2 min depending on model size); subsequent requests are warm.

REST endpoints

`POST /transcribe` — Speech-to-text

Accepts either a raw WAV body or a JSON envelope with base64-encoded audio.

# WAV body (preferred — lower overhead)
curl -X POST http://127.0.0.1:8080/transcribe \
  -H "Content-Type: audio/wav" \
  --data-binary @recording.wav

# JSON with base64
curl -X POST http://127.0.0.1:8080/transcribe \
  -H "Content-Type: application/json" \
  -d '{"audio_base64":"'"$(base64 -i recording.wav)"'","language":"en"}'

Response: {"text": "…", "language": "en", "confidence": 0.93}.

`POST /speak` — Text-to-speech

curl -X POST http://127.0.0.1:8080/speak \
  -H "Content-Type: application/json" \
  -d '{"text":"Hello, world!","engine":"kokoro","language":"en"}' \
  --output hello.wav

Response body is a WAV blob. Supported engine values: qwen3 (default), cosyvoice, kokoro.

`POST /respond` — Speech-to-speech

curl -X POST "http://127.0.0.1:8080/respond?voice=en_female_calm&max_steps=256" \
  -H "Content-Type: audio/wav" \
  --data-binary @question.wav \
  --output answer.wav

Runs PersonaPlex 7B — speech in, speech out. Transcript is returned as an X-Response-Text header. See PersonaPlex guide for voice preset names.

`POST /enhance` — Speech enhancement

curl -X POST http://127.0.0.1:8080/enhance \
  -H "Content-Type: audio/wav" \
  --data-binary @noisy.wav \
  --output clean.wav

DeepFilterNet3 at 48 kHz. Input is resampled if needed.

`GET /health` — Liveness probe

curl http://127.0.0.1:8080/health
# {"status":"ok"}

WebSocket: `/v1/realtime`

Drop-in compatible with the OpenAI Realtime API — the same JSON event schema (session.update, input_audio_buffer.append, response.create, response.audio.delta, …) with base64-encoded PCM16 audio at 24 kHz. Clients written against OpenAI's Realtime SDK work against speech-server without code changes (just switch the WebSocket URL).

JavaScript example

const ws = new WebSocket("ws://127.0.0.1:8080/v1/realtime");

ws.onopen = () => {
  ws.send(JSON.stringify({
    type: "session.update",
    session: { modalities: ["audio", "text"] }
  }));

  // Stream PCM16 mono 24kHz audio from the mic:
  const audioBase64 = await capturePCM16Chunk();
  ws.send(JSON.stringify({
    type: "input_audio_buffer.append",
    audio: audioBase64
  }));

  ws.send(JSON.stringify({ type: "response.create" }));
};

ws.onmessage = ev => {
  const msg = JSON.parse(ev.data);
  if (msg.type === "response.audio.delta") {
    playPCM16Base64(msg.delta);
  }
};

Python example

import asyncio, base64, json, wave, websockets

async def main():
    async with websockets.connect("ws://127.0.0.1:8080/v1/realtime") as ws:
        await ws.send(json.dumps({"type": "session.update",
                                  "session": {"modalities": ["audio", "text"]}}))

        with wave.open("question.wav", "rb") as wav:
            pcm16 = wav.readframes(wav.getnframes())
        await ws.send(json.dumps({"type": "input_audio_buffer.append",
                                  "audio": base64.b64encode(pcm16).decode()}))
        await ws.send(json.dumps({"type": "response.create"}))

        async for raw in ws:
            msg = json.loads(raw)
            if msg["type"] == "response.audio.delta":
                open("answer.pcm", "ab").write(base64.b64decode(msg["delta"]))
            elif msg["type"] == "response.done":
                break

asyncio.run(main())

Deployment notes

No auth by default. speech-server binds 127.0.0.1 and trusts every caller. Put it behind a reverse proxy (Caddy, nginx, tailscale) if you expose it beyond localhost.
Models are lazy-loaded. First request for /transcribe triggers the ASR download + load (~700 MB, 30–60 s). Use --preload for zero cold-start but slower boot.
No streaming transcription yet. POST /transcribe expects the full WAV up front. Use /v1/realtime for streaming.
Systemd / launchd. No service unit ships today. A plain nohup speech-server & or your usual process supervisor works.

Source

Sources/AudioServer — Hummingbird-based HTTP router, lazy model registry, /v1/realtime WebSocket handler.
Sources/AudioServerCLI — @main entry point, argument parser.
Upstream: OpenAI Realtime API reference.

Speech Server

Install and run

Command-line flags

REST endpoints

POST /transcribe — Speech-to-text

POST /speak — Text-to-speech

POST /respond — Speech-to-speech

POST /enhance — Speech enhancement

GET /health — Liveness probe

WebSocket: /v1/realtime

JavaScript example

Python example

Deployment notes

Source

`POST /transcribe` — Speech-to-text

`POST /speak` — Text-to-speech

`POST /respond` — Speech-to-speech

`POST /enhance` — Speech enhancement

`GET /health` — Liveness probe

WebSocket: `/v1/realtime`