Skip to content
Active (maintaining)6 commitsSolo developer

Claude Speaks

Local TTS voice system for Claude Code — persistent Kokoro-82M daemon, Unix socket IPC, and a debugging story in macOS audio architecture.

PythonBashmacOSKokoro-82MUnix IPC
View on GitHub

Giving an AI a voice sounds simple until you try it on macOS.

Claude Speaks is a local text-to-speech system for Claude Code built around Kokoro-82M, a 335MB ONNX neural TTS model that runs entirely on your machine. A persistent Python daemon loads the model once at session start, then speaks Claude's responses with near-zero latency over Unix socket IPC. Five Claude Code hooks handle the full lifecycle — startup, speech, incremental narration, interrupt, and cleanup. The result is a working session where you can hear Claude think, not just read it.

The Problem

Claude Code is text-only. For long working sessions, reading every response breaks flow — you're switching attention from the task to the terminal constantly. Cloud TTS services like ElevenLabs or Google add network latency and per-character cost. macOS's built-in say command is functional but robotic, and even then you're spawning a new process per utterance with no lifecycle integration.

The goal was specific: local, low-latency, natural-sounding voice that plugged cleanly into Claude Code's existing hook system without requiring any changes to how I work. No cloud API calls, no perceptible delay between Claude responding and audio starting, no manual toggling.

How It Works

Persistent Daemon Architecture

The naive approach is to spawn the kokoro-tts CLI on each utterance. It works, but Kokoro-82M carries a 3–5 second cold start on every invocation — the model loads into memory, generates audio, exits. For a tool meant to reduce friction, a five-second lag before every sentence is worse than silence.

The daemon solves this by inverting the model lifecycle. At session start, a Python process loads the 335MB model once and parks it in memory. From that point, speak commands arrive over a Unix domain socket and audio starts within milliseconds — the model is already warm. The IPC protocol is length-prefixed JSON with a 100KB message cap, which keeps the implementation simple while handling any realistic utterance length. Communication is fire-and-forget: the hook sends a command and returns immediately without waiting for playback to finish.

Hook Integration

Five hooks cover the complete session lifecycle. SessionStart launches the daemon via double-fork, ensuring it outlives the hook process. Stop extracts the assistant's text from the incoming stdin JSON and dispatches it to the daemon. PreToolUse reads the session's transcript JSONL to find the most recent assistant message and narrate it as Claude begins a tool call — useful for understanding what Claude is about to do. UserPromptSubmit sends a synchronous interrupt to cut off any in-progress speech before the next exchange begins. SessionEnd sends a shutdown command and cleans up temp state.

The hook-to-daemon boundary is clean: hooks are thin Bash scripts that format a JSON payload and write it to the socket. All model interaction, audio synthesis, and playback management live in the daemon. If the daemon isn't running, hooks exit silently — no error noise, no session disruption.

Voice System

Kokoro-82M exposes speaker embeddings as numpy arrays, which makes voice blending straightforward at the embedding level rather than requiring post-processing on the audio output. The default mix — 65% fable, 25% george, 10% lewis — produces a voice that reads as confident and clear without sounding affected. The weights are configurable without touching the daemon code.

Speech rate adjusts contextually based on response length. Short replies play at normal speed; longer responses step up slightly to avoid each session feeling like an audiobook. Markdown stripping runs via regex on the hot path — no external parser dependencies, because anything that adds import overhead adds latency on every utterance.

What I Learned

The first real wall was macOS CoreAudio. Standard Unix daemonization calls os.setsid() to create a new session and detach from the controlling terminal — this is textbook, every daemon tutorial shows it. But macOS ties audio hardware access to the user's login session. When setsid() severs the process from that session, the daemon can generate audio samples perfectly but has no path to the sound hardware. The symptoms were confusing: no errors, no crashes, just silence. The fix was to double-fork without calling setsid(), staying inside the user's session while still orphaning the daemon from the hook process. The model runs, the audio plays, and the hook has long since exited.

The second problem was subtler and harder to find. Python's sys.stdout = open(os.devnull, 'w') looks like it redirects stdout, and in Python-land it does. But the original file descriptor — the actual integer 1 at the OS level — is still open and still connected to Claude Code's pipe. Claude Code polls open file descriptors on hooks it spawns, and as long as that FD was live, the session would freeze waiting for it to close. The fix required os.dup2(os.open(os.devnull, os.O_WRONLY), 1) — operating at the OS level to actually replace the file descriptor, not just rebind a Python name to a different file object. That's a distinction that only matters when another process is watching your FDs.

The third problem revealed something about Claude Code's process model I hadn't anticipated. Claude Code tears down the entire process group when a hook exits — not just the hook itself, but every child it spawned. Backgrounding with & and disown in Bash still leaves the process in the same group until the kernel cleans up. The double-fork pattern solves this correctly: the first fork creates an intermediate process that can be waited on, the intermediate fork creates the actual daemon, and the intermediate process exits immediately. The daemon is now a child of init, fully orphaned before the hook returns.

The stabilization commit landed at +174 / -471 lines. The system got simpler as it got more reliable. Audio ducking — reducing system volume during speech — was cut entirely: it added implementation surface, required permissions, and the benefit was marginal in a single-user dev environment. The PreToolUse hook was rewritten twice before finding an approach that didn't introduce race conditions. Every removal made the remaining code faster and more predictable. The version that shipped is shorter than the version I started with, and it does more of what actually matters.

Outcome

v1.0.0, 6 commits, open source under MIT. One-command installer. The project demonstrates Unix systems programming on macOS, IPC design, process lifecycle management, and the diagnostic discipline to ship a stable system through iterative failure. The debugging path — CoreAudio session isolation, file descriptor semantics, process group teardown — required understanding each layer of the stack independently before the system could work reliably across all of them together.