feat(voice-server): add TTS provider abstraction with fallback order #173

zitongcharliedeng · 2025-12-09T17:31:50Z

Summary

Adds pluggable TTS provider system with configurable fallback order
Extracts ElevenLabs logic into provider class (backwards compatible)
Adds Piper provider for free local/offline TTS
Refactors MacOSSay provider using native command
Adds cross-platform audio playback (afplay/aplay/powershell)

Motivation

Related to #166 and complements #171. While pai-voice adds a CLI alternative to the HTTP server, this PR adds:

Provider abstraction - Multiple TTS backends can coexist
Fallback order - Configure which providers to try via
Cross-platform audio - Works on macOS, Linux, and WSL
Local TTS option - Piper provides free offline TTS (no API costs)

Configuration

{
  "providers": ["piper", "elevenlabs", "macos"]
}

First available provider is used. If no config exists, falls back to ElevenLabs API directly (fully backwards compatible).

Providers

Provider	Type	Platform	Requirements
elevenlabs	Cloud	All	API key in ~/.env
piper	Local	Linux/WSL	piper binary + voice models
macos	Local	macOS	None (built-in)

Test plan

Verify ElevenLabs still works when no config.json present
Test provider fallback when first provider unavailable
Test Piper on Linux/WSL
Test MacOS provider on macOS
Verify /health endpoint shows correct provider

🤖 Generated with Claude Code

.claude/voice-server/providers/index.ts

.claude/voice-server/tts-providers/index.ts

.claude/voice-server/server.ts

.claude/voice-server/providers/Piper.ts

.claude/voice-server/providers/index.ts

.claude/voice-server/config.json

.claude/voice-server/server.ts

.claude/voice-server/audio.ts

.claude/voice-server/tts-providers/ElevenLabs.ts

.claude/voice-server/desktop-audio-playback.ts

zitongcharliedeng · 2025-12-09T19:50:34Z

Manual Testing Results

Windows WSL2 Testing (via PowerShell MediaPlayer)

Test	Provider	Result
Audio playback from WSL	ElevenLabs	✅ Passed
Audio playback from WSL	Piper	✅ Passed
Path conversion (wslpath)	N/A	✅ Working
MediaPlayer (no GUI window)	N/A	✅ Working

Test Plan Progress

Verify ElevenLabs still works when no config.json present
Test provider fallback when first provider unavailable
Test Piper on Linux/WSL
Test MacOS provider on macOS (no macOS available)
Verify /health endpoint shows correct provider

Notes

PowerShell MediaPlayer plays audio from WSL without opening a GUI window
\ correctly converts WSL paths to Windows format
Provider fallback works correctly (Piper -> ElevenLabs when Piper unavailable)
\ type correctly detects \ via kernel release string

🤖 Manually tested by Charlie on Windows 11 :)

zitongcharliedeng · 2025-12-09T19:55:42Z

Re: per-agent voice IDs - This is already supported. The /notify endpoint accepts voice_id in the POST body, which gets passed to the provider. Each agent hook can pass a different voice_id per request. The Piper provider also supports voice mapping via voices.json config.

From Claude.

Adds a provider abstraction layer for text-to-speech with configurable fallback order and cross-platform audio playback support. Changes: - Add TTSProvider interface with ElevenLabs, Piper, and MacOS providers - Add audio-playback module with ShellEnvironment detection - Support macOS (afplay), Linux (paplay), and Windows/WSL (PowerShell) - Add config.json for provider priority order - Maintain backwards compatibility with existing ElevenLabs setup Closes danielmiessler#166 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

zitongcharliedeng · 2025-12-09T20:04:28Z

Addressed remaining review comments:

Naming fixes:

Renamed SHELL_ENV to currentShellEnvironment (idiomatic camelCase)
Renamed audio-playback.ts to desktop-audio-playback.ts for clarity
Renamed providers/ to tts-providers/ (done previously)

Already addressed:

providerConstructorsFromConfigNames - was already using this name
Provider loading via loadProvider() in tts-providers/index.ts
Config order: elevenlabs, macos-say, piper
Interface uses synthesize() returning AudioResult (buffer + format)

Design decisions:

Provider map is "hardcoded" because it maps config strings to TypeScript classes - this is the factory pattern, config.json controls the order/selection
Server.ts no longer has ElevenLabs-specific logic, just calls provider.synthesize()
Piper implementation follows official CLI usage (--model, --speaker, --output_file, stdin for text)

All commits squashed into single conventional commit: feat(voice-server): add pluggable TTS provider system

.claude/voice-server/tts-providers/ElevenLabs.ts

Each voice entry now has provider-specific config as siblings: - elevenlabs: voice_name, type - piper: model, speaker Shared config (description, rate_multiplier, rate_wpm) remains at voice level. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Provider name now comes from config.json key via loadProvider(). - loadProvider() returns { provider, name } instead of just provider - Removed hardcoded name property from ElevenLabs, Piper, MacOSSay - Server uses providerName from the result 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

.claude/voice-server/voices.json

.claude/voice-server/config.json

- Replace DEFAULT_VOICE_ID with agent-based voice lookup - Load voices.json at startup for provider-specific voice config - Add getVoiceForAgent() to resolve agent name to voice ID based on active provider - /notify endpoint now accepts 'agent' parameter instead of 'voice_id' - Hooks can now send agent name and server determines correct voice based on provider 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Each agent now has a distinct ElevenLabs voice_id using default voices. Server looks up voice_id from voices.json based on agent name. Voice mapping: - kai: George (UK Male) - researcher: Sarah (US Female) - engineer: Laura (US Female) - architect: Lily (UK Female) - designer: Jessica (Female) - artist: Alice (Female) - pentester: Harry (UK Male) - writer: Matilda (UK Female) 🤖 Generated with Claude Code Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Hooks now send agent parameter to voice server, which looks up the voice_id from voices.json. Removes hardcoded voice IDs from hooks. Updated hooks: - stop-hook.ts - initialize-session.ts - subagent-stop-hook.ts - context-compression-hook.ts 🤖 Generated with Claude Code Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Hooks now use process.env.DA for main agent name (user-configurable) - Falls back to "kai" only if DA is not set - Restored original voice IDs from subagent-stop-hook to voices.json 🤖 Generated with Claude Code Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

…t-stop-hook, set aito voice key

- SpTaskAwareness: Ulysses Pact task tracking via SP-MCP - Enforces task selection before any work begins - Checks overdue tasks at session start - Constitutional: no task = no work - SelfImprovement: Config surgery over empty promises - Triggers on "I'll do better" patterns or user complaints - Forces permanent config changes instead of apologies - "Make a man out of yourself by editing your source code" - CORE/SKILL.md: Added AITO identity and self-preservation rule - AI Task Orchestrator with multi-agent architecture - Prevents mid-session config rebuilds (temporary until NixOS) - README-AITO-OVERLAY.md: Documents the overlay structure 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

These were created in error. The correct structure is: - DEFINE_AGENTS/SHARED_BY_ALL_AGENTS/LOAD_AT_SESSION_START/ has the files - Our custom hook dynamically loads ALL .md files from that directory - Files keep their proper names (ILL-MAKE-A-MAN-OUT-OF-YOU.md, etc.) This is NOT the PAI skill pattern - these are session-start context that loads automatically, not skills to invoke. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add SubagentStart hook support - Set DA=AITO in settings.json - Custom hook implementations for AITO workflow - Fix execute permissions on hook files 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Use .mcp.json instead of settings.json for MCP list - Use stat -c%Y (GNU/Linux) instead of stat -f%m (BSD/macOS) 🤖 Generated with Claude Code

danielmiessler · 2025-12-20T12:28:27Z

Hey, thank you so much for this submission. We're about to change the project significantly to solve a number of core issues. Once we do that, let's revisit if it makes sense.

zitongcharliedeng force-pushed the feat/tts-provider-abstraction branch from aa9d822 to 3e1f895 Compare December 9, 2025 17:36

zitongcharliedeng marked this pull request as draft December 9, 2025 18:15