Skip to content

Conversation

@zitongcharliedeng
Copy link

@zitongcharliedeng zitongcharliedeng commented Dec 9, 2025

Summary

  • Adds pluggable TTS provider system with configurable fallback order
  • Extracts ElevenLabs logic into provider class (backwards compatible)
  • Adds Piper provider for free local/offline TTS
  • Refactors MacOSSay provider using native command
  • Adds cross-platform audio playback (afplay/aplay/powershell)

Motivation

Related to #166 and complements #171. While pai-voice adds a CLI alternative to the HTTP server, this PR adds:

  1. Provider abstraction - Multiple TTS backends can coexist
  2. Fallback order - Configure which providers to try via
  3. Cross-platform audio - Works on macOS, Linux, and WSL
  4. Local TTS option - Piper provides free offline TTS (no API costs)

Configuration

{
  "providers": ["piper", "elevenlabs", "macos"]
}

First available provider is used. If no config exists, falls back to ElevenLabs API directly (fully backwards compatible).

Providers

Provider Type Platform Requirements
elevenlabs Cloud All API key in ~/.env
piper Local Linux/WSL piper binary + voice models
macos Local macOS None (built-in)

Test plan

  • Verify ElevenLabs still works when no config.json present
  • Test provider fallback when first provider unavailable
  • Test Piper on Linux/WSL
  • Test MacOS provider on macOS
  • Verify /health endpoint shows correct provider

🤖 Generated with Claude Code

@zitongcharliedeng zitongcharliedeng force-pushed the feat/tts-provider-abstraction branch from aa9d822 to 3e1f895 Compare December 9, 2025 17:36
@zitongcharliedeng zitongcharliedeng marked this pull request as draft December 9, 2025 18:15
@zitongcharliedeng
Copy link
Author

zitongcharliedeng commented Dec 9, 2025

Manual Testing Results

Windows WSL2 Testing (via PowerShell MediaPlayer)

Test Provider Result
Audio playback from WSL ElevenLabs ✅ Passed
Audio playback from WSL Piper ✅ Passed
Path conversion (wslpath) N/A ✅ Working
MediaPlayer (no GUI window) N/A ✅ Working

Test Plan Progress

  • Verify ElevenLabs still works when no config.json present
  • Test provider fallback when first provider unavailable
  • Test Piper on Linux/WSL
  • Test MacOS provider on macOS (no macOS available)
  • Verify /health endpoint shows correct provider

Notes

  • PowerShell MediaPlayer plays audio from WSL without opening a GUI window
  • \ correctly converts WSL paths to Windows format
  • Provider fallback works correctly (Piper -> ElevenLabs when Piper unavailable)
  • \ type correctly detects \ via kernel release string

🤖 Manually tested by Charlie on Windows 11 :)

@zitongcharliedeng
Copy link
Author

zitongcharliedeng commented Dec 9, 2025

Re: per-agent voice IDs - This is already supported. The /notify endpoint accepts voice_id in the POST body, which gets passed to the provider. Each agent hook can pass a different voice_id per request. The Piper provider also supports voice mapping via voices.json config.

From Claude.

@zitongcharliedeng zitongcharliedeng force-pushed the feat/tts-provider-abstraction branch from 3733e5d to a098f46 Compare December 9, 2025 20:01
Adds a provider abstraction layer for text-to-speech with configurable
fallback order and cross-platform audio playback support.

Changes:
- Add TTSProvider interface with ElevenLabs, Piper, and MacOS providers
- Add audio-playback module with ShellEnvironment detection
- Support macOS (afplay), Linux (paplay), and Windows/WSL (PowerShell)
- Add config.json for provider priority order
- Maintain backwards compatibility with existing ElevenLabs setup

Closes danielmiessler#166

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@zitongcharliedeng zitongcharliedeng force-pushed the feat/tts-provider-abstraction branch from a098f46 to e4df291 Compare December 9, 2025 20:04
@zitongcharliedeng
Copy link
Author

Addressed remaining review comments:

Naming fixes:

  • Renamed SHELL_ENV to currentShellEnvironment (idiomatic camelCase)
  • Renamed audio-playback.ts to desktop-audio-playback.ts for clarity
  • Renamed providers/ to tts-providers/ (done previously)

Already addressed:

  • providerConstructorsFromConfigNames - was already using this name
  • Provider loading via loadProvider() in tts-providers/index.ts
  • Config order: elevenlabs, macos-say, piper
  • Interface uses synthesize() returning AudioResult (buffer + format)

Design decisions:

  • Provider map is "hardcoded" because it maps config strings to TypeScript classes - this is the factory pattern, config.json controls the order/selection
  • Server.ts no longer has ElevenLabs-specific logic, just calls provider.synthesize()
  • Piper implementation follows official CLI usage (--model, --speaker, --output_file, stdin for text)

All commits squashed into single conventional commit: feat(voice-server): add pluggable TTS provider system

@zitongcharliedeng zitongcharliedeng marked this pull request as ready for review December 9, 2025 20:17
zitongcharliedeng and others added 2 commits December 9, 2025 20:35
Each voice entry now has provider-specific config as siblings:
- elevenlabs: voice_name, type
- piper: model, speaker

Shared config (description, rate_multiplier, rate_wpm) remains at voice level.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Provider name now comes from config.json key via loadProvider().
- loadProvider() returns { provider, name } instead of just provider
- Removed hardcoded name property from ElevenLabs, Piper, MacOSSay
- Server uses providerName from the result

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
zitongcharliedeng and others added 3 commits December 9, 2025 20:47
- Replace DEFAULT_VOICE_ID with agent-based voice lookup
- Load voices.json at startup for provider-specific voice config
- Add getVoiceForAgent() to resolve agent name to voice ID based on active provider
- /notify endpoint now accepts 'agent' parameter instead of 'voice_id'
- Hooks can now send agent name and server determines correct voice based on provider

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Each agent now has a distinct ElevenLabs voice_id using default voices.
Server looks up voice_id from voices.json based on agent name.

Voice mapping:
- kai: George (UK Male)
- researcher: Sarah (US Female)
- engineer: Laura (US Female)
- architect: Lily (UK Female)
- designer: Jessica (Female)
- artist: Alice (Female)
- pentester: Harry (UK Male)
- writer: Matilda (UK Female)

🤖 Generated with Claude Code

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Hooks now send agent parameter to voice server, which looks up
the voice_id from voices.json. Removes hardcoded voice IDs from hooks.

Updated hooks:
- stop-hook.ts
- initialize-session.ts
- subagent-stop-hook.ts
- context-compression-hook.ts

🤖 Generated with Claude Code

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
zitongcharliedeng and others added 6 commits December 11, 2025 19:42
- Hooks now use process.env.DA for main agent name (user-configurable)
- Falls back to "kai" only if DA is not set
- Restored original voice IDs from subagent-stop-hook to voices.json

🤖 Generated with Claude Code

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- SpTaskAwareness: Ulysses Pact task tracking via SP-MCP
  - Enforces task selection before any work begins
  - Checks overdue tasks at session start
  - Constitutional: no task = no work

- SelfImprovement: Config surgery over empty promises
  - Triggers on "I'll do better" patterns or user complaints
  - Forces permanent config changes instead of apologies
  - "Make a man out of yourself by editing your source code"

- CORE/SKILL.md: Added AITO identity and self-preservation rule
  - AI Task Orchestrator with multi-agent architecture
  - Prevents mid-session config rebuilds (temporary until NixOS)

- README-AITO-OVERLAY.md: Documents the overlay structure

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
These were created in error. The correct structure is:
- DEFINE_AGENTS/SHARED_BY_ALL_AGENTS/LOAD_AT_SESSION_START/ has the files
- Our custom hook dynamically loads ALL .md files from that directory
- Files keep their proper names (ILL-MAKE-A-MAN-OUT-OF-YOU.md, etc.)

This is NOT the PAI skill pattern - these are session-start context
that loads automatically, not skills to invoke.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add SubagentStart hook support
- Set DA=AITO in settings.json
- Custom hook implementations for AITO workflow
- Fix execute permissions on hook files

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Use .mcp.json instead of settings.json for MCP list
- Use stat -c%Y (GNU/Linux) instead of stat -f%m (BSD/macOS)

🤖 Generated with Claude Code
@danielmiessler
Copy link
Owner

Hey, thank you so much for this submission. We're about to change the project significantly to solve a number of core issues. Once we do that, let's revisit if it makes sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants