Voice-to-Command for AI coding CLI — Project Overview
VoiceCC is a voice-to-command system that lets you control AI coding CLI (CLI) by speaking. You speak, the system transcribes via Whisper (local GPU), interprets the voice command, and injects it into the terminal session where AI coding CLI runs — both on WSL (tmux) and Windows (PowerShell/Terminal).
It supports direct text commands, special keys (escape, enter, tab), and intelligent interaction with AI coding CLI prompts (option selection, confirmations, custom input, multiselect) — completely hands-free.
graph TB
User["User\nVoice Input"] -->|"hotkey"| VE
subgraph Windows
VE["VoiceEmitter\nAudio Capture + Whisper STT\n+ Hotkey + System Tray"]
WW["WindowsWorker\nTerminal Detection\n+ PostMessage Injection"]
BR["Browser\nDashboard (React)"]
end
subgraph WSL / Linux
HUB["SignalR Hub :5050\nMessage Broker + Filter Pipeline\n+ Choice Detection + Static Dashboard"]
TW["TmuxWorker\nSession Monitor\n+ Tmux Injector"]
TMUX["tmux session 'cc'\nAI coding CLI"]
end
VE -->|"SignalR\nvoice command"| HUB
WW -->|"SignalR\nsession state + results"| HUB
HUB -->|"route command\nto worker"| TW
HUB -->|"route command\nto worker"| WW
TW -->|"send-keys"| TMUX
WW -->|"PostMessage\nWM_CHAR"| Terminal["Windows Terminal"]
BR -->|"SignalR"| HUB
HUB -->|"events + status"| BR
How long would a super senior .NET/React developer take to build this from scratch, writing all code by hand?
| # | Phase | Days | Notes |
|---|---|---|---|
| 0 | Environment + Research | 5 | WSL2 + systemd setup, CUDA + cuDNN (notoriously painful), .NET/Node.js SDK, research Whisper.net, NAudio, SharpHook, PostMessage/Win32, UIAutomation, SignalR polymorphism. Each API has undocumented quirks |
| 1 | Architecture + Foundation | 6 | BBC design, multi-solution structure (WSL/Windows), shared contracts (lots of DTOs/interfaces/events), SignalR Hub with typed client, session-owner routing, Serilog, first E2E wiring test |
| 2 | Audio Capture + Whisper STT | 9 | NAudio WASAPI (device enum, buffer mgmt, WAV format), SharpHook global hotkey (3 slots), Whisper.net CUDA loading (format issues, model loading failures), beam search tuning, confidence scores (broken in stable, need preview build), language prompts for 8 languages, hallucination filtering |
| 3 | WSL Pipeline | 5 | tmux session monitor (state machine: idle/busy/waiting), tmux injector (send-keys escaping is HARD — quotes, special chars, ArgumentList.Add discovery), TmuxWorker orchestrator with retry, systemd service templates + install scripts |
| 4 | Windows Pipeline | 10 | PostMessage POC (discovering lParam scan codes, WM_CHAR vs WM_KEYDOWN — deep Win32 rabbit hole), UIAutomation terminal finder (TextPattern, detecting AI coding CLI state from status bar), injection without focus steal, long messages 200+ chars, dual SignalR architecture, Velopack packaging |
| 5 | Dashboard (React) | 5 | React 18 + Vite + TypeScript scaffold, SignalR connection hook with reconnect, Zustand stores, session list + command log + debug console, VU meter (recharts real-time), static build served from Hub |
| 6 | Intelligent Message | 10 | Filter pipeline (chain of responsibility), escape command filter, manual AI coding CLI keybinding research (test every prompt type, screenshot, document), screen parsing for 5 prompt types, voice command mapping in 8 languages, cross-platform screen adapter (tmux '❯' vs Windows '>'), ChoiceExecutorService orchestration |
| 7 | Testing + Bug Fixing | 8 | 245 unit tests written by hand with TDD, mocking SignalR/tmux/NAudio/Whisper, integration tests. Bug fixing: WebSocket 1006, CORS, quote escaping, timing issues, hub routing broadcast bug, mutex issues, Whisper hallucination, language prompt priming |
| 8 | Deployment + Docs | 4 | WSL install/uninstall scripts (systemd, .sudo, port check), Windows scripts (no-admin, Registry HKCU\Run — discovered Task Scheduler needs admin), DEPLOY.md workflow, mode switching (uninstall→port-check→install), test on clean machine |
| TOTAL | 62 | ~12.5 weeks, ~3 months |
| Scenario | Multiplier | Total |
|---|---|---|
| Optimistic (CUDA works, no rabbit holes, everything clicks) | 1x | ~62 working days (~3 months) |
| Realistic (CUDA hell, Win32 undocumented, Whisper quirks) | 1.4x | ~87 days (~4.5 months) |
| Regular senior (less Win32/audio experience) | 1.6x | ~100 days (~5 months) |
| Junior/Mid (learning curve on everything) | 2.5x | ~155 days (~8 months) |
The Windows Pipeline (Phase 4) dominates because:
The Intelligent Message (Phase 6) is the second time sink:
The Whisper STT (Phase 2) hides complexity:
The project was built with AI coding assistant writing ~90% of the code, directed by a senior developer who designed architecture and made all decisions. Starting from absolute zero: no environment, no tools installed, just the idea. The 6 days include everything: environment setup, research, architecture, development, testing, deployment.
| Day | Commits | What Was Built | Hours | Phase (manual est.) |
|---|---|---|---|---|
| Day 1 (Mon) | 8 | Environment setup, architecture, Docker Hub, VoiceEmitter base, Whisper STT integration, first working prototype | ~12h | Phases 0+1+2 |
| Day 2 (Tue) | 20 | VoiceEmitter complete (settings, tray, settings UI), Velopack installer, Docker→systemd migration, multi-activation feature (3 hotkey slots) | ~11h | Phases 1+2+3+8 |
| Day 3 (Wed) | 46 | Settings bugfix, entire PowerShellWorker from zero (UIAutomation + PostMessage + WindowsWorker + WorkerBase refactor + Hub routing fix), integration into VoiceEmitter, Whisper hallucination fix, message filter pipeline with escape commands | ~14h | Phases 3+4+7+8 |
| Day 4 (Thu) | 29 | Versioned Hub, deploy options (WSL + Windows, no-admin), Feature04 POC casistiche 4-6, Whisper language prompt fix | ~12h | Phases 6+7+8 |
| Day 5 (Fri) | 14 | Complete choice detection: keybinding research, screen parsing, voice mapping, ChoiceExecutorService, Windows support with IScreenAdapter | ~11h | Phase 6 complete |
| Day 6 (Sat) | 3 | Whisper accuracy tuning: upgrade to preview build, beam search, confidence scores, language prompt vocabulary | ~2h | Phase 2 polish |
| TOTAL | 120 | ~62h | ~6 working days |
| Phase | Estimated (manual) | Actual (with AI) | Speedup |
|---|---|---|---|
| Environment + Research | 5 days | ~0.5 days | 10x |
| Architecture + Foundation | 6 days | ~0.5 days | 12x |
| Audio + Whisper STT | 9 days | ~1.5 days | 6x |
| WSL Pipeline | 5 days | ~0.5 days | 10x |
| Windows Pipeline | 10 days | ~1 day | 10x |
| Dashboard (React) | 5 days | ~0.3 days | 17x |
| Intelligent Message | 10 days | ~1.5 days | 7x |
| Testing + Bug Fixing | 8 days | distributed | — |
| Deployment + Docs | 4 days | ~0.5 days | 8x |
| TOTAL | 62 days | ~6 days | 10x |