VoiceCC

Voice-to-Command for AI coding CLI — Project Overview

What It Is

VoiceCC is a voice-to-command system that lets you control AI coding CLI (CLI) by speaking. You speak, the system transcribes via Whisper (local GPU), interprets the voice command, and injects it into the terminal session where AI coding CLI runs — both on WSL (tmux) and Windows (PowerShell/Terminal).

It supports direct text commands, special keys (escape, enter, tab), and intelligent interaction with AI coding CLI prompts (option selection, confirmations, custom input, multiselect) — completely hands-free.

8 Black Boxes 2 Platforms (WSL + Windows) 8 Languages Supported 245 Tests

Architecture

graph TB
    User["User\nVoice Input"] -->|"hotkey"| VE

    subgraph Windows
        VE["VoiceEmitter\nAudio Capture + Whisper STT\n+ Hotkey + System Tray"]
        WW["WindowsWorker\nTerminal Detection\n+ PostMessage Injection"]
        BR["Browser\nDashboard (React)"]
    end

    subgraph WSL / Linux
        HUB["SignalR Hub :5050\nMessage Broker + Filter Pipeline\n+ Choice Detection + Static Dashboard"]
        TW["TmuxWorker\nSession Monitor\n+ Tmux Injector"]
        TMUX["tmux session 'cc'\nAI coding CLI"]
    end

    VE -->|"SignalR\nvoice command"| HUB
    WW -->|"SignalR\nsession state + results"| HUB
    HUB -->|"route command\nto worker"| TW
    HUB -->|"route command\nto worker"| WW
    TW -->|"send-keys"| TMUX
    WW -->|"PostMessage\nWM_CHAR"| Terminal["Windows Terminal"]
    BR -->|"SignalR"| HUB
    HUB -->|"events + status"| BR

Development Time Estimate

How long would a super senior .NET/React developer take to build this from scratch, writing all code by hand?

Solo developer, clear vision of what to build and how
Writes all code by hand, uses ChatGPT only for API lookups on unfamiliar libraries
Starting from absolute zero: no environment, no repo, no tools installed, just the idea
Optimistic estimate (no major blockers, CUDA works on second attempt, no hardware surprises)

Breakdown

#	Phase	Days	Notes
0	Environment + Research	5	WSL2 + systemd setup, CUDA + cuDNN (notoriously painful), .NET/Node.js SDK, research Whisper.net, NAudio, SharpHook, PostMessage/Win32, UIAutomation, SignalR polymorphism. Each API has undocumented quirks
1	Architecture + Foundation	6	BBC design, multi-solution structure (WSL/Windows), shared contracts (lots of DTOs/interfaces/events), SignalR Hub with typed client, session-owner routing, Serilog, first E2E wiring test
2	Audio Capture + Whisper STT	9	NAudio WASAPI (device enum, buffer mgmt, WAV format), SharpHook global hotkey (3 slots), Whisper.net CUDA loading (format issues, model loading failures), beam search tuning, confidence scores (broken in stable, need preview build), language prompts for 8 languages, hallucination filtering
3	WSL Pipeline	5	tmux session monitor (state machine: idle/busy/waiting), tmux injector (send-keys escaping is HARD — quotes, special chars, ArgumentList.Add discovery), TmuxWorker orchestrator with retry, systemd service templates + install scripts
4	Windows Pipeline	10	PostMessage POC (discovering lParam scan codes, WM_CHAR vs WM_KEYDOWN — deep Win32 rabbit hole), UIAutomation terminal finder (TextPattern, detecting AI coding CLI state from status bar), injection without focus steal, long messages 200+ chars, dual SignalR architecture, Velopack packaging
5	Dashboard (React)	5	React 18 + Vite + TypeScript scaffold, SignalR connection hook with reconnect, Zustand stores, session list + command log + debug console, VU meter (recharts real-time), static build served from Hub
6	Intelligent Message	10	Filter pipeline (chain of responsibility), escape command filter, manual AI coding CLI keybinding research (test every prompt type, screenshot, document), screen parsing for 5 prompt types, voice command mapping in 8 languages, cross-platform screen adapter (tmux '❯' vs Windows '>'), ChoiceExecutorService orchestration
7	Testing + Bug Fixing	8	245 unit tests written by hand with TDD, mocking SignalR/tmux/NAudio/Whisper, integration tests. Bug fixing: WebSocket 1006, CORS, quote escaping, timing issues, hub routing broadcast bug, mutex issues, Whisper hallucination, language prompt priming
8	Deployment + Docs	4	WSL install/uninstall scripts (systemd, .sudo, port check), Windows scripts (no-admin, Registry HKCU\Run — discovered Task Scheduler needs admin), DEPLOY.md workflow, mode switching (uninstall→port-check→install), test on clean machine
	TOTAL	62	~12.5 weeks, ~3 months

Scenario Comparison

Scenario	Multiplier	Total
Optimistic (CUDA works, no rabbit holes, everything clicks)	1x	~62 working days (~3 months)
Realistic (CUDA hell, Win32 undocumented, Whisper quirks)	1.4x	~87 days (~4.5 months)
Regular senior (less Win32/audio experience)	1.6x	~100 days (~5 months)
Junior/Mid (learning curve on everything)	2.5x	~155 days (~8 months)

Where The Time Really Goes

The Windows Pipeline (Phase 4) dominates because:

PostMessage injection requires reverse-engineering undocumented Win32 behavior
The lParam scan code requirement is discovered only through trial-and-error
UIAutomation for Windows Terminal is poorly documented — TextPattern access varies by terminal version
Detecting AI coding CLI's state from the status bar requires reading terminal content without stealing focus
Dual SignalR connection architecture (emitter + worker roles in one process) is non-trivial

The Intelligent Message (Phase 6) is the second time sink:

AI coding CLI keybinding research must be done manually — no documentation exists, you test each prompt type by hand
Screen parsing must handle 5 different prompt types with different separators and selection indicators
Cross-platform differences (tmux uses '❯' U+276F, Windows uses '>' 0x3E) only surface during testing
Multilanguage voice mapping (8 languages, number words, action words) requires meticulous testing

The Whisper STT (Phase 2) hides complexity:

Whisper.net stable (1.9.0) doesn't support beam search — you need 1.9.1-preview1, discovered after hours
WithProbabilities() shows 0% confidence unless enabled during processor creation, not after
Language prompts must be IN the target language (not "speak Italian" but actual Italian vocabulary)
Short audio (<1s) causes hallucinations ("thank you", "yeah") that need dedicated filtering

Actual Development Timeline (with AI coding CLI)

The project was built with AI coding assistant writing ~90% of the code, directed by a senior developer who designed architecture and made all decisions. Starting from absolute zero: no environment, no tools installed, just the idea. The 6 days include everything: environment setup, research, architecture, development, testing, deployment.

120 Commits 245 Tests 9 Features + 4 Bugfixes + 3 POCs 6 Calendar Days

Day-by-Day Breakdown

Day	Commits	What Was Built	Hours	Phase (manual est.)
Day 1 (Mon)	8	Environment setup, architecture, Docker Hub, VoiceEmitter base, Whisper STT integration, first working prototype	~12h	Phases 0+1+2
Day 2 (Tue)	20	VoiceEmitter complete (settings, tray, settings UI), Velopack installer, Docker→systemd migration, multi-activation feature (3 hotkey slots)	~11h	Phases 1+2+3+8
Day 3 (Wed)	46	Settings bugfix, entire PowerShellWorker from zero (UIAutomation + PostMessage + WindowsWorker + WorkerBase refactor + Hub routing fix), integration into VoiceEmitter, Whisper hallucination fix, message filter pipeline with escape commands	~14h	Phases 3+4+7+8
Day 4 (Thu)	29	Versioned Hub, deploy options (WSL + Windows, no-admin), Feature04 POC casistiche 4-6, Whisper language prompt fix	~12h	Phases 6+7+8
Day 5 (Fri)	14	Complete choice detection: keybinding research, screen parsing, voice mapping, ChoiceExecutorService, Windows support with IScreenAdapter	~11h	Phase 6 complete
Day 6 (Sat)	3	Whisper accuracy tuning: upgrade to preview build, beam search, confidence scores, language prompt vocabulary	~2h	Phase 2 polish
TOTAL	120		~62h	~6 working days

Estimate vs Reality

Phase	Estimated (manual)	Actual (with AI)	Speedup
Environment + Research	5 days	~0.5 days	10x
Architecture + Foundation	6 days	~0.5 days	12x
Audio + Whisper STT	9 days	~1.5 days	6x
WSL Pipeline	5 days	~0.5 days	10x
Windows Pipeline	10 days	~1 day	10x
Dashboard (React)	5 days	~0.3 days	17x
Intelligent Message	10 days	~1.5 days	7x
Testing + Bug Fixing	8 days	distributed	—
Deployment + Docs	4 days	~0.5 days	8x
TOTAL	62 days	~6 days	10x

Key Observations

Architecture + contracts + boilerplate = 10–17x speedup. AI generates DTOs, interfaces, events, DI registrations, SignalR hub scaffolding, React components almost instantly. What takes a human days of typing, AI produces in minutes.
Day 3 was superhuman: 46 commits, 5 features, 2 bugfixes in 14 hours. The entire Windows pipeline (UIAutomation + PostMessage + WindowsWorker + Hub routing fix) plus the message filter pipeline were built in a single day. This volume of correct, tested code is physically impossible without AI — it's not typing speed, it's the sheer breadth of simultaneous domain knowledge (Win32, SignalR, tmux, TDD).
Whisper.net + CUDA still required human debugging. Discovering that beam search needs a preview NuGet, that WithProbabilities() must be set during creation, that language prompts need actual vocabulary — these are the kind of undocumented gotchas where AI can research but the human must verify through real execution.
POC → production in hours, not days. PostMessage injection POC was validated and turned into a full WindowsWorker + integrated into VoiceEmitter in a single afternoon. A human would need the POC day plus 3–4 more days for the production implementation.
Tests were "free" — written alongside code, not after. With TDD driven by AI, 245 tests were produced as a natural byproduct of development. A human writing 245 tests by hand would need 5–6 dedicated days. With AI, testing cost was essentially zero — absorbed into every feature.
Total human effort was ~62 hours over 6 days, starting from absolute zero. The human work was: environment setup, architecture decisions, prompting AI, reviewing output, manual testing with real microphone + tmux + Windows Terminal, and debugging Whisper/CUDA behavior. The estimated manual equivalent is 62 working days (~500 hours). Effective acceleration: ~10x.