VoiceCC

Voice-to-Command for AI coding CLI — Project Overview

What It Is

VoiceCC is a voice-to-command system that lets you control AI coding CLI (CLI) by speaking. You speak, the system transcribes via Whisper (local GPU), interprets the voice command, and injects it into the terminal session where AI coding CLI runs — both on WSL (tmux) and Windows (PowerShell/Terminal).

It supports direct text commands, special keys (escape, enter, tab), and intelligent interaction with AI coding CLI prompts (option selection, confirmations, custom input, multiselect) — completely hands-free.

8 Black Boxes 2 Platforms (WSL + Windows) 8 Languages Supported 245 Tests

Architecture

graph TB
    User["User\nVoice Input"] -->|"hotkey"| VE

    subgraph Windows
        VE["VoiceEmitter\nAudio Capture + Whisper STT\n+ Hotkey + System Tray"]
        WW["WindowsWorker\nTerminal Detection\n+ PostMessage Injection"]
        BR["Browser\nDashboard (React)"]
    end

    subgraph WSL / Linux
        HUB["SignalR Hub :5050\nMessage Broker + Filter Pipeline\n+ Choice Detection + Static Dashboard"]
        TW["TmuxWorker\nSession Monitor\n+ Tmux Injector"]
        TMUX["tmux session 'cc'\nAI coding CLI"]
    end

    VE -->|"SignalR\nvoice command"| HUB
    WW -->|"SignalR\nsession state + results"| HUB
    HUB -->|"route command\nto worker"| TW
    HUB -->|"route command\nto worker"| WW
    TW -->|"send-keys"| TMUX
    WW -->|"PostMessage\nWM_CHAR"| Terminal["Windows Terminal"]
    BR -->|"SignalR"| HUB
    HUB -->|"events + status"| BR
  

Development Time Estimate

How long would a super senior .NET/React developer take to build this from scratch, writing all code by hand?

Breakdown

#PhaseDaysNotes
0Environment + Research5WSL2 + systemd setup, CUDA + cuDNN (notoriously painful), .NET/Node.js SDK, research Whisper.net, NAudio, SharpHook, PostMessage/Win32, UIAutomation, SignalR polymorphism. Each API has undocumented quirks
1Architecture + Foundation6BBC design, multi-solution structure (WSL/Windows), shared contracts (lots of DTOs/interfaces/events), SignalR Hub with typed client, session-owner routing, Serilog, first E2E wiring test
2Audio Capture + Whisper STT9NAudio WASAPI (device enum, buffer mgmt, WAV format), SharpHook global hotkey (3 slots), Whisper.net CUDA loading (format issues, model loading failures), beam search tuning, confidence scores (broken in stable, need preview build), language prompts for 8 languages, hallucination filtering
3WSL Pipeline5tmux session monitor (state machine: idle/busy/waiting), tmux injector (send-keys escaping is HARD — quotes, special chars, ArgumentList.Add discovery), TmuxWorker orchestrator with retry, systemd service templates + install scripts
4Windows Pipeline10PostMessage POC (discovering lParam scan codes, WM_CHAR vs WM_KEYDOWN — deep Win32 rabbit hole), UIAutomation terminal finder (TextPattern, detecting AI coding CLI state from status bar), injection without focus steal, long messages 200+ chars, dual SignalR architecture, Velopack packaging
5Dashboard (React)5React 18 + Vite + TypeScript scaffold, SignalR connection hook with reconnect, Zustand stores, session list + command log + debug console, VU meter (recharts real-time), static build served from Hub
6Intelligent Message10Filter pipeline (chain of responsibility), escape command filter, manual AI coding CLI keybinding research (test every prompt type, screenshot, document), screen parsing for 5 prompt types, voice command mapping in 8 languages, cross-platform screen adapter (tmux '❯' vs Windows '>'), ChoiceExecutorService orchestration
7Testing + Bug Fixing8245 unit tests written by hand with TDD, mocking SignalR/tmux/NAudio/Whisper, integration tests. Bug fixing: WebSocket 1006, CORS, quote escaping, timing issues, hub routing broadcast bug, mutex issues, Whisper hallucination, language prompt priming
8Deployment + Docs4WSL install/uninstall scripts (systemd, .sudo, port check), Windows scripts (no-admin, Registry HKCU\Run — discovered Task Scheduler needs admin), DEPLOY.md workflow, mode switching (uninstall→port-check→install), test on clean machine
TOTAL62~12.5 weeks, ~3 months

Scenario Comparison

ScenarioMultiplierTotal
Optimistic (CUDA works, no rabbit holes, everything clicks)1x~62 working days (~3 months)
Realistic (CUDA hell, Win32 undocumented, Whisper quirks)1.4x~87 days (~4.5 months)
Regular senior (less Win32/audio experience)1.6x~100 days (~5 months)
Junior/Mid (learning curve on everything)2.5x~155 days (~8 months)

Where The Time Really Goes

The Windows Pipeline (Phase 4) dominates because:

The Intelligent Message (Phase 6) is the second time sink:

The Whisper STT (Phase 2) hides complexity:


Actual Development Timeline (with AI coding CLI)

The project was built with AI coding assistant writing ~90% of the code, directed by a senior developer who designed architecture and made all decisions. Starting from absolute zero: no environment, no tools installed, just the idea. The 6 days include everything: environment setup, research, architecture, development, testing, deployment.

120 Commits 245 Tests 9 Features + 4 Bugfixes + 3 POCs 6 Calendar Days

Day-by-Day Breakdown

DayCommitsWhat Was BuiltHoursPhase (manual est.)
Day 1 (Mon) 8 Environment setup, architecture, Docker Hub, VoiceEmitter base, Whisper STT integration, first working prototype ~12h Phases 0+1+2
Day 2 (Tue) 20 VoiceEmitter complete (settings, tray, settings UI), Velopack installer, Docker→systemd migration, multi-activation feature (3 hotkey slots) ~11h Phases 1+2+3+8
Day 3 (Wed) 46 Settings bugfix, entire PowerShellWorker from zero (UIAutomation + PostMessage + WindowsWorker + WorkerBase refactor + Hub routing fix), integration into VoiceEmitter, Whisper hallucination fix, message filter pipeline with escape commands ~14h Phases 3+4+7+8
Day 4 (Thu) 29 Versioned Hub, deploy options (WSL + Windows, no-admin), Feature04 POC casistiche 4-6, Whisper language prompt fix ~12h Phases 6+7+8
Day 5 (Fri) 14 Complete choice detection: keybinding research, screen parsing, voice mapping, ChoiceExecutorService, Windows support with IScreenAdapter ~11h Phase 6 complete
Day 6 (Sat) 3 Whisper accuracy tuning: upgrade to preview build, beam search, confidence scores, language prompt vocabulary ~2h Phase 2 polish
TOTAL 120 ~62h ~6 working days

Estimate vs Reality

PhaseEstimated (manual)Actual (with AI)Speedup
Environment + Research5 days~0.5 days10x
Architecture + Foundation6 days~0.5 days12x
Audio + Whisper STT9 days~1.5 days6x
WSL Pipeline5 days~0.5 days10x
Windows Pipeline10 days~1 day10x
Dashboard (React)5 days~0.3 days17x
Intelligent Message10 days~1.5 days7x
Testing + Bug Fixing8 daysdistributed
Deployment + Docs4 days~0.5 days8x
TOTAL62 days~6 days10x

Key Observations

  1. Architecture + contracts + boilerplate = 10–17x speedup. AI generates DTOs, interfaces, events, DI registrations, SignalR hub scaffolding, React components almost instantly. What takes a human days of typing, AI produces in minutes.
  2. Day 3 was superhuman: 46 commits, 5 features, 2 bugfixes in 14 hours. The entire Windows pipeline (UIAutomation + PostMessage + WindowsWorker + Hub routing fix) plus the message filter pipeline were built in a single day. This volume of correct, tested code is physically impossible without AI — it's not typing speed, it's the sheer breadth of simultaneous domain knowledge (Win32, SignalR, tmux, TDD).
  3. Whisper.net + CUDA still required human debugging. Discovering that beam search needs a preview NuGet, that WithProbabilities() must be set during creation, that language prompts need actual vocabulary — these are the kind of undocumented gotchas where AI can research but the human must verify through real execution.
  4. POC → production in hours, not days. PostMessage injection POC was validated and turned into a full WindowsWorker + integrated into VoiceEmitter in a single afternoon. A human would need the POC day plus 3–4 more days for the production implementation.
  5. Tests were "free" — written alongside code, not after. With TDD driven by AI, 245 tests were produced as a natural byproduct of development. A human writing 245 tests by hand would need 5–6 dedicated days. With AI, testing cost was essentially zero — absorbed into every feature.
  6. Total human effort was ~62 hours over 6 days, starting from absolute zero. The human work was: environment setup, architecture decisions, prompting AI, reviewing output, manual testing with real microphone + tmux + Windows Terminal, and debugging Whisper/CUDA behavior. The estimated manual equivalent is 62 working days (~500 hours). Effective acceleration: ~10x.