Voice Interface

Voice is the primary interface for DotAgents. Hold to speak, release to act. Your agents listen, think, and execute — all triggered by your voice.

Voice Modes

DotAgents offers several voice interaction modes:

Hold-to-Record (Dictation)

The default mode for quick voice input:

Hold Ctrl (macOS/Linux) or Ctrl+/ (Windows)
Speak your request
Release to stop recording
Your speech is transcribed and inserted into the active application

This is pure dictation — the AI transcribes your speech and types it wherever your cursor is.

MCP Agent Mode

Voice input that triggers full agent execution with tools:

Hold Ctrl+Alt to start recording
Speak your request (e.g., "Search GitHub for recent issues in my repo")
Release Ctrl+Alt to process
The agent reasons about your request, executes MCP tools, and responds
Watch real-time progress as each tool is called

Toggle Dictation (Fn)

Instead of holding a key, toggle dictation on and off:

Press Fn to start recording
Speak freely
Press Fn again to stop and transcribe

Hands-Free Mode (Mobile)

On the mobile app, hands-free mode uses Voice Activity Detection (VAD):

Toggle the microphone icon in the chat header
The app listens continuously
When you speak, it transcribes automatically
When you stop speaking, it sends the message
Perfect for driving, cooking, or multitasking

Text Input

For when voice isn't convenient:

Ctrl+T (macOS/Linux) or Ctrl+Shift+T (Windows) opens a text input overlay
Type your message and press Enter
Same agent processing as voice input

Speech-to-Text (STT)

DotAgents supports multiple STT providers for transcription:

Provider	Models	Speed	Quality	Offline
OpenAI	Whisper	Fast	Excellent	No
Groq	Whisper (accelerated)	Very Fast	Excellent	No
Parakeet	ONNX model	Fast	Good	Yes

Language Support

DotAgents supports 30+ languages for speech recognition:

Afrikaans, Arabic, Armenian, Azerbaijani, Belarusian, Bosnian, Bulgarian, Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, German, Greek, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Italian, Japanese, Kannada, Kazakh, Korean, Latvian, Lithuanian, Macedonian, Malay, Marathi, Maori, Nepali, Norwegian, Persian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swahili, Swedish, Tagalog, Tamil, Thai, Turkish, Ukrainian, Urdu, Vietnamese, Welsh.

Configure your preferred language in Settings > General.

Configuring STT

Go to Settings > General
Under "Speech-to-Text", select your provider
Choose the model variant (if applicable)
Set your preferred language
Test with a voice recording

Text-to-Speech (TTS)

Agent responses can be spoken aloud with multiple TTS providers:

Provider	Voices	Quality	Speed
OpenAI	6 voices (Alloy, Echo, Fable, Onyx, Nova, Shimmer)	High	Fast
Groq	Orpheus voices	High	Very Fast
Google Gemini	Multiple voices	High	Fast
Kitten	Custom voices	Variable	Fast
Supertonic	Custom voices	Variable	Fast

TTS Features

Auto-play — Automatically speak agent responses as they arrive
Voice selection — Choose from 50+ AI voices across providers
Audio player — Play, pause, and replay TTS output
Streaming — Audio begins playing before the full response is generated

Configuring TTS

Go to Settings > General
Under "Text-to-Speech", select your provider
Choose your preferred voice
Toggle auto-play on/off
Adjust volume and speed (provider-dependent)

Voice Flow Architecture

              ┌──────────────┐
              │  Microphone  │
              └──────┬───────┘
                     │
              ┌──────▼───────┐
              │  Recording   │
              │  (Hold key)  │
              └──────┬───────┘
                     │
              ┌──────▼───────┐
              │  STT Provider│
              │  (Whisper)   │
              └──────┬───────┘
                     │
              ┌──────▼───────┐
              │  Transcribed │
              │  Text        │
              └──────┬───────┘
                     │
         ┌───────────┼───────────┐
         │                       │
    ┌────▼─────┐          ┌──────▼───────┐
    │ Dictation│          │ Agent Mode   │
    │ Mode     │          │ (MCP Tools)  │
    └────┬─────┘          └──────┬───────┘
         │                       │
    ┌────▼─────┐          ┌──────▼───────┐
    │ Insert   │          │ LLM Engine   │
    │ Text     │          │ (Tool Calls) │
    └──────────┘          └──────┬───────┘
                                 │
                          ┌──────▼───────┐
                          │ Response     │
                          ├──────────────┤
                          │ Text Display │
                          │ TTS Playback │
                          │ Text Insert  │
                          └──────────────┘

Emergency Stop

If an agent is running a long or undesired operation:

Ctrl+Shift+Escape — Immediately stops all active agent sessions.

This is the kill switch. It aborts all in-flight LLM calls, tool executions, and ACP delegations.

Tips

Short commands work best — "Search for React tutorials" is better than a paragraph of instructions
Be specific — "Open the file at src/index.ts" beats "open that file"
Use agent mode for actions — Hold Ctrl+Alt when you want the agent to do something, not just transcribe
Check your mic — Ensure the correct microphone is selected in your system settings
Grant permissions — macOS requires accessibility permissions for keyboard monitoring

Next Steps

Desktop App — Full desktop feature guide
Mobile App — Voice on the go
AI Providers — Configure STT/TTS providers
Keyboard Shortcuts — All voice hotkeys

Voice Modes​

Hold-to-Record (Dictation)​

MCP Agent Mode​

Toggle Dictation (Fn)​

Hands-Free Mode (Mobile)​

Text Input​

Speech-to-Text (STT)​

Language Support​

Configuring STT​

Text-to-Speech (TTS)​

TTS Features​

Configuring TTS​

Voice Flow Architecture​

Emergency Stop​

Tips​

Next Steps​