Voice Interface
Voice is the primary interface for DotAgents. Hold to speak, release to act. Your agents listen, think, and execute — all triggered by your voice.
Voice Modes
DotAgents offers several voice interaction modes:
Hold-to-Record (Dictation)
The default mode for quick voice input:
- Hold
Ctrl(macOS/Linux) orCtrl+/(Windows) - Speak your request
- Release to stop recording
- Your speech is transcribed and inserted into the active application
This is pure dictation — the AI transcribes your speech and types it wherever your cursor is.
MCP Agent Mode
Voice input that triggers full agent execution with tools:
- Hold
Ctrl+Altto start recording - Speak your request (e.g., "Search GitHub for recent issues in my repo")
- Release
Ctrl+Altto process - The agent reasons about your request, executes MCP tools, and responds
- Watch real-time progress as each tool is called
Toggle Dictation (Fn)
Instead of holding a key, toggle dictation on and off:
- Press
Fnto start recording - Speak freely
- Press
Fnagain to stop and transcribe
Hands-Free Mode (Mobile)
On the mobile app, hands-free mode uses Voice Activity Detection (VAD):
- Toggle the microphone icon in the chat header
- The app listens continuously
- When you speak, it transcribes automatically
- When you stop speaking, it sends the message
- Perfect for driving, cooking, or multitasking
Text Input
For when voice isn't convenient:
Ctrl+T(macOS/Linux) orCtrl+Shift+T(Windows) opens a text input overlay- Type your message and press Enter
- Same agent processing as voice input
Speech-to-Text (STT)
DotAgents supports multiple STT providers for transcription:
| Provider | Models | Speed | Quality | Offline |
|---|---|---|---|---|
| OpenAI | Whisper | Fast | Excellent | No |
| Groq | Whisper (accelerated) | Very Fast | Excellent | No |
| Parakeet | ONNX model | Fast | Good | Yes |
Language Support
DotAgents supports 30+ languages for speech recognition:
Afrikaans, Arabic, Armenian, Azerbaijani, Belarusian, Bosnian, Bulgarian, Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, German, Greek, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Italian, Japanese, Kannada, Kazakh, Korean, Latvian, Lithuanian, Macedonian, Malay, Marathi, Maori, Nepali, Norwegian, Persian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swahili, Swedish, Tagalog, Tamil, Thai, Turkish, Ukrainian, Urdu, Vietnamese, Welsh.
Configure your preferred language in Settings > General.
Configuring STT
- Go to Settings > General
- Under "Speech-to-Text", select your provider
- Choose the model variant (if applicable)
- Set your preferred language
- Test with a voice recording
Text-to-Speech (TTS)
Agent responses can be spoken aloud with multiple TTS providers:
| Provider | Voices | Quality | Speed |
|---|---|---|---|
| OpenAI | 6 voices (Alloy, Echo, Fable, Onyx, Nova, Shimmer) | High | Fast |
| Groq | Orpheus voices | High | Very Fast |
| Google Gemini | Multiple voices | High | Fast |
| Kitten | Custom voices | Variable | Fast |
| Supertonic | Custom voices | Variable | Fast |
TTS Features
- Auto-play — Automatically speak agent responses as they arrive
- Voice selection — Choose from 50+ AI voices across providers
- Audio player — Play, pause, and replay TTS output
- Streaming — Audio begins playing before the full response is generated
Configuring TTS
- Go to Settings > General
- Under "Text-to-Speech", select your provider
- Choose your preferred voice
- Toggle auto-play on/off
- Adjust volume and speed (provider-dependent)
Voice Flow Architecture
┌──────────────┐
│ Microphone │
└──────┬───────┘
│
┌──────▼───────┐
│ Recording │
│ (Hold key) │
└──────┬───────┘
│
┌──────▼───────┐
│ STT Provider│
│ (Whisper) │
└──────┬───────┘
│
┌──────▼───────┐
│ Transcribed │
│ Text │
└──────┬───────┘
│
┌───────────┼───────────┐
│ │
┌────▼─────┐ ┌──────▼───────┐
│ Dictation│ │ Agent Mode │
│ Mode │ │ (MCP Tools) │
└────┬─────┘ └──────┬───────┘
│ │
┌────▼─────┐ ┌──────▼───────┐
│ Insert │ │ LLM Engine │
│ Text │ │ (Tool Calls) │
└──────────┘ └──────┬───────┘
│
┌──────▼───────┐
│ Response │
├──────────────┤
│ Text Display │
│ TTS Playback │
│ Text Insert │
└──────────────┘
Emergency Stop
If an agent is running a long or undesired operation:
Ctrl+Shift+Escape — Immediately stops all active agent sessions.
This is the kill switch. It aborts all in-flight LLM calls, tool executions, and ACP delegations.
Tips
- Short commands work best — "Search for React tutorials" is better than a paragraph of instructions
- Be specific — "Open the file at src/index.ts" beats "open that file"
- Use agent mode for actions — Hold
Ctrl+Altwhen you want the agent to do something, not just transcribe - Check your mic — Ensure the correct microphone is selected in your system settings
- Grant permissions — macOS requires accessibility permissions for keyboard monitoring
Next Steps
- Desktop App — Full desktop feature guide
- Mobile App — Voice on the go
- AI Providers — Configure STT/TTS providers
- Keyboard Shortcuts — All voice hotkeys