Voice - Neotask by Neotask Documentation | Neotask

Voice

Overview

Open Claw supports voice interaction through multiple systems: wake word detection, continuous voice conversation (talk mode), and text-to-speech for spoken responses.

Wake Words

Swabble (macOS)

Swabble is a native macOS daemon that provides always-on, on-device voice wake word detection using Apple's Speech.framework.

Features:

Local-only processing — no audio leaves your device during wake word detection

Default wake word: clawd (with alias claude)

Customizable wake words

Continuous audio capture and transcription

Hook execution — triggers shell commands when the wake word is detected

File transcription — convert audio files to text (TXT or SRT format)

Configurable cooldown, minimum character count, and timeout

How it works:

Swabble listens continuously using the system microphone

When it detects the wake word in spoken text, it captures the following speech

The captured text is sent to your agent via a configured hook command

The agent processes the voice command and responds

Node Wake Words

On iOS and Android companion apps, voice wake is handled natively:

Wake word configuration is owned by the Gateway

Nodes receive wake word config on connect

Detection uses platform-native speech recognition

Talk Mode

Talk mode enables continuous voice conversations — speak naturally and hear your agent respond.

How It Works

Speech-to-Text — Your voice is transcribed in real-time (Deepgram streaming or platform-native STT)

Agent Processing — The transcribed text is sent to your agent as a regular message

Text-to-Speech — The agent's response is spoken back to you

Voice State Machine

Talk mode transitions between four states:

| State | Description | |-------|-------------| | Idle | Not actively listening | | Listening | Capturing and transcribing your speech | | Thinking | Agent is processing your request | | Speaking | Agent response is being spoken |

Text-to-Speech Providers

| Provider | Description | |----------|-------------| | ElevenLabs | High-quality voice synthesis with voice selection | | OpenAI TTS | OpenAI's text-to-speech API |

Voice Preferences

Voice selection — Choose from available TTS voices

Custom system prompt — Override the agent's personality for voice mode

Custom response format — Control how the agent formats spoken responses

Language support — Voice strings localized for 18+ languages

Voice Commands

Multi-Intent Detection

Agents can detect and execute multi-step voice commands:

> "Create a calendar event for tomorrow at 3 PM, then send an email to the team about it, and post a reminder in Slack"

This is automatically parsed into a sequence of commands, each executed in order with the results flowing to the next step.

Tool Execution

During voice conversations, agents can execute tools just like in text conversations — browse the web, run code, manage files, control devices, and more. Results are summarized and spoken back.

Action Truth Enforcement

Voice mode includes validation that agent claims match actual tool outcomes. If an agent says "I've sent the email" but the email tool failed, the system catches the discrepancy and reports the actual result.

Voice Calling (Plugin)

The Voice Call plugin adds SIP telephony support:

Inbound call handling

Outbound calls (provider-dependent)

Real-time bidirectional audio (PCM streams)

TTS synthesis injected into the call audio

Quota Management

Voice services may have usage quotas:

Monthly minute allocation for TTS + STT

Per-session tracking

Warning at 80% usage

Automatic cutoff at quota limit

View full documentation