ROS Package bob_q3tts
A ROS 2 wrapper for the Qwen3-TTS model, providing high-fidelity, low-latency text-to-speech with streaming aggregation and voice cloning capabilities. It also includes a Qt-based GUI for real-time parameter tuning.
Quick Start
Launch the TTS Service:
ros2 run bob_q3tts tts
Open the Parameter GUI:
ros2 run bob_q3tts gui
Docker Usage
The package is automatically built and pushed to the GitHub Container Registry (GHCR).
Using the Pre-built Image
docker pull ghcr.io/bob-ros2/bob-q3tts:latest
Using Docker Compose (Recommended)
docker-compose build
docker-compose up
Using Docker CLI
Build the Image (locally):
docker build -t bob-q3tts .
Run the Node (with GPU and Audio):
docker run -it --rm \ --gpus all \ --device /dev/snd \ -e Q3TTS_MODEL_DIR=/models \ -e ROS_DOMAIN_ID=99 \ -v $(pwd)/models:/models \ --network host \ --ipc host \ ghcr.io/bob-ros2/bob-q3tts:latest
Topics
Name |
Type |
Direction |
Description |
|---|---|---|---|
|
|
Subscriber |
Incoming text. Aggregated and synthesized at sentence boundaries. |
|
|
Publisher |
The text currently being spoken. Published right before playback. |
Parameters
The node uses static configuration for initialization and dynamic parameters for per-sentence tuning.
Core Configuration (Static)
Parameter |
Type |
Description |
|---|---|---|
|
|
The Hugging Face model ID. Env: |
|
|
Local directory for model caching. Env: |
|
|
Characters or strings that trigger synthesis (e.g. |
|
|
Timeout in ms to flush buffer without delimiter. Env: |
|
|
Regex-based [pattern, replacement] pairs for text cleaning (HTML, emojis, etc.). Use |
Generation Settings (Dynamic)
Parameter |
Type |
Description |
|---|---|---|
|
|
Speech language. Env: |
|
|
Enable sampling for Stage 1. Env: |
|
|
Sampling temperature for Stage 1. Env: |
|
|
Nucleus sampling threshold. Env: |
|
|
Top-k sampling limit. Env: |
|
|
Penalty for repeated sounds. Env: |
Subtalker Settings (Dynamic)
Parameter |
Type |
Description |
|---|---|---|
|
|
Enable sampling for Stage 2. Env: |
|
|
Temperature for acoustic texture. Env: |
|
|
Nucleus sampling for Stage 2. Env: |
|
|
Top-k for Stage 2. Env: |
Voice Clone / ICL (Dynamic)
Parameter |
Type |
Description |
|---|---|---|
|
|
Instructions to dynamically shape voice without audio samples. Note: If defined, it actively overrides voice-cloning via |
|
|
Path to reference |
|
|
Transcript or path to transcript file. Reading from file enables dynamic updates. Env: |
Voice Cloning & Language Support
The package provides reference files for high-quality voice cloning in both German and English.
German (Default): Use
eva_24khz_de.wavandvoice_ref_text_de.txt. Best for Bob’s default personality.English: Use
eva_24khz_en.wavandvoice_ref_text_en.txt. Use this if you want Eva to speak with a native English accent and proper prosody.
To switch languages via CLI:
ros2 run bob_q3tts tts --ros-args -p language:=English -p voice_ref_audio:=/app/ros2_ws/install/bob_q3tts/share/bob_q3tts/config/eva_24khz_en.wav -p voice_ref_text:=/app/ros2_ws/install/bob_q3tts/share/bob_q3tts/config/voice_ref_text_en.txt
Output & Storage (Dynamic)
Parameter |
Type |
Description |
|---|---|---|
|
|
Enable/disable audio playback. Env: |
|
|
Player: |
|
|
Device ID or name for sounddevice. Env: |
|
|
Force resampling for local playback (0 = auto). Env: |
|
|
Prefix for saving audio files. Env: |
|
|
Starting index for file naming. Env: |
Audio Architecture & Sample Rates
The node handles different sample rates for local playback and remote streaming:
Local Playback (
target_sample_rate= 0):The model generates audio at 24,000 Hz.
If the soundcard (e.g., HDMI) doesn’t support 24kHz, the node automatically falls back to 48,000 Hz resampling.
This fallback becomes “sticky” (remembered) to ensure zero-latency for subsequent sentences.
Remote Streaming (
audio_rawtopic):Audio published to the ROS topic is always fixed at 44,100 Hz.
This ensure compatibility with standard streaming tools and bridges (like Twitch bots) that expect a constant frequency.
Troubleshooting Audio
If you hear no sound or see “Invalid sample rate” errors (common with HDMI/GPU audio), or experience ALSA underrun messages:
Use aplay fallback: Set the parameter
player:=aplay. This bypasses the Python-native sound library and uses the more robust system utility to stream the audio data.List Devices (inside container):
python3 -c "import sounddevice as sd; print(sd.query_devices())"
Set Device: Find the index or name (e.g.,
HDA NVidia: HDMI 0 (hw:2,3)) and set theaudio_deviceparameter.Automatic Handling: By default (
target_sample_rate: 0), the node will try to auto-detect a working rate (falling back to 48kHz). You only need to manually settarget_sample_rateif the auto-detection fails or you have very specific hardware needs.
Text Cleaning Example
The substitute parameter is designed to remove “garbage” characters that common LLMs might stream (like HTML leftovers or emojis) which cannot be spoken by a TTS model.
# Example launch snippet
ros2 run bob_q3tts tts --ros-args -p 'substitute:=[
" ","", # Remove non-breaking spaces
"<br>"," ", # Replace <br> with space
"[\\U00010000-\\U0010ffff\\u2600-\\u27BF\\ufe00-\\ufe0f]","", # Remove emojis/icons
"https?://\\S+","", # Remove URLs
"[*~_|<>\\^`\\]\\[]"," ", # Remove markdown formatting chars
"\\s{2,}"," " # Normalize multiple spaces
]'