ROS Package bob_coquitts
This ROS package provides a robust node that interfaces with the Coqui TTS library, allowing a ROS 2 system to convert text into speech. It intelligently processes incoming text streams, handles voice cloning with models like XTTS, and offers extensive configuration options.
Features
Intelligent Text Buffering: Waits for pauses in the incoming text stream before processing to ensure complete thoughts are synthesized.
Dual Splitting Modes: Choose between two modes for sentence splitting via the
split_sentencesparameter:Manual Mode (Default): Use custom delimiters (
sentence_delimiters) to precisely control how text is split into sentences.Automatic Mode: Let Coqui’s powerful internal splitter handle long, unstructured text blocks.
Advanced Text Normalization:
Automatically filters pictorial emojis and symbols by default using a Unicode-aware regex filter.
Removes user-defined characters (e.g., typographical quotes
„“).Strips leading and trailing characters (e.g., spaces, punctuation) from sentences before synthesis.
Normalizes numbers by removing thousands separators (e.g.,
2.500->2500) to ensure correct pronunciation.
Real-time Feedback: Publishes the exact text chunk being synthesized to a separate ROS topic (
/text_speaking), allowing other nodes to synchronize with the speech output.Wide Model Support & Voice Cloning: Supports a vast range of Coqui models, including zero-shot voice cloning with XTTS.
Flexible Output: Optionally plays audio directly or saves it to a WAV file with automatic unique filename generation.
Hardware Acceleration: Supports both GPU (
cuda) and CPU inference.
Prerequisites
ROS 2 (Humble, Iron, or newer).
Python 3.8+
NVIDIA GPU with CUDA installed for GPU acceleration (optional but recommended for XTTS).
An audio output device.
System dependencies for
sounddeviceandlibsndfile.
# For Debian/Ubuntu-based systems
sudo apt-get update
sudo apt-get install libportaudio2 libasound-dev libsndfile1
Installation
Clone the Package: Clone this repository into your ROS 2 workspace’s
srcdirectory.Install Python Dependencies: This node requires the
regexlibrary for full Unicode support (e.g., filtering emojis). It is recommended to use a Python virtual environment.cd ~/ros2_ws # If using a virtual environment, activate it first pip install -r src/bob_coquitts/requirements.txt
The
requirements.txtfile should contain:TTS sounddevice numpy soundfile regex
Building
Source your ROS 2 installation and build the package using colcon.
cd ~/ros2_ws
source /opt/ros/humble/setup.bash
colcon build --packages-select bob_coquitts
Usage
After building, source the workspace’s setup.bash file. For detailed troubleshooting, launch the node with --log-level DEBUG.
source ~/ros2_ws/install/setup.bash
ros2 run bob_coquitts tts
Example 1: XTTS Voice Cloning with Text Cleaning
This example uses the powerful XTTS v2 model. We override the sentence_strip_chars parameter to only remove colons, which can cause unnatural-sounding audio.
ros2 run bob_coquitts tts --ros-args \
-p model_name:='tts_models/multilingual/multi-dataset/xtts_v2' \
-p reference_wav:='/path/to/your/voice.wav' \
-p language:='en' \
-p device:='cuda' \
-p sentence_strip_chars:="':'"
# In another terminal, publish text with a colon
ros2 topic pub --once /text std_msgs/msg/String "data: 'Here is my statement:'"
# In a third terminal, listen to the cleaned text being spoken
ros2 topic echo /text_speaking
# Output will be: data: Here is my statement
Example 2: Using Coqui’s Internal Splitter for Long Text
If you are feeding a large, unstructured block of text, it’s best to let Coqui handle the splitting.
ros2 run bob_coquitts tts --ros-args -p split_sentences:=True
# Publish a long paragraph
ros2 topic pub --once /text std_msgs/msg/String "data: 'This is the first sentence. This is the second sentence which is much longer and might exceed the character limit if not handled properly. Coquis splitter will take care of it.'"
ROS Interface
Subscribed Topics
Topic Name |
Message Type |
Description |
|---|---|---|
|
|
The text to be synthesized. The node buffers incoming text and processes it after a pause. |
Published Topics
Topic Name |
Message Type |
Description |
|---|---|---|
|
|
Publishes the cleaned, normalized sentence or chunk of text exactly as it is being sent to the TTS model. |
Parameters
Parameter Name |
Type |
Default Value |
Description |
|---|---|---|---|
General |
|||
|
string |
|
The Coqui TTS model to use. (env: |
|
string |
|
Language code for multi-lingual models (e.g., |
|
string |
|
Compute device for inference ( |
|
string |
|
Path to a reference WAV file for voice cloning. (env: |
Audio Output |
|||
|
integer |
|
Audio sample rate for playback. Must match the model’s native rate. (env: |
|
boolean |
|
If true, plays the generated audio directly. (env: |
|
string |
|
Path to save the output WAV file. (env: |
Text Processing |
|||
|
boolean |
|
Mode switch for splitting. If |
|
string |
|
Characters used for manual splitting (only when |
|
integer |
|
Max number of sentences to process at once in manual mode. (env: |
|
integer |
|
If a text chunk is shorter than this, a period is appended to stabilize TTS synthesis. Set to 0 to disable. (env: |
|
string |
|
Character to remove from between digits (e.g., |
|
string |
|
Characters to remove from the beginning and end of a processed text chunk. (env: |
|
string |
|
Specific characters to remove from the entire text. (env: |
|
string |
|
Regex to remove patterns from the entire text. Requires |
XTTS Tuning |
|||
|
double |
|
Controls randomness. Lower is more deterministic. (env: |
|
double |
|
Factor to penalize longer sequences. (env: |
|
double |
|
Penalty for repeating tokens. (env: |
|
integer |
|
Samples from the k most likely next tokens. (env: |
|
double |
|
Samples from tokens with a cumulative probability of p. (env: |