ROS Package bob_coquitts

This ROS package provides a robust node that interfaces with the Coqui TTS library, allowing a ROS 2 system to convert text into speech. It intelligently processes incoming text streams, handles voice cloning with models like XTTS, and offers extensive configuration options.

Features

  • Intelligent Text Buffering: Waits for pauses in the incoming text stream before processing to ensure complete thoughts are synthesized.

  • Dual Splitting Modes: Choose between two modes for sentence splitting via the split_sentences parameter:

    1. Manual Mode (Default): Use custom delimiters (sentence_delimiters) to precisely control how text is split into sentences.

    2. Automatic Mode: Let Coqui’s powerful internal splitter handle long, unstructured text blocks.

  • Advanced Text Normalization:

    • Automatically filters pictorial emojis and symbols by default using a Unicode-aware regex filter.

    • Removes user-defined characters (e.g., typographical quotes „“).

    • Strips leading and trailing characters (e.g., spaces, punctuation) from sentences before synthesis.

    • Normalizes numbers by removing thousands separators (e.g., 2.500 -> 2500) to ensure correct pronunciation.

  • Real-time Feedback: Publishes the exact text chunk being synthesized to a separate ROS topic (/text_speaking), allowing other nodes to synchronize with the speech output.

  • Wide Model Support & Voice Cloning: Supports a vast range of Coqui models, including zero-shot voice cloning with XTTS.

  • Flexible Output: Optionally plays audio directly or saves it to a WAV file with automatic unique filename generation.

  • Hardware Acceleration: Supports both GPU (cuda) and CPU inference.

Prerequisites

  • ROS 2 (Humble, Iron, or newer).

  • Python 3.8+

  • NVIDIA GPU with CUDA installed for GPU acceleration (optional but recommended for XTTS).

  • An audio output device.

  • System dependencies for sounddevice and libsndfile.

# For Debian/Ubuntu-based systems
sudo apt-get update
sudo apt-get install libportaudio2 libasound-dev libsndfile1

Installation

  1. Clone the Package: Clone this repository into your ROS 2 workspace’s src directory.

  2. Install Python Dependencies: This node requires the regex library for full Unicode support (e.g., filtering emojis). It is recommended to use a Python virtual environment.

    cd ~/ros2_ws
    # If using a virtual environment, activate it first
    pip install -r src/bob_coquitts/requirements.txt
    

    The requirements.txt file should contain:

    TTS
    sounddevice
    numpy
    soundfile
    regex
    

Building

Source your ROS 2 installation and build the package using colcon.

cd ~/ros2_ws
source /opt/ros/humble/setup.bash
colcon build --packages-select bob_coquitts

Usage

After building, source the workspace’s setup.bash file. For detailed troubleshooting, launch the node with --log-level DEBUG.

source ~/ros2_ws/install/setup.bash
ros2 run bob_coquitts tts

Example 1: XTTS Voice Cloning with Text Cleaning

This example uses the powerful XTTS v2 model. We override the sentence_strip_chars parameter to only remove colons, which can cause unnatural-sounding audio.

ros2 run bob_coquitts tts --ros-args \
-p model_name:='tts_models/multilingual/multi-dataset/xtts_v2' \
-p reference_wav:='/path/to/your/voice.wav' \
-p language:='en' \
-p device:='cuda' \
-p sentence_strip_chars:="':'"

# In another terminal, publish text with a colon
ros2 topic pub --once /text std_msgs/msg/String "data: 'Here is my statement:'"

# In a third terminal, listen to the cleaned text being spoken
ros2 topic echo /text_speaking
# Output will be: data: Here is my statement

Example 2: Using Coqui’s Internal Splitter for Long Text

If you are feeding a large, unstructured block of text, it’s best to let Coqui handle the splitting.

ros2 run bob_coquitts tts --ros-args -p split_sentences:=True

# Publish a long paragraph
ros2 topic pub --once /text std_msgs/msg/String "data: 'This is the first sentence. This is the second sentence which is much longer and might exceed the character limit if not handled properly. Coquis splitter will take care of it.'"

ROS Interface

Subscribed Topics

Topic Name

Message Type

Description

/text

std_msgs/msg/String

The text to be synthesized. The node buffers incoming text and processes it after a pause.

Published Topics

Topic Name

Message Type

Description

/text_speaking

std_msgs/msg/String

Publishes the cleaned, normalized sentence or chunk of text exactly as it is being sent to the TTS model.

Parameters

Parameter Name

Type

Default Value

Description

General

model_name

string

tts_models/en/ljspeech/vits

The Coqui TTS model to use. (env: COQUITTS_MODEL_NAME)

language

string

''

Language code for multi-lingual models (e.g., en, de). (env: COQUITTS_LANGUAGE)

device

string

cpu

Compute device for inference (cuda or cpu). (env: COQUITTS_DEVICE)

reference_wav

string

''

Path to a reference WAV file for voice cloning. (env: COQUITTS_REFERENCE_WAV)

Audio Output

sample_rate

integer

24000

Audio sample rate for playback. Must match the model’s native rate. (env: COQUITTS_SAMPLE_RATE)

play_audio

boolean

True

If true, plays the generated audio directly. (env: COQUITTS_PLAY_AUDIO)

output_wav_path

string

''

Path to save the output WAV file. (env: COQUITTS_OUTPUT_WAV_PATH)

Text Processing

split_sentences

boolean

False

Mode switch for splitting. If True, Coqui handles splitting. If False (default), the node uses manual splitting below. (env: COQUITTS_SPLIT_SENTENCES)

sentence_delimiters

string

.!?\n

Characters used for manual splitting (only when split_sentences is False). (env: COQUITTS_SENTENCE_DELIMITERS)

sentences_max

integer

1

Max number of sentences to process at once in manual mode. (env: COQUITTS_SENTENCES_MAX)

min_char_length_for_synthesis

integer

3

If a text chunk is shorter than this, a period is appended to stabilize TTS synthesis. Set to 0 to disable. (env: COQUITTS_MIN_CHAR_LENGTH)

number_thousands_separator

string

.

Character to remove from between digits (e.g., . in 1.234). (env: COQUITTS_NUMBER_THOUSANDS_SEPARATOR)

sentence_strip_chars

string

.,:!?

Characters to remove from the beginning and end of a processed text chunk. (env: COQUITTS_SENTENCE_STRIP_CHARS)

text_filter_chars

string

„”‘“’*—#<>

Specific characters to remove from the entire text. (env: COQUITTS_TEXT_FILTER_CHARS)

text_filter_regex

string

[\p{Emoji_Presentation}\p{Extended_Pictographic}]

Regex to remove patterns from the entire text. Requires regex pip package. Default filters emojis. (env: COQUITTS_TEXT_FILTER_REGEX)

XTTS Tuning

temperature

double

0.2

Controls randomness. Lower is more deterministic. (env: COQUITTS_TEMPERATURE)

length_penalty

double

1.0

Factor to penalize longer sequences. (env: COQUITTS_LENGTH_PENALTY)

repetition_penalty

double

2.0

Penalty for repeating tokens. (env: COQUITTS_REPETITION_PENALTY)

top_k

integer

40

Samples from the k most likely next tokens. (env: COQUITTS_TOP_K)

top_p

double

0.9

Samples from tokens with a cumulative probability of p. (env: COQUITTS_TOP_P)