ROS Package bob_moondream 

This ROS 2 package provides a node that serves as a wrapper for the vikhyatk/moondream2, a small but powerful vision-language model (VLM). The bob_moondream node subscribes to a raw image topic and exposes a flexible ROS 2 interface for various visual reasoning and perception tasks.

The node loads the model into memory upon startup and processes the latest received image when a service is called or a prompt is published on a topic. It can perform image captioning, answer visual questions, detect objects based on a text description, and locate the center point of an object based on a textual description. It can also publish detection results and an annotated video feed for visualization.

Overview

Real-Time Diagram ROS Services String Topics

Model Loading

On the first launch, the node will automatically download the model weights from the Hugging Face Hub and save them to a local cache directory. This process requires an internet connection and may take a few minutes depending on your network speed. The total download size is approximately 3-4 GB.

The download location is determined by the cache_dir parameter, which defaults to a models folder inside the directory where the node is launched.

When loaded onto a GPU (the default behavior), the model consumes approximately 4 GB of VRAM. Ensure your GPU has sufficient memory available.

Subsequent launches of the node will be much faster, as the model will be loaded directly from this local cache, allowing for offline operation after the initial download.

Dependencies

ROS 2 Dependencies

This package depends on several standard ROS 2 message packages and a custom message package. Ensure they are available in your ROS 2 workspace.

std_msgs
sensor_msgs
vision_msgs
bob_moondream_msgs: This custom service message package must be located and built within the same workspace as bob_moondream.

Python Dependencies

It is highly recommended to use a Python virtual environment (venv) to manage these dependencies and avoid conflicts with other projects or system packages.

The node requires several Python packages to run the vision-language model. These are listed in requirements.txt and can be installed using pip:

pip install -r requirements.txt

This will install key libraries such as transformers, torch, and Pillow.

Build Instructions

To build the bob_moondream package, follow these steps:

Create a ROS 2 Workspace: If you don’t have one already, create a new ROS 2 workspace.
```
mkdir -p ~/ros2_ws/src
cd ~/ros2_ws
```

Clone the Repositories: Clone this package and its message dependency into the src directory.

cd src
git clone https://github.com/bob-ros2/bob_moondream.git
git clone https://github.com/bob-ros2/bob_moondream_msgs.git

Install Python Dependencies: Navigate to the bob_moondream package and install the required Python packages.
```
cd bob_moondream
pip install -r requirements.txt
cd ../.. 
```
Build the Workspace: Use colcon to build the packages.
```
colcon build
```
Source the Workspace: Before running any nodes, source your workspace’s setup file.
```
source install/setup.bash
```

Usage

Prerequisites

Before launching the moondream node, ensure you have a source of images being published to the image_input topic. This can be from a real camera driver or a rosbag file.

Option 1: Using a Standard Camera Driver

For example, to run a USB camera node:

# You may need to install a camera driver first
# sudo apt install ros-humble-usb-cam

# Launch the camera node and remap its output to image_input
ros2 run usb_cam usb_cam_node_exe --ros-args -r image_raw:=image_input

Option 2: Using the Provided Launch File

This package includes a convenience launch file that starts a camera node and automatically handles the topic remapping for you. This launch file uses topic_tools to throttle the image stream for better performance. Ensure it is installed first:

# For ROS 2 Humble
sudo apt install ros-humble-topic-tools

Once installed, you can run the launch file:

ros2 launch bob_moondream camera.launch.py

Launching the Node

Once your workspace is sourced and an image feed is active, open a new terminal and run the following command to start the node.

Basic Launch with remapping:

ros2 run bob_moondream moondream --ros-args -r image_input:=/camera/image_raw/throttled -r __ns:=/vision

On the first run, this will trigger the model download, which can take several minutes.

Launch on CPU: To run the model on the CPU, use the device parameter:

ros2 run bob_moondream moondream --ros-args -p device:=cpu

Once the node is running and has loaded the model, you can interact with it using the services or the interactive script.

Real-time Detection and Pointing

You can enable continuous detection or pointing on the live image stream by setting the prompt_detect and prompt_point parameters at runtime. The node will process every incoming image frame with the provided prompt.

To start real-time object detection: Open a new terminal and use the ros2 param set command to provide a detection prompt. For example, to continuously look for a cup:

ros2 param set /moondream prompt_detect "the cup"

The node will now publish bounding boxes to /detected_objects and the annotated video to /annotated_image.

To start real-time pointing: Similarly, you can ask the model to continuously point to a specific feature:

ros2 param set /moondream prompt_point "the tip of the pen"

The node will publish center points to /pointed_objects.

To stop real-time processing: To disable the continuous processing, simply set the parameter to an empty string:

ros2 param set /moondream prompt_detect ""

Interactive Query Script

The package includes a convenient helper script, query.sh, for interacting with the node’s topic-based interface directly from the command line. This script allows you to send prompts and see the results in real-time without needing to write a separate node.

How to Use

To use the script, run it with ros2 run and specify the type of query you want to perform.

Syntax:

ros2 run bob_moondream query.sh <type> [<namespace>]

<type>: The mode of interaction. Must be one of visual_query, object_detection, or pointing.
[<namespace>]: (Optional) The ROS 2 namespace where the topics are located. Defaults to /vision.

Example

To start a visual query session, launch the moondream_node and a camera feed. Then, in a new terminal, run the following command:

$ ros2 run bob_moondream query.sh visual_query
--- Listening for results on /vision/visual_query_result
--- Enter your prompt below (Press Ctrl+C to exit)
> what do you see
"A man is sitting at a desk, working on a computer. He is focused on his task..."
>

$ ros2 run bob_moondream query.sh object_detection
--- Listening for results on /vision/object_detection_result
--- Enter your prompt below (Press Ctrl+C to exit)
> mark the cups
'[{"x_min": 0.19673718698322773, "y_min": 0.6845761351287365, "x_max": 0.25443468801677227, "y_max": 0.8037051148712635}, {"x_min": 0.3479478508234024, "y_min": 0.6667342223227024, "x_max": 0.4118177741765976, "y_max": 0.7707657776772976}'
>

Parameters

Parameter	Type	Default	Description
`model_name`	`string`	`"vikhyatk/moondream2"`	The name of the model on the Hugging Face Hub.
`model_revision`	`string`	`"2025-05-20"`	The specific model revision (git commit hash or tag) to use.
`cache_dir`	`string`	`"./models"`	The local directory where the model will be downloaded and cached.
`frame_id`	`string`	`"moondream"`	The frame ID to be used in the headers of published messages.
`device`	`string`	`"cuda"`	The device to run the model on (e.g., “cuda” for GPU, “cpu” for CPU).
`prompt_point`	`string`	`""`	If set, continuously runs pointing detection on every new image frame.
`prompt_detect`	`string`	`""`	If set, continuously runs object detection on every new image frame.

Subscribers

Topic	Message Type	Description
`image_input`	`sensor_msgs/msg/Image`	The main input topic for raw images that will be processed by the model.
`visual_query_prompt`	`std_msgs/msg/String`	Receives a text prompt to ask a question. The answer is published on `visual_query_result`.
`object_detection_prompt`	`std_msgs/msg/String`	Receives a text prompt to detect an object. The result is published on `object_detection_result`.
`pointing_prompt`	`std_msgs/msg/String`	Receives a text prompt to find a point. The result is published on `pointing_result`.

Publishers

Topic	Message Type	Description
`detected_objects`	`vision_msgs/msg/Detection2DArray`	Publishes bounding boxes of detected objects.
`pointed_objects`	`vision_msgs/msg/Detection2DArray`	Publishes points of interest (as zero-size bounding boxes).
`annotated_image`	`sensor_msgs/msg/Image`	Publishes the input image with detections drawn on it.
`visual_query_result`	`std_msgs/msg/String`	Publishes the text answer from a visual query.
`object_detection_result`	`std_msgs/msg/String`	Publishes object detection results as a JSON string.
`pointing_result`	`std_msgs/msg/String`	Publishes pointing results as a JSON string.

Services

Service Name	Service Type	Description
`caption`	`bob_moondream_msgs/srv/VisualQuery`	Generates a caption for the latest image. The `prompt` is ignored.
`visual_query`	`bob_moondream_msgs/srv/VisualQuery`	Asks a question (`prompt`) about the latest image.
`object_detection`	`bob_moondream_msgs/srv/VisualQuery`	Detects objects described by the `prompt`. Returns a JSON string.
`pointing`	`bob_moondream_msgs/srv/VisualQuery`	Finds points of interest described by the `prompt`. Returns a JSON string.

ROS Package bob_moondream