ROS Package bob_moondream
This ROS 2 package provides a node that serves as a wrapper for the vikhyatk/moondream2, a small but powerful vision-language model (VLM). The bob_moondream node subscribes to a raw image topic and exposes a flexible ROS 2 interface for various visual reasoning and perception tasks.
The node loads the model into memory upon startup and processes the latest received image when a service is called or a prompt is published on a topic. It can perform image captioning, answer visual questions, detect objects based on a text description, and locate the center point of an object based on a textual description. It can also publish detection results and an annotated video feed for visualization.
Overview

Model Loading
On the first launch, the node will automatically download the model weights from the Hugging Face Hub and save them to a local cache directory. This process requires an internet connection and may take a few minutes depending on your network speed. The total download size is approximately 3-4 GB.
The download location is determined by the cache_dir parameter, which defaults to a models folder inside the directory where the node is launched.
When loaded onto a GPU (the default behavior), the model consumes approximately 4 GB of VRAM. Ensure your GPU has sufficient memory available.
Subsequent launches of the node will be much faster, as the model will be loaded directly from this local cache, allowing for offline operation after the initial download.
Dependencies
ROS 2 Dependencies
This package depends on several standard ROS 2 message packages and a custom message package. Ensure they are available in your ROS 2 workspace.
std_msgssensor_msgsvision_msgsbob_moondream_msgs: This custom service message package must be located and built within the same workspace asbob_moondream.
Python Dependencies
It is highly recommended to use a Python virtual environment (venv) to manage these dependencies and avoid conflicts with other projects or system packages.
The node requires several Python packages to run the vision-language model. These are listed in requirements.txt and can be installed using pip:
pip install -r requirements.txt
This will install key libraries such as transformers, torch, and Pillow.
Build Instructions
To build the bob_moondream package, follow these steps:
Create a ROS 2 Workspace: If you don’t have one already, create a new ROS 2 workspace.
mkdir -p ~/ros2_ws/src cd ~/ros2_ws
Clone the Repositories: Clone this package and its message dependency into the
srcdirectory.cd src git clone https://github.com/bob-ros2/bob_moondream.git git clone https://github.com/bob-ros2/bob_moondream_msgs.git
Install Python Dependencies: Navigate to the
bob_moondreampackage and install the required Python packages.cd bob_moondream pip install -r requirements.txt cd ../..
Build the Workspace: Use
colconto build the packages.colcon buildSource the Workspace: Before running any nodes, source your workspace’s setup file.
source install/setup.bash
Usage
Prerequisites
Before launching the moondream node, ensure you have a source of images being published to the image_input topic. This can be from a real camera driver or a rosbag file.
Option 1: Using a Standard Camera Driver
For example, to run a USB camera node:
# You may need to install a camera driver first
# sudo apt install ros-humble-usb-cam
# Launch the camera node and remap its output to image_input
ros2 run usb_cam usb_cam_node_exe --ros-args -r image_raw:=image_input
Option 2: Using the Provided Launch File
This package includes a convenience launch file that starts a camera node and automatically handles the topic remapping for you. This launch file uses topic_tools to throttle the image stream for better performance. Ensure it is installed first:
# For ROS 2 Humble
sudo apt install ros-humble-topic-tools
Once installed, you can run the launch file:
ros2 launch bob_moondream camera.launch.py
Launching the Node
Once your workspace is sourced and an image feed is active, open a new terminal and run the following command to start the node.
Basic Launch with remapping:
ros2 run bob_moondream moondream --ros-args -r image_input:=/camera/image_raw/throttled -r __ns:=/vision
On the first run, this will trigger the model download, which can take several minutes.
Launch on CPU:
To run the model on the CPU, use the device parameter:
ros2 run bob_moondream moondream --ros-args -p device:=cpu
Once the node is running and has loaded the model, you can interact with it using the services or the interactive script.
Real-time Detection and Pointing
You can enable continuous detection or pointing on the live image stream by setting the prompt_detect and prompt_point parameters at runtime. The node will process every incoming image frame with the provided prompt.
To start real-time object detection:
Open a new terminal and use the ros2 param set command to provide a detection prompt. For example, to continuously look for a cup:
ros2 param set /moondream prompt_detect "the cup"
The node will now publish bounding boxes to /detected_objects and the annotated video to /annotated_image.
To start real-time pointing: Similarly, you can ask the model to continuously point to a specific feature:
ros2 param set /moondream prompt_point "the tip of the pen"
The node will publish center points to /pointed_objects.
To stop real-time processing: To disable the continuous processing, simply set the parameter to an empty string:
ros2 param set /moondream prompt_detect ""
Interactive Query Script
The package includes a convenient helper script, query.sh, for interacting with the node’s topic-based interface directly from the command line. This script allows you to send prompts and see the results in real-time without needing to write a separate node.
How to Use
To use the script, run it with ros2 run and specify the type of query you want to perform.
Syntax:
ros2 run bob_moondream query.sh <type> [<namespace>]
<type>: The mode of interaction. Must be one ofvisual_query,object_detection, orpointing.[<namespace>]: (Optional) The ROS 2 namespace where the topics are located. Defaults to/vision.
Example
To start a visual query session, launch the moondream_node and a camera feed. Then, in a new terminal, run the following command:
$ ros2 run bob_moondream query.sh visual_query
--- Listening for results on /vision/visual_query_result
--- Enter your prompt below (Press Ctrl+C to exit)
> what do you see
"A man is sitting at a desk, working on a computer. He is focused on his task..."
>
$ ros2 run bob_moondream query.sh object_detection
--- Listening for results on /vision/object_detection_result
--- Enter your prompt below (Press Ctrl+C to exit)
> mark the cups
'[{"x_min": 0.19673718698322773, "y_min": 0.6845761351287365, "x_max": 0.25443468801677227, "y_max": 0.8037051148712635}, {"x_min": 0.3479478508234024, "y_min": 0.6667342223227024, "x_max": 0.4118177741765976, "y_max": 0.7707657776772976}'
>
Parameters
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
|
The name of the model on the Hugging Face Hub. |
|
|
|
The specific model revision (git commit hash or tag) to use. |
|
|
|
The local directory where the model will be downloaded and cached. |
|
|
|
The frame ID to be used in the headers of published messages. |
|
|
|
The device to run the model on (e.g., “cuda” for GPU, “cpu” for CPU). |
|
|
|
If set, continuously runs pointing detection on every new image frame. |
|
|
|
If set, continuously runs object detection on every new image frame. |
Subscribers
Topic |
Message Type |
Description |
|---|---|---|
|
|
The main input topic for raw images that will be processed by the model. |
|
|
Receives a text prompt to ask a question. The answer is published on |
|
|
Receives a text prompt to detect an object. The result is published on |
|
|
Receives a text prompt to find a point. The result is published on |
Publishers
Topic |
Message Type |
Description |
|---|---|---|
|
|
Publishes bounding boxes of detected objects. |
|
|
Publishes points of interest (as zero-size bounding boxes). |
|
|
Publishes the input image with detections drawn on it. |
|
|
Publishes the text answer from a visual query. |
|
|
Publishes object detection results as a JSON string. |
|
|
Publishes pointing results as a JSON string. |
Services
Service Name |
Service Type |
Description |
|---|---|---|
|
|
Generates a caption for the latest image. The |
|
|
Asks a question ( |
|
|
Detects objects described by the |
|
|
Finds points of interest described by the |