Natural-Language and Vision Control (VLM / LLM)#
Movensys Intelligence is an application layer on top of WMX ROS2 that lets a robot be driven by natural language and vision. A vision-language model (VLM) and large language model (LLM) interpret camera images and spoken or typed instructions, then issue motion commands to the robot through a FastAPI service that bridges to the WMX ROS2 / MoveIt2 stack.
Unlike the other entries in this section, this is not a motion-planning
backend. It runs above a planning backend (MoveIt2) and calls the
/wmx/moveit2/* services to move the arm.
Architecture#
flowchart LR
USER["Voice / text<br/>instruction"]
WHISPER["Whisper STT"]
VLM["VLM / LLM<br/>(vLLM, OpenAI API)"]
MEM["Memory<br/>(Qdrant vector DB)"]
API["FastAPI service<br/>Movensys Manipulator API"]
BRIDGE["ROS2 bridge node"]
WMX["WMX ROS2 / MoveIt2"]
ROBOT["Robot + cameras"]
YOLO["YOLO detectors"]
USER --> WHISPER --> API
USER --> API
API --> VLM
VLM <--> MEM
VLM --> API
API --> BRIDGE --> WMX --> ROBOT
ROBOT --> YOLO --> BRIDGE
BRIDGE --> API
Movensys Intelligence — perception and language to motion#
The flow is: a spoken instruction is transcribed by Whisper, or text is sent directly. The VLM/LLM receives the instruction together with the latest camera images and any relevant context retrieved from the memory store, and produces a motion command. The FastAPI service forwards that command through its ROS2 bridge to the WMX ROS2 / MoveIt2 services, which execute the motion on the robot. Perception nodes (YOLO detectors) publish detected object poses back through the bridge.
Components#
Component |
Role |
|---|---|
FastAPI service ( |
“Movensys Manipulator API” — hosts the REST + WebSocket API, web UI, and the ROS2 bridge node |
vLLM |
Serves the VLM/LLM behind an OpenAI-compatible endpoint
( |
Whisper |
Speech-to-text for voice instructions
( |
Qdrant vector DB |
Long-term memory / retrieval for the language model
( |
Arize Phoenix (optional) |
Tracing and observability for model calls ( |
ROS2 bridge ( |
Subscribes to robot state, camera, and perception topics and calls the WMX ROS2 / MoveIt2 motion services |
Connection to WMX ROS2#
The ROS2 bridge node talks to the WMX ROS2 / MoveIt2 stack through these interfaces:
Interface |
Direction |
Purpose |
|---|---|---|
|
service client |
Read the current end-effector pose |
|
service client |
Move to an absolute Cartesian pose (base frame) |
|
service client |
Move by a relative Cartesian offset (base frame) |
|
service client |
Move by a relative Cartesian offset (tool frame) |
|
service client |
Joint-space moves |
|
service client |
Open/close the gripper ( |
|
subscribe |
Joint feedback for the live UI |
|
subscribe |
RGB, depth, and camera info for the VLM and UI |
|
subscribe |
Detected object poses from the YOLO detectors |
The custom service types (GetEefPose, MovePose, MoveJoints) come
from the movensys_manipulator_moveit_config package.
API Surface#
The FastAPI service exposes (among others):
Language —
POST /api/vlm/infer,GET /api/vlm/system_prompt,GET /api/vlm/memory,POST /api/whisper/transcribeMotion —
POST /api/move/absolute_cartesian_base,/api/move/relative_cartesian_base,/api/move/relative_cartesian_tool,/api/move/joint_absolute,/api/move/joint_relative,POST /api/services/gripperState streams — WebSocket endpoints under
/api/stream/*for end-effector pose, joint states, and camera feedsWeb UI —
/camerasand/vlmpages, with OpenAPI docs at/docs
Running It#
Movensys Intelligence runs as a set of Docker Compose services. Select your
accelerator with the XPU_CORE environment variable
(nvidia-gpu or intel-xpu), then bring up the stack:
export XPU_CORE=nvidia-gpu # or intel-xpu
cd ~/workspaces/movensys-intelligence/movensys_vlm/docker
# 1. Serve the VLM/LLM with vLLM
COMPOSE_PROFILES=$XPU_CORE docker compose -f vllm.yaml up -d --build
# 2. Start the vector-DB memory
COMPOSE_PROFILES=$CPU_ARCH docker compose -f vectordb.yaml up -d --build
# 3. Start the FastAPI service (+ Whisper)
COMPOSE_PROFILES=$XPU_CORE docker compose -f movensys_vlm.yaml up -d --build
COMPOSE_PROFILES=$XPU_CORE docker compose -f whisper.yaml up -d --build
Wait until the vLLM container logs application startup complete before
sending requests. A pick-and-place example is included in the repository:
cd ~/workspaces/movensys-intelligence
python3 movensys_sample/movensys_robopoly/pick_and_place.py red_cube GO true
Note
The Movensys Intelligence services connect to a running WMX ROS2 / MoveIt2 session. Bring up the robot (or simulation) first — see MoveIt2 Motion Planning and Testing WMX ROS2.
Supported accelerators include NVIDIA desktop GPUs and Jetson Thor
(nvidia-gpu) and Intel B60 / Panther Lake (intel-xpu). See the
Movensys Intelligence repository for the full setup
guide and the latest model configuration.