Language-Conditioned Dual-Arm Robotic Pouring System

Context

This project focused on building an integrated vision-language manipulation system capable of executing real-world tasks from spoken natural language commands.

Fig 1. Experimental setup featuring a RB-Y1 dual-arm mobile manipulator operating next to an L-shaped table. Color-coded bottles filled with marbles were placed on one side, while a long cylindrical container positioned on a weight scale was placed on the other.

When a user issued a command, for example:

“Pick the blue bottle and pour 300 grams.”

The robot autonomously executed the complete sequence:

Identify the correct bottle
Grasp the bottle
Open the cap
Rotate toward the target container
Pour using real-time weight feedback
Stop precisely at the requested weight
Close the bottle
Return it to its original position

All steps were performed continuously without human intervention.

My Contribution

I worked on system integration and learning-based manipulation modules, including:

Designing modular task-level skills for dual-arm coordination
Integrating perception with motion planning for bottle manipulation
Co-developing learning-based control for precise pouring
Integrating speech understanding and language grounding into executable robot actions
Building validation logic to ensure safe and feasible task execution

Technical Overview

The system combined perception, language understanding, learning-based control, and classical motion planning:

Multimodal Perception: The robot used onboard cameras to identify bottles and determine object states relevant to manipulation.
Language Grounding: Spoken instructions were converted into structured task plans using a large language model. These plans were verified before execution.
Dual-Arm Manipulation: The robot coordinated both arms to grasp, open, pour, and close objects in a continuous sequence.
Learning-Based Control: A RL policy enabled smooth and precise pouring using real-time weight feedback, allowing the robot to stop accurately at the requested amount.
Hybrid Skill Composition: The system combined learned behaviors with deterministic motion planning and state-machine-based task management.

Results

Demonstrated reliable execution of language-conditioned manipulation tasks
Achieved precise (+/- 5 gram) weight-based pouring using closed-loop feedback
Showcased coordinated dual-arm manipulation in a real-world setting
Validated integration of speech, vision, planning, and learning in a deployable robotic system

Context

My Contribution

Technical Overview

Results

Media