Language-Conditioned Dual-Arm Robotic Pouring System

A multimodal robotic manipulation system that executes natural language pouring tasks with precise weight control.

Context

This project focused on building an integrated vision-language manipulation system capable of executing real-world tasks from spoken natural language commands.

Fig 1. Experimental setup featuring a RB-Y1 dual-arm mobile manipulator operating next to an L-shaped table. Color-coded bottles filled with marbles were placed on one side, while a long cylindrical container positioned on a weight scale was placed on the other.

When a user issued a command, for example:

“Pick the blue bottle and pour 300 grams.”

The robot autonomously executed the complete sequence:

  1. Identify the correct bottle
  2. Grasp the bottle
  3. Open the cap
  4. Rotate toward the target container
  5. Pour using real-time weight feedback
  6. Stop precisely at the requested weight
  7. Close the bottle
  8. Return it to its original position

All steps were performed continuously without human intervention.

My Contribution

I worked on system integration and learning-based manipulation modules, including:

  • Designing modular task-level skills for dual-arm coordination
  • Integrating perception with motion planning for bottle manipulation
  • Co-developing learning-based control for precise pouring
  • Integrating speech understanding and language grounding into executable robot actions
  • Building validation logic to ensure safe and feasible task execution

Technical Overview

The system combined perception, language understanding, learning-based control, and classical motion planning:

  • Multimodal Perception: The robot used onboard cameras to identify bottles and determine object states relevant to manipulation.

  • Language Grounding: Spoken instructions were converted into structured task plans using a large language model. These plans were verified before execution.

  • Dual-Arm Manipulation: The robot coordinated both arms to grasp, open, pour, and close objects in a continuous sequence.

  • Learning-Based Control: A RL policy enabled smooth and precise pouring using real-time weight feedback, allowing the robot to stop accurately at the requested amount.

  • Hybrid Skill Composition: The system combined learned behaviors with deterministic motion planning and state-machine-based task management.

Results

  • Demonstrated reliable execution of language-conditioned manipulation tasks
  • Achieved precise (+/- 5 gram) weight-based pouring using closed-loop feedback
  • Showcased coordinated dual-arm manipulation in a real-world setting
  • Validated integration of speech, vision, planning, and learning in a deployable robotic system

Media