Context
This project focused on building an integrated vision-language manipulation system capable of executing real-world tasks from spoken natural language commands.
When a user issued a command, for example:
“Pick the blue bottle and pour 300 grams.”
The robot autonomously executed the complete sequence:
- Identify the correct bottle
- Grasp the bottle
- Open the cap
- Rotate toward the target container
- Pour using real-time weight feedback
- Stop precisely at the requested weight
- Close the bottle
- Return it to its original position
All steps were performed continuously without human intervention.
My Contribution
I worked on system integration and learning-based manipulation modules, including:
- Designing modular task-level skills for dual-arm coordination
- Integrating perception with motion planning for bottle manipulation
- Co-developing learning-based control for precise pouring
- Integrating speech understanding and language grounding into executable robot actions
- Building validation logic to ensure safe and feasible task execution
Technical Overview
The system combined perception, language understanding, learning-based control, and classical motion planning:
-
Multimodal Perception: The robot used onboard cameras to identify bottles and determine object states relevant to manipulation.
-
Language Grounding: Spoken instructions were converted into structured task plans using a large language model. These plans were verified before execution.
-
Dual-Arm Manipulation: The robot coordinated both arms to grasp, open, pour, and close objects in a continuous sequence.
-
Learning-Based Control: A RL policy enabled smooth and precise pouring using real-time weight feedback, allowing the robot to stop accurately at the requested amount.
-
Hybrid Skill Composition: The system combined learned behaviors with deterministic motion planning and state-machine-based task management.
Results
- Demonstrated reliable execution of language-conditioned manipulation tasks
- Achieved precise (+/- 5 gram) weight-based pouring using closed-loop feedback
- Showcased coordinated dual-arm manipulation in a real-world setting
- Validated integration of speech, vision, planning, and learning in a deployable robotic system