Promptable Visual Object Manipulation in Cluttered Environments by Memory-augmented Student-Teacher Learning

Abstract

Building models responsive to input prompts represents a transformative shift in machine learning. This paradigm holds significant potential for robotics problems, such as targeted manipulation amidst clutter. This work presents a novel approach to combine promptable foundation models with reinforcement learning (RL), enabling robots to perform dexterous manipulation tasks in a prompt-guided manner. While current methods struggle to integrate high-level commands with dexterous low-level control, our method bridges this gap, through a memory-augmented student-teacher learning framework. We use the Segment-Anything 2 (SAM\,2) model as a perception backbone to infer an object of interest from user prompts. While the detections are imperfect, the sequence of outputs over an episode is highly informative, allowing for implicit estimation of the true object state by a memory-augmented student policy. Our approach successfully learns prompt-responsive policies, demonstrated in picking objects from cluttered scenes.

Video
We use reinforcement learning to train interactive grasping policies. Their knowledge is distilled to real-world deployable policies that are promptable via language or points by using SAM-2 as their interface.