ShapeGrasp: Zero-Shot Object Manipulation with Large Language Models through Geometric Decomposition

Abstract

Task-oriented grasping of unfamiliar objects is a necessary skill for robots in dynamic in-home environments. Inspired by the human capability to grasp such objects through intuition about their shape and structure, we present a novel zero-shot task-oriented grasping method leveraging a geometric decomposition of the target object into simple, convex shapes that we represent in a graph structure, including geometric attributes and spatial relationships. Our approach employs minimal essential information -- the object's name and the intended task -- to facilitate zero-shot task-oriented grasping. We utilize the commonsense reasoning capabilities of large language models to dynamically assign semantic meaning to each decomposed part and subsequently reason over the utility of each part for the intended task. Through extensive experiments on a real-world robotics platform, we demonstrate that our grasping approach's decomposition and reasoning pipeline is capable of selecting the correct part in 92% of the cases and successfully grasping the object in 82% of the tasks we evaluate.

Approach

Given a target object, our RGB+D-based pipeline decomposes the object into approximate convex parts. We propose a heuristic approach to select a suitable decomposition that we then convert into a graph that captures the object's composition. Each decomposed part is approximated as a basic shape and represented as a node in the graph with geometric and color attributes. Edges within the graph are drawn between nodes whose parts are connected in the segmentation. Finally, an LLM is utilized for a two-stage reasoning process. First, the LLM reasons about the semantic significance of each node in the graph and assigns part labels. Then, using this semantic reasoning and the desired task, the LLM reasons about the task-utility of each part before selecting the most appropriate part for the robot to grasp.

Varying Threshold

Selecting an appropriate decomposition is key to meaningful part-level semantic and task-utility reasoning. The slider shows decomposition results at different thresholds on the "sunglasses" object. At our heuristic selected threshold, the assigned semantic labels and selected part are aligned with the ground truth.

Selected Threshold

Loading...

Final Grasping Attempt

Affordance Prediction

Understanding the affordances and properties of objects and parts is important in the task-oriented part selection. Here, we ask the LLM affordance-based questions that it responds using our object graph.

BibTeX


      @article{Li2024ShapeGraspZT,
        title={ShapeGrasp: Zero-Shot Task-Oriented Grasping with Large Language Models through Geometric Decomposition},
        author={Samuel Li and Sarthak Bhagat and Joseph Campbell and Yaqi Xie and Woojun Kim and Katia P. Sycara and Simon Stepputtis},
        journal={2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)},
        year={2024},
        pages={10527-10534},
      }

ShapeGrasp: Zero-Shot Object Manipulation with LLMs through Geometric Decomposition

Abstract

Approach

Multiple Task Oriented Grasps

Single Task Oriented Grasps

Fragile Objects

Tools

Unusual Objects

Varying Threshold

Affordance Prediction

BibTeX