GIFT: Generalizable Interaction-aware Functional Tool Affordances without Labels

Dylan Turpin, Liquan Wang, Stavros Tsogkas, Sven Dickinson, Animesh Garg
2021 Robotics: Science and Systems XVII   unpublished
hooking, reaching hammering. & across three manipulation tasks: by interacting with procedurally-generated tools Discover tool a ordances hooking, from RGBD observations of unknown objects reaching hammering & match expected task semantics across A ordance predictions and are similar to those of a human labeller, e.g., for hammering. human labeller Train an a ordance model and predict distributions over pairs of keypoints to argmax to detect sparse keypoints representing tool geometry by
more » ... g from the contact data of sampled trajectories. grasp and interact with for each task model predictions grasp interact Fig. 1 : Rather than relying on human labels, the GIFT framework discovers affordances from goal-directed interaction with a set of procedurally-generated tools. This interaction experience is collected with a simple sampling-based motion planner that does not require demonstrations or an expert policy. Since the affordances are not prespecified (either explicitly by labels or implicitly by predefined manipulation strategies), they are unbiased, i.e., they emerge only from the constraints of the task. Abstract-Tool use requires reasoning about the fit between an object's affordances and the demands of a task. Visual affordance learning can benefit from goal-directed interaction experience, but current techniques rely on human labels or expert demonstrations to generate this data. In this paper, we describe a method that grounds affordances in physical interactions instead, thus removing the need for human labels or expert policies. We use an efficient sampling-based method to generate successful trajectories that provide contact data, which are then used to reveal affordance representations. Our framework, GIFT, operates in two phases: first, we discover visual affordances from goal-directed interaction with a set of procedurally generated tools; second, we train a model to predict new instances of the discovered affordances on novel tools in a self-supervised fashion. In our experiments, we show that GIFT can leverage a sparse keypoint representation to predict grasp and interaction points to accommodate multiple tasks, such as hooking, reaching, and hammering. GIFT outperforms baselines on all tasks and matches a human oracle on two of three tasks using novel tools. Qualitative results available at:
doi:10.15607/rss.2021.xvii.060 fatcat:k5bcwfesfrchflqo6vt3sbh6da