CALVIN: A Benchmark for Language-conditioned Policy Learning for Long-horizon Robot Manipulation Tasks [article]

Oier Mees, Lukas Hermann, Erick Rosete-Beas, Wolfram Burgard
<span title="2021-12-08">2021</span> <i > arXiv </i> &nbsp; <span class="release-stage" >pre-print</span>
General-purpose robots coexisting with humans in their environment must learn to relate human language to their perceptions and actions to be useful in a range of daily tasks. Moreover, they need to acquire a diverse repertoire of general-purpose skills that allow composing long-horizon tasks by following unconstrained language instructions. In this paper, we present CALVIN (Composing Actions from Language and Vision), an open-source simulated benchmark to learn long-horizon
tasks. Our aim is to make it possible to develop agents that can solve many robotic manipulation tasks over a long horizon, from onboard sensors, and specified only via human language. CALVIN tasks are more complex in terms of sequence length, action space, and language than existing vision-and-language task datasets and supports flexible specification of sensor suites. We evaluate the agents in zero-shot to novel language instructions and to novel environments and objects. We show that a baseline model based on multi-context imitation learning performs poorly on CALVIN, suggesting that there is significant room for developing innovative agents that learn to relate human language to their world models with this benchmark.
<span class="external-identifiers"> <a target="_blank" rel="external noopener" href="">arXiv:2112.03227v2</a> <a target="_blank" rel="external noopener" href="">fatcat:aw3vvlb7ejeofodzw7xcdjnysm</a> </span>
