Connecting Language and Vision to Actions

Peter Anderson, Abhishek Das, Qi Wu
2018 Proceedings of ACL 2018, Tutorial Abstracts  
A long-term goal of AI research is to build intelligent agents that can see the rich visual environment around us, communicate this understanding in natural language to humans and other agents, and act in a physical or embodied environment. To this end, recent advances at the intersection of language and vision have made incredible progress -from being able to generate natural language descriptions of images/videos, to answering questions about them, to even holding freeform conversations about
more » ... visual content! However, while these agents can passively describe images or answer (a sequence of) questions about them, they cannot act in the world (what if I cannot answer a question from my current view, or I am asked to move or manipulate something?). Thus, the challenge now is to extend this progress in language and vision to embodied agents that take actions and actively interact with their visual environments.
doi:10.18653/v1/p18-5004 dblp:conf/acl/AndersonDW18 fatcat:ilrvhjobwrhcdf3iftjbeunbmq