Filters








1,030 Hits in 5.7 sec

Embodied BERT: A Transformer Model for Embodied, Language-guided Visual Task Completion [article]

Alessandro Suglia, Qiaozi Gao, Jesse Thomason, Govind Thattai, Gaurav Sukhatme
2021 arXiv   pre-print
We present Embodied BERT (EmBERT), a transformer-based model which can attend to high-dimensional, multi-modal inputs across long temporal horizons for language-conditioned task completion.  ...  Additionally, we bridge the gap between successful object-centric navigation models used for non-interactive agents and the language-guided visual task completion benchmark, ALFRED, by introducing object  ...  Embodied BERT EmBERT uses a transformer encoder for jointly embedding language and visual tokens and an transformer decoder for long-horizon planning and object-centric navigation predictions (Figure  ... 
arXiv:2108.04927v2 fatcat:pq6k7mbrsneuzaaio6egm54hqe

Deep Learning for Embodied Vision Navigation: A Survey [article]

Fengda Zhu, Yi Zhu, Vincent CS Lee, Xiaodan Liang, Xiaojun Chang
2021 arXiv   pre-print
The remarkable learning ability of deep learning methods empowered the agents to accomplish embodied visual navigation tasks.  ...  modeling seen scenarios, understanding cross-modal instructions, and adapting to a new environment, etc.  ...  Furthermore, bert-based methods [147] , [148] pretrain a transformer network with proxy tasks and achieve great success in vision, language and cross-modal tasks.  ... 
arXiv:2108.04097v4 fatcat:46p2p3zlivabbn7dvowkyccufe

ALFWorld: Aligning Text and Embodied Environments for Interactive Learning [article]

Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, Matthew Hausknecht
2021 arXiv   pre-print
BUTLER's simple, modular design factors the problem to allow researchers to focus on models for improving every piece of the pipeline (language understanding, planning, navigation, and visual scene understanding  ...  Embodied agents require the same abilities, but existing work does not yet provide the infrastructure necessary for both reasoning abstractly and executing concretely.  ...  ACKNOWLEDGMENTS The authors thank Cheng Zhang, Jesse Thomason, Karthik Desingh, Rishabh Joshi, Romain Laroche, Shunyu Yao, and Victor Zhong for insightful feedback and discussions.  ... 
arXiv:2010.03768v2 fatcat:bcj42swdcffjjlo4hll7pubsom

Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents [article]

Wenlong Huang, Pieter Abbeel, Deepak Pathak, Igor Mordatch
2022 arXiv   pre-print
The conducted human evaluation reveals a trade-off between executability and correctness but shows a promising sign towards extracting actionable knowledge from language models.  ...  In this paper, we investigate the possibility of grounding high-level tasks, expressed in natural language (e.g. "make breakfast"), to a chosen set of actionable steps (e.g. "open fridge").  ...  Acknowledgment We would like to thank OpenAI for providing academic access to the OpenAI API and Luke Metz for valuable feedback and discussions.  ... 
arXiv:2201.07207v2 fatcat:2ighvy7jsfaxfllziu4yngv3n4

Core Challenges in Embodied Vision-Language Planning [article]

Jonathan Francis, Nariaki Kitamura, Felix Labelle, Xiaopeng Lu, Ingrid Navarro, Jean Oh
2022 arXiv   pre-print
In this survey paper, we discuss Embodied Vision-Language Planning (EVLP) tasks, a family of prominent embodied navigation and manipulation problems that jointly use computer vision and natural language  ...  Finally, we present the core challenges that we believe new EVLP works should seek to address, and we advocate for task construction that enables model generalizability and furthers real-world deployment  ...  Acknowledgements The authors thank Alessandro Oltramari, Yonatan Bisk, Eric Nyberg, and Louis-Philippe Morency for insightful discussions; we thank Mayank Mali for support throughout the editing process  ... 
arXiv:2106.13948v3 fatcat:tk32nr4jtjekboh33zutellvnm

Syntax-BERT: Improving Pre-trained Transformers with Syntax Trees [article]

Jiangang Bai, Yujing Wang, Yiren Chen, Yaming Yang, Jing Bai, Jing Yu, Yunhai Tong
2021 arXiv   pre-print
Pre-trained language models like BERT achieve superior performances in various NLP tasks without explicit consideration of syntactic information.  ...  Experiments on various datasets of natural language understanding verify the effectiveness of syntax trees and achieve consistent improvement over multiple pre-trained models, including BERT, RoBERTa,  ...  to exploit more syntactic and semantic knowledge, including relation types from a dependency parser and concepts from a knowledge graph.  ... 
arXiv:2103.04350v1 fatcat:wd7hiiqhmreuficb4bgu7bhd44

LEBP – Language Expectation Binding Policy: A Two-Stream Framework for Embodied Vision-and-Language Interaction Task Learning Agents [article]

Haoyu Liu, Yang Liu, Hongkai He, Hangfang Yang
2022 arXiv   pre-print
People always desire an embodied agent that can perform a task by understanding language instruction.  ...  The expectation consists of a sequence of sub-steps for the task (e.g., Pick an apple).  ...  modules for a transformer to catch key memory frames [Pashevich et al., 2021] .  ... 
arXiv:2203.04637v1 fatcat:qtkvpvaw5ncdniqgrusltoivwm

Embodied AI-Driven Operation of Smart Cities: A Concise Review [article]

Farzan Shenavarmasouleh, Farid Ghareh Mohammadi, M. Hadi Amini, Hamid R. Arabnia
2021 arXiv   pre-print
Robots and physical machines are inseparable parts of a smart city. Embodied AI is the field of study that takes a deeper look into these and explores how they can fit into real-world environments.  ...  Finally, we will address its challenges and identify its potentials for future research.  ...  Visual Question Answering (VQA) [17] is the task of receiving an image along with a natural language question about that image as an input and attempting to find the accurate natural language answer for  ... 
arXiv:2108.09823v1 fatcat:xcjyq2ad3jgbborpldopgcd3vm

If-Clauses, Their Grammatical Consequents, and Their Embodied Consequence: Organizing Joint Attention in Guided Tours

Elwys De Stefani
2021 Frontiers in Communication  
This article examines if-clauses as a resource available to tour guides for reorienting the visitors' visual attention towards an object of interest.  ...  In this setting, guides recurrently use if-clauses to organize a joint focus of attention, by soliciting the visitors to bodily and visually rearrange.  ...  I also thank Matthew Burdelski for his advice on Japanese and Isabelle Heyerick for the fieldwork that led to the recording of guided tours interpreted into Flemish Sign Language and for her help with  ... 
doi:10.3389/fcomm.2021.661165 fatcat:a74zk367dnbfxp6bmq7km3q3r4

Finding common ground: Alternatives to code models for language use

Carol A. Fowler, Bert Hodges
2016 New Ideas in Psychology  
They represent language studies from the perspective of ecological psychology, dynamical systems approaches, the Distributed Language Approach, and others.  ...  Different contributions to the special issue offer critiques of conventional scientific studies of decontextualized language and language processing, and offer new perspectives on such diverse domains  ...  We extend our thanks for that support and to all conference presenters and participants.  ... 
doi:10.1016/j.newideapsych.2016.03.001 fatcat:7utskgrrwjdp5b6vxydl3sppey

Improving Vision-and-Language Navigation with Image-Text Pairs from the Web [article]

Arjun Majumdar, Ayush Shrivastava, Stefan Lee, Peter Anderson, Devi Parikh, Dhruv Batra
2020 arXiv   pre-print
Specifically, we develop VLN-BERT, a visiolinguistic transformer-based model for scoring the compatibility between an instruction ('...stop at the brown sofa') and a sequence of panoramic RGB images captured  ...  that improve performance on a relatively data-starved embodied perception task (Vision-and-Language Navigation)?  ...  The BERT [7] model is a large transformerbased [29] architecture for language modeling.  ... 
arXiv:2004.14973v2 fatcat:aseehzodgzforll2vnhn4rjruu

Multimodal Transformer with Variable-length Memory for Vision-and-Language Navigation [article]

Chuang Lin, Yi Jiang, Jianfei Cai, Lizhen Qu, Gholamreza Haffari, Zehuan Yuan
2021 arXiv   pre-print
Considering a single fixed-length vector is often insufficient to capture long-term temporal context, in this paper, we introduce Multimodal Transformer with Variable-length Memory (MTVM) for visually-grounded  ...  Vision-and-Language Navigation (VLN) is a task that an agent is required to follow a language instruction to navigate to the goal position, which relies on the ongoing interactions with the environment  ...  Visionand-Language Navigation (VLN) is such a task that an embodied agent is required to follow a language instruction to In contrast to most existing methods that utilize a fixedlength vector to represent  ... 
arXiv:2111.05759v1 fatcat:eyceb3ftfzd6rmsdx6tlwyut4u

AssistQ: Affordance-centric Question-driven Task Completion for Egocentric Assistant [article]

Benita Wong, Joya Chen, You Wu, Stan Weixian Lei, Dongxing Mao, Difei Gao, Mike Zheng Shou
2022 arXiv   pre-print
In this paper, we define a new task called Affordance-centric Question-driven Task Completion, where the AI assistant should learn from instructional videos and scripts to guide the user step-by-step.  ...  To address this unique task, we developed a Question-to-Actions (Q2A) model that significantly outperforms several baseline methods while still having large room for improvement.  ...  Conclusion In this paper, we proposed Affordance-centric Question-driven Task Completion (AQTC) task for AI assistants to learn from instructional videos to guide users.  ... 
arXiv:2203.04203v2 fatcat:akmutm76yjdlrpy7cjr2y2oheq

Experiencing more complexity than we can tell

Bert Timmermans, Bert Windey, Axel Cleeremans
2010 Cognitive Neuroscience  
They are necessary to complete the behavioral task, and if they have limited capacity they will impose a limit on a potentially much richer visual sensation.  ...  What is evident from this small exercise is that introspection is a poor guide to conscious visual sensation.  ... 
doi:10.1080/17588928.2010.497586 pmid:24168344 fatcat:kv3xi4bqazbn7n22pepmqmqa7a

FILM: Following Instructions in Language with Modular Methods [article]

So Yeon Min, Devendra Singh Chaplot, Pradeep Ravikumar, Yonatan Bisk, Ruslan Salakhutdinov
2022 arXiv   pre-print
Recent methods for embodied instruction following are typically trained end-to-end using imitation learning. This often requires the use of expert trajectories and low-level language instructions.  ...  In contrast, we propose a modular method with structured representations that (1) builds a semantic map of the scene and (2) performs exploration with a semantic search policy, to achieve the natural language  ...  Language and visual input are transformed into respectively high-level actions and the 3D map.  ... 
arXiv:2110.07342v3 fatcat:gx2poe4ks5dh5l6sqpzqoi6nje
« Previous Showing results 1 — 15 out of 1,030 results