Filters








40,213 Hits in 7.8 sec

Multi-View Learning for Vision-and-Language Navigation [article]

Qiaolin Xia, Xiujun Li, Chunyuan Li, Yonatan Bisk, Zhifang Sui, Jianfeng Gao, Yejin Choi, Noah A. Smith
2020 arXiv   pre-print
Further, LEO is complementary to most existing models for vision-and-language navigation, allowing for easy integration with the existing techniques, leading to LEO+, which creates the new state of the  ...  In this paper, we present a novel training paradigm, Learn from EveryOne (LEO), which leverages multiple instructions (as different views) for the same trajectory to resolve language ambiguity and improve  ...  Multi-View Learning.  ... 
arXiv:2003.00857v3 fatcat:vhik23mjx5cyraxr4kb5cuspea

Vision-Dialog Navigation by Exploring Cross-modal Memory [article]

Yi Zhu, Fengda Zhu, Zhaohuan Zhan, Bingqian Lin, Jianbin Jiao, Xiaojun Chang, Xiaodan Liang
2020 arXiv   pre-print
Vision-dialog navigation posed as a new holy-grail task in vision-language disciplinary targets at learning an agent endowed with the capability of constant conversation for help with natural language  ...  V-mem learns to associate the current visual views and the cross-modal memory about the previous navigation actions.  ...  Introduction Powered by the recent progress in natural language processing and visual scene understanding, vision-language tasks such as Visual Question Answering (VQA) [3, 1, 10] and Vision-Language  ... 
arXiv:2003.06745v1 fatcat:fn3ua4o3bjh33oyy27au2conq4

Deep Learning for Embodied Vision Navigation: A Survey [article]

Fengda Zhu, Yi Zhu, Vincent CS Lee, Xiaodan Liang, Xiaojun Chang
2021 arXiv   pre-print
The remarkable learning ability of deep learning methods empowered the agents to accomplish embodied visual navigation tasks.  ...  Recently, embodied visual navigation has attracted rising attention of the community, and numerous works has been proposed to learn these skills.  ...  [207] propose a multi-task model that jointly learns multi-modal tasks, and transfers vision-language knowledge across the tasks.  ... 
arXiv:2108.04097v4 fatcat:46p2p3zlivabbn7dvowkyccufe

Vision-Language Navigation: A Survey and Taxonomy [article]

Wansen Wu, Tao Chang, Xinmeng Li
2022 arXiv   pre-print
Vision-Language Navigation (VLN) tasks require an agent to follow human language instructions to navigate in previously unseen environments.  ...  Depending on whether the navigation instructions are given for once or multiple times, this paper divides the tasks into two categories, i.e., single-turn and multi-turn tasks.  ...  ACKNOWLEDGMENT The work described in this paper was sponsored in part by the National Natural Science Foundation of China under Grant No. 62103420 and 62103428 , the Natural Science Fund of Hunan Province  ... 
arXiv:2108.11544v3 fatcat:qo5g237si5cwtewxiaeqtjwqpy

Active Visual Information Gathering for Vision-Language Navigation [article]

Hanqing Wang, Wenguan Wang, Tianmin Shu, Wei Liang, Jianbing Shen
2020 arXiv   pre-print
This work draws inspiration from human navigation behavior and endows an agent with an active information gathering ability for a more intelligent vision-language navigation policy.  ...  Vision-language navigation (VLN) is the task of entailing an agent to carry out navigational instructions inside photo-realistic environments.  ...  dialog [6, 27] , and vision-language navigation [1] .  ... 
arXiv:2007.08037v3 fatcat:7c2drtt2tvhzjnaypj7vmanbla

Self-supervised 3D Semantic Representation Learning for Vision-and-Language Navigation [article]

Sinan Tan, Mengmeng Ge, Di Guo, Huaping Liu, Fuchun Sun
2022 arXiv   pre-print
In the Vision-and-Language Navigation task, the embodied agent follows linguistic instructions and navigates to a specific goal.  ...  Then, we construct an LSTM-based navigation model and train it with the proposed 3D semantic representations and BERT language features on vision-language pairs.  ...  In our work, we adopt self-supervised learning for 3D semantic representation in the vision-and-language navigation task.  ... 
arXiv:2201.10788v1 fatcat:cczeqqjkobblnbetotq5zvhcl4

The Road to Know-Where: An Object-and-Room Informed Sequential BERT for Indoor Vision-Language Navigation [article]

Yuankai Qi, Zizheng Pan, Yicong Hong, Ming-Hsuan Yang, Anton van den Hengel, Qi Wu
2021 arXiv   pre-print
Vision-and-Language Navigation (VLN) requires an agent to find a path to a remote location on the basis of natural-language instructions and a set of photo-realistic panoramas.  ...  ., left/right/front/back) of each navigable location and the room type (e.g., bedroom, kitchen) of its current and final navigation goal, as such information is widely mentioned in instructions implying  ...  Acknowledgements This work is supported in part by the ARC DE190100539 and the NSF CAREER Grant #1149783.  ... 
arXiv:2104.04167v2 fatcat:cq34ceywgrbdlhhzkuygkde764

Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation [article]

Xin Wang, Qiuyuan Huang, Asli Celikyilmaz, Jianfeng Gao, Dinghan Shen, Yuan-Fang Wang, William Yang Wang, Lei Zhang
2019 arXiv   pre-print
Vision-language navigation (VLN) is the task of navigating an embodied agent to carry out natural language instructions inside real 3D environments.  ...  In this paper, we study how to address three critical challenges for this task: the cross-modal grounding, the ill-posed feedback, and the generalization problems.  ...  The authors thank Peter Anderson and Pengchuan Zhang for their helpful discussions, and Ronghang Hu for his visualization code.  ... 
arXiv:1811.10092v2 fatcat:lt4e626n5jedrptnv2rapgwdve

VISITRON: Visual Semantics-Aligned Interactively Trained Object-Navigator [article]

Ayush Shrivastava, Karthik Gopalakrishnan, Yang Liu, Robinson Piramuthu, Gokhan Tür, Devi Parikh, Dilek Hakkani-Tür
2022 arXiv   pre-print
In this paper, we present VISITRON, a multi-modal Transformer-based navigator better suited to the interactive regime inherent to Cooperative Vision-and-Dialog Navigation (CVDN).  ...  Interactive robots navigating photo-realistic environments need to be trained to effectively leverage and handle the dynamic nature of dialogue in addition to the challenges underlying vision-and-language  ...  We would also like to thank the anonymous reviewers for their service and useful feedback.  ... 
arXiv:2105.11589v2 fatcat:7o7dxrlbl5bt7focxxackp6kyy

Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-training [article]

Weituo Hao, Chunyuan Li, Xiujun Li, Lawrence Carin, Jianfeng Gao
2020 arXiv   pre-print
In this paper, we present the first pre-training and fine-tuning paradigm for vision-and-language navigation (VLN) tasks.  ...  Further, the learned representation is transferable to other VLN tasks. On two recent tasks, vision-and-dialog navigation and "Help, Anna!"  ...  arXiv preprint supervised imitation learning for vision-language navigation.  ... 
arXiv:2002.10638v2 fatcat:zcqp4cduyzgrbfycj6hvefjrtm

Diagnosing the Environment Bias in Vision-and-Language Navigation [article]

Yubo Zhang, Hao Tan, Mohit Bansal
2020 arXiv   pre-print
Vision-and-Language Navigation (VLN) requires an agent to follow natural-language instructions, explore the given environments, and reach the desired target locations.  ...  We observe that neither the language nor the underlying navigational graph, but the low-level visual appearance conveyed by ResNet features directly affects the agent model and contributes to this environment  ...  Zhang, and X. Zhou for their helpful suggestions.  ... 
arXiv:2005.03086v1 fatcat:ljdaa2rqxvdi3csj4qpap35hca

Soft Expert Reward Learning for Vision-and-Language Navigation [article]

Hu Wang, Qi Wu, Chunhua Shen
2020 arXiv   pre-print
Vision-and-Language Navigation (VLN) requires an agent to find a specified spot in an unseen environment by following natural language instructions.  ...  In this paper, we introduce a Soft Expert Reward Learning (SERL) model to overcome the reward engineering designing and generalisation problems of the VLN task.  ...  Soft Expert Reward Learning Model Overview and Problem Definition Vision-and-Language Navigation task requires an agent placed at a unknown photo-realistic house to understand multi-modal data comprehensively  ... 
arXiv:2007.10835v1 fatcat:njay4brlfracrbqwntwd4ntt7u

Multimodal Transformer with Variable-length Memory for Vision-and-Language Navigation [article]

Chuang Lin, Yi Jiang, Jianfei Cai, Lizhen Qu, Gholamreza Haffari, Zehuan Yuan
2021 arXiv   pre-print
Vision-and-Language Navigation (VLN) is a task that an agent is required to follow a language instruction to navigate to the goal position, which relies on the ongoing interactions with the environment  ...  natural language navigation by modelling the temporal context explicitly.  ...  Different from other vision and language tasks such as VQA and image captioning that learn relationships between each single image and its corresponding language, VLN aims to learn the joint representation  ... 
arXiv:2111.05759v1 fatcat:eyceb3ftfzd6rmsdx6tlwyut4u

Contrastive Instruction-Trajectory Learning for Vision-Language Navigation [article]

Xiwen Liang, Fengda Zhu, Yi Zhu, Bingqian Lin, Bing Wang, Xiaodan Liang
2021 arXiv   pre-print
These problems hinder agents from learning distinctive vision-and-language representations, harming the robustness and generalizability of the navigation policy.  ...  for robust navigation.  ...  Related Work Vision-and-Language Navigation Learning navigation with vision-language clues has attracted a lot of attention of researchers.  ... 
arXiv:2112.04138v2 fatcat:3scvi3vzfvbcpglcfxrgcfbkcy

Diagnosing the Environment Bias in Vision-and-Language Navigation

Yubo Zhang, Hao Tan, Mohit Bansal
2020 Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence  
Vision-and-Language Navigation (VLN) requires an agent to follow natural-language instructions, explore the given environments, and reach the desired target locations.  ...  We observe that neither the language nor the underlying navigational graph, but the low-level visual appearance conveyed by ResNet features directly affects the agent model and contributes to this environment  ...  Zhang, and X. Zhou for their helpful suggestions.  ... 
doi:10.24963/ijcai.2020/124 dblp:conf/ijcai/ZhangTB20 fatcat:j4igwg7s5nfszmgaj7s22sdt7y
« Previous Showing results 1 — 15 out of 40,213 results