A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2021; you can also visit the original URL.
The file type is application/pdf
.
Joint Multimodal Embedding and Backtracking Search in Vision-and-Language Navigation
2021
Sensors
Due to the development of computer vision and natural language processing technologies in recent years, there has been a growing interest in multimodal intelligent tasks that require the ability to concurrently understand various forms of input data such as images and text. Vision-and-language navigation (VLN) require the alignment and grounding of multimodal input data to enable real-time perception of the task status on panoramic images and natural language instruction. This study proposes a
doi:10.3390/s21031012
pmid:33540789
pmcid:PMC7867342
fatcat:wgtyeowcrnbdjpq2b2jcufglhq