Filters








123 Hits in 6.4 sec

Dynamic Graph Representation Learning for Video Dialog via Multi-Modal Shuffled Transformers [article]

Shijie Geng, Peng Gao, Moitreya Chatterjee, Chiori Hori, Jonathan Le Roux, Yongfeng Zhang, Hongsheng Li, Anoop Cherian
2021 arXiv   pre-print
To encode fine-grained visual information, we present a novel dynamic scene graph representation learning pipeline that consists of an intra-frame reasoning layer producing spatio-semantic graph representations  ...  To solve this task, we introduce a semantics-controlled multi-modal shuffled Transformer reasoning framework, consisting of a sequence of Transformer modules, each taking a modality as input and producing  ...  A multi-step reasoning scheme is proposed in (Gan et al. 2019) using joint attention via an RNN for generating a multi-modal representation.  ... 
arXiv:2007.03848v2 fatcat:bqvz6lk3szfv7frgcq4fvfz2ji

(2.5+1)D Spatio-Temporal Scene Graphs for Video Question Answering [article]

Anoop Cherian and Chiori Hori and Tim K. Marks and Jonathan Le Roux
2022 arXiv   pre-print
Spatio-temporal scene-graph approaches to video-based reasoning tasks, such as video question-answering (QA), typically construct such graphs for every video frame.  ...  Next, for the video QA task, we present a novel transformer-based reasoning pipeline that embeds the (2.5+1)D graph into a spatio-temporal hierarchical latent space, where the sub-graphs and their interactions  ...  Similarly, video scene graphs are combined with multimodal Transformers for video dialogs and QA in (Geng et al. 2021) .  ... 
arXiv:2202.09277v2 fatcat:avwo43pepvfbneqauppa3dchmi

Saying the Unseen: Video Descriptions via Dialog Agents [article]

Ye Zhu, Yu Wu, Yi Yang, Yan Yan
2021 arXiv   pre-print
limit the visual input for AI systems and seek a more secure and transparent information medium, i.e., the natural language dialog, to supplement the missing visual information.  ...  as a supplement for incomplete implicit visions.  ...  Our overall contributions for this work can be summarized as follows: • We propose a novel and challenging task that aims to describe an unseen video via two multi-modal dialog agents.  ... 
arXiv:2106.14069v1 fatcat:ryovyvqvmjfl5i7xie4bhd5ram

MHMS: Multimodal Hierarchical Multimedia Summarization [article]

Jielin Qiu, Jiacheng Zhu, Mengdi Xu, Franck Dernoncourt, Trung Bui, Zhaowen Wang, Bo Li, Ding Zhao, Hailin Jin
2022 arXiv   pre-print
., automatically generating cover images and titles for news articles or providing introductions to online videos.  ...  Our MHMS method contains video and textual segmentation and summarization module, respectively.  ...  Img+Trans [34] : [34] applied multi-modal video features including video frames, transcripts, and dialog context for dialog generation.  ... 
arXiv:2204.03734v1 fatcat:gzalrqlobvebvm6otawukq6t64

A Roadmap for Big Model [article]

Sha Yuan, Hanyu Zhao, Shuai Zhao, Jiahong Leng, Yangxiao Liang, Xiaozhi Wang, Jifan Yu, Xin Lv, Zhou Shao, Jiaao He, Yankai Lin, Xu Han (+88 others)
2022 arXiv   pre-print
With the rapid development of deep learning, training Big Models (BMs) for multiple downstream tasks becomes a popular paradigm.  ...  We introduce 16 specific BM-related topics in those four parts, they are Data, Knowledge, Computing System, Parallel Training System, Language Model, Vision Model, Multi-modal Model, Theory&Interpretability  ...  transformer for learning contextualized video embeddings.  ... 
arXiv:2203.14101v4 fatcat:rdikzudoezak5b36cf6hhne5u4

Graph Neural Networks for Natural Language Processing: A Survey [article]

Lingfei Wu, Yu Chen, Kai Shen, Xiaojie Guo, Hanning Gao, Shucheng Li, Jian Pei, Bo Long
2021 arXiv   pre-print
We propose a new taxonomy of GNNs for NLP, whichsystematically organizes existing research of GNNs for NLP along three axes: graph construction,graph representation learning, and graph based encoder-decoder  ...  As a result, thereis a surge of interests in developing new deep learning techniques on graphs for a large numberof NLP tasks.  ...  Secondly for each modal, they apply multi-head self-attention to learn the intra-modal representation.  ... 
arXiv:2106.06090v1 fatcat:zvkhinpcvzbmje4kjpwjs355qu

Pre-Trained Models: Past, Present and Future [article]

Xu Han, Zhengyan Zhang, Ning Ding, Yuxian Gu, Xiao Liu, Yuqi Huo, Jiezhong Qiu, Yuan Yao, Ao Zhang, Liang Zhang, Wentao Han, Minlie Huang (+12 others)
2021 arXiv   pre-print
It is now the consensus of the AI community to adopt PTMs as backbone for downstream tasks rather than learning models from scratch.  ...  In this paper, we take a deep look into the history of pre-training, especially its special relation with transfer learning and self-supervised learning, to reveal the crucial position of PTMs in the AI  ...  In addition to image-text PTMs, there are also PTMs for other modalities, such as video and audio.  ... 
arXiv:2106.07139v3 fatcat:kn6gk2bg4jecndvlhhvq32x724

History Aware Multimodal Transformer for Vision-and-Language Navigation [article]

Shizhe Chen, Pierre-Louis Guhur, Cordelia Schmid, Ivan Laptev
2021 arXiv   pre-print
HAMT efficiently encodes all the past panoramic observations via a hierarchical vision transformer (ViT), which first encodes individual images with ViT, then models spatial relation between images in  ...  We demonstrate HAMT to be particularly effective for navigation tasks with longer trajectories.  ...  requires an agent to arrive at goal regions based on multi-turn question-answering dialogs.  ... 
arXiv:2110.13309v1 fatcat:crkc5hwjmnctjm5yz6bfldhy44

Game-Based Video-Context Dialogue [article]

Ramakanth Pasunuru, Mohit Bansal
2018 arXiv   pre-print
However, several real-world human interactions also involve dynamic visual context (similar to videos) as well as dialogue exchanges among multiple speakers.  ...  We evaluate these models via retrieval ranking-recall, automatic phrase-matching metrics, as well as human evaluation studies.  ...  videos to Causal And-Or graphs).  ... 
arXiv:1809.04560v2 fatcat:c3pbmhwywrcavdjwfigeozk7cm

Game-Based Video-Context Dialogue

Ramakanth Pasunuru, Mohit Bansal
2018 Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing  
However, several real-world human interactions also involve dynamic visual context (similar to videos) as well as dialogue exchanges among multiple speakers.  ...  We evaluate these models via retrieval ranking-recall, automatic phrasematching metrics, as well as human evaluation studies.  ...  Acknowledgments We thank the reviewers for their helpful comments.  ... 
doi:10.18653/v1/d18-1012 dblp:conf/emnlp/PasunuruB18 fatcat:jfggyhrdwna3lmr3hp7ioskmle

Transferable Representation Learning in Vision-and-Language Navigation [article]

Haoshuo Huang, Vihan Jain, Harsh Mehta, Alexander Ku, Gabriel Magalhaes, Jason Baldridge, Eugene Ie
2019 arXiv   pre-print
Specifically, the representations are adapted to solve both a cross-modal sequence alignment and sequence coherence task.  ...  Our approach adapts pre-trained vision and language representations to relevant in-domain tasks making them more effective for VLN.  ...  Acknowledgements We thank the ICCV 2019 reviewers for their helpful reviews.  ... 
arXiv:1908.03409v2 fatcat:fnhobdpucjf2bpjxlvbimnyl3q

Core Challenges in Embodied Vision-Language Planning [article]

Jonathan Francis, Nariaki Kitamura, Felix Labelle, Xiaopeng Lu, Ingrid Navarro, Jean Oh
2022 arXiv   pre-print
Recent advances in the areas of multimodal machine learning and artificial intelligence (AI) have led to the development of challenging tasks at the intersection of Computer Vision, Natural Language Processing  ...  even when combinations of these topics are considered, more focus is placed on describing, e.g., current architectural methods, as opposed to also illustrating high-level challenges and opportunities for  ...  , and we thank the JAIR reviewers for their valuable feedback.  ... 
arXiv:2106.13948v4 fatcat:esrtfxpun5ae5kaydjnymf3v6u

Core Challenges in Embodied Vision-Language Planning

Jonathan Francis, Nariaki Kitamura, Felix Labelle, Xiaopeng Lu, Ingrid Navarro, Jean Oh
2022 The Journal of Artificial Intelligence Research  
Recent advances in the areas of multimodal machine learning and artificial intelligence (AI) have led to the development of challenging tasks at the intersection of Computer Vision, Natural Language Processing  ...  even when combinations of these topics are considered, more focus is placed on describing, e.g., current architectural methods, as opposed to also illustrating high-level challenges and opportunities for  ...  , and we thank the JAIR reviewers for their valuable feedback.  ... 
doi:10.1613/jair.1.13646 fatcat:rmgy6whqefeuvpvte7ultyyjlq

Matching Questions and Answers in Dialogues from Online Forums [article]

Qi Jia, Mengxue Zhang, Shengyao Zhang, Kenny Q. Zhu
2020 arXiv   pre-print
outperforms the state-of-the-art and other strong baselines, particularly for matching long-distance QA pairs.  ...  Given scores computed by the trained model between each non-question turn with its candidate questions, a greedy matching strategy is used for final predictions.  ...  Dynamic time warping (DTW) is anothor algorithm for measuring similarity between two temporal sequences. It's also widely used in video-text alignment task [11] , speech recognition task [32] .  ... 
arXiv:2005.09276v2 fatcat:ircgottdpbdlhgbsjsunc35njm

Ego4D: Around the World in 3,000 Hours of Egocentric Video [article]

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Nagarajan (+73 others)
2022 arXiv   pre-print
We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite.  ...  Portions of the video are accompanied by audio, 3D meshes of the environment, eye gaze, stereo, and/or synchronized videos from multiple egocentric cameras at the same event.  ...  The universities acknowledge the usage of commercial software for de-identification of video. brighter.ai was used for redacting videos by some of the universities.  ... 
arXiv:2110.07058v3 fatcat:lgh27km63nhcdcpkvbr2qarsru
« Previous Showing results 1 — 15 out of 123 results