Filters








45,890 Hits in 4.2 sec

ORD: Object Relationship Discovery for Visual Dialogue Generation [article]

Ziwei Wang, Zi Huang, Yadan Luo, Huimin Lu
2020 arXiv   pre-print
With the rapid advancement of image captioning and visual question answering at single-round level, the question of how to generate multi-round dialogue about visual content has not yet been well explored.Existing  ...  visual dialogue methods encode the image into a fixed feature vector directly, concatenated with the question and history embeddings to predict the response.Some recent methods tackle the co-reference  ...  HieCoAtt-QI-D [25] : The Hierarchical Question-Image Co-Attention model utilises visual and hierarchical representation of the question in a joint framework for VQA.  ... 
arXiv:2006.08322v1 fatcat:6ic2p2p2zbcj5jeaqp35e5hlwq

DualVD: An Adaptive Dual Encoding Model for Deep Visual Understanding in Visual Dialogue

Xiaoze Jiang, Jing Yu, Zengchang Qin, Yingying Zhuang, Xingxing Zhang, Yue Hu, Qi Wu
2020 PROCEEDINGS OF THE THIRTIETH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE AND THE TWENTY-EIGHTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE  
Different from Visual Question Answering task that requires to answer only one question about an image, Visual Dialogue involves multiple questions which cover a broad range of visual content that could  ...  Futhermore, on top of such multi-view image features, we propose a feature selection framework which is able to adaptively capture question-relevant information hierarchically in fine-grained level.  ...  The typical solution for visual dialogue is to firstly fuse visual (i.e. image) features and textual (i.e. dialogue history, current question) features together and then to infer the correct answer.  ... 
doi:10.1609/aaai.v34i07.6769 fatcat:pxyktd6kq5gwlg2e36kfyo35my

Video Question Answering via Hierarchical Spatio-Temporal Attention Networks

Zhou Zhao, Qifan Yang, Deng Cai, Xiaofei He, Yueting Zhuang
2017 Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence  
We propose the hierarchical spatio-temporal attention network for learning the joint representation of the dynamic video contents according to the given question.  ...  However, the existing visual question answering works only focus on the static image, which may be ineffectively applied to video question answering due to the temporal dynamics of video contents.  ...  To exploit complex visual question answering tasks, QRU method [Li and Jia, 2016] employs the reasoning process with attention mechanism that iteratively selects the relevant image regions for question  ... 
doi:10.24963/ijcai.2017/492 dblp:conf/ijcai/ZhaoYCHZ17 fatcat:6pxv556elzejhnl4uw34oaihqu

Multi-Turn Video Question Answering via Multi-Stream Hierarchical Attention Context Network

Zhou Zhao, Xinghua Jiang, Deng Cai, Jun Xiao, Xiaofei He, Shiliang Pu
2018 Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence  
However, the existing visual question answering methods mainly tackle the problem of single-turn video question answering, which may be ineffectively applied for multi-turn video question answering directly  ...  We next devise the hierarchical attention context network learning method with multi-step reasoning process for multi-turn video question answering.  ...  question answering, due to the lack of modeling the visual conversation context for answer inference.  ... 
doi:10.24963/ijcai.2018/513 dblp:conf/ijcai/ZhaoJCXHP18 fatcat:xv5wzmfa3vgopduberyvbeulxe

DualVD: An Adaptive Dual Encoding Model for Deep Visual Understanding in Visual Dialogue [article]

Xiaoze Jiang, Jing Yu, Zengchang Qin, Yingying Zhuang, Xingxing Zhang, Yue Hu, Qi Wu
2019 arXiv   pre-print
Different from Visual Question Answering task that requires to answer only one question about an image, Visual Dialogue involves multiple questions which cover a broad range of visual content that could  ...  Futhermore, on top of such multi-view image features, we propose a feature selection framework which is able to adaptively capture question-relevant information hierarchically in fine-grained level.  ...  The typical solution for visual dialogue is to firstly fuse visual (i.e. image) features and textual (i.e. dialogue history, current question) features together and then to infer the correct answer.  ... 
arXiv:1911.07251v1 fatcat:of6xfs5ofndtff4fsaezzzzuwi

Video Question Answering via Hierarchical Dual-Level Attention Network Learning

Zhou Zhao, Jinghao Lin, Xinghua Jiang, Deng Cai, Xiaofei He, Yueting Zhuang
2017 Proceedings of the 2017 ACM on Multimedia Conference - MM '17  
However, the existing visual question answering approaches mainly tackle the problem of static image question answering, which may be ineffectively applied for video question answering directly, due to  ...  Video question answering is a challenging task in visual information retrieval, which provides the accurate answer from the referenced video contents according to the given question.  ...  [24] propose hierarchical recurrent neural encoder that exploits temporal information for video representation learning. Xu et. al.  ... 
doi:10.1145/3123266.3123364 dblp:conf/mm/ZhaoLJCHZ17 fatcat:vjag7l4gsbdjzcidiuhchdshsm

CQ-VQA: Visual Question Answering on Categorized Questions [article]

Aakansha Mishra, Ashish Anand, Prithwijit Guha
2020 arXiv   pre-print
This paper proposes CQ-VQA, a novel 2-level hierarchical but end-to-end model to solve the task of visual question answering (VQA).  ...  The QC uses attended and fused features of the input question and image.  ...  CQ-VQA: Learning the Model This work proposes a hierarchical model for visual question answering. This hierarchical model has two levels.  ... 
arXiv:2002.06800v1 fatcat:mwkyk3djyracxbvzrwvbh4f6ke

Transfer Learning via Unsupervised Task Discovery for Visual Question Answering [article]

Hyeonwoo Noh, Taehoon Kim, Jonghwan Mun, Bohyung Han
2019 arXiv   pre-print
WordNet) and visual descriptions for unsupervised task discovery, and transfer a learned task conditional visual classifier as an answering unit in a visual question answering model.  ...  We study how to leverage off-the-shelf visual and linguistic data to cope with out-of-vocabulary answers in visual question answering task.  ...  Conclusion We present a transfer learning approach for visual question answering with out-of-vocabulary answers.  ... 
arXiv:1810.02358v2 fatcat:5cnfwaimsbbmff44xofmzpme2y

Learning to Compose Dynamic Tree Structures for Visual Contexts [article]

Kaihua Tang, Hanwang Zhang, Baoyuan Wu, Wenhan Luo, Wei Liu
2018 arXiv   pre-print
Experimental results on two benchmarks, which require reasoning over contexts: Visual Genome for scene graph generation and VQA2.0 for visual Q&A, show that VCTree outperforms state-of-the-art results  ...  tree encodes the inherent parallel/hierarchical relationships among objects, e.g., "clothes" and "pants" are usually co-occur and belong to "person"; 2) the dynamic structure varies from image to image  ...  For the proposed VCTREE, we assigned different learnable matrices for the hidden states from the left-branch parents and right-branch parents.  ... 
arXiv:1812.01880v1 fatcat:dkg6hgwrtjepdp2tdtn3njkezi

Open-Ended Long-form Video Question Answering via Adaptive Hierarchical Reinforced Networks

Zhou Zhao, Zhu Zhang, Shuwen Xiao, Zhou Yu, Jun Yu, Deng Cai, Fei Wu, Yueting Zhuang
2018 Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence  
decoder network to generate the natural language answer for open-ended video question answering.  ...  Open-ended long-form video question answering is challenging problem in visual information retrieval, which automatically generates the natural language answer from the referenced long-form video content  ...  supported by the National Natural Science Foundation of China under Grant No.61602405, No.61702143, No.61622205 and No.61472110, Sponsored by CCF-Tencent Open Research Fund and the China Knowledge Centre for  ... 
doi:10.24963/ijcai.2018/512 dblp:conf/ijcai/ZhaoZXYYCWZ18 fatcat:n6yvgkp54rh6hksn2yi6pxdoki

Bridge to Answer: Structure-aware Graph Interaction Network for Video Question Answering [article]

Jungin Park, Jiyoung Lee, Kwanghoon Sohn
2021 arXiv   pre-print
To realize this, we learn question conditioned visual graphs by exploiting the relation between video and question to enable each visual node using question-to-visual interactions to encompass both visual  ...  As a result, our method can learn the question conditioned visual representations attributed to appearance and motion that show powerful capability for video question answering.  ...  For visual question answering (VQA), Teney et al.  ... 
arXiv:2104.14085v1 fatcat:qguxecsdajbnpo5gds2i2eraii

Cartesian vs. Radial – A Comparative Evaluation of Two Visualization Tools [chapter]

Michael Burch, Felix Bott, Fabian Beck, Stephan Diehl
2008 Lecture Notes in Computer Science  
In this work we compare a radial and a Cartesian variant of a visualization tool for sequences of transactions in information hierarchies.  ...  Many recently developed information visualization techniques are radial variants of originally Cartesian visualizations.  ...  We found that a data set representing the number of ball contacts of players in a sequence of moves contains all the features that we need for the visualization tools.  ... 
doi:10.1007/978-3-540-89639-5_15 fatcat:m26sznu7xradnnjloujjd63bcm

Auto-Parsing Network for Image Captioning and Visual Question Answering [article]

Xu Yang and Chongyang Gao and Hanwang Zhang and Jianfei Cai
2021 arXiv   pre-print
Specifically, we showcase that our APN can strengthen Transformer based networks in two major vision-language tasks: Captioning and Visual Question Answering.  ...  We propose an Auto-Parsing Network (APN) to discover and exploit the input data's hidden tree structures for improving the effectiveness of the Transformer-based vision-language systems.  ...  PGM probabilities on self-attention layers and exploiting hierarchical constraints into PGM probabilities. • We design two different APNs for solving Image Captioning and visual Question Answering. •  ... 
arXiv:2108.10568v1 fatcat:ugkinzzxmfhdlmmiowlmuoecie

HAF-SVG: Hierarchical Stochastic Video Generation with Aligned Features

Zhihui Lin, Chun Yuan, Maomao Li
2020 Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence  
Experiments on Moving-MNIST, BAIR, and KTH datasets demonstrate that hierarchical structure is helpful for modeling more accurate future uncertainty, and the feature aligner is beneficial to generate realistic  ...  The proposed model is named Hierarchical Stochastic Video Generation network with Aligned Features, referred to as HAF-SVG.  ...  Although they hierarchically model the video representation, they didn't fully exploit the textual feature, hence lack the capability to capture video-related semantics of question.  ... 
doi:10.24963/ijcai.2020/139 dblp:conf/ijcai/CaiYSLCS20 fatcat:4espn7s2affjzgys6tliidolju

Guest Editorial Introduction to the Special Section on Video and Language

Tao Mei, Jason J. Corso, Gunhee Kim, Jiebo Luo, Chunhua Shen, Hanwang Zhang
2022 IEEE transactions on circuits and systems for video technology (Print)  
visual captioning; and 3) the modeling of action-centric interaction among frames for video question answering.  ...  Video Question Answering In [A6] , Zhang et al. propose an action-centric relation transformer network (ACRTransformer) for video question answering (VideoQA).  ...  He was elected as a Fellow of IAPR in 2016, and a Distinguished Scientist of ACM in 2016, for his contributions to large-scale multimedia analysis and applications.  ... 
doi:10.1109/tcsvt.2021.3137430 fatcat:ksel3hruujgwfpwalj4u5ebebu
« Previous Showing results 1 — 15 out of 45,890 results