Filters








405 Hits in 7.0 sec

Structured Triplet Learning with POS-tag Guided Attention for Visual Question Answering [article]

Zhe Wang, Xiaoyi Liu, Liangjian Chen, Limin Wang, Yu Qiao, Xiaohui Xie, Charless Fowlkes
2018 arXiv   pre-print
We explore mechanisms of incorporating part-of-speech (POS) tag guided attention, convolutional n-grams, triplet attention interactions between the image, question and candidate answer, and structured  ...  learning for triplets based on image-question pairs.  ...  Jianwei Yang for the helpful discussion.  ... 
arXiv:1801.07853v1 fatcat:tgtksndxdjgzde6qk6hpbz7q4u

Multimodal Differential Network for Visual Question Generation [article]

Badri N. Patro, Sandeep Kumar, Vinod K. Kurmi, Vinay P. Namboodiri
2019 arXiv   pre-print
Images can have multiple visual and language contexts that are relevant for generating questions namely places, captions, and tags.  ...  Generating natural questions from an image is a semantic task that requires using visual and language modality to learn multimodal representations.  ...  Ask, attend and answer: Exploring question-guided spatial attention for visual question answering. In European Confer- ence on Computer Vision, pages 451-466. Springer.  ... 
arXiv:1808.03986v2 fatcat:qohveg4uqfbtti6zrvibejrluu

Deep Exemplar Networks for VQA and VQG [article]

Badri N. Patro, Vinay P. Namboodiri
2019 arXiv   pre-print
In this paper, we consider the problem of solving semantic tasks such as 'Visual Question Answering' (VQA), where one aims to answers related to an image and 'Visual Question Generation' (VQG), where one  ...  Thus, just as the incorporation of attention is now considered de facto useful for solving these tasks, similarly, incorporating exemplars also can be considered to improve any proposed architecture for  ...  We find the part-ofspeech(POS) tag present in the caption. POS taggers have been developed for two well-known corpora, the Brown Corpus and the Penn Treebanks.  ... 
arXiv:1912.09551v1 fatcat:ofd6uk5ulfhwzdfrdgayoc25ym

Multimodal Differential Network for Visual Question Generation

Badri Narayana Patro, Sandeep Kumar, Vinod Kumar Kurmi, Vinay Namboodiri
2018 Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing  
Images can have multiple visual and language contexts that are relevant for generating questions namely places, captions, and tags.  ...  Generating natural questions from an image is a semantic task that requires using visual and language modality to learn multimodal representations.  ...  It is a step towards having a natural visual dialog instead of the widely prevalent visual question answering bots.  ... 
doi:10.18653/v1/d18-1434 dblp:conf/emnlp/PatroKKN18 fatcat:2zt3ze4ffjadtgplh22hxakqe4

On the Potential of Lexico-logical Alignments for Semantic Parsing to SQL Queries [article]

Tianze Shi, Chen Zhao, Jordan Boyd-Graber, Hal Daumé III, Lillian Lee
2020 arXiv   pre-print
To explore the utility of fine-grained, lexical-level supervision, we introduce Squall, a dataset that enriches 11,276 WikiTableQuestions English-language questions with manually created SQL equivalents  ...  plus alignments between SQL and question fragments.  ...  Acknowledgments We thank the members of UMD CLIP, Xilun Chen, Jack Hessel, Thomas Müller, Ana Smith, and the anonymous reviewers and meta-reviewer for their suggestions and comments.  ... 
arXiv:2010.11246v1 fatcat:wpmjjsbhffhxpf4ugtfkjjfspq

Customized Image Narrative Generation via Interactive Visual Question Generation and Answering [article]

Andrew Shin, Yoshitaka Ushiku, Tatsuya Harada
2018 arXiv   pre-print
We further attempt to learn the user's interest via repeating such interactive stages, and to automatically reflect the interest in descriptions for new images.  ...  In this paper, we propose a customized image narrative generation task, in which the users are interactively engaged in the generation process by providing answers to the questions.  ...  Acknowledgments This work was partially funded by the ImPACT Program of the Council for Science, Technology, and Innovation (Cabinet Office, Government of Japan), and was partially supported by CREST,  ... 
arXiv:1805.00460v1 fatcat:ly2e4biw7jcdjhgactjofwrga4

Structured Knowledge Discovery from Massive Text Corpus [article]

Chenwei Zhang
2019 arXiv   pre-print
In particular, four problems are studied in this dissertation: Structured Intent Detection for Natural Language Understanding, Structure-aware Natural Language Modeling, Generative Structured Knowledge  ...  with fewer annotation efforts.  ...  Those POS tags are used by the POS tag-ger in NNID-JM as its default setting.  ... 
arXiv:1908.01837v1 fatcat:j46srlxblfd35cd4z6jkl43iiu

Multimodal Research in Vision and Language: A Review of Current and Emerging Trends [article]

Shagun Uppal, Sarthak Bhagat, Devamanyu Hazarika, Navonil Majumdar, Soujanya Poria, Roger Zimmermann, Amir Zadeh
2020 arXiv   pre-print
Deep Learning and its applications have cascaded impactful research and development with a diverse range of modalities present in the real-world data.  ...  In this paper, we present a detailed overview of the latest trends in research pertaining to visual and language modalities.  ...  For VQA, the dataset D generally consists of visual input-question-answer triplets wherein the i th triplet is represented by < I i , Q i , A i >.  ... 
arXiv:2010.09522v2 fatcat:l4npstkoqndhzn6hznr7eeys4u

Natural Language Processing Advancements By Deep Learning: A Survey [article]

Amirsina Torfi, Rouzbeh A. Shirvani, Yaser Keneshloo, Nader Tavaf, Edward A. Fox
2021 arXiv   pre-print
Natural Language Processing (NLP) helps empower intelligent machines by enhancing a better understanding of the human language for linguistic-based human-computer communication.  ...  This survey categorizes and addresses the different aspects and applications of NLP that have benefited from deep learning.  ...  image, Visual Question Answering (VQA) tries to answer a natural language question about the image [148] .  ... 
arXiv:2003.01200v4 fatcat:riw6vvl24nfvboy56v2zfcidpu

Learning Visual Representations with Caption Annotations [article]

Mert Bulent Sariyildiz, Julien Perez, Diane Larlus
2020 arXiv   pre-print
To tackle this task, we propose hybrid models, with dedicated visual and textual encoders, and we show that the visual representations learned as a by-product of solving this task transfer well to a variety  ...  While one can learn such features on the extensively-annotated ImageNet dataset, recent approaches have looked at ways to allow for noisy, fewer, or even no annotations to perform such pretraining.  ...  Vision and language (VL) have been jointly leveraged to learn cross-modal representations for various VL tasks, such as crossmodal retrieval [21, 61] , visual question answering [25], captioning [56]  ... 
arXiv:2008.01392v1 fatcat:6hf54vnv4bht7ojbyszxbvtxwu

Two-Stage Synthesis Networks for Transfer Learning in Machine Comprehension

David Golub, Po-Sen Huang, Xiaodong He, Li Deng
2017 Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing  
We develop a technique for transfer learning in machine comprehension (MC) using a novel two-stage synthesis network (SynNet).  ...  Given a high-performing MC model in one domain, our technique aims to answer questions about documents in another domain, where we use no labeled data of question-answer pairs.  ...  Acknowledgments We would like to thank Yejin Choi and Luke Zettlemoyer for helpful discussions concerning this work.  ... 
doi:10.18653/v1/d17-1087 dblp:conf/emnlp/GolubHHD17 fatcat:vw7ensnjlre7xprmd5z34a42gu

Two-Stage Synthesis Networks for Transfer Learning in Machine Comprehension [article]

David Golub, Po-Sen Huang, Xiaodong He, Li Deng
2017 arXiv   pre-print
We develop a technique for transfer learning in machine comprehension (MC) using a novel two-stage synthesis network (SynNet).  ...  Given a high-performing MC model in one domain, our technique aims to answer questions about documents in another domain, where we use no labeled data of question-answer pairs.  ...  Acknowledgments We would like to thank Yejin Choi and Luke Zettlemoyer for helpful discussions concerning this work.  ... 
arXiv:1706.09789v3 fatcat:34xfs42eojhtdlbjuckw5v5kqa

Paradigm Shift in Natural Language Processing [article]

Tianxiang Sun, Xiangyang Liu, Xipeng Qiu, Xuanjing Huang
2021 arXiv   pre-print
For example, we usually adopt the sequence labeling paradigm to solve a bundle of tasks such as POS-tagging, NER, Chunking, and adopt the classification paradigm to solve tasks like sentiment analysis.  ...  In the era of deep learning, modeling for most NLP tasks has converged to several mainstream paradigms.  ...  Further, and Zhao et al. (2020) formulate the triplet extraction task as multi-turn question answering and solve it with the MRC paradigm.  ... 
arXiv:2109.12575v1 fatcat:vckeva3u3va3vjr6okhuztox4y

Counterfactual Samples Synthesizing for Robust Visual Question Answering

Long Chen, Xin Yan, Jun Xiao, Hanwang Zhang, Shiliang Pu, Yueting Zhuang
2020 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)  
Despite Visual Question Answering (VQA) has realized impressive progress over the last few years, today's VQA models tend to capture superficial linguistic correlations in the train set and fail to generalize  ...  rely on the right visual regions when making decisions. 2) question-sensitive: the model should be sensitive to the linguistic variations in question.  ...  Specifically, we first assign POS tags to each word in the QA using the spaCy POS tagger [19] and extract nouns in QA.  ... 
doi:10.1109/cvpr42600.2020.01081 dblp:conf/cvpr/0016YXZPZ20 fatcat:73eq2b2lizbixceyfdcttaioua

Counterfactual Samples Synthesizing for Robust Visual Question Answering [article]

Long Chen, Xin Yan, Jun Xiao, Hanwang Zhang, Shiliang Pu, Yueting Zhuang
2020 arXiv   pre-print
Despite Visual Question Answering (VQA) has realized impressive progress over the last few years, today's VQA models tend to capture superficial linguistic correlations in the train set and fail to generalize  ...  rely on the right visual regions when making decisions. 2) question-sensitive: the model should be sensitive to the linguistic variations in question.  ...  Specifically, we first assign POS tags to each word in the QA using the spaCy POS tagger [19] and extract nouns in QA.  ... 
arXiv:2003.06576v1 fatcat:ojs7jlbmdjc6vfd6noleeqxjt4
« Previous Showing results 1 — 15 out of 405 results