Filters








9 Hits in 4.7 sec

VisualCOMET: Reasoning about the Dynamic Context of a Still Image [article]

Jae Sung Park, Chandra Bhagavatula, Roozbeh Mottaghi, Ali Farhadi, Yejin Choi
2020 arXiv   pre-print
Even from a single frame of a still image, people can reason about the dynamic story of the image before, after, and beyond the frame.  ...  For example, given an image of a man struggling to stay afloat in water, we can reason that the man fell into the water sometime in the past, the intent of that man at the moment is to stay alive, and  ...  Acknowledgements This research was supported in part by NSF (IIS1524371, IIS-1714566), DARPA under the CwC program through the ARO (W911NF-15-1-0543), DARPA under the MCS program through NIWC Pacific (  ... 
arXiv:2004.10796v3 fatcat:xodcxsclmzgp7m6vooygu4e66a

Super-Prompting: Utilizing Model-Independent Contextual Data to Reduce Data Annotation Required in Visual Commonsense Tasks [article]

Navid Rezaei, Marek Z. Reformat
2022 arXiv   pre-print
To evaluate our results, we use a dataset focusing on visual commonsense reasoning in time.  ...  Although it is impressive, the size of language models can be prohibitive to make them usable in on-device applications, such as sensors or smartphones.  ...  To put it in context, almost $240,000 has been spent on the annotation of the Visual Commonsense Reasoning in Time (VisualCOMET) dataset and this figure only includes the payment to crowd-workers from  ... 
arXiv:2204.11922v1 fatcat:jguoe6vwmjf3tnqs2gnpve64ly

Reasoning about Actions over Visual and Linguistic Modalities: A Survey [article]

Shailaja Keyur Sampat, Maitreya Patel, Subhasish Das, Yezhou Yang, Chitta Baral
2022 arXiv   pre-print
While 'Reasoning about Actions & Change' (RAC) has been widely studied in the Knowledge Representation community, it has recently piqued the interest of NLP and computer vision researchers.  ...  'Actions' play a vital role in how humans interact with the world and enable them to achieve desired goals. As a result, most common sense (CS) knowledge for humans revolves around actions.  ...  Reasoning about actions is a relatively new research direction for the vi-sion+language community; still there exists an array of welldefined tasks, datasets to start with.  ... 
arXiv:2207.07568v1 fatcat:os3c266tmnhbdcohedg5pe3dma

Fantastic Data and How to Query Them [article]

Trung-Kien Tran, Anh Le-Tuan, Manh Nguyen-Duc, Jicheng Yuan, Danh Le-Phuoc
2022 arXiv   pre-print
It is commonly acknowledged that the availability of the huge amount of (training) data is one of the most important factors for many recent advances in Artificial Intelligence (AI).  ...  In this paper, we present our vision about a unified framework for different datasets so that they can be integrated and queried easily, e.g., using standard query languages.  ...  Other use cases include analytic queries to have deeper understanding about the dataset and performances of DNNs on some particular sub-set of the data.  ... 
arXiv:2201.05026v1 fatcat:prowe54idfbxtkoufpsmlt2uty

PACS: A Dataset for Physical Audiovisual CommonSense Reasoning [article]

Samuel Yu, Peter Wu, Paul Pu Liang, Ruslan Salakhutdinov, Louis-Philippe Morency
2022 arXiv   pre-print
Our dataset provides new opportunities to advance the research field of physical reasoning by bringing audio as a core component of this multimodal problem.  ...  In order for AI to be safely deployed in real-world scenarios such as hospitals, schools, and the workplace, it must be able to robustly reason about the physical world.  ...  Acknowledgements This material is based upon work partially supported by the National Science Foundation (Awards #1722822 and #1750439) and National Institutes of Health (Awards #R01MH125740, #R01MH096951  ... 
arXiv:2203.11130v3 fatcat:biom6cnkobdutahh52hvgrx65y

Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods [article]

Aditya Mogadala and Marimuthu Kalimuthu and Dietrich Klakow
2020 arXiv   pre-print
The largest of the growths in these fields has been made possible with deep learning, a sub-area of machine learning, which uses the principles of artificial neural networks.  ...  Our efforts go beyond earlier surveys which are either task-specific or concentrate only on one type of visual content, i.e., image or video.  ...  Acknowledgments This work was supported by the German Research Foundation (DFG) as a part of -Project-ID 232722074 -SFB1102.  ... 
arXiv:1907.09358v2 fatcat:4fyf6kscy5dfbewll3zs7yzsuq

Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods

Aditya Mogadala, Marimuthu Kalimuthu, Dietrich Klakow
2021 The Journal of Artificial Intelligence Research  
Much of the growth in these fields has been made possible with deep learning, a sub-area of machine learning that uses artificial neural networks.  ...  Our efforts go beyond earlier surveys which are either task-specific or concentrate only on one type of visual content, i.e., image or video.  ...  Acknowledgments This work was supported by the German Research Foundation (DFG) as a part of -Project-ID 232722074 -SFB1102.  ... 
doi:10.1613/jair.1.11688 fatcat:kvfdrg3bwrh35fns4z67adqp6i

CLEVR_HYP: A Challenge Dataset and Baselines for Visual Question Answering with Hypothetical Actions over Images

Shailaja Keyur Sampat, Akshay Kumar, Yezhou Yang, Chitta Baral
2021 Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies   unpublished
Finally, we motivate the development of better vision-language models by providing insights about the capability of diverse architectures to perform joint reasoning over image-text modality 1 .  ...  Most existing research on visual question answering (VQA) is limited to information explicitly present in an image or a video.  ...  Acknowledgements We are thankful to the anonymous reviewers for the constructive feedback. This work is partially supported by the grants NSF 1816039, DARPA W911NF2020006 and ONR N00014-20-1-2332.  ... 
doi:10.18653/v1/2021.naacl-main.289 fatcat:wixwaac2jvf3dew6q2dcob6yxy

Towards Knowledge-capable AI: Agents that See, Speak, Act and Know

Kenneth Marino
2022
We introduce a benchmark for vision and language that requires models with the capability to bring in and reason about knowledge about the world.  ...  We then examine the action modality by first showing that by using the knowledge inherent in language models to solve a highly complex, semantic crafting task.  ...  Abhinav has been a dogged advocate during my career, providing advice, support and guidance throught my PhD.  ... 
doi:10.1184/r1/19552225.v1 fatcat:rxhy64wyqfb4rjnizil6hrffge