69 Hits in 4.0 sec

An experimental study of the vision-bottleneck in VQA [article]

Pierre Marza, Corentin Kervadec, Grigory Antipov, Moez Baccouche, Christian Wolf
2022 arXiv   pre-print
In this paper, we propose an in-depth study of the vision-bottleneck in VQA, experimenting with both the quantity and quality of visual objects extracted from images.  ...  We also study the impact of two methods to incorporate the information about objects necessary for answering a question, in the reasoning module directly, and earlier in the object selection stage.  ...  The impact of object detection quality We evaluate the general impact of the quality of an object detection system on VQA performance.  ... 
arXiv:2202.06858v1 fatcat:idzgrcelu5cezdbyiflcdsgg6i

Multi-Modal Graph Neural Network for Joint Reasoning on Vision and Scene Text [article]

Difei Gao, Ke Li, Ruiping Wang, Shiguang Shan, Xilin Chen
2020 arXiv   pre-print
Experimental evaluations show that our MM-GNN represents the scene texts better and obviously facilitates the performances on two VQA tasks that require reading scene texts.  ...  Answering questions that require reading texts in an image is challenging for current models.  ...  This work is partially supported by Natural Science Foundation of China under contracts Nos. 61922080, U19B2036, 61772500, and CAS Frontier Science Key Research Project No. QYZDJ-SSWJSC009.  ... 
arXiv:2003.13962v1 fatcat:ifofez5zjjdlrf2i7iptppe4oe

All-in-One Image-Grounded Conversational Agents [article]

Da Ju, Kurt Shuster, Y-Lan Boureau, Jason Weston
2020 arXiv   pre-print
We design an architecture that combines state-of-the-art Transformer and ResNeXt modules fed into a novel attentive multimodal module to produce a combined model trained on many tasks.  ...  Our final models provide a single system that obtains good results on all vision and language tasks considered, and improves the state-of-the-art in image-grounded conversational applications.  ...  Multimodal Combiner We next assess the impact of the multimodal combiner module in our architecture; we first analyze the non-attentive version.  ... 
arXiv:1912.12394v2 fatcat:heqeepgwsbbdfgdixum3cwb33y

C3DVQA: Full-Reference Video Quality Assessment with 3D Convolutional Neural Network [article]

Munan Xu, Junming Chen, Haiqiang Wang, Shan Liu, Ge Li, Zhiqiang Bai
2020 arXiv   pre-print
However, video quality exhibits different characteristics from static image quality due to the existence of temporal masking effects.  ...  We empirically found that 3D convolutional layers are capable to capture temporal masking effects of videos. We evaluated the proposed method on the LIVE and CSIQ datasets.  ...  Besides, motion-related distortions also have an impact on the perceived quality.  ... 
arXiv:1910.13646v2 fatcat:zfad7igdj5dqlkw4nr5jiallgi

In Defense of Grid Features for Visual Question Answering [article]

Huaizu Jiang, Ishan Misra, Marcus Rohrbach, Erik Learned-Miller, Xinlei Chen
2020 arXiv   pre-print
In this paper, we revisit grid features for VQA, and find they can work surprisingly well - running more than an order of magnitude faster with the same accuracy (e.g. if pre-trained in a similar fashion  ...  Through extensive experiments, we verify that this observation holds true across different VQA models (reporting a state-of-the-art accuracy on VQA 2.0 test-std, 72.71), datasets, and generalizes well  ...  The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the  ... 
arXiv:2001.03615v2 fatcat:bxa2fqiq5nfelm3smqpsjmz6oe

Video Quality Assessment Based on Measuring Perceptual Noise From Spatial and Temporal Perspectives

Yin Zhao, Lu Yu, Zhenzhong Chen, Ce Zhu
2011 IEEE transactions on circuits and systems for video technology (Print)  
Video quality assessment (VQA) exploits important properties of the sophisticated human visual system (HVS).  ...  In this paper, we study a series of fundamental HVS characteristics for subjective video quality assessment, and incorporate them into a systematic framework to simulate subjective evaluation on impaired  ...  Politecnico di Milano, Milan, Italy, for kindly providing the VQA databases.  ... 
doi:10.1109/tcsvt.2011.2157189 fatcat:oys4wlb5krer7kiw7kyweu4h3e

Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA [article]

Ronghang Hu, Amanpreet Singh, Trevor Darrell, Marcus Rohrbach
2020 arXiv   pre-print
Furthermore, it enables iterative answer decoding with a dynamic pointer network, allowing the model to form an answer through multi-step prediction instead of one-step classification.  ...  For example, a deep water label on a warning sign warns people about the danger in the scene.  ...  What is the danger? Previous work: water Our model: deep water TextVQA Figure 3 . 3 Accuracy under different maximum decoding steps T on the validation set of TextVQA, ST-VQA, and OCR-VQA.  ... 
arXiv:1911.06258v3 fatcat:c4zkfcdk2bgkdiljbeig6dcedq

Cross-Modal Knowledge Reasoning for Knowledge-based Visual Question Answering

Jing Yu, Zihao Zhu, Yujing Wang, Weifeng Zhang, Yue Hu, Jianlong Tan
2020 Pattern Recognition  
By stacking the modules multiple times, our model performs transitive reasoning and obtains question-oriented concept representations under the constrain of different modalities.  ...  Inspired by the human cognition theory, in this paper, we depict an image by multiple knowledge graphs from the visual, semantic and factual views.  ...  We further evaluate the influence of different number of reasoning steps T in the GRUC network.  ... 
doi:10.1016/j.patcog.2020.107563 fatcat:ezlkrzacbnddfh7f573vqouhne

LaTr: Layout-Aware Transformer for Scene-Text VQA [article]

Ali Furkan Biten, Ron Litman, Yusheng Xie, Srikar Appalaraju, R. Manmatha
2021 arXiv   pre-print
Thus, we first investigate the impact of each modality, and reveal the importance of the language module, especially when enriched with layout information.  ...  In particular, +7.6% on TextVQA, +10.8% on ST-VQA and +4.0% on OCR-VQA (all absolute accuracy numbers).  ...  OCR-VQA Results As commonly done by previous work [20] , we only evaluate our model using the constrained setting. In this setting, we do not change the OCR system, i.e. we use Rosetta OCR system.  ... 
arXiv:2112.12494v2 fatcat:chdp2ozx5vfmromsdxksjwf63e

Trying Bilinear Pooling in Video-QA [article]

Thomas Winterbottom, Sarah Xiao, Alistair McLean, Noura Al Moubayed
2020 arXiv   pre-print
Bilinear pooling (BLP) refers to a family of operations recently developed for fusing features from different modalities predominantly developed for VQA models.  ...  Our experiments include both simply replacing feature concatenation in the existing models with BLP, and a modified version of the TVQA baseline to accommodate BLP we name the 'dual-stream' model.  ...  The overlap of language and vision has a key area of research, in particular visual question answering (VQA) [1, 2] i.e. answer a question about an image (surveyed here [3, 4] ).  ... 
arXiv:2012.10285v1 fatcat:qc4ct34dgrbspjzl46fienmv7m

A Completely Blind Video Integrity Oracle

Anish Mittal, Michele A. Saad, Alan C. Bovik
2016 IEEE Transactions on Image Processing  
The new model does not require the use of any additional information other than the video being quality evaluated.  ...  Here, we attempt to bridge this gap by developing a new VQA model called the video intrinsic integrity and distortion evaluation oracle (VIIDEO).  ...  The performance numbers reported for VIIDEO in what different. However, the test sets used to evaluate VIIDEO in these separate analysis differ.  ... 
doi:10.1109/tip.2015.2502725 pmid:26599970 fatcat:o3lhdbz26fbspcbnosp3yq2nz4

Diagnostically Resilient Encoding, Wireless Transmission, and Quality Assessment of Medical Video

A.S. Panayides, C.S. Pattichis, A. Pitsillides, M.S. Pattichis
2011 Zenodo  
Quality assessment is based on a new clinical rating system that provides for independent evaluations of the different parts of the video (subjective).  ...  A new framework for effective communication and evaluation of wireless medical video over error-prone channels is proposed.  ...  An in-depth analysis of the approaches and methods of each of the components parting the proposed system and how they integrate in an efficient system design is depicted.  ... 
doi:10.5281/zenodo.2592409 fatcat:bq5moo4ctjasjomrnnvw5wcqzu

QoE Modeling for HTTP Adaptive Video Streaming - A Survey and Open Challenges

Nabajeet Barman, Maria G. Martini
2019 IEEE Access  
The main contribution of this paper is to present a comprehensive overview of recent and currently undergoing works in the field of QoE modeling for HAS.  ...  With the recent increased usage of video services, the focus has recently shifted from the traditional quality of service-based video delivery to quality of experience (QoE)-based video delivery.  ...  The model consists of three modules, a video module Pv, an audio module Pa and an audio-visual integration module, Pq.  ... 
doi:10.1109/access.2019.2901778 fatcat:aedkodlyifchndfgosmdvnxp6i

Towards Perceptually Optimized End-to-end Adaptive Video Streaming [article]

Christos G. Bampis, Zhi Li, Ioannis Katsavounidis, Te-Yuan Huang, Chaitanya Ekanadham, Alan C. Bovik
2018 arXiv   pre-print
Using our database, we study the effects of multiple streaming dimensions on user experience and evaluate video quality and quality of experience models.  ...  Our database builds on recent advancements in content-adaptive encoding and incorporates actual network traces to capture realistic network variations on the client device.  ...  In our model system, the video quality module performs perceptual video quality calculations that are fed to the encoding module, in order to determine an appropriate bitrate ladder (the set of target  ... 
arXiv:1808.03898v1 fatcat:i5jkeyopfvc2tiwlkiwkuioqjq

IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language Reasoning [article]

Pan Lu, Liang Qiu, Jiaqi Chen, Tony Xia, Yizhou Zhao, Wei Zhang, Zhou Yu, Xiaodan Liang, Song-Chun Zhu
2022 arXiv   pre-print
In this work, we introduce a new challenge of Icon Question Answering (IconQA) with the goal of answering a question in an icon image context.  ...  We conduct extensive user studies and blind experiments and reproduce a wide range of advanced VQA methods to benchmark the IconQA task.  ...  The metric of Top-5 accuracy is used to evaluate different model setups and the evaluation results are summarized in Table 8 .  ... 
arXiv:2110.13214v3 fatcat:u4vh5gtuyzdlfbzggoxu3qchdu
« Previous Showing results 1 — 15 out of 69 results