A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2022; you can also visit the original URL.
The file type is application/pdf
.
Filters
An experimental study of the vision-bottleneck in VQA
[article]
2022
arXiv
pre-print
In this paper, we propose an in-depth study of the vision-bottleneck in VQA, experimenting with both the quantity and quality of visual objects extracted from images. ...
We also study the impact of two methods to incorporate the information about objects necessary for answering a question, in the reasoning module directly, and earlier in the object selection stage. ...
The impact of object detection quality We evaluate the general impact of the quality of an object detection system on VQA performance. ...
arXiv:2202.06858v1
fatcat:idzgrcelu5cezdbyiflcdsgg6i
Multi-Modal Graph Neural Network for Joint Reasoning on Vision and Scene Text
[article]
2020
arXiv
pre-print
Experimental evaluations show that our MM-GNN represents the scene texts better and obviously facilitates the performances on two VQA tasks that require reading scene texts. ...
Answering questions that require reading texts in an image is challenging for current models. ...
This work is partially supported by Natural Science Foundation of China under contracts Nos. 61922080, U19B2036, 61772500, and CAS Frontier Science Key Research Project No. QYZDJ-SSWJSC009. ...
arXiv:2003.13962v1
fatcat:ifofez5zjjdlrf2i7iptppe4oe
All-in-One Image-Grounded Conversational Agents
[article]
2020
arXiv
pre-print
We design an architecture that combines state-of-the-art Transformer and ResNeXt modules fed into a novel attentive multimodal module to produce a combined model trained on many tasks. ...
Our final models provide a single system that obtains good results on all vision and language tasks considered, and improves the state-of-the-art in image-grounded conversational applications. ...
Multimodal Combiner We next assess the impact of the multimodal combiner module in our architecture; we first analyze the non-attentive version. ...
arXiv:1912.12394v2
fatcat:heqeepgwsbbdfgdixum3cwb33y
C3DVQA: Full-Reference Video Quality Assessment with 3D Convolutional Neural Network
[article]
2020
arXiv
pre-print
However, video quality exhibits different characteristics from static image quality due to the existence of temporal masking effects. ...
We empirically found that 3D convolutional layers are capable to capture temporal masking effects of videos. We evaluated the proposed method on the LIVE and CSIQ datasets. ...
Besides, motion-related distortions also have an impact on the perceived quality. ...
arXiv:1910.13646v2
fatcat:zfad7igdj5dqlkw4nr5jiallgi
In Defense of Grid Features for Visual Question Answering
[article]
2020
arXiv
pre-print
In this paper, we revisit grid features for VQA, and find they can work surprisingly well - running more than an order of magnitude faster with the same accuracy (e.g. if pre-trained in a similar fashion ...
Through extensive experiments, we verify that this observation holds true across different VQA models (reporting a state-of-the-art accuracy on VQA 2.0 test-std, 72.71), datasets, and generalizes well ...
The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the ...
arXiv:2001.03615v2
fatcat:bxa2fqiq5nfelm3smqpsjmz6oe
Video Quality Assessment Based on Measuring Perceptual Noise From Spatial and Temporal Perspectives
2011
IEEE transactions on circuits and systems for video technology (Print)
Video quality assessment (VQA) exploits important properties of the sophisticated human visual system (HVS). ...
In this paper, we study a series of fundamental HVS characteristics for subjective video quality assessment, and incorporate them into a systematic framework to simulate subjective evaluation on impaired ...
Politecnico di Milano, Milan, Italy, for kindly providing the VQA databases. ...
doi:10.1109/tcsvt.2011.2157189
fatcat:oys4wlb5krer7kiw7kyweu4h3e
Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA
[article]
2020
arXiv
pre-print
Furthermore, it enables iterative answer decoding with a dynamic pointer network, allowing the model to form an answer through multi-step prediction instead of one-step classification. ...
For example, a deep water label on a warning sign warns people about the danger in the scene. ...
What is the danger? Previous work: water Our model: deep water TextVQA
Figure 3 . 3 Accuracy under different maximum decoding steps T on the validation set of TextVQA, ST-VQA, and OCR-VQA. ...
arXiv:1911.06258v3
fatcat:c4zkfcdk2bgkdiljbeig6dcedq
Cross-Modal Knowledge Reasoning for Knowledge-based Visual Question Answering
2020
Pattern Recognition
By stacking the modules multiple times, our model performs transitive reasoning and obtains question-oriented concept representations under the constrain of different modalities. ...
Inspired by the human cognition theory, in this paper, we depict an image by multiple knowledge graphs from the visual, semantic and factual views. ...
We further evaluate the influence of different number of reasoning steps T in the GRUC network. ...
doi:10.1016/j.patcog.2020.107563
fatcat:ezlkrzacbnddfh7f573vqouhne
LaTr: Layout-Aware Transformer for Scene-Text VQA
[article]
2021
arXiv
pre-print
Thus, we first investigate the impact of each modality, and reveal the importance of the language module, especially when enriched with layout information. ...
In particular, +7.6% on TextVQA, +10.8% on ST-VQA and +4.0% on OCR-VQA (all absolute accuracy numbers). ...
OCR-VQA Results As commonly done by previous work [20] , we only evaluate our model using the constrained setting. In this setting, we do not change the OCR system, i.e. we use Rosetta OCR system. ...
arXiv:2112.12494v2
fatcat:chdp2ozx5vfmromsdxksjwf63e
Trying Bilinear Pooling in Video-QA
[article]
2020
arXiv
pre-print
Bilinear pooling (BLP) refers to a family of operations recently developed for fusing features from different modalities predominantly developed for VQA models. ...
Our experiments include both simply replacing feature concatenation in the existing models with BLP, and a modified version of the TVQA baseline to accommodate BLP we name the 'dual-stream' model. ...
The overlap of language and vision has a key area of research, in particular visual question answering (VQA) [1, 2] i.e. answer a question about an image (surveyed here [3, 4] ). ...
arXiv:2012.10285v1
fatcat:qc4ct34dgrbspjzl46fienmv7m
A Completely Blind Video Integrity Oracle
2016
IEEE Transactions on Image Processing
The new model does not require the use of any additional information other than the video being quality evaluated. ...
Here, we attempt to bridge this gap by developing a new VQA model called the video intrinsic integrity and distortion evaluation oracle (VIIDEO). ...
The performance numbers reported for VIIDEO in
what different. However, the test sets used to evaluate VIIDEO in these separate analysis differ. ...
doi:10.1109/tip.2015.2502725
pmid:26599970
fatcat:o3lhdbz26fbspcbnosp3yq2nz4
Diagnostically Resilient Encoding, Wireless Transmission, and Quality Assessment of Medical Video
2011
Zenodo
Quality assessment is based on a new clinical rating system that provides for independent evaluations of the different parts of the video (subjective). ...
A new framework for effective communication and evaluation of wireless medical video over error-prone channels is proposed. ...
An in-depth analysis of the approaches and methods of each of the components parting the proposed system and how they integrate in an efficient system design is depicted. ...
doi:10.5281/zenodo.2592409
fatcat:bq5moo4ctjasjomrnnvw5wcqzu
QoE Modeling for HTTP Adaptive Video Streaming - A Survey and Open Challenges
2019
IEEE Access
The main contribution of this paper is to present a comprehensive overview of recent and currently undergoing works in the field of QoE modeling for HAS. ...
With the recent increased usage of video services, the focus has recently shifted from the traditional quality of service-based video delivery to quality of experience (QoE)-based video delivery. ...
The model consists of three modules, a video module Pv, an audio module Pa and an audio-visual integration module, Pq. ...
doi:10.1109/access.2019.2901778
fatcat:aedkodlyifchndfgosmdvnxp6i
Towards Perceptually Optimized End-to-end Adaptive Video Streaming
[article]
2018
arXiv
pre-print
Using our database, we study the effects of multiple streaming dimensions on user experience and evaluate video quality and quality of experience models. ...
Our database builds on recent advancements in content-adaptive encoding and incorporates actual network traces to capture realistic network variations on the client device. ...
In our model system, the video quality module performs perceptual video quality calculations that are fed to the encoding module, in order to determine an appropriate bitrate ladder (the set of target ...
arXiv:1808.03898v1
fatcat:i5jkeyopfvc2tiwlkiwkuioqjq
IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language Reasoning
[article]
2022
arXiv
pre-print
In this work, we introduce a new challenge of Icon Question Answering (IconQA) with the goal of answering a question in an icon image context. ...
We conduct extensive user studies and blind experiments and reproduce a wide range of advanced VQA methods to benchmark the IconQA task. ...
The metric of Top-5 accuracy is used to evaluate different model setups and the evaluation results are summarized in Table 8 . ...
arXiv:2110.13214v3
fatcat:u4vh5gtuyzdlfbzggoxu3qchdu
« Previous
Showing results 1 — 15 out of 69 results