Filters








2,639 Hits in 1e+01 sec

Rethinking Multi-Modal Alignment in Video Question Answering from Feature and Sample Perspectives [article]

Shaoning Xiao, Long Chen, Kaifeng Gao, Zhao Wang, Yi Yang, Jun Xiao
2022 arXiv   pre-print
In this paper, we reconsider the multi-modal alignment problem in VideoQA from feature and sample perspectives to achieve better performance.  ...  From the view of feature,we break down the video into trajectories and first leverage trajectory feature in VideoQA to enhance the alignment between two modalities.  ...  CONCLUSION In this paper, we explored multi-modal alignment in video question answering task from feature and sample perspectives.  ... 
arXiv:2204.11544v1 fatcat:booznt7kjjaxfmagfxaunb35j4

ActBERT: Learning Global-Local Video-Text Representations [article]

Linchao Zhu, Yi Yang
2020 arXiv   pre-print
We validate the generalization capability of ActBERT on downstream video-and language tasks, i.e., text-video clip retrieval, video captioning, video question answering, action segmentation, and action  ...  In this paper, we introduce ActBERT for self-supervised learning of joint video-text representations from unlabeled data.  ...  In this way, both the linguistic standard multi-head attention encoding features from the and visual features are incorporated during transformer en- same modality, we leverage the  ... 
arXiv:2011.07231v1 fatcat:xh6lvxh4cfhylffq6ewlynftlq

2021 Index IEEE Transactions on Image Processing Vol. 30

2021 IEEE Transactions on Image Processing  
-that appeared in this periodical during 2021, and items from previous years that were commented upon or corrected in 2021.  ...  Note that the item title is found only under the primary entry in the Author Index.  ...  ., +, TIP 2021 7790-7802 Gradient-Based Feature Extraction From Raw Bayer Pattern Images. Zhou, W., +, TIP 2021 5122-5137 Graph-Based Multi-Interaction Network for Video Question Answering.  ... 
doi:10.1109/tip.2022.3142569 fatcat:z26yhwuecbgrnb2czhwjlf73qu

MultiViz: An Analysis Benchmark for Visualizing and Understanding Multimodal Models [article]

Paul Pu Liang, Yiwei Lyu, Gunjan Chhablani, Nihal Jain, Zihao Deng, Xingbo Wang, Louis-Philippe Morency, Ruslan Salakhutdinov
2022 arXiv   pre-print
are represented in decision-level features, and (4) multimodal prediction: how decision-level features are composed to make a prediction.  ...  to features, (3) perform error analysis on model misclassifications, and (4) use insights from error analysis to debug models.  ...  Any opinions, findings, conclusions, or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation, National Institutes  ... 
arXiv:2207.00056v1 fatcat:vxg2lcvm6jgghldw74b7onwjje

A Roadmap for Big Model [article]

Sha Yuan, Hanyu Zhao, Shuai Zhao, Jiahong Leng, Yangxiao Liang, Xiaozhi Wang, Jifan Yu, Xin Lv, Zhou Shao, Jiaao He, Yankai Lin, Xu Han (+88 others)
2022 arXiv   pre-print
We introduce 16 specific BM-related topics in those four parts, they are Data, Knowledge, Computing System, Parallel Training System, Language Model, Vision Model, Multi-modal Model, Theory&Interpretability  ...  Researchers have achieved various outcomes in the construction of BMs and the BM application in many fields.  ...  -Big Multi-modal Model (Section 8) Humans can learn from multi-modal information in the real world.  ... 
arXiv:2203.14101v4 fatcat:rdikzudoezak5b36cf6hhne5u4

Transcript to Video: Efficient Clip Sequencing from Texts [article]

Yu Xiong, Fabian Caba Heilbron, Dahua Lin
2021 arXiv   pre-print
Quantitative results and user studies demonstrate empirically that the proposed learning framework can retrieve content-relevant shots while creating plausible video sequences in terms of style.  ...  To meet the demands for non-experts, we present Transcript-to-Video -- a weakly-supervised framework that uses texts as input to automatically create video sequences from an extensive collection of shots  ...  A joint se- quence fusion model for video question answering and re- trieval.  ... 
arXiv:2107.11851v1 fatcat:vfcx7w75kzgg7ppurgswceoi5i

Large-scale Pretraining for Visual Dialog: A Simple State-of-the-Art Baseline [article]

Vishvak Murahari, Dhruv Batra, Devi Parikh, Abhishek Das
2020 arXiv   pre-print
Our model is pretrained on the Conceptual Captions and Visual Question Answering datasets, and finetuned on VisDial.  ...  This highlights a trade-off between the two primary metrics -- NDCG and MRR -- which we find is due to dense annotations not correlating well with the original ground-truth answers to questions.  ...  AD is supported in part by fellowships from Facebook, Adobe, and Snap Inc.  ... 
arXiv:1912.02379v2 fatcat:s7nmajzjhveq3f43prs37nfasq

User-Generated Content in Social Media (Dagstuhl Seminar 17301)

Tat-Seng Chua, Norbert Fuhr, Gregory Grefenstette, Kalervo Järvelin, Jaakko Paltonen, Marc Herbstritt
2018 Dagstuhl Reports  
In this seminar, we brought together researchers from different subfields of computer science, such as information retrieval, multimedia, natural language processing, machine learning and social media  ...  We formed two working groups, WG1 "Fake News and Credibility", WG2 "Summarizing and Story Telling from UGC".  ...  First, we interpret the questions and answers in CQA as two independent networks.  ... 
doi:10.4230/dagrep.7.7.110 dblp:journals/dagstuhl-reports/ChuaFGJP17 fatcat:bman5u6q5zdg7a6csnzwpba7sm

Rethinking Search: Making Experts out of Dilettantes [article]

Donald Metzler, Yi Tay, Dara Bahri, Marc Najork
2021 arXiv   pre-print
Successful question answering systems offer a limited corpus created on-demand by human experts, which is neither timely nor scalable.  ...  This paper examines how ideas from classical information retrieval and large pre-trained language models can be synthesized and evolved into systems that truly deliver on the promise of expert advice.  ...  Multi-hop Question Answer- [34] Kai Hui, Andrew Yates, Klaus Berberich, and Gerard de Melo. 2017.  ... 
arXiv:2105.02274v1 fatcat:qdghlnv2nnfhnoo6eafdaxqxzy

Are Accelerometers for Activity Recognition a Dead-end? [article]

Catherine Tong, Shyam A. Tailor, Nicholas D. Lane
2020 arXiv   pre-print
Despite continued and prolonged efforts in improving feature engineering and machine learning models, the activities that we can recognize reliably have only expanded slightly and many of the same flaws  ...  This sensor does not offer enough information for us to progress in the core domain of HAR - to recognize everyday activities from sensor data.  ...  It is time to rethink this default HAR sensor and move towards a modality with richer information, in order to identify more activities more robustly.  ... 
arXiv:2001.08111v2 fatcat:a2mlifk77vcvbjfbcxza7if4te

Table of Contents

2019 2019 IEEE/CVF International Conference on Computer Vision (ICCV)  
and Technology of China) RGB-Infrared Cross-Modality Person Re-Identification via Joint Pixel and Feature Alignment Beyond Human Parts: Dual Part-Aligned Representations for Person Re-Identification 3641  ...  Graphical Feature Learning for the Feature Matching Problem 5086 Zhen Zhang (NUS) and Wee Sun Lee (NUS) Minimum Delay Object Detection From Video 5096 Dong Lao (KAUST) and Ganesh Sundaramoorthi (  ... 
doi:10.1109/iccv.2019.00004 fatcat:5aouo4scprc75c7zetsimylj2y

Artefacts, practices and pedagogies: teaching writing in English in the NAPLAN era

Susanne Gannon, Jennifer Dove
2021 The Australian Educational Researcher  
It developed a novel method of case study at a distance that required no classroom presence or school visits for the researchers and allowed a multi-sited and geographically dispersed design.  ...  Teachers were invited to select classroom artefacts pertaining to the teaching of writing in their English classes, compile individualised e-portfolios and reflect on these items in writing and in digitally  ...  Students are reminded to use QAR (Question-Answer Relationships) "to work out where the answer is going to come from-in the book or in your head?".  ... 
doi:10.1007/s13384-020-00416-6 pmid:33526956 pmcid:PMC7838853 fatcat:td4o75smyvck7idklw5len52xm

Scientific Visualization (Dagstuhl Seminar 11231)

Min Chen, Hans Hagen, Charles D. Hansen, Arie Kaufman, Marc Herbstritt
2011 Dagstuhl Reports  
Reflecting the heterogeneous structure of Scientific Visualization and the currently unsolved problems in the field, this seminar dealt with key research problems and their solutions in the following subfields  ...  Scientific Visualization (SV) is the transformation of abstract data, derived from observation or simulation, into readily comprehensible images, and has proven to play an indispensable part of the scientific  ...  In particular, we show how to approximate topological structures from hixel data, extract structures from multi-modal distributions, and render uncertain isosurfaces.  ... 
doi:10.4230/dagrep.1.6.1 dblp:journals/dagstuhl-reports/ChenHHK11 fatcat:jvdbpd4q3fddjazkxyhttih36a

Synchronized distribution framework for high-quality multimodal interactive teleimmersion

Zixia Huang
2012 ACM SIGMultimedia Records  
This dissertation investigates issues of performing synchronized distribution of time-correlated multi-modal continuous media data in the distributed interactive teleimmersion, and proposes approaches  ...  I have learnt a lot from her broad knowledge, critical insights and thoughtful directions, and have benefited from her professionalism and dedication.  ...  For example in a remote education application, students at different receiver sites are racing to answer a question asked by an instructor at the sender site.  ... 
doi:10.1145/2458051.2458057 fatcat:oyhv5qkkfrdv7lqmkigqm3wvfm

Automatic Identification of Non-Meaningful Body-Movements and What It Reveals About Humans [article]

Md Iftekhar Tanveer, RuJie Zhao, Mohammed Hoque
2017 arXiv   pre-print
We extracted five types of features from the audio-video recordings: disfluency, prosody, body movements, facial, and lexical.  ...  In a dataset of 84 public speaking videos from 28 individuals, we extract 314 unique body movement patterns (e.g. pacing, gesturing, shifting body weights, etc.).  ...  Acknowledgement This work was supported in part by Grant W911NF-15-1-0542 with the US Defense Advanced Research Projects Agency (DARPA) and the Army Research Office (ARO).  ... 
arXiv:1707.04790v1 fatcat:pu55p7xj3fdl5nvcgfryhb7eky
« Previous Showing results 1 — 15 out of 2,639 results