Filters








501 Hits in 5.4 sec

Modality-Agnostic Attention Fusion for visual search with text feedback [article]

Eric Dodds, Jack Culpepper, Simao Herdade, Yang Zhang, Kofi Boakye
2020 arXiv   pre-print
Our Modality-Agnostic Attention Fusion (MAAF) model combines image and text features and outperforms existing approaches on two visual search with modifying phrase datasets, Fashion IQ and CSS, and performs  ...  Image retrieval with natural language feedback offers the promise of catalog search based on fine-grained visual features that go beyond objects and binary attributes, facilitating real-world applications  ...  We propose a simple model we call Modality-Agnostic Attention Fusion (MAAF) to address the text-modified image retrieval task.  ... 
arXiv:2007.00145v1 fatcat:bw3uiomgzbhafpmazqgafzrlja

A Review on Explainability in Multimodal Deep Neural Nets

Gargi Joshi, Rahee Walambe, Ketan Kotecha
2021 IEEE Access  
[78] [79] Visual, Sensory text Paper uses a causal approach to visual attention mechanism.  ...  The multimodal fusion task has also been modeled as a neural architecture search algorithm to find an appropriate search space and a suitable architecture to fuse the modalities [132] .  ... 
doi:10.1109/access.2021.3070212 fatcat:5wtxr4nf7rbshk5zx7lzbtcram

Image Search with Text Feedback by Additive Attention Compositional Learning [article]

Yuxin Tian, Shawn Newsam, Kofi Boakye
2022 arXiv   pre-print
Effective image retrieval with text feedback stands to impact a range of real-world applications, such as e-commerce.  ...  composing a multi-modal (image-text) query.  ...  Recently, MAAF [15] improved multi-modal image search via a Modality-Agnostic Attention Fusion model.  ... 
arXiv:2203.03809v1 fatcat:wkt7hlq47jelxguutlgbzx5qdq

Multimodal Research in Vision and Language: A Review of Current and Emerging Trends [article]

Shagun Uppal, Sarthak Bhagat, Devamanyu Hazarika, Navonil Majumdar, Soujanya Poria, Roger Zimmermann, Amir Zadeh
2020 arXiv   pre-print
In this paper, we present a detailed overview of the latest trends in research pertaining to visual and language modalities.  ...  Deep Learning and its applications have cascaded impactful research and development with a diverse range of modalities present in the real-world data.  ...  [258] inputted an image with a text-query describing necessary changes to be considered from the present image while searching for other relevant images for retrieval.  ... 
arXiv:2010.09522v2 fatcat:l4npstkoqndhzn6hznr7eeys4u

Multimodal Machine Learning: A Survey and Taxonomy [article]

Tadas Baltrušaitis, Chaitanya Ahuja, Louis-Philippe Morency
2017 arXiv   pre-print
It is a vibrant multi-disciplinary field of increasing importance and with extraordinary potential.  ...  We go beyond the typical early and late fusion categorization and identify broader challenges that are faced by multimodal machine learning, namely: representation, translation, alignment, fusion, and  ...  While earlier approaches for indexing and searching these multimedia videos were keyword-based [188] , new research problems emerged when trying to search the visual and multimodal content directly.  ... 
arXiv:1705.09406v2 fatcat:262fo4sihffvxecg4nwsifoddm

Upgrading the Newsroom: An Automated Image Selection System for News Articles [article]

Fangyu Liu, Rémi Lebret, Didier Orel, Philippe Sordet, Karl Aberer
2020 arXiv   pre-print
The text encoder adopts a hierarchical self-attention mechanism to attend more to both keywords within a piece of text and informative components of a news article.  ...  The system is compared with multiple baselines with ablation studies and is shown to beat existing text-image retrieval methods in a weakly-supervised learning setting.  ...  November 2014) 7 Searching this text with our model trained only with captions, the visualization of attention scores received are presented in the first row of Figure 9 where attention decreases in  ... 
arXiv:2004.11449v1 fatcat:4dnkwtwkufaxdg3rs6urz2zgai

Multimodal Co-learning: Challenges, Applications with Datasets, Recent Advances and Future Directions [article]

Anil Rahate, Rahee Walambe, Sheela Ramanna, Ketan Kotecha
2021 arXiv   pre-print
Multimodal deep learning systems which employ multiple modalities like text, image, audio, video, etc., are showing better performance in comparison with individual modalities (i.e., unimodal) systems.  ...  Our final goal is to discuss challenges and perspectives along with the important ideas and directions for future work that we hope to be beneficial for the entire research community focusing on this exciting  ...  Multimodal fusion architecture search space [136] is used to decide which layers to use for fusion from each modality and which non-linear function to be used for fusion.  ... 
arXiv:2107.13782v2 fatcat:s4spofwxjndb7leqbcqnwbifq4

HighMMT: Towards Modality and Task Generalization for High-Modality Representation Learning [article]

Paul Pu Liang, Yiwei Lyu, Xiang Fan, Shentong Mo, Dani Yogatama, Louis-Philippe Morency, Ruslan Salakhutdinov
2022 arXiv   pre-print
In order to accelerate generalization towards diverse and understudied modalities, we investigate methods for high-modality (a large set of diverse modalities) and partially-observable (each task only  ...  Our resulting model generalizes across text, image, video, audio, time-series, sensors, tables, and set modalities from different research areas, improves the tradeoff between performance and efficiency  ...  For example, the modality embedding of the image sequence for a video classification task will be shared with that for an image and text question-answering task.  ... 
arXiv:2203.01311v2 fatcat:vrduxldb4jenxdfbzws2he7lgi

Multimodal Interfaces: A Survey of Principles, Models and Frameworks [chapter]

Bruno Dumas, Denis Lalanne, Sharon Oviatt
2009 Lecture Notes in Computer Science  
This opens a number of associated issues covered by this chapter, such as heterogeneous data types fusion, architectures for real-time processing, dialog management, machine learning for multimodal interaction  ...  The chapter starts with the features and advantages associated with multimodal interaction, with a focus on particular findings and guidelines, as well as cognitive foundations underlying multimodal interaction  ...  Fission of Output Modalities When multiple output modalities such as text-to-speech synthesis, audio cues, visual cues, haptic feedback or animated agents are available, output selection becomes a delicate  ... 
doi:10.1007/978-3-642-00437-7_1 fatcat:2kpxjb4kqfcupkrxeexlvwi3su

Weakly-Supervised Video Moment Retrieval via Semantic Completion Network [article]

Zhijie Lin, Zhou Zhao, Zhu Zhang, Qi Wang, Huasheng Liu
2020 arXiv   pre-print
Next, we build a semantic completion module to measure the semantic similarity between the selected proposals and query, compute reward and provide feedbacks to the proposal generation module for scoring  ...  Video moment retrieval is to search the moment that is most relevant to the given natural language query.  ...  Existing weakly-supervised method in (Mithun, Paul, and Roy-Chowdhury 2019) proposes to learn a joint visual-text embedding, and utilizes the latent alignment produced by intermediate Text-Guided Attention  ... 
arXiv:1911.08199v3 fatcat:7vwjsnr6cza7fj74rifxd22sdm

M2Lens: Visualizing and Explaining Multimodal Models for Sentiment Analysis [article]

Xingbo Wang, Jianben He, Zhihua Jin, Muqiao Yang, Yong Wang, Huamin Qu
2021 arXiv   pre-print
In this paper, we present an interactive visual analytics system, M2Lens, to visualize and explain multimodal models for sentiment analysis.  ...  Moreover, M2Lens identifies frequent and influential multimodal features and supports the multi-faceted exploration of model behaviors from language, acoustic, and visual modalities.  ...  ACKNOWLEDGMENTS The authors wish to thank anonymous reviewers for their feedback. This research was supported in part by grant FSNH20EG01 under Foshan-HKUST Projects.  ... 
arXiv:2107.08264v4 fatcat:2zmqziomuveodlru5vqfcvnpta

Weakly-Supervised Video Moment Retrieval via Semantic Completion Network

Zhijie Lin, Zhou Zhao, Zhu Zhang, Qi Wang, Huasheng Liu
2020 PROCEEDINGS OF THE THIRTIETH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE AND THE TWENTY-EIGHTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE  
Next, we build a semantic completion module to measure the semantic similarity between the selected proposals and query, compute reward and provide feedbacks to the proposal generation module for scoring  ...  Video moment retrieval is to search the moment that is most relevant to the given natural language query.  ...  Existing weakly-supervised method in (Mithun, Paul, and Roy-Chowdhury 2019) proposes to learn a joint visual-text embedding, and utilizes the latent alignment produced by intermediate Text-Guided Attention  ... 
doi:10.1609/aaai.v34i07.6820 fatcat:zveh5blsg5ehvapv2aes7unvye

Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers [article]

Antoine Miech, Jean-Baptiste Alayrac, Ivan Laptev, Josef Sivic, Andrew Zisserman
2021 arXiv   pre-print
An alternative approach of using vision-text transformers with cross-attention gives considerable improvements in accuracy over the joint embeddings, but is often inapplicable in practice for large-scale  ...  retrieval given the cost of the cross-attention mechanisms required for each sample at test time.  ...  We would like to thank Lisa Anne Hendricks for feedback.  ... 
arXiv:2103.16553v1 fatcat:rw2av5leebdx7kcrqowxv6yo54

Latent Structure Mining with Contrastive Modality Fusion for Multimedia Recommendation [article]

Jinghao Zhang, Yanqiao Zhu, Qiang Liu, Mengqi Zhang, Shu Wu, Liang Wang
2022 arXiv   pre-print
to fully understand content information and item relationships.To this end, we propose a latent structure MIning with ContRastive mOdality fusion method (MICRO for brevity).  ...  To be specific, we devise a novel modality-aware structure learning module, which learns item-item relationships for each modality.  ...  ACF [7] introduces an itemlevel and component-level attention model for inferring the underlying users' preferences encoded in the implicit user feedbacks.  ... 
arXiv:2111.00678v2 fatcat:boqsb2twpjd45gbtol5tpkirqa

GreaseLM: Graph REASoning Enhanced Language Models for Question Answering [article]

Xikun Zhang, Antoine Bosselut, Michihiro Yasunaga, Hongyu Ren, Percy Liang, Christopher D. Manning, Jure Leskovec
2022 arXiv   pre-print
However, pretrained language models (LM), the foundation of most modern QA systems, do not robustly represent latent relationships between concepts, which is necessary for reasoning.  ...  While knowledge graphs (KG) are often used to augment LMs with structured representations of world knowledge, it remains an open question how to effectively fuse and reason over the KG representations  ...  ACKNOWLEDGMENT We thank Rok Sosic, Maria Brbic, Jordan Troutman, Rajas Bansal, and our anonymous reviewers for discussions and for providing feedback on our manuscript.  ... 
arXiv:2201.08860v1 fatcat:2idgswwqknhnflhrc4a3tnulla
« Previous Showing results 1 — 15 out of 501 results