A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2020; you can also visit the original URL.
The file type is application/pdf
.
Filters
Modality-Agnostic Attention Fusion for visual search with text feedback
[article]
2020
arXiv
pre-print
Our Modality-Agnostic Attention Fusion (MAAF) model combines image and text features and outperforms existing approaches on two visual search with modifying phrase datasets, Fashion IQ and CSS, and performs ...
Image retrieval with natural language feedback offers the promise of catalog search based on fine-grained visual features that go beyond objects and binary attributes, facilitating real-world applications ...
We propose a simple model we call Modality-Agnostic Attention Fusion (MAAF) to address the text-modified image retrieval task. ...
arXiv:2007.00145v1
fatcat:bw3uiomgzbhafpmazqgafzrlja
A Review on Explainability in Multimodal Deep Neural Nets
2021
IEEE Access
[78]
[79]
Visual,
Sensory
text
Paper uses a causal
approach
to
visual
attention mechanism. ...
The multimodal fusion task has also been modeled as a neural architecture search algorithm to find an appropriate search space and a suitable architecture to fuse the modalities [132] . ...
doi:10.1109/access.2021.3070212
fatcat:5wtxr4nf7rbshk5zx7lzbtcram
Image Search with Text Feedback by Additive Attention Compositional Learning
[article]
2022
arXiv
pre-print
Effective image retrieval with text feedback stands to impact a range of real-world applications, such as e-commerce. ...
composing a multi-modal (image-text) query. ...
Recently, MAAF [15] improved multi-modal image search via a Modality-Agnostic Attention Fusion model. ...
arXiv:2203.03809v1
fatcat:wkt7hlq47jelxguutlgbzx5qdq
Multimodal Research in Vision and Language: A Review of Current and Emerging Trends
[article]
2020
arXiv
pre-print
In this paper, we present a detailed overview of the latest trends in research pertaining to visual and language modalities. ...
Deep Learning and its applications have cascaded impactful research and development with a diverse range of modalities present in the real-world data. ...
[258] inputted an image with a text-query describing necessary changes to be considered from the present image while searching for other relevant images for retrieval. ...
arXiv:2010.09522v2
fatcat:l4npstkoqndhzn6hznr7eeys4u
Multimodal Machine Learning: A Survey and Taxonomy
[article]
2017
arXiv
pre-print
It is a vibrant multi-disciplinary field of increasing importance and with extraordinary potential. ...
We go beyond the typical early and late fusion categorization and identify broader challenges that are faced by multimodal machine learning, namely: representation, translation, alignment, fusion, and ...
While earlier approaches for indexing and searching these multimedia videos were keyword-based [188] , new research problems emerged when trying to search the visual and multimodal content directly. ...
arXiv:1705.09406v2
fatcat:262fo4sihffvxecg4nwsifoddm
Upgrading the Newsroom: An Automated Image Selection System for News Articles
[article]
2020
arXiv
pre-print
The text encoder adopts a hierarchical self-attention mechanism to attend more to both keywords within a piece of text and informative components of a news article. ...
The system is compared with multiple baselines with ablation studies and is shown to beat existing text-image retrieval methods in a weakly-supervised learning setting. ...
November 2014) 7 Searching this text with our model trained only with captions, the visualization of attention scores received are presented in the first row of Figure 9 where attention decreases in ...
arXiv:2004.11449v1
fatcat:4dnkwtwkufaxdg3rs6urz2zgai
Multimodal Co-learning: Challenges, Applications with Datasets, Recent Advances and Future Directions
[article]
2021
arXiv
pre-print
Multimodal deep learning systems which employ multiple modalities like text, image, audio, video, etc., are showing better performance in comparison with individual modalities (i.e., unimodal) systems. ...
Our final goal is to discuss challenges and perspectives along with the important ideas and directions for future work that we hope to be beneficial for the entire research community focusing on this exciting ...
Multimodal fusion architecture search space [136] is used to decide which layers to use for fusion from each modality and which non-linear function to be used for fusion. ...
arXiv:2107.13782v2
fatcat:s4spofwxjndb7leqbcqnwbifq4
HighMMT: Towards Modality and Task Generalization for High-Modality Representation Learning
[article]
2022
arXiv
pre-print
In order to accelerate generalization towards diverse and understudied modalities, we investigate methods for high-modality (a large set of diverse modalities) and partially-observable (each task only ...
Our resulting model generalizes across text, image, video, audio, time-series, sensors, tables, and set modalities from different research areas, improves the tradeoff between performance and efficiency ...
For example, the modality embedding of the image sequence for a video classification task will be shared with that for an image and text question-answering task. ...
arXiv:2203.01311v2
fatcat:vrduxldb4jenxdfbzws2he7lgi
Multimodal Interfaces: A Survey of Principles, Models and Frameworks
[chapter]
2009
Lecture Notes in Computer Science
This opens a number of associated issues covered by this chapter, such as heterogeneous data types fusion, architectures for real-time processing, dialog management, machine learning for multimodal interaction ...
The chapter starts with the features and advantages associated with multimodal interaction, with a focus on particular findings and guidelines, as well as cognitive foundations underlying multimodal interaction ...
Fission of Output Modalities When multiple output modalities such as text-to-speech synthesis, audio cues, visual cues, haptic feedback or animated agents are available, output selection becomes a delicate ...
doi:10.1007/978-3-642-00437-7_1
fatcat:2kpxjb4kqfcupkrxeexlvwi3su
Weakly-Supervised Video Moment Retrieval via Semantic Completion Network
[article]
2020
arXiv
pre-print
Next, we build a semantic completion module to measure the semantic similarity between the selected proposals and query, compute reward and provide feedbacks to the proposal generation module for scoring ...
Video moment retrieval is to search the moment that is most relevant to the given natural language query. ...
Existing weakly-supervised method in (Mithun, Paul, and Roy-Chowdhury 2019) proposes to learn a joint visual-text embedding, and utilizes the latent alignment produced by intermediate Text-Guided Attention ...
arXiv:1911.08199v3
fatcat:7vwjsnr6cza7fj74rifxd22sdm
M2Lens: Visualizing and Explaining Multimodal Models for Sentiment Analysis
[article]
2021
arXiv
pre-print
In this paper, we present an interactive visual analytics system, M2Lens, to visualize and explain multimodal models for sentiment analysis. ...
Moreover, M2Lens identifies frequent and influential multimodal features and supports the multi-faceted exploration of model behaviors from language, acoustic, and visual modalities. ...
ACKNOWLEDGMENTS The authors wish to thank anonymous reviewers for their feedback. This research was supported in part by grant FSNH20EG01 under Foshan-HKUST Projects. ...
arXiv:2107.08264v4
fatcat:2zmqziomuveodlru5vqfcvnpta
Weakly-Supervised Video Moment Retrieval via Semantic Completion Network
2020
PROCEEDINGS OF THE THIRTIETH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE AND THE TWENTY-EIGHTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE
Next, we build a semantic completion module to measure the semantic similarity between the selected proposals and query, compute reward and provide feedbacks to the proposal generation module for scoring ...
Video moment retrieval is to search the moment that is most relevant to the given natural language query. ...
Existing weakly-supervised method in (Mithun, Paul, and Roy-Chowdhury 2019) proposes to learn a joint visual-text embedding, and utilizes the latent alignment produced by intermediate Text-Guided Attention ...
doi:10.1609/aaai.v34i07.6820
fatcat:zveh5blsg5ehvapv2aes7unvye
Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers
[article]
2021
arXiv
pre-print
An alternative approach of using vision-text transformers with cross-attention gives considerable improvements in accuracy over the joint embeddings, but is often inapplicable in practice for large-scale ...
retrieval given the cost of the cross-attention mechanisms required for each sample at test time. ...
We would like to thank Lisa Anne Hendricks for feedback. ...
arXiv:2103.16553v1
fatcat:rw2av5leebdx7kcrqowxv6yo54
Latent Structure Mining with Contrastive Modality Fusion for Multimedia Recommendation
[article]
2022
arXiv
pre-print
to fully understand content information and item relationships.To this end, we propose a latent structure MIning with ContRastive mOdality fusion method (MICRO for brevity). ...
To be specific, we devise a novel modality-aware structure learning module, which learns item-item relationships for each modality. ...
ACF [7] introduces an itemlevel and component-level attention model for inferring the underlying users' preferences encoded in the implicit user feedbacks. ...
arXiv:2111.00678v2
fatcat:boqsb2twpjd45gbtol5tpkirqa
GreaseLM: Graph REASoning Enhanced Language Models for Question Answering
[article]
2022
arXiv
pre-print
However, pretrained language models (LM), the foundation of most modern QA systems, do not robustly represent latent relationships between concepts, which is necessary for reasoning. ...
While knowledge graphs (KG) are often used to augment LMs with structured representations of world knowledge, it remains an open question how to effectively fuse and reason over the KG representations ...
ACKNOWLEDGMENT We thank Rok Sosic, Maria Brbic, Jordan Troutman, Rajas Bansal, and our anonymous reviewers for discussions and for providing feedback on our manuscript. ...
arXiv:2201.08860v1
fatcat:2idgswwqknhnflhrc4a3tnulla
« Previous
Showing results 1 — 15 out of 501 results