Filters








20,416 Hits in 9.7 sec

Grounded Compositional Semantics for Finding and Describing Images with Sentences

Richard Socher, Andrej Karpathy, Quoc V. Le, Christopher D. Manning, Andrew Y. Ng
2014 Transactions of the Association for Computational Linguistics  
Previous work on Recursive Neural Networks (RNNs) shows that these models can produce compositional feature vectors for accurately representing and classifying sentences or images.  ...  DT-RNNs outperform other recursive and recurrent neural networks, kernelized CCA and a bag-of-words baseline on the tasks of finding an image that fits a sentence description and vice versa.  ...  With linear kernels, kCCA does well for image search but is worse for sentence self similarity and describing images with sentences close-by in embedding space.  ... 
doi:10.1162/tacl_a_00177 fatcat:r2k4gcax3rbbredim4e5mk7pli

Attributes as Semantic Units between Natural Language and Visual Recognition [article]

Marcus Rohrbach
2016 arXiv   pre-print
Specifically we discuss how attributes allow using knowledge mined from language resources for recognizing novel visual categories, how we can generate sentence description about images and video, how  ...  we can ground natural language in visual content, and finally, how we can answer natural language questions about images.  ...  with natural language sentences (Section 3), how to ground phrases in images (Section 4), and how compositional computation allows for effective question answering about images (Section 5).  ... 
arXiv:1604.03249v1 fatcat:a5dpwgoddvcsvkovik2gupbri4

On the Automatic Generation of Medical Imaging Reports [article]

Baoyu Jing, Pengtao Xie, Eric Xing
2017 arXiv   pre-print
Medical imaging is widely used in clinical practice for diagnosis and treatment.  ...  First, a complete report contains multiple heterogeneous forms of information, including findings and tags. Second, abnormal regions in medical images are difficult to identify.  ...  Portion of sentences which describes the normalities and abnormalities in the image.  ... 
arXiv:1711.08195v2 fatcat:plcxiky6vndsli2mown7oqh3ee

A Corpus for Reasoning about Natural Language Grounded in Photographs

Alane Suhr, Stephanie Zhou, Ally Zhang, Iris Zhang, Huajun Bai, Yoav Artzi
2019 Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics  
We introduce a new dataset for joint reasoning about natural language and images, with a focus on semantic diversity, compositionality, and visual reasoning challenges.  ...  We crowdsource the data using sets of visually rich images and a compare-and-contrast task to elicit linguistically diverse language.  ...  We thank Mark Yatskar, Noah Snavely, and Valts Blukis for their comments and suggestions, the workers who participated in our data collection for their contributions, and the anonymous reviewers for their  ... 
doi:10.18653/v1/p19-1644 dblp:conf/acl/SuhrZZZBA19 fatcat:3oiegtibgnfyleyzvldqqbzfci

Learning semantic sentence representations from visually grounded language without lexical knowledge [article]

Danny Merkx, Stefan Frank
2019 arXiv   pre-print
We use a multimodal sentence encoder trained on a corpus of images with matching text captions to produce visually grounded sentence embeddings.  ...  We evaluate the semantic content of the resulting sentence embeddings using the data from the Semantic Textual Similarity benchmark task and show that the multimodal embeddings correlate well with human  ...  This work was carried out on the Dutch national einfrastructure with the support of SURF Cooperative. We would like to thank Mirjam Ernestus for commenting on an earlier version of this paper.  ... 
arXiv:1903.11393v1 fatcat:hazhz3qjtva7vagaa6awthycom

Compositional Temporal Grounding with Structured Variational Cross-Graph Correspondence Learning [article]

Juncheng Li, Junlin Xie, Long Qian, Linchao Zhu, Siliang Tang, Fei Wu, Yi Yang, Yueting Zhuang, Xin Eric Wang
2022 arXiv   pre-print
Temporal grounding in videos aims to localize one target video segment that semantically corresponds to a given query sentence.  ...  However, current temporal grounding datasets do not specifically test for the compositional generalizability.  ...  When we systematically analyze the SOTA models, we find that previous temporal grounding methods largely neglect the structured semantics in video and language, which is crucial for compositional reasoning  ... 
arXiv:2203.13049v2 fatcat:kpatshcwr5ftxag5nou77vnoau

Visually Grounded Concept Composition [article]

Bowen Zhang, Hexiang Hu, Linlu Qiu, Peter Shaw, Fei Sha
2021 arXiv   pre-print
Meanwhile, we propose a concept composition neural network called Composer to leverage the CRG for visually grounded concept learning.  ...  Specifically, we learn the grounding of both primitive and all composed concepts by aligning them to images and show that learning to compose leads to more robust grounding results, measured in text-to-image  ...  Additionally, we would like to thank Jason Baldbridge for reviewing an early version of this paper, and Kristina Toutanova for helpful discussion.  ... 
arXiv:2109.14115v1 fatcat:2y7q4svqorcg5aeag2fipm5hgm

Incorporating Visual Semantics into Sentence Representations within a Grounded Space [article]

Patrick Bordes, Eloi Zablocki, Laure Soulier, Benjamin Piwowarski, Patrick Gallinari
2020 arXiv   pre-print
We further propose two new complementary objectives ensuring that (1) sentences associated with the same visual content are close in the grounded space and (2) similarities between related elements are  ...  This hypothesis does not hold when representing words, and becomes problematic when used to learn sentence representations --- the focus of this paper --- as a visual scene can be described by a wide variety  ...  Acknowledgments This work is partially supported by the CHIST-ERA EU project MUSTER (ANR-15-CHR2-0005) and the Labex SMART (ANR-11-LABX-65) supported by French state funds managed by the ANR within the  ... 
arXiv:2002.02734v1 fatcat:yihzff6wmbentlgrvrtd7gcceu

Deep Compositional Captioning: Describing Novel Object Categories without Paired Training Data

Lisa Anne Hendricks, Subhashini Venugopalan, Marcus Rohrbach, Raymond Mooney, Kate Saenko, Trevor Darrell
2016 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)  
While recent deep neural network models have achieved promising results on the image captioning task, they rely largely on the availability of corpora with paired image and sentence captions to describe  ...  In contrast, our model can compose sentences that describe novel objects and their interactions with other objects.  ...  Trevor Darrell was supported in part by DARPA; AFRL; DoD MURI award N000141110688; NSF awards IIS-1212798, IIS-1427425, and IIS-1536003, and the Berkeley Vision and Learning Center.  ... 
doi:10.1109/cvpr.2016.8 dblp:conf/cvpr/HendricksVRMSD16 fatcat:wf2poz4sjvfb7kybpvulacgdv4

A Joint Model of Language and Perception for Grounded Attribute Learning [article]

Cynthia Matuszek, Nicholas FitzGerald , Liefeng Bo
2012 arXiv   pre-print
In this paper, we present an approach for joint learning of language and perception models for grounded attribute induction.  ...  The approach is evaluated on the task of interpreting sentences that describe sets of objects in a physical workspace.  ...  Acknowledgments This work was funded in part by the Intel Science and Technology Center for Pervasive Computing, the Robotics Consortium sponsored by the U.S.  ... 
arXiv:1206.6423v1 fatcat:vaabotfnrrf7zctjbvvyf5ie3q

Computer Vision and Natural Language Processing

Peratham Wiriyathammabhum, Douglas Summers-Stay, Cornelia Fermüller, Yiannis Aloimonos
2016 ACM Computing Surveys  
We make an analog of distributional semantics in computer vision and natural language processing as image embedding and word embedding, respectively.  ...  We also present a unified view for the field and propose possible future directions.  ...  Linguistics and Information Processing Lab (CLIP) for useful insights and support.  ... 
doi:10.1145/3009906 fatcat:bdgaeoz4w5djhd5spab4lrc4au

Natural Language Semantics With Pictures: Some Language & Vision Datasets and Potential Uses for Computational Semantics [article]

David Schlangen
2019 arXiv   pre-print
Specifically, we show that in this way we can create data that can be used to learn and evaluate lexical and compositional grounded semantics, and we show that the "linked to same image" relation tracks  ...  Propelling, and propelled by, the "deep learning revolution", recent years have seen the introduction of ever larger corpora of images annotated with natural language expressions.  ...  I thank Sina Zarrieß and the anonymous reviewers for comments.  ... 
arXiv:1904.07318v1 fatcat:ibsp6lf3k5bojl5u3n4qhtq22a

Incorporating Visual Semantics into Sentence Representations within a Grounded Space

Patrick Bordes, Eloi Zablocki, Laure Soulier, Benjamin Piwowarski, patrick Gallinari
2019 Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)  
We further propose two new complementary objectives ensuring that (1) sentences associated with the same visual content are close in the grounded space and (2) similarities between related elements are  ...  with image regions (Xiao et al., 2017) , but this is not the focus of the present paper.  ...  Acknowledgments This work is partially supported by the CHIST-ERA EU project MUSTER (ANR-15-CHR2-0005) and the Labex SMART (ANR-11-LABX-65) supported by French state funds managed by the ANR within the  ... 
doi:10.18653/v1/d19-1064 dblp:conf/emnlp/BordesZSPG19 fatcat:6xhsaf73xfb5bfo2ny5ojhbqyu

Multimodal Convolutional Neural Networks for Matching Image and Sentence

Lin Ma, Zhengdong Lu, Lifeng Shang, Hang Li
2015 2015 IEEE International Conference on Computer Vision (ICCV)  
Our m-CNN provides an end-to-end framework with convolutional architectures to exploit image representation, word composition, and the matching relations between the two modalities.  ...  In this paper, we propose multimodal convolutional neural networks (m-CNNs) for matching image and sentence.  ...  Describing the image with natural sentences is useful for image annotation and captioning [8, 21, 26] , while retrieving image with a natural language as query is more natural for image search [13, 16  ... 
doi:10.1109/iccv.2015.301 dblp:conf/iccv/MaLSL15 fatcat:fv3kzu4iz5ghplyzwofob4revy

Every Picture Tells a Story: Generating Sentences from Images [chapter]

Ali Farhadi, Mohsen Hejrati, Mohammad Amin Sadeghi, Peter Young, Cyrus Rashtchian, Julia Hockenmaier, David Forsyth
2010 Lecture Notes in Computer Science  
We describe a system that can compute a score linking an image to a sentence.  ...  While our underlying estimate of meaning is impoverished, it is sufficient to produce very good quantitative results, evaluated with a novel score that can account for synecdoche.  ...  Finding images for sentences: Once the matching in the meaning space is established we can generate sentences for images (annotation) and also find images that can be best describe by a sentence.  ... 
doi:10.1007/978-3-642-15561-1_2 fatcat:vamw26hpavhfda3azbzetwjwsq
« Previous Showing results 1 — 15 out of 20,416 results