Filters








427 Hits in 3.2 sec

Image Cationing with Visual-Semantic LSTM

Nannan Li, Zhenzhong Chen
2018 Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence  
Inspired by the visual processing of our cognitive system, we propose a visual-semantic LSTM model to locate the attention objects with their low-level features in the visual cell, and then successively  ...  In this paper, a novel image captioning approach is proposed to describe the content of images.  ...  Then in the LSTM model, the visual cell LSTM v utilizes the visual features to localize the objects in the image, whilst the semantic cell LSTM s further integrates the localized objects with their attributes  ... 
doi:10.24963/ijcai.2018/110 dblp:conf/ijcai/LiC18 fatcat:beg636qo7nbzdbm3zesimqbepi

Sketch Recognition with Deep Visual-Sequential Fusion Model

Jun-Yan He, Xiao Wu, Yu-Gang Jiang, Bo Zhao, Qiang Peng
2017 Proceedings of the 2017 ACM on Multimedia Conference - MM '17  
Finally, the visual and sequential representations of the sketches are seamlessly integrated with a fusion layer to obtain the nal results.  ...  To learn the pa erns of stroke order, sequential networks are constructed by Residual Long Short-Term Memory (R-LSTM) units, which optimize the network architecture by skip connection.  ...  Figure 5 5 : e architecture of Residual LSTM (R-LSTM) unit with ReLU mapping.  ... 
doi:10.1145/3123266.3123321 dblp:conf/mm/HeWJZP17 fatcat:pbbvhkyqwze67j2y3cyluyleqm

TGIF: A New Dataset and Benchmark on Animated GIF Description

Yuncheng Li, Yale Song, Liangliang Cao, Joel Tetreault, Larry Goldberg, Alejandro Jaimes, Jiebo Luo
2016 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)  
We use the multi-label classifi- cation model of Read et al. [29] as our visual classifier.  ...  semantic roles (e.g., (a) “ball player” and (b) “pool of wa- cation step by using ground-truth semantic roles as input to ter”), but most sentences contain syntactic errors.  ... 
doi:10.1109/cvpr.2016.502 dblp:conf/cvpr/LiSCTGJL16 fatcat:olguswylfvhf3le4eblko63day

Neural Motifs: Scene Graph Parsing with Global Context

Rowan Zellers, Mark Yatskar, Sam Thomson, Yejin Choi
2018 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition  
Our analysis motivates a new baseline: given object detections, predict the most frequent relation between object pairs with the given labels, as seen in the training set.  ...  We investigate the problem of producing structured graph representations of visual scenes. Our work analyzes the role of motifs: regularly appearing substructures in scene graphs.  ...  Acknowledgements We thank the anonymous reviewers along with Ali  ... 
doi:10.1109/cvpr.2018.00611 dblp:conf/cvpr/ZellersYTC18 fatcat:nvye4ywmyjdajpfei5zhqhn2lu

Multi-Networks Joint Learning for Large-Scale Cross-Modal Retrieval

Liang Zhang, Bingpeng Ma, Guorong Li, Qingming Huang, Qi Tian
2017 Proceedings of the 2017 ACM on Multimedia Conference - MM '17  
Moreover, they take feature learning and latent space embedding as two separate steps which cannot generate speci c features to accord with the cross-modal task.  ...  Finally, we can simultaneously achieve speci c features adapting to crossmodal task and learn a shared latent space for images and sentences.  ...  Joint training MNiL is a hybrid deep architecture that consists of ResNet and LSTM for learning the discriminative ranking with the accurate semantic expression.  ... 
doi:10.1145/3123266.3123317 dblp:conf/mm/ZhangMLHT17 fatcat:k2hmlxifinaxbjip2inbh5nglu

MSRC

Kan Chen, Rama Kovvuri, Jiyang Gao, Ram Nevatia
2017 Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval - ICMR '17  
Given an image and a natural language query phrase, a grounding system localizes the mentioned objects in the image according to the query's speci cations.  ...  Second, MSRC not only encodes the semantics of a query phrase, but also deals with its relation with other queries in the same sentence (i.e., context) by a context re nement network.  ...  Figure 1 : Multimodal Spatial Regression with semantic Context (MSRC) system regresses each proposal based on query's semantics and visual features.  ... 
doi:10.1145/3078971.3078976 dblp:conf/mir/ChenKGN17 fatcat:sv6p5i2lbzh6pcjv4okdcncpeq

Instance-aware Image and Sentence Matching with Selective Multimodal LSTM [article]

Yan Huang, Wei Wang, Liang Wang
2016 arXiv   pre-print
Effective image and sentence matching depends on how to well measure their global visual-semantic similarity.  ...  selective multimodal Long Short-Term Memory network (sm-LSTM) for instance-aware image and sentence matching.  ...  lo- cations.  ... 
arXiv:1611.05588v1 fatcat:lrvxlemagncmjbd64sfhtmgmou

Geometry Attention Transformer with Position-aware LSTMs for Image Captioning [article]

Chi Wang, Yulin Shen, Luping Ji
2021 arXiv   pre-print
In recent years, transformer structures have been widely applied in image captioning with impressive performance.  ...  For good captioning results, the geometry and position relations of different visual objects are often thought of as crucial information.  ...  This kind of methods builds captions typically through the syntactic and semantics analysis on images, with visual concept detecting, sentence template matching and optimizing [10, 11, 12] .  ... 
arXiv:2110.00335v1 fatcat:emucqxpc3rdfpeu3xoewpwuj2i

Neural Motifs: Scene Graph Parsing with Global Context [article]

Rowan Zellers, Mark Yatskar, Sam Thomson, Yejin Choi
2018 arXiv   pre-print
Our analysis motivates a new baseline: given object detections, predict the most frequent relation between object pairs with the given labels, as seen in the training set.  ...  We investigate the problem of producing structured graph representations of visual scenes. Our work analyzes the role of motifs: regularly appearing substructures in scene graphs.  ...  Acknowledgements We thank the anonymous reviewers along with Ali  ... 
arXiv:1711.06640v2 fatcat:hvuai4lihzhwxhaog674btnqn4

Learning Convolutional Text Representations for Visual Question Answering [chapter]

Zhengyang Wang, Shuiwang Ji
2018 Proceedings of the 2018 SIAM International Conference on Data Mining  
Shallow models like fastText, which can obtain comparable results with deep learning models in tasks like text classification, are not suitable in visual question answering.  ...  Visual question answering is a recently proposed artificial intelligence task that requires a deep understanding of both images and texts.  ...  is consistent among tasks; that is, be er image classi cation models yield be er results when used in VQA models. is implies the image classi cation task shares similar requirements with the VQA task  ... 
doi:10.1137/1.9781611975321.67 dblp:conf/sdm/WangJ18 fatcat:vvr7hmnepvgq3cey73yt7cainy

Language Features Matter: Effective Language Representations for Vision-Language Tasks [article]

Andrea Burns, Reuben Tan, Kate Saenko, Stan Sclaroff, Bryan A. Plummer
2019 arXiv   pre-print
retrieval, image captioning, visual question answering, phrase grounding, and text-to-clip retrieval.  ...  To further elevate language features, we also show that knowledge in vision-language problems can be transferred across tasks to gain performance with multi-task training.  ...  Visual Word2Vec Visual Word2Vec [29] is a neural model designed to ground the original Word2Vec representation with visual semantics.  ... 
arXiv:1908.06327v1 fatcat:z7o4dq62aneprlxqmopuw3ymni

Query-adaptive Video Summarization via Quality-aware Relevance Estimation

Arun Balajee Vasudevan, Michael Gygli, Anna Volokitin, Luc Van Gool
2017 Proceedings of the 2017 ACM on Multimedia Conference - MM '17  
We quantify relevance by measuring the distance between frames and queries in a common textual-visual semantic embedding space induced by a neural network.  ...  Furthermore, we introduce a new dataset, annotated with diversity and query-speci c relevance labels.  ...  Finally, we introduced a new dataset for thumbnail selection which comes with query-relevance labels and a grouping of the frames according to visual and semantic similarity.  ... 
doi:10.1145/3123266.3123297 dblp:conf/mm/VasudevanGVG17 fatcat:lqe6pebxezc3bbvncz6wmiovla

Query-adaptive Video Summarization via Quality-aware Relevance Estimation [article]

Arun Balajee Vasudevan, Michael Gygli, Anna Volokitin, Luc Van Gool
2017 arXiv   pre-print
We quantify relevance by measuring the distance between frames and queries in a common textual-visual semantic embedding space induced by a neural network.  ...  Furthermore, we introduce a new dataset, annotated with diversity and query-specific relevance labels.  ...  Finally, we introduced a new dataset for thumbnail selection which comes with query-relevance labels and a grouping of the frames according to visual and semantic similarity.  ... 
arXiv:1705.00581v2 fatcat:w3sijnp4bnev3a4qng3n2oqqma

LSTM-Based Facial Performance Capture Using Embedding Between Expressions [article]

Hsien-Yu Meng, Tzu-heng Lin, Xiubao Jiang, Yao Lu, Jiangtao Wen
2018 arXiv   pre-print
Second, the embeddings are fed into an LSTM network to learn the deformation between frames.  ...  First, to extract the information in the frames, we optimize a triplet loss to learn the embedding space which ensures the semantically closer facial expressions are closer in the embedding space and the  ...  7 : 7 VGG-like framework [Laine et al., 2016] Figure 8 : 8 Experiment: the rst row is our pretrained FaceNet + LSTM network, the second row is the VGG-like network trained with LSTM network Figure  ... 
arXiv:1805.03874v4 fatcat:htjxxmhk2bawla34wiownkpyzm

Fluency-Guided Cross-Lingual Image Captioning

Weiyu Lan, Xirong Li, Jianfeng Dong
2017 Proceedings of the 2017 ACM on Multimedia Conference - MM '17  
Image captioning has so far been explored mostly in English, as most available datasets are in this language. However, the application of image captioning should not be restricted by language.  ...  Only few studies have been conducted for image captioning in a cross-lingual setting.  ...  LSTM LSTM word embedding output LSTM LSTM LSTM LSTM LSTM word embedding visual embedding Model for Image Caption generation output <START> p( ) batch > 0.5 Rejection sampling  ... 
doi:10.1145/3123266.3123366 dblp:conf/mm/LanLD17 fatcat:6syfvi5ubva27e6ogi6kgtok3y
« Previous Showing results 1 — 15 out of 427 results