A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2019; you can also visit the original URL.
The file type is application/pdf
.
Filters
Image Cationing with Visual-Semantic LSTM
2018
Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence
Inspired by the visual processing of our cognitive system, we propose a visual-semantic LSTM model to locate the attention objects with their low-level features in the visual cell, and then successively ...
In this paper, a novel image captioning approach is proposed to describe the content of images. ...
Then in the LSTM model, the visual cell LSTM v utilizes the visual features to localize the objects in the image, whilst the semantic cell LSTM s further integrates the localized objects with their attributes ...
doi:10.24963/ijcai.2018/110
dblp:conf/ijcai/LiC18
fatcat:beg636qo7nbzdbm3zesimqbepi
Sketch Recognition with Deep Visual-Sequential Fusion Model
2017
Proceedings of the 2017 ACM on Multimedia Conference - MM '17
Finally, the visual and sequential representations of the sketches are seamlessly integrated with a fusion layer to obtain the nal results. ...
To learn the pa erns of stroke order, sequential networks are constructed by Residual Long Short-Term Memory (R-LSTM) units, which optimize the network architecture by skip connection. ...
Figure 5 5 : e architecture of Residual LSTM (R-LSTM) unit with ReLU mapping. ...
doi:10.1145/3123266.3123321
dblp:conf/mm/HeWJZP17
fatcat:pbbvhkyqwze67j2y3cyluyleqm
TGIF: A New Dataset and Benchmark on Animated GIF Description
2016
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
We use the multi-label classifi-
cation model of Read et al. [29] as our visual classifier. ...
semantic roles (e.g., (a) “ball player” and (b) “pool of wa-
cation step by using ground-truth semantic roles as input to ter”), but most sentences contain syntactic errors. ...
doi:10.1109/cvpr.2016.502
dblp:conf/cvpr/LiSCTGJL16
fatcat:olguswylfvhf3le4eblko63day
Neural Motifs: Scene Graph Parsing with Global Context
2018
2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
Our analysis motivates a new baseline: given object detections, predict the most frequent relation between object pairs with the given labels, as seen in the training set. ...
We investigate the problem of producing structured graph representations of visual scenes. Our work analyzes the role of motifs: regularly appearing substructures in scene graphs. ...
Acknowledgements We thank the anonymous reviewers along with Ali ...
doi:10.1109/cvpr.2018.00611
dblp:conf/cvpr/ZellersYTC18
fatcat:nvye4ywmyjdajpfei5zhqhn2lu
Multi-Networks Joint Learning for Large-Scale Cross-Modal Retrieval
2017
Proceedings of the 2017 ACM on Multimedia Conference - MM '17
Moreover, they take feature learning and latent space embedding as two separate steps which cannot generate speci c features to accord with the cross-modal task. ...
Finally, we can simultaneously achieve speci c features adapting to crossmodal task and learn a shared latent space for images and sentences. ...
Joint training MNiL is a hybrid deep architecture that consists of ResNet and LSTM for learning the discriminative ranking with the accurate semantic expression. ...
doi:10.1145/3123266.3123317
dblp:conf/mm/ZhangMLHT17
fatcat:k2hmlxifinaxbjip2inbh5nglu
MSRC
2017
Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval - ICMR '17
Given an image and a natural language query phrase, a grounding system localizes the mentioned objects in the image according to the query's speci cations. ...
Second, MSRC not only encodes the semantics of a query phrase, but also deals with its relation with other queries in the same sentence (i.e., context) by a context re nement network. ...
Figure 1 : Multimodal Spatial Regression with semantic Context (MSRC) system regresses each proposal based on query's semantics and visual features. ...
doi:10.1145/3078971.3078976
dblp:conf/mir/ChenKGN17
fatcat:sv6p5i2lbzh6pcjv4okdcncpeq
Instance-aware Image and Sentence Matching with Selective Multimodal LSTM
[article]
2016
arXiv
pre-print
Effective image and sentence matching depends on how to well measure their global visual-semantic similarity. ...
selective multimodal Long Short-Term Memory network (sm-LSTM) for instance-aware image and sentence matching. ...
lo- cations. ...
arXiv:1611.05588v1
fatcat:lrvxlemagncmjbd64sfhtmgmou
Geometry Attention Transformer with Position-aware LSTMs for Image Captioning
[article]
2021
arXiv
pre-print
In recent years, transformer structures have been widely applied in image captioning with impressive performance. ...
For good captioning results, the geometry and position relations of different visual objects are often thought of as crucial information. ...
This kind of methods builds captions typically through the syntactic and semantics analysis on images, with visual concept detecting, sentence template matching and optimizing [10, 11, 12] . ...
arXiv:2110.00335v1
fatcat:emucqxpc3rdfpeu3xoewpwuj2i
Neural Motifs: Scene Graph Parsing with Global Context
[article]
2018
arXiv
pre-print
Our analysis motivates a new baseline: given object detections, predict the most frequent relation between object pairs with the given labels, as seen in the training set. ...
We investigate the problem of producing structured graph representations of visual scenes. Our work analyzes the role of motifs: regularly appearing substructures in scene graphs. ...
Acknowledgements We thank the anonymous reviewers along with Ali ...
arXiv:1711.06640v2
fatcat:hvuai4lihzhwxhaog674btnqn4
Learning Convolutional Text Representations for Visual Question Answering
[chapter]
2018
Proceedings of the 2018 SIAM International Conference on Data Mining
Shallow models like fastText, which can obtain comparable results with deep learning models in tasks like text classification, are not suitable in visual question answering. ...
Visual question answering is a recently proposed artificial intelligence task that requires a deep understanding of both images and texts. ...
is consistent among tasks; that is, be er image classi cation models yield be er results when used in VQA models. is implies the image classi cation task shares similar requirements with the VQA task ...
doi:10.1137/1.9781611975321.67
dblp:conf/sdm/WangJ18
fatcat:vvr7hmnepvgq3cey73yt7cainy
Language Features Matter: Effective Language Representations for Vision-Language Tasks
[article]
2019
arXiv
pre-print
retrieval, image captioning, visual question answering, phrase grounding, and text-to-clip retrieval. ...
To further elevate language features, we also show that knowledge in vision-language problems can be transferred across tasks to gain performance with multi-task training. ...
Visual Word2Vec Visual Word2Vec [29] is a neural model designed to ground the original Word2Vec representation with visual semantics. ...
arXiv:1908.06327v1
fatcat:z7o4dq62aneprlxqmopuw3ymni
Query-adaptive Video Summarization via Quality-aware Relevance Estimation
2017
Proceedings of the 2017 ACM on Multimedia Conference - MM '17
We quantify relevance by measuring the distance between frames and queries in a common textual-visual semantic embedding space induced by a neural network. ...
Furthermore, we introduce a new dataset, annotated with diversity and query-speci c relevance labels. ...
Finally, we introduced a new dataset for thumbnail selection which comes with query-relevance labels and a grouping of the frames according to visual and semantic similarity. ...
doi:10.1145/3123266.3123297
dblp:conf/mm/VasudevanGVG17
fatcat:lqe6pebxezc3bbvncz6wmiovla
Query-adaptive Video Summarization via Quality-aware Relevance Estimation
[article]
2017
arXiv
pre-print
We quantify relevance by measuring the distance between frames and queries in a common textual-visual semantic embedding space induced by a neural network. ...
Furthermore, we introduce a new dataset, annotated with diversity and query-specific relevance labels. ...
Finally, we introduced a new dataset for thumbnail selection which comes with query-relevance labels and a grouping of the frames according to visual and semantic similarity. ...
arXiv:1705.00581v2
fatcat:w3sijnp4bnev3a4qng3n2oqqma
LSTM-Based Facial Performance Capture Using Embedding Between Expressions
[article]
2018
arXiv
pre-print
Second, the embeddings are fed into an LSTM network to learn the deformation between frames. ...
First, to extract the information in the frames, we optimize a triplet loss to learn the embedding space which ensures the semantically closer facial expressions are closer in the embedding space and the ...
7 : 7 VGG-like framework [Laine et al., 2016]
Figure 8 : 8 Experiment: the rst row is our pretrained FaceNet + LSTM network, the second row is the VGG-like network trained with LSTM network
Figure ...
arXiv:1805.03874v4
fatcat:htjxxmhk2bawla34wiownkpyzm
Fluency-Guided Cross-Lingual Image Captioning
2017
Proceedings of the 2017 ACM on Multimedia Conference - MM '17
Image captioning has so far been explored mostly in English, as most available datasets are in this language. However, the application of image captioning should not be restricted by language. ...
Only few studies have been conducted for image captioning in a cross-lingual setting. ...
LSTM
LSTM
word embedding
output
LSTM
LSTM
LSTM
LSTM
LSTM
word embedding
visual
embedding
Model for Image Caption generation
output
<START>
p(
)
batch
> 0.5
Rejection
sampling ...
doi:10.1145/3123266.3123366
dblp:conf/mm/LanLD17
fatcat:6syfvi5ubva27e6ogi6kgtok3y
« Previous
Showing results 1 — 15 out of 427 results