Filters








22,948 Hits in 2.8 sec

Joint Commonsense and Relation Reasoning for Image and Video Captioning

Jingyi Hou, Xinxiao Wu, Xiaoxun Zhang, Yayun Qi, Yunde Jia, Jiebo Luo
2020 PROCEEDINGS OF THE THIRTIETH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE AND THE TWENTY-EIGHTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE  
Particularly, our method is implemented by an iterative learning algorithm that alternates between 1) commonsense reasoning for embedding visual regions into the semantic space to build a semantic graph  ...  In this paper, we propose a joint commonsense and relation reasoning method that exploits prior knowledge for image and video captioning without relying on any detectors.  ...  via relation reasoning.  ... 
doi:10.1609/aaai.v34i07.6731 fatcat:ynlfbar46fd4xmcqgl4yido4zu

SAVE: A framework for semantic annotation of visual events

Mun Wai Lee, Asaad Hakeem, Niels Haering, Song-Chun Zhu
2008 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops  
The main contribution of this paper is a framework for an end-to-end system that infers visual events and annotates a large collection of videos.  ...  The second component is an event inference engine, where the Video Event Markup Language (VEML) is adopted for semantic representation, and a grammarbased approach is used for event analysis and detection  ...  Efficient automatic video analysis is required to enable retrieval via human readable queries, either by searching the meta-data or text description.  ... 
doi:10.1109/cvprw.2008.4562954 dblp:conf/cvpr/LeeHHZ08 fatcat:63q64oofnnggtbvxlgmu6eutum

Video Captioning Using Weak Annotation [article]

Jingyi Hou, Yunde Jia, Xinxiao wu, Yayun Qi
2020 arXiv   pre-print
To this end, we propose a progressive visual reasoning method that progressively generates fine sentences from weak annotations by inferring more semantic concepts and their dependency relationships for  ...  Accordingly, we develop an iterative refinement algorithm that refines sentences via spanning dependency trees and fine-tunes the captioning model using the refined sentences in an alternative training  ...  Our method performs much better probably due to the following reasons: (1) [21] detects semantic concepts via a detector pre-trained on a large-scale image dataset, while our method infers semantic concepts  ... 
arXiv:2009.01067v1 fatcat:55wjdfuvz5h4vojswkvneuwj7e

Semi-Automatic Annotation For Visual Object Tracking [article]

Kutalmis Gokalp Ince, Aybora Koksal, Arda Fazla, A. Aydin Alatan
2021 arXiv   pre-print
We propose a semi-automatic bounding box annotation method for visual object tracking by utilizing temporal information with a tracking-by-detection approach.  ...  For detection, we use an off-the-shelf object detector which is trained iteratively with the annotations generated by the proposed method, and we perform object detection on each frame independently.  ...  Adhikari et al. also worked on iterative bounding box annotation via two studies [1, 2] .  ... 
arXiv:2101.06977v4 fatcat:ylm6mxoq2vgvli6dnqplplpyua

Discover and Learn New Objects from Documentaries [article]

Kai Chen, Hang Song, Chen Change Loy, Dahua Lin
2017 arXiv   pre-print
Towards this goal, we develop a joint probabilistic framework, where individual pieces of information, including video frames and subtitles, are brought together via both visual and linguistic links.  ...  Despite the remarkable progress in recent years, detecting objects in a new context remains a challenging task.  ...  visual and linguistic information in videos.  ... 
arXiv:1707.09593v1 fatcat:rrxoalahx5drvm4uq2iimax6kq

Understanding videos, constructing plots learning a visually grounded storyline model from annotated videos

Abhinav Gupta, Praveen Srinivasan, Jianbo Shi, Larry S. Davis
2009 2009 IEEE Conference on Computer Vision and Pattern Recognition  
We present an approach to learn a visually grounded storyline model of videos directly from weakly labeled data.  ...  Analyzing videos of human activities involves not only recognizing actions (typically based on their appearances), but also determining the story/plot of the video.  ...  Each action-type has a visual appearance model which provides visual grounding for ORnodes. Each OR-node is connected to other OR-nodes either directly or via an AND-node.  ... 
doi:10.1109/cvpr.2009.5206492 dblp:conf/cvpr/GuptaSSD09 fatcat:tlimlyzwbfczvht4m46k2hswzy

Understanding videos, constructing plots learning a visually grounded storyline model from annotated videos

A. Gupta, P. Srinivasan, Jianbo Shi, L.S. Davis
2009 2009 IEEE Conference on Computer Vision and Pattern Recognition  
We present an approach to learn a visually grounded storyline model of videos directly from weakly labeled data.  ...  Analyzing videos of human activities involves not only recognizing actions (typically based on their appearances), but also determining the story/plot of the video.  ...  Each action-type has a visual appearance model which provides visual grounding for ORnodes. Each OR-node is connected to other OR-nodes either directly or via an AND-node.  ... 
doi:10.1109/cvprw.2009.5206492 fatcat:osotgcetnvffhagjnbtg5wuhmu

Representation Learning on Visual-Symbolic Graphs for Video Understanding [article]

Effrosyni Mavroudi, Benjamín Béjar Haro, René Vidal
2020 arXiv   pre-print
To capture this rich visual and semantic context, we propose using two graphs: (1) an attributed spatio-temporal visual graph whose nodes correspond to actors and objects and whose edges encode different  ...  Events in natural videos typically arise from spatio-temporal interactions between actors and objects and involve multiple co-occurring activities and object classes.  ...  and object affordance detection, measured via F1-score.  ... 
arXiv:1905.07385v2 fatcat:6fz7xtbmhvh5hms2g5rplcghlm

Real-time landmark detection for precise endoscopic submucosal dissection via shape-aware relation network [article]

Jiacheng Wang, Yueming Jin, Shuntian Cai, Hongzhi Xu, Pheng-Ann Heng, Jing Qin, Liansheng Wang
2021 arXiv   pre-print
Both schemes are beneficial to the model in training, and can be readily unloaded in inference to achieve real-time detection.  ...  We propose a novel shape-aware relation network for accurate and real-time landmark detection in endoscopic submucosal dissection (ESD) surgery.  ...  It then can provide the global-level relation constraint to the detection network via adversarial learning.  ... 
arXiv:2111.04733v1 fatcat:zct3kojajzdetamgthdyhzsbfi

Discover and Learn New Objects from Documentaries

Kai Chen, Hang Song, Chen Change Loy, Dahua Lin
2017 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)  
Towards this goal, we develop a joint probabilistic framework, where individual pieces of information, including video frames and subtitles, are brought together via both visual and linguistic links.  ...  Despite the remarkable progress in recent years, detecting objects in a new context remains a challenging task.  ...  visual and linguistic information in videos.  ... 
doi:10.1109/cvpr.2017.124 dblp:conf/cvpr/ChenSLL17 fatcat:ywfmn4c5bfadhb2eg3hisignqu

Transitive Invariance for Self-supervised Visual Representation Learning [article]

Xiaolong Wang, Kaiming He, Abhinav Gupta
2017 arXiv   pre-print
Specifically, we propose to generate a graph with millions of objects mined from hundreds of thousands of videos.  ...  For object detection, we achieve 63.2% mAP on PASCAL VOC 2007 using Fast R-CNN (compare to 67.3% with ImageNet pre-training).  ...  We set up simple transitive relations on this graph to infer more complex invariance from the data, which are then used to train a Triplet-Siamese network for learning visual representations.  ... 
arXiv:1708.02901v3 fatcat:enmp5l5zezfpzc5spyfx64bueq

Structure Inference Machines: Recurrent Neural Networks for Analyzing Relations in Group Activity Recognition [article]

Zhiwei Deng, Arash Vahdat, Hexiang Hu, Greg Mori
2016 arXiv   pre-print
Rich semantic relations are important in a variety of visual recognition problems.  ...  Instead of using a traditional inference method, we use a sequential inference modeled by a recurrent neural network.  ...  Introduction Relations between image entities are an important facet of higher-level visual understanding.  ... 
arXiv:1511.04196v2 fatcat:xlhcepz44jfuxmwo6rc6tf3ckq

Structure Inference Machines: Recurrent Neural Networks for Analyzing Relations in Group Activity Recognition

Zhiwei Deng, Arash Vahdat, Hexiang Hu, Greg Mori
2016 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)  
Rich semantic relations are important in a variety of visual recognition problems.  ...  Instead of using a traditional inference method, we use a sequential inference modeled by a recurrent neural network.  ...  Introduction Relations between image entities are an important facet of higher-level visual understanding.  ... 
doi:10.1109/cvpr.2016.516 dblp:conf/cvpr/DengVHM16 fatcat:drhxwplsd5evvkkul6mawzbpia

Discriminative figure-centric models for joint action localization and recognition

Tian Lan, Yang Wang, Greg Mori
2011 2011 International Conference on Computer Vision  
In this paper we develop an algorithm for action recognition and localization in videos. The algorithm uses a figurecentric visual word representation.  ...  Temporal smoothness over video sequences is also enforced.  ...  Kovashka and Grauman [13] consider higher-order relations between visual words, each with discriminatively selected spatial arrangements.  ... 
doi:10.1109/iccv.2011.6126472 dblp:conf/iccv/LanWM11 fatcat:42rfv4vy3fadddg2yyezs23gpm

Learning Human-Object Interactions by Graph Parsing Neural Networks [chapter]

Siyuan Qi, Wenguan Wang, Baoxiong Jia, Jianbing Shen, Song-Chun Zhu
2018 Lecture Notes in Computer Science  
Within a message passing inference framework, GPNN iteratively computes the adjacency matrices and node labels.  ...  This paper addresses the task of detecting and recognizing human-object interactions (HOI) in images and videos.  ...  In § 3, GPNN automatically infers graph structures (i.e., parse graph) via learning a soft adjacency matrix.  ... 
doi:10.1007/978-3-030-01240-3_25 fatcat:46wf6rn7dzerpeepmupj2ifvqq
« Previous Showing results 1 — 15 out of 22,948 results