Filters








57,078 Hits in 5.3 sec

Anti-Litter Surveillance based on Person Understanding via Multi-Task Learning

Kangmin Bae, Kimin Yun, Hyung-Il Kim, Youngwan Lee, Jongyoul Park
2020 British Machine Vision Conference  
action (e.g., pushing a cart) via multi-task learning.  ...  In addition to collecting data from the real-world, we train the effective model to understand the person through multiple datasets such as human poses, human coarse action (e.g., upright, bent), and fine  ...  Multi-task learning and fusion of features also helped to improve performance through a better understanding of a person.  ... 
dblp:conf/bmvc/BaeYKLP20 fatcat:jjg6mgqamzfc3ke6lnlswnb3zq

Interpretable Visual Understanding with Cognitive Attention Network [article]

Xuejiao Tang, Wenbin Zhang, Yi Yu, Kea Turner, Tyler Derr, Mengyu Wang, Eirini Ntoutsi
2021 arXiv   pre-print
, which calls for exploiting the multi-source information as well as learning different levels of understanding and extensive commonsense knowledge.  ...  Extensive experiments on large-scale Visual Commonsense Reasoning (VCR) benchmark dataset demonstrate the effectiveness of our approach.  ...  [5] further formulated Visual Commonsense Reasoning as the VCR task, which is an important step towards reliable visual understanding, and benchmarked the VCR dataset.  ... 
arXiv:2108.02924v2 fatcat:iqtxzkuym5bmjd6wdf4cadhpjm

Multi-modal embeddings using multi-task learning for emotion recognition [article]

Aparna Khare, Srinivas Parthasarathy, Shiva Sundaram
2020 arXiv   pre-print
In this paper, we extend the work from natural language understanding to multi-modal architectures that use audio, visual and textual information for machine learning tasks.  ...  We use person identification and automatic speech recognition as the tasks in our embedding generation framework.  ...  Context is typically set by what is being communicated and how through multi-modal cues.  ... 
arXiv:2009.05019v1 fatcat:vig36g65ljcddabd6jbfnpwyba

Multi-Modal Embeddings Using Multi-Task Learning for Emotion Recognition

Aparna Khare, Srinivas Parthasarathy, Shiva Sundaram
2020 Interspeech 2020  
In this paper, we extend the work from natural language understanding to multi-modal architectures that use audio, visual and textual information for machine learning tasks.  ...  We use person identification and automatic speech recognition as the tasks in our embedding generation framework.  ...  Context is typically set by what is being communicated and how through multi-modal cues.  ... 
doi:10.21437/interspeech.2020-1827 dblp:conf/interspeech/KharePS20 fatcat:kwcrzgivefhcpbotxrph7fv36m

UniNet: A Unified Scene Understanding Network and Exploring Multi-Task Relationships through the Lens of Adversarial Attacks [article]

Naresh Kumar Gurulingan, Elahe Arani, Bahram Zonooz
2022 arXiv   pre-print
In multi-task learning (MTL), on the other hand, these single tasks are jointly learned, thereby providing an opportunity for tasks to share information and obtain a more comprehensive understanding.  ...  We evaluate the task relationships in UniNet through the lens of adversarial attacks based on the notion that they can exploit learned biases and task interactions in the neural network.  ...  Multi-Task Learning and Task Relationships Multi-task learning concerns the joint prediction of multiple tasks.  ... 
arXiv:2108.04584v2 fatcat:nlfnkd23mzed7brotxiqzbw4xu

Disjoint Multi-task Learning between Heterogeneous Human-centric Tasks [article]

Dong-Jin Kim, Jinsoo Choi, Tae-Hyun Oh, Youngjin Yoon, In So Kweon
2018 arXiv   pre-print
However, multi-task learning relies on task specific datasets and constructing such datasets can be cumbersome. It requires huge amounts of data, labeling efforts, statistical consideration etc.  ...  In order to efficiently make use of data, multi-task learning has been studied in diverse computer vision tasks including human behavior understanding.  ...  Related Work Previous works extend over multiple contexts: human understanding, multi-task learning and disjoint setups.  ... 
arXiv:1802.04962v1 fatcat:flqul2wjrbax7gndviagr5atjy

Multi-domain and multi-task prediction of extraversion and leadership from meeting videos

Ahmet Alp Kindiroglu, Lale Akarun, Oya Aran
2017 EURASIP Journal on Image and Video Processing  
Our results indicate that multi-task learning methods using 10 personality annotations as tasks and with a transfer from two different datasets from different domains improve the overall recognition performance  ...  We use feature analysis and multi-task learning methods in conjunction with the non-verbal features and crowd-sourced annotations from the Video bLOG (VLOG) corpus to perform a multi-domain and multi-task  ...  Availability of data and materials The datasets supporting the conclusions of this article are available for download from IDIAP dataset access pages at "https://www.idiap.ch/dataset/ youtube-personality  ... 
doi:10.1186/s13640-017-0224-z fatcat:enzkhmqambbgjikukndstnf7hm

Understanding Humans in Crowded Scenes

Jian Zhao, Jianshu Li, Yu Cheng, Terence Sim, Shuicheng Yan, Jiashi Feng
2018 2018 ACM Multimedia Conference on Multimedia Conference - MM '18  
Despite the noticeable progress in perceptual tasks like detection, instance segmentation and human parsing, computers still perform unsatisfactorily on visually understanding humans in crowded scenes,  ...  NAN consistently outperforms existing state-of-the-art solutions on our MHP and several other datasets, and serves as a strong baseline to drive the future research for multi-human parsing.  ...  different person instances and refines results simultaneously through deep nested adversarial learning in an effective yet time-efficient manner.  ... 
doi:10.1145/3240508.3240509 dblp:conf/mm/ZhaoLCSYF18 fatcat:rby6klh6ozflzptt23vo2x6xzy

Compositional action recognition with multi-view feature fusion

Zhicheng Zhao, Yingan Liu, Lei Ma, Ayan Seal
2022 PLoS ONE  
We validate our approach on two action recognition datasets, IKEA ASM and LEMMA.  ...  In particular, on the IKEA ASM dataset, the performance of the multi-view fusion approach improves 18.1% over the performance of the single-view approach on top-1.  ...  The IKEA ASM dataset [15] is a multi-modal and multi-view video dataset of assembly tasks to enable rich analysis and understanding of human activities.  ... 
doi:10.1371/journal.pone.0266259 pmid:35421122 pmcid:PMC9009598 fatcat:g4yvhgdlfvhfnctj3u4pzfovve

Fine-Grained Multi-human Parsing

Jian Zhao, Jianshu Li, Hengzhu Liu, Shuicheng Yan, Jiashi Feng
2019 International Journal of Computer Vision  
Despite the noticeable progress in perceptual tasks like detection, instance segmentation and human parsing, computers still perform unsatisfactorily on visually understanding humans in crowded scenes,  ...  With the above innovations and contributions, we have organized the CVPR 2018 Workshop on Visual Understanding of Humans in Crowd Scene (VUHCS 2018) and the Fine-Grained Multi-human Parsing and Pose Estimation  ...  The work of Jiashi Feng was partially supported by NUS IDS R-263-000-C67-646, ECRA R-263-000-C87-133 and MOE Tier-II R-263-000-D17-112.  ... 
doi:10.1007/s11263-019-01181-5 fatcat:ezqzloyjizfwnajvz5qc4ntljq

Understanding Humans in Crowded Scenes: Deep Nested Adversarial Learning and A New Benchmark for Multi-Human Parsing [article]

Jian Zhao, Jianshu Li, Yu Cheng, Li Zhou, Terence Sim, Shuicheng Yan, Jiashi Feng
2018 arXiv   pre-print
Despite the noticeable progress in perceptual tasks like detection, instance segmentation and human parsing, computers still perform unsatisfactorily on visually understanding humans in crowded scenes,  ...  NAN consistently outperforms existing state-of-the-art solutions on our MHP and several other datasets, and serves as a strong baseline to drive the future research for multi-human parsing.  ...  The work of Jiashi Feng was partially supported by NUS startup R-263-000-C08-133, MOE Tier-I R-263-000-C21-112, NUS IDS R-263-000-C67-646 and ECRA R-263-000-C87-133.  ... 
arXiv:1804.03287v2 fatcat:m6goc3nfwrhx5mh4tvxd32jmau

Video Question Answering with Iterative Video-Text Co-Tokenization [article]

AJ Piergiovanni and Kairo Morton and Weicheng Kuo and Michael S. Ryoo and Anelia Angelova
2022 arXiv   pre-print
Video question answering is a challenging task that requires understanding jointly the language input, the visual information in individual video frames, as well as the temporal information about the events  ...  In this paper, we propose a novel multi-stream video encoder for video question answering that uses multiple video inputs and a new video-text iterative co-tokenization approach to answer a variety of  ...  VideoQA has the inherent challenges of VQA tasks: it needs to understand the visual and language inputs and how they relate to each other.  ... 
arXiv:2208.00934v1 fatcat:np2ndxet7fdwja74n2zqtdhr2y

LEMMA: A Multi-view Dataset for Learning Multi-agent Multi-task Activities [article]

Baoxiong Jia, Yixin Chen, Siyuan Huang, Yixin Zhu, Song-chun Zhu
2020 arXiv   pre-print
We introduce the LEMMA dataset to provide a single home to address these missing dimensions with meticulously designed settings, wherein the number of tasks and agents varies to highlight different learning  ...  However, a few imperative components of daily human activities are largely missed in prior literature, including the goal-directed actions, concurrent multi-tasks, and collaborations among multi-agents  ...  VCLA for assisting the endeavor of post-processing this massive dataset.  ... 
arXiv:2007.15781v1 fatcat:c4hzp7ohnvh57caehrxu5vfwua

OPT: Omni-Perception Pre-Trainer for Cross-Modal Understanding and Generation [article]

Jing Liu, Xinxin Zhu, Fei Liu, Longteng Guo, Zijia Zhao, Mingzhen Sun, Weining Wang, Hanqing Lu, Shiyu Zhou, Jiajun Zhang, Jinqiao Wang
2021 arXiv   pre-print
For the OPT's pre-training, we design a multi-task pretext learning scheme to model multi-modal resources from three different data granularities, \ie, token-, modality-, and sample-level modeling, through  ...  Experimental results show that OPT can learn strong image-text-audio multi-modal representations and achieve promising results on a variety of cross-modal understanding and generation tasks.  ...  OPT is pretrained on large amounts of language-vision-audio triplets with a multi-task pretext learning scheme, and can effectively adapt to downstream understanding and generation tasks given single-,  ... 
arXiv:2107.00249v2 fatcat:m62kzdqj5zga3ezzgjnohvxnve

Attention Guided Semantic Relationship Parsing for Visual Question Answering [article]

Moshiur Farazi, Salman Khan, Nick Barnes
2020 arXiv   pre-print
is trying to solve a multi-modal task.  ...  Humans explain inter-object relationships with semantic labels that demonstrate a high-level understanding required to perform complex Vision-Language tasks such as Visual Question Answering (VQA).  ...  However, for achieving high-level visual understanding, one needs to learn both mono-modal and multi-modal interactions, which we propose in this work.  ... 
arXiv:2010.01725v1 fatcat:dcvmigzvsvh7zbjjk7m3ezyj24
« Previous Showing results 1 — 15 out of 57,078 results