Filters








171 Hits in 0.57 sec

Real-Time MDNet [article]

Ilchae Jung, Jeany Son, Mooyeol Baek, Bohyung Han
2018 arXiv   pre-print
We present a fast and accurate visual tracking algorithm based on the multi-domain convolutional neural network (MDNet). The proposed approach accelerates feature extraction procedure and learns more discriminative models for instance classification; it enhances representation quality of target and background by maintaining a high resolution feature map with a large receptive field per activation. We also introduce a novel loss term to differentiate foreground instances across multiple domains
more » ... nd learn a more discriminative embedding of target objects with similar semantics. The proposed techniques are integrated into the pipeline of a well known CNN-based visual tracking algorithm, MDNet. We accomplish approximately 25 times speed-up with almost identical accuracy compared to MDNet. Our algorithm is evaluated in multiple popular tracking benchmark datasets including OTB2015, UAV123, and TempleColor, and outperforms the state-of-the-art real-time tracking methods consistently even without dataset-specific parameter tuning.
arXiv:1808.08834v1 fatcat:2ux6d7jobjfhhowah26yaiuroy

Streamlined Dense Video Captioning [article]

Jonghwan Mun, Linjie Yang, Zhou Ren, Ning Xu, Bohyung Han
2019 arXiv   pre-print
Dense video captioning is an extremely challenging task since accurate and coherent description of events in a video requires holistic understanding of video contents as well as contextual reasoning of individual events. Most existing approaches handle this problem by first detecting event proposals from a video and then captioning on a subset of the proposals. As a result, the generated sentences are prone to be redundant or inconsistent since they fail to consider temporal dependency between
more » ... vents. To tackle this challenge, we propose a novel dense video captioning framework, which models temporal dependency across events in a video explicitly and leverages visual and linguistic context from prior events for coherent storytelling. This objective is achieved by 1) integrating an event sequence generation network to select a sequence of event proposals adaptively, and 2) feeding the sequence of event proposals to our sequential video captioning network, which is trained by reinforcement learning with two-level rewards at both event and episode levels for better context modeling. The proposed technique achieves outstanding performances on ActivityNet Captions dataset in most metrics.
arXiv:1904.03870v1 fatcat:x5kjgrzjw5fgzkbyqvf5bfzzzq

Learning Deconvolution Network for Semantic Segmentation [article]

Hyeonwoo Noh, Seunghoon Hong, Bohyung Han
2015 arXiv   pre-print
We propose a novel semantic segmentation algorithm by learning a deconvolution network. We learn the network on top of the convolutional layers adopted from VGG 16-layer net. The deconvolution network is composed of deconvolution and unpooling layers, which identify pixel-wise class labels and predict segmentation masks. We apply the trained network to each proposal in an input image, and construct the final semantic segmentation map by combining the results from all proposals in a simple
more » ... . The proposed algorithm mitigates the limitations of the existing methods based on fully convolutional networks by integrating deep deconvolution network and proposal-wise prediction; our segmentation method typically identifies detailed structures and handles objects in multiple scales naturally. Our network demonstrates outstanding performance in PASCAL VOC 2012 dataset, and we achieve the best accuracy (72.5%) among the methods trained with no external data through ensemble with the fully convolutional network.
arXiv:1505.04366v1 fatcat:rualme4krfdodkqjkdoxp2nqaa

Text-guided Attention Model for Image Captioning [article]

Jonghwan Mun, Minsu Cho, Bohyung Han
2016 arXiv   pre-print
Ba, Mnih, and Kavukcuoglu 2015; Kantorov et al. 2016) , image generation (Gregor et al. 2015) , semantic segmentation (Hong et al. 2016 ) and visual question answering (Andreas et al. 2016; Noh and Han  ... 
arXiv:1612.03557v1 fatcat:jsupp6wmsreyzbcz2wyh2sicyi

Multi-Level Branched Regularization for Federated Learning [article]

Jinkyu Kim, Geeho Kim, Bohyung Han
2022 arXiv   pre-print
A critical challenge of federated learning is data heterogeneity and imbalance across clients, which leads to inconsistency between local networks and unstable convergence of global models. To alleviate the limitations, we propose a novel architectural regularization technique that constructs multiple auxiliary branches in each local model by grafting local and global subnetworks at several different levels and that learns the representations of the main pathway in the local model congruent to
more » ... he auxiliary hybrid pathways via online knowledge distillation. The proposed technique is effective to robustify the global model even in the non-iid setting and is applicable to various federated learning frameworks conveniently without incurring extra communication costs. We perform comprehensive empirical studies and demonstrate remarkable performance gains in terms of accuracy and efficiency compared to existing methods. The source code is available at our project page.
arXiv:2207.06936v1 fatcat:supek6iw4bcnrel3xsiywveixi

Local-Global Video-Text Interactions for Temporal Grounding [article]

Jonghwan Mun, Minsu Cho, Bohyung Han
2020 arXiv   pre-print
This paper addresses the problem of text-to-video temporal grounding, which aims to identify the time interval in a video semantically relevant to a text query. We tackle this problem using a novel regression-based model that learns to extract a collection of mid-level features for semantic phrases in a text query, which corresponds to important semantic entities described in the query (e.g., actors, objects, and actions), and reflect bi-modal interactions between the linguistic features of the
more » ... query and the visual features of the video in multiple levels. The proposed method effectively predicts the target time interval by exploiting contextual information from local to global during bi-modal interactions. Through in-depth ablation studies, we find out that incorporating both local and global context in video and text interactions is crucial to the accurate grounding. Our experiment shows that the proposed method outperforms the state of the arts on Charades-STA and ActivityNet Captions datasets by large margins, 7.44\% and 4.61\% points at Recall@tIoU=0.5 metric, respectively. Code is available in https://github.com/JonghwanMun/LGI4temporalgrounding.
arXiv:2004.07514v1 fatcat:uzu4t6nubff5fiz2qbmiwohnr4

Learning Debiased and Disentangled Representations for Semantic Segmentation [article]

Sanghyeok Chu, Dongwan Kim, Bohyung Han
2021 arXiv   pre-print
Deep neural networks are susceptible to learn biased models with entangled feature representations, which may lead to subpar performances on various downstream tasks. This is particularly true for under-represented classes, where a lack of diversity in the data exacerbates the tendency. This limitation has been addressed mostly in classification tasks, but there is little study on additional challenges that may appear in more complex dense prediction problems including semantic segmentation. To
more » ... this end, we propose a model-agnostic and stochastic training scheme for semantic segmentation, which facilitates the learning of debiased and disentangled representations. For each class, we first extract class-specific information from the highly entangled feature map. Then, information related to a randomly sampled class is suppressed by a feature selection process in the feature space. By randomly eliminating certain class information in each training iteration, we effectively reduce feature dependencies among classes, and the model is able to learn more debiased and disentangled feature representations. Models trained with our approach demonstrate strong results on multiple semantic segmentation benchmarks, with especially notable performance gains on under-represented classes.
arXiv:2111.00531v1 fatcat:flnzsnkpx5httdmx7vhy2riy3q

Class-Incremental Learning for Action Recognition in Videos [article]

Jaeyoo Park, Minsoo Kang, Bohyung Han
2022 arXiv   pre-print
Main Results We compare the proposed method, referred to as Time-Channel Distillation (TCD), with existing class-incremental 2 https://github.com/mit-han-lab/temporal-shift-module learning baselines, which  ... 
arXiv:2203.13611v1 fatcat:f436lu5mhvhirgze5h3ncp5gpi

Regularizing Neural Networks via Stochastic Branch Layers [article]

Wonpyo Park, Paul Hongsuck Seo, Bohyung Han, Minsu Cho
2019 arXiv   pre-print
Han et al. (2017) introduce a regularized ensemble method for single object tracking, which branches out intermediate layers to learn different target representations.  ... 
arXiv:1910.01467v1 fatcat:b7sccp23ijemrexu4vyfvh2mzq

Fine-Grained Neural Architecture Search [article]

Heewon Kim, Seokil Hong, Bohyung Han, Heesoo Myeong, Kyoung Mu Lee
2019 arXiv   pre-print
We present an elegant framework of fine-grained neural architecture search (FGNAS), which allows to employ multiple heterogeneous operations within a single layer and can even generate compositional feature maps using several different base operations. FGNAS runs efficiently in spite of significantly large search space compared to other methods because it trains networks end-to-end by a stochastic gradient descent method. Moreover, the proposed framework allows to optimize the network under
more » ... efined resource constraints in terms of number of parameters, FLOPs and latency. FGNAS has been applied to two crucial applications in resource demanding computer vision tasks---large-scale image classification and image super-resolution---and demonstrates the state-of-the-art performance through flexible operation search and channel pruning.
arXiv:1911.07478v1 fatcat:zbx4p5ogdjh55huoyjoavti2nu

Communication-Efficient Federated Learning with Acceleration of Global Momentum [article]

Geeho Kim, Jinkyu Kim, Bohyung Han
2022 arXiv   pre-print
Federated learning often suffers from unstable and slow convergence due to heterogeneous characteristics of participating clients. Such tendency is aggravated when the client participation ratio is low since the information collected from the clients at each round is prone to be more inconsistent. To tackle the challenge, we propose a novel federated learning framework, which improves the stability of the server-side aggregation step, which is achieved by sending the clients an accelerated
more » ... estimated with the global gradient to guide the local gradient updates. Our algorithm naturally aggregates and conveys the global update information to participants with no additional communication cost and does not require to store the past models in the clients. We also regularize local update to further reduce the bias and improve the stability of local updates. We perform comprehensive empirical studies on real data under various settings and demonstrate the remarkable performance of the proposed method in terms of accuracy and communication-efficiency compared to the state-of-the-art methods, especially with low client participation rates. Our code is available at https://github.com/ ninigapa0/FedAGM
arXiv:2201.03172v1 fatcat:mcwcqny36vb2riu4kknjyws6oi

Towards Oracle Knowledge Distillation with Neural Architecture Search [article]

Minsoo Kang, Jonghwan Mun, Bohyung Han
2019 arXiv   pre-print
We present a novel framework of knowledge distillation that is capable of learning powerful and efficient student models from ensemble teacher networks. Our approach addresses the inherent model capacity issue between teacher and student and aims to maximize benefit from teacher models during distillation by reducing their capacity gap. Specifically, we employ a neural architecture search technique to augment useful structures and operations, where the searched network is appropriate for
more » ... ge distillation towards student models and free from sacrificing its performance by fixing the network capacity. We also introduce an oracle knowledge distillation loss to facilitate model search and distillation using an ensemble-based teacher model, where a student network is learned to imitate oracle performance of the teacher. We perform extensive experiments on the image classification datasets---CIFAR-100 and TinyImageNet---using various networks. We also show that searching for a new student model is effective in both accuracy and memory size and that the searched models often outperform their teacher models thanks to neural architecture search with oracle knowledge distillation.
arXiv:1911.13019v1 fatcat:pqxlzkumibgxzewmszam6y63cq

Scenario-based video event recognition by constraint flow

Suha Kwak, Bohyung Han, Joon Hee Han
2011 CVPR 2011  
We present a novel approach to representing and recognizing composite video events. A composite event is specified by a scenario, which is based on primitive events and their temporal-logical relations, to constrain the arrangements of the primitive events in the composite event. We propose a new scenario description method to represent composite events fluently and efficiently. A composite event is recognized by a constrained optimization algorithm whose constraints are defined by the
more » ... The dynamic configuration of the scenario constraints is represented with constraint flow, which is generated from scenario automatically by our scenario parsing algorithm. The constraint flow reduces the search space dramatically, alleviates the effect of preprocessing errors, and guarantees the globally optimal solution for recognition. We validate our method to describe scenario and construct constraint flow for real videos and illustrate the effectiveness of our composite event recognition algorithm for natural video events.
doi:10.1109/cvpr.2011.5995435 dblp:conf/cvpr/KwakHH11 fatcat:lue4pfutpfdino3q35tywzz6oq

Online Graph-Based Tracking [chapter]

Hyeonseob Nam, Seunghoon Hong, Bohyung Han
2014 Lecture Notes in Computer Science  
., Han, B.: Orderless tracking through model-averaged posterior estimation. In: ICCV.  ... 
doi:10.1007/978-3-319-10602-1_8 fatcat:glse7pclajd4xgxlb5b42liuku

Real-Time MDNet [chapter]

Ilchae Jung, Jeany Son, Mooyeol Baek, Bohyung Han
2018 Lecture Notes in Computer Science  
We present a fast and accurate visual tracking algorithm based on the multi-domain convolutional neural network (MDNet). The proposed approach accelerates feature extraction procedure and learns more discriminative models for instance classification; it enhances representation quality of target and background by maintaining a high resolution feature map with a large receptive field per activation. We also introduce a novel loss term to differentiate foreground instances across multiple domains
more » ... nd learn a more discriminative embedding of target objects with similar semantics. The proposed techniques are integrated into the pipeline of a well known CNN-based visual tracking algorithm, MDNet. We accomplish approximately 25 times speed-up with almost identical accuracy compared to MDNet. Our algorithm is evaluated in multiple popular tracking benchmark datasets including OTB2015, UAV123, and TempleColor, and outperforms the state-of-the-art real-time tracking methods consistently even without dataset-specific parameter tuning.
doi:10.1007/978-3-030-01225-0_6 fatcat:xwqyw3yxzzdafjuqcq5bwwc32u
« Previous Showing results 1 — 15 out of 171 results