2,142 Hits in 2.3 sec

Grid R-CNN [article]

Xin Lu, Buyu Li, Yuxin Yue, Quanquan Li, Junjie Yan
2018 arXiv   pre-print
This paper proposes a novel object detection framework named Grid R-CNN, which adopts a grid guided localization mechanism for accurate object detection. Different from the traditional regression based methods, the Grid R-CNN captures the spatial information explicitly and enjoys the position sensitive property of fully convolutional architecture. Instead of using only two independent points, we design a multi-point supervision formulation to encode more clues in order to reduce the impact of
more » ... accurate prediction of specific points. To take the full advantage of the correlation of points in a grid, we propose a two-stage information fusion strategy to fuse feature maps of neighbor grid points. The grid guided localization approach is easy to be extended to different state-of-the-art detection frameworks. Grid R-CNN leads to high quality object localization, and experiments demonstrate that it achieves a 4.1% AP gain at IoU=0.8 and a 10.0% AP gain at IoU=0.9 on COCO benchmark compared to Faster R-CNN with Res50 backbone and FPN architecture.
arXiv:1811.12030v1 fatcat:u4mx6dkmezespphp5w2gi5ahc4

Cross-dataset Training for Class Increasing Object Detection [article]

Yongqiang Yao, Yan Wang, Yu Guo, Jiaojiao Lin, Hongwei Qin, Junjie Yan
2020 arXiv   pre-print
We present a conceptually simple, flexible and general framework for cross-dataset training in object detection. Given two or more already labeled datasets that target for different object classes, cross-dataset training aims to detect the union of the different classes, so that we do not have to label all the classes for all the datasets. By cross-dataset training, existing datasets can be utilized to detect the merged object classes with a single model. Further more, in industrial
more » ... , the object classes usually increase on demand. So when adding new classes, it is quite time-consuming if we label the new classes on all the existing datasets. While using cross-dataset training, we only need to label the new classes on the new dataset. We experiment on PASCAL VOC, COCO, WIDER FACE and WIDER Pedestrian with both solo and cross-dataset settings. Results show that our cross-dataset pipeline can achieve similar impressive performance simultaneously on these datasets compared with training independently.
arXiv:2001.04621v1 fatcat:rzqrarzpuzentolvjuiaozmrx4

CRAFT Objects from Images [article]

Bin Yang, Junjie Yan, Zhen Lei, Stan Z. Li
2016 arXiv   pre-print
CRAFT on ILSVRC detection val2 set in comparison with other state-of-the-art detectors. 92.37 93.61 93.75 94.13 93.04 method proposal classifier ilsvrc Ouyang et al. [24] SS+EB RCNN 45.0 Yan  ... 
arXiv:1604.03239v1 fatcat:aic7b3snhjfa7dkdxdozchq3ji

Peephole: Predicting Network Performance Before Training [article]

Boyang Deng, Junjie Yan, Dahua Lin
2017 arXiv   pre-print
The quest for performant networks has been a significant force that drives the advancements of deep learning in recent years. While rewarding, improving network design has never been an easy journey. The large design space combined with the tremendous cost required for network training poses a major obstacle to this endeavor. In this work, we propose a new approach to this problem, namely, predicting the performance of a network before training, based on its architecture. Specifically, we
more » ... p a unified way to encode individual layers into vectors and bring them together to form an integrated description via LSTM. Taking advantage of the recurrent network's strong expressive power, this method can reliably predict the performances of various network architectures. Our empirical studies showed that it not only achieved accurate predictions but also produced consistent rankings across datasets -- a key desideratum in performance prediction.
arXiv:1712.03351v1 fatcat:klb2axrnsrcsfcotleptfu6hse

Impression Network for Video Object Detection [article]

Congrui Hetang, Hongwei Qin, Shaohui Liu, Junjie Yan
2017 arXiv   pre-print
Video object detection is more challenging compared to image object detection. Previous works proved that applying object detector frame by frame is not only slow but also inaccurate. Visual clues get weakened by defocus and motion blur, causing failure on corresponding frames. Multi-frame feature fusion methods proved effective in improving the accuracy, but they dramatically sacrifice the speed. Feature propagation based methods proved effective in improving the speed, but they sacrifice the
more » ... ccuracy. So is it possible to improve speed and performance simultaneously? Inspired by how human utilize impression to recognize objects from blurry frames, we propose Impression Network that embodies a natural and efficient feature aggregation mechanism. In our framework, an impression feature is established by iteratively absorbing sparsely extracted frame features. The impression feature is propagated all the way down the video, helping enhance features of low-quality frames. This impression mechanism makes it possible to perform long-range multi-frame feature fusion among sparse keyframes with minimal overhead. It significantly improves per-frame detection baseline on ImageNet VID while being 3 times faster (20 fps). We hope Impression Network can provide a new perspective on video feature enhancement. Code will be made available.
arXiv:1712.05896v1 fatcat:mt2izdtq3zfvdh3ey7w35eascu

Synaptic Strength For Convolutional Neural Network [article]

Chen Lin, Zhao Zhong, Wei Wu, Junjie Yan
2018 arXiv   pre-print
Convolutional Neural Networks(CNNs) are both computation and memory intensive which hindered their deployment in mobile devices. Inspired by the relevant concept in neural science literature, we propose Synaptic Pruning: a data-driven method to prune connections between input and output feature maps with a newly proposed class of parameters called Synaptic Strength. Synaptic Strength is designed to capture the importance of a connection based on the amount of information it transports.
more » ... t results show the effectiveness of our approach. On CIFAR-10, we prune connections for various CNN models with up to 96% , which results in significant size reduction and computation saving. Further evaluation on ImageNet demonstrates that synaptic pruning is able to discover efficient models which is competitive to state-of-the-art compact CNNs such as MobileNet-V2 and NasNet-Mobile. Our contribution is summarized as following: (1) We introduce Synaptic Strength, a new class of parameters for CNNs to indicate the importance of each connections. (2) Our approach can prune various CNNs with high compression without compromising accuracy. (3) Further investigation shows, the proposed Synaptic Strength is a better indicator for kernel pruning compared with the previous approach in both empirical result and theoretical analysis.
arXiv:1811.02454v1 fatcat:r7whuj7ljbguni5qsbyhilllfm

Domain Adaptive Person Search [article]

Junjie Li, Yichao Yan, Guanshuo Wang, Fufu Yu, Qiong Jia, Shouhong Ding
2022 arXiv   pre-print
Yan et al. [43] introduce a graph model to explore the impact of contextual information for identity matching. Chen et al.  ... 
arXiv:2207.11898v1 fatcat:pl4qnodjfbcfrgedypsj7g2g6e

Scale-Aware Face Detection [article]

Zekun Hao, Yu Liu, Hongwei Qin, Junjie Yan, Xiu Li, Xiaolin Hu
2017 arXiv   pre-print
Convolutional neural network (CNN) based face detectors are inefficient in handling faces of diverse scales. They rely on either fitting a large single model to faces across a large scale range or multi-scale testing. Both are computationally expensive. We propose Scale-aware Face Detector (SAFD) to handle scale explicitly using CNN, and achieve better performance with less computation cost. Prior to detection, an efficient CNN predicts the scale distribution histogram of the faces. Then the
more » ... le histogram guides the zoom-in and zoom-out of the image. Since the faces will be approximately in uniform scale after zoom, they can be detected accurately even with much smaller CNN. Actually, more than 99% of the faces in AFW can be covered with less than two zooms per image. Extensive experiments on FDDB, MALF and AFW show advantages of SAFD.
arXiv:1706.09876v1 fatcat:n6tjh7td45hhbecurtjd6xsccq

Towards Flops-constrained Face Recognition [article]

Yu Liu, Guanglu Song, Manyuan Zhang, Jihao Liu, Yucong Zhou, Junjie Yan
2019 arXiv   pre-print
Large scale face recognition is challenging especially when the computational budget is limited. Given a flops upper bound, the key is to find the optimal neural network architecture and optimization method. In this article, we briefly introduce the solutions of team 'trojans' for the ICCV19 - Lightweight Face Recognition Challenge . The challenge requires each submission to be one single model with the computational budget no higher than 30 GFlops. We introduce a searched network architecture
more » ... Efficient PolyFace' based on the Flops constraint, a novel loss function 'ArcNegFace', a novel frame aggregation method 'QAN++', together with a bag of useful tricks in our implementation (augmentations, regular face, label smoothing, anchor finetuning, etc.). Our basic model, 'Efficient PolyFace', takes 28.25 Gflops for the 'deepglint-large' image-based track, and the 'PolyFace+QAN++' solution takes 24.12 Gflops for the 'iQiyi-large' video-based track. These two solutions achieve 94.198% @ 1e-8 and 72.981% @ 1e-4 in the two tracks respectively, which are the state-of-the-art results.
arXiv:1909.00632v1 fatcat:7fgsak3clrccndav4zskhdvmt4

Equalization Loss for Large Vocabulary Instance Segmentation [article]

Jingru Tan, Changbao Wang, Quanquan Li, Junjie Yan
2019 arXiv   pre-print
Recent object detection and instance segmentation tasks mainly focus on datasets with a relatively small set of categories, e.g. Pascal VOC with 20 classes and COCO with 80 classes. The new large vocabulary dataset LVIS brings new challenges to conventional methods. In this work, we propose an equalization loss to solve the long tail of rare categories problem. Combined with exploiting the data from detection datasets to alleviate the effect of missing-annotation problems during the training,
more » ... r method achieves 5.1\% overall AP gain and 11.4\% AP gain of rare categories on LVIS benchmark without any bells and whistles compared to Mask R-CNN baseline. Finally we achieve 28.9 mask AP on the test-set of the LVIS and rank 1st place in LVIS Challenge 2019.
arXiv:1911.04692v1 fatcat:oyar7lj4pbdktly5tesimkjiyu

Localization Guided Learning for Pedestrian Attribute Recognition [article]

Pengze Liu, Xihui Liu, Junjie Yan, Jing Shao
2018 arXiv   pre-print
Pedestrian attribute recognition has attracted many attentions due to its wide applications in scene understanding and person analysis from surveillance videos. Existing methods try to use additional pose, part or viewpoint information to complement the global feature representation for attribute classification. However, these methods face difficulties in localizing the areas corresponding to different attributes. To address this problem, we propose a novel Localization Guided Network which
more » ... gns attribute-specific weights to local features based on the affinity between proposals pre-extracted proposals and attribute locations. The advantage of our model is that our local features are learned automatically for each attribute and emphasized by the interaction with global features. We demonstrate the effectiveness of our Localization Guided Network on two pedestrian attribute benchmarks (PA-100K and RAP). Our result surpasses the previous state-of-the-art in all five metrics on both datasets.
arXiv:1808.09102v1 fatcat:uox3ri45jfezjfdg3grb6qsmbe

Quality Aware Network for Set to Set Recognition [article]

Yu Liu, Junjie Yan, Wanli Ouyang
2017 arXiv   pre-print
This paper targets on the problem of set to set recognition, which learns the metric between two image sets. Images in each set belong to the same identity. Since images in a set can be complementary, they hopefully lead to higher accuracy in practical applications. However, the quality of each sample cannot be guaranteed, and samples with poor quality will hurt the metric. In this paper, the quality aware network (QAN) is proposed to confront this problem, where the quality of each sample can
more » ... e automatically learned although such information is not explicitly provided in the training stage. The network has two branches, where the first branch extracts appearance feature embedding for each sample and the other branch predicts quality score for each sample. Features and quality scores of all samples in a set are then aggregated to generate the final feature embedding. We show that the two branches can be trained in an end-to-end manner given only the set-level identity annotation. Analysis on gradient spread of this mechanism indicates that the quality learned by the network is beneficial to set-to-set recognition and simplifies the distribution that the network needs to fit. Experiments on both face verification and person re-identification show advantages of the proposed QAN. The source code and network structure can be downloaded at
arXiv:1704.03373v1 fatcat:5766tt2oifd6fc4d4moxfnluza

1st Place Solutions for OpenImage2019 – Object Detection and Instance Segmentation [article]

Yu Liu, Guanglu Song, Yuhang Zang, Yan Gao, Enze Xie, Junjie Yan, Chen Change Loy, Xiaogang Wang
2020 arXiv   pre-print
This article introduces the solutions of the two champion teams, 'MMfruit' for the detection track and 'MMfruitSeg' for the segmentation track, in OpenImage Challenge 2019. It is commonly known that for an object detector, the shared feature at the end of the backbone is not appropriate for both classification and regression, which greatly limits the performance of both single stage detector and Faster RCNN based detector. In this competition, we observe that even with a shared feature,
more » ... t locations in one object has completely inconsistent performances for the two tasks. E.g. the features of salient locations are usually good for classification, while those around the object edge are good for regression. Inspired by this, we propose the Decoupling Head (DH) to disentangle the object classification and regression via the self-learned optimal feature extraction, which leads to a great improvement. Furthermore, we adjust the soft-NMS algorithm to adj-NMS to obtain stable performance improvement. Finally, a well-designed ensemble strategy via voting the bounding box location and confidence is proposed. We will also introduce several training/inferencing strategies and a bag of tricks that give minor improvement. Given those masses of details, we train and aggregate 28 global models with various backbones, heads and 3+2 expert models, and achieves the 1st place on the OpenImage 2019 Object Detection Challenge on the both public and private leadboards. Given such good instance bounding box, we further design a simple instance-level semantic segmentation pipeline and achieve the 1st place on the segmentation challenge.
arXiv:2003.07557v1 fatcat:fri42ao7tff43f2peu4q3wuchu

Visualization of Convolutional Neural Networks for Monocular Depth Estimation [article]

Junjie Hu, Yan Zhang, Takayuki Okatani
2019 arXiv   pre-print
Recently, convolutional neural networks (CNNs) have shown great success on the task of monocular depth estimation. A fundamental yet unanswered question is: how CNNs can infer depth from a single image. Toward answering this question, we consider visualization of inference of a CNN by identifying relevant pixels of an input image to depth estimation. We formulate it as an optimization problem of identifying the smallest number of image pixels from which the CNN can estimate a depth map with the
more » ... minimum difference from the estimate from the entire image. To cope with a difficulty with optimization through a deep CNN, we propose to use another network to predict those relevant image pixels in a forward computation. In our experiments, we first show the effectiveness of this approach, and then apply it to different depth estimation networks on indoor and outdoor scene datasets. The results provide several findings that help exploration of the above question.
arXiv:1904.03380v1 fatcat:ybn56lstgbdpvbd6nlp73qfh4m

Learning Statistical Texture for Semantic Segmentation [article]

Lanyun Zhu, Deyi Ji, Shiping Zhu, Weihao Gan, Wei Wu, Junjie Yan
2021 arXiv   pre-print
Existing semantic segmentation works mainly focus on learning the contextual information in high-level semantic features with CNNs. In order to maintain a precise boundary, low-level texture features are directly skip-connected into the deeper layers. Nevertheless, texture features are not only about local structure, but also include global statistical knowledge of the input image. In this paper, we fully take advantages of the low-level texture features and propose a novel Statistical Texture
more » ... earning Network (STLNet) for semantic segmentation. For the first time, STLNet analyzes the distribution of low level information and efficiently utilizes them for the task. Specifically, a novel Quantization and Counting Operator (QCO) is designed to describe the texture information in a statistical manner. Based on QCO, two modules are introduced: (1) Texture Enhance Module (TEM), to capture texture-related information and enhance the texture details; (2) Pyramid Texture Feature Extraction Module (PTFEM), to effectively extract the statistical texture features from multiple scales. Through extensive experiments, we show that the proposed STLNet achieves state-of-the-art performance on three semantic segmentation benchmarks: Cityscapes, PASCAL Context and ADE20K.
arXiv:2103.04133v1 fatcat:75vtx6zblvghdjywbl3b7kshmm
« Previous Showing results 1 — 15 out of 2,142 results