Filters








2,787 Hits in 7.8 sec

You Only Infer Once: Cross-Modal Meta-Transfer for Referring Video Object Segmentation

Dezhuang Li, Ruoqi Li, Lijun Wang, Yifan Wang, Jinqing Qi, Lu Zhang, Ting Liu, Qingquan Xu, Huchuan Lu
2022 PROCEEDINGS OF THE THIRTIETH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE AND THE TWENTY-EIGHTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE  
We present YOFO (You Only inFer Once), a new paradigm for referring video object segmentation (RVOS) that operates in an one-stage manner.  ...  regression labels for the meta-transfer module.  ...  We make the first attempt towards this goal by proposing YOFO (You Only inFer Once), a one-stage RVOS method.  ... 
doi:10.1609/aaai.v36i2.20017 fatcat:iw762sycobcspkpe55fza4j64m

Multimodal One-shot Learning of Speech and Images

Ryan Eloff, Herman A. Engelbrecht, Herman Kamper
2019 ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)  
cross-modal matching from limited weakly supervised data.  ...  This model outperforms our other approaches on our most difficult benchmark with a cross-modal matching accuracy of 40.3% for 10-way 5-shot learning.  ...  We refer to this task as one-shot cross-modal matching.  ... 
doi:10.1109/icassp.2019.8683587 dblp:conf/icassp/EloffEK19 fatcat:47yfbmhsg5bbbdeiivcglj3vtu

Transformers in Vision: A Survey [article]

Salman Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, Fahad Shahbaz Khan, Mubarak Shah
2021 arXiv   pre-print
, multi-modal tasks (e.g., visual-question answering, visual reasoning, and visual grounding), video processing (e.g., activity recognition, video forecasting), low-level vision (e.g., image super-resolution  ...  We then cover extensive applications of transformers in vision including popular recognition tasks (e.g., image classification, object detection, action recognition, and segmentation), generative modeling  ...  We would also like to thank Mohamed Afham for his help with a figure.  ... 
arXiv:2101.01169v4 fatcat:ynsnfuuaize37jlvhsdki54cy4

A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets

Khaled Bayoudh, Raja Knani, Fayçal Hamdaoui, Abdellatif Mtibaa
2021 The Visual Computer  
integration and combination of heterogeneous visual cues across sensory modalities.  ...  Unstructured real-world data can inherently take many forms, also known as modalities, often including visual and textual content.  ...  To significantly increase the processing speed of object detection pipelines, Redmon et al. [25] implemented a onestage lightweight detection strategy called YOLO (You Only Look Once).  ... 
doi:10.1007/s00371-021-02166-7 pmid:34131356 pmcid:PMC8192112 fatcat:jojwyc6slnevzk7eaiutlmlgfe

Recent Advances in Vision-Based On-Road Behaviors Understanding: A Critical Survey

Rim Trabelsi, Redouane Khemmar, Benoit Decoux, Jean-Yves Ertaud, Rémi Butteau
2022 Sensors  
To push for an holistic understanding, we investigate the complementary relationships between different elementary tasks that we define as the main components of road behavior understanding to achieve  ...  For this, five related topics have been covered in this review, including situational awareness, driver-road interaction, road scene understanding, trajectories forecast, driving activities, and status  ...  For a similar objective, in [18] the crossing behavior, i.e., the pedestrians' intention to cross, is predicted.  ... 
doi:10.3390/s22072654 pmid:35408269 pmcid:PMC9003377 fatcat:2vrmgz3b25eyxbijeurx5aijv4

Speech2Action: Cross-modal Supervision for Action Recognition [article]

Arsha Nagrani, Chen Sun, David Ross, Rahul Sukthankar, Cordelia Schmid, Andrew Zisserman
2020 arXiv   pre-print
Using the predictions of this model, we obtain weak action labels for over 800K video clips.  ...  We then apply this model to the speech segments of a large unlabelled movie corpus (188M speech segments from 288K movies).  ...  We are grateful to Carl Vondrick for early discussions.  ... 
arXiv:2003.13594v1 fatcat:tlhposqu4zdfhj2g7msd2l42t4

Meta-learning with implicit gradients in a few-shot setting for medical image segmentation [article]

Rabindra Khadga, Debesh Jha, Steven Hicks, Vajira Thambawita, Michael A. Riegler, Sharib Ali, Pål Halvorsen
2022 arXiv   pre-print
To this end, we propose to exploit an optimization-based implicit model agnostic meta-learning (iMAML) algorithm under few-shot settings for medical image segmentation.  ...  To our knowledge, this is the first work that exploits iMAML for medical image segmentation and explores the strength of the model on scenarios such as meta-training on unique and mixed instances of lesion  ...  For carrying out the experiments, two datasets for skin, two datasets of normal colonoscopy and a dataset from video capsule endoscopy, which is a different modality, were used.  ... 
arXiv:2106.03223v2 fatcat:tl5tx5dr4batrfedsj4ou3arma

Eye-2-I: Eye-tracking for just-in-time implicit user profiling [article]

Keng-Teck Ma, Qianli Xu, Liyuan Li, Terence Sim, Mohan Kankanhalli, Rosary Lim
2016 arXiv   pre-print
For many applications, such as targeted advertising and content recommendation, knowing users' traits and interests is a prerequisite. User profiling is a helpful approach for this purpose.  ...  We propose a novel just-in-time implicit profiling method, Eye-2-I, which learns the user's interests, demographic and personality traits from the eye-tracking data while the user is watching videos.  ...  Leave-one-out cross validation was used to evaluate the meta-classifiers, i.e. a single subject is left out of the training set in each round.  ... 
arXiv:1507.04441v2 fatcat:xscco244mfcvxlpjfoajdnuzei

Deep Learning Serves Traffic Safety Analysis: A Forward-looking Review [article]

Abolfazl Razi, Xiwen Chen, Huayu Li, Hao Wang, Brendan Russo, Yan Chen, Hongbin Yu
2022 arXiv   pre-print
This processing framework includes several steps, including video enhancement, video stabilization, semantic and incident segmentation, object detection and classification, trajectory extraction, speed  ...  This paper explores Deep Learning (DL) methods that are used or have the potential to be used for traffic video analysis, emphasizing driving safety for both Autonomous Vehicles (AVs) and human-operated  ...  Junsuo Qu and Greg Leeming for his insightful comments.  ... 
arXiv:2203.10939v2 fatcat:oml733wvjfh3blne4h7kg5y3du

2021 Index IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 43

2022 IEEE Transactions on Pattern Analysis and Machine Intelligence  
Note that the item title is found only under the primary entry in the Author Index.  ...  The Author Index contains the primary entry for each item, listed under the first author's name.  ...  Liu, H., +, TPAMI May 2021 1791-1807 You Only Search Once: Single Shot Neural Architecture Search via Direct Sparse Optimization.  ... 
doi:10.1109/tpami.2021.3126216 fatcat:h6bdbf2tdngefjgj76cudpoyia

Learning To Recognize Procedural Activities with Distant Supervision [article]

Xudong Lin, Fabio Petroni, Gedas Bertasius, Marcus Rohrbach, Shih-Fu Chang, Lorenzo Torresani
2022 arXiv   pre-print
of the steps needed for the execution of a wide variety of complex activities.  ...  This problem is dramatically different from traditional action classification, where models are typically optimized on videos that span only a few seconds and that are manually trimmed to contain simple  ...  Acknowledgments Thanks to Karl Ridgeway, Michael Iuzzolino, Jue Wang, Noureldien Hussein, and Effrosyni Mavroudi for valuable discussions.  ... 
arXiv:2201.10990v3 fatcat:ghjybqtitjf5thlsgoknmpgf2a

Front Matter: Volume 10646

Ivan Kadar
2018 Signal Processing, Sensor/Information Fusion, and Target Recognition XXVII  
The publisher is not responsible for the validity of the information or for any outcomes resulting from reliance thereon.  ...  Please use the following format to cite material from these proceedings: Publication of record for individual papers is online in the SPIE Digital Library.  ...  ., "Learning Rich Features from RGB-D Images for Object Detection and Segmentation", ECCV 2014.  ... 
doi:10.1117/12.2500434 fatcat:wfvvakrbsrfrbiiglzdvnp34o4

Empowering Things with Intelligence: A Survey of the Progress, Challenges, and Opportunities in Artificial Intelligence of Things [article]

Jing Zhang, Dacheng Tao
2020 arXiv   pre-print
Then, we present progress in AI research for IoT from four perspectives: perceiving, learning, reasoning, and behaving.  ...  In the Internet of Things (IoT) era, billions of sensors and devices collect and process data from the environment, transmit them to cloud centers, and receive feedback via the internet for connectivity  ...  Image/video captioning and text-to-image generation are two generative tasks related to cross-modal matching, where captioning refers to generating a piece of text description for a given image or video  ... 
arXiv:2011.08612v1 fatcat:dflut2wdrjb4xojll34c7daol4

A Metaverse: taxonomy, components, applications, and open challenges

Sang-Min Park, Young-Gab Kim
2022 IEEE Access  
Finally, we summarize the limitations and directions for implementing the immersive Metaverse as social influences, constraints, and open challenges.  ...  The integration of enhanced social activities and neural-net methods requires a new definition of Metaverse suitable for the present, different from the previous Metaverse.  ...  Anaphora and co-reference resolutions are used to infer cross-references in questions and conversations [164, 165] .  ... 
doi:10.1109/access.2021.3140175 fatcat:fnraeaz74vh33knfvhzrynesli

ForgeryNet: A Versatile Benchmark for Comprehensive Forgery Analysis [article]

Yinan He, Bei Gan, Siyu Chen, Yichun Zhou, Guojun Yin, Luchuan Song, Lu Sheng, Jing Shao, Ziwei Liu
2021 arXiv   pre-print
However, existing face forgery datasets either have limited diversity or only support coarse-grained analysis.  ...  the manipulated area of fake images compared to their corresponding source real images. 3) Video Forgery Classification, which re-defines the video-level forgery classification with manipulated frames  ...  For classification methods, we use the default cross-entropy loss for training. As for localization methods, we also add a segmentation loss in addition to the classification loss.  ... 
arXiv:2103.05630v2 fatcat:b6gz6ugdmjhfnfawno2d6ummxa
« Previous Showing results 1 — 15 out of 2,787 results