Filters








88,044 Hits in 4.0 sec

Cross-Class Relevance Learning for Temporal Concept Localization [article]

Junwei Ma, Satya Krishna Gorti, Maksims Volkovs, Ilya Stanevich, Guangwei Yu
2019 arXiv   pre-print
We present a novel Cross-Class Relevance Learning approach for the task of temporal concept localization.  ...  This facilitates learning of shared information between classes, and allows for arbitrary class-specific feature engineering.  ...  CCRL is able to correctly identify all relevant segments. Results Conclusion We propose cross-class relevance learning approach for temporal concept localization.  ... 
arXiv:1911.08548v1 fatcat:dot2ofwk5bhpvpftdwiimsimme

Transferring Cross-domain Knowledge for Video Sign Language Recognition [article]

Dongxu Li, Xin Yu, Chenchen Xu, Lars Petersson, Hongdong Li
2020 arXiv   pre-print
We also demonstrate the effectiveness of our method on automatically localizing signs from sign news, achieving 28.1 for AP@0.5.  ...  In order to learn domain-invariant features within each class and suppress domain-specific features, our method further resorts to an external memory to store the class centroids of the aligned news signs  ...  We thank all anonymous reviewers and ACs for their constructive comments.  ... 
arXiv:2003.03703v2 fatcat:5o7q6fd6pbcufbmzad5q4in2iy

Cross-Modal Graph with Meta Concepts for Video Captioning [article]

Hao Wang, Guosheng Lin, Steven C. H. Hoi, Chunyan Miao
2021 arXiv   pre-print
We further build meta concept graphs dynamically with the learned cross-modal meta concepts.  ...  named cross-modal meta concepts.  ...  Specifically, we use a weakly-supervised learning approach to localize the attended visual regions and their semantic classes for objects shown in captions, in an attempt to cover some undefined classes  ... 
arXiv:2108.06458v2 fatcat:ud6awh36iba67gacokl25uccim

A multimodal tensor-based late fusion approach for satellite image search in Sentinel 2 images

Ilias Gialampoukidis, Anastasia Moumtzidou, Marios Bakratsas, Stefanos Vrochidis, Ioannis Kompatsiaris
2021 Zenodo  
We st create a K-order tensor from the results of separate searches by visual features, concepts, spatial and temporal information.  ...  The multimodal character of EO Big Data requires eective combination of multiple modalities for similarity search.  ...  Deep learning [7] makes use of deep auto-encoders to learn features from different modalities in the task of cross-modal retrieval.  ... 
doi:10.5281/zenodo.4293265 fatcat:lqibjpr27zbyljyz3um53xciny

Attentive Sequence to Sequence Translation for Localizing Clips of Interest by Natural Language Descriptions [article]

Ke Ning, Linchao Zhu, Ming Cai, Yi Yang, Di Xie, Fei Wu
2018 arXiv   pre-print
We propose a novel attentive sequence to sequence translator (ASST) for clip localization in videos by natural language descriptions. We make two contributions.  ...  The 2D representation not only preserves the temporal dependencies of frames but also provides an effective way to perform frame-level video-language matching.  ...  Ablation studies We then perform ablation studies from the following three aspects: input visual modality, the importance of cross-modal local relevance and temporal endpoint fea-ture.  ... 
arXiv:1808.08803v1 fatcat:rfxl44c5e5e7be3t4b3wh3d6ra

Experimenting with musically motivated convolutional neural networks

Jordi Pons, Thomas Lidy, Xavier Serra
2016 2016 14th International Workshop on Content-Based Multimedia Indexing (CBMI)  
In this article we explore various architectural choices of relevance for music signals classification tasks in order to start understanding what the chosen networks are learning.  ...  The classes in this dataset have a strong correlation with tempo, what allows assessing if the proposed architectures are learning frequency and/or time dependencies.  ...  We would like to thank Marius Miron for many helpful discussions.  ... 
doi:10.1109/cbmi.2016.7500246 dblp:conf/cbmi/PonsLS16 fatcat:yfnqfa6lpnefnp2fr7ad7ektkm

Future-Supervised Retrieval of Unseen Queries for Live Video

Spencer Cappallo, Cees G.M. Snoek
2017 Proceedings of the 2017 ACM on Multimedia Conference - MM '17  
We introduce the use of future frame representations as a supervision signal for learning temporally aware semantic representations on unlabeled video data.  ...  Its live nature means that video representations should be relevant to current content, and not necessarily to past content.  ...  In a stream, it's necessary for our representation to be temporally local.  ... 
doi:10.1145/3123266.3123437 dblp:conf/mm/CappalloS17 fatcat:t3wgpjthpfhnnbq6eeu56flsyu

French prominence: A probabilistic framework

Nicolas Obin, Xavier Rodet, Anne Lacheret-Dujour
2008 Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing  
The proposed method for automatic prominence labelling is based on well-known machine learning techniques in a three step procedure: i) a feature extraction step in which we propose a framework for systematic  ...  and multi-level speech acoustic feature extraction, ii) a feature selection step for identifying the more relevant prominence acoustic correlates, and iii) a modelling step in which a gaussian mixture  ...  We suggest to heuristically define different temporal horizons for the comparison of acoustic data relevant for prominence detection.  ... 
doi:10.1109/icassp.2008.4518529 dblp:conf/icassp/ObinRL08 fatcat:vc7iujkq3vgrlmzaerb65lc57u

Multi-Instance Multi-Label Action Recognition and Localization Based on Spatio-Temporal Pre-Trimming for Untrimmed Videos

Xiao-Yu Zhang, Haichao Shi, Changsheng Li, Peng Li
2020 PROCEEDINGS OF THE THIRTIETH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE AND THE TWENTY-EIGHTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE  
After convergence, temporal localization is further achieved with local-global temporal class activation map. Extensive experiments are conducted on two benchmark datasets, i.e.  ...  Weakly supervised action recognition and localization for untrimmed videos is a challenging problem with extensive applications.  ...  To be specific, we propose the local-global T-CAM for temporal localization.  ... 
doi:10.1609/aaai.v34i07.6986 fatcat:bxywbawmkfetxjy2ji5ln6k55i

End-to-End Deep Learning Approach for Perfusion Data: A Proof-of-Concept Study to Classify Core Volume in Stroke CT

Andreas Mittermeier, Paul Reidler, Matthias P. Fabritius, Balthasar Schachtner, Philipp Wesp, Birgit Ertl-Wagner, Olaf Dietrich, Jens Ricke, Lars Kellert, Steffen Tiedt, Wolfgang G. Kunz, Michael Ingrisch
2022 Diagnostics  
External independent validation resulted in an ensembled mean ROC-AUC of 0.61. (4) Conclusions: In this proof-of-concept study, the proposed end-to-end deep learning approach bypasses conventional perfusion  ...  Further studies can easily extend to additional clinically relevant endpoints.  ...  Median core volume for the training cohort was 32.4 mL which yields, per definition, a balanced class split.  ... 
doi:10.3390/diagnostics12051142 fatcat:ngksvtcz7vbtllyskbtjyr7nui

Skeleton based Activity Recognition by Fusing Part-wise Spatio-temporal and Attention Driven Residues [article]

Chhavi Dhiman, Dinesh Kumar Vishwakarma, Paras Aggarwal
2019 arXiv   pre-print
To extract and learn salient features for action recognition, attention driven residues are used which enhance the performance of residual components for effective 3D skeleton-based Spatio-temporal action  ...  The RIAFNet architecture is greatly inspired by the InceptionV4 architecture which unified the ResNet and Inception based Spatio-temporal feature representation concept and achieving the highest top-1  ...  Skeletons were split into anatomically relevant parts, which were fed into each independent subnet to extract local features. Shahroudy et al.  ... 
arXiv:1912.00576v1 fatcat:4pg77axdxbd43p6sa6lt6fmnoe

Weakly Supervised Temporal Adjacent Network for Language Grounding [article]

Yuechen Wang, Jiajun Deng, Wengang Zhou, Houqiang Li
2021 arXiv   pre-print
Specifically, WSTAN learns cross-modal semantic alignment by exploiting temporal adjacent network in a multiple instance learning (MIL) paradigm, with a whole description paragraph as input.  ...  In this task, it is critical to learn a strong cross-modal semantic alignment between sentence semantics and visual content.  ...  Complementary Branch Cross-modal semantic alignment learning is a straightforward idea to learn temporal relevance without explicit annotations.  ... 
arXiv:2106.16136v1 fatcat:i7nwetztpraf5bjwxfrn2gpa2a

High-level event recognition in unconstrained videos

Yu-Gang Jiang, Subhabrata Bhattacharya, Shih-Fu Chang, Mubarak Shah
2012 International Journal of Multimedia Information Retrieval  
Finally, we discuss promising directions for future research.  ...  In this paper, we review current technologies for complex event recognition in unconstrained videos.  ...  Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon.  ... 
doi:10.1007/s13735-012-0024-2 fatcat:mfzttic3svb4tho2xb6aczgp4y

A Novel Dictionary Learning based Multiple Instance Learning Approach to Action Recognition from Videos

Abhinaba Roy, Biplab Banerjee, Vittorio Murino
2017 Proceedings of the 6th International Conference on Pattern Recognition Applications and Methods  
In contrast, we propose a dictionary learning based strategy to MIL which first identifies class-specific discriminative codewords, and then projects the bag-level instances into a probabilistic embedding  ...  only from negative classes.  ...  We ran 500 fold cross validation for KTH and 300 fold cross validation for Weizmann dataset. Number of local codewords are experimentally determined.  ... 
doi:10.5220/0006200205190526 dblp:conf/icpram/RoyBM17 fatcat:j7pvnjii6ndgjghycqlt7sqxfq

A Spatiotemporal Deep Learning Solution For Automatic Micro-Expressions Recognition From Local Facial Regions

Mouath Aouayeb, Wassim Hamidouche, Kidiyo Kpalma, Amel Benazza-Benyahia
2019 2019 IEEE 29th International Workshop on Machine Learning for Signal Processing (MLSP)  
The proposed solution applies on regions of interest instead of the whole face, and uses a combination of CNN and LSTM to extract the most relevant spatio-temporal features.  ...  Deep Learning approach Deep Learning (CNN + LSTM) was used by Kim et al. [11] to encode spatial and temporal characteristics. MicroExpSTCNN has been proposed by Reddy et al. [16] .  ... 
doi:10.1109/mlsp.2019.8918771 dblp:conf/mlsp/AouayebHKB19 fatcat:gawfbl6ibvclrf7tmydxqo6jyq
« Previous Showing results 1 — 15 out of 88,044 results