A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2020; you can also visit the original URL.
The file type is application/pdf
.
Filters
Cross-Class Relevance Learning for Temporal Concept Localization
[article]
2019
arXiv
pre-print
We present a novel Cross-Class Relevance Learning approach for the task of temporal concept localization. ...
This facilitates learning of shared information between classes, and allows for arbitrary class-specific feature engineering. ...
CCRL is able to correctly identify all relevant segments.
Results
Conclusion We propose cross-class relevance learning approach for temporal concept localization. ...
arXiv:1911.08548v1
fatcat:dot2ofwk5bhpvpftdwiimsimme
Transferring Cross-domain Knowledge for Video Sign Language Recognition
[article]
2020
arXiv
pre-print
We also demonstrate the effectiveness of our method on automatically localizing signs from sign news, achieving 28.1 for AP@0.5. ...
In order to learn domain-invariant features within each class and suppress domain-specific features, our method further resorts to an external memory to store the class centroids of the aligned news signs ...
We thank all anonymous reviewers and ACs for their constructive comments. ...
arXiv:2003.03703v2
fatcat:5o7q6fd6pbcufbmzad5q4in2iy
Cross-Modal Graph with Meta Concepts for Video Captioning
[article]
2021
arXiv
pre-print
We further build meta concept graphs dynamically with the learned cross-modal meta concepts. ...
named cross-modal meta concepts. ...
Specifically, we use a weakly-supervised learning approach to localize the attended visual regions and their semantic classes for objects shown in captions, in an attempt to cover some undefined classes ...
arXiv:2108.06458v2
fatcat:ud6awh36iba67gacokl25uccim
A multimodal tensor-based late fusion approach for satellite image search in Sentinel 2 images
2021
Zenodo
We st create a K-order tensor from the results of separate searches by visual features, concepts, spatial and temporal information. ...
The multimodal character of EO Big Data requires eective combination of multiple modalities for similarity search. ...
Deep learning [7] makes use of deep auto-encoders to learn features from different modalities in the task of cross-modal retrieval. ...
doi:10.5281/zenodo.4293265
fatcat:lqibjpr27zbyljyz3um53xciny
Attentive Sequence to Sequence Translation for Localizing Clips of Interest by Natural Language Descriptions
[article]
2018
arXiv
pre-print
We propose a novel attentive sequence to sequence translator (ASST) for clip localization in videos by natural language descriptions. We make two contributions. ...
The 2D representation not only preserves the temporal dependencies of frames but also provides an effective way to perform frame-level video-language matching. ...
Ablation studies We then perform ablation studies from the following three aspects: input visual modality, the importance of cross-modal local relevance and temporal endpoint fea-ture. ...
arXiv:1808.08803v1
fatcat:rfxl44c5e5e7be3t4b3wh3d6ra
Experimenting with musically motivated convolutional neural networks
2016
2016 14th International Workshop on Content-Based Multimedia Indexing (CBMI)
In this article we explore various architectural choices of relevance for music signals classification tasks in order to start understanding what the chosen networks are learning. ...
The classes in this dataset have a strong correlation with tempo, what allows assessing if the proposed architectures are learning frequency and/or time dependencies. ...
We would like to thank Marius Miron for many helpful discussions. ...
doi:10.1109/cbmi.2016.7500246
dblp:conf/cbmi/PonsLS16
fatcat:yfnqfa6lpnefnp2fr7ad7ektkm
Future-Supervised Retrieval of Unseen Queries for Live Video
2017
Proceedings of the 2017 ACM on Multimedia Conference - MM '17
We introduce the use of future frame representations as a supervision signal for learning temporally aware semantic representations on unlabeled video data. ...
Its live nature means that video representations should be relevant to current content, and not necessarily to past content. ...
In a stream, it's necessary for our representation to be temporally local. ...
doi:10.1145/3123266.3123437
dblp:conf/mm/CappalloS17
fatcat:t3wgpjthpfhnnbq6eeu56flsyu
French prominence: A probabilistic framework
2008
Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing
The proposed method for automatic prominence labelling is based on well-known machine learning techniques in a three step procedure: i) a feature extraction step in which we propose a framework for systematic ...
and multi-level speech acoustic feature extraction, ii) a feature selection step for identifying the more relevant prominence acoustic correlates, and iii) a modelling step in which a gaussian mixture ...
We suggest to heuristically define different temporal horizons for the comparison of acoustic data relevant for prominence detection. ...
doi:10.1109/icassp.2008.4518529
dblp:conf/icassp/ObinRL08
fatcat:vc7iujkq3vgrlmzaerb65lc57u
Multi-Instance Multi-Label Action Recognition and Localization Based on Spatio-Temporal Pre-Trimming for Untrimmed Videos
2020
PROCEEDINGS OF THE THIRTIETH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE AND THE TWENTY-EIGHTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE
After convergence, temporal localization is further achieved with local-global temporal class activation map. Extensive experiments are conducted on two benchmark datasets, i.e. ...
Weakly supervised action recognition and localization for untrimmed videos is a challenging problem with extensive applications. ...
To be specific, we propose the local-global T-CAM for temporal localization. ...
doi:10.1609/aaai.v34i07.6986
fatcat:bxywbawmkfetxjy2ji5ln6k55i
End-to-End Deep Learning Approach for Perfusion Data: A Proof-of-Concept Study to Classify Core Volume in Stroke CT
2022
Diagnostics
External independent validation resulted in an ensembled mean ROC-AUC of 0.61. (4) Conclusions: In this proof-of-concept study, the proposed end-to-end deep learning approach bypasses conventional perfusion ...
Further studies can easily extend to additional clinically relevant endpoints. ...
Median core volume for the training cohort was 32.4 mL which yields, per definition, a balanced class split. ...
doi:10.3390/diagnostics12051142
fatcat:ngksvtcz7vbtllyskbtjyr7nui
Skeleton based Activity Recognition by Fusing Part-wise Spatio-temporal and Attention Driven Residues
[article]
2019
arXiv
pre-print
To extract and learn salient features for action recognition, attention driven residues are used which enhance the performance of residual components for effective 3D skeleton-based Spatio-temporal action ...
The RIAFNet architecture is greatly inspired by the InceptionV4 architecture which unified the ResNet and Inception based Spatio-temporal feature representation concept and achieving the highest top-1 ...
Skeletons were split into anatomically relevant parts, which were fed into each independent subnet to extract local features. Shahroudy et al. ...
arXiv:1912.00576v1
fatcat:4pg77axdxbd43p6sa6lt6fmnoe
Weakly Supervised Temporal Adjacent Network for Language Grounding
[article]
2021
arXiv
pre-print
Specifically, WSTAN learns cross-modal semantic alignment by exploiting temporal adjacent network in a multiple instance learning (MIL) paradigm, with a whole description paragraph as input. ...
In this task, it is critical to learn a strong cross-modal semantic alignment between sentence semantics and visual content. ...
Complementary Branch Cross-modal semantic alignment learning is a straightforward idea to learn temporal relevance without explicit annotations. ...
arXiv:2106.16136v1
fatcat:i7nwetztpraf5bjwxfrn2gpa2a
High-level event recognition in unconstrained videos
2012
International Journal of Multimedia Information Retrieval
Finally, we discuss promising directions for future research. ...
In this paper, we review current technologies for complex event recognition in unconstrained videos. ...
Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. ...
doi:10.1007/s13735-012-0024-2
fatcat:mfzttic3svb4tho2xb6aczgp4y
A Novel Dictionary Learning based Multiple Instance Learning Approach to Action Recognition from Videos
2017
Proceedings of the 6th International Conference on Pattern Recognition Applications and Methods
In contrast, we propose a dictionary learning based strategy to MIL which first identifies class-specific discriminative codewords, and then projects the bag-level instances into a probabilistic embedding ...
only from negative classes. ...
We ran 500 fold cross validation for KTH and 300 fold cross validation for Weizmann dataset. Number of local codewords are experimentally determined. ...
doi:10.5220/0006200205190526
dblp:conf/icpram/RoyBM17
fatcat:j7pvnjii6ndgjghycqlt7sqxfq
A Spatiotemporal Deep Learning Solution For Automatic Micro-Expressions Recognition From Local Facial Regions
2019
2019 IEEE 29th International Workshop on Machine Learning for Signal Processing (MLSP)
The proposed solution applies on regions of interest instead of the whole face, and uses a combination of CNN and LSTM to extract the most relevant spatio-temporal features. ...
Deep Learning approach Deep Learning (CNN + LSTM) was used by Kim et al. [11] to encode spatial and temporal characteristics. MicroExpSTCNN has been proposed by Reddy et al. [16] . ...
doi:10.1109/mlsp.2019.8918771
dblp:conf/mlsp/AouayebHKB19
fatcat:gawfbl6ibvclrf7tmydxqo6jyq
« Previous
Showing results 1 — 15 out of 88,044 results