3,552 Hits in 9.0 sec

To Find Where You Talk: Temporal Sentence Localization in Video with Attention Based Location Regression [article]

Yitian Yuan, Tao Mei, Wenwu Zhu
2018 arXiv   pre-print
In this paper, we propose a novel Attention Based Location Regression (ABLR) approach to solve the temporal sentence localization from a global perspective.  ...  Finally, a novel attention based location regression network is designed to predict the temporal coordinates of sentence query from the previous attention.  ...  The authors would like to thank Dr. Ting Yao, Dr. Jun Xu, Linjun Zhou and Xumin Chen for their great supports and valuable suggestions on this work.  ... 
arXiv:1804.07014v4 fatcat:7ngjxiv3kfgzhfemkf2pxe3pzq

To Find Where You Talk: Temporal Sentence Localization in Video with Attention Based Location Regression

Yitian Yuan, Tao Mei, Wenwu Zhu
To address these issues, we propose a novel Attention Based Location Regression (ABLR) approach to localize sentence descriptions in videos in an efficient end-to-end manner.  ...  Finally, a novel attention based location prediction network is designed to regress the temporal coordinates of sentence from the previous attentions.  ...  The authors would like to thank Dr. Ting Yao, Dr. Jun Xu, Linjun Zhou and Xumin Chen for their great supports and valuable suggestions on this work.  ... 
doi:10.1609/aaai.v33i01.33019159 fatcat:t5ckqvm4kre5njg4m6m2uh5z7m

Lifelike talking faces for interactive services

E. Cosatto, J. Ostermann, H.P. Graf, J. Schoeter
2003 Proceedings of the IEEE  
Using an RTP-based protocol, face animation can be driven with only 800 bits/s in addition to the rate for transmitting audio.  ...  The sample-based approach, on the other hand, concatenates segments of recorded videos, instead of trying to model the dynamics of the animations in detail.  ...  In order to find facial features with subpixel accuracy, we proceed in two steps, where first we locate several features only approximately and then zoom in to determine their locations more precisely.  ... 
doi:10.1109/jproc.2003.817141 fatcat:tentbcv2nndanotgumf57brnnu

Eye'm talking to you: speakers' gaze direction modulates co-speech gesture processing in the right MTG

Judith Holler, Idil Kokal, Ivan Toni, Peter Hagoort, Spencer D. Kelly, Aslı Özyürek
2014 Social Cognitive and Affective Neuroscience  
The comprehension of Speech&Gesture relative to SpeechOnly utterances recruited middle occipital, middle temporal and inferior frontal gyri, bilaterally.  ...  Such cues may modulate neural activity in regions associated either with the processing of ostensive cues, such as eye gaze, or with the processing of semantic information, provided by speech and gesture  ...  Fig. 3 3 Anatomical location of a cluster along the right middle temporal gyrus (in red, overlaid on a rendered brain) showing a significant differential response to Speech&Gesture (SG) utterances [as  ... 
doi:10.1093/scan/nsu047 pmid:24652857 pmcid:PMC4321622 fatcat:3m5nclxalbf3bo24nt3vrddoj4

Complex Communication Dynamics: Exploring the Structure of an Academic Talk

Camila Alviar, Rick Dale, Alexia Galati
2019 Cognitive Science  
These findings, although tentative, do suggest that the cognitive system is integrating body, slides, and speech in a coordinated manner during natural language use.  ...  Further research is needed to clarify the specific coordination patterns that occur between the different modalities.  ...  Acknowledgments This project was supported in part by the Graduate Dean's Recruitment Fellowship awarded by UC Merced to the first author.  ... 
doi:10.1111/cogs.12718 pmid:30900289 fatcat:z4uxqaaq4nek3jlesrterbdj5a

Self-supervised Learning for Semi-supervised Temporal Language Grounding [article]

Fan Luo, Shaoxiang Chen, Jingjing Chen, Zuxuan Wu, Yu-Gang Jiang
2021 arXiv   pre-print
Given a text description, Temporal Language Grounding (TLG) aims to localize temporal boundaries of the segments that contain the specified semantics in an untrimmed video.  ...  Since manual annotations are expensive, to cope with limited annotations, we tackle TLG in a semi-supervised way by incorporating self-supervised learning, and propose Self-Supervised Semi-Supervised Temporal  ...  To find where you abs/2003.07048, 2020. 1 talk: Temporal sentence localization in video with attention [40] Reuben Tan, Huijuan Xu, Kate Saenko, and Bryan  ... 
arXiv:2109.11475v2 fatcat:2qmfaum4off4dmxzbvgpgj2hty

LocFormer: Enabling Transformers to Perform Temporal Moment Localization on Long Untrimmed Videos With a Feature Sampling Approach [article]

Cristian Rodriguez-Opazo, Edison Marrese-Taylor, Basura Fernando, Hiroya Takamura, Qi Wu
2021 arXiv   pre-print
our sampling technique is more effective than competing counterparts and that it consistently improves the performance of prior work, by up to 3.13\% in the mean temporal IoU, ultimately leading to a  ...  LocFormer is designed for tasks where it is necessary to process the entire long video and at its core lie two main contributions.  ...  A talk: Temporal sentence localization in video with attention primer in BERTology: What we know about how BERT based location regression.  ... 
arXiv:2112.10066v1 fatcat:xip6uokv7fhb3gtao5ncb5sv3y

From image to language and back again

2018 Natural Language Engineering  
Work in computer vision and natural language processing involving images and text has been experiencing explosive growth over the past decade, with a particular boost coming from the neural network revolution  ...  In this article, we touch upon all of these topics as we review work involving images and text under the three main headings of image description (Section 2), visually grounded referring expression generation  ...  Acknowledgements This work was funded in part by NSF CAREER awards to DP and DB, an ONR YIP award to DP, ONR Grant N00014-14-1-0679 to DB, a Sloan Fellowship to DP, ARO YIP awards to DB and DP, a Allen  ... 
doi:10.1017/s1351324918000086 fatcat:fvxkgjlolra4vns2r5qx4xvg3i

TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval [article]

Jie Lei, Licheng Yu, Tamara L. Berg, Mohit Bansal
2020 arXiv   pre-print
The dataset contains 109K queries collected on 21.8K videos from 6 TV shows of diverse genres, where each query is associated with a tight temporal window.  ...  The queries are also labeled with query types that indicate whether each of them is more related to video or subtitle or both, allowing for in-depth analysis of the dataset and the methods that built on  ...  .: Localizing moments in video with temporal language. In: EMNLP (2018) 16.  ... 
arXiv:2001.09099v2 fatcat:npokf5n7tbca7bf6a44shlnlim

ActivityNet Challenge 2017 Summary [article]

Bernard Ghanem, Juan Carlos Niebles, Cees Snoek, Fabian Caba Heilbron, Humam Alwassel, Ranjay Khrisna, Victor Escorcia, Kenji Hata, Shyamal Buch
2017 arXiv   pre-print
We would like to thank the authors of the Kinetics dataset for their kind support; and Joao Carreira and Brian Zhang for helpful discussions.  ...  We also employ a location regression scheme similar to practices in [5] to further refine the temporal extent of positive proposals.  ...  In particular, we utilize KNN to find the visually similar video segments based on the extracted video representations.  ... 
arXiv:1710.08011v1 fatcat:bc5qhp2cungrdj4j3lebxeoane

Distinct roles of temporal and frontoparietal cortex in representing actions across vision and language

Moritz F. Wurm, Alfonso Caramazza
2019 Nature Communications  
Both temporal and frontoparietal brain areas are associated with the representation of knowledge about the world, in particular about actions.  ...  Here, we reveal distinct functional profiles of lateral temporal and frontoparietal cortex using fMRI-based MVPA.  ...  We thank Valentina Brentari for assistance in preparing the verbal stimulus material and with data acquisition.  ... 
doi:10.1038/s41467-018-08084-y pmid:30655531 pmcid:PMC6336825 fatcat:tiksjn7czjgqjndhwgbhj2u2e4

Deep Learning-based Automated Lip-Reading: A Survey

Souheil Fenghour, Daqing Chen, Kun Guo, Bo Li, Perry Xiao
2021 IEEE Access  
advantages of Attention-Transformers and Temporal Convolutional Networks to Recurrent Neural Networks for classification; 3) A comparison of different classification schemas used for lip-reading including  ...  A survey on automated lip-reading approaches is presented in this paper with the main focus being on deep learning related methodologies which have proven to be more fruitful for both feature extraction  ...  The LRS3-TED [67] dataset is another sentence-based dataset compiled in a similar fashion by extracting videos from Ted-X videos where 150,000 sentences were extracted from TED programs.  ... 
doi:10.1109/access.2021.3107946 fatcat:enjgwdrwzragredhck2xvozsdy

Markers of Topical Discourse in Child-Directed Speech

Hannah Rohde, Michael C. Frank
2014 Cognitive Science  
Our findings suggest that many cues used to signal topicality in adult discourse are also available in child-directed speech. Discourse in child-directed speech 3  ...  Research in pragmatics, however, points to ways in which each subsequent utterance provides new opportunities for listeners to infer speaker meaning.  ...  Based on both of these metrics, it appears that children only gradually became engaged in the discourse, rather than shifting their attention immediately to the topic, and in these videos they did not  ... 
doi:10.1111/cogs.12121 pmid:24731080 fatcat:3dkh5ssvjrcclosehvk3auglse

Perspectives and Prospects on Transformer Architecture for Cross-Modal Tasks with Language and Vision [article]

Andrew Shin, Masato Ishii, Takuya Narihira
2021 arXiv   pre-print
Its success also implies drastic changes in cross-modal tasks with language and vision, and many researchers have already tackled the issue.  ...  Furthermore, we discuss its current limitations and speculate upon some of the prospects that we find imminent.  ...  For example, the speaker may be talking about cars, where the video shows the speaker himself.  ... 
arXiv:2103.04037v2 fatcat:ws2djb722bat7nc53uodjqi7ki

A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal Transformer [article]

Vladimir Iashin, Esa Rahtu
2020 arXiv   pre-print
Dense video captioning aims to localize and describe important events in untrimmed videos.  ...  We show the effectiveness of the proposed model with audio and visual modalities on the dense video captioning task, yet the module is capable of digesting any two modalities in a sequence-to-sequence  ...  At- tention clusters: Purely attention based local feature integration for video classification.  ... 
arXiv:2005.08271v2 fatcat:6mnjiwvugrba5kq4tyyuvc5dga
« Previous Showing results 1 — 15 out of 3,552 results