3,778 Hits in 6.7 sec

MPEG-4: Audio/video and synthetic graphics/audio for mixed media

Peter K. Doenges, Tolga K. Capin, Fabio Lavagetto, Joern Ostermann, Igor S. Pandzic, Eric D. Petajan
1997 Signal processing. Image communication  
MPEG-4 addresses coding of digital hybrids of natural and synthetic, Aural and Visual (A/V) information.  ...  Integrated spatial-temporal coding is sought for audio, video, and 2D/3D computer graphics as standardized A/V objects.  ...  Artificial textures can be described by a function generating the values of the texture map or by an image.  ... 
doi:10.1016/s0923-5965(97)00007-6 fatcat:miw7fareavhbjkjfl7ntjgmoqa


Huijie Lin, Jia Jia, Hanyu Liao, Lianhong Cai
2013 Proceedings of the 21st ACM international conference on Multimedia - MM '13  
the face alignment algorithm; 2) emotional audio-visual speech synchronization algorithm based on DBN.  ...  Given user-input greeting text and facial image, WeCard intelligently and automatically generate the personalized speech with expressive lipmotion synchronized facial animation.  ...  For audio-visual synchronization, the inputs of DBN based AVCM are: 1) acoustic features extracted from emotional speech, and 2) the FAPs generated by merging the facial expression with viseme.  ... 
doi:10.1145/2502081.2502278 dblp:conf/mm/LinJLC13 fatcat:k3f65jh5qbghlm6cziu4xbszgq

SoundSpaces 2.0: A Simulation Platform for Visual-Acoustic Learning [article]

Changan Chen, Carl Schissler, Sanchit Garg, Philip Kobernik, Alexander Clegg, Paul Calamia, Dhruv Batra, Philip W Robinson, Kristen Grauman
2022 arXiv   pre-print
Together with existing 3D visual assets, it supports an array of audio-visual research tasks, such as audio-visual navigation, mapping, source localization and separation, and acoustic matching.  ...  Given a 3D mesh of a real-world environment, SoundSpaces can generate highly realistic acoustics for arbitrary sounds captured from arbitrary microphone locations.  ...  Another line of research facilitated by simulation is visual-acoustic learning [65, 16, 13, 44] , where the goal is to either match or remove the room acoustics implied by the image.  ... 
arXiv:2206.08312v1 fatcat:vnp5x42covcv3hix3ug7zd2clm

Visual Object Detector for Cow Sound Event Detection

Yagya Raj Pandeya, Bhuwan Bhattarai, Joonwhoan Lee
2020 IEEE Access  
by treating acoustic signals as RGB images [12] .  ...  We first applied a conventional CNN structure with certain improvements and then proceeded to two-stage visual object detection for audio by treating acoustic signals as RGB images.  ...  AUDIO EVENT DETECTION IN SPECTROGRAM This experiment used a visual object detector for audio in time-frequency representation.  ... 
doi:10.1109/access.2020.3022058 fatcat:pwmpbnoumbgqfbvomqvlilpfna

Visual Acoustic Matching [article]

Changan Chen, Ruohan Gao, Paul Calamia, Kristen Grauman
2022 arXiv   pre-print
Given an image of the target environment and a waveform for the source audio, the goal is to re-synthesize the audio to match the target room acoustics as suggested by its visible geometry and materials  ...  To address this novel task, we propose a cross-modal transformer model that uses audio-visual attention to inject visual properties into the audio and generate realistic audio output.  ...  Acknowledgements UT Austin is supported in part by a gift from Google and the IFML NSF AI Institute.  ... 
arXiv:2202.06875v2 fatcat:qt6he2ckazgaxp6chln2m73kwa

Cyberspatial audio technology

Michael Cohen, Jens Herder, William L. Martens
1999 Journal of the Acoustical Society of Japan (E)  
, or synesthetically generated cues, like an , infrared meter displayed as an appropriately-localized audio alarm. 6 .  ...  Binaural cues can be generated by stereo microphones.  ...  Besides the interest in spatial audio manifested by this paper, Cohen has research interests in telecommunication semiotics and hypermedia; Herder has interests in computer graphics, software engineering  ... 
doi:10.1250/ast.20.389 fatcat:37wpewb45jgl3a3xlr6tfn47ae

Analysis and Modeling of Affective Audio Visual Speech Based on PAD Emotion Space

Shen Zhang, Yingjin Xu, Jia Jia, Lianhong Cai
2008 2008 6th International Symposium on Chinese Spoken Language Processing  
This paper analyzes acoustic and visual features for affective audio-visual speech based on PAD (Pleasure-Arousal-Dominance) emotion space.  ...  The variation of acoustic features is predicted by PAD values, and a PAD-PEP mapping function for facial expression synthesis is built.  ...  CONCLUSION AND FUTURE WORK In this paper we analyze acoustic and visual features for affective audio-visual speech based on PAD emotion space.  ... 
doi:10.1109/chinsl.2008.ecp.82 dblp:conf/iscslp/ZhangXJC08 fatcat:xoudfpbr5vdknoiuakdatx7em4

An Overview of Recent Work in Media Forensics: Methods and Threats [article]

Kratika Bhagtani, Amit Kumar Singh Yadav, Emily R. Bartusiak, Ziyue Xiang, Ruiting Shao, Sriram Baireddy, Edward J. Delp
2022 arXiv   pre-print
In this paper, we review recent work in media forensics for digital images, video, audio (specifically speech), and documents.  ...  Acknowledgments This paper is based on research sponsored by the Defense Advanced Research Projects Agency (DARPA) and the Air Force Research Laboratory (AFRL) under agreement numbers FA8750-20-2-1004  ...  We will use the term "audio" to indicate any type of acoustic signal.  ... 
arXiv:2204.12067v2 fatcat:jjeaeqy5zrbwdp62uejenndcja

Lip Movements Synthesis Using Time Delay Neural Networks

S. Curinga, F. Lavagetto, F. Vignoli
1996 Zenodo  
The audio was sampled at 48 KHz, represented with 16 bit/sample, and distorted by a synthetized telephone{line noise on one stereo channel.  ...  This correlation between the acoustic and visual modalities could be used to devise a reliable acoustic{ to{visual conversion system able to improve video quality in model{based coding systems 4] or  ... 
doi:10.5281/zenodo.36281 fatcat:unws22wa2rhxtjw6d67dnxmhqa

Towards Robust Real-time Audio-Visual Speech Enhancement [article]

Mandar Gogate, Kia Dashtipour, Amir Hussain
2021 arXiv   pre-print
In particular, a generative adversarial networks (GAN) is proposed to address the practical issue of visual imperfections in AV SE.  ...  In this paper, we present a novel framework for low latency speaker-independent AV SE that can generalise on a range of visual and acoustic noises.  ...  synthetic GRID- and acoustic speech signal for robust AV SE.  ... 
arXiv:2112.09060v1 fatcat:zockfnqivnbi7nna2yoxngpwui

Audio-Visual Speech Recognition Using Bimodal-Trained Bottleneck Features for a Person with Severe Hearing Loss

Yuki Takashima, Ryo Aihara, Tetsuya Takiguchi, Yasuo Ariki, Nobuyuki Mitani, Kiyohiro Omori, Kaoru Nakazono
2016 Interspeech 2016  
We propose a novel visual feature extraction approach that connects the lip image to audio features efficiently, and the use of convolutive bottleneck networks (CBNs) increases robustness with respect  ...  to speech fluctuations caused by hearing loss.  ...  model parameter is estimated by constrained local model (CLM) and a lip image is extracted.  ... 
doi:10.21437/interspeech.2016-721 dblp:conf/interspeech/TakashimaATAMON16 fatcat:hvaz4pgqorcuvobnhdz3ptkkx4

Evaluation of A Viseme-Driven Talking Head [article]

Priya Dey, Steve Maddock, Rod Nicolson
2010 Computer Graphics and Visual Computing  
The audiovisual speech animation was found to give higher intelligibility of isolated words than acoustic speech alone.  ...  The system achieves audiovisual speech synthesis using viseme-driven animation and a coarticulation model, to automatically generate speech from text.  ...  Acknowledgements This work is sponsored by the ESRC and EPSRC.  ... 
doi:10.2312/localchapterevents/tpcg/tpcg10/139-142 dblp:conf/tpcg/DeyMN10 fatcat:t7ir72j6pbg6law4ornxed4pwy

Sound-to-Imagination: An Exploratory Study on Unsupervised Crossmodal Translation Using Diverse Audiovisual Data [article]

Leonardo A. Fanzeres, Climent Nadeu
2022 arXiv   pre-print
The motivation of our research is to explore the possibilities of automatic sound-to-image (S2I) translation for enabling a human receiver to visually infer the occurrence of sound related events.  ...  Additionally, we present a solution using informativity classifiers to perform quantitative evaluation of the generated images.  ...  The authors organize the text into four main topics: audiovisual separation and localization, audiovisual corresponding learning, audio and visual generation, and audiovisual representation.  ... 
arXiv:2106.01266v2 fatcat:mly5i3bqljcinmd2m2cl74fozq

Audio-Visual Scene Understanding

Di Hu
2021 Zenodo  
Audio-Visual Scene Understanding Slides  ...  CVPR2021 Tutorial on Audio-Visual Scene Understanding -Audio Scene Understanding -6/19/2021 37 [1] F. Jiang & Z.  ...  ----sound source localization • What does each source sound like?  ... 
doi:10.5281/zenodo.5013725 fatcat:zzkh6dxfjzdd7apq46jsx3qvve

Learning Sparse Generative Models Of Audiovisual Signals

Gianluca Monaci, Friedrich Sommer, Pierre Vandergheynst
2008 Zenodo  
plausible codes for acoustic [16] and visual data [14, 17].  ...  In general, the = ha,φ0 (t−ρ0 )iφ0 ,hv,φ0 (x− p0 , y−q0 ,t−ρ0 )i φ0 +R1 s , audio and video modalities are weighted by different coefficients, (a) (v) cki and cki , since the same audio-video  ... 
doi:10.5281/zenodo.41060 fatcat:ey444s5bprgobgyrali5p3qehq
« Previous Showing results 1 — 15 out of 3,778 results