9 Hits in 2.7 sec

ObamaNet: Photo-realistic lip-sync from text [article]

Rithesh Kumar, Jose Sotelo, Kundan Kumar, Alexandre de Brebisson, Yoshua Bengio
2017 arXiv   pre-print
We present ObamaNet, the first architecture that generates both audio and synchronized photo-realistic lip-sync videos from any new text.  ...  Contrary to other published lip-sync approaches, ours is only composed of fully trainable neural modules and does not rely on any traditional computer graphics methods.  ...  A video example can be found there: Although we showcase the method on Barack Obama because his videos are commonly used to benchmark lip-sync methods (see for example  ... 
arXiv:1801.01442v1 fatcat:kczda6izyvfpfdhanrljh4m5xi

Supplementary Evidence: Towards Higher Levels of Assurance in Remote Identity Proofing [article]

Jongkil Jeong, Syed Wajid Ali Shah, Ashish Nanda, Robin Doss
Reenactment (Mouth) ObamaNet [9] Leverages text-to-speech processing, time-delayed LSTM, and CNN to convert text to photo-realistic lip-synced videos.  ...  Reenactment (Mouth) Mining Audio [13] Leverages LSTM and self-attention mechanism to transform audio or text to a realistic video with enhanced lip-syncing.  ... 
doi:10.6084/m9.figshare.19119680.v2 fatcat:ijki7jkshzbrfhk7ufsfuh2ri4

RIDP__IEEE_CEM.pdf [article]

Ashish Nanda, Syed Wajid Ali Shah, Jongkil Jeong, Robin Ram Mohan Doss
Reenactment (Mouth) ObamaNet [12] Leverages text-to-speech processing, time-delayed LSTM, and CNN to convert text to photo-realistic lip-synced videos.  ...  Reenactment (Mouth) Mining Audio [16] Leverages LSTM and self-attention mechanism to transform audio or text to a realistic video with enhanced lip-syncing.  ... 
doi:10.6084/m9.figshare.21067912.v1 fatcat:wm6ygsbv5nhatfn3lqoonyoopa

A Survey of Audio Synthesis and Lip-syncing for Synthetic Video Generation

Anup Kadam, Sagar Rane, Arpit Mishra, Shailesh Sahu, Shubham Singh, Shivam Pathak
2021 EAI Endorsed Transactions on Creative Technologies  
Voice cloning procedure include state of the art methods like wavenet and other text-to-speech approaches. Lip synchronization methods describe constrained and unconstrained methods.  ...  To synthesize a high-grade artificial video, the lip must be synchronized with the audio. Here we have compared the various methods for voice-cloning and lip synchronization.  ...  But this method works majorly for the specific person, not the generic audience Puppetry Obamanet: Photorealistic lip-syncing from text This method [29] works accurately with static images or video  ... 
doi:10.4108/eai.14-4-2021.169187 fatcat:cu2ghjbzn5dx7nib5klifujwye

AnyoneNet: Synchronized Speech and Talking Head Generation for Arbitrary Person [article]

Xinsheng Wang, Qicong Xie, Jihua Zhu, Lei Xie, Scharenborg
2021 arXiv   pre-print
To generate the talking head videos from the face images, a facial landmark-based method that can predict both lip movements and head rotations is proposed.  ...  Automatically generating videos in which synthesized speech is synchronized with lip movements in a talking head has great potential in many human-computer interaction scenarios.  ...  However, even with these ground-truth landmarks, there are still differences between the generated images and real target images, as different lip shapes from the photo-realistic images could lead to the  ... 
arXiv:2108.04325v2 fatcat:64vpm5cz7za27ov6ltjl6dutui

Synthesizing a Talking Child Avatar to Train Interviewers Working with Maltreated Children

Pegah Salehi, Syed Zohaib Hassan, Myrthe Lammerse, Saeed Shafiee Sabet, Ingvild Riiser, Ragnhild Klingenberg Røed, Miriam S. Johnson, Vajira Thambawita, Steven A. Hicks, Martine Powell, Michael E. Lamb, Gunn Astrid Baugerud (+2 others)
2022 Big Data and Cognitive Computing  
Due to recent advances in artificial intelligence, we propose to generate a realistic and interactive child avatar, aiming to mimic a child.  ...  The insights and feedback from these studies have contributed to the refined and improved architecture of the child avatar system which we present here.  ...  The lip-sync quality was not explored in the data analyses because all of the animated characters used the same technique for lip-sync generation utilizing the Salsa component in the Unity game engine.  ... 
doi:10.3390/bdcc6020062 fatcat:axsbe6m6rncejkh5lhm3p73kry

A Survey and Taxonomy of Adversarial Neural Networks for Text-to-Image Synthesis [article]

Jorge Agnese, Jonathan Herrera, Haicheng Tao, Xingquan Zhu
2019 arXiv   pre-print
Recent progress in deep learning (DL) has brought a new set of unsupervised deep learning methods, particularly deep generative models which are able to generate realistic visual images using suitably  ...  to summarize the list of contemporaneous solutions that utilize GANs and DCNNs to generate enthralling results in categories such as human faces, birds, flowers, room interiors, object reconstruction from  ...  to achieve photo-realistic images from text descriptions.  ... 
arXiv:1910.09399v1 fatcat:4zqrooqcm5cgrk74kgon5vwkzm

Iterative Text-based Editing of Talking-heads Using Neural Retargeting [article]

Xinwei Yao, Ohad Fried, Kayvon Fatahalian, Maneesh Agrawala
2020 arXiv   pre-print
We present a text-based tool for editing talking-head video that enables an iterative editing workflow.  ...  ObamaNet [Kumar et al. 2017 ] synthesizes both audio and video from text, using a large dataset of 17 hours of the president's speeches.  ...  Our approach is to retarget lip motion from a repository of source actor video to the target actor. The frame shown for each iteration corresponds to the red edit text/gesture below the frame.  ... 
arXiv:2011.10688v1 fatcat:odu63nsc5bdyrenfp3hbrvin3u

The Creation and Detection of Deepfakes: A Survey [article]

Yisroel Mirsky, Wenke Lee
2020 arXiv   pre-print
Improvements in Lip-sync. Noting a human's sensitivity to temporal coherence, the authors of [147] use a GAN with three discriminators: on the frames, video, and lip-sync.  ...  Compared to direct models such as direct models [147, 179] , the authors of [27] improve the lip-syncing by preventing the model from learning irrelevant correlations between the audiovisual signal  ... 
arXiv:2004.11138v3 fatcat:xqabyslmdfhyznm7msqp3wznnq