Speech Animation using Dynamic Visemes

Lead Research Organisation: University of East Anglia
Department Name: Computing Sciences

Abstract

This project will investigate new methods for automatically producing speech animation. For animators in the movie industry this is typically a tedious, iterative process that involves key-framing static lip-poses and then handcrafting a blending function to transition from one key pose to another. In is not uncommon for an animator to spend several hours producing animation for just a few seconds of speech.

We have previously worked on identifying a new dynamic unit for speech animation, termed dynamic visemes, and have shown that these units produce better animation than more traditional phoneme-based units. In this project we will integrate dynamic visemes into state of the art approaches to further improve upon the quality of automated animation that is currently possible. Furthermore, we will investigate how dynamic visemes relate to speech acoustics so that animation can be generated directly from the voice of an actor.

We will build tools that can be implemented in commercial animation pipelines so animation studios can use our tools as a basis for animating any speech on their own models. This will leave their artists free to focus on the overall performance of the character.

The proposed project is ambitious in its aims, proposing a new approaches for producing better speech animation. However, the impact of the work is wide reaching and has the potential to influence the production of speech content in all animated movies and computer games.

Planned Impact

During the course of this project we will develop new techniques for producing better quality speech animation than is available using the current state of the art. These techniques will be implemented in easy to use tools that will remove the burden from professional artists in the production of animated speech content. The artist will need only provide the audio assets from the actor and our tools will automatically generate the lip motion synchronised with the spoken words. The artist can then focus on the character performance, e.g. adding the expression and head pose variation to bring the character to life.

The manual effort required to create production-quality animated speech cannot be overstated. It is not unusual for even very skilled artists to spend many hours lip-syncing a character for only a very short scene. We will work with our industrial partner and industrial advisors to develop the research ideas into tools that work with industry standard software so that they can easily fit within animation pipelines currently used by the various studios and VFX houses. These new approaches will allow production-quality animation to be created in real-time, and so will provide significant savings to studios. Furthermore, we have demonstrated that our approach is not dependent on a particular model or rig and so can be applied to all animation, from cartoons to videorealistic characters. This will also mean that high quality speech animation is available to all, and not just the largest studios producing the biggest budget content.

More broadly, this work will impact research areas that involve the analysis of the (visible) speech articulators. For example, our approaches could find application in speech therapy, whereby a patient is shown speech-related exercises on a model of their own face. Our tools could automatically analyse the patients motions and compare them against the expect motion and show errors in their production. This form of assessment might be beneficial to stroke victims, whereby they need to re-train their facial muscles to properly articulate their speech. A face-to-face virtual speech therapist is always at hand to provide useful and immediate feedback, and our analysis tools can be used to log progress and provide progress related information to a real speech therapist.

There has been work showing the effectiveness of computer generated characters as learning aids, and our tools could be developed into a language tutoring system. Speech-related movements can be tracked on the face of a student and the virtual tutor can compare the observed motion with the expected motion of a native speaker. A range of face models, from cartoon-like to video real, will make foreign language learning more 'fun' and hopefully re-engage school children in learning foreign languages.
 
Description Our first finding is that it is possible to estimate visual speech features from audio speech features using deep neural networks. This allows the mouth movements of an animated character to be generated automatically just from the audio speech. The challenge in this scenario is to make the estimated mouth movements look realistic and subjective tests have shown that this is possible. Two methods to do this have been developed. The first uses an automatic speech recogniser to identify phonetic/linguistic features that can then be transformed into visual features and used for animation. The second method dispenses with the speech recognises and transforms audio features directly into visual features. Furthermore, using similar techniques it is also possible to do the reverse, which is to estimate audio speech from visual speech. In this situation, from a video of a person talking it is possible to estimate an audio speech signal. The challenge in this scenario is to create an intelligible speech signal. Applications for this area are silent speech interfaces that can be used in, for example, medical areas (for patients who have undergone a laryngectomy) and for surveillance. Having the ability to estimate audio speech features from visual features has also led to development of audio-visual methods of speech enhancement.
We have also demonstrated that it is possible to perform audio-to-visual animation in close to real time. By extracting audio features from the speech signal in an asymmetric manner, delay can be minimised which makes the method suitable for real-time networked applications such as in gaming.
We have also developed a method of speaker-independent animation. Rather than requiring the system to be trained and applied to only a single speaker, the new method allows animation to be created for any speaker. The new speaker speaks, and the character is animated accordingly. Subjective tests using human subjects have shown the resulting animations to be almost indistinguishable from those produce from a speaker-dependent system.
Exploitation Route Automatic animation of face in film/TV industry. Silent speech interfaces for medical applications and surveillance. Real-time animation for use in on-line gaming.
Sectors Creative Economy,Digital/Communication/Information Technologies (including Software),Education,Healthcare,Leisure Activities, including Sports, Recreation and Tourism,Culture, Heritage, Museums and Collections,Security and Diplomacy

 
Description One of the main non-academic impacts has been the advancement of researchers employed on the project. One is now about to start work in a senior role in a games company, working on animation directly related to the project. A second researcher went on to apply some of the techniques developed in the project to a completely different field of sonar processing that has has significant impact in that area. That person has now been taken on as a permanent member of staff by the company.
First Year Of Impact 2021
Sector Digital/Communication/Information Technologies (including Software),Electronics
Impact Types Economic

 
Description Proof of Concept Funding
Amount £14,000 (GBP)
Organisation University of East Anglia 
Sector Academic/University
Country United Kingdom
Start 05/2017 
End 08/2017
 
Title YouTube AV Speech database 
Description A large AV speech dataset derived from YouTube video that has been face tracked and processed, and contains many thousands of hours of data. 
Type Of Material Database/Collection of data 
Provided To Others? No  
Impact Tool developed for tracking facial features and assessing audio-visual speech synchrony and we plan to release this to the speech processing community. 
 
Description Collaboration with Disney Research 
Organisation Disney Research
Country United States 
Sector Private 
PI Contribution Weekly meetings to discuss research from both sides, joint publications (in preparation).
Collaborator Contribution Weekly meetings to discuss research from both sides, joint publications (in preparation) and large-scale audio-visual speech databases. Internships.
Impact Publications under review.
Start Year 2015
 
Description Paper presentation at SIGGRAPH 2017 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Presentation of paper at SIGGRAPH conference in Los Angeles, USA. SIGGRAPH is the world's largest and most influential conference in computer graphics and interactive techniques.
Year(s) Of Engagement Activity 2017
URL http://s2017.siggraph.org/
 
Description Presentation at Interspeech 2018 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Policymakers/politicians
Results and Impact Presentation delivered at the Interspeech 2018 conference in India.
Year(s) Of Engagement Activity 2018
 
Description Presentation at NorDev conference 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Industry/Business
Results and Impact Presented my research at a booth to participants at a regional developers' conference.
Year(s) Of Engagement Activity 2017
 
Description Research-led teaching 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Undergraduate students
Results and Impact Embedded results of latest research into Level 6 module on Audio-visual Processing module. This took form of lectures and additional to laboratory classes and general awareness to students of latest work in this area.
Year(s) Of Engagement Activity 2015,2016
 
Description School visit to Fakenham Academy 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Schools
Results and Impact Talk to High School girls about my research, UEA and higher education.
Year(s) Of Engagement Activity 2016