Speech Animation using Dynamic Visemes
Lead Research Organisation:
University of East Anglia
Department Name: Computing Sciences
Abstract
This project will investigate new methods for automatically producing speech animation. For animators in the movie industry this is typically a tedious, iterative process that involves key-framing static lip-poses and then handcrafting a blending function to transition from one key pose to another. In is not uncommon for an animator to spend several hours producing animation for just a few seconds of speech.
We have previously worked on identifying a new dynamic unit for speech animation, termed dynamic visemes, and have shown that these units produce better animation than more traditional phoneme-based units. In this project we will integrate dynamic visemes into state of the art approaches to further improve upon the quality of automated animation that is currently possible. Furthermore, we will investigate how dynamic visemes relate to speech acoustics so that animation can be generated directly from the voice of an actor.
We will build tools that can be implemented in commercial animation pipelines so animation studios can use our tools as a basis for animating any speech on their own models. This will leave their artists free to focus on the overall performance of the character.
The proposed project is ambitious in its aims, proposing a new approaches for producing better speech animation. However, the impact of the work is wide reaching and has the potential to influence the production of speech content in all animated movies and computer games.
We have previously worked on identifying a new dynamic unit for speech animation, termed dynamic visemes, and have shown that these units produce better animation than more traditional phoneme-based units. In this project we will integrate dynamic visemes into state of the art approaches to further improve upon the quality of automated animation that is currently possible. Furthermore, we will investigate how dynamic visemes relate to speech acoustics so that animation can be generated directly from the voice of an actor.
We will build tools that can be implemented in commercial animation pipelines so animation studios can use our tools as a basis for animating any speech on their own models. This will leave their artists free to focus on the overall performance of the character.
The proposed project is ambitious in its aims, proposing a new approaches for producing better speech animation. However, the impact of the work is wide reaching and has the potential to influence the production of speech content in all animated movies and computer games.
Planned Impact
During the course of this project we will develop new techniques for producing better quality speech animation than is available using the current state of the art. These techniques will be implemented in easy to use tools that will remove the burden from professional artists in the production of animated speech content. The artist will need only provide the audio assets from the actor and our tools will automatically generate the lip motion synchronised with the spoken words. The artist can then focus on the character performance, e.g. adding the expression and head pose variation to bring the character to life.
The manual effort required to create production-quality animated speech cannot be overstated. It is not unusual for even very skilled artists to spend many hours lip-syncing a character for only a very short scene. We will work with our industrial partner and industrial advisors to develop the research ideas into tools that work with industry standard software so that they can easily fit within animation pipelines currently used by the various studios and VFX houses. These new approaches will allow production-quality animation to be created in real-time, and so will provide significant savings to studios. Furthermore, we have demonstrated that our approach is not dependent on a particular model or rig and so can be applied to all animation, from cartoons to videorealistic characters. This will also mean that high quality speech animation is available to all, and not just the largest studios producing the biggest budget content.
More broadly, this work will impact research areas that involve the analysis of the (visible) speech articulators. For example, our approaches could find application in speech therapy, whereby a patient is shown speech-related exercises on a model of their own face. Our tools could automatically analyse the patients motions and compare them against the expect motion and show errors in their production. This form of assessment might be beneficial to stroke victims, whereby they need to re-train their facial muscles to properly articulate their speech. A face-to-face virtual speech therapist is always at hand to provide useful and immediate feedback, and our analysis tools can be used to log progress and provide progress related information to a real speech therapist.
There has been work showing the effectiveness of computer generated characters as learning aids, and our tools could be developed into a language tutoring system. Speech-related movements can be tracked on the face of a student and the virtual tutor can compare the observed motion with the expected motion of a native speaker. A range of face models, from cartoon-like to video real, will make foreign language learning more 'fun' and hopefully re-engage school children in learning foreign languages.
The manual effort required to create production-quality animated speech cannot be overstated. It is not unusual for even very skilled artists to spend many hours lip-syncing a character for only a very short scene. We will work with our industrial partner and industrial advisors to develop the research ideas into tools that work with industry standard software so that they can easily fit within animation pipelines currently used by the various studios and VFX houses. These new approaches will allow production-quality animation to be created in real-time, and so will provide significant savings to studios. Furthermore, we have demonstrated that our approach is not dependent on a particular model or rig and so can be applied to all animation, from cartoons to videorealistic characters. This will also mean that high quality speech animation is available to all, and not just the largest studios producing the biggest budget content.
More broadly, this work will impact research areas that involve the analysis of the (visible) speech articulators. For example, our approaches could find application in speech therapy, whereby a patient is shown speech-related exercises on a model of their own face. Our tools could automatically analyse the patients motions and compare them against the expect motion and show errors in their production. This form of assessment might be beneficial to stroke victims, whereby they need to re-train their facial muscles to properly articulate their speech. A face-to-face virtual speech therapist is always at hand to provide useful and immediate feedback, and our analysis tools can be used to log progress and provide progress related information to a real speech therapist.
There has been work showing the effectiveness of computer generated characters as learning aids, and our tools could be developed into a language tutoring system. Speech-related movements can be tracked on the face of a student and the virtual tutor can compare the observed motion with the expected motion of a native speaker. A range of face models, from cartoon-like to video real, will make foreign language learning more 'fun' and hopefully re-engage school children in learning foreign languages.
Publications
Le Cornu T
(2017)
Generating Intelligible Audio Speech From Visual Speech
in IEEE/ACM Transactions on Audio, Speech, and Language Processing
Lines J
(2018)
Time Series Classification with HIVE-COTE The Hierarchical Vote Collective of Transformation-Based Ensembles
in ACM Transactions on Knowledge Discovery from Data
Taylor S
(2017)
A deep learning approach for generalized speech animation
in ACM Transactions on Graphics
Taylor S
(2016)
Audio-to-Visual Speech Conversion Using Deep Neural Networks
Thangthai A
(2019)
Synthesising visual speech using dynamic visemes and deep learning architectures
in Computer Speech & Language
Thangthai A
(2016)
Visual Speech Synthesis Using Dynamic Visemes, Contextual Features and DNNs
Thangthai Ausdang
(2018)
Visual speech synthesis using dynamic visemes and deep learning architectures
Websdale D
(2022)
Speaker-Independent Speech Animation Using Perceptual Loss Functions and Synthetic Data
in IEEE Transactions on Multimedia
Websdale D
(2018)
The Effect of Real-Time Constraints on Automatic Speech Animation
Description | Our first finding is that it is possible to estimate visual speech features from audio speech features using deep neural networks. This allows the mouth movements of an animated character to be generated automatically just from the audio speech. The challenge in this scenario is to make the estimated mouth movements look realistic and subjective tests have shown that this is possible. Two methods to do this have been developed. The first uses an automatic speech recogniser to identify phonetic/linguistic features that can then be transformed into visual features and used for animation. The second method dispenses with the speech recognises and transforms audio features directly into visual features. Furthermore, using similar techniques it is also possible to do the reverse, which is to estimate audio speech from visual speech. In this situation, from a video of a person talking it is possible to estimate an audio speech signal. The challenge in this scenario is to create an intelligible speech signal. Applications for this area are silent speech interfaces that can be used in, for example, medical areas (for patients who have undergone a laryngectomy) and for surveillance. Having the ability to estimate audio speech features from visual features has also led to development of audio-visual methods of speech enhancement. We have also demonstrated that it is possible to perform audio-to-visual animation in close to real time. By extracting audio features from the speech signal in an asymmetric manner, delay can be minimised which makes the method suitable for real-time networked applications such as in gaming. We have also developed a method of speaker-independent animation. Rather than requiring the system to be trained and applied to only a single speaker, the new method allows animation to be created for any speaker. The new speaker speaks, and the character is animated accordingly. Subjective tests using human subjects have shown the resulting animations to be almost indistinguishable from those produce from a speaker-dependent system. |
Exploitation Route | Automatic animation of face in film/TV industry. Silent speech interfaces for medical applications and surveillance. Real-time animation for use in on-line gaming. |
Sectors | Creative Economy Digital/Communication/Information Technologies (including Software) Education Healthcare Leisure Activities including Sports Recreation and Tourism Culture Heritage Museums and Collections Security and Diplomacy |
Description | One of the main non-academic impacts has been the advancement of researchers employed on the project. One is now about to start work in a senior role in a games company, working on animation directly related to the project. A second researcher went on to apply some of the techniques developed in the project to a completely different field of sonar processing that has has significant impact in that area. That person has now been taken on as a permanent member of staff by the company. |
First Year Of Impact | 2021 |
Sector | Digital/Communication/Information Technologies (including Software),Electronics |
Impact Types | Economic |
Description | Proof of Concept Funding |
Amount | £14,000 (GBP) |
Organisation | University of East Anglia |
Sector | Academic/University |
Country | United Kingdom |
Start | 04/2017 |
End | 08/2017 |
Title | YouTube AV Speech database |
Description | A large AV speech dataset derived from YouTube video that has been face tracked and processed, and contains many thousands of hours of data. |
Type Of Material | Database/Collection of data |
Provided To Others? | No |
Impact | Tool developed for tracking facial features and assessing audio-visual speech synchrony and we plan to release this to the speech processing community. |
Description | Collaboration with Disney Research |
Organisation | Disney Research |
Country | United States |
Sector | Private |
PI Contribution | Weekly meetings to discuss research from both sides, joint publications (in preparation). |
Collaborator Contribution | Weekly meetings to discuss research from both sides, joint publications (in preparation) and large-scale audio-visual speech databases. Internships. |
Impact | Publications under review. |
Start Year | 2015 |
Description | Paper presentation at SIGGRAPH 2017 |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | Presentation of paper at SIGGRAPH conference in Los Angeles, USA. SIGGRAPH is the world's largest and most influential conference in computer graphics and interactive techniques. |
Year(s) Of Engagement Activity | 2017 |
URL | http://s2017.siggraph.org/ |
Description | Presentation at Interspeech 2018 |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | National |
Primary Audience | Policymakers/politicians |
Results and Impact | Presentation delivered at the Interspeech 2018 conference in India. |
Year(s) Of Engagement Activity | 2018 |
Description | Presentation at NorDev conference |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | Regional |
Primary Audience | Industry/Business |
Results and Impact | Presented my research at a booth to participants at a regional developers' conference. |
Year(s) Of Engagement Activity | 2017 |
Description | Research-led teaching |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | Local |
Primary Audience | Undergraduate students |
Results and Impact | Embedded results of latest research into Level 6 module on Audio-visual Processing module. This took form of lectures and additional to laboratory classes and general awareness to students of latest work in this area. |
Year(s) Of Engagement Activity | 2015,2016 |
Description | School visit to Fakenham Academy |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | Local |
Primary Audience | Schools |
Results and Impact | Talk to High School girls about my research, UEA and higher education. |
Year(s) Of Engagement Activity | 2016 |