Dynamically Accurate Avatars
Lead Research Organisation:
University of East Anglia
Department Name: Computing Sciences
Abstract
Our bodies move as we speak. Evidently, movement of the jaw, lips and tongue is required to produce coherent speech. Furthermore, additional body gestures both synchronise with the voice and significantly contribute to speech comprehension. For example, a person's eyebrows raise when they are stressing a point, their head shakes when they disagree and a shrug might express doubt.
The goal is to build a computational model that learns the relationship between speech and upper body motion so that we can automatically predict face and body posture for any given audio speech. The predicted body pose can be transferred to computer graphics characters, or avatars, to automatically create character animation directly from speech, on the fly.
A number of approaches have previously been used for mapping from audio to facial motion or head motion, but the limited amount of speech and body motion data that is available has hindered progress. Our research programme will use a field of machine learning called transfer learning to overcome this limitation.
Our research will be used to automatically and realistically animate the face and upper body of a graphics character along with a user's voice in real time. This is valuable for a) controlling the body motion of avatars in multiplayer online gaming, b) driving a user's digital presence in virtual reality (VR) scenarios, and c) automating character animation in television and film production. The work will enhance the realism of avatars during live interaction between users in computer games and social VR without the need for full body tracking. Additionally, we will significantly reduce the time required to produce character animation by removing the need for expensive and time-consuming hand-animation or motion capture.
We will develop novel artificial intelligence approaches to build a robust speech-to-body motion model. For this, we will design and collect a video and motion capture dataset of people speaking, and this will be made publicly available.
The project team is comprised of Dr. Taylor and a PDRA at the University of East Anglia, Norwich, UK.
The goal is to build a computational model that learns the relationship between speech and upper body motion so that we can automatically predict face and body posture for any given audio speech. The predicted body pose can be transferred to computer graphics characters, or avatars, to automatically create character animation directly from speech, on the fly.
A number of approaches have previously been used for mapping from audio to facial motion or head motion, but the limited amount of speech and body motion data that is available has hindered progress. Our research programme will use a field of machine learning called transfer learning to overcome this limitation.
Our research will be used to automatically and realistically animate the face and upper body of a graphics character along with a user's voice in real time. This is valuable for a) controlling the body motion of avatars in multiplayer online gaming, b) driving a user's digital presence in virtual reality (VR) scenarios, and c) automating character animation in television and film production. The work will enhance the realism of avatars during live interaction between users in computer games and social VR without the need for full body tracking. Additionally, we will significantly reduce the time required to produce character animation by removing the need for expensive and time-consuming hand-animation or motion capture.
We will develop novel artificial intelligence approaches to build a robust speech-to-body motion model. For this, we will design and collect a video and motion capture dataset of people speaking, and this will be made publicly available.
The project team is comprised of Dr. Taylor and a PDRA at the University of East Anglia, Norwich, UK.
Planned Impact
Economic Impact
The global games audience is estimated at around 2.4 billion people and the global market is expected to grow to an estimated $129 billion by the end of 2020. The UK consumer spend on games was valued at £4.33bn in 2016 with a record of £1.2bn coming from online games sales, so the opportunities for the UK online games industry have never been greater. At present, a player can speak with other players during live gameplay, yet their avatar does not move in sync with their speech. Our software will add significant value to the games industry since it will address this challenge and yield a more compelling gaming experience.
A further £61 million of the UK consumer spend on video games came from the sale of virtual reality (VR) hardware. VR is a fast-developing sector of the creative digital industries, and our technology will allow a user's digital presence to move in sync with their voice without the need for intrusive and expensive full body tracking.
Our methods will significantly reduce the time required to produce character animation by removing the need for expensive and time-consuming hand-animation or motion capture. Consequently, the cost of production will reduce accordingly.
Scientific Impact
Our project introduces a ground-breaking technique for creating highly realistic character animation automatically for any speech. Thus, we envisage computer graphics researchers to shift focus towards improving the fidelity of real-time rendered graphics characters, which will consequently expedite the advancement of human realism in computer graphics.
We expect the academic impact of our work in the field of psychology to be considerable. Our technology allows for the dynamics or the appearance of an animated character to be manipulated in precise ways and, for example, would allow psychologists to conduct experiments on dissociating human behaviour and appearance.
Societal Impact
Our technology will bring an equivalent level of realism to potentially every animated production and every computer game, and it will be available for all game content and not just for cut-scenes. Furthermore, it will be possible to generate character animation dynamically and in response to actions by the player. This will be a significant step forward for an industry that strives for ever more realistic content, and crucially will provide children with characters that are consistently animated with realistic face and body behaviours at a time when their own speech is developing.
The proposed research can also be used in social interaction training tools for people with autism spectrum disorder (ASD), who can use the technology for practising conversations and for learning how to interpret human emotions. This research has potential to positively impact the lives of the 700,000 people in the UK and 3.5 million in the USA alone who have been diagnosed with ASD. We will ensure that colleagues in the relevant faculties and institutions are kept informed of the research, and we will work with them to develop applications through future bids to the research councils
Outreach and Engagement
Dr. Taylor will continue to deliver lectures as part of outreach events at local schools and, since our work will have influenced the content of the computer games that these students play, she will be able to demonstrate that cutting edge computing science research at UEA has practical use. It will help students to understand the way that characters in animated shows are brought to life, and inspire them to get involved with science.
We will interactively demo the work at the Norwich Science Festival and create character animations using voices recorded from members of the public. The video, augmented with the UEA logo, will be emailed to them and they will be encouraged to share it on social media, broadening public awareness of the university and of the research.
The global games audience is estimated at around 2.4 billion people and the global market is expected to grow to an estimated $129 billion by the end of 2020. The UK consumer spend on games was valued at £4.33bn in 2016 with a record of £1.2bn coming from online games sales, so the opportunities for the UK online games industry have never been greater. At present, a player can speak with other players during live gameplay, yet their avatar does not move in sync with their speech. Our software will add significant value to the games industry since it will address this challenge and yield a more compelling gaming experience.
A further £61 million of the UK consumer spend on video games came from the sale of virtual reality (VR) hardware. VR is a fast-developing sector of the creative digital industries, and our technology will allow a user's digital presence to move in sync with their voice without the need for intrusive and expensive full body tracking.
Our methods will significantly reduce the time required to produce character animation by removing the need for expensive and time-consuming hand-animation or motion capture. Consequently, the cost of production will reduce accordingly.
Scientific Impact
Our project introduces a ground-breaking technique for creating highly realistic character animation automatically for any speech. Thus, we envisage computer graphics researchers to shift focus towards improving the fidelity of real-time rendered graphics characters, which will consequently expedite the advancement of human realism in computer graphics.
We expect the academic impact of our work in the field of psychology to be considerable. Our technology allows for the dynamics or the appearance of an animated character to be manipulated in precise ways and, for example, would allow psychologists to conduct experiments on dissociating human behaviour and appearance.
Societal Impact
Our technology will bring an equivalent level of realism to potentially every animated production and every computer game, and it will be available for all game content and not just for cut-scenes. Furthermore, it will be possible to generate character animation dynamically and in response to actions by the player. This will be a significant step forward for an industry that strives for ever more realistic content, and crucially will provide children with characters that are consistently animated with realistic face and body behaviours at a time when their own speech is developing.
The proposed research can also be used in social interaction training tools for people with autism spectrum disorder (ASD), who can use the technology for practising conversations and for learning how to interpret human emotions. This research has potential to positively impact the lives of the 700,000 people in the UK and 3.5 million in the USA alone who have been diagnosed with ASD. We will ensure that colleagues in the relevant faculties and institutions are kept informed of the research, and we will work with them to develop applications through future bids to the research councils
Outreach and Engagement
Dr. Taylor will continue to deliver lectures as part of outreach events at local schools and, since our work will have influenced the content of the computer games that these students play, she will be able to demonstrate that cutting edge computing science research at UEA has practical use. It will help students to understand the way that characters in animated shows are brought to life, and inspire them to get involved with science.
We will interactively demo the work at the Norwich Science Festival and create character animations using voices recorded from members of the public. The video, augmented with the UEA logo, will be emailed to them and they will be encouraged to share it on social media, broadening public awareness of the university and of the research.
Organisations
- University of East Anglia (Lead Research Organisation)
- AHRC (Co-funder)
- Assessment Micro-Analytics (AMA) (Collaboration)
- Carnegie Mellon University (Collaboration)
- Uneeq (Collaboration)
- FXhome Limited (Project Partner)
- The Foundry Visionmongers Ltd (UK) (Project Partner)
- SyncNorwich (Project Partner)
- Emteq Ltd (Project Partner)
- FaceMe (Project Partner)
Publications
Greenwood D
(2019)
Joint Estimation of Face and Camera Pose from a Collection of Images
Taylor S
(2021)
Speech-Driven Conversational Agents using Conditional Flow-VAEs
Thangthai A
(2019)
Synthesising visual speech using dynamic visemes and deep learning architectures
in Computer Speech & Language
Websdale D
(2022)
Speaker-Independent Speech Animation Using Perceptual Loss Functions and Synthetic Data
in IEEE Transactions on Multimedia
Windle J
(2022)
Arm motion symmetry in conversation
in Speech Communication
Zhou H.
(2021)
Self-Supervised Monocular Depth Estimation with Internal Feature Fusion
in 32nd British Machine Vision Conference, BMVC 2021
| Description | A new deep learning architecture has been developed for predicting body motion from speech. The method outperforms state of the art in this field, but we also expect the approach to be generalisable to many different applications. |
| Exploitation Route | A paper is published and the code is available. |
| Sectors | Creative Economy Digital/Communication/Information Technologies (including Software) |
| Description | EPSRC DTP PhD Studentship |
| Amount | £77,556 (GBP) |
| Organisation | University of East Anglia |
| Sector | Academic/University |
| Country | United Kingdom |
| Start | 09/2020 |
| End | 02/2024 |
| Title | UEA Digital Humans Dataset |
| Description | The dataset contains actors speaking over many hours. It contains natural dialogue, acted expressive monologue and heated debates. The actors are filmed using 3 cameras from different angles so that we can reconstruct their 3D body motion. The actors are required to sign a model release form to allow us to freely distribute this dataset when the time comes. We are currently in the process of capturing this dataset and it will be made publicly available when complete. |
| Type Of Material | Database/Collection of data |
| Year Produced | 2019 |
| Provided To Others? | No |
| Impact | We are currently using this data to learn a model to predict body motion from speech. As the database grows, the new data will be used to improve the generalisability of the model to new speakers. |
| Description | EIRA project with Assessment Micro-Analytics |
| Organisation | Assessment Micro-Analytics (AMA) |
| Country | United Kingdom |
| Sector | Private |
| PI Contribution | I led a research team to explore the efficacy of automated detection of face and body in video for human gesture and expression recognition. It was discovered that automatic detection of a set of body landmarks is possible using existing tools, and provided full code for fitting to an image. A set of recommendations was suggested for maximising detection accuracy by controlling the capture environment. The team performed an exploration into the performance of face trackers on a diverse population, and this revealed that detections on images containing subjects from certain ethnic groups were more accurate than those from others. The exploratory research also found that detections on the younger age group achieved good accuracy. Finally, a pipeline for processing multimodal data in a machine learning framework for human behaviour recognition was proposed. |
| Collaborator Contribution | The partner brought expertise and knowledge of real-world challenges. |
| Impact | This was a proof-of-concept research project to determine whether existing face and body trackers could be used for tracking student behaviour in online assessment. The results have provided practical guidance for Assessment Micro-Analytics to integrate this functionality into their products, and will form the basis of further grant applications and collaborative projects. The collaboration resulted in an EIRA case study. (Link not yet available.) |
| Start Year | 2020 |
| Description | FaceMe/Uneeq |
| Organisation | Uneeq |
| Country | New Zealand |
| Sector | Private |
| PI Contribution | We have been working together with Uneeq (formerly FaceMe) to design a dataset of a face and body motion along with speech. |
| Collaborator Contribution | Uneeq will record the data and make it available to our research team. This will be valuable to the project since they have the resources to capture high quality facial and body motion. |
| Impact | None as yet. |
| Start Year | 2018 |
| Description | Tongue research with CMU |
| Organisation | Carnegie Mellon University |
| Country | United States |
| Sector | Academic/University |
| PI Contribution | We performed an analysis on tongue EMA data to investigate the lateral motion during speech. |
| Collaborator Contribution | Provided tongue motion data and worked together on analysis. |
| Impact | Publication at Interspeech 2021 (doi: 10.21437/Interspeech.2021-1732), and ongoing discussion. |
| Start Year | 2021 |
| Description | Talk at Norwich Science Festival |
| Form Of Engagement Activity | A talk or presentation |
| Part Of Official Scheme? | No |
| Geographic Reach | Regional |
| Primary Audience | Public/other audiences |
| Results and Impact | I gave a talk at Norwich Science Festival on my research. Approximately 60 people came along, and 5 or 10 stayed after to discuss the work and the wider applications of the approaches. Representatives of a few local companies passed on their business cards for further discussion on possible collaboration with the School of Computing Sciences. |
| Year(s) Of Engagement Activity | 2019 |
| URL | https://norwichsciencefestival.co.uk/events/automatically-animating-faces/ |
| Description | Talk at School |
| Form Of Engagement Activity | A talk or presentation |
| Part Of Official Scheme? | No |
| Geographic Reach | National |
| Primary Audience | Schools |
| Results and Impact | Around 100 pupils attended for a remote school event in which I presented my research. Questions and discussion followed, and the school reported increased interest in related subject areas. |
| Year(s) Of Engagement Activity | 2021 |
