📣 Help Shape the Future of UKRI's Gateway to Research (GtR)

We're improving UKRI's Gateway to Research and are seeking your input! If you would be interested in being interviewed about the improvements we're making and to have your say about how we can make GtR more user-friendly, impactful, and effective for the Research and Innovation community, please email gateway@ukri.org.

Dynamically Accurate Avatars

Lead Research Organisation: University of East Anglia
Department Name: Computing Sciences

Abstract

Our bodies move as we speak. Evidently, movement of the jaw, lips and tongue is required to produce coherent speech. Furthermore, additional body gestures both synchronise with the voice and significantly contribute to speech comprehension. For example, a person's eyebrows raise when they are stressing a point, their head shakes when they disagree and a shrug might express doubt.

The goal is to build a computational model that learns the relationship between speech and upper body motion so that we can automatically predict face and body posture for any given audio speech. The predicted body pose can be transferred to computer graphics characters, or avatars, to automatically create character animation directly from speech, on the fly.

A number of approaches have previously been used for mapping from audio to facial motion or head motion, but the limited amount of speech and body motion data that is available has hindered progress. Our research programme will use a field of machine learning called transfer learning to overcome this limitation.

Our research will be used to automatically and realistically animate the face and upper body of a graphics character along with a user's voice in real time. This is valuable for a) controlling the body motion of avatars in multiplayer online gaming, b) driving a user's digital presence in virtual reality (VR) scenarios, and c) automating character animation in television and film production. The work will enhance the realism of avatars during live interaction between users in computer games and social VR without the need for full body tracking. Additionally, we will significantly reduce the time required to produce character animation by removing the need for expensive and time-consuming hand-animation or motion capture.

We will develop novel artificial intelligence approaches to build a robust speech-to-body motion model. For this, we will design and collect a video and motion capture dataset of people speaking, and this will be made publicly available.

The project team is comprised of Dr. Taylor and a PDRA at the University of East Anglia, Norwich, UK.

Planned Impact

Economic Impact
The global games audience is estimated at around 2.4 billion people and the global market is expected to grow to an estimated $129 billion by the end of 2020. The UK consumer spend on games was valued at £4.33bn in 2016 with a record of £1.2bn coming from online games sales, so the opportunities for the UK online games industry have never been greater. At present, a player can speak with other players during live gameplay, yet their avatar does not move in sync with their speech. Our software will add significant value to the games industry since it will address this challenge and yield a more compelling gaming experience.

A further £61 million of the UK consumer spend on video games came from the sale of virtual reality (VR) hardware. VR is a fast-developing sector of the creative digital industries, and our technology will allow a user's digital presence to move in sync with their voice without the need for intrusive and expensive full body tracking.

Our methods will significantly reduce the time required to produce character animation by removing the need for expensive and time-consuming hand-animation or motion capture. Consequently, the cost of production will reduce accordingly.

Scientific Impact
Our project introduces a ground-breaking technique for creating highly realistic character animation automatically for any speech. Thus, we envisage computer graphics researchers to shift focus towards improving the fidelity of real-time rendered graphics characters, which will consequently expedite the advancement of human realism in computer graphics.

We expect the academic impact of our work in the field of psychology to be considerable. Our technology allows for the dynamics or the appearance of an animated character to be manipulated in precise ways and, for example, would allow psychologists to conduct experiments on dissociating human behaviour and appearance.

Societal Impact
Our technology will bring an equivalent level of realism to potentially every animated production and every computer game, and it will be available for all game content and not just for cut-scenes. Furthermore, it will be possible to generate character animation dynamically and in response to actions by the player. This will be a significant step forward for an industry that strives for ever more realistic content, and crucially will provide children with characters that are consistently animated with realistic face and body behaviours at a time when their own speech is developing.

The proposed research can also be used in social interaction training tools for people with autism spectrum disorder (ASD), who can use the technology for practising conversations and for learning how to interpret human emotions. This research has potential to positively impact the lives of the 700,000 people in the UK and 3.5 million in the USA alone who have been diagnosed with ASD. We will ensure that colleagues in the relevant faculties and institutions are kept informed of the research, and we will work with them to develop applications through future bids to the research councils

Outreach and Engagement
Dr. Taylor will continue to deliver lectures as part of outreach events at local schools and, since our work will have influenced the content of the computer games that these students play, she will be able to demonstrate that cutting edge computing science research at UEA has practical use. It will help students to understand the way that characters in animated shows are brought to life, and inspire them to get involved with science.

We will interactively demo the work at the Norwich Science Festival and create character animations using voices recorded from members of the public. The video, augmented with the UEA logo, will be emailed to them and they will be encouraged to share it on social media, broadening public awareness of the university and of the research.
 
Description A new deep learning architecture has been developed for predicting body motion from speech. The method outperforms state of the art in this field, but we also expect the approach to be generalisable to many different applications.
Exploitation Route A paper is published and the code is available.
Sectors Creative Economy

Digital/Communication/Information Technologies (including Software)

 
Description EPSRC DTP PhD Studentship
Amount £77,556 (GBP)
Organisation University of East Anglia 
Sector Academic/University
Country United Kingdom
Start 09/2020 
End 02/2024
 
Title UEA Digital Humans Dataset 
Description The dataset contains actors speaking over many hours. It contains natural dialogue, acted expressive monologue and heated debates. The actors are filmed using 3 cameras from different angles so that we can reconstruct their 3D body motion. The actors are required to sign a model release form to allow us to freely distribute this dataset when the time comes. We are currently in the process of capturing this dataset and it will be made publicly available when complete. 
Type Of Material Database/Collection of data 
Year Produced 2019 
Provided To Others? No  
Impact We are currently using this data to learn a model to predict body motion from speech. As the database grows, the new data will be used to improve the generalisability of the model to new speakers. 
 
Description EIRA project with Assessment Micro-Analytics 
Organisation Assessment Micro-Analytics (AMA)
Country United Kingdom 
Sector Private 
PI Contribution I led a research team to explore the efficacy of automated detection of face and body in video for human gesture and expression recognition. It was discovered that automatic detection of a set of body landmarks is possible using existing tools, and provided full code for fitting to an image.  A set of recommendations was suggested for maximising detection accuracy by controlling the capture environment.  The team performed an exploration into the performance of face trackers on a diverse population, and this revealed that detections on images containing subjects from certain ethnic groups were more accurate than those from others. The exploratory research also found that detections on the younger age group achieved good accuracy. Finally, a pipeline for processing multimodal data in a machine learning framework for human behaviour recognition was proposed.
Collaborator Contribution The partner brought expertise and knowledge of real-world challenges.
Impact This was a proof-of-concept research project to determine whether existing face and body trackers could be used for tracking student behaviour in online assessment. The results have provided practical guidance for Assessment Micro-Analytics to integrate this functionality into their products, and will form the basis of further grant applications and collaborative projects. The collaboration resulted in an EIRA case study. (Link not yet available.)
Start Year 2020
 
Description FaceMe/Uneeq 
Organisation Uneeq
Country New Zealand 
Sector Private 
PI Contribution We have been working together with Uneeq (formerly FaceMe) to design a dataset of a face and body motion along with speech.
Collaborator Contribution Uneeq will record the data and make it available to our research team. This will be valuable to the project since they have the resources to capture high quality facial and body motion.
Impact None as yet.
Start Year 2018
 
Description Tongue research with CMU 
Organisation Carnegie Mellon University
Country United States 
Sector Academic/University 
PI Contribution We performed an analysis on tongue EMA data to investigate the lateral motion during speech.
Collaborator Contribution Provided tongue motion data and worked together on analysis.
Impact Publication at Interspeech 2021 (doi: 10.21437/Interspeech.2021-1732), and ongoing discussion.
Start Year 2021
 
Description Talk at Norwich Science Festival 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Public/other audiences
Results and Impact I gave a talk at Norwich Science Festival on my research. Approximately 60 people came along, and 5 or 10 stayed after to discuss the work and the wider applications of the approaches. Representatives of a few local companies passed on their business cards for further discussion on possible collaboration with the School of Computing Sciences.
Year(s) Of Engagement Activity 2019
URL https://norwichsciencefestival.co.uk/events/automatically-animating-faces/
 
Description Talk at School 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Schools
Results and Impact Around 100 pupils attended for a remote school event in which I presented my research. Questions and discussion followed, and the school reported increased interest in related subject areas.
Year(s) Of Engagement Activity 2021