Dynamically Accurate Avatars

Lead Research Organisation: University of East Anglia

Department Name: Computing Sciences

Abstract

Our bodies move as we speak. Evidently, movement of the jaw, lips and tongue is required to produce coherent speech. Furthermore, additional body gestures both synchronise with the voice and significantly contribute to speech comprehension. For example, a person's eyebrows raise when they are stressing a point, their head shakes when they disagree and a shrug might express doubt.

The goal is to build a computational model that learns the relationship between speech and upper body motion so that we can automatically predict face and body posture for any given audio speech. The predicted body pose can be transferred to computer graphics characters, or avatars, to automatically create character animation directly from speech, on the fly.

A number of approaches have previously been used for mapping from audio to facial motion or head motion, but the limited amount of speech and body motion data that is available has hindered progress. Our research programme will use a field of machine learning called transfer learning to overcome this limitation.

Our research will be used to automatically and realistically animate the face and upper body of a graphics character along with a user's voice in real time. This is valuable for a) controlling the body motion of avatars in multiplayer online gaming, b) driving a user's digital presence in virtual reality (VR) scenarios, and c) automating character animation in television and film production. The work will enhance the realism of avatars during live interaction between users in computer games and social VR without the need for full body tracking. Additionally, we will significantly reduce the time required to produce character animation by removing the need for expensive and time-consuming hand-animation or motion capture.

We will develop novel artificial intelligence approaches to build a robust speech-to-body motion model. For this, we will design and collect a video and motion capture dataset of people speaking, and this will be made publicly available.

The project team is comprised of Dr. Taylor and a PDRA at the University of East Anglia, Norwich, UK.

Planned Impact

Economic Impact
The global games audience is estimated at around 2.4 billion people and the global market is expected to grow to an estimated $129 billion by the end of 2020. The UK consumer spend on games was valued at £4.33bn in 2016 with a record of £1.2bn coming from online games sales, so the opportunities for the UK online games industry have never been greater. At present, a player can speak with other players during live gameplay, yet their avatar does not move in sync with their speech. Our software will add significant value to the games industry since it will address this challenge and yield a more compelling gaming experience.

A further £61 million of the UK consumer spend on video games came from the sale of virtual reality (VR) hardware. VR is a fast-developing sector of the creative digital industries, and our technology will allow a user's digital presence to move in sync with their voice without the need for intrusive and expensive full body tracking.

Our methods will significantly reduce the time required to produce character animation by removing the need for expensive and time-consuming hand-animation or motion capture. Consequently, the cost of production will reduce accordingly.

Scientific Impact
Our project introduces a ground-breaking technique for creating highly realistic character animation automatically for any speech. Thus, we envisage computer graphics researchers to shift focus towards improving the fidelity of real-time rendered graphics characters, which will consequently expedite the advancement of human realism in computer graphics.

We expect the academic impact of our work in the field of psychology to be considerable. Our technology allows for the dynamics or the appearance of an animated character to be manipulated in precise ways and, for example, would allow psychologists to conduct experiments on dissociating human behaviour and appearance.

Societal Impact
Our technology will bring an equivalent level of realism to potentially every animated production and every computer game, and it will be available for all game content and not just for cut-scenes. Furthermore, it will be possible to generate character animation dynamically and in response to actions by the player. This will be a significant step forward for an industry that strives for ever more realistic content, and crucially will provide children with characters that are consistently animated with realistic face and body behaviours at a time when their own speech is developing.

The proposed research can also be used in social interaction training tools for people with autism spectrum disorder (ASD), who can use the technology for practising conversations and for learning how to interpret human emotions. This research has potential to positively impact the lives of the 700,000 people in the UK and 3.5 million in the USA alone who have been diagnosed with ASD. We will ensure that colleagues in the relevant faculties and institutions are kept informed of the research, and we will work with them to develop applications through future bids to the research councils

Outreach and Engagement
Dr. Taylor will continue to deliver lectures as part of outreach events at local schools and, since our work will have influenced the content of the computer games that these students play, she will be able to demonstrate that cutting edge computing science research at UEA has practical use. It will help students to understand the way that characters in animated shows are brought to life, and inspire them to get involved with science.

We will interactively demo the work at the Norwich Science Festival and create character animations using voices recorded from members of the public. The video, augmented with the UEA logo, will be emailed to them and they will be encouraged to share it on social media, broadening public awareness of the university and of the research.

Funded Value:

£557,530

Funded Period:

Jun 18 - May 22

Funder:

ISCF

Project Status:

Closed

Project Category:

Fellowship

Project Reference:

EP/S001816/1

Principal Investigator:

Sarah Taylor

Research Subject:

Info. & commun. Technol. (100%)

Research Topic:

Computer Graphics & Visual. (50%)

Human Communication in ICT (50%)

Organisations

People	ORCID iD
Sarah Taylor (Principal Investigator / Fellow)	http://orcid.org/0000-0003-1299-0446

Publications

Author Name

Title Publication Date Published

10 25 50

Greenwood D (2019) Joint Estimation of Face and Camera Pose from a Collection of Images

Medina S (2021) Importance of Parasagittal Sensor Information in Tongue Motion Capture Through a Diphonic Analysis

Taylor S (2021) Speech-Driven Conversational Agents using Conditional Flow-VAEs

Thangthai A (2019) Synthesising visual speech using dynamic visemes and deep learning architectures in Computer Speech & Language

Websdale D (2022) Speaker-Independent Speech Animation Using Perceptual Loss Functions and Synthetic Data in IEEE Transactions on Multimedia

Windle J (2022) Arm motion symmetry in conversation in Speech Communication

Zhou H (2020) Constant Velocity Constraints for Self-Supervised Monocular Depth Estimation

Zhou H (2021) Self-Supervised Monocular Depth Estimation with Internal Feature Fusion

Zhou H. (2021) Self-Supervised Monocular Depth Estimation with Internal Feature Fusion in 32nd British Machine Vision Conference, BMVC 2021

Key Findings
Further Funding
Research Databases and Models
Collaboration
Engagement Activities


Description	A new deep learning architecture has been developed for predicting body motion from speech. The method outperforms state of the art in this field, but we also expect the approach to be generalisable to many different applications.
Exploitation Route	A paper is published and the code is available.
Sectors	Creative Economy Digital/Communication/Information Technologies (including Software)


Description	EPSRC DTP PhD Studentship
Amount	£77,556 (GBP)
Organisation	University of East Anglia
Sector	Academic/University
Country	United Kingdom
Start	09/2020
End	02/2024


Title	UEA Digital Humans Dataset
Description	The dataset contains actors speaking over many hours. It contains natural dialogue, acted expressive monologue and heated debates. The actors are filmed using 3 cameras from different angles so that we can reconstruct their 3D body motion. The actors are required to sign a model release form to allow us to freely distribute this dataset when the time comes. We are currently in the process of capturing this dataset and it will be made publicly available when complete.
Type Of Material	Database/Collection of data
Year Produced	2019
Provided To Others?	No
Impact	We are currently using this data to learn a model to predict body motion from speech. As the database grows, the new data will be used to improve the generalisability of the model to new speakers.


Description	EIRA project with Assessment Micro-Analytics
Organisation	Assessment Micro-Analytics (AMA)
Country	United Kingdom
Sector	Private
PI Contribution	I led a research team to explore the efficacy of automated detection of face and body in video for human gesture and expression recognition. It was discovered that automatic detection of a set of body landmarks is possible using existing tools, and provided full code for fitting to an image. A set of recommendations was suggested for maximising detection accuracy by controlling the capture environment. The team performed an exploration into the performance of face trackers on a diverse population, and this revealed that detections on images containing subjects from certain ethnic groups were more accurate than those from others. The exploratory research also found that detections on the younger age group achieved good accuracy. Finally, a pipeline for processing multimodal data in a machine learning framework for human behaviour recognition was proposed.
Collaborator Contribution	The partner brought expertise and knowledge of real-world challenges.
Impact	This was a proof-of-concept research project to determine whether existing face and body trackers could be used for tracking student behaviour in online assessment. The results have provided practical guidance for Assessment Micro-Analytics to integrate this functionality into their products, and will form the basis of further grant applications and collaborative projects. The collaboration resulted in an EIRA case study. (Link not yet available.)
Start Year	2020


Description	FaceMe/Uneeq
Organisation	Uneeq
Country	New Zealand
Sector	Private
PI Contribution	We have been working together with Uneeq (formerly FaceMe) to design a dataset of a face and body motion along with speech.
Collaborator Contribution	Uneeq will record the data and make it available to our research team. This will be valuable to the project since they have the resources to capture high quality facial and body motion.
Impact	None as yet.
Start Year	2018


Description	Tongue research with CMU
Organisation	Carnegie Mellon University
Country	United States
Sector	Academic/University
PI Contribution	We performed an analysis on tongue EMA data to investigate the lateral motion during speech.
Collaborator Contribution	Provided tongue motion data and worked together on analysis.
Impact	Publication at Interspeech 2021 (doi: 10.21437/Interspeech.2021-1732), and ongoing discussion.
Start Year	2021


Description	Talk at Norwich Science Festival
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	Regional
Primary Audience	Public/other audiences
Results and Impact	I gave a talk at Norwich Science Festival on my research. Approximately 60 people came along, and 5 or 10 stayed after to discuss the work and the wider applications of the approaches. Representatives of a few local companies passed on their business cards for further discussion on possible collaboration with the School of Computing Sciences.
Year(s) Of Engagement Activity	2019
URL	https://norwichsciencefestival.co.uk/events/automatically-animating-faces/


Description	Talk at School
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	National
Primary Audience	Schools
Results and Impact	Around 100 pupils attended for a remote school event in which I presented my research. Questions and discussion followed, and the school reported increased interest in related subject areas.
Year(s) Of Engagement Activity	2021

Abstract

Planned Impact

Organisations

People

ORCID iD

Publications