Improving audio-visual speech recognition with augmented facial-mapping.

Lead Research Organisation: University of Southampton
Department Name: Faculty of Engineering & the Environment

Abstract

Research questions:
Can audio-visual speech recognition be improved through the augmentation of emerging facial mapping technology?
Can the application of real-time 3D face mapping and sound compartmentalisation improve audio-visual speech recognition accuracy?

Potential applications
At the time of writing, no known research exists in the use of the TrueDepth camera's facial recognition for audio-visual speech recognition. This may be due to the infancy of the technology. The potential applications for an improved integrated audio-visual speech recognition system are:

Improved human computer interaction for AI systems.
A cheaper means of autonomous speech therapy.
Language learning.

Objectives and Aims
This research will focus on machine learning principles to develop a more effective end-to-end solution for speech and facial (visual speech) recognition algorithms. This will then be used to improve human accuracy and communication in these areas, through a precise feedback engine. The objective is to effectively integrate the use of the latest infrared and proximity sensors used for real-time face mapping, to improve audio-visual speech recognition.

Methodology
As this research is inherently interdisciplinary between computer science and linguistics this paper will first investigate current deep learning audio-visual speech recognition methodologies and broader historical speechreading and natural language processing techniques. This paper will then explore the individual accuracy of Apple's TrueDepth camera in terms of its potential application for visual speech recognition. The TrueDepth system is primarily used for facial recognition and animation, and is essentially the same technology contained within Microsoft's 3D tracking Connect accessory. This has since been miniaturised and improved by a middleware layer of machine learning software, to achieve the real-time mapping and articulation of 37 facial features with millimetre accurately. This research will first test the TrueDepth camera's recognition accuracy of a set visemes (visual phonemes) by recording a large native language learning dataset and iterating through a supervised deep learning algorithm. Once an acceptable level of viseme recognition accuracy is achieved, this will then be combined with an existing audio-based speech recognition engine. The final stage will assess whether the augmentation of the TruDepth camera system will result in a statistically viable improvement, when tested against standalone speech recognition engines.

Planned Impact

The proposed CDT in Web Science Innovation will have significant economic and societal impact, as it develops a substantial cohort of students equipped to navigate the disruptive transition to a digital economy. The training methods utilised by the CDT are based on a model of intensive industry partnership, thereby situating students directly in contemporary industry contexts and engaging them in a range of communications with industrial partners. This training context will develop important leadership skills, and will contribute to the formation of a better-skilled and more entrepreneurial workforce in the Digital Economy. Graduates will be able to understand the challenges and opportunities of the web from a variety of disciplinary perspectives, and therefore will have impact for a range of local, national and international businesses, within diverse sectors. As the proposed CDT combines both technical and societal approaches to Web Science, they will be able to identify and deploy effective digital solutions that have social traction. This will have considerable social impact creating a workforce capable of a holistic and therefore more effective approach to innovating in the digital economy.

Engagement with government will also allow for impact on a policy level. As interdisciplinary approaches to topical issues are developed through the CDT training, a cohort of graduates who can analyse and synthesise across perspectives will develop, providing cogent expert advice to policy makers. This societal impact will be significant, as key contemporary topics, such as online privacy or internet child pornography, are becoming increasingly complex and significant.

The CDT will also cultivate graduates who are adept at public engagement and outreach, having developed skills by engaging in a range of public activities throughout the course of their training. The ability to communicate broadly and clearly with a range of audiences, and to engage as leaders in a broad economic field, will be at the heart of the training.

The research undertaken by the postgraduate cohort, directed towards Web Science Innovation, and conducted in close co-operation with a network of industry partners, will generate significant new intellectual property within the UK economy. A particular innovation focus for the CDT will be Open Data, which amplifies the opportunities for value creation downstream from the original data creators and publishers. Our students will have the skills and opportunities to develop a range of novel and socially authentic Web services, through partnerships brokered by the Open Data Institute with government organisations, large firms, SMEs and startups.

Publications

10 25 50