Understanding Speech by Leveraging Both Audio and Lexical Information Channels

Lead Research Organisation: University of Edinburgh
Department Name: Sch of Informatics

Abstract

The over-arching research area of this project is spoken language processing (SLP) with a focus on improving methods for extracting information from spontaneous, natural, spoken communication.

The key objectives of this project are to make use of information encoded in the audio channel of spoken language to improve both methods to extract the semantic content of speech and our understanding of spoken conversations.

The past decade has seen incredible advances in NLP research, however one medium that has yet to benefit so acutely is the processing of spontaneous, spoken dialogue. This modality of communication is arguably the most fundamental; humans and their languages have evolved side-by-side to optimize the ability to express and understand dialogue. There is already massive public interest in enabling human-machine interaction through conversational speech as demonstrated by the burgeoning development of systems like Alexa and Siri, and academic research into understanding human-human conversation. However, much of modern NLP has been developed from an almost-exclusive focus on written text.

Though related, spoken and written domains are intrinsically different. Spoken language provides a second, distinct channel of information transfer: audio. The voice contains prosodic and temporal cues which convey semantic and syntactic information alongside the lexical content of speech. The discrepancy between written and spoken language widens when examining spontaneous, spoken dialogue which is generated incrementally and riddled with disfluencies. These features provide a rich bandwidth for information transmission which humans use effectively and with little-to-no effort, however, current SLP methods make little to no use of them. Standard SLP pipelines use a cascading architecture whereby audio is only used to produce a transcript, and downstream tasks are based entirely on the written words.

Given that relatively young state of the field of SLP, there is little agreement regarding meaningful tasks and corresponding evaluations of speech representations. Thus, the first component of this project will involve a critical review of current approaches. The second component will focus on augmenting representations of speech with audio-based information. Finally, we will evaluate which information channels are important, and when, for both machine and human understanding.

As the state of research currently stands, the vast majority of spoken interaction can't be adequately processed by machines. Improving methods for conversation comprehension could provide a much more natural and accessible interface between people and technology and unlock swathes of information transmitted in spoken language. I am particularly interested in health applications. Speech is one of the most informative observable probes of neurological health, however, its use in practice has been restricted. Improving our understanding of human spoken interaction could enable detection of abnormal speech, allowing diagnosis of diseases like Alzheimer's to be made much earlier and much more efficiently.

Publications

10 25 50

Studentship Projects

Project Reference Relationship Related To Start End Student Name
EP/T517884/1 01/10/2020 30/09/2025
2424038 Studentship EP/T517884/1 01/09/2020 29/02/2024 Sarenne Wallbridge