Scaling up Statistical Spoken Dialogue Systems for real user goals using automatic belief state compression

Lead Research Organisation: Heriot-Watt University
Department Name: S of Mathematical and Computer Sciences

Abstract

Spoken dialogue systems (SDS) are increasingly being deployed in avariety of commercial applications ranging from traditional CallCentre automation (e.g. travel information) to new ``troubleshooting''or customer self-service lines (e.g. help fixing broken internetconnections).SDS are notoriously fragile (especially to speech recognition errors),do not offer natural ease of use, and do not adapt to differentusers. One of the main problems for SDS is to maintain an accurateview of the user's goals in the conversation (e.g. find a good indianrestaurant nearby, or repair a broadband connection) underuncertainty, and thereby to compute the optimal next system dialogueaction (e.g. offer a restaurant, ask for clarification). Recentresearch in statistical spoken dialogue systems (SSDS) hassuccessfully addressed aspects of these problems but, we shall show,it is currently hamstrung by an impoverished representation of usergoals, which has been adopted to enable tractable learning withstandard techniques.In the field as a whole, currently only small and unrealistic dialogueproblems (usually less than 100 searchable entities) are tackled withstatistical learning methods, for reasons of computationaltractability.In addition, current user goal state approximations in SSDS make itimpossible to represent some plausible user goals, e.g. someone whowants to know about nearby cheap restaurants and high-quality onesfurther away. This renders dialogue management sub-optimal and makesit impossible to deal adequately with the following types of userutterance: ``I'm looking for french or italian food'' and ``NotItalian, unless it's expensive''. User utterances with negations anddisjunctions of various sorts are very natural, and exploit the fullpower of natural language input, but current SSDS are unable toprocess them adequately. Moreover, much work in dialogue systemevaluation shows that real user goals are generally sets of items withdifferent features, rather than a single item. People like to explorepossible trade offs between features of items.Our main proposal is therefore to: a) develop realistic large-scale SSDS with an accurate, extended representation of user goals, and b) to use new Automatic Belief Compression (ABC) techniques to plan over the large state spaces thus generated.Techniques such as Value-Directed Compression demonstrate thatcompressible structure can be found automatically in the SSDS domain(for example compressing a test problem of 433 states to 31 basisfunctions).These techniques have their roots in methods for handling the largestate spaces required for robust robot navigation in realenvironments, and may lead to breakthroughs in the development ofrobust, efficient, and natural human-computer dialogue systems, withthe potential to radically improve the state-of-the-art in dialoguemanagement.

Publications

10 25 50
 
Description This project has helped to develop more robust, efficient, and natural human-computer speech interfaces. Such interfaces are becoming used more frequently in everyday life -- for example in the Apple iPhone speech interface "Siri" and Google's "Now" and Voice Search applications. In this project we experimented with new computational models and statistical machine learning methods for tackling two main problems for such interfaces: 1) allowing users of speech systems to express more complex and natural goals, and 2) scaling these systems up to handle larger spoken dialogue problems. To do this, we invented new representations of complex user goals (for example "I want french food, or else italian if there's one close to me"), and we investigated techniques for"Automatic Belief Compression" that allow such large-scale, high-dimensional computational problems to be reduced in size to a lower, more tractable dimension.



In practical terms we developed and deployed real telephone-based speech interfaces that implemented these ideas, and we tested them both in simulation and with members of the public, using crowdsourcing methods. We collected and analysed data from 2193 calls from 85 users.


Our key findings have been that methods for automatically compressing such problems can produce speech systems which are almost as effective as those where expert human designers have hand-crafted a suitable lower-dimensional problem space. We also developed new knowledge about the effectiveness of a variety of different automatic compression methods. In addition, we developed a new method for automatic belief compression, which overcomes several problems with previous approaches.



We published a number of conference papers reporting this work, and contributed to 2 books on new statistical learning methods for the development of speech interfaces. We have recently written 3 journal papers reporting our findings.
Exploitation Route This outputs of this research can be used in industrial and commercial development of novel speech and natural language interfaces such as future variants and extensions of Apple's iPhone speech interface Siri and Google's Now and Voice Search applications. Similar future application domains would be in interaction with virtual characters in areas such as education, healthcare, games, automated customer service, and human-robot interaction. Other applications are in hands-busy and eyes-busy operating situations, such as while driving and in medical contexts, where speech interfaces to information services are useful. In addition, advanced and natural speech interfaces are useful for blind, disabled, and ageing users who cannot easily use traditional interaction devices such as keyboards and screens. Finally, speech interfaces can be used to open up information services for illiterate users, for example in some developing countries. This research can be used in new interfaces and technologies for human-computer interaction - in particular for future speech interfaces and multimodal systems (these are interfaces which combine human communication channels such as speech, gesture, facial expression, body pose, gaze, graphics, natural language, and touch). The research allows more natural expression of user search goals using natural language, and develops computational methods for decision-making in such systems. The exploitation routes are therefore primarily in new interfaces and technologies for human-computer interaction, for example in human-robot interaction and in speech and multimodal interfaces, such as speech interfaces used with mobile phones, in cars, or for disabled users.
Sectors Digital/Communication/Information Technologies (including Software)

URL https://sites.google.com/site/abcpomdp/
 
Description This outputs of this research are useful in industrial and commercial development of novel speech and natural language interfaces such as future variants and extensions of Apple's iPhone speech interface Siri, Microsoft's Cortana, and Google's Now and Voice Search applications. Similar future application domains would be in interaction with virtual characters in areas such as education, healthcare, games, automated customer service, and human-robot interaction. Other applications are in hands-busy and eyes-busy operating situations, such as while driving and in medical contexts, where speech interfaces to information services are useful. In addition, advanced and natural speech interfaces are useful for blind, disabled, and ageing users who cannot easily use traditional interaction devices such as keyboards and screens. Finally, such advanced speech interfaces can be used to open up information services for illiterate users, for example in some developing countries.
First Year Of Impact 2012
Sector Digital/Communication/Information Technologies (including Software)
Impact Types Societal,Economic

 
Description Amazon Alexa Challenge 2017
Amount $100,000 (USD)
Organisation Amazon.com 
Sector Private
Country United States
Start 11/2016 
End 11/2017
 
Description EC FP7 ICT grant: SpaceBook
Amount £645,000 (GBP)
Funding ID 270019 
Organisation European Commission 
Sector Public
Country European Union (EU)
Start 03/2011 
End 02/2014
 
Description EC FP7 ICT project: JAMES: Joint Action for Multimodal Embodied Social Systems
Amount € 3,209,918 (EUR)
Funding ID 270435 
Organisation European Commission 
Sector Public
Country European Union (EU)
Start 02/2011 
End 09/2014
 
Description ERC Advanced Research Grant (STAC)
Amount € 1,930,000 (EUR)
Funding ID 269427 
Organisation European Commission 
Sector Public
Country European Union (EU)
Start 06/2011 
End 05/2017
 
Description Horizon 2020 ICT : MuMMER project - Multimodal Mall Entertainment Robot
Amount € 900,000 (EUR)
Funding ID 688147 
Organisation European Commission 
Sector Public
Country European Union (EU)
Start 03/2016 
End 02/2020
 
Title ABC dialogue management algorithms 
Description A set of new statistical algorithms for spoken dialogue management -- see Crook et al. 2014 
Type Of Material Computer model/algorithm 
Year Produced 2014 
Provided To Others? Yes  
Impact Some of the algorithms developed are used for dialogue management in current/recent projects such as EC FP7 PARLANCE and SpaceBook and JAMES 
URL https://sites.google.com/site/abcpomdp/home
 
Title Spoken dialogue data - ABC 
Description A collection of real user spoken dialogues with our automated dialogue systems, as described in Crook et al. 2014 
Type Of Material Database/Collection of data 
Year Produced 2014 
Provided To Others? Yes  
Impact Use of data in EC FP7 projects such as SpaceBook and PARLANCE 
URL https://sites.google.com/site/abcpomdp/home
 
Title End-to-end statistical spoken dialogue systems software and architecture 
Description Automated spoken dialogue system using a fully statistical end-to-end architecture (see publications). 
Type Of Technology Webtool/Application 
Year Produced 2009 
Impact Used in future projects, e.g. EPSRC-follow on and EC FP7 project such as SpaceBook and PARLANCE