Exemplar-based Expressive Speech Synthesis

Lead Research Organisation: University of Sheffield
Department Name: Computer Science

Abstract

Synthetic voices are becoming ubiquitous: `smart' speakers at home, announcement systems on public transport, and voice-enabled assistants on call lines. There exist a strong public demand for `smarter' assistants capable of laughing at our jokes; interacting with our children as encouraging and emphatic tutors; calling to check up on our parents; providing a reassuring `ear' for an isolated person; and offering calming and supportive virtual therapy. To support current and future applications, voice synthesis technology needs to satisfy a number of requirements. First, it needs to be customisable for rapid research and development, and second, it needs to be able to produce any spoken content, including expressive voice characteristics. However, none of the current synthesis technologies can simultaneously satisfy all of the above requirements. For instance, while current non-machine learning approaches allow pre-recorded phrases to be efficiently combined into complete sentences, it also means that missing necessary phrases must be recorded first, thereby limiting their flexibility and efficiency. On the other hand, current machine learning models can seamlessly synthesise any spoken content. However, creating such models is a very costly, time-consuming and computationally demanding process. Furthermore, these models offer a very limited control over the qualities of the voice characteristics and lack interpretability, which are highly desirable conditions in both research and commercial settings.

In this project, the objective is to develop a computationally efficient, customisable, expressive and interpretable speech synthesis, by drawing from the concept of `exemplars' in cognitive science.

In the field of cognitive science, the notions of `exemplars' and `prototypes' form a part of a prominent view on how humans categorise concepts. In particular, exemplar theory argues that singular examples, rather than prototypes (an average of examples), form the basic building blocks of how we understand and interact with the world. The key argument in favour of exemplar theory is our ability as humans to solve complex tasks based on just a few examples, which makes this theory appealing to applications that involve complex phenomena or that require high computational efficiency. Furthermore, expressive speech synthesis combines expressivity and speech production, which are two complex phenomena that remain poorly understood. Unlike prototype theory, exemplar theory, at least theoretically, enables to produce expressive speech, provided that at least one recording of the desired spoken content and one recording featuring the desired expressivity are available. Lastly, adopting exemplar theory promotes transparency during the decision making process through the use of real examples that can be inspected, modified, replaced, added, etc. within the task.

The objective will be achieved through three innovative means by: i) formulating a methodological framework for exemplar-based speech synthesis, ii) building an exemplar-based representation for speech expressivity from pre-recorded examples and iii) presenting a novel methodology for integrating this expressivity-based representation into the framework of i).

Publications

10 25 50
 
Description The award enabled to set up a speech synthesis group in the host institution by adding a more senior researcher to the only PhD student working in the area at the moment. This has created a number of speech synthesis activities in the group, such as reading and discussion club, mock challenges, and mentoring.

The award objectives have been pursued through two directions as a part of risk mitigation strategy. The first direction has been described in the project proposal and has been led by the hired research associate. The second direction can be described as a more constrained form of the first direction where decisions as to which exemplars to combine are hard (and hence more interpretable) rather than soft. The second direction has been opened as a risk mitigation strategy. The publication of research findings may be moved to the post award stage as an additional risk mitigation strategy.

The risk mitigation strategy within the primary direction has been primarily introduced due to a large search space of possible designs capable of meeting the original objectives. This suggest that a follow up work with the most promising alternative designs will be conducted by the PI once the current award period expires.
Exploitation Route The hired research associate and the PI has produced a large experimental framework enabling others to experiment with exemplar-based speech synthesis and explored variants of the proposed research not covered within the current award.
Sectors Digital/Communication/Information Technologies (including Software)

 
Description Hybrid approaches for multilingual speech recognition
Amount £106,228 (GBP)
Funding ID 2431591 
Organisation Engineering and Physical Sciences Research Council (EPSRC) 
Sector Public
Country United Kingdom
Start 09/2020 
End 09/2024
 
Title Deep learning framework for conducting experiments with exemplar-based speech synthesis 
Description Two frameworks have been created to conduct experiments with exemplar-based speech synthesis. One framework led by the employed research associate implements a soft exemplar selection. Another framework led by the PI implements a hard exemplar selection. Both framework have different pros and cons and enable to conduct investigation from two different principled positions. 
Type Of Material Improvements to research infrastructure 
Year Produced 2022 
Provided To Others? No  
Impact Two deep learning frameworks enable to make better use of available high-performance computing environment and produce a significantly larger number of experiments than would be otherwise possible. 
 
Description DIET Chatbot 
Organisation University of Aberdeen
Country United Kingdom 
Sector Academic/University 
PI Contribution In this collaboration we are aiming to develop and emphatic and stress and emotion aware chatbot that will interact with users by means of speech synthesis. This collaboration is in the stage of proposal submission to EPSRC and I hope to contribute the expertise I gain in expressive speech synthesis to this project.
Collaborator Contribution The partner provides a real-life app for testing expressive speech synthesis solutions, both existing and experimental, such as those developed during the award, with real users. They provide an exciting opportunity to test human responses to different types of expressive synthetic speech.
Impact This is a multidisciplinary collaboration between computer science, nutrition, psychology.
Start Year 2022
 
Description DORA 
Organisation Centre for Research and Technology Hellas (CERTH)
Country Greece 
Sector Academic/University 
PI Contribution As a part of this partnership I will be looking into creating expressive voices that have high trustworthiness. This is an interesting special case of general expressive speech synthesis.
Collaborator Contribution The partners will provide an opportunity to work with user groups to find out which attributes of synthetic voices increase/decrease trustworthiness.
Impact The partnership is in early stages of proposal and have not produced yet any output. This is a very multidisciplinary project including computer science, ethics, disinformation studies, psychology.
Start Year 2023
 
Description DORA 
Organisation Dublin City University
Country Ireland 
Sector Academic/University 
PI Contribution As a part of this partnership I will be looking into creating expressive voices that have high trustworthiness. This is an interesting special case of general expressive speech synthesis.
Collaborator Contribution The partners will provide an opportunity to work with user groups to find out which attributes of synthetic voices increase/decrease trustworthiness.
Impact The partnership is in early stages of proposal and have not produced yet any output. This is a very multidisciplinary project including computer science, ethics, disinformation studies, psychology.
Start Year 2023
 
Description DORA 
Organisation SINTEF
Country Norway 
Sector Multiple 
PI Contribution As a part of this partnership I will be looking into creating expressive voices that have high trustworthiness. This is an interesting special case of general expressive speech synthesis.
Collaborator Contribution The partners will provide an opportunity to work with user groups to find out which attributes of synthetic voices increase/decrease trustworthiness.
Impact The partnership is in early stages of proposal and have not produced yet any output. This is a very multidisciplinary project including computer science, ethics, disinformation studies, psychology.
Start Year 2023
 
Description DORA 
Organisation University of Greenwich
Country United Kingdom 
Sector Academic/University 
PI Contribution As a part of this partnership I will be looking into creating expressive voices that have high trustworthiness. This is an interesting special case of general expressive speech synthesis.
Collaborator Contribution The partners will provide an opportunity to work with user groups to find out which attributes of synthetic voices increase/decrease trustworthiness.
Impact The partnership is in early stages of proposal and have not produced yet any output. This is a very multidisciplinary project including computer science, ethics, disinformation studies, psychology.
Start Year 2023