Exemplar-based Expressive Speech Synthesis

Lead Research Organisation: University of Sheffield

Department Name: Computer Science

Abstract

Synthetic voices are becoming ubiquitous: `smart' speakers at home, announcement systems on public transport, and voice-enabled assistants on call lines. There exist a strong public demand for `smarter' assistants capable of laughing at our jokes; interacting with our children as encouraging and emphatic tutors; calling to check up on our parents; providing a reassuring `ear' for an isolated person; and offering calming and supportive virtual therapy. To support current and future applications, voice synthesis technology needs to satisfy a number of requirements. First, it needs to be customisable for rapid research and development, and second, it needs to be able to produce any spoken content, including expressive voice characteristics. However, none of the current synthesis technologies can simultaneously satisfy all of the above requirements. For instance, while current non-machine learning approaches allow pre-recorded phrases to be efficiently combined into complete sentences, it also means that missing necessary phrases must be recorded first, thereby limiting their flexibility and efficiency. On the other hand, current machine learning models can seamlessly synthesise any spoken content. However, creating such models is a very costly, time-consuming and computationally demanding process. Furthermore, these models offer a very limited control over the qualities of the voice characteristics and lack interpretability, which are highly desirable conditions in both research and commercial settings.

In this project, the objective is to develop a computationally efficient, customisable, expressive and interpretable speech synthesis, by drawing from the concept of `exemplars' in cognitive science.

In the field of cognitive science, the notions of `exemplars' and `prototypes' form a part of a prominent view on how humans categorise concepts. In particular, exemplar theory argues that singular examples, rather than prototypes (an average of examples), form the basic building blocks of how we understand and interact with the world. The key argument in favour of exemplar theory is our ability as humans to solve complex tasks based on just a few examples, which makes this theory appealing to applications that involve complex phenomena or that require high computational efficiency. Furthermore, expressive speech synthesis combines expressivity and speech production, which are two complex phenomena that remain poorly understood. Unlike prototype theory, exemplar theory, at least theoretically, enables to produce expressive speech, provided that at least one recording of the desired spoken content and one recording featuring the desired expressivity are available. Lastly, adopting exemplar theory promotes transparency during the decision making process through the use of real examples that can be inspected, modified, replaced, added, etc. within the task.

The objective will be achieved through three innovative means by: i) formulating a methodological framework for exemplar-based speech synthesis, ii) building an exemplar-based representation for speech expressivity from pre-recorded examples and iii) presenting a novel methodology for integrating this expressivity-based representation into the framework of i).

Funded Value:

£218,290

Funded Period:

Dec 21 - Nov 23

Funder:

EPSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

EP/V046772/1

Principal Investigator:

Anton Ragni

Research Subject:

Info. & commun. Technol. (100%)

Research Topic:

Human Communication in ICT (100%)

Organisations

People	ORCID iD
Anton Ragni (Principal Investigator)

Publications

Author Name

Title Publication Date Published

10 25 50

Mogridge R (2024) Learning from memory-based models

Rhiannon Mogridge (2024) NON-INTRUSIVE SPEECH INTELLIGIBILITY PREDICTION FOR HEARING-IMPAIRED USERS USING INTERMEDIATE ASR FEATURES AND HUMAN MEMORY MODELS

Key Findings
Impact Summary
Further Funding
Research Tools and Methods
Collaboration


Description	The award enabled to set up a speech synthesis group in the host institution by adding a more senior researcher to the only PhD student working in the area at the moment. This has created a number of speech synthesis activities in the group, such as reading and discussion club, mock challenges, and mentoring. The award objectives have been pursued through two directions as a part of risk mitigation strategy. The first direction has been described in the project proposal and has been led by the hired research associate. The second direction can be described as a more constrained form of the first direction where decisions as to which exemplars to combine are hard (and hence more interpretable) rather than soft. The second direction has been opened as a risk mitigation strategy. The publication of research findings has been moved to the post award stage. The risk mitigation strategy within the primary direction has been primarily introduced due to a large search space of possible designs capable of meeting the original objectives. This suggest that a follow up work with the most promising alternative designs will be conducted by the PI once the current award period expires. The award has produced sufficient number of findings that the hired research associate is currently writing up for a publication.
Exploitation Route	The hired research associate and the PI has produced a large experimental framework enabling others to experiment with exemplar-based speech synthesis and explored variants of the proposed research not covered within the current award.
Sectors	Digital/Communication/Information Technologies (including Software)


Description	The true quality, lack of interpretability and computational costs linked with using modern machine learning approaches appear to be not fully appreciated by the public, including university students. As a part of this award I have created a set of resources that I have since then used to educate university students about those pitfalls and argue for the use of computationally efficient, interpretable and high-quality exemplar-based approaches . Also, as a part of this award I have broadened the application of exemplar-based approaches to speech intelligibility prediction, which is becoming an important in countries with rapidly ageing population.
First Year Of Impact	2021
Sector	Digital/Communication/Information Technologies (including Software),Education
Impact Types	Cultural Societal


Description	Hybrid approaches for multilingual speech recognition
Amount	£106,228 (GBP)
Funding ID	2431591
Organisation	Engineering and Physical Sciences Research Council (EPSRC)
Sector	Public
Country	United Kingdom
Start	08/2020
End	09/2024


Title	Deep learning framework for conducting experiments with exemplar-based speech synthesis
Description	Two frameworks have been created to conduct experiments with exemplar-based speech synthesis. One framework led by the employed research associate implements a soft exemplar selection. Another framework led by the PI implements a hard exemplar selection. Both framework have different pros and cons and enable to conduct investigation from two different principled positions.
Type Of Material	Improvements to research infrastructure
Year Produced	2022
Provided To Others?	No
Impact	Two deep learning frameworks enable to make better use of available high-performance computing environment and produce a significantly larger number of experiments than would be otherwise possible.


Description	DIET Chatbot
Organisation	University of Aberdeen
Country	United Kingdom
Sector	Academic/University
PI Contribution	In this collaboration we are aiming to develop and emphatic and stress and emotion aware chatbot that will interact with users by means of speech synthesis. This collaboration is in the stage of proposal submission to EPSRC and I hope to contribute the expertise I gain in expressive speech synthesis to this project.
Collaborator Contribution	The partner provides a real-life app for testing expressive speech synthesis solutions, both existing and experimental, such as those developed during the award, with real users. They provide an exciting opportunity to test human responses to different types of expressive synthetic speech.
Impact	This is a multidisciplinary collaboration between computer science, nutrition, psychology.
Start Year	2022


Description	DORA
Organisation	Centre for Research and Technology Hellas (CERTH)
Country	Greece
Sector	Academic/University
PI Contribution	As a part of this partnership I will be looking into creating expressive voices that have high trustworthiness. This is an interesting special case of general expressive speech synthesis.
Collaborator Contribution	The partners will provide an opportunity to work with user groups to find out which attributes of synthetic voices increase/decrease trustworthiness.
Impact	The partnership is in early stages of proposal and have not produced yet any output. This is a very multidisciplinary project including computer science, ethics, disinformation studies, psychology.
Start Year	2023


Description	DORA
Organisation	Dublin City University
Country	Ireland
Sector	Academic/University
PI Contribution	As a part of this partnership I will be looking into creating expressive voices that have high trustworthiness. This is an interesting special case of general expressive speech synthesis.
Collaborator Contribution	The partners will provide an opportunity to work with user groups to find out which attributes of synthetic voices increase/decrease trustworthiness.
Impact	The partnership is in early stages of proposal and have not produced yet any output. This is a very multidisciplinary project including computer science, ethics, disinformation studies, psychology.
Start Year	2023


Description	DORA
Organisation	SINTEF
Country	Norway
Sector	Multiple
PI Contribution	As a part of this partnership I will be looking into creating expressive voices that have high trustworthiness. This is an interesting special case of general expressive speech synthesis.
Collaborator Contribution	The partners will provide an opportunity to work with user groups to find out which attributes of synthetic voices increase/decrease trustworthiness.
Impact	The partnership is in early stages of proposal and have not produced yet any output. This is a very multidisciplinary project including computer science, ethics, disinformation studies, psychology.
Start Year	2023


Description	DORA
Organisation	University of Greenwich
Country	United Kingdom
Sector	Academic/University
PI Contribution	As a part of this partnership I will be looking into creating expressive voices that have high trustworthiness. This is an interesting special case of general expressive speech synthesis.
Collaborator Contribution	The partners will provide an opportunity to work with user groups to find out which attributes of synthetic voices increase/decrease trustworthiness.
Impact	The partnership is in early stages of proposal and have not produced yet any output. This is a very multidisciplinary project including computer science, ethics, disinformation studies, psychology.
Start Year	2023

Abstract

Organisations

People

ORCID iD

Publications