Exemplar-based Expressive Speech Synthesis
Lead Research Organisation:
University of Sheffield
Department Name: Computer Science
Abstract
Synthetic voices are becoming ubiquitous: `smart' speakers at home, announcement systems on public transport, and voice-enabled assistants on call lines. There exist a strong public demand for `smarter' assistants capable of laughing at our jokes; interacting with our children as encouraging and emphatic tutors; calling to check up on our parents; providing a reassuring `ear' for an isolated person; and offering calming and supportive virtual therapy. To support current and future applications, voice synthesis technology needs to satisfy a number of requirements. First, it needs to be customisable for rapid research and development, and second, it needs to be able to produce any spoken content, including expressive voice characteristics. However, none of the current synthesis technologies can simultaneously satisfy all of the above requirements. For instance, while current non-machine learning approaches allow pre-recorded phrases to be efficiently combined into complete sentences, it also means that missing necessary phrases must be recorded first, thereby limiting their flexibility and efficiency. On the other hand, current machine learning models can seamlessly synthesise any spoken content. However, creating such models is a very costly, time-consuming and computationally demanding process. Furthermore, these models offer a very limited control over the qualities of the voice characteristics and lack interpretability, which are highly desirable conditions in both research and commercial settings.
In this project, the objective is to develop a computationally efficient, customisable, expressive and interpretable speech synthesis, by drawing from the concept of `exemplars' in cognitive science.
In the field of cognitive science, the notions of `exemplars' and `prototypes' form a part of a prominent view on how humans categorise concepts. In particular, exemplar theory argues that singular examples, rather than prototypes (an average of examples), form the basic building blocks of how we understand and interact with the world. The key argument in favour of exemplar theory is our ability as humans to solve complex tasks based on just a few examples, which makes this theory appealing to applications that involve complex phenomena or that require high computational efficiency. Furthermore, expressive speech synthesis combines expressivity and speech production, which are two complex phenomena that remain poorly understood. Unlike prototype theory, exemplar theory, at least theoretically, enables to produce expressive speech, provided that at least one recording of the desired spoken content and one recording featuring the desired expressivity are available. Lastly, adopting exemplar theory promotes transparency during the decision making process through the use of real examples that can be inspected, modified, replaced, added, etc. within the task.
The objective will be achieved through three innovative means by: i) formulating a methodological framework for exemplar-based speech synthesis, ii) building an exemplar-based representation for speech expressivity from pre-recorded examples and iii) presenting a novel methodology for integrating this expressivity-based representation into the framework of i).
In this project, the objective is to develop a computationally efficient, customisable, expressive and interpretable speech synthesis, by drawing from the concept of `exemplars' in cognitive science.
In the field of cognitive science, the notions of `exemplars' and `prototypes' form a part of a prominent view on how humans categorise concepts. In particular, exemplar theory argues that singular examples, rather than prototypes (an average of examples), form the basic building blocks of how we understand and interact with the world. The key argument in favour of exemplar theory is our ability as humans to solve complex tasks based on just a few examples, which makes this theory appealing to applications that involve complex phenomena or that require high computational efficiency. Furthermore, expressive speech synthesis combines expressivity and speech production, which are two complex phenomena that remain poorly understood. Unlike prototype theory, exemplar theory, at least theoretically, enables to produce expressive speech, provided that at least one recording of the desired spoken content and one recording featuring the desired expressivity are available. Lastly, adopting exemplar theory promotes transparency during the decision making process through the use of real examples that can be inspected, modified, replaced, added, etc. within the task.
The objective will be achieved through three innovative means by: i) formulating a methodological framework for exemplar-based speech synthesis, ii) building an exemplar-based representation for speech expressivity from pre-recorded examples and iii) presenting a novel methodology for integrating this expressivity-based representation into the framework of i).
People |
ORCID iD |
Anton Ragni (Principal Investigator) |
Description | The award enabled to set up a speech synthesis group in the host institution by adding a more senior researcher to the only PhD student working in the area at the moment. This has created a number of speech synthesis activities in the group, such as reading and discussion club, mock challenges, and mentoring. The award objectives have been pursued through two directions as a part of risk mitigation strategy. The first direction has been described in the project proposal and has been led by the hired research associate. The second direction can be described as a more constrained form of the first direction where decisions as to which exemplars to combine are hard (and hence more interpretable) rather than soft. The second direction has been opened as a risk mitigation strategy. The publication of research findings may be moved to the post award stage as an additional risk mitigation strategy. The risk mitigation strategy within the primary direction has been primarily introduced due to a large search space of possible designs capable of meeting the original objectives. This suggest that a follow up work with the most promising alternative designs will be conducted by the PI once the current award period expires. |
Exploitation Route | The hired research associate and the PI has produced a large experimental framework enabling others to experiment with exemplar-based speech synthesis and explored variants of the proposed research not covered within the current award. |
Sectors | Digital/Communication/Information Technologies (including Software) |
Description | Hybrid approaches for multilingual speech recognition |
Amount | £106,228 (GBP) |
Funding ID | 2431591 |
Organisation | Engineering and Physical Sciences Research Council (EPSRC) |
Sector | Public |
Country | United Kingdom |
Start | 09/2020 |
End | 09/2024 |
Title | Deep learning framework for conducting experiments with exemplar-based speech synthesis |
Description | Two frameworks have been created to conduct experiments with exemplar-based speech synthesis. One framework led by the employed research associate implements a soft exemplar selection. Another framework led by the PI implements a hard exemplar selection. Both framework have different pros and cons and enable to conduct investigation from two different principled positions. |
Type Of Material | Improvements to research infrastructure |
Year Produced | 2022 |
Provided To Others? | No |
Impact | Two deep learning frameworks enable to make better use of available high-performance computing environment and produce a significantly larger number of experiments than would be otherwise possible. |
Description | DIET Chatbot |
Organisation | University of Aberdeen |
Country | United Kingdom |
Sector | Academic/University |
PI Contribution | In this collaboration we are aiming to develop and emphatic and stress and emotion aware chatbot that will interact with users by means of speech synthesis. This collaboration is in the stage of proposal submission to EPSRC and I hope to contribute the expertise I gain in expressive speech synthesis to this project. |
Collaborator Contribution | The partner provides a real-life app for testing expressive speech synthesis solutions, both existing and experimental, such as those developed during the award, with real users. They provide an exciting opportunity to test human responses to different types of expressive synthetic speech. |
Impact | This is a multidisciplinary collaboration between computer science, nutrition, psychology. |
Start Year | 2022 |
Description | DORA |
Organisation | Centre for Research and Technology Hellas (CERTH) |
Country | Greece |
Sector | Academic/University |
PI Contribution | As a part of this partnership I will be looking into creating expressive voices that have high trustworthiness. This is an interesting special case of general expressive speech synthesis. |
Collaborator Contribution | The partners will provide an opportunity to work with user groups to find out which attributes of synthetic voices increase/decrease trustworthiness. |
Impact | The partnership is in early stages of proposal and have not produced yet any output. This is a very multidisciplinary project including computer science, ethics, disinformation studies, psychology. |
Start Year | 2023 |
Description | DORA |
Organisation | Dublin City University |
Country | Ireland |
Sector | Academic/University |
PI Contribution | As a part of this partnership I will be looking into creating expressive voices that have high trustworthiness. This is an interesting special case of general expressive speech synthesis. |
Collaborator Contribution | The partners will provide an opportunity to work with user groups to find out which attributes of synthetic voices increase/decrease trustworthiness. |
Impact | The partnership is in early stages of proposal and have not produced yet any output. This is a very multidisciplinary project including computer science, ethics, disinformation studies, psychology. |
Start Year | 2023 |
Description | DORA |
Organisation | SINTEF |
Country | Norway |
Sector | Multiple |
PI Contribution | As a part of this partnership I will be looking into creating expressive voices that have high trustworthiness. This is an interesting special case of general expressive speech synthesis. |
Collaborator Contribution | The partners will provide an opportunity to work with user groups to find out which attributes of synthetic voices increase/decrease trustworthiness. |
Impact | The partnership is in early stages of proposal and have not produced yet any output. This is a very multidisciplinary project including computer science, ethics, disinformation studies, psychology. |
Start Year | 2023 |
Description | DORA |
Organisation | University of Greenwich |
Country | United Kingdom |
Sector | Academic/University |
PI Contribution | As a part of this partnership I will be looking into creating expressive voices that have high trustworthiness. This is an interesting special case of general expressive speech synthesis. |
Collaborator Contribution | The partners will provide an opportunity to work with user groups to find out which attributes of synthetic voices increase/decrease trustworthiness. |
Impact | The partnership is in early stages of proposal and have not produced yet any output. This is a very multidisciplinary project including computer science, ethics, disinformation studies, psychology. |
Start Year | 2023 |