SCRIPT: Speech Synthesis for Spoken Content Production
Lead Research Organisation:
University of Edinburgh
Department Name: Centre for Speech Technology Research
Abstract
The cost of producing dynamically-updated media content - such as online video news packages - across multiple languages is very high. Maintaining substantial teams of journalists per language is expensive and inflexible. Modern media organisations like the BBC or the Financial Times need a more agile approach: they must be able to react quickly to changing world events (e.g., breaking news or emerging markets), dynamically allocating their limited resources in response to external demands. Ideally, they would like to create `pop-up' services & products in previously-unsupported languages, then to scale them up or down later.
The government has set the BBC a target of reaching a global audience of 500 million people by 2022, compared with today's 308 million. The only way to reach such a huge audience is through new language services and efficient production techniques. Text-to-speech - which automatically produces speech from text - offers an attractive solution to this challenge, and the BBC have identified computer assisted translation and text-to-speech as key technologies that will provide them with new ways of creating and reversioning their content across many languages.
This project's objectives are to push text-to-speech technology towards "broadcast quality" computer-generated speech (i.e., good enough for the BBC to broadcast) in many languages, and to make it cheap and easy to add more languages later. We will do this by combining and extending several distinct pieces of our previous basic research on text-to-speech. We will use the latest data-driven machine learning techniques, and extend them to produce much higher quality output speech. At the same time, we will enable the possibility of human control over the speech. This will allow the user (e.g., a BBC journalist) to adjust the speech to make sure the quality and the speaking style is right for their purposes (e.g., correcting the pronunciation of a difficult word, or putting emphasis in the right place).
The technology we will create for the likes of the BBC will also enable smaller companies and other organisations, state bodies, charities, and individuals to rapidly create high-quality spoken content, in whatever language or domain they are operating. We will work with other types of organisation during the project, to make sure that the technology we create has broad appeal and will be useful to a wide range of companies and individuals.
The government has set the BBC a target of reaching a global audience of 500 million people by 2022, compared with today's 308 million. The only way to reach such a huge audience is through new language services and efficient production techniques. Text-to-speech - which automatically produces speech from text - offers an attractive solution to this challenge, and the BBC have identified computer assisted translation and text-to-speech as key technologies that will provide them with new ways of creating and reversioning their content across many languages.
This project's objectives are to push text-to-speech technology towards "broadcast quality" computer-generated speech (i.e., good enough for the BBC to broadcast) in many languages, and to make it cheap and easy to add more languages later. We will do this by combining and extending several distinct pieces of our previous basic research on text-to-speech. We will use the latest data-driven machine learning techniques, and extend them to produce much higher quality output speech. At the same time, we will enable the possibility of human control over the speech. This will allow the user (e.g., a BBC journalist) to adjust the speech to make sure the quality and the speaking style is right for their purposes (e.g., correcting the pronunciation of a difficult word, or putting emphasis in the right place).
The technology we will create for the likes of the BBC will also enable smaller companies and other organisations, state bodies, charities, and individuals to rapidly create high-quality spoken content, in whatever language or domain they are operating. We will work with other types of organisation during the project, to make sure that the technology we create has broad appeal and will be useful to a wide range of companies and individuals.
Planned Impact
Who might benefit from this research?
-----------------------------------------------------------
This project has significant potential for almost immediate impact on the broadcast and wider media / news industries, starting with the BBC use case, but quickly moving on to address use cases proposed by the Financial Times and Deutsche Welle (who will sit on our Advisory Board). Their UK and global audiences will benefit from news and information becoming available in more languages.
Smaller companies will benefit too, because they will have access to a cost-effective way to add new languages to their products and services. Currently, small and medium size companies simply cannot afford to commission a bespoke language front end from a commercial text-to-speech provider, due to the large amount of manual work that requires. With our new technology, the cost of commissioning a new language will be dramatically lower.
Commercial providers of text-to-speech (e.g., our collaborators ReadSpeaker) will benefit from access to new techniques for text-to-speech synthesis. They will find it easier to compete with multinationals (e.g., Nuance) because they will be able to bring cost-effective language products to the market, and will be able to create bespoke products more quickly and cheaply than at present.
The UK as a whole will benefit, as a consequence of the BBC World Service's ability to reach a wider global audience. As noted in the proposal, the BBC World Service is an important way for the UK to exert soft power around the world. Our technology will enable the BBC to do this in many more languages than at present.
The availability of affordable text-to-speech for the great many languages spoken in, for example, Africa could benefit development in many countries. It would be useful for disseminating, for example, agricultural or health information from governments or the United Nations.
How might they benefit from this research?
-----------------------------------------------------------
Our goal is to deliver economic impact for the BBC, and societal impact for their audience, within the duration of the project. Other media organisations, and their respective audiences, will benefit in the same way.
Commercial providers of text-to-speech can incorporate our new techniques into their products and so will be able to penetrate new markets of two types: 1) markets demanding very high quality, which our methods for editorial control will deliver; 2) markets requiring provision of many languages, especially the hundreds and thousands of languages that currently have no prospect of text-to-speech provision.
Delivering longer-term benefits to people in developing countries is challenging, even when done through well-established not-for-profit organisations, such as the United Nations Development Programme. Dr. John Quinn will sit on our Advisory Board and provide his expertise in overcoming barriers to deployment of spoken language technology in developing countries.
-----------------------------------------------------------
This project has significant potential for almost immediate impact on the broadcast and wider media / news industries, starting with the BBC use case, but quickly moving on to address use cases proposed by the Financial Times and Deutsche Welle (who will sit on our Advisory Board). Their UK and global audiences will benefit from news and information becoming available in more languages.
Smaller companies will benefit too, because they will have access to a cost-effective way to add new languages to their products and services. Currently, small and medium size companies simply cannot afford to commission a bespoke language front end from a commercial text-to-speech provider, due to the large amount of manual work that requires. With our new technology, the cost of commissioning a new language will be dramatically lower.
Commercial providers of text-to-speech (e.g., our collaborators ReadSpeaker) will benefit from access to new techniques for text-to-speech synthesis. They will find it easier to compete with multinationals (e.g., Nuance) because they will be able to bring cost-effective language products to the market, and will be able to create bespoke products more quickly and cheaply than at present.
The UK as a whole will benefit, as a consequence of the BBC World Service's ability to reach a wider global audience. As noted in the proposal, the BBC World Service is an important way for the UK to exert soft power around the world. Our technology will enable the BBC to do this in many more languages than at present.
The availability of affordable text-to-speech for the great many languages spoken in, for example, Africa could benefit development in many countries. It would be useful for disseminating, for example, agricultural or health information from governments or the United Nations.
How might they benefit from this research?
-----------------------------------------------------------
Our goal is to deliver economic impact for the BBC, and societal impact for their audience, within the duration of the project. Other media organisations, and their respective audiences, will benefit in the same way.
Commercial providers of text-to-speech can incorporate our new techniques into their products and so will be able to penetrate new markets of two types: 1) markets demanding very high quality, which our methods for editorial control will deliver; 2) markets requiring provision of many languages, especially the hundreds and thousands of languages that currently have no prospect of text-to-speech provision.
Delivering longer-term benefits to people in developing countries is challenging, even when done through well-established not-for-profit organisations, such as the United Nations Development Programme. Dr. John Quinn will sit on our Advisory Board and provide his expertise in overcoming barriers to deployment of spoken language technology in developing countries.
Publications
Watts O
(2018)
Exemplar-based Speech Waveform Generation
Cassia Valentini Botinhao
(2018)
Exemplar-based speech waveform generation for text-to-speech
Aubin A
(2019)
Improving speech synthesis with discourse relations
Description | Amongst other goals, the project pioneered the use of "human-in-the-loop" corrections for computer generated speech. Within the project, the application was for the BBC to produce broadcasts in many languages (especially on the World Service). After the grant, our approaches have found applications in other areas too. Since the grant finished, our understanding of the key issues in computer-generated speech from text input (called "Text-to-Speech") have continued to deepen, building on what we learned in that project, and we now have a very clear understanding of why additional inputs (not contained in the text) are essential. |
Exploitation Route | Additional human-provided inputs are now becoming commonplace in commercial applications of Text-to-Speech. |
Sectors | Creative Economy,Digital/Communication/Information Technologies (including Software),Other |
Description | SpeakUnique spinout benefitted from advances in core speech synthesis technology that we made in SCRIPT. Papercup Technologies also using human-in-the-loop corrections to synthetic speech. |
First Year Of Impact | 2019 |
Sector | Digital/Communication/Information Technologies (including Software),Healthcare |
Impact Types | Societal,Economic |
Description | Foundations for Expressive Speech Synthesis |
Amount | $194,600 (USD) |
Organisation | Samsung |
Sector | Private |
Country | Korea, Republic of |
Start | 01/2019 |
End | 12/2019 |
Description | Foundations for Expressive Speech Synthesis (Year 2) |
Amount | $191,950 (USD) |
Organisation | Samsung |
Sector | Private |
Country | Korea, Republic of |
Start | 01/2020 |
End | 12/2020 |
Description | BBC |
Organisation | British Broadcasting Corporation (BBC) |
Country | United Kingdom |
Sector | Public |
PI Contribution | Project partner |
Collaborator Contribution | Provision of data |
Impact | See publications - some are based on the BBC data |
Start Year | 2016 |
Title | Merlin |
Description | Merlin: The Neural Network (NN) based Speech Synthesis System This repository contains the Neural Network (NN) based Speech Synthesis System developed at the Centre for Speech Technology Research (CSTR), University of Edinburgh. Merlin is a toolkit for building Deep Neural Network models for statistical parametric speech synthesis. It must be used in combination with a front-end text processor (e.g., Festival) and a vocoder (e.g., STRAIGHT or WORLD). The system is written in Python and relies on the Theano numerical computation library. Merlin comes with recipes (in the spirit of the Kaldi automatic speech recognition toolkit) to show you how to build state-of-the art systems. Merlin is free software, distributed under an Apache License Version 2.0, allowing unrestricted commercial and non-commercial use alike. |
Type Of Technology | Software |
Year Produced | 2016 |
Open Source License? | Yes |
Impact | Reproducible research |
URL | https://github.com/CSTR-Edinburgh/merlin |
Title | Ophelia |
Description | A modified version of Kyubyong Park's dc_tts repository, which implements a variant of the system described in Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention. |
Type Of Technology | Software |
Year Produced | 2018 |
Open Source License? | Yes |
Impact | Reproducible research |
URL | https://github.com/CSTR-Edinburgh/ophelia |
Title | Ossian |
Description | Ossian is a collection of Python code for building text-to-speech (TTS) systems, with an emphasis on easing research into building TTS systems with minimal expert supervision. Updates to this repository occurred in the SCRIPT project. |
Type Of Technology | Software |
Year Produced | 2016 |
Open Source License? | Yes |
Impact | Reproducible research |
URL | https://github.com/CSTR-Edinburgh/Ossian |
Title | Snickery |
Description | nickery This repository contains code used to build the proposed systems presented in the papers Exemplar-based speech waveform generation and Exemplar-based speech waveform generation for text-to-speech. Exemplar-based speech waveform generation @inproceedings{watts18examplar, title = {Exemplar-based speech waveform generation}, author = {Oliver Watts and Cassia Valentini-Botinhao and Felipe Espic and Simon King}, booktitle = {Interspeech}, year = {2018}, } The first part of this README is about use of scripts: script/train_simple.py script/synth_simple.py ... which can only build a few restricted types of system (selection of epoch-based fragments, greedy search only). They can be used to replicate the system proposed in the paper Exemplar-based speech waveform generation |
Type Of Technology | Software |
Year Produced | 2017 |
Open Source License? | Yes |
Impact | Reproducible research |
URL | https://github.com/CSTR-Edinburgh/snickery |
Title | Waffler |
Description | waffler This repository contains code used to build the proposed systems presented in the following paper: @inproceedings{watts2019speech, title={Speech waveform reconstruction using convolutional neural networks with noise and periodic inputs}, author={Oliver Watts and Cassia Valentini-Botinhao and Simon King}, booktitle={2019 {IEEE} International Conference on Acoustics, Speech and Signal Processing, {ICASSP} 2019}, year={2019} } The instructions below explain how to produce a system comparable to the new system (P0) proposed in that paper. |
Type Of Technology | Software |
Year Produced | 2019 |
Open Source License? | Yes |
Impact | Reproducible research |
URL | https://github.com/CSTR-Edinburgh/waffler |
Company Name | SPEAKUNIQUE LIMITED |
Description | Commercialisation of voice reconstruction technology. Some of this builds on work done in SCRIPT, and the CTO (Oliver Watts) worked on SCRIPT. |
Year Established | 2018 |
Impact | Product release expected 2020 |
Website | https://www.speakunique.org |
Description | Does "end-to-end" speech synthesis mean we don't need text processing or signal processing any more? |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | Invited talk in Dublin |
Year(s) Of Engagement Activity | 2019 |
Description | Finding your (artificial) voice |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Public/other audiences |
Results and Impact | A Pint of Science |
Year(s) Of Engagement Activity | 2017 |
Description | If you lose your voice, how can you speak? |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | National |
Primary Audience | Undergraduate students |
Results and Impact | Invited talk at ULAB 2018 |
Year(s) Of Engagement Activity | 2018 |
Description | Is identifying people using their voice a good idea? |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | Invited talk in Trento |
Year(s) Of Engagement Activity | 2018 |
Description | Multiple talks in Japan |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | A series of 4 invited talks in Nagoya, Japan |
Year(s) Of Engagement Activity | 2018 |
Description | Speech Synthesis |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | Local |
Primary Audience | Undergraduate students |
Results and Impact | Invited talk in York |
Year(s) Of Engagement Activity | 2018 |
Description | What is "end-to-end" text-to-speech synthesis ? |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | Regional |
Primary Audience | Undergraduate students |
Results and Impact | Talk to student society at Lancaster University |
Year(s) Of Engagement Activity | 2019 |