SCRIPT: Speech Synthesis for Spoken Content Production

Lead Research Organisation: University of Edinburgh
Department Name: Centre for Speech Technology Research

Abstract

The cost of producing dynamically-updated media content - such as online video news packages - across multiple languages is very high. Maintaining substantial teams of journalists per language is expensive and inflexible. Modern media organisations like the BBC or the Financial Times need a more agile approach: they must be able to react quickly to changing world events (e.g., breaking news or emerging markets), dynamically allocating their limited resources in response to external demands. Ideally, they would like to create `pop-up' services & products in previously-unsupported languages, then to scale them up or down later.

The government has set the BBC a target of reaching a global audience of 500 million people by 2022, compared with today's 308 million. The only way to reach such a huge audience is through new language services and efficient production techniques. Text-to-speech - which automatically produces speech from text - offers an attractive solution to this challenge, and the BBC have identified computer assisted translation and text-to-speech as key technologies that will provide them with new ways of creating and reversioning their content across many languages.

This project's objectives are to push text-to-speech technology towards "broadcast quality" computer-generated speech (i.e., good enough for the BBC to broadcast) in many languages, and to make it cheap and easy to add more languages later. We will do this by combining and extending several distinct pieces of our previous basic research on text-to-speech. We will use the latest data-driven machine learning techniques, and extend them to produce much higher quality output speech. At the same time, we will enable the possibility of human control over the speech. This will allow the user (e.g., a BBC journalist) to adjust the speech to make sure the quality and the speaking style is right for their purposes (e.g., correcting the pronunciation of a difficult word, or putting emphasis in the right place).

The technology we will create for the likes of the BBC will also enable smaller companies and other organisations, state bodies, charities, and individuals to rapidly create high-quality spoken content, in whatever language or domain they are operating. We will work with other types of organisation during the project, to make sure that the technology we create has broad appeal and will be useful to a wide range of companies and individuals.

Planned Impact

Who might benefit from this research?
-----------------------------------------------------------

This project has significant potential for almost immediate impact on the broadcast and wider media / news industries, starting with the BBC use case, but quickly moving on to address use cases proposed by the Financial Times and Deutsche Welle (who will sit on our Advisory Board). Their UK and global audiences will benefit from news and information becoming available in more languages.

Smaller companies will benefit too, because they will have access to a cost-effective way to add new languages to their products and services. Currently, small and medium size companies simply cannot afford to commission a bespoke language front end from a commercial text-to-speech provider, due to the large amount of manual work that requires. With our new technology, the cost of commissioning a new language will be dramatically lower.

Commercial providers of text-to-speech (e.g., our collaborators ReadSpeaker) will benefit from access to new techniques for text-to-speech synthesis. They will find it easier to compete with multinationals (e.g., Nuance) because they will be able to bring cost-effective language products to the market, and will be able to create bespoke products more quickly and cheaply than at present.

The UK as a whole will benefit, as a consequence of the BBC World Service's ability to reach a wider global audience. As noted in the proposal, the BBC World Service is an important way for the UK to exert soft power around the world. Our technology will enable the BBC to do this in many more languages than at present.

The availability of affordable text-to-speech for the great many languages spoken in, for example, Africa could benefit development in many countries. It would be useful for disseminating, for example, agricultural or health information from governments or the United Nations.


How might they benefit from this research?
-----------------------------------------------------------

Our goal is to deliver economic impact for the BBC, and societal impact for their audience, within the duration of the project. Other media organisations, and their respective audiences, will benefit in the same way.

Commercial providers of text-to-speech can incorporate our new techniques into their products and so will be able to penetrate new markets of two types: 1) markets demanding very high quality, which our methods for editorial control will deliver; 2) markets requiring provision of many languages, especially the hundreds and thousands of languages that currently have no prospect of text-to-speech provision.

Delivering longer-term benefits to people in developing countries is challenging, even when done through well-established not-for-profit organisations, such as the United Nations Development Programme. Dr. John Quinn will sit on our Advisory Board and provide his expertise in overcoming barriers to deployment of spoken language technology in developing countries.
 
Description Amongst other goals, the project pioneered the use of "human-in-the-loop" corrections for computer generated speech. Within the project, the application was for the BBC to produce broadcasts in many languages (especially on the World Service). After the grant, our approaches have found applications in other areas too. Since the grant finished, our understanding of the key issues in computer-generated speech from text input (called "Text-to-Speech") have continued to deepen, building on what we learned in that project, and we now have a very clear understanding of why additional inputs (not contained in the text) are essential.
Exploitation Route Additional human-provided inputs are now becoming commonplace in commercial applications of Text-to-Speech.
Sectors Creative Economy,Digital/Communication/Information Technologies (including Software),Other

 
Description SpeakUnique spinout benefitted from advances in core speech synthesis technology that we made in SCRIPT. Papercup Technologies also using human-in-the-loop corrections to synthetic speech.
First Year Of Impact 2019
Sector Digital/Communication/Information Technologies (including Software),Healthcare
Impact Types Societal,Economic

 
Description Foundations for Expressive Speech Synthesis
Amount $194,600 (USD)
Organisation Samsung 
Sector Private
Country Korea, Republic of
Start 01/2019 
End 12/2019
 
Description Foundations for Expressive Speech Synthesis (Year 2)
Amount $191,950 (USD)
Organisation Samsung 
Sector Private
Country Korea, Republic of
Start 01/2020 
End 12/2020
 
Description BBC 
Organisation British Broadcasting Corporation (BBC)
Country United Kingdom 
Sector Public 
PI Contribution Project partner
Collaborator Contribution Provision of data
Impact See publications - some are based on the BBC data
Start Year 2016
 
Title Merlin 
Description Merlin: The Neural Network (NN) based Speech Synthesis System This repository contains the Neural Network (NN) based Speech Synthesis System developed at the Centre for Speech Technology Research (CSTR), University of Edinburgh. Merlin is a toolkit for building Deep Neural Network models for statistical parametric speech synthesis. It must be used in combination with a front-end text processor (e.g., Festival) and a vocoder (e.g., STRAIGHT or WORLD). The system is written in Python and relies on the Theano numerical computation library. Merlin comes with recipes (in the spirit of the Kaldi automatic speech recognition toolkit) to show you how to build state-of-the art systems. Merlin is free software, distributed under an Apache License Version 2.0, allowing unrestricted commercial and non-commercial use alike. 
Type Of Technology Software 
Year Produced 2016 
Open Source License? Yes  
Impact Reproducible research 
URL https://github.com/CSTR-Edinburgh/merlin
 
Title Ophelia 
Description A modified version of Kyubyong Park's dc_tts repository, which implements a variant of the system described in Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention. 
Type Of Technology Software 
Year Produced 2018 
Open Source License? Yes  
Impact Reproducible research 
URL https://github.com/CSTR-Edinburgh/ophelia
 
Title Ossian 
Description Ossian is a collection of Python code for building text-to-speech (TTS) systems, with an emphasis on easing research into building TTS systems with minimal expert supervision. Updates to this repository occurred in the SCRIPT project. 
Type Of Technology Software 
Year Produced 2016 
Open Source License? Yes  
Impact Reproducible research 
URL https://github.com/CSTR-Edinburgh/Ossian
 
Title Snickery 
Description nickery This repository contains code used to build the proposed systems presented in the papers Exemplar-based speech waveform generation and Exemplar-based speech waveform generation for text-to-speech. Exemplar-based speech waveform generation @inproceedings{watts18examplar, title = {Exemplar-based speech waveform generation}, author = {Oliver Watts and Cassia Valentini-Botinhao and Felipe Espic and Simon King}, booktitle = {Interspeech}, year = {2018}, } The first part of this README is about use of scripts: script/train_simple.py script/synth_simple.py ... which can only build a few restricted types of system (selection of epoch-based fragments, greedy search only). They can be used to replicate the system proposed in the paper Exemplar-based speech waveform generation 
Type Of Technology Software 
Year Produced 2017 
Open Source License? Yes  
Impact Reproducible research 
URL https://github.com/CSTR-Edinburgh/snickery
 
Title Waffler 
Description waffler This repository contains code used to build the proposed systems presented in the following paper: @inproceedings{watts2019speech, title={Speech waveform reconstruction using convolutional neural networks with noise and periodic inputs}, author={Oliver Watts and Cassia Valentini-Botinhao and Simon King}, booktitle={2019 {IEEE} International Conference on Acoustics, Speech and Signal Processing, {ICASSP} 2019}, year={2019} } The instructions below explain how to produce a system comparable to the new system (P0) proposed in that paper. 
Type Of Technology Software 
Year Produced 2019 
Open Source License? Yes  
Impact Reproducible research 
URL https://github.com/CSTR-Edinburgh/waffler
 
Company Name SPEAKUNIQUE LIMITED 
Description Commercialisation of voice reconstruction technology. Some of this builds on work done in SCRIPT, and the CTO (Oliver Watts) worked on SCRIPT. 
Year Established 2018 
Impact Product release expected 2020
Website https://www.speakunique.org
 
Description Does "end-to-end" speech synthesis mean we don't need text processing or signal processing any more? 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Invited talk in Dublin
Year(s) Of Engagement Activity 2019
 
Description Finding your (artificial) voice 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Public/other audiences
Results and Impact A Pint of Science
Year(s) Of Engagement Activity 2017
 
Description If you lose your voice, how can you speak? 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Undergraduate students
Results and Impact Invited talk at ULAB 2018
Year(s) Of Engagement Activity 2018
 
Description Is identifying people using their voice a good idea? 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Invited talk in Trento
Year(s) Of Engagement Activity 2018
 
Description Multiple talks in Japan 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact A series of 4 invited talks in Nagoya, Japan
Year(s) Of Engagement Activity 2018
 
Description Speech Synthesis 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Undergraduate students
Results and Impact Invited talk in York
Year(s) Of Engagement Activity 2018
 
Description What is "end-to-end" text-to-speech synthesis ? 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Undergraduate students
Results and Impact Talk to student society at Lancaster University
Year(s) Of Engagement Activity 2019