SCRIPT: Speech Synthesis for Spoken Content Production

Lead Research Organisation: University of Edinburgh

Department Name: Centre for Speech Technology Research

Abstract

The cost of producing dynamically-updated media content - such as online video news packages - across multiple languages is very high. Maintaining substantial teams of journalists per language is expensive and inflexible. Modern media organisations like the BBC or the Financial Times need a more agile approach: they must be able to react quickly to changing world events (e.g., breaking news or emerging markets), dynamically allocating their limited resources in response to external demands. Ideally, they would like to create `pop-up' services & products in previously-unsupported languages, then to scale them up or down later.

The government has set the BBC a target of reaching a global audience of 500 million people by 2022, compared with today's 308 million. The only way to reach such a huge audience is through new language services and efficient production techniques. Text-to-speech - which automatically produces speech from text - offers an attractive solution to this challenge, and the BBC have identified computer assisted translation and text-to-speech as key technologies that will provide them with new ways of creating and reversioning their content across many languages.

This project's objectives are to push text-to-speech technology towards "broadcast quality" computer-generated speech (i.e., good enough for the BBC to broadcast) in many languages, and to make it cheap and easy to add more languages later. We will do this by combining and extending several distinct pieces of our previous basic research on text-to-speech. We will use the latest data-driven machine learning techniques, and extend them to produce much higher quality output speech. At the same time, we will enable the possibility of human control over the speech. This will allow the user (e.g., a BBC journalist) to adjust the speech to make sure the quality and the speaking style is right for their purposes (e.g., correcting the pronunciation of a difficult word, or putting emphasis in the right place).

The technology we will create for the likes of the BBC will also enable smaller companies and other organisations, state bodies, charities, and individuals to rapidly create high-quality spoken content, in whatever language or domain they are operating. We will work with other types of organisation during the project, to make sure that the technology we create has broad appeal and will be useful to a wide range of companies and individuals.

Planned Impact

Who might benefit from this research?
-----------------------------------------------------------

This project has significant potential for almost immediate impact on the broadcast and wider media / news industries, starting with the BBC use case, but quickly moving on to address use cases proposed by the Financial Times and Deutsche Welle (who will sit on our Advisory Board). Their UK and global audiences will benefit from news and information becoming available in more languages.

Smaller companies will benefit too, because they will have access to a cost-effective way to add new languages to their products and services. Currently, small and medium size companies simply cannot afford to commission a bespoke language front end from a commercial text-to-speech provider, due to the large amount of manual work that requires. With our new technology, the cost of commissioning a new language will be dramatically lower.

Commercial providers of text-to-speech (e.g., our collaborators ReadSpeaker) will benefit from access to new techniques for text-to-speech synthesis. They will find it easier to compete with multinationals (e.g., Nuance) because they will be able to bring cost-effective language products to the market, and will be able to create bespoke products more quickly and cheaply than at present.

The UK as a whole will benefit, as a consequence of the BBC World Service's ability to reach a wider global audience. As noted in the proposal, the BBC World Service is an important way for the UK to exert soft power around the world. Our technology will enable the BBC to do this in many more languages than at present.

The availability of affordable text-to-speech for the great many languages spoken in, for example, Africa could benefit development in many countries. It would be useful for disseminating, for example, agricultural or health information from governments or the United Nations.

How might they benefit from this research?
-----------------------------------------------------------

Our goal is to deliver economic impact for the BBC, and societal impact for their audience, within the duration of the project. Other media organisations, and their respective audiences, will benefit in the same way.

Commercial providers of text-to-speech can incorporate our new techniques into their products and so will be able to penetrate new markets of two types: 1) markets demanding very high quality, which our methods for editorial control will deliver; 2) markets requiring provision of many languages, especially the hundreds and thousands of languages that currently have no prospect of text-to-speech provision.

Delivering longer-term benefits to people in developing countries is challenging, even when done through well-established not-for-profit organisations, such as the United Nations Development Programme. Dr. John Quinn will sit on our Advisory Board and provide his expertise in overcoming barriers to deployment of spoken language technology in developing countries.

Funded Value:

£533,267

Funded Period:

Dec 16 - Nov 19

Funder:

EPSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

EP/P011586/1

Principal Investigator:

Simon King

Research Subject:

Info. & commun. Technol. (100%)

Research Topic:

Human Communication in ICT (100%)

Organisations

People	ORCID iD
Simon King (Principal Investigator)
Junichi Yamagishi (Co-Investigator)
Mark Sinclair (Researcher)	http://orcid.org/0000-0001-7708-2284
Cassia Valentini Botinhao (Researcher)
Oliver Samuel Watts (Researcher Co-Investigator)

Publications

Author Name

Title Publication Date Published

10 25 50

Aubin A (2019) Improving speech synthesis with discourse relations

Cassia Valentini Botinhao (2018) Exemplar-based speech waveform generation for text-to-speech

Hodari Z (2018) Learning Interpretable Control Dimensions for Speech Synthesis by Using External Data

Mendelson J (2017) Nativization of Foreign Names in TTS for Automatic Reading of World News in Swahili

Ronanki S (2017) A Hierarchical Encoder-Decoder Model for Statistical Parametric Speech Synthesis

Valentini-Botinhao C (2018) Examplar-Based Speechwaveform Generation for Text-To-Speech

Watts O (2018) Exemplar-based Speech Waveform Generation

Watts O (2019) Speech Waveform Reconstruction Using Convolutional Neural Networks with Noise and Periodic Inputs

Watts O (2019) Speech Waveform Reconstruction using Convolutional Neural Networks with Noise and Periodic Inputs

Key Findings
Impact Summary
Further Funding
Collaboration
Software and Technical Products
Spin Outs
Engagement Activities


Description	Amongst other goals, the project pioneered the use of "human-in-the-loop" corrections for computer generated speech. Within the project, the application was for the BBC to produce broadcasts in many languages (especially on the World Service). After the grant, our approaches have found applications in other areas too. Since the grant finished, our understanding of the key issues in computer-generated speech from text input (called "Text-to-Speech") have continued to deepen, building on what we learned in that project, and we now have a very clear understanding of why additional inputs (not contained in the text) are essential.
Exploitation Route	Additional human-provided inputs are now becoming commonplace in commercial applications of Text-to-Speech.
Sectors	Creative Economy Digital/Communication/Information Technologies (including Software) Other


Description	SpeakUnique spinout benefitted from advances in core speech synthesis technology that we made in SCRIPT. Papercup Technologies also using human-in-the-loop corrections to synthetic speech.
First Year Of Impact	2019
Sector	Digital/Communication/Information Technologies (including Software),Healthcare
Impact Types	Societal Economic


Description	Foundations for Expressive Speech Synthesis
Amount	$194,600 (USD)
Organisation	Samsung
Sector	Private
Country	Korea, Republic of
Start	01/2019
End	12/2019


Description	Foundations for Expressive Speech Synthesis (Year 2)
Amount	$191,950 (USD)
Organisation	Samsung
Sector	Private
Country	Korea, Republic of
Start	01/2020
End	12/2020


Description	BBC
Organisation	British Broadcasting Corporation (BBC)
Country	United Kingdom
Sector	Public
PI Contribution	Project partner
Collaborator Contribution	Provision of data
Impact	See publications - some are based on the BBC data
Start Year	2016


Title	Merlin
Description	Merlin: The Neural Network (NN) based Speech Synthesis System This repository contains the Neural Network (NN) based Speech Synthesis System developed at the Centre for Speech Technology Research (CSTR), University of Edinburgh. Merlin is a toolkit for building Deep Neural Network models for statistical parametric speech synthesis. It must be used in combination with a front-end text processor (e.g., Festival) and a vocoder (e.g., STRAIGHT or WORLD). The system is written in Python and relies on the Theano numerical computation library. Merlin comes with recipes (in the spirit of the Kaldi automatic speech recognition toolkit) to show you how to build state-of-the art systems. Merlin is free software, distributed under an Apache License Version 2.0, allowing unrestricted commercial and non-commercial use alike.
Type Of Technology	Software
Year Produced	2016
Open Source License?	Yes
Impact	Reproducible research
URL	https://github.com/CSTR-Edinburgh/merlin


Title	Ophelia
Description	A modified version of Kyubyong Park's dc_tts repository, which implements a variant of the system described in Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention.
Type Of Technology	Software
Year Produced	2018
Open Source License?	Yes
Impact	Reproducible research
URL	https://github.com/CSTR-Edinburgh/ophelia


Title	Ossian
Description	Ossian is a collection of Python code for building text-to-speech (TTS) systems, with an emphasis on easing research into building TTS systems with minimal expert supervision. Updates to this repository occurred in the SCRIPT project.
Type Of Technology	Software
Year Produced	2016
Open Source License?	Yes
Impact	Reproducible research
URL	https://github.com/CSTR-Edinburgh/Ossian


Title	Snickery
Description	nickery This repository contains code used to build the proposed systems presented in the papers Exemplar-based speech waveform generation and Exemplar-based speech waveform generation for text-to-speech. Exemplar-based speech waveform generation @inproceedings{watts18examplar, title = {Exemplar-based speech waveform generation}, author = {Oliver Watts and Cassia Valentini-Botinhao and Felipe Espic and Simon King}, booktitle = {Interspeech}, year = {2018}, } The first part of this README is about use of scripts: script/train_simple.py script/synth_simple.py ... which can only build a few restricted types of system (selection of epoch-based fragments, greedy search only). They can be used to replicate the system proposed in the paper Exemplar-based speech waveform generation
Type Of Technology	Software
Year Produced	2017
Open Source License?	Yes
Impact	Reproducible research
URL	https://github.com/CSTR-Edinburgh/snickery


Title	Waffler
Description	waffler This repository contains code used to build the proposed systems presented in the following paper: @inproceedings{watts2019speech, title={Speech waveform reconstruction using convolutional neural networks with noise and periodic inputs}, author={Oliver Watts and Cassia Valentini-Botinhao and Simon King}, booktitle={2019 {IEEE} International Conference on Acoustics, Speech and Signal Processing, {ICASSP} 2019}, year={2019} } The instructions below explain how to produce a system comparable to the new system (P0) proposed in that paper.
Type Of Technology	Software
Year Produced	2019
Open Source License?	Yes
Impact	Reproducible research
URL	https://github.com/CSTR-Edinburgh/waffler


Company Name	Speak:Unique
Description	Speak:Unique develops synthetic voices based on the person's voice that allows those losing their voice to use their own instead of a robotic sounding one.
Year Established	2018
Impact	Product release expected 2020
Website	http://speakunique.co.uk


Description	Does "end-to-end" speech synthesis mean we don't need text processing or signal processing any more?
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	Invited talk in Dublin
Year(s) Of Engagement Activity	2019


Description	Finding your (artificial) voice
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Public/other audiences
Results and Impact	A Pint of Science
Year(s) Of Engagement Activity	2017


Description	If you lose your voice, how can you speak?
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	National
Primary Audience	Undergraduate students
Results and Impact	Invited talk at ULAB 2018
Year(s) Of Engagement Activity	2018


Description	Is identifying people using their voice a good idea?
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	Invited talk in Trento
Year(s) Of Engagement Activity	2018


Description	Multiple talks in Japan
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	A series of 4 invited talks in Nagoya, Japan
Year(s) Of Engagement Activity	2018


Description	Speech Synthesis
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	Local
Primary Audience	Undergraduate students
Results and Impact	Invited talk in York
Year(s) Of Engagement Activity	2018


Description	What is "end-to-end" text-to-speech synthesis ?
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	Regional
Primary Audience	Undergraduate students
Results and Impact	Talk to student society at Lancaster University
Year(s) Of Engagement Activity	2019

Abstract

Planned Impact

Organisations

People

ORCID iD

Publications