SCRIPT: Speech Synthesis for Spoken Content Production

Lead Research Organisation: University of Edinburgh
Department Name: Centre for Speech Technology Research


The cost of producing dynamically-updated media content - such as online video news packages - across multiple languages is very high. Maintaining substantial teams of journalists per language is expensive and inflexible. Modern media organisations like the BBC or the Financial Times need a more agile approach: they must be able to react quickly to changing world events (e.g., breaking news or emerging markets), dynamically allocating their limited resources in response to external demands. Ideally, they would like to create `pop-up' services & products in previously-unsupported languages, then to scale them up or down later.

The government has set the BBC a target of reaching a global audience of 500 million people by 2022, compared with today's 308 million. The only way to reach such a huge audience is through new language services and efficient production techniques. Text-to-speech - which automatically produces speech from text - offers an attractive solution to this challenge, and the BBC have identified computer assisted translation and text-to-speech as key technologies that will provide them with new ways of creating and reversioning their content across many languages.

This project's objectives are to push text-to-speech technology towards "broadcast quality" computer-generated speech (i.e., good enough for the BBC to broadcast) in many languages, and to make it cheap and easy to add more languages later. We will do this by combining and extending several distinct pieces of our previous basic research on text-to-speech. We will use the latest data-driven machine learning techniques, and extend them to produce much higher quality output speech. At the same time, we will enable the possibility of human control over the speech. This will allow the user (e.g., a BBC journalist) to adjust the speech to make sure the quality and the speaking style is right for their purposes (e.g., correcting the pronunciation of a difficult word, or putting emphasis in the right place).

The technology we will create for the likes of the BBC will also enable smaller companies and other organisations, state bodies, charities, and individuals to rapidly create high-quality spoken content, in whatever language or domain they are operating. We will work with other types of organisation during the project, to make sure that the technology we create has broad appeal and will be useful to a wide range of companies and individuals.

Planned Impact

Who might benefit from this research?

This project has significant potential for almost immediate impact on the broadcast and wider media / news industries, starting with the BBC use case, but quickly moving on to address use cases proposed by the Financial Times and Deutsche Welle (who will sit on our Advisory Board). Their UK and global audiences will benefit from news and information becoming available in more languages.

Smaller companies will benefit too, because they will have access to a cost-effective way to add new languages to their products and services. Currently, small and medium size companies simply cannot afford to commission a bespoke language front end from a commercial text-to-speech provider, due to the large amount of manual work that requires. With our new technology, the cost of commissioning a new language will be dramatically lower.

Commercial providers of text-to-speech (e.g., our collaborators ReadSpeaker) will benefit from access to new techniques for text-to-speech synthesis. They will find it easier to compete with multinationals (e.g., Nuance) because they will be able to bring cost-effective language products to the market, and will be able to create bespoke products more quickly and cheaply than at present.

The UK as a whole will benefit, as a consequence of the BBC World Service's ability to reach a wider global audience. As noted in the proposal, the BBC World Service is an important way for the UK to exert soft power around the world. Our technology will enable the BBC to do this in many more languages than at present.

The availability of affordable text-to-speech for the great many languages spoken in, for example, Africa could benefit development in many countries. It would be useful for disseminating, for example, agricultural or health information from governments or the United Nations.

How might they benefit from this research?

Our goal is to deliver economic impact for the BBC, and societal impact for their audience, within the duration of the project. Other media organisations, and their respective audiences, will benefit in the same way.

Commercial providers of text-to-speech can incorporate our new techniques into their products and so will be able to penetrate new markets of two types: 1) markets demanding very high quality, which our methods for editorial control will deliver; 2) markets requiring provision of many languages, especially the hundreds and thousands of languages that currently have no prospect of text-to-speech provision.

Delivering longer-term benefits to people in developing countries is challenging, even when done through well-established not-for-profit organisations, such as the United Nations Development Programme. Dr. John Quinn will sit on our Advisory Board and provide his expertise in overcoming barriers to deployment of spoken language technology in developing countries.


10 25 50
Description Foundations for Expressive Speech Synthesis
Amount $194,600 (USD)
Organisation Samsung 
Sector Private
Country Global
Start 01/2019 
End 12/2019