<?xml version="1.0" encoding="UTF-8"?><ns2:project xmlns:ns1="http://gtr.rcuk.ac.uk/gtr/api" xmlns:ns2="http://gtr.rcuk.ac.uk/gtr/api/project" xmlns:ns3="http://gtr.rcuk.ac.uk/gtr/api/fund" xmlns:ns4="http://gtr.rcuk.ac.uk/gtr/api/person" xmlns:ns5="http://gtr.rcuk.ac.uk/gtr/api/project/outcome" xmlns:ns6="http://gtr.rcuk.ac.uk/gtr/api/organisation" ns1:created="2026-06-03T15:52:43Z" ns1:href="http://gtr.ukri.org/gtr/api/projects/57EC7E1A-63FD-4FDF-BC6D-F02BAC4ECA21" ns1:id="57EC7E1A-63FD-4FDF-BC6D-F02BAC4ECA21"><ns1:links><ns1:link ns1:href="http://gtr.ukri.org/gtr/api/persons/D1B200E5-03E1-40D4-B910-755F8F781B32" ns1:rel="PM_PER"/><ns1:link ns1:href="http://gtr.ukri.org/gtr/api/organisations/D901698C-1237-43DF-B2C1-5804E3AD5E86" ns1:rel="LEAD_ORG"/><ns1:link ns1:href="http://gtr.ukri.org/gtr/api/organisations/D901698C-1237-43DF-B2C1-5804E3AD5E86" ns1:rel="PARTICIPANT_ORG"/><ns1:link ns1:end="2021-09-29T23:00:00Z" ns1:href="http://gtr.ukri.org/gtr/api/funds/0C99942E-19F4-445D-8EB7-3027DCC75925" ns1:rel="FUND" ns1:start="2020-09-30T23:00:00Z"/></ns1:links><ns2:identifiers><ns2:identifier ns2:type="RCUK">73674</ns2:identifier></ns2:identifiers><ns2:title>Highly expressive voices for machine video content localisation</ns2:title><ns2:status>Closed</ns2:status><ns2:grantCategory>Study</ns2:grantCategory><ns2:leadFunder>Innovate UK</ns2:leadFunder><ns2:abstractText>Imagine any video available in any language, with both the unique qualities of the original actors' voices, and the unique way in which they delivered their lines, preserved in the new language. This is the ambitious vision that Papercup will make reality by harnessing the latest developments in the world of machine learning.

In 2016, Google's Deepmind created the WaveNet vocoder. This was a revolution in speech synthesis. Prior to this, speech synthesis models were either concatenative (meaning that they work by glueing together short audio samples of recorded speech) or modelled methods, which generate speech &amp;quot;from scratch&amp;quot; using a model of how the human speech production system works. Concatenative synthesis typically resulted in more natural sounding voices, but with unnatural flow because the audio samples come from unrelated sections of speech. Modelled methods tended to produce better flow, but the voices sounded robotic. WaveNet is a deep-learning method, trained directly on audio samples, and combines the natural variation of modelled methods with the natural sound of concatenative methods. This development means that speech synthesis could become essentially indistinguishable from human speech.

A vocoder (such as WaveNet), however, is not even half the story. You still have to tell it what to say, and how to say it. For a computer to achieve this, we must first recognise what was said in the original video, by whom, and in what way. Papercup exploits the latest developments in deep learning and has developed a patent-pending method for analysing the unique acoustic features of each speaker, and the way in which they delivered their lines. This is encoded by our algorithms using an internal learned representation, which enables the stresses, intonation, and emotion to be transferred across languages, in a manner analogous to the way translation tools translate text from one language to another.

In this way, Papercup's approach replicates the unique vocal characteristics of the actors, and replicates their delivery. This has the potential to revolutionise the voiceover translation industry by creating faithful voiceover translations that accurately convey the original content in additional languages, and do so at scale with significantly lower costs than using traditional voiceover translation services with voice-actors.</ns2:abstractText></ns2:project>