Deep architectures for statistical speech synthesis

Lead Research Organisation: University of Edinburgh
Department Name: Centre for Speech Technology Research

Abstract

Speech synthesis is the conversion of written text into speech output. Applications range from telephone dialogue systems to computer games and clinical applications. Current speech synthesis systems have a very limited range of difference voices available. This is because it is complex and expensive to create them.

Unfortunately, that is a big problem for many interesting applications, including one we are focusing on in this proposal: assistive communication aids for people with vocal problems due to Motor Neurone Disease and other conditions. At the moment, these people are forced to use devices with inappropriate voices, very often in the wrong accent and sometimes even of the wrong sex! This is a disincentive for them to communicate, even with their own family, since they do not "own" the voice and it does not reflect their identity. The voice is an integral part of identity, and we are creating the technology to allow people to communicate in their own voice, when their natural speech has become hard to understand or they can no longer speak at all.

The technology we will develop has a lot of other applications too: it will enable a speech synthesiser to adjust not only the speaker identity but many other properties too. For example, adjusting speaking effort will simulate what human talkers do in noisy conditions to make their speech more intelligible. Our starting point is a technique we have pioneered, called speaker adaptation.

Speaker adaptation has proven to be highly successful in enabling the flexible transformation of the characteristics of a text-to-speech synthesis system, based on a small amount of recorded speech. It can be used for changing the characteristics of the speech to a different speaker or speaking style. However, current methods do not use any deep knowledge about speech and does not generalise across similar situations. This is considerably less natural and flexible than human speech production, in which speech is controlled by human talkers based simply on prior experience. For instance, we effortlessly adapt our speech in noisy environments, compared with quiet environments, in order to increase intelligibility. The current adaptation techniques that we have pioneered are completely automatic, but they do not enable this prior knowledge to be incorporated in a straightforward way.

In some preliminary work, we have developed a model which includes information about the movement of the speech articulators: the tongue, lips and so on. Then, using our knowledge of how humans alter their speech production in the presence of noise (hyper- & hypo-articulation), we have demonstrated that it is possible to improve the intelligibility of synthetic speech in noise.

The current proposal is to extend and generalise this preliminary work, in order to integrate many other types of knowledge about human speech into this model. We will develop a new model which allows us to include more information about how speech is produced, as well as information about how it is perceived and how external factors, such as background noise, affect speech.

One important application of this technology is to create personalised speech synthesis for people with disordered speech (caused by Motor Neurone Disease, for example). Current technology for creating voices does not work for these people, because their speech is usually already disordered. Our technique can actually correct this, and produce speech which sounds like the person, but is more intelligible than their current natural speech. We have already produced a proof-of-concept system demonstrating that this works. The current proposal will make the technology available and affordable to a wide range of people.

Planned Impact

Societal Impact

Our research into personalised speech technology for assistive applications will significantly improve the quality of life of people with communicative disorders, enabling them to play a full part in society. The voice is such an integral part of identity that when damage occurs sufferers may withdraw from social interaction and interaction with patent's family. Ironically, current voice communication aids compound this effect due to their small and inappropriate range of voices, sometimes of the wrong gender, almost always of the wrong accent (US-accented voices are the default, even in the UK market). In contrast, our new technology provides a voice that sounds like the user. This is something long requested by the users of AAC devices. For them, speech synthesis is not just an optional extra function to read out text, but a function for social communication with identity.

We will conduct voice reconstruction trials at both the Euan McDonald Centre for Motor Neurone Disease research and the Anne Rowling Regenerative Neurology Clinic under the approval of NHS Lothian, aiming for 50 new patients annually. In addition, our creation of 'voice banking' service will raise awareness amongst the public of more general issues surrounding vocal health, which is important in itself but can also be an early indicator of other problems. The MND association of Scotland has promised to help us raise awareness of the service and its benefits nationwide.


Economic impact

I am a member of the teams developing the two main free open-source research software packages for speech synthesis: HTS and Festival. Festival is pre-installed in most major Linux distributions by default. HTS has been used in many academic institutes and there are products based on HTS on the market around the world. These two toolkits are very influential and provide an immediate pathway to impact.

The outcomes of the proposed fellowship will be released under open source licenses in the form of software and data. The proposed deep model will add new capabilities to the HTS toolkit: controllability of synthetic speech. Control is one of the fundamental challenges facing speech synthesis and if this problem can be solved, many new applications become possible, leading to new commercial opportunities and economic impact

The proposed deep multi-layer models will introduce a new method of direct and detailed control, based on specification of articulatory, formant, loudness, or glottal features. This brings control over the overall speaking style and of local properties. Commercial applications that will benefit from this include speech output in noisy environments (e.g., in-car navigation), computer games and other applications requiring more variety and expressivity in their speech output


Academic impact

Our new model combines the advantages of controllability available in conventional articulatory synthesisers such as ``VocalTract Lab'' and formant synthesisers such as the Klatt model (the basis of DECTalk, as used by Prof. Stephen Hawking), with the advantages automation and quality available in modern speech synthesis. This will be very useful in other fields such as speech perception and phonetics research, where the the Klatt model (which produces very poor quality speech) is currently the main tool. Another important layer is the auditory layer and we expect that the links we will make between speech audition and production will provide novel capabilities for speech synthesisers.

Publications

10 25 50
 
Description As planned in the original proposal, we have been exploring the usefulness of a range of possible vocal tract layers, glottal layers and external information layers such as noises in the new statistical speech-production-oriented speech synthesis framework.

These achievements have changed meanings and roles of speech synthesis significantly. Speech synthesis was just a function to read out given texts. However, thanks to the outcomes of the current project, speech synthesis starts to means more useful and advanced functionalities clearly beyond that.

It has 'ears' to listen to environments. It has a 'brain' to mimick/clone somebody's voices. It has 'tongue' as well as 'glottis' like human. This clearly makes TTS applications such as spoken dialogue systems, robots, assistive technology attractive and meaningful to the society and researchers in future. For instance, we have also reconstructed a hundred of MND patient's voices and have confirmed that the new speech synthesis may change their quality of life.
Exploitation Route As described earlier, the findings of our research project clearly make speech applications that use speech synthesis such as spoken dialogue systems, robots, assistive technology more attractive and meaningful.

Meantime, the attractive speech synthesis techniques start to bring us massive amounts of voice data - more than has ever been used for speech synthesis before - and will enable us tackle various challenging new research topics in speech synthesis.

All subcomponents of conventional systems assume a fixed amount of voice and text data, processed in "batch mode" currently. Ideally, the statistical speech models at the core of the system should continuously improve, given this incoming data: this is not possible using current batch-based approaches. New algorithms need to be found, to take advantage of such a massive and never-ending data stream.

This will be a new research area. Advances in this area would benefit the next generation of personalised user interfaces, and also approaches to analyse streams of speech and audio data which are based on similar statistical modelling approaches. Areas of impact will thus extend to multimodal user interfaces, and analysis and indexing of online and broadcast media.

This is an excellent fit to EPSRC's "Towards an Intelligent Information Infrastructure" cross-ICT priority theme, and is strongly linked to proposed activities in "Data to Knowledge".
Sectors Digital/Communication/Information Technologies (including Software),Healthcare

URL http://researchmap.jp/read0205283/?lang=english
 
Description Speech synthesis is the conversion of written text into speech output. Applications range from telephone dialogue systems to computer games and clinical applications. Current speech synthesis systems have a very limited range of difference voices available. This is because it is complex and expensive to create them. The current proposal is to extend and generalise the state-of-the-art speech synthesis in order to integrate many other types of knowledge about human speech into this model. We have developed a series of new statistical models which allow us to include more information about how speech is produced, as well as information about how it is perceived and how external factors, such as background noise, affect speech. One important application of this technology is to create personalised speech synthesis for people with disordered speech (caused by Motor Neurone Disease, for example). Our technique has been proved to actually correct this, and to produce speech which sounds like the person, but is more intelligible than their current natural speech through clinical trials that we have carried out. In the following we give you a few examples of the achievements that we have proposed to integrate many other types of knowledge about human speech into the models. First we have proposed and published a technique to consider articulation information in speech synthesis better than before in the prestigious IEEE journal: Zhenhua Ling, Korin Richmond, Junichi Yamagishi, "Articulatory control of HMM-based parametric speech synthesis using feature-space-switched multiple regression" IEEE Audio, Speech, & Language Processing, volume 21, issue 1, pp. 207-219, January 2013 We have also published an IEEE journal paper that mentioned a novel framework that uses glottis information of speech production for speech synthesis better than before: João P. Cabral, Korin Richmond, Junichi Yamagishi, and Steve Renals, "Glottal Spectral Separation for Speech Synthesis," IEEE Journal of Selected Topics in Signal Processing, vol.8, no.2, pp.195,208, April 2014 We have also proposed a novel way to adaptively use noise environmental information for changing speech synthesis outputs so that we can increase speech intelligibility automatically: Cassia Valentini-Botinhao, Junichi Yamagishi, Simon King, Ranniery Maia "Intelligibility enhancement of HMM-generated speech in additive noise by modifying Mel cepstral coefficients to increase the Glimpse Proportion" Computer & Speech Language, Volume 28, Issue 2, March 2014, Pages 665-6862013 For clinical application of speech synthesis, we have built infrastructures for clinicians to automatically construct personalized voices of MND patients from speech recordings and also a new communication app to use the voices on iPad/iPhone devices. We have delivered the personalised voices to about 100 MND patients as personalized communication devices and most of them gave us very positive feedback about their QOL. This clearly shows a strong evidence of the social impact of the achievements of the fundamental research that we have carried out. In fact, this research outcome has been broadcasted by TV, radio and newspaper many times. The result will be published as a chapter of the following book. Christophe Veaux, Junichi Yamagishi, Simon King, Shuna Colville, Philippa Rewaj, Siddharthan Chandran, Gergely Bakos "Speech Synthesis Technologies for Individuals with Vocal Disabilities: Voice banking and voice reconstruction" in Evaluating the Role of Speech Technology in Medical Case Management, Hemant Patil and Manisha Kulshreshtha (Eds) De Gruyter Studium
Sector Digital/Communication/Information Technologies (including Software),Healthcare
Impact Types Societal,Economic

 
Title Automatic Speaker Verification Spoofing and Countermeasures Challenge (ASVspoof 2015) Database 
Description The database has been used in the first Automatic Speaker Verification Spoofing and Countermeasures Challenge (ASVspoof 2015). Genuine speech is collected from 106 speakers (45 male, 61 female) and with no signi?cant channel or background noise effects. Spoofed speech is generated from the genuine data using a number of different spoo?ng algorithms. The full dataset is partitioned into three subsets, the ?rst for training, the second for development and the third for evaluation. More details can be found in the evaluation plan in the summary paper. 
Type Of Material Database/Collection of data 
Year Produced 2015 
Provided To Others? Yes  
Impact Automatic speaker verification (ASV) offers a low-cost and flexible biometric solution to person authentication. While the reliability of ASV systems is now considered sufficient to support mass-market adoption, there are concerns that the technology is vulnerable to spoofing, also referred to as presentation attacks. Spoofing refers to an attack whereby a fraudster attempts to manipulate a biometric system by masquerading as another, enrolled person. Acknowledged vulnerabilities include attacks through impersonation, replay, speech synthesis and voice conversion. This database has been used for the 2015 ASVspoof challenge, which aims to encourage further progress through (i) the collection and distribution of a standard dataset with varying spoofing attacks implemented with multiple, diverse algorithms and (ii) a series of competitive evaluations. The first ASVspoof challenge was held during the 2015 edition of INTERSPEECH in Dresden, Germany. The challenge has been designed to support, for the first time, independent assessments of vulnerabilities to spoofing and of countermeasure performance and to facilitate the comparison of different spoofing countermeasures on a common dataset, with standard protocols and metrics. 
 
Title CSTR VCTK Corpus -- Multi-speaker English Corpus for CSTR Voice Cloning Toolkit 
Description This CSTR VCTK Corpus includes speech data uttered by 109 native speakers of English with various accents. Each speaker reads out about 400 sentences, most of which were selected from a newspaper plus the Rainbow Passage and an elicitation paragraph intended to identify the speaker's accent. The newspaper texts were taken from The Herald (Glasgow), with permission from Herald & Times Group. Each speaker reads a different set of the newspaper sentences, where each set was selected using a greedy algorithm designed to maximise the contextual and phonetic coverage. The Rainbow Passage and elicitation paragraph are the same for all speakers. The Rainbow Passage can be found in the International Dialects of English Archive: (http://web.ku.edu/~idea/readings/rainbow.htm). The elicitation paragraph is identical to the one used for the speech accent archive (http://accent.gmu.edu). The details of the the speech accent archive can be found at http://www.ualberta.ca/~aacl2009/PDFs/WeinbergerKunath2009AACL.pdf All speech data was recorded using an identical recording setup: an omni-directional head-mounted microphone (DPA 4035), 96kHz sampling frequency at 24 bits and in a hemi-anechoic chamber of the University of Edinburgh. All recordings were converted into 16 bits, were downsampled to 48 kHz based on STPK, and were manually end-pointed. This corpus was recorded for the purpose of building HMM-based text-to-speech synthesis systems, especially for speaker-adaptive HMM-based speech synthesis using average voice models trained on multiple speakers and speaker adaptation technologies. 
Type Of Material Database/Collection of data 
Year Produced 2012 
Provided To Others? Yes  
Impact This is the first free corpus that is designed and appropriate for speaker-adaptive speech synthesis. This starts to become a standard database to build and compare speaker-adaptive speech synthesis systems and voice conversion systems. This was also used even for speaker verification systems. 
URL http://homepages.inf.ed.ac.uk/jyamagis/page3/page58/page58.html
 
Title Spoofing and Anti-Spoofing (SAS) corpus v1.0 
Description This dataset is associated with the paper "'SAS: A speaker verification spoofing database containing diverse attacks': presents the first version of a speaker verification spoofing and anti-spoofing database, named SAS corpus. The corpus includes nine spoofing techniques, two of which are speech synthesis, and seven are voice conversion. We design two protocols, one for standard speaker verification evaluation, and the other for producing spoofing materials. Hence, they allow the speech synthesis community to produce spoofing materials incrementally without knowledge of speaker verification spoofing and anti-spoofing. To provide a set of preliminary results, we conducted speaker verification experiments using two state-of-the-art systems. Without any anti-spoofing techniques, the two systems are extremely vulnerable to the spoofing attacks implemented in our SAS corpus". 
Type Of Material Database/Collection of data 
Year Produced 2015 
Provided To Others? Yes  
Impact This SAS database is the first version of a standard dataset for spoofing and anti-spoofing research. Currently, the SAS corpus includes speech generated using nine spoofing methods, each of which comprises around 300000 spoofed trials. To the best of our knowledge, this is the first attempt to include such a diverse range of spoofing attacks in a single database. The SAS corpus is publicly available at no cost. 
 
Title The Voice Conversion Challenge 2016 database 
Description The Voice Conversion Challenge (VCC) 2016, one of the special sessions at Interspeech 2016, deals with speaker identity conversion, referred as Voice Conversion (VC). The task of the challenge was speaker conversion, i.e., to transform the voice identity of a source speaker into that of a target speaker while preserving the linguistic content. Using a common dataset consisting of 162 utterances for training and 54 utterances for evaluation from each of 5 source and 5 target speakers, 17 groups working in VC around the world developed their own VC systems for every combination of the source and target speakers, i.e., 25 systems in total, and generated voice samples converted by the developed systems. The objective of the VCC was to compare various VC techniques on identical training and evaluation speech data. The samples were evaluated in terms of target speaker similarity and naturalness by 200 listeners in a controlled environment. This dataset consists of the participants' VC submissions and the listening test results for naturalness and similarity. 
Type Of Material Database/Collection of data 
Year Produced 2016 
Provided To Others? Yes  
Impact 17 groups working in VC around the world have used this database and have developed their own VC systems. 
URL http://datashare.is.ed.ac.uk/handle/10283/2211
 
Title High-quality speech synthesizer, HTS voice 
Description High-quality speech synthesis software based on speech technologies developed during my fellowship. 
IP Reference  
Protection Copyrighted (e.g. software)
Year Protection Granted 2012
Licensed Yes
Impact I have formally licensed the high-quality speech synthesizer to two companies for a commercial basis.
 
Title Clinical trial of personalized speech synthesis voices for MND patients 
Description Adaptive speech synthesis may be be used to develop personalised synthetic voices for people who have a vocal pathology. In 2009 Dr. Sarah Creer from University of Sheffield and I have successfully applied it to clinical voice banking for laryngectomees (individuals who have had their vocal cords removed due to a developing cancer) to reconstruct their voices. In 2010, I have "implanted" the personalised synthetic voice of a patient who has motor neurone disease into their assistive communication device. Such a personalised voice can lead to far more natural communication for patients, particularly with family. A "voice reconstruction" trial has been tested with about 100 patients in total at the Euan MacDonald Centre for MND Research and the Anne Rowling Regenerative Neurology Clinic in Edinburgh. 
Type Health and Social Care Services
Current Stage Of Development Initial development
Year Development Stage Completed 2015
Development Status Actively seeking support
Impact We have recorded about 100 MND patients at the Euan MacDonald Centre for MND Research and the Anne Rowling Regenerative Neurology Clinic in Edinburgh and have constructed personalized speech synthesizers based on their disordered voices. We have received and analyzed feedback from the patients and we have confirmed that this new speech synthesis technology can improve their quality-of-life. 
 
Title HTS ver 2.3 
Description HTS is an open source toolkit for statistical speech synthesis. I am a member of a team developing the the free open-source research software packages for speech synthesis. 
Type Of Technology Software 
Year Produced 2015 
Open Source License? Yes  
Impact The HTS toolkit is used worldwide by both academic and commercial organisations, such as Microsoft, Nuance, Toshiba, Pentax, and Google. The number of downloads of HTS exceeds 10,000 and various commercial products using HTS are on the market. Therefore, this toolkit is a very influential platform for me to disseminate outcomes and form an immediate pathway to impact. 
URL http://hts.sp.nitech.ac.jp
 
Description Invited talk at the XHUMED event 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Public/other audiences
Results and Impact Christophe Veaux gave a talk on computer-based voice reconstruction techniques for MND patient at the scientific event called xHumed | Dead Good Thinking in Birmingham.
Year(s) Of Engagement Activity 2013
URL http://xhumed.co.uk