Natural Speech Technology

Lead Research Organisation: University of Edinburgh
Department Name: Sch of Informatics

Abstract

Humans are highly adaptable, and speech is our natural medium for informal communication. When communicating, we continuously adjust to other people, to the situation, and to the environment, using previously acquired knowledge to make this adaptation seem almost instantaneous. Humans generalise, enabling efficient communication in unfamiliar situations and rapid adaptation to new speakers or listeners. Current speech technology works well for certain controlled tasks and domains, but is far from natural, a consequence of its limited ability to acquire knowledge about people or situations, to adapt, and to generalise. This accounts for the uneasy public reaction to speech-driven systems. For example, text-to-speech synthesis can be as intelligible as human speech, but lacks expression and is not perceived as natural. Similarly, the accuracy of speech recognition systems can collapse if the acoustic environment or task domain changes, conditions which a human listener would handle easily. Research approaches to these problems have hitherto been piecemeal and as a result progress has been patchy. In contrast NST will focus on the integrated theoretical development of new joint models for speech recognition and synthesis. These models will allow us to incorporate knowledge about the speakers, the environment, the communication context and awareness of the task, and will learn and adapt from real world data in an online, unsupervised manner. This theoretical unification is already underway within the NST labs and, combined with our record of turning theory into practical state-of-the-art applications, will enable us to bring a naturalness to speech technology that is not currently attainable.The NST programme will yield technology which (1) approaches human adaptability to new communication situations, (2) is capable of personalised communication, and (3) takes account of speaker intention and expressiveness in speech recognition and synthesis. This is an ambitious vision. Its success will be measured in terms of how the theoretical development reshapes the field over the next decade, the takeup of the software systems that we shall develop, and through the impact of our exemplar interactive applications.We shall establish a strong User Group to maximise the impact of the project, with a members concerned with clinical applications, as well as more general speech technology. Members of the User Group include Toshiba, EADS Innovation Works, Cisco, Barnsley Hospital NHS Foundation Trust, and the Euan MacDonald Centre for MND Research. An important interaction with the User Group will be validating our systems on their data and tasks, discussed at an annual user workshop.

Planned Impact

Leading market analysts predict that revenues from speech technology in North America alone will reach $1 billion by 2011. The reality has lagged behind such predictions in the past since the technology is not refined enough even at this time, but paradigms are shifting. The revolutionary change in connectivity and mobile computing in recent years gives rise to a number of compelling application drivers for the proposed research programme: (1) Rapid developments in mobile computing - decreasing power consumption, high network bandwidth and cloud computing - are stimulating demand for new interfaces. (2) Demographic and economic pressures mean that home care and support systems will become commonplace; such systems will benefit from personalised spoken interaction. (3) Remote meetings are becoming standard, stimulated by the economic conditions and climate change; natural speech technology will enable much richer interactions. (4) As data access becomes more open, the volume of available audio data will increase exponentially; natural speech transcription will result in such data oceans become searchable and structured. (5) There is a potentially huge market (entertainment, consumer apps, robotics) that would be opened up by the availability of adaptive, controllable, expressive speech synthesis. (6) Clinical applications of speech technology will be substantially enriched by the personalised systems proposed in NST. As these drivers have reached a critical level, the NST team has made a number of crucial breakthroughs in adaptive speech synthesis, in conversational speech transcription and in new algorithms to robustly handle changing environments. The research potential is thus poised to meet the application drivers. Beneficiaries of the research can be found in the commercial sector (e.g., remote meeting technology; speech synthesis for computer games; speech archive search; etc), the public sector (e.g., voice reconstruction services for the National Health Service), the third sector (e.g., charities providing support for sufferers of neurodegenerative diseases), art and design, policy makers (e.g., investment in the use of spoken language technology can reduce travel and therefore carbon emissions; it can also enable people to live longer in their own homes, thus reducing the need for residential care services), and the general public (e.g., prospective voice banking and donation could become as commonplace and as widely known as blood donation). The programme's direct training and development impact will be large, through the PhD students and researchers who will work on the project and through researchers on projects that are drawn in on the side of NST; indirectly the training impact will be even larger through other students, researchers and visitors at the three universities, as well as programme workshops.
 
Description The aim of the project is to significantly advance the state-of-the-art in speech technology by the recognition and synthesis of natural speech, approaching human levels of flexibility, reliability, and fluency.

We have made advances in several areas.

1. Learning and adaptation. We have developed new approaches to learning representations for speech and language based on deep neural networks and recurrent neural networks. In contrast to previous approaches, these new approaches require less feature engineering and human design. These approaches have been applied to both speech recognition and speech synthesis. We have also developed new approaches for the adaptation of systems to a new voice, given just a few seconds of speech. We have also developed new factorised modelling approaches which, for example, enable us to separately model the effects of the talker as distinct to the effects of the recording channel.

2. Speech transcription. We have developed several new acoustic modelling techniques: for example, new techniques can model phonetic context in a more efficient way, and a new approach to recognising speech captured using multiple microphones. We have also developed more accurate language models, based on recurrent neural networks, and have introduced a new algorithm to automatically learn a pronunciation lexicon.

3. Speech synthesis. We have introduced new models for synthesising speech based on multiple average voices, and using prior information automatically extracted from talker characteristics. We have developed a new approach ro characterise the perceptual effects of modelling assumptions in speech synthesis through perceptual experiments using stimuli constructed from repeated natural speech. We have developed new techniques for synthesis of conversational speech, for example through automatic pause insertion.

4. Applications. This work has been applied in a number of areas.
a) transcription of broadcast speech for subtitling, metadata extraction, and archive search. This is in collaboration with user group partners BBC and Red Bee Media
b) adaptive speech recognition and dialogue management for users with speech disorders, which is currently undergoing trials in users' homes
c) voice banking and cloning, to create personalised voice output communication aids for people with diseases such as Motor Neurone Disease and Parkinson's Disease. This is also undergoing trials with users.
Exploitation Route Already our findings are having considerable impact. In particular we have released many of the findings made in the project through open source toolkits (Kaldi, HTK, HTS and Festival) which has resulted in significant take-up. Several of our techniques for speech recognition and speech synthesis are being further developed by other groups.

Our techniques have been put to use by several members of the project user group including the BBC and Ericsson (broadcast speech transcription); the Euan MacDonald Centre for Motor Neurone Disease Research, and the Motor Neurone Disease Association (voice banking); Quorate Technology (audio search and browsing); Toshiba (speech synthesis); Emotech (distant speech recognition).
Sectors Creative Economy,Digital/Communication/Information Technologies (including Software),Education,Healthcare,Government, Democracy and Justice,Culture, Heritage, Museums and Collections

URL http://www.natural-speech-technology.org
 
Description 1. Contributions to widely-used open source software including HTK, Kaldi, HTS, CUED RNN Toolkit, and Merlin. The impact of NST to other researchers, and to industry, has been enhanced through the implementation and release of many of the key models and algorithms developed through the main open source platforms used in speech technology: HTK, Kaldi, and HTS. NST research also resulted in the release of the two widely-used open source software toolkits, the CUED RNN toolkit for speech recognition language modelling, and Merlin for neural text-to-speech synthesis. NST speech recognition was made available to researchers through the webASR system. 2. Personalised speech synthesis used for voice banking and reconstruction and deployed in assistive technology communication aids: this was developed in collaboration with the Euan MacDonald Centre for Motor Neurone Disease Research and the Anne Rowling Clinic for Regenerative Neurology at the University of Edinburgh. This work included a successful clinical pilot study, started during the NST project, and has resulted in the formation of spinout company, SpeakUnique. 3. Transcription of multi-genre broadcast speech. Media companies such as Ericsson/Red Bee Media and the BBC have used NST technology to automatically transcribe a wide-range of broadcast speech. Red Bee Media have worked closely with University of Edinburgh and spinout company Quorate Techniology on the development of their real-time subtitling services, increasing accessibility to live television and streaming for the 11 million people with hearing loss in the UK. 4. Transcription of parliamentary proceedings. Hansard provides a "substantially verbatim" record of proceedings in both UK Houses of Parliaments - the House of Commons and the House of Lords - as well as transcripts of Select Committee sittings. Using speech recognition technology from Edinburgh NST spinout Quorate Technology, Hansard use automatically generated transcriptions as a first draft of the official record, as well as to enable searches linking the audio, video, and transcription of parliamentary recordings, 5. Deployment of an application with English Heritage for Browsing Oral Histories: speech recognition developed in the project was used to browse through spoken interviews in 'Duty Calls', an exhibition by English Heritage centred on events at Brodsworth Hall in World War II, and in 'Village Memories', a lottery-funded project exploring life in 3 South Yorkshire villages. 6. Development of academic-industry research centres. An important aspect of the project was the development of industry-focussed research and this has resulted in the formation of a number of academic-industry research centres including the BBC Data Science Research Partnership, the VoiceBase Centre for Speech & Language Technology at the University of Sheffield, a joint research lab with Huawei at the University of Edinburgh, and collaborative funded projects with Bloomberg, Ericsson, Samsung, Toshiba, and Zoo Digital. 7. Distant speech recognition. Emotech have used the distant speech recognition technology developed in NST to develop a robust commercial system used in the prototype personal robot Olly. At CES-2017 in Las Vegas in 2017, Olly became the robotics product that won the most awards in CES history (Smart home; Drones and unmanned [sic] systems; Smart appliances; Home audio-video accessories). Speech recognition was central to this and based on UoE research. The underlying speech recognition technology is now being used by Emotech, in partnership with Huawei, for classroom-based language learning systems, including automatic pronunciation assessment.
First Year Of Impact 2012
Sector Aerospace, Defence and Marine,Creative Economy,Digital/Communication/Information Technologies (including Software),Education,Healthcare,Government, Democracy and Justice,Culture, Heritage, Museums and Collections
Impact Types Cultural,Societal,Economic

 
Description CITIA
Geographic Reach Europe 
Policy Influence Type Participation in a guidance/advisory committee
Impact Steve Renals is founding chairperson of the EU Conversational Interaction Technologies Innovation Alliance, a group which has advised the EU on policy relating speech technology to the multilingual digital single market
URL http://citia.eu
 
Description ROCKIT/CITIA Roadmap
Geographic Reach Europe 
Policy Influence Type Citation in other policy documents
Impact The ROCKIT/CITIA strategic roadmap for conversational interaction technologies forms the basis of a research and innovation agenda in the area of conversational interaction technologies. 2014 and 2015 we constructed this technology roadmap to enable the conversational interaction technologies vision to be realised. The roadmapping process was carried out at the European level, connecting the strong R&D base, with commercial and industrial activity, and policy makers, at the EU and national levels.
URL http://www.sharpcloud.com/ROCKIT
 
Description Adapting end-to-end speech recognition systems (year 1)
Amount £137,365 (GBP)
Organisation Samsung 
Sector Private
Country Korea, Republic of
Start 12/2018 
End 11/2019
 
Description Adapting end-to-end speech recognition systems (year 2)
Amount £113,989 (GBP)
Organisation Samsung 
Sector Private
Country Korea, Republic of
Start 12/2019 
End 11/2020
 
Description Bloomberg PhD Studenship
Amount £42,677 (GBP)
Organisation Bloomberg 
Sector Private
Country United States
Start 01/2015 
End 12/2015
 
Description EPSRC Impact Acceleration Award
Amount £37,716 (GBP)
Organisation Engineering and Physical Sciences Research Council (EPSRC) 
Sector Public
Country United Kingdom
Start 02/2015 
End 09/2015
 
Description EPSRC Responsive Mode
Amount £1,402,097 (GBP)
Funding ID EP/R012180/1 
Organisation Engineering and Physical Sciences Research Council (EPSRC) 
Sector Public
Country United Kingdom
Start 03/2018 
End 02/2021
 
Description EPSRC Responsive Mode
Amount £533,268 (GBP)
Funding ID EP/P011586/1 
Organisation Engineering and Physical Sciences Research Council (EPSRC) 
Sector Public
Country United Kingdom
Start 12/2016 
End 11/2019
 
Description EU FP7-ICT-2011-1.5
Amount € 540,000 (EUR)
Funding ID 287872 
Organisation European Commission 
Sector Public
Country European Union (EU)
Start 11/2011 
End 10/2014
 
Description EU FP7-ICT-2013-10
Amount € 520,000 (EUR)
Funding ID 611092 
Organisation European Commission 
Sector Public
Country European Union (EU)
Start 12/2013 
End 11/2015
 
Description EU H2020 ICT Programme
Amount € 1,999,113 (EUR)
Funding ID 688139 
Organisation European Commission 
Department Horizon 2020
Sector Public
Country European Union (EU)
Start 02/2016 
End 01/2019
 
Description European Community's Seventh Framework Programme (FP7/2007-2013)
Amount € 1,100,000 (EUR)
Funding ID 287678 
Organisation European Commission 
Sector Public
Country European Union (EU)
Start 11/2011 
End 10/2014
 
Description European Union Seventh Framework Programme
Amount € 1,100,000 (EUR)
Funding ID 287658 
Organisation European Commission 
Sector Public
Country European Union (EU)
Start 02/2012 
End 01/2015
 
Description IIKE Early Career Research Scheme
Amount £15,000 (GBP)
Organisation University of Sheffield 
Sector Academic/University
Country United Kingdom
Start 11/2015 
End 02/2016
 
Description Innovation Seed Funding
Amount £5,000 (GBP)
Organisation University of Sheffield 
Sector Academic/University
Country United Kingdom
Start 06/2015 
End 08/2015
 
Description ItsLanguage pronunciation assessment
Amount € 75,000 (EUR)
Organisation ITSLanguage bv 
Sector Private
Country Netherlands
Start 11/2012 
End 08/2014
 
Description Leverhulme International Network
Amount £125,000 (GBP)
Organisation The Leverhulme Trust 
Sector Charity/Non Profit
Country United Kingdom
Start 01/2015 
End 12/2018
 
Description Response to Tender (1)
Amount £73,726 (GBP)
Organisation Defence Science & Technology Laboratory (DSTL) 
Sector Public
Country United Kingdom
Start 10/2012 
End 04/2013
 
Description Response to Tender (2)
Amount £98,982 (GBP)
Organisation Defence Science & Technology Laboratory (DSTL) 
Sector Public
Country United Kingdom
Start 12/2013 
End 04/2014
 
Description Response to Tender (3)
Amount £78,684 (GBP)
Organisation Defence Science & Technology Laboratory (DSTL) 
Sector Public
Country United Kingdom
Start 01/2015 
End 08/2016
 
Description The DataLab Industry PhD
Amount £102,000 (GBP)
Organisation The Datalab 
Sector Charity/Non Profit
Start 09/2016 
End 04/2020
 
Description Toshiba PhD Studentship
Amount £144,485 (GBP)
Organisation Toshiba Research Europe Ltd 
Sector Private
Country United Kingdom
Start 09/2017 
End 04/2021
 
Title MGB Challenge Speech Recognition Systems 
Description Speech recognition software, based the open source Kaldi toolkit, was released to enable the construction of lightly supervised multi-genre broadcast speech recognition systems. 
Type Of Material Improvements to research infrastructure 
Year Produced 2015 
Provided To Others? Yes  
Impact These systems provided the baselines for the 2015 MGB Challenge 
URL http://mgb-challenge.org
 
Title Artificial Personality 
Description This dataset is associated with the paper "Artificial Personality and Disfluency" by Mirjam Wester, Matthew Aylett, Marcus Tomalin and Rasmus Dall published at Interspeech 2015, Dresden. The focus of this paper is artificial voices with different personalities. Previous studies have shown links between an individual's use of disfluencies in their speech and their perceived personality. Here, filled pauses (uh and um) and discourse markers (like, you know, I mean) have been included in synthetic speech as a way of creating an artificial voice with different personalities. We discuss the automatic insertion of filled pauses and discourse markers (i.e., fillers) into otherwise fluent texts. The automatic system is compared to a ground truth of human ``acted" filler insertion. Perceived personality (as defined by the big five personality dimensions) of the synthetic speech is assessed by means of a standardised questionnaire. Synthesis without fillers is compared to synthesis with either spontaneous or synthetic fillers. Our findings explore how the inclusion of disfluencies influences the way in which subjects rate the perceived personality of an artificial voice. 
Type Of Material Database/Collection of data 
Year Produced 2015 
Provided To Others? Yes  
 
Title Automatic Speaker Verification Spoofing and Countermeasures Challenge (ASVspoof 2015) Database 
Description The database has been used in the first Automatic Speaker Verification Spoofing and Countermeasures Challenge (ASVspoof 2015). Genuine speech is collected from 106 speakers (45 male, 61 female) and with no signi?cant channel or background noise effects. Spoofed speech is generated from the genuine data using a number of different spoo?ng algorithms. The full dataset is partitioned into three subsets, the ?rst for training, the second for development and the third for evaluation. More details can be found in the evaluation plan in the summary paper. 
Type Of Material Database/Collection of data 
Year Produced 2015 
Provided To Others? Yes  
Impact Automatic speaker verification (ASV) offers a low-cost and flexible biometric solution to person authentication. While the reliability of ASV systems is now considered sufficient to support mass-market adoption, there are concerns that the technology is vulnerable to spoofing, also referred to as presentation attacks. Spoofing refers to an attack whereby a fraudster attempts to manipulate a biometric system by masquerading as another, enrolled person. Acknowledged vulnerabilities include attacks through impersonation, replay, speech synthesis and voice conversion. This database has been used for the 2015 ASVspoof challenge, which aims to encourage further progress through (i) the collection and distribution of a standard dataset with varying spoofing attacks implemented with multiple, diverse algorithms and (ii) a series of competitive evaluations. The first ASVspoof challenge was held during the 2015 edition of INTERSPEECH in Dresden, Germany. The challenge has been designed to support, for the first time, independent assessments of vulnerabilities to spoofing and of countermeasure performance and to facilitate the comparison of different spoofing countermeasures on a common dataset, with standard protocols and metrics. 
 
Title CSTR VCTK Corpus -- Multi-speaker English Corpus for CSTR Voice Cloning Toolkit 
Description This CSTR VCTK Corpus includes speech data uttered by 109 native speakers of English with various accents. Each speaker reads out about 400 sentences, most of which were selected from a newspaper plus the Rainbow Passage and an elicitation paragraph intended to identify the speaker's accent. The newspaper texts were taken from The Herald (Glasgow), with permission from Herald & Times Group. Each speaker reads a different set of the newspaper sentences, where each set was selected using a greedy algorithm designed to maximise the contextual and phonetic coverage. The Rainbow Passage and elicitation paragraph are the same for all speakers. The Rainbow Passage can be found in the International Dialects of English Archive: (http://web.ku.edu/~idea/readings/rainbow.htm). The elicitation paragraph is identical to the one used for the speech accent archive (http://accent.gmu.edu). The details of the the speech accent archive can be found at http://www.ualberta.ca/~aacl2009/PDFs/WeinbergerKunath2009AACL.pdf All speech data was recorded using an identical recording setup: an omni-directional head-mounted microphone (DPA 4035), 96kHz sampling frequency at 24 bits and in a hemi-anechoic chamber of the University of Edinburgh. All recordings were converted into 16 bits, were downsampled to 48 kHz based on STPK, and were manually end-pointed. This corpus was recorded for the purpose of building HMM-based text-to-speech synthesis systems, especially for speaker-adaptive HMM-based speech synthesis using average voice models trained on multiple speakers and speaker adaptation technologies. 
Type Of Material Database/Collection of data 
Year Produced 2012 
Provided To Others? Yes  
Impact This is the first free corpus that is designed and appropriate for speaker-adaptive speech synthesis. This starts to become a standard database to build and compare speaker-adaptive speech synthesis systems and voice conversion systems. This was also used even for speaker verification systems. 
URL http://homepages.inf.ed.ac.uk/jyamagis/page3/page58/page58.html
 
Title Data Underpinning "Joint Optimisation of Tandem Systems Using Gaussian Mixture Density Neural Network Discriminative Sequence Training" 
Description Description of the Speech Recognition Training and Test Data and its Availability used for Experiments. Key Speech Recognition Outputs/Detailed Scoring Results used in the paper. 
Type Of Material Database/Collection of data 
Year Produced 2017 
Provided To Others? Yes  
 
Title Experiment materials for "Disfluencies in change detection in natural, vocoded and synthetic speech." 
Description The current dataset is associated with the DiSS paper "Disfluencies in change detection in natural, vocoded and synthetic speech." In this paper we investigate the effect of filled pauses, a discourse marker and silent pauses in a change detection experiment in natural, vocoded and synthetic speech. In natural speech change detection has been found to increase in the presence of filled pauses, we extend this work by replicating earlier findings and explore the effect of a discourse marker, like, and silent pauses. Furthermore we report how the use of "unnatural" speech, namely synthetic and vocoded, affects change detection rates. 
Type Of Material Database/Collection of data 
Year Produced 2015 
Provided To Others? Yes  
 
Title Experiment materials for "The temporal delay hypothesis: Natural, vocoded and synthetic speech." 
Description Including disfluencies in synthetic speech is being explored as a way of making synthetic speech sound more natural and conversational. How to measure whether the resulting speech is actually more natural, however, is not straightforward. Conventional approaches to synthetic speech evaluation fall short as a listener is either primed to prefer stimuli with filled pauses or when they aren't primed they prefer more fluent speech. Reaction time experiments from psycholinguistics may circumvent this issue. In this paper, we revisit one such reaction time experiment. For natural speech, delays in word onset were found to facilitate word recognition regardless of the type of delay; be they filled pause (um), silent or a tone. We reused the materials for natural speech, and extended it to vocoded and synthetic speech. The results partially replicate previous findings. For natural and vocoded speech, if the delay is a silent pause, significant increases in the speed of word recognition are found. If the delay comprises filled pauses there is a significant increase in reaction time for vocoded speech but not for natural speech. For synthetic speech, no clear effects of delay on word recognition are found. We hypothesise this is because it takes longer (requires more cognitive resources) to process synthetic speech than natural or vocoded speech. 
Type Of Material Database/Collection of data 
Year Produced 2015 
Provided To Others? Yes  
 
Title Human vs Machine Spoofing 
Description Listening test materials for "Human vs Machine Spoofing Detection on Wideband and Narrowband data." They include lists of the speech material selected from the SAS spoofing database and the listeners' responses. The main data file has been split into five smaller files (labelled "aa" to "ae") for ease of download. 
Type Of Material Database/Collection of data 
Year Produced 2015 
Provided To Others? Yes  
 
Title Improving Interpretability and Regularisation in Deep Learning 
Description The provided .ctm and scoring .sys files correspond to the MPE systems of Table VI (Javanese) and Table X (BN) of this paper. 
Type Of Material Database/Collection of data 
Year Produced 2018 
Provided To Others? Yes  
 
Title MGB Challenge 
Description The MGB Challenge data was released to support the MGB Challenge evaluation of multi-genre broadcast speech recognition systems. It consists of approximately 1,600 hours of broadcast audio taken from seven weeks of BBC output across all TV channels, captions as originally broadcast on TV, accompanied by baseline lightly-supervised alignments using an ASR system, with confidence measures, several hundred million words of subtitle text from BBC TV output collected over a 15 year period, and a hand-compiled British English lexicon derived from Complex. 
Type Of Material Database/Collection of data 
Year Produced 2015 
Provided To Others? Yes  
Impact This research database supported the MGB Challenge at the IEEE ASRU-2015 workshop 
URL http://mgb-challenge.org
 
Title MGB database 
Description The MGB database is the official database of the MGB challenge. It contains 2,000 hours of audio, 700 million words of transcripts plus other metadata. 
Type Of Material Database/Collection of data 
Year Produced 2015 
Provided To Others? Yes  
Impact Features in the MGB challenge and the accompanying workshop at ASRU 2015. 
 
Title REHASP 
Description Studio recording of female native British English talker producing three sets of Harvard sentences (thirty prompts), each prompt repeated forty times. Available both as unprocessed 96 kHz recordings and standardised 16 kHz files. 
Type Of Material Database/Collection of data 
Year Produced 2014 
Provided To Others? Yes  
Impact The following paper has been published: G. E. Henter, T. Merritt, M. Shannon, C. Mayo, and S. King, "Measuring the perceptual effects of modelling assumptions in speech synthesis using stimuli constructed from repeated natural speech," in Proc. Interspeech, 2014 
 
Title Spoofing and Anti-Spoofing (SAS) corpus v1.0 
Description This dataset is associated with the paper "'SAS: A speaker verification spoofing database containing diverse attacks': presents the first version of a speaker verification spoofing and anti-spoofing database, named SAS corpus. The corpus includes nine spoofing techniques, two of which are speech synthesis, and seven are voice conversion. We design two protocols, one for standard speaker verification evaluation, and the other for producing spoofing materials. Hence, they allow the speech synthesis community to produce spoofing materials incrementally without knowledge of speaker verification spoofing and anti-spoofing. To provide a set of preliminary results, we conducted speaker verification experiments using two state-of-the-art systems. Without any anti-spoofing techniques, the two systems are extremely vulnerable to the spoofing attacks implemented in our SAS corpus". 
Type Of Material Database/Collection of data 
Year Produced 2015 
Provided To Others? Yes  
Impact This SAS database is the first version of a standard dataset for spoofing and anti-spoofing research. Currently, the SAS corpus includes speech generated using nine spoofing methods, each of which comprises around 300000 spoofed trials. To the best of our knowledge, this is the first attempt to include such a diverse range of spoofing attacks in a single database. The SAS corpus is publicly available at no cost. 
 
Title The Voice Conversion Challenge 2016 database 
Description The Voice Conversion Challenge (VCC) 2016, one of the special sessions at Interspeech 2016, deals with speaker identity conversion, referred as Voice Conversion (VC). The task of the challenge was speaker conversion, i.e., to transform the voice identity of a source speaker into that of a target speaker while preserving the linguistic content. Using a common dataset consisting of 162 utterances for training and 54 utterances for evaluation from each of 5 source and 5 target speakers, 17 groups working in VC around the world developed their own VC systems for every combination of the source and target speakers, i.e., 25 systems in total, and generated voice samples converted by the developed systems. The objective of the VCC was to compare various VC techniques on identical training and evaluation speech data. The samples were evaluated in terms of target speaker similarity and naturalness by 200 listeners in a controlled environment. This dataset consists of the participants' VC submissions and the listening test results for naturalness and similarity. 
Type Of Material Database/Collection of data 
Year Produced 2016 
Provided To Others? Yes  
Impact 17 groups working in VC around the world have used this database and have developed their own VC systems. 
URL http://datashare.is.ed.ac.uk/handle/10283/2211
 
Title Wargames Day 2 and 3 
Description Further recordings of two days, playing the game Warhammer, yielding a total of 20 hours of transcribed speech. 
Type Of Material Database/Collection of data 
Year Produced 2016 
Provided To Others? Yes  
Impact No direct impact has been recorded, however the Kaldi team has proposed to refine the system scripts included in the corpus. 
URL http://mini.dcs.shef.ac.uk/resources/sheffield-wargames-corpus/
 
Title Wargames I 
Description Recordings of groups of people playing the Warhammer hgame, recorded in 96 audio channels and 3 media streams, fully transcribed. 
Type Of Material Database/Collection of data 
Year Produced 2014 
Provided To Others? Yes  
Impact It was discussed at ASRU 2016 as a potential candidate for future tasks. University of Sheffield are recording part II and part III which will be made available shortly, for this objective. 
 
Title the homeService corpus 
Description an audio corpus of spontaneous dysarthric speech 
Type Of Material Database/Collection of data 
Year Produced 2015 
Provided To Others? Yes  
Impact first example of semi-spontaneuos dysarthric speech corpus 
URL http://mini.dsc.shef.ac.uk/
 
Description BBC 
Organisation British Broadcasting Corporation (BBC)
Department BBC Research & Development
Country United Kingdom 
Sector Public 
PI Contribution Development of systems and showcases of the use of automatic speech processing of media archives
Collaborator Contribution Provided audio and video broadcast data and gave feedback on their requirements for future systems
Impact Several systems for media transcription are available in webASR now (www.webasr.org), a showcase for transcription of Youtube clips is also available (http://staffwww.dcs.shef.ac.uk/people/O.Saztorralba/youtube/)
Start Year 2012
 
Description BBC Data Science Partnership 
Organisation British Broadcasting Corporation (BBC)
Department BBC Research & Development
Country United Kingdom 
Sector Public 
PI Contribution Development of speech and language technology applied to broadcasting and media production
Collaborator Contribution R&D work from BBC researchers; data sharing.
Impact MGB Challenge iCASE studentships EPSRC SCRIPT Project
Start Year 2017
 
Description Barnsley Hospital NHS Foundation Trust 
Organisation Barnsley Hospital NHS Foundation Trust
Country United Kingdom 
Sector Public 
PI Contribution recruitment of homeService users
Collaborator Contribution recruitment of homeService users
Impact recruitment of homeService users
Start Year 2012
 
Description Bloomberg PhD Studentship 
Organisation Johns Hopkins University
Department Johns Hopkins Bloomberg School of Public Health
Country United States 
Sector Academic/University 
PI Contribution Research, systems, evaluation of multi-domain speech recognition
Collaborator Contribution Full funding for year 1 of a PhD studentship
Impact PhD student commenced work in Sept 2015
Start Year 2015
 
Description Cereproc 
Organisation Cereproc Ltd.
Country United Kingdom 
Sector Private 
PI Contribution Steve Renals is non-executive director of Cereproc Ltd.
Collaborator Contribution Brought a deeper understanding of commercial exploitation of speech technology
Impact Cereproc had developed into one of the leading companies in the speech synthesis area
Start Year 2014
 
Description Dysarthric speech organisation in Sheffield 
Organisation University of Sheffield
Country United Kingdom 
Sector Academic/University 
PI Contribution recruitment of hS users
Collaborator Contribution recruitment of hS users
Impact recruitment of hS users
Start Year 2015
 
Description English Heritage 
Organisation English Heritage
Country United Kingdom 
Sector Charity/Non Profit 
PI Contribution Development of a system and a platform for information retrieval and content linking in oral archives
Collaborator Contribution Provided audio data and summaries from interviews nad gave feedback on their system requirements
Impact A demonstrator for the use of the technology on a set of oral history interviews was developed (http://brodsworthhall.azurewebsites.net/)
Start Year 2013
 
Description Julia Olcoz, visiting researcher 
Organisation University of Zaragoza
Country Spain 
Sector Academic/University 
PI Contribution Provided a baseline system for the task of lightly supervised alignment in media broadcasts
Collaborator Contribution Developed novel techniques for improving the lightly supervised alignment task
Impact The enhanced system for lightly supervised alignment is available in webASR. A paper detailing the system was submitted to Interspeech 2016.
Start Year 2015
 
Description MGB Challenge 
Organisation British Broadcasting Corporation (BBC)
Department BBC Research & Development
Country United Kingdom 
Sector Public 
PI Contribution Organised international speech recognition challenge using BBC data: the MGB Challenge at ASRU-2015. We provided baseline and state-of-the-art systems, defined the challenge procedures
Collaborator Contribution BBC provided 2000 hrs of multi-genre TV recordings, and 634M words of subtitle transcriptions
Impact P Bell, MJF Gales, T Hain, J Kilgour, P Lanchantin, X Liu, A McParland, S Renals, O Saz, M Wester, and P Woodland, The MGB Challenge: Evaluating multi-genre broadcast media recognition, IEEE ASRU-2015
Start Year 2014
 
Description Mediaeval 
Organisation Medieval Settlement Research Group
Country United Kingdom 
Sector Charity/Non Profit 
PI Contribution Provided automatic transcription of media data for their evaluation campaigns
Collaborator Contribution Provided the data
Impact The transcribed data was used by participants in the evaluation campaign and features in their several publications related to the evaluation. The Mediaeval is an evaluation campaign that aims to improve the information retrieval task in media data. It's organised by Maria Eskevich at Eurecom (France).
Start Year 2014
 
Description NII 
Organisation National Institute of Informatics (NII)
Country Japan 
Sector Public 
PI Contribution Joint research in speech synthesis
Collaborator Contribution Joint research in speech synthesis
Impact many joint publications; collaboration on open source software (HTS); joint work on voice banking; joint position for Dr Junichi Yamagishi
Start Year 2013
 
Description NITech 
Organisation Nagoya Institute of Technology
Country Japan 
Sector Academic/University 
PI Contribution Joint research in particular focussed on HTS speech synthesis and user generated spoken dialogue systems.
Collaborator Contribution Joint research in particular focussed on HTS speech synthesis and user generated spoken dialogue systems.
Impact multiple joint publications
Start Year 2011
 
Description Pengyuang Zhang - visitor 
Organisation Chinese Academy of Sciences
Country China 
Sector Public 
PI Contribution Hosting the visitor, and providing access to research facilities at the University of Sheffield
Collaborator Contribution Research collaboration with the group
Impact The research visit resulted in the output of two research papers: http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=7078564 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6854663
Start Year 2013
 
Description The Centre for Assistive Technology and Connected Healthcare (CATCH) 
Organisation University of Sheffield
Country United Kingdom 
Sector Academic/University 
PI Contribution sharing of knowledge and resources
Collaborator Contribution sharing of knowledge and resources
Impact unknown
Start Year 2013
 
Description Visitor Dr Yan-Xiong Li 
Organisation South China University of Technology
Department School of Environment and Energy
Country China 
Sector Academic/University 
PI Contribution Collaboration on diarisation work on meeting and broadcast media data.
Collaborator Contribution Joint annotation and testing experiments on BBC data and AMI meeting data. Public Release of new data.
Impact Annotation of BBC and RT'07 data.
Start Year 2013
 
Description uSTAR collaborative R&D project 
Organisation NICT National Institute of Information and Communications Technology
Country Japan 
Sector Academic/University 
PI Contribution Building online speech recognition for an ipphone platform for speech to speech translation
Collaborator Contribution Building online speech recognition for an ipphone platform for speech to speech translation
Impact Building online speech recognition for an ipphone platform for speech to speech translation
Start Year 2012
 
Title Combilex-ASR 
Description Combilex-ASR is a large scale lexicon for speech recognition in British English. It is licensed under a Creative Commons BY-NC license 
IP Reference  
Protection Copyrighted (e.g. software)
Year Protection Granted 2016
Licensed Yes
Impact This lexicon underpinned the ASRU-2015 MGB Challenge, and is planned to be used in the US IARPA Babel Programme
 
Title High-quality speech synthesizer, HTS voice 
Description High-quality speech synthesis software based on speech technologies developed during my fellowship. 
IP Reference  
Protection Copyrighted (e.g. software)
Year Protection Granted 2012
Licensed Yes
Impact I have formally licensed the high-quality speech synthesizer to two companies for a commercial basis.
 
Title Clinical trial of personalized speech synthesis voices for MND patients 
Description Adaptive speech synthesis may be be used to develop personalised synthetic voices for people who have a vocal pathology. In 2009 Dr. Sarah Creer from University of Sheffield and I have successfully applied it to clinical voice banking for laryngectomees (individuals who have had their vocal cords removed due to a developing cancer) to reconstruct their voices. In 2010, I have "implanted" the personalised synthetic voice of a patient who has motor neurone disease into their assistive communication device. Such a personalised voice can lead to far more natural communication for patients, particularly with family. A "voice reconstruction" trial has been tested with about 100 patients in total at the Euan MacDonald Centre for MND Research and the Anne Rowling Regenerative Neurology Clinic in Edinburgh. 
Type Health and Social Care Services
Current Stage Of Development Initial development
Year Development Stage Completed 2015
Development Status Actively seeking support
Impact We have recorded about 100 MND patients at the Euan MacDonald Centre for MND Research and the Anne Rowling Regenerative Neurology Clinic in Edinburgh and have constructed personalized speech synthesizers based on their disordered voices. We have received and analyzed feedback from the patients and we have confirmed that this new speech synthesis technology can improve their quality-of-life. 
 
Title Alignment task and scoring software 
Description Implementation of the alignment scoring rules tha the University of Sheffield has defined as part of organising the MGB Challenge 
Type Of Technology Software 
Year Produced 2015 
Open Source License? Yes  
Impact Used in the MGB Challenge and used already for several publications by other research groups. 
 
Title CUED-RNNLM Toolkit 
Description Software to train recurrent neural network language models (RNNLMs) for speech recognition and other applications. The software features efficient training (on a GPU) and efficient evaluation (on a CPU). It includes a patch to HTK 3.4.1 that enables the application of RNNLMs to HTK-based speech recognition lattices. RNNLMs can be trained with additional features in the input layer for better performance (e.g. topic adaptation using an LDA-based topic vector). 
Type Of Technology Software 
Year Produced 2015 
Open Source License? Yes  
Impact This software was used to train RNNLMs that were used in the Cambridge University transcription systems used in the 2015 ASRU multi-genre broadcast challenge (international challenge involving the automatic transcription of BBC broadcast audio). The Cambridge system gave the lowest error rates in the challenge and the use of RNNLMs efficiently trained on a large corpus of subtitle material was a key component, as well as the use of topic adaptation. 
URL http://mi.eng.cam.ac.uk/projects/cued-rnnlm/
 
Title Diarisation scoring tools 
Description Implementation of new diarisation methods as published in ICASSP 2016 
Type Of Technology Software 
Year Produced 2016 
Open Source License? Yes  
Impact Not yet. 
 
Title HTK 3.5 
Description HTK is a portable toolkit for building and manipulating hidden Markov models which has been developed over many years primarily at Cambridge University Engineering Dept.. HTK is primarily used for speech recognition research although it is also widely used for speech synthesis and other applications. HTK 3.5 adds built-in support for artificial neural network (ANN) models while maintaining compatibility with most existing functions (including hybrid and tandem models, sequence training and CPU/GPU math kernels), as well as support for decoding RNN language models HTK is supplied in source form with a specific licence that allows any use of the models produced but does not allow software re-distribution. HTK has over 100,000 registered users. 
Type Of Technology Software 
Year Produced 2015 
Open Source License? Yes  
Impact HTK 3.5 has been used as a platform to develop various types of speech technology research at Cambridge, building on developments over many years to focus on the use of deep neural network acoustic models and recurrent neural network language models. A particular outcome has been the development of the Cambridge University systems for the 2015 ARSU multi-genre broadcast (MGB) challenge. This required the processing of more than 1600 hours of BBC TV audio data and developing systems for transcription, subtitle alignment and diarisation. This was embodied in 4 tasks in the MGB challenge, and Cambridge University systems based on HTK 3.5 had the best performance for all these tasks. Many HTK users have downloaded HTK 3.5 and are actively using it to develop both research and commercial systems. 
URL http://htk.eng.cam.ac.uk/
 
Title HTS ver 2.3 
Description HTS is an open source toolkit for statistical speech synthesis. I am a member of a team developing the the free open-source research software packages for speech synthesis. 
Type Of Technology Software 
Year Produced 2015 
Open Source License? Yes  
Impact The HTS toolkit is used worldwide by both academic and commercial organisations, such as Microsoft, Nuance, Toshiba, Pentax, and Google. The number of downloads of HTS exceeds 10,000 and various commercial products using HTS are on the market. Therefore, this toolkit is a very influential platform for me to disseminate outcomes and form an immediate pathway to impact. 
URL http://hts.sp.nitech.ac.jp
 
Title Merlin 
Description Merlin is the Neural Network (NN) based Speech Synthesis System developed at the Centre for Speech Technology Research (CSTR), University of Edinburgh 
Type Of Technology Software 
Year Produced 2016 
Open Source License? Yes  
Impact Since its release at the end of the Natural Speech Technology project, Merlin has established a significant base of users and developers. 
URL https://github.com/CSTR-Edinburgh/merlin
 
Title The Festival Speech Synthesis System 
Description Festival offers a general framework for building speech synthesis systems as well as including examples of various modules. As a whole it offers full text to speech through a number APIs: from shell level, though a Scheme command interpreter, as a C++ library, from Java, and an Emacs interface. Festival is multi-lingual (currently English (British and American), and Spanish) though English is the most advanced. Other groups release new languages for the system. And full tools and documentation for build new voices are available through Carnegie Mellon's FestVox project (http://festvox.org). The software was first released in the 1990s, but has been under continuous development, improvement, and maintenance since then. v2.1 q was released in November 2010. 
Type Of Technology Software 
Open Source License? Yes  
Impact Festival is distributed as default in a number of standard Linux distributions including Arch Linux, Fedora, CentOS, RHEL, Scientific Linux, Debian, Ubuntu, openSUSE, Mandriva, Mageia and Slackware, and can easily be installed on any Linux distribution that supports apt-get. More recently our work on statistical parametric speech synthesis and the algorithms for adaptation have been incorporated in the HTS toolkit (one of the coordinators (Yamagishi) is from Edinburgh), which integrates with Festival. These toolkits are the most used open-source speech synthesis systems and have also formed the high performing baseline systems for the international Blizzard evaluation of (commercial and research) speech synthesis also organised by Edinburgh. 
URL http://www.cstr.ed.ac.uk/projects/festival/
 
Title The Festival Speech Synthesis system - version 2.4 
Description The de facto industry standard toolkit for developing text-to-speech systems. 
Type Of Technology Software 
Year Produced 2014 
Open Source License? Yes  
Impact Commercial products from AT&T. Spinout company Rhetorical Systems. 
URL http://www.cstr.ed.ac.uk/projects/festival/
 
Title homeService protocol 
Description The homeService protocol to develop and human-machine interaction for users with speech and mobility impairment. Example of virtuous cycle. 
Type Of Technology Software 
Year Produced 2015 
Open Source License? Yes  
Impact Not at this point. 
 
Title webASR 
Description Publicly available webtool (www.webasr.org) and two showcases on media transcription (http://staffwww.dcs.shef.ac.uk/people/O.Saztorralba/youtube/) and alignment of lecture subtitles (http://staffwww.dcs.shef.ac.uk/people/O.Saztorralba/ted/) 
Type Of Technology Webtool/Application 
Year Produced 2015 
Impact New version of webASR (www.webasr.org) with new systems and demonstrators 
URL http://www.webasr.org/
 
Company Name Quorate Technology 
Description Quorate Technology provides Quorate, a Speech Recognition and Analysis Suite that unlocks information in recordings. The company's technology uses speech recognition for - Search: Recorded speech is made searchable and transcripts can be generated automatically. Quorate enables search terms to be located in both the recording and in a computer-generated transcript. - Browsing: Quorate generates keyword summaries of recordings. These keywords represent the terms that best characterise a single recording in the context of a group of recordings. - Analysis: Recorded speech can be examined from many perspectives, via the extraction of rich metadata in order to enable recordings to be segmented, filtered and connected to related materials. 
Year Established 2012 
Impact The company has a variety of commercial contracts.There are at least 10 full-time scientific positions.
Website http://quoratetechnology.com
 
Company Name SPEAKUNIQUE LIMITED 
Description This company is commercialising the personalised speech synthesis technology (voice banking and voice cloning) developed in the NST project, following successful initial clinical trial with the University of Edinburgh Medical School. 
Year Established 2018 
Impact The company is still in an early phase.
 
Description A talk about the homeService experience/project 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Professional Practitioners
Results and Impact A talk about the homeService experience/project to Birmingham University (Feb 2016
Year(s) Of Engagement Activity 2016
 
Description A talk about the homeService experience/project 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Professional Practitioners
Results and Impact A talk about the homeService experience/project to Birmingham University (Feb 2016
Year(s) Of Engagement Activity 2016
 
Description A talk about the homeService experience/project to Medical Humanity Sheffield 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Professional Practitioners
Results and Impact A talk about the homeService experience/project to Medical Humanity Sheffield
Year(s) Of Engagement Activity 2015
 
Description CATCH kickoff event 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Professional Practitioners
Results and Impact Introduction of the homeService project to interest audience
Year(s) Of Engagement Activity 2013
 
Description COST APPELE meeting 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Industry/Business
Results and Impact Engaging academics and industry across Europe
Year(s) Of Engagement Activity 2014
URL http://aapele.eu/%7Clink
 
Description Data Science for Media Summit 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact The Summit was organised by the Alan Turing Institute to bring together researchers and media specialists to discuss in the future directions of research in data science for media
Year(s) Of Engagement Activity 2015
 
Description How technology is changing speech and language therapy (Guardian on-line) 
Form Of Engagement Activity A press release, press conference or response to a media enquiry/interview
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Media (as a channel to the public)
Results and Impact Article in Guardian which includes information about progress in speech recognition technology and specific mention of the EPSRC Natural Speech Technology project.
Year(s) Of Engagement Activity 2015
URL http://www.theguardian.com/higher-education-network/2015/apr/15/how-technology-is-changing-speech-an...
 
Description Mobile University outreach event 2015 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Public/other audiences
Results and Impact Our research on speech recognition and machine translation was presented as part of a 'Mobile University' outreach to general public in Sheffield City Centre.
Year(s) Of Engagement Activity 2015
URL http://mini.dcs.shef.ac.uk/mobileuni2015/
 
Description Multi-Genre Broadcast (MGB) Challenge 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Organised and participated the Multi-Genre Broadcast (MGB) challenge. The challenge took place at ASRU 2015 and it served as a meeting for the participants. 20+ research groups participated in the challenge.
Year(s) Of Engagement Activity 2015
URL http://www.mgb-challenge.org/
 
Description Talk at AIST Tsukuba 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Talk at AIST, Tsukuba, Japan - "Improving speech transcription using out-of-domain data"
Year(s) Of Engagement Activity 2012
 
Description The future of Languages - more than just words 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Public/other audiences
Results and Impact A public lecture at the Public Library in Amsterdam, followed by a debate with an audience.

Interactions with the audience.
Year(s) Of Engagement Activity 2012
URL http://www.clubofamsterdam.com/event.asp?contentid=854
 
Description Using speech synthesis to give everyone their own voice 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Public/other audiences
Results and Impact Discussions with the audience afterwards.

Follow up emails from members of the public.
Year(s) Of Engagement Activity 2012
 
Description seminar at INESC 2012 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Talk entitled "Assistive Speech Technology" at INESC-ID, Lisbon
Year(s) Of Engagement Activity 2012