Natural Speech Technology

Lead Research Organisation: University of Edinburgh

Department Name: Sch of Informatics

Abstract

Humans are highly adaptable, and speech is our natural medium for informal communication. When communicating, we continuously adjust to other people, to the situation, and to the environment, using previously acquired knowledge to make this adaptation seem almost instantaneous. Humans generalise, enabling efficient communication in unfamiliar situations and rapid adaptation to new speakers or listeners. Current speech technology works well for certain controlled tasks and domains, but is far from natural, a consequence of its limited ability to acquire knowledge about people or situations, to adapt, and to generalise. This accounts for the uneasy public reaction to speech-driven systems. For example, text-to-speech synthesis can be as intelligible as human speech, but lacks expression and is not perceived as natural. Similarly, the accuracy of speech recognition systems can collapse if the acoustic environment or task domain changes, conditions which a human listener would handle easily. Research approaches to these problems have hitherto been piecemeal and as a result progress has been patchy. In contrast NST will focus on the integrated theoretical development of new joint models for speech recognition and synthesis. These models will allow us to incorporate knowledge about the speakers, the environment, the communication context and awareness of the task, and will learn and adapt from real world data in an online, unsupervised manner. This theoretical unification is already underway within the NST labs and, combined with our record of turning theory into practical state-of-the-art applications, will enable us to bring a naturalness to speech technology that is not currently attainable.The NST programme will yield technology which (1) approaches human adaptability to new communication situations, (2) is capable of personalised communication, and (3) takes account of speaker intention and expressiveness in speech recognition and synthesis. This is an ambitious vision. Its success will be measured in terms of how the theoretical development reshapes the field over the next decade, the takeup of the software systems that we shall develop, and through the impact of our exemplar interactive applications.We shall establish a strong User Group to maximise the impact of the project, with a members concerned with clinical applications, as well as more general speech technology. Members of the User Group include Toshiba, EADS Innovation Works, Cisco, Barnsley Hospital NHS Foundation Trust, and the Euan MacDonald Centre for MND Research. An important interaction with the User Group will be validating our systems on their data and tasks, discussed at an annual user workshop.

Planned Impact

Leading market analysts predict that revenues from speech technology in North America alone will reach $1 billion by 2011. The reality has lagged behind such predictions in the past since the technology is not refined enough even at this time, but paradigms are shifting. The revolutionary change in connectivity and mobile computing in recent years gives rise to a number of compelling application drivers for the proposed research programme: (1) Rapid developments in mobile computing - decreasing power consumption, high network bandwidth and cloud computing - are stimulating demand for new interfaces. (2) Demographic and economic pressures mean that home care and support systems will become commonplace; such systems will benefit from personalised spoken interaction. (3) Remote meetings are becoming standard, stimulated by the economic conditions and climate change; natural speech technology will enable much richer interactions. (4) As data access becomes more open, the volume of available audio data will increase exponentially; natural speech transcription will result in such data oceans become searchable and structured. (5) There is a potentially huge market (entertainment, consumer apps, robotics) that would be opened up by the availability of adaptive, controllable, expressive speech synthesis. (6) Clinical applications of speech technology will be substantially enriched by the personalised systems proposed in NST. As these drivers have reached a critical level, the NST team has made a number of crucial breakthroughs in adaptive speech synthesis, in conversational speech transcription and in new algorithms to robustly handle changing environments. The research potential is thus poised to meet the application drivers. Beneficiaries of the research can be found in the commercial sector (e.g., remote meeting technology; speech synthesis for computer games; speech archive search; etc), the public sector (e.g., voice reconstruction services for the National Health Service), the third sector (e.g., charities providing support for sufferers of neurodegenerative diseases), art and design, policy makers (e.g., investment in the use of spoken language technology can reduce travel and therefore carbon emissions; it can also enable people to live longer in their own homes, thus reducing the need for residential care services), and the general public (e.g., prospective voice banking and donation could become as commonplace and as widely known as blood donation). The programme's direct training and development impact will be large, through the PhD students and researchers who will work on the project and through researchers on projects that are drawn in on the side of NST; indirectly the training impact will be even larger through other students, researchers and visitors at the three universities, as well as programme workshops.

Funded Value:

£6,236,104

Funded Period:

Apr 11 - Jul 16

Funder:

EPSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

EP/I031022/1

Principal Investigator:

Steve Renals

Research Subject:

Info. & commun. Technol. (100%)

Research Topic:

Digital Signal Processing (20%)

Human Communication in ICT (80%)

Organisations

People	ORCID iD
Steve Renals (Principal Investigator)
Philip Woodland (Co-Investigator)
Phil Green (Co-Investigator)
Mark Gales (Co-Investigator)
Simon King (Co-Investigator)
Thomas Hain (Co-Investigator)
William Byrne (Co-Investigator)
Junichi Yamagishi (Co-Investigator)

Publications

Author Name

Title Publication Date Published

|< < 1 2 3 4 5 6 7 8 9 10 > >|

10 25 50

Ali A (2015) Multi-reference WER for evaluating ASR for languages with no orthographic rule in Proc IEEE ASRU

Ali A (2016) The MGB-2 challenge: Arabic multi-dialect broadcast media recognition

Ali A (2015) Multi-Reference Evaluation for Dialectal Speech Recognition System: A Study for Egyptian ASR in Proc WANLP

Andersson S (2012) Synthesis and evaluation of conversational characteristics in HMM-based speech synthesis in Speech Communication

Astrinaki M. (2013) Reactive accent interpolation through an interactive map application in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

Aylett, M. (2014) A flexible front-end for HTS

Bell P (2015) Complementary tasks for context-dependent deep neural network acoustic models in Proc Interspeech

Bell P (2013) Multi-level adaptive networks in tandem and hybrid ASR systems

Bell P (2015) The MGB challenge: Evaluating multi-genre broadcast media recognition

Bell P (2015) A system for automatic alignment of broadcast media captions using weighted finite-state transducers

Key Findings
Impact Summary
Policy Influence
Further Funding
Research Databases and Models
Research Tools and Methods
Collaboration
Intellectual Property
Products Interventions & Clinical Trials
Software and Technical Products
Spin Outs
Engagement Activities


Description	The aim of the project is to significantly advance the state-of-the-art in speech technology by the recognition and synthesis of natural speech, approaching human levels of flexibility, reliability, and fluency. We have made advances in several areas. 1. Learning and adaptation. We have developed new approaches to learning representations for speech and language based on deep neural networks and recurrent neural networks. In contrast to previous approaches, these new approaches require less feature engineering and human design. These approaches have been applied to both speech recognition and speech synthesis. We have also developed new approaches for the adaptation of systems to a new voice, given just a few seconds of speech. We have also developed new factorised modelling approaches which, for example, enable us to separately model the effects of the talker as distinct to the effects of the recording channel. 2. Speech transcription. We have developed several new acoustic modelling techniques: for example, new techniques can model phonetic context in a more efficient way, and a new approach to recognising speech captured using multiple microphones. We have also developed more accurate language models, based on recurrent neural networks, and have introduced a new algorithm to automatically learn a pronunciation lexicon. 3. Speech synthesis. We have introduced new models for synthesising speech based on multiple average voices, and using prior information automatically extracted from talker characteristics. We have developed a new approach ro characterise the perceptual effects of modelling assumptions in speech synthesis through perceptual experiments using stimuli constructed from repeated natural speech. We have developed new techniques for synthesis of conversational speech, for example through automatic pause insertion. 4. Applications. This work has been applied in a number of areas. a) transcription of broadcast speech for subtitling, metadata extraction, and archive search. This is in collaboration with user group partners BBC and Red Bee Media b) adaptive speech recognition and dialogue management for users with speech disorders, which is currently undergoing trials in users' homes c) voice banking and cloning, to create personalised voice output communication aids for people with diseases such as Motor Neurone Disease and Parkinson's Disease. This is also undergoing trials with users.
Exploitation Route	Already our findings are having considerable impact. In particular we have released many of the findings made in the project through open source toolkits (Kaldi, HTK, HTS and Festival) which has resulted in significant take-up. Several of our techniques for speech recognition and speech synthesis are being further developed by other groups. Our techniques have been put to use by several members of the project user group including the BBC and Ericsson (broadcast speech transcription); the Euan MacDonald Centre for Motor Neurone Disease Research, and the Motor Neurone Disease Association (voice banking); Quorate Technology (audio search and browsing); Toshiba (speech synthesis); Emotech (distant speech recognition).
Sectors	Creative Economy Digital/Communication/Information Technologies (including Software) Education Healthcare Government Democracy and Justice Culture Heritage Museums and Collections
URL	http://www.natural-speech-technology.org


Description	1. Contributions to widely-used open source software including HTK, Kaldi, HTS, CUED RNN Toolkit, and Merlin. The impact of NST to other researchers, and to industry, has been enhanced through the implementation and release of many of the key models and algorithms developed through the main open source platforms used in speech technology: HTK, Kaldi, and HTS. NST research also resulted in the release of the two widely-used open source software toolkits, the CUED RNN toolkit for speech recognition language modelling, and Merlin for neural text-to-speech synthesis. NST speech recognition was made available to researchers through the webASR system. 2. Personalised speech synthesis used for voice banking and reconstruction and deployed in assistive technology communication aids: this was developed in collaboration with the Euan MacDonald Centre for Motor Neurone Disease Research and the Anne Rowling Clinic for Regenerative Neurology at the University of Edinburgh. This work included a successful clinical pilot study, started during the NST project, and has resulted in the formation of spinout company, SpeakUnique. 3. Transcription of multi-genre broadcast speech. Media companies such as Ericsson/Red Bee Media and the BBC have used NST technology to automatically transcribe a wide-range of broadcast speech. Red Bee Media have worked closely with University of Edinburgh and spinout company Quorate Techniology on the development of their real-time subtitling services, increasing accessibility to live television and streaming for the 11 million people with hearing loss in the UK. 4. Transcription of parliamentary proceedings. Hansard provides a "substantially verbatim" record of proceedings in both UK Houses of Parliaments - the House of Commons and the House of Lords - as well as transcripts of Select Committee sittings. Using speech recognition technology from Edinburgh NST spinout Quorate Technology, Hansard use automatically generated transcriptions as a first draft of the official record, as well as to enable searches linking the audio, video, and transcription of parliamentary recordings, 5. Deployment of an application with English Heritage for Browsing Oral Histories: speech recognition developed in the project was used to browse through spoken interviews in 'Duty Calls', an exhibition by English Heritage centred on events at Brodsworth Hall in World War II, and in 'Village Memories', a lottery-funded project exploring life in 3 South Yorkshire villages. 6. Development of academic-industry research centres. An important aspect of the project was the development of industry-focussed research and this has resulted in the formation of a number of academic-industry research centres including the BBC Data Science Research Partnership, the VoiceBase Centre for Speech & Language Technology at the University of Sheffield, a joint research lab with Huawei at the University of Edinburgh, and collaborative funded projects with Bloomberg, Ericsson, Samsung, Toshiba, and Zoo Digital. 7. Distant speech recognition. Emotech have used the distant speech recognition technology developed in NST to develop a robust commercial system used in the prototype personal robot Olly. At CES-2017 in Las Vegas in 2017, Olly became the robotics product that won the most awards in CES history (Smart home; Drones and unmanned [sic] systems; Smart appliances; Home audio-video accessories). Speech recognition was central to this and based on UoE research. The underlying speech recognition technology is now being used by Emotech, in partnership with Huawei, for classroom-based language learning systems, including automatic pronunciation assessment.
First Year Of Impact	2012
Sector	Aerospace, Defence and Marine,Creative Economy,Digital/Communication/Information Technologies (including Software),Education,Healthcare,Government, Democracy and Justice,Culture, Heritage, Museums and Collections
Impact Types	Cultural Societal Economic


Description	CITIA
Geographic Reach	Europe
Policy Influence Type	Participation in a guidance/advisory committee
Impact	Steve Renals is founding chairperson of the EU Conversational Interaction Technologies Innovation Alliance, a group which has advised the EU on policy relating speech technology to the multilingual digital single market
URL	http://citia.eu


Description	ROCKIT/CITIA Roadmap
Geographic Reach	Europe
Policy Influence Type	Citation in other policy documents
Impact	The ROCKIT/CITIA strategic roadmap for conversational interaction technologies forms the basis of a research and innovation agenda in the area of conversational interaction technologies. 2014 and 2015 we constructed this technology roadmap to enable the conversational interaction technologies vision to be realised. The roadmapping process was carried out at the European level, connecting the strong R&D base, with commercial and industrial activity, and policy makers, at the EU and national levels.
URL	http://www.sharpcloud.com/ROCKIT


Description	Adapting end-to-end speech recognition systems (year 1)
Amount	£137,365 (GBP)
Organisation	Samsung
Sector	Private
Country	Korea, Republic of
Start	12/2018
End	11/2019


Description	Adapting end-to-end speech recognition systems (year 2)
Amount	£113,989 (GBP)
Organisation	Samsung
Sector	Private
Country	Korea, Republic of
Start	12/2019
End	11/2020


Description	Bloomberg PhD Studenship
Amount	£42,677 (GBP)
Organisation	Bloomberg
Sector	Private
Country	United States
Start	01/2015
End	12/2015


Description	EPSRC Impact Acceleration Award
Amount	£37,716 (GBP)
Organisation	Engineering and Physical Sciences Research Council (EPSRC)
Sector	Public
Country	United Kingdom
Start	02/2015
End	09/2015


Description	EPSRC Responsive Mode
Amount	£1,402,097 (GBP)
Funding ID	EP/R012180/1
Organisation	Engineering and Physical Sciences Research Council (EPSRC)
Sector	Public
Country	United Kingdom
Start	03/2018
End	02/2021


Description	EPSRC Responsive Mode
Amount	£533,268 (GBP)
Funding ID	EP/P011586/1
Organisation	Engineering and Physical Sciences Research Council (EPSRC)
Sector	Public
Country	United Kingdom
Start	12/2016
End	11/2019


Description	EU FP7-ICT-2011-1.5
Amount	€ 540,000 (EUR)
Funding ID	287872
Organisation	European Commission
Sector	Public
Country	Belgium
Start	11/2011
End	10/2014


Description	EU FP7-ICT-2013-10
Amount	€ 520,000 (EUR)
Funding ID	611092
Organisation	European Commission
Sector	Public
Country	Belgium
Start	12/2013
End	11/2015


Description	EU H2020 ICT Programme
Amount	€ 1,999,113 (EUR)
Funding ID	688139
Organisation	European Commission
Department	Horizon 2020
Sector	Public
Country	European Union (EU)
Start	02/2016
End	01/2019


Description	European Community's Seventh Framework Programme (FP7/2007-2013)
Amount	€ 1,100,000 (EUR)
Funding ID	287678
Organisation	European Commission
Sector	Public
Country	Belgium
Start	11/2011
End	10/2014


Description	European Union Seventh Framework Programme
Amount	€ 1,100,000 (EUR)
Funding ID	287658
Organisation	European Commission
Sector	Public
Country	Belgium
Start	02/2012
End	01/2015


Description	IIKE Early Career Research Scheme
Amount	£15,000 (GBP)
Organisation	University of Sheffield
Sector	Academic/University
Country	United Kingdom
Start	11/2015
End	02/2016


Description	Innovation Seed Funding
Amount	£5,000 (GBP)
Organisation	University of Sheffield
Sector	Academic/University
Country	United Kingdom
Start	05/2015
End	08/2015


Description	ItsLanguage pronunciation assessment
Amount	€ 75,000 (EUR)
Organisation	ITSLanguage bv
Sector	Private
Country	Netherlands
Start	11/2012
End	08/2014


Description	Leverhulme International Network
Amount	£125,000 (GBP)
Organisation	The Leverhulme Trust
Sector	Charity/Non Profit
Country	United Kingdom
Start	01/2015
End	12/2018


Description	Response to Tender (1)
Amount	£73,726 (GBP)
Organisation	Defence Science & Technology Laboratory (DSTL)
Sector	Public
Country	United Kingdom
Start	09/2012
End	04/2013


Description	Response to Tender (2)
Amount	£98,982 (GBP)
Organisation	Defence Science & Technology Laboratory (DSTL)
Sector	Public
Country	United Kingdom
Start	12/2013
End	04/2014


Description	Response to Tender (3)
Amount	£78,684 (GBP)
Organisation	Defence Science & Technology Laboratory (DSTL)
Sector	Public
Country	United Kingdom
Start	01/2015
End	08/2016


Description	The DataLab Industry PhD
Amount	£102,000 (GBP)
Organisation	The Datalab
Sector	Charity/Non Profit
Start	08/2016
End	04/2020


Description	Toshiba PhD Studentship
Amount	£144,485 (GBP)
Organisation	Toshiba Research Europe Ltd
Sector	Private
Country	United Kingdom
Start	08/2017
End	04/2021


Title	MGB Challenge Speech Recognition Systems
Description	Speech recognition software, based the open source Kaldi toolkit, was released to enable the construction of lightly supervised multi-genre broadcast speech recognition systems.
Type Of Material	Improvements to research infrastructure
Year Produced	2015
Provided To Others?	Yes
Impact	These systems provided the baselines for the 2015 MGB Challenge
URL	http://mgb-challenge.org


Title	Artificial Personality
Description	This dataset is associated with the paper "Artificial Personality and Disfluency" by Mirjam Wester, Matthew Aylett, Marcus Tomalin and Rasmus Dall published at Interspeech 2015, Dresden. The focus of this paper is artificial voices with different personalities. Previous studies have shown links between an individual's use of disfluencies in their speech and their perceived personality. Here, filled pauses (uh and um) and discourse markers (like, you know, I mean) have been included in synthetic speech as a way of creating an artificial voice with different personalities. We discuss the automatic insertion of filled pauses and discourse markers (i.e., fillers) into otherwise fluent texts. The automatic system is compared to a ground truth of human ``acted" filler insertion. Perceived personality (as defined by the big five personality dimensions) of the synthetic speech is assessed by means of a standardised questionnaire. Synthesis without fillers is compared to synthesis with either spontaneous or synthetic fillers. Our findings explore how the inclusion of disfluencies influences the way in which subjects rate the perceived personality of an artificial voice.
Type Of Material	Database/Collection of data
Year Produced	2015
Provided To Others?	Yes


Title	Automatic Speaker Verification Spoofing and Countermeasures Challenge (ASVspoof 2015) Database
Description	The database has been used in the first Automatic Speaker Verification Spoofing and Countermeasures Challenge (ASVspoof 2015). Genuine speech is collected from 106 speakers (45 male, 61 female) and with no signi?cant channel or background noise effects. Spoofed speech is generated from the genuine data using a number of different spoo?ng algorithms. The full dataset is partitioned into three subsets, the ?rst for training, the second for development and the third for evaluation. More details can be found in the evaluation plan in the summary paper.
Type Of Material	Database/Collection of data
Year Produced	2015
Provided To Others?	Yes
Impact	Automatic speaker verification (ASV) offers a low-cost and flexible biometric solution to person authentication. While the reliability of ASV systems is now considered sufficient to support mass-market adoption, there are concerns that the technology is vulnerable to spoofing, also referred to as presentation attacks. Spoofing refers to an attack whereby a fraudster attempts to manipulate a biometric system by masquerading as another, enrolled person. Acknowledged vulnerabilities include attacks through impersonation, replay, speech synthesis and voice conversion. This database has been used for the 2015 ASVspoof challenge, which aims to encourage further progress through (i) the collection and distribution of a standard dataset with varying spoofing attacks implemented with multiple, diverse algorithms and (ii) a series of competitive evaluations. The first ASVspoof challenge was held during the 2015 edition of INTERSPEECH in Dresden, Germany. The challenge has been designed to support, for the first time, independent assessments of vulnerabilities to spoofing and of countermeasure performance and to facilitate the comparison of different spoofing countermeasures on a common dataset, with standard protocols and metrics.


Title	CSTR VCTK Corpus -- Multi-speaker English Corpus for CSTR Voice Cloning Toolkit
Description	This CSTR VCTK Corpus includes speech data uttered by 109 native speakers of English with various accents. Each speaker reads out about 400 sentences, most of which were selected from a newspaper plus the Rainbow Passage and an elicitation paragraph intended to identify the speaker's accent. The newspaper texts were taken from The Herald (Glasgow), with permission from Herald & Times Group. Each speaker reads a different set of the newspaper sentences, where each set was selected using a greedy algorithm designed to maximise the contextual and phonetic coverage. The Rainbow Passage and elicitation paragraph are the same for all speakers. The Rainbow Passage can be found in the International Dialects of English Archive: (http://web.ku.edu/~idea/readings/rainbow.htm). The elicitation paragraph is identical to the one used for the speech accent archive (http://accent.gmu.edu). The details of the the speech accent archive can be found at http://www.ualberta.ca/~aacl2009/PDFs/WeinbergerKunath2009AACL.pdf All speech data was recorded using an identical recording setup: an omni-directional head-mounted microphone (DPA 4035), 96kHz sampling frequency at 24 bits and in a hemi-anechoic chamber of the University of Edinburgh. All recordings were converted into 16 bits, were downsampled to 48 kHz based on STPK, and were manually end-pointed. This corpus was recorded for the purpose of building HMM-based text-to-speech synthesis systems, especially for speaker-adaptive HMM-based speech synthesis using average voice models trained on multiple speakers and speaker adaptation technologies.
Type Of Material	Database/Collection of data
Year Produced	2012
Provided To Others?	Yes
Impact	This is the first free corpus that is designed and appropriate for speaker-adaptive speech synthesis. This starts to become a standard database to build and compare speaker-adaptive speech synthesis systems and voice conversion systems. This was also used even for speaker verification systems.
URL	http://homepages.inf.ed.ac.uk/jyamagis/page3/page58/page58.html


Title	Computer, Speech and Language - Experiment results for paper "Acoustic Adaptation to Dynamic Background Conditions with Asynchronous Transformations"
Description	The files in the dataset correspond to results that have been generated for the Computer, Speech and Language article: "Acoustic Adaptation to Dynamic Background Conditions with Asynchronous Transformations" http://dx.doi.org/10.1016/j.csl.2016.06.008. The files in the zip file are of three types:- .ctm, which correspond to the output of the automatic speech recognition system and the columns include segment information as well as transcripts of the recognition.- .sys, which correspond to scoring of the automatic speech recognition system and includes the overall word error rate as well as the number of insertions, deletions and substitutions of the overall system.- .lur, which provides a more detailed decomposition of the word error rate across different tags. The following is a description about the naming convention of the files: TableX-LineY: This is the recognition and scoring output corresponding to Line Y of Table X in the article.Figure X-BarY: This is the recognition and scoring output corresponding to Bar Y (starting on the left hand side) of Figure X in the article. All three file types are standard outputs that are recognised by the automatic speech recognition community and can be opened using any text editor.
Type Of Material	Database/Collection of data
Year Produced	2016
Provided To Others?	Yes
URL	https://figshare.shef.ac.uk/articles/dataset/Computer_Speech_and_Language_-_Experiment_results_for_p...


Title	Computer, Speech and Language - Experiment results for paper "Acoustic Adaptation to Dynamic Background Conditions with Asynchronous Transformations"
Description	The files in the dataset correspond to results that have been generated for the Computer, Speech and Language article: "Acoustic Adaptation to Dynamic Background Conditions with Asynchronous Transformations" http://dx.doi.org/10.1016/j.csl.2016.06.008. The files in the zip file are of three types:- .ctm, which correspond to the output of the automatic speech recognition system and the columns include segment information as well as transcripts of the recognition.- .sys, which correspond to scoring of the automatic speech recognition system and includes the overall word error rate as well as the number of insertions, deletions and substitutions of the overall system.- .lur, which provides a more detailed decomposition of the word error rate across different tags. The following is a description about the naming convention of the files: TableX-LineY: This is the recognition and scoring output corresponding to Line Y of Table X in the article.Figure X-BarY: This is the recognition and scoring output corresponding to Bar Y (starting on the left hand side) of Figure X in the article. All three file types are standard outputs that are recognised by the automatic speech recognition community and can be opened using any text editor.
Type Of Material	Database/Collection of data
Year Produced	2016
Provided To Others?	Yes
URL	https://figshare.shef.ac.uk/articles/dataset/Computer_Speech_and_Language_-_Experiment_results_for_p...


Title	Data Underpinning "Joint Optimisation of Tandem Systems Using Gaussian Mixture Density Neural Network Discriminative Sequence Training"
Description	Description of the Speech Recognition Training and Test Data and its Availability used for Experiments. Key Speech Recognition Outputs/Detailed Scoring Results used in the paper.
Type Of Material	Database/Collection of data
Year Produced	2017
Provided To Others?	Yes


Title	Experiment materials for "Disfluencies in change detection in natural, vocoded and synthetic speech."
Description	The current dataset is associated with the DiSS paper "Disfluencies in change detection in natural, vocoded and synthetic speech." In this paper we investigate the effect of filled pauses, a discourse marker and silent pauses in a change detection experiment in natural, vocoded and synthetic speech. In natural speech change detection has been found to increase in the presence of filled pauses, we extend this work by replicating earlier findings and explore the effect of a discourse marker, like, and silent pauses. Furthermore we report how the use of "unnatural" speech, namely synthetic and vocoded, affects change detection rates.
Type Of Material	Database/Collection of data
Year Produced	2015
Provided To Others?	Yes


Title	Experiment materials for "The temporal delay hypothesis: Natural, vocoded and synthetic speech."
Description	Including disfluencies in synthetic speech is being explored as a way of making synthetic speech sound more natural and conversational. How to measure whether the resulting speech is actually more natural, however, is not straightforward. Conventional approaches to synthetic speech evaluation fall short as a listener is either primed to prefer stimuli with filled pauses or when they aren't primed they prefer more fluent speech. Reaction time experiments from psycholinguistics may circumvent this issue. In this paper, we revisit one such reaction time experiment. For natural speech, delays in word onset were found to facilitate word recognition regardless of the type of delay; be they filled pause (um), silent or a tone. We reused the materials for natural speech, and extended it to vocoded and synthetic speech. The results partially replicate previous findings. For natural and vocoded speech, if the delay is a silent pause, significant increases in the speed of word recognition are found. If the delay comprises filled pauses there is a significant increase in reaction time for vocoded speech but not for natural speech. For synthetic speech, no clear effects of delay on word recognition are found. We hypothesise this is because it takes longer (requires more cognitive resources) to process synthetic speech than natural or vocoded speech.
Type Of Material	Database/Collection of data
Year Produced	2015
Provided To Others?	Yes


Title	Experimental results for IEEE/ACM Transaction on Audio, Speech and Language Processing Journal Paper: "Recurrent Neural Network Language Model Adaptation for Multi-Genre Broadcast Speech Recognition and Alignment"
Description	The files in the dataset correspond to results that have been generated for the IEEE/ACM Transactions on Audio, Speech and Language Processing paper: "Recurrent Neural Network Language Model Adaptation for Multi-Genre Broadcast Speech Recognition and Alignment", DOI: 10.1109/TASLP.2018.2888814. The paper deals with language model adaptation for the MGB Challenge 2015 transcription and alignment tasks. The files in the zip file are of three types: - .ctm, which correspond to the output of the automatic speech recognition system and the columns include segment information as well as transcripts of the recognition. - .ctm.filt.sys, which correspond to scoring of the automatic speech recognition system and includes the overall word error rate as well as the number of insertions, deletions and substitutions of the overall system. - .ctm.filt.lur, which provides a more detailed decomposition of the word error rate across multiple genres. The three file types are repeated for all the results described in Tables 4,5 and 6 of the paper (27 entries in total). The following is a description about the naming convention of the files: 4gram.amlm.baseline refers to the 4-gram LM baseline on LM1 and LM2 text rnnlm refers to Recurrent Neural Network Language Model. amrnnlm prefix refers to acoustic model text RNNLM. amlmrnnlm prefix refers to acoustic model + language model text RNNLM. .baseline.lattice.rescore suffix refers to baseline results generated with lattice rescoring. .nbest.baseline.rescore suffix refers to baseline results generated with nbest rescoring. .noadaptation refers to RNNLM results with no adaptation. .genre.finetune refers to genre fine-tuning of the RNNLMs. .genre.adaptationlayer refers to genre LHN adaptation layer fine-tuning of the RNNLMs. .ldafeat.hiddenlayer refers to text-based Latent Dirichlet Allocation (LDA) features at the hidden layer. .acousticldafeat.hiddenlayer refers to acoustic LDA features at the hidden layer .acoustictextldafeat.hiddenlayer refers to acoustic and text LDA features at the hidden layer. .genrefeat.hiddenlayer refers to Genre 1-hot auxiliary codes at the hidden layer. .genrefeat.adaptationlayer refers to Genre 1-hot auxiliary codes at the adaptation layer. .2layer.ldafeat.hiddenlayer refers to a 2-layer RNNLM with text LDA features at the hidden layer and no feat. at adaptation layer. .2layer.ldafeat.hiddenlayer.genrefinetune refers to a 2-layer RNNLM with text LDA features at the hidden layer, no feat. at adaptation layer and genre fine-tuning. .kcomponent refers to K-Component Adaptive Topic fine-tuning using LDA posteriors All three file types are standard outputs that are recognised by the automatic speech recognition community and can be opened using any text editor.
Type Of Material	Database/Collection of data
Year Produced	2021
Provided To Others?	Yes
URL	https://figshare.shef.ac.uk/articles/dataset/Experiments_results_for_IEEE_ACM_Transaction_on_Audio_S...


Title	Experimental results for IEEE/ACM Transaction on Audio, Speech and Language Processing Journal Paper: "Recurrent Neural Network Language Model Adaptation for Multi-Genre Broadcast Speech Recognition and Alignment"
Description	The files in the dataset correspond to results that have been generated for the IEEE/ACM Transactions on Audio, Speech and Language Processing paper: "Recurrent Neural Network Language Model Adaptation for Multi-Genre Broadcast Speech Recognition and Alignment", DOI: 10.1109/TASLP.2018.2888814. The paper deals with language model adaptation for the MGB Challenge 2015 transcription and alignment tasks. The files in the zip file are of three types: - .ctm, which correspond to the output of the automatic speech recognition system and the columns include segment information as well as transcripts of the recognition. - .ctm.filt.sys, which correspond to scoring of the automatic speech recognition system and includes the overall word error rate as well as the number of insertions, deletions and substitutions of the overall system. - .ctm.filt.lur, which provides a more detailed decomposition of the word error rate across multiple genres. The three file types are repeated for all the results described in Tables 4,5 and 6 of the paper (27 entries in total). The following is a description about the naming convention of the files: 4gram.amlm.baseline refers to the 4-gram LM baseline on LM1 and LM2 text rnnlm refers to Recurrent Neural Network Language Model. amrnnlm prefix refers to acoustic model text RNNLM. amlmrnnlm prefix refers to acoustic model + language model text RNNLM. .baseline.lattice.rescore suffix refers to baseline results generated with lattice rescoring. .nbest.baseline.rescore suffix refers to baseline results generated with nbest rescoring. .noadaptation refers to RNNLM results with no adaptation. .genre.finetune refers to genre fine-tuning of the RNNLMs. .genre.adaptationlayer refers to genre LHN adaptation layer fine-tuning of the RNNLMs. .ldafeat.hiddenlayer refers to text-based Latent Dirichlet Allocation (LDA) features at the hidden layer. .acousticldafeat.hiddenlayer refers to acoustic LDA features at the hidden layer .acoustictextldafeat.hiddenlayer refers to acoustic and text LDA features at the hidden layer. .genrefeat.hiddenlayer refers to Genre 1-hot auxiliary codes at the hidden layer. .genrefeat.adaptationlayer refers to Genre 1-hot auxiliary codes at the adaptation layer. .2layer.ldafeat.hiddenlayer refers to a 2-layer RNNLM with text LDA features at the hidden layer and no feat. at adaptation layer. .2layer.ldafeat.hiddenlayer.genrefinetune refers to a 2-layer RNNLM with text LDA features at the hidden layer, no feat. at adaptation layer and genre fine-tuning. .kcomponent refers to K-Component Adaptive Topic fine-tuning using LDA posteriors All three file types are standard outputs that are recognised by the automatic speech recognition community and can be opened using any text editor.
Type Of Material	Database/Collection of data
Year Produced	2021
Provided To Others?	Yes
URL	https://figshare.shef.ac.uk/articles/dataset/Experiments_results_for_IEEE_ACM_Transaction_on_Audio_S...


Title	Experiments results for IEEE/ACM Transaction on Audio, Speech and Language Processing Journal Paper: "Recurrent Neural Network Language Model Adaptation for Multi-Genre Broadcast Speech Recognition and Alignment"
Description	The files in the dataset correspond to results that have been generated for the IEEE/ACM Transactions on Audio, Speech and Language Processing paper: "Recurrent Neural Network Language Model Adaptation for Multi-Genre Broadcast Speech Recognition and Alignment", DOI: 10.1109/TASLP.2018.2888814. The paper deals with language model adaptation for the MGB Challenge 2015 transcription and alignment tasks. The files in the zip file are of three types: - .ctm, which correspond to the output of the automatic speech recognition system and the columns include segment information as well as transcripts of the recognition. - .ctm.filt.sys, which correspond to scoring of the automatic speech recognition system and includes the overall word error rate as well as the number of insertions, deletions and substitutions of the overall system. - .ctm.filt.lur, which provides a more detailed decomposition of the word error rate across multiple genres. The three file types are repeated for all the results described in Tables 4,5 and 6 of the paper (27 entries in total). The following is a description about the naming convention of the files: 4gram.amlm.baseline refers to the 4-gram LM baseline on LM1 and LM2 text rnnlm refers to Recurrent Neural Network Language Model. amrnnlm prefix refers to acoustic model text RNNLM. amlmrnnlm prefix refers to acoustic model + language model text RNNLM. .baseline.lattice.rescore suffix refers to baseline results generated with lattice rescoring. .nbest.baseline.rescore suffix refers to baseline results generated with nbest rescoring. .noadaptation refers to RNNLM results with no adaptation. .genre.finetune refers to genre fine-tuning of the RNNLMs. .genre.adaptationlayer refers to genre LHN adaptation layer fine-tuning of the RNNLMs. .ldafeat.hiddenlayer refers to text-based Latent Dirichlet Allocation (LDA) features at the hidden layer. .acousticldafeat.hiddenlayer refers to acoustic LDA features at the hidden layer .acoustictextldafeat.hiddenlayer refers to acoustic and text LDA features at the hidden layer. .genrefeat.hiddenlayer refers to Genre 1-hot auxiliary codes at the hidden layer. .genrefeat.adaptationlayer refers to Genre 1-hot auxiliary codes at the adaptation layer. .2layer.ldafeat.hiddenlayer refers to a 2-layer RNNLM with text LDA features at the hidden layer and no feat. at adaptation layer. .2layer.ldafeat.hiddenlayer.genrefinetune refers to a 2-layer RNNLM with text LDA features at the hidden layer, no feat. at adaptation layer and genre fine-tuning. .kcomponent refers to K-Component Adaptive Topic fine-tuning using LDA posteriors All three file types are standard outputs that are recognised by the automatic speech recognition community and can be opened using any text editor.
Type Of Material	Database/Collection of data
Year Produced	2018
Provided To Others?	Yes
URL	https://figshare.shef.ac.uk/articles/dataset/Experiments_results_for_IEEE_ACM_Transaction_on_Audio_S...


Title	Human vs Machine Spoofing
Description	Listening test materials for "Human vs Machine Spoofing Detection on Wideband and Narrowband data." They include lists of the speech material selected from the SAS spoofing database and the listeners' responses. The main data file has been split into five smaller files (labelled "aa" to "ae") for ease of download.
Type Of Material	Database/Collection of data
Year Produced	2015
Provided To Others?	Yes


Title	Improving Interpretability and Regularisation in Deep Learning
Description	The provided .ctm and scoring .sys files correspond to the MPE systems of Table VI (Javanese) and Table X (BN) of this paper.
Type Of Material	Database/Collection of data
Year Produced	2018
Provided To Others?	Yes


Title	MGB Challenge
Description	The MGB Challenge data was released to support the MGB Challenge evaluation of multi-genre broadcast speech recognition systems. It consists of approximately 1,600 hours of broadcast audio taken from seven weeks of BBC output across all TV channels, captions as originally broadcast on TV, accompanied by baseline lightly-supervised alignments using an ASR system, with confidence measures, several hundred million words of subtitle text from BBC TV output collected over a 15 year period, and a hand-compiled British English lexicon derived from Complex.
Type Of Material	Database/Collection of data
Year Produced	2015
Provided To Others?	Yes
Impact	This research database supported the MGB Challenge at the IEEE ASRU-2015 workshop
URL	http://mgb-challenge.org


Title	MGB database
Description	The MGB database is the official database of the MGB challenge. It contains 2,000 hours of audio, 700 million words of transcripts plus other metadata.
Type Of Material	Database/Collection of data
Year Produced	2015
Provided To Others?	Yes
Impact	Features in the MGB challenge and the accompanying workshop at ASRU 2015.


Title	Multimedia Tools and Applications - Experiments results for paper "Lightly supervised alignment of subtitles on multigenre broadcasts"
Description	The files in the dataset correspond to results that have been generated for the Multimedia Tools and Applications (Springer ISSN: 1380-7501 / 1573-7721) article: "Lightly supervised alignment of subtitles on multigenre broadcasts". The files in the zip file are of three types: - .ctm, which correspond to the output of the automatic speech recognition system or lightly supervised alignment system. - .rttm, which correspond to the output of the speech segmentation system. - .sys, which correspond to scoring of the speech segmentation, automatic speech recognition or lightly supervised alignment system. The following is a description about the naming convention of the files: TableX-LineY-[ser\|wer\|f1]: This is the output and scoring results corresponding to Line Y of Table X in the article in terms of SER, WER or F1 score. All three file types are standard outputs that are recognised by the speech technology community and can be opened using any text editor.
Type Of Material	Database/Collection of data
Year Produced	2016
Provided To Others?	Yes
URL	https://figshare.shef.ac.uk/articles/dataset/Multimedia_Tools_and_Applications_-_Experiments_results...


Title	Multimedia Tools and Applications - Experiments results for paper "Lightly supervised alignment of subtitles on multigenre broadcasts"
Description	The files in the dataset correspond to results that have been generated for the Multimedia Tools and Applications (Springer ISSN: 1380-7501 / 1573-7721) article: "Lightly supervised alignment of subtitles on multigenre broadcasts". The files in the zip file are of three types: - .ctm, which correspond to the output of the automatic speech recognition system or lightly supervised alignment system. - .rttm, which correspond to the output of the speech segmentation system. - .sys, which correspond to scoring of the speech segmentation, automatic speech recognition or lightly supervised alignment system. The following is a description about the naming convention of the files: TableX-LineY-[ser\|wer\|f1]: This is the output and scoring results corresponding to Line Y of Table X in the article in terms of SER, WER or F1 score. All three file types are standard outputs that are recognised by the speech technology community and can be opened using any text editor.
Type Of Material	Database/Collection of data
Year Produced	2016
Provided To Others?	Yes
URL	https://figshare.shef.ac.uk/articles/dataset/Multimedia_Tools_and_Applications_-_Experiments_results...


Title	REHASP
Description	Studio recording of female native British English talker producing three sets of Harvard sentences (thirty prompts), each prompt repeated forty times. Available both as unprocessed 96 kHz recordings and standardised 16 kHz files.
Type Of Material	Database/Collection of data
Year Produced	2014
Provided To Others?	Yes
Impact	The following paper has been published: G. E. Henter, T. Merritt, M. Shannon, C. Mayo, and S. King, "Measuring the perceptual effects of modelling assumptions in speech synthesis using stimuli constructed from repeated natural speech," in Proc. Interspeech, 2014


Title	Spoofing and Anti-Spoofing (SAS) corpus v1.0
Description	This dataset is associated with the paper "'SAS: A speaker verification spoofing database containing diverse attacks': presents the first version of a speaker verification spoofing and anti-spoofing database, named SAS corpus. The corpus includes nine spoofing techniques, two of which are speech synthesis, and seven are voice conversion. We design two protocols, one for standard speaker verification evaluation, and the other for producing spoofing materials. Hence, they allow the speech synthesis community to produce spoofing materials incrementally without knowledge of speaker verification spoofing and anti-spoofing. To provide a set of preliminary results, we conducted speaker verification experiments using two state-of-the-art systems. Without any anti-spoofing techniques, the two systems are extremely vulnerable to the spoofing attacks implemented in our SAS corpus".
Type Of Material	Database/Collection of data
Year Produced	2015
Provided To Others?	Yes
Impact	This SAS database is the first version of a standard dataset for spoofing and anti-spoofing research. Currently, the SAS corpus includes speech generated using nine spoofing methods, each of which comprises around 300000 spoofed trials. To the best of our knowledge, this is the first attempt to include such a diverse range of spoofing attacks in a single database. The SAS corpus is publicly available at no cost.


Title	The Voice Conversion Challenge 2016 database
Description	The Voice Conversion Challenge (VCC) 2016, one of the special sessions at Interspeech 2016, deals with speaker identity conversion, referred as Voice Conversion (VC). The task of the challenge was speaker conversion, i.e., to transform the voice identity of a source speaker into that of a target speaker while preserving the linguistic content. Using a common dataset consisting of 162 utterances for training and 54 utterances for evaluation from each of 5 source and 5 target speakers, 17 groups working in VC around the world developed their own VC systems for every combination of the source and target speakers, i.e., 25 systems in total, and generated voice samples converted by the developed systems. The objective of the VCC was to compare various VC techniques on identical training and evaluation speech data. The samples were evaluated in terms of target speaker similarity and naturalness by 200 listeners in a controlled environment. This dataset consists of the participants' VC submissions and the listening test results for naturalness and similarity.
Type Of Material	Database/Collection of data
Year Produced	2016
Provided To Others?	Yes
Impact	17 groups working in VC around the world have used this database and have developed their own VC systems.
URL	http://datashare.is.ed.ac.uk/handle/10283/2211


Title	Wargames Day 2 and 3
Description	Further recordings of two days, playing the game Warhammer, yielding a total of 20 hours of transcribed speech.
Type Of Material	Database/Collection of data
Year Produced	2016
Provided To Others?	Yes
Impact	No direct impact has been recorded, however the Kaldi team has proposed to refine the system scripts included in the corpus.
URL	http://mini.dcs.shef.ac.uk/resources/sheffield-wargames-corpus/


Title	Wargames I
Description	Recordings of groups of people playing the Warhammer hgame, recorded in 96 audio channels and 3 media streams, fully transcribed.
Type Of Material	Database/Collection of data
Year Produced	2014
Provided To Others?	Yes
Impact	It was discussed at ASRU 2016 as a potential candidate for future tasks. University of Sheffield are recording part II and part III which will be made available shortly, for this objective.


Title	the homeService corpus
Description	an audio corpus of spontaneous dysarthric speech
Type Of Material	Database/Collection of data
Year Produced	2015
Provided To Others?	Yes
Impact	first example of semi-spontaneuos dysarthric speech corpus
URL	http://mini.dsc.shef.ac.uk/


Description	BBC
Organisation	British Broadcasting Corporation (BBC)
Department	BBC Research & Development
Country	United Kingdom
Sector	Public
PI Contribution	Development of systems and showcases of the use of automatic speech processing of media archives
Collaborator Contribution	Provided audio and video broadcast data and gave feedback on their requirements for future systems
Impact	Several systems for media transcription are available in webASR now (www.webasr.org), a showcase for transcription of Youtube clips is also available (http://staffwww.dcs.shef.ac.uk/people/O.Saztorralba/youtube/)
Start Year	2012


Description	BBC Data Science Partnership
Organisation	British Broadcasting Corporation (BBC)
Department	BBC Research & Development
Country	United Kingdom
Sector	Public
PI Contribution	Development of speech and language technology applied to broadcasting and media production
Collaborator Contribution	R&D work from BBC researchers; data sharing.
Impact	MGB Challenge iCASE studentships EPSRC SCRIPT Project
Start Year	2017


Description	Barnsley Hospital NHS Foundation Trust
Organisation	Barnsley Hospital NHS Foundation Trust
Country	United Kingdom
Sector	Public
PI Contribution	recruitment of homeService users
Collaborator Contribution	recruitment of homeService users
Impact	recruitment of homeService users
Start Year	2012


Description	Bloomberg PhD Studentship
Organisation	Johns Hopkins University
Department	Johns Hopkins Bloomberg School of Public Health
Country	United States
Sector	Academic/University
PI Contribution	Research, systems, evaluation of multi-domain speech recognition
Collaborator Contribution	Full funding for year 1 of a PhD studentship
Impact	PhD student commenced work in Sept 2015
Start Year	2015


Description	Cereproc
Organisation	Cereproc Ltd.
Country	United Kingdom
Sector	Private
PI Contribution	Steve Renals is non-executive director of Cereproc Ltd.
Collaborator Contribution	Brought a deeper understanding of commercial exploitation of speech technology
Impact	Cereproc had developed into one of the leading companies in the speech synthesis area
Start Year	2014


Description	Dysarthric speech organisation in Sheffield
Organisation	University of Sheffield
Country	United Kingdom
Sector	Academic/University
PI Contribution	recruitment of hS users
Collaborator Contribution	recruitment of hS users
Impact	recruitment of hS users
Start Year	2015


Description	English Heritage
Organisation	English Heritage
Country	United Kingdom
Sector	Charity/Non Profit
PI Contribution	Development of a system and a platform for information retrieval and content linking in oral archives
Collaborator Contribution	Provided audio data and summaries from interviews nad gave feedback on their system requirements
Impact	A demonstrator for the use of the technology on a set of oral history interviews was developed (http://brodsworthhall.azurewebsites.net/)
Start Year	2013


Description	Julia Olcoz, visiting researcher
Organisation	University of Zaragoza
Country	Spain
Sector	Academic/University
PI Contribution	Provided a baseline system for the task of lightly supervised alignment in media broadcasts
Collaborator Contribution	Developed novel techniques for improving the lightly supervised alignment task
Impact	The enhanced system for lightly supervised alignment is available in webASR. A paper detailing the system was submitted to Interspeech 2016.
Start Year	2015


Description	MGB Challenge
Organisation	British Broadcasting Corporation (BBC)
Department	BBC Research & Development
Country	United Kingdom
Sector	Public
PI Contribution	Organised international speech recognition challenge using BBC data: the MGB Challenge at ASRU-2015. We provided baseline and state-of-the-art systems, defined the challenge procedures
Collaborator Contribution	BBC provided 2000 hrs of multi-genre TV recordings, and 634M words of subtitle transcriptions
Impact	P Bell, MJF Gales, T Hain, J Kilgour, P Lanchantin, X Liu, A McParland, S Renals, O Saz, M Wester, and P Woodland, The MGB Challenge: Evaluating multi-genre broadcast media recognition, IEEE ASRU-2015
Start Year	2014


Description	Mediaeval
Organisation	Medieval Settlement Research Group
Country	United Kingdom
Sector	Charity/Non Profit
PI Contribution	Provided automatic transcription of media data for their evaluation campaigns
Collaborator Contribution	Provided the data
Impact	The transcribed data was used by participants in the evaluation campaign and features in their several publications related to the evaluation. The Mediaeval is an evaluation campaign that aims to improve the information retrieval task in media data. It's organised by Maria Eskevich at Eurecom (France).
Start Year	2014


Description	NII
Organisation	National Institute of Informatics (NII)
Country	Japan
Sector	Public
PI Contribution	Joint research in speech synthesis
Collaborator Contribution	Joint research in speech synthesis
Impact	many joint publications; collaboration on open source software (HTS); joint work on voice banking; joint position for Dr Junichi Yamagishi
Start Year	2013


Description	NITech
Organisation	Nagoya Institute of Technology
Country	Japan
Sector	Academic/University
PI Contribution	Joint research in particular focussed on HTS speech synthesis and user generated spoken dialogue systems.
Collaborator Contribution	Joint research in particular focussed on HTS speech synthesis and user generated spoken dialogue systems.
Impact	multiple joint publications
Start Year	2011


Description	Pengyuang Zhang - visitor
Organisation	Chinese Academy of Sciences
Country	China
Sector	Public
PI Contribution	Hosting the visitor, and providing access to research facilities at the University of Sheffield
Collaborator Contribution	Research collaboration with the group
Impact	The research visit resulted in the output of two research papers: http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=7078564 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6854663
Start Year	2013


Description	The Centre for Assistive Technology and Connected Healthcare (CATCH)
Organisation	University of Sheffield
Country	United Kingdom
Sector	Academic/University
PI Contribution	sharing of knowledge and resources
Collaborator Contribution	sharing of knowledge and resources
Impact	unknown
Start Year	2013


Description	Visitor Dr Yan-Xiong Li
Organisation	South China University of Technology
Department	School of Environment and Energy
Country	China
Sector	Academic/University
PI Contribution	Collaboration on diarisation work on meeting and broadcast media data.
Collaborator Contribution	Joint annotation and testing experiments on BBC data and AMI meeting data. Public Release of new data.
Impact	Annotation of BBC and RT'07 data.
Start Year	2013


Description	uSTAR collaborative R&D project
Organisation	NICT National Institute of Information and Communications Technology
Country	Japan
Sector	Academic/University
PI Contribution	Building online speech recognition for an ipphone platform for speech to speech translation
Collaborator Contribution	Building online speech recognition for an ipphone platform for speech to speech translation
Impact	Building online speech recognition for an ipphone platform for speech to speech translation
Start Year	2012


Title	Combilex-ASR
Description	Combilex-ASR is a large scale lexicon for speech recognition in British English. It is licensed under a Creative Commons BY-NC license
IP Reference
Protection	Copyrighted (e.g. software)
Year Protection Granted	2016
Licensed	Yes
Impact	This lexicon underpinned the ASRU-2015 MGB Challenge, and is planned to be used in the US IARPA Babel Programme


Title	High-quality speech synthesizer, HTS voice
Description	High-quality speech synthesis software based on speech technologies developed during my fellowship.
IP Reference
Protection	Copyrighted (e.g. software)
Year Protection Granted	2012
Licensed	Yes
Impact	I have formally licensed the high-quality speech synthesizer to two companies for a commercial basis.


Title	Clinical trial of personalized speech synthesis voices for MND patients
Description	Adaptive speech synthesis may be be used to develop personalised synthetic voices for people who have a vocal pathology. In 2009 Dr. Sarah Creer from University of Sheffield and I have successfully applied it to clinical voice banking for laryngectomees (individuals who have had their vocal cords removed due to a developing cancer) to reconstruct their voices. In 2010, I have "implanted" the personalised synthetic voice of a patient who has motor neurone disease into their assistive communication device. Such a personalised voice can lead to far more natural communication for patients, particularly with family. A "voice reconstruction" trial has been tested with about 100 patients in total at the Euan MacDonald Centre for MND Research and the Anne Rowling Regenerative Neurology Clinic in Edinburgh.
Type	Health and Social Care Services
Current Stage Of Development	Initial development
Year Development Stage Completed	2015
Development Status	Actively seeking support
Impact	We have recorded about 100 MND patients at the Euan MacDonald Centre for MND Research and the Anne Rowling Regenerative Neurology Clinic in Edinburgh and have constructed personalized speech synthesizers based on their disordered voices. We have received and analyzed feedback from the patients and we have confirmed that this new speech synthesis technology can improve their quality-of-life.


Title	Alignment task and scoring software
Description	Implementation of the alignment scoring rules tha the University of Sheffield has defined as part of organising the MGB Challenge
Type Of Technology	Software
Year Produced	2015
Open Source License?	Yes
Impact	Used in the MGB Challenge and used already for several publications by other research groups.


Title	CUED-RNNLM Toolkit
Description	Software to train recurrent neural network language models (RNNLMs) for speech recognition and other applications. The software features efficient training (on a GPU) and efficient evaluation (on a CPU). It includes a patch to HTK 3.4.1 that enables the application of RNNLMs to HTK-based speech recognition lattices. RNNLMs can be trained with additional features in the input layer for better performance (e.g. topic adaptation using an LDA-based topic vector).
Type Of Technology	Software
Year Produced	2015
Open Source License?	Yes
Impact	This software was used to train RNNLMs that were used in the Cambridge University transcription systems used in the 2015 ASRU multi-genre broadcast challenge (international challenge involving the automatic transcription of BBC broadcast audio). The Cambridge system gave the lowest error rates in the challenge and the use of RNNLMs efficiently trained on a large corpus of subtitle material was a key component, as well as the use of topic adaptation.
URL	http://mi.eng.cam.ac.uk/projects/cued-rnnlm/


Title	Diarisation scoring tools
Description	Implementation of new diarisation methods as published in ICASSP 2016
Type Of Technology	Software
Year Produced	2016
Open Source License?	Yes
Impact	Not yet.


Title	HTK 3.5
Description	HTK is a portable toolkit for building and manipulating hidden Markov models which has been developed over many years primarily at Cambridge University Engineering Dept.. HTK is primarily used for speech recognition research although it is also widely used for speech synthesis and other applications. HTK 3.5 adds built-in support for artificial neural network (ANN) models while maintaining compatibility with most existing functions (including hybrid and tandem models, sequence training and CPU/GPU math kernels), as well as support for decoding RNN language models HTK is supplied in source form with a specific licence that allows any use of the models produced but does not allow software re-distribution. HTK has over 100,000 registered users.
Type Of Technology	Software
Year Produced	2015
Open Source License?	Yes
Impact	HTK 3.5 has been used as a platform to develop various types of speech technology research at Cambridge, building on developments over many years to focus on the use of deep neural network acoustic models and recurrent neural network language models. A particular outcome has been the development of the Cambridge University systems for the 2015 ARSU multi-genre broadcast (MGB) challenge. This required the processing of more than 1600 hours of BBC TV audio data and developing systems for transcription, subtitle alignment and diarisation. This was embodied in 4 tasks in the MGB challenge, and Cambridge University systems based on HTK 3.5 had the best performance for all these tasks. Many HTK users have downloaded HTK 3.5 and are actively using it to develop both research and commercial systems.
URL	http://htk.eng.cam.ac.uk/


Title	HTS ver 2.3
Description	HTS is an open source toolkit for statistical speech synthesis. I am a member of a team developing the the free open-source research software packages for speech synthesis.
Type Of Technology	Software
Year Produced	2015
Open Source License?	Yes
Impact	The HTS toolkit is used worldwide by both academic and commercial organisations, such as Microsoft, Nuance, Toshiba, Pentax, and Google. The number of downloads of HTS exceeds 10,000 and various commercial products using HTS are on the market. Therefore, this toolkit is a very influential platform for me to disseminate outcomes and form an immediate pathway to impact.
URL	http://hts.sp.nitech.ac.jp


Title	Merlin
Description	Merlin is the Neural Network (NN) based Speech Synthesis System developed at the Centre for Speech Technology Research (CSTR), University of Edinburgh
Type Of Technology	Software
Year Produced	2016
Open Source License?	Yes
Impact	Since its release at the end of the Natural Speech Technology project, Merlin has established a significant base of users and developers.
URL	https://github.com/CSTR-Edinburgh/merlin


Title	The Festival Speech Synthesis System
Description	Festival offers a general framework for building speech synthesis systems as well as including examples of various modules. As a whole it offers full text to speech through a number APIs: from shell level, though a Scheme command interpreter, as a C++ library, from Java, and an Emacs interface. Festival is multi-lingual (currently English (British and American), and Spanish) though English is the most advanced. Other groups release new languages for the system. And full tools and documentation for build new voices are available through Carnegie Mellon's FestVox project (http://festvox.org). The software was first released in the 1990s, but has been under continuous development, improvement, and maintenance since then. v2.1 q was released in November 2010.
Type Of Technology	Software
Open Source License?	Yes
Impact	Festival is distributed as default in a number of standard Linux distributions including Arch Linux, Fedora, CentOS, RHEL, Scientific Linux, Debian, Ubuntu, openSUSE, Mandriva, Mageia and Slackware, and can easily be installed on any Linux distribution that supports apt-get. More recently our work on statistical parametric speech synthesis and the algorithms for adaptation have been incorporated in the HTS toolkit (one of the coordinators (Yamagishi) is from Edinburgh), which integrates with Festival. These toolkits are the most used open-source speech synthesis systems and have also formed the high performing baseline systems for the international Blizzard evaluation of (commercial and research) speech synthesis also organised by Edinburgh.
URL	http://www.cstr.ed.ac.uk/projects/festival/


Title	The Festival Speech Synthesis system - version 2.4
Description	The de facto industry standard toolkit for developing text-to-speech systems.
Type Of Technology	Software
Year Produced	2014
Open Source License?	Yes
Impact	Commercial products from AT&T. Spinout company Rhetorical Systems.
URL	http://www.cstr.ed.ac.uk/projects/festival/


Title	homeService protocol
Description	The homeService protocol to develop and human-machine interaction for users with speech and mobility impairment. Example of virtuous cycle.
Type Of Technology	Software
Year Produced	2015
Open Source License?	Yes
Impact	Not at this point.


Title	webASR
Description	Publicly available webtool (www.webasr.org) and two showcases on media transcription (http://staffwww.dcs.shef.ac.uk/people/O.Saztorralba/youtube/) and alignment of lecture subtitles (http://staffwww.dcs.shef.ac.uk/people/O.Saztorralba/ted/)
Type Of Technology	Webtool/Application
Year Produced	2015
Impact	New version of webASR (www.webasr.org) with new systems and demonstrators
URL	http://www.webasr.org/


Company Name	Quorate Technology
Description	Quorate Technology develops QSpeech, a speech recognition software that records and transcribes audio, as well as analysing the data in order to make it searchable.
Year Established	2011
Impact	The company has a variety of commercial contracts.There are at least 10 full-time scientific positions.
Website	http://www.quoratetechnology.com


Company Name	Speak:Unique
Description	Speak:Unique develops synthetic voices based on the person's voice that allows those losing their voice to use their own instead of a robotic sounding one.
Year Established	2018
Impact	The company is still in an early phase.
Website	http://speakunique.co.uk


Description	A talk about the homeService experience/project
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	Regional
Primary Audience	Professional Practitioners
Results and Impact	A talk about the homeService experience/project to Birmingham University (Feb 2016
Year(s) Of Engagement Activity	2016


Description	A talk about the homeService experience/project
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	Regional
Primary Audience	Professional Practitioners
Results and Impact	A talk about the homeService experience/project to Birmingham University (Feb 2016
Year(s) Of Engagement Activity	2016


Description	A talk about the homeService experience/project to Medical Humanity Sheffield
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	Regional
Primary Audience	Professional Practitioners
Results and Impact	A talk about the homeService experience/project to Medical Humanity Sheffield
Year(s) Of Engagement Activity	2015


Description	CATCH kickoff event
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	Regional
Primary Audience	Professional Practitioners
Results and Impact	Introduction of the homeService project to interest audience
Year(s) Of Engagement Activity	2013


Description	COST APPELE meeting
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Industry/Business
Results and Impact	Engaging academics and industry across Europe
Year(s) Of Engagement Activity	2014
URL	http://aapele.eu/%7Clink


Description	Data Science for Media Summit
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	The Summit was organised by the Alan Turing Institute to bring together researchers and media specialists to discuss in the future directions of research in data science for media
Year(s) Of Engagement Activity	2015


Description	How technology is changing speech and language therapy (Guardian on-line)
Form Of Engagement Activity	A press release, press conference or response to a media enquiry/interview
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Media (as a channel to the public)
Results and Impact	Article in Guardian which includes information about progress in speech recognition technology and specific mention of the EPSRC Natural Speech Technology project.
Year(s) Of Engagement Activity	2015
URL	http://www.theguardian.com/higher-education-network/2015/apr/15/how-technology-is-changing-speech-an...


Description	Mobile University outreach event 2015
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	Local
Primary Audience	Public/other audiences
Results and Impact	Our research on speech recognition and machine translation was presented as part of a 'Mobile University' outreach to general public in Sheffield City Centre.
Year(s) Of Engagement Activity	2015
URL	http://mini.dcs.shef.ac.uk/mobileuni2015/


Description	Multi-Genre Broadcast (MGB) Challenge
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	Organised and participated the Multi-Genre Broadcast (MGB) challenge. The challenge took place at ASRU 2015 and it served as a meeting for the participants. 20+ research groups participated in the challenge.
Year(s) Of Engagement Activity	2015
URL	http://www.mgb-challenge.org/


Description	Talk at AIST Tsukuba
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	Talk at AIST, Tsukuba, Japan - "Improving speech transcription using out-of-domain data"
Year(s) Of Engagement Activity	2012


Description	The future of Languages - more than just words
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Public/other audiences
Results and Impact	A public lecture at the Public Library in Amsterdam, followed by a debate with an audience. Interactions with the audience.
Year(s) Of Engagement Activity	2012
URL	http://www.clubofamsterdam.com/event.asp?contentid=854


Description	Using speech synthesis to give everyone their own voice
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	Local
Primary Audience	Public/other audiences
Results and Impact	Discussions with the audience afterwards. Follow up emails from members of the public.
Year(s) Of Engagement Activity	2012


Description	seminar at INESC 2012
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	Talk entitled "Assistive Speech Technology" at INESC-ID, Lisbon
Year(s) Of Engagement Activity	2012

Abstract

Planned Impact

Organisations

People

ORCID iD

Publications