Natural Speech Technology
Lead Research Organisation:
University of Edinburgh
Department Name: Sch of Informatics
Abstract
Humans are highly adaptable, and speech is our natural medium for informal communication. When communicating, we continuously adjust to other people, to the situation, and to the environment, using previously acquired knowledge to make this adaptation seem almost instantaneous. Humans generalise, enabling efficient communication in unfamiliar situations and rapid adaptation to new speakers or listeners. Current speech technology works well for certain controlled tasks and domains, but is far from natural, a consequence of its limited ability to acquire knowledge about people or situations, to adapt, and to generalise. This accounts for the uneasy public reaction to speech-driven systems. For example, text-to-speech synthesis can be as intelligible as human speech, but lacks expression and is not perceived as natural. Similarly, the accuracy of speech recognition systems can collapse if the acoustic environment or task domain changes, conditions which a human listener would handle easily. Research approaches to these problems have hitherto been piecemeal and as a result progress has been patchy. In contrast NST will focus on the integrated theoretical development of new joint models for speech recognition and synthesis. These models will allow us to incorporate knowledge about the speakers, the environment, the communication context and awareness of the task, and will learn and adapt from real world data in an online, unsupervised manner. This theoretical unification is already underway within the NST labs and, combined with our record of turning theory into practical state-of-the-art applications, will enable us to bring a naturalness to speech technology that is not currently attainable.The NST programme will yield technology which (1) approaches human adaptability to new communication situations, (2) is capable of personalised communication, and (3) takes account of speaker intention and expressiveness in speech recognition and synthesis. This is an ambitious vision. Its success will be measured in terms of how the theoretical development reshapes the field over the next decade, the takeup of the software systems that we shall develop, and through the impact of our exemplar interactive applications.We shall establish a strong User Group to maximise the impact of the project, with a members concerned with clinical applications, as well as more general speech technology. Members of the User Group include Toshiba, EADS Innovation Works, Cisco, Barnsley Hospital NHS Foundation Trust, and the Euan MacDonald Centre for MND Research. An important interaction with the User Group will be validating our systems on their data and tasks, discussed at an annual user workshop.
Planned Impact
Leading market analysts predict that revenues from speech technology in North America alone will reach $1 billion by 2011. The reality has lagged behind such predictions in the past since the technology is not refined enough even at this time, but paradigms are shifting. The revolutionary change in connectivity and mobile computing in recent years gives rise to a number of compelling application drivers for the proposed research programme: (1) Rapid developments in mobile computing - decreasing power consumption, high network bandwidth and cloud computing - are stimulating demand for new interfaces. (2) Demographic and economic pressures mean that home care and support systems will become commonplace; such systems will benefit from personalised spoken interaction. (3) Remote meetings are becoming standard, stimulated by the economic conditions and climate change; natural speech technology will enable much richer interactions. (4) As data access becomes more open, the volume of available audio data will increase exponentially; natural speech transcription will result in such data oceans become searchable and structured. (5) There is a potentially huge market (entertainment, consumer apps, robotics) that would be opened up by the availability of adaptive, controllable, expressive speech synthesis. (6) Clinical applications of speech technology will be substantially enriched by the personalised systems proposed in NST. As these drivers have reached a critical level, the NST team has made a number of crucial breakthroughs in adaptive speech synthesis, in conversational speech transcription and in new algorithms to robustly handle changing environments. The research potential is thus poised to meet the application drivers. Beneficiaries of the research can be found in the commercial sector (e.g., remote meeting technology; speech synthesis for computer games; speech archive search; etc), the public sector (e.g., voice reconstruction services for the National Health Service), the third sector (e.g., charities providing support for sufferers of neurodegenerative diseases), art and design, policy makers (e.g., investment in the use of spoken language technology can reduce travel and therefore carbon emissions; it can also enable people to live longer in their own homes, thus reducing the need for residential care services), and the general public (e.g., prospective voice banking and donation could become as commonplace and as widely known as blood donation). The programme's direct training and development impact will be large, through the PhD students and researchers who will work on the project and through researchers on projects that are drawn in on the side of NST; indirectly the training impact will be even larger through other students, researchers and visitors at the three universities, as well as programme workshops.
Organisations
- University of Edinburgh (Lead Research Organisation)
- NICT National Institute of Information and Communications Technology (Collaboration)
- Cereproc Ltd. (Collaboration)
- South China University of Technology (Collaboration)
- Medieval Settlement Research Group (Collaboration)
- BARNSLEY HOSPITAL NHS FOUNDATION TRUST (Collaboration)
- Nagoya Institute of Technology (Collaboration)
- Chinese Academy of Sciences (Collaboration)
- University of Sheffield (Collaboration)
- Johns Hopkins University (Collaboration)
- English Heritage (Collaboration)
- British Broadcasting Corporation (BBC) (Collaboration)
- University of Zaragoza (Collaboration)
- National Institute of Informatics (NII) (Collaboration)
Publications
Ali A
(2015)
Multi-reference WER for evaluating ASR for languages with no orthographic rule
in Proc IEEE ASRU
Ali A
(2015)
Multi-Reference Evaluation for Dialectal Speech Recognition System: A Study for Egyptian ASR
in Proc WANLP
Andersson S
(2012)
Synthesis and evaluation of conversational characteristics in HMM-based speech synthesis
in Speech Communication
Astrinaki M.
(2013)
Reactive accent interpolation through an interactive map application
in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Aylett, M.
(2014)
A flexible front-end for HTS
Bell P
(2015)
Regularization of context-dependent deep neural networks with context-independent multi-task training
in Proc IEEE ICASSP
Bell P
(2015)
Complementary tasks for context-dependent deep neural network acoustic models
in Proc Interspeech
Bell P
(2015)
The MGB Challenge: Evaluating Multi-genre Broadcast Media Recognition
in Proceedings of the 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU)
Description | The aim of the project is to significantly advance the state-of-the-art in speech technology by the recognition and synthesis of natural speech, approaching human levels of flexibility, reliability, and fluency. We have made advances in several areas. 1. Learning and adaptation. We have developed new approaches to learning representations for speech and language based on deep neural networks and recurrent neural networks. In contrast to previous approaches, these new approaches require less feature engineering and human design. These approaches have been applied to both speech recognition and speech synthesis. We have also developed new approaches for the adaptation of systems to a new voice, given just a few seconds of speech. We have also developed new factorised modelling approaches which, for example, enable us to separately model the effects of the talker as distinct to the effects of the recording channel. 2. Speech transcription. We have developed several new acoustic modelling techniques: for example, new techniques can model phonetic context in a more efficient way, and a new approach to recognising speech captured using multiple microphones. We have also developed more accurate language models, based on recurrent neural networks, and have introduced a new algorithm to automatically learn a pronunciation lexicon. 3. Speech synthesis. We have introduced new models for synthesising speech based on multiple average voices, and using prior information automatically extracted from talker characteristics. We have developed a new approach ro characterise the perceptual effects of modelling assumptions in speech synthesis through perceptual experiments using stimuli constructed from repeated natural speech. We have developed new techniques for synthesis of conversational speech, for example through automatic pause insertion. 4. Applications. This work has been applied in a number of areas. a) transcription of broadcast speech for subtitling, metadata extraction, and archive search. This is in collaboration with user group partners BBC and Red Bee Media b) adaptive speech recognition and dialogue management for users with speech disorders, which is currently undergoing trials in users' homes c) voice banking and cloning, to create personalised voice output communication aids for people with diseases such as Motor Neurone Disease and Parkinson's Disease. This is also undergoing trials with users. |
Exploitation Route | Already our findings are having considerable impact. In particular we have released many of the findings made in the project through open source toolkits (Kaldi, HTK, HTS and Festival) which has resulted in significant take-up. Several of our techniques for speech recognition and speech synthesis are being further developed by other groups. Our techniques have been put to use by several members of the project user group including the BBC and Ericsson (broadcast speech transcription); the Euan MacDonald Centre for Motor Neurone Disease Research, and the Motor Neurone Disease Association (voice banking); Quorate Technology (audio search and browsing); Toshiba (speech synthesis); Emotech (distant speech recognition). |
Sectors | Creative Economy Digital/Communication/Information Technologies (including Software) Education Healthcare Government Democracy and Justice Culture Heritage Museums and Collections |
URL | http://www.natural-speech-technology.org |
Description | 1. Contributions to widely-used open source software including HTK, Kaldi, HTS, CUED RNN Toolkit, and Merlin. The impact of NST to other researchers, and to industry, has been enhanced through the implementation and release of many of the key models and algorithms developed through the main open source platforms used in speech technology: HTK, Kaldi, and HTS. NST research also resulted in the release of the two widely-used open source software toolkits, the CUED RNN toolkit for speech recognition language modelling, and Merlin for neural text-to-speech synthesis. NST speech recognition was made available to researchers through the webASR system. 2. Personalised speech synthesis used for voice banking and reconstruction and deployed in assistive technology communication aids: this was developed in collaboration with the Euan MacDonald Centre for Motor Neurone Disease Research and the Anne Rowling Clinic for Regenerative Neurology at the University of Edinburgh. This work included a successful clinical pilot study, started during the NST project, and has resulted in the formation of spinout company, SpeakUnique. 3. Transcription of multi-genre broadcast speech. Media companies such as Ericsson/Red Bee Media and the BBC have used NST technology to automatically transcribe a wide-range of broadcast speech. Red Bee Media have worked closely with University of Edinburgh and spinout company Quorate Techniology on the development of their real-time subtitling services, increasing accessibility to live television and streaming for the 11 million people with hearing loss in the UK. 4. Transcription of parliamentary proceedings. Hansard provides a "substantially verbatim" record of proceedings in both UK Houses of Parliaments - the House of Commons and the House of Lords - as well as transcripts of Select Committee sittings. Using speech recognition technology from Edinburgh NST spinout Quorate Technology, Hansard use automatically generated transcriptions as a first draft of the official record, as well as to enable searches linking the audio, video, and transcription of parliamentary recordings, 5. Deployment of an application with English Heritage for Browsing Oral Histories: speech recognition developed in the project was used to browse through spoken interviews in 'Duty Calls', an exhibition by English Heritage centred on events at Brodsworth Hall in World War II, and in 'Village Memories', a lottery-funded project exploring life in 3 South Yorkshire villages. 6. Development of academic-industry research centres. An important aspect of the project was the development of industry-focussed research and this has resulted in the formation of a number of academic-industry research centres including the BBC Data Science Research Partnership, the VoiceBase Centre for Speech & Language Technology at the University of Sheffield, a joint research lab with Huawei at the University of Edinburgh, and collaborative funded projects with Bloomberg, Ericsson, Samsung, Toshiba, and Zoo Digital. 7. Distant speech recognition. Emotech have used the distant speech recognition technology developed in NST to develop a robust commercial system used in the prototype personal robot Olly. At CES-2017 in Las Vegas in 2017, Olly became the robotics product that won the most awards in CES history (Smart home; Drones and unmanned [sic] systems; Smart appliances; Home audio-video accessories). Speech recognition was central to this and based on UoE research. The underlying speech recognition technology is now being used by Emotech, in partnership with Huawei, for classroom-based language learning systems, including automatic pronunciation assessment. |
First Year Of Impact | 2012 |
Sector | Aerospace, Defence and Marine,Creative Economy,Digital/Communication/Information Technologies (including Software),Education,Healthcare,Government, Democracy and Justice,Culture, Heritage, Museums and Collections |
Impact Types | Cultural Societal Economic |
Description | CITIA |
Geographic Reach | Europe |
Policy Influence Type | Participation in a guidance/advisory committee |
Impact | Steve Renals is founding chairperson of the EU Conversational Interaction Technologies Innovation Alliance, a group which has advised the EU on policy relating speech technology to the multilingual digital single market |
URL | http://citia.eu |
Description | ROCKIT/CITIA Roadmap |
Geographic Reach | Europe |
Policy Influence Type | Citation in other policy documents |
Impact | The ROCKIT/CITIA strategic roadmap for conversational interaction technologies forms the basis of a research and innovation agenda in the area of conversational interaction technologies. 2014 and 2015 we constructed this technology roadmap to enable the conversational interaction technologies vision to be realised. The roadmapping process was carried out at the European level, connecting the strong R&D base, with commercial and industrial activity, and policy makers, at the EU and national levels. |
URL | http://www.sharpcloud.com/ROCKIT |
Description | Adapting end-to-end speech recognition systems (year 1) |
Amount | £137,365 (GBP) |
Organisation | Samsung |
Sector | Private |
Country | Korea, Republic of |
Start | 12/2018 |
End | 11/2019 |
Description | Adapting end-to-end speech recognition systems (year 2) |
Amount | £113,989 (GBP) |
Organisation | Samsung |
Sector | Private |
Country | Korea, Republic of |
Start | 12/2019 |
End | 11/2020 |
Description | Bloomberg PhD Studenship |
Amount | £42,677 (GBP) |
Organisation | Bloomberg |
Sector | Private |
Country | United States |
Start | 01/2015 |
End | 12/2015 |
Description | EPSRC Impact Acceleration Award |
Amount | £37,716 (GBP) |
Organisation | Engineering and Physical Sciences Research Council (EPSRC) |
Sector | Public |
Country | United Kingdom |
Start | 02/2015 |
End | 09/2015 |
Description | EPSRC Responsive Mode |
Amount | £533,268 (GBP) |
Funding ID | EP/P011586/1 |
Organisation | Engineering and Physical Sciences Research Council (EPSRC) |
Sector | Public |
Country | United Kingdom |
Start | 12/2016 |
End | 11/2019 |
Description | EPSRC Responsive Mode |
Amount | £1,402,097 (GBP) |
Funding ID | EP/R012180/1 |
Organisation | Engineering and Physical Sciences Research Council (EPSRC) |
Sector | Public |
Country | United Kingdom |
Start | 03/2018 |
End | 02/2021 |
Description | EU FP7-ICT-2011-1.5 |
Amount | € 540,000 (EUR) |
Funding ID | 287872 |
Organisation | European Commission |
Sector | Public |
Country | European Union (EU) |
Start | 11/2011 |
End | 10/2014 |
Description | EU FP7-ICT-2013-10 |
Amount | € 520,000 (EUR) |
Funding ID | 611092 |
Organisation | European Commission |
Sector | Public |
Country | European Union (EU) |
Start | 12/2013 |
End | 11/2015 |
Description | EU H2020 ICT Programme |
Amount | € 1,999,113 (EUR) |
Funding ID | 688139 |
Organisation | European Commission |
Department | Horizon 2020 |
Sector | Public |
Country | European Union (EU) |
Start | 02/2016 |
End | 01/2019 |
Description | European Community's Seventh Framework Programme (FP7/2007-2013) |
Amount | € 1,100,000 (EUR) |
Funding ID | 287678 |
Organisation | European Commission |
Sector | Public |
Country | European Union (EU) |
Start | 11/2011 |
End | 10/2014 |
Description | European Union Seventh Framework Programme |
Amount | € 1,100,000 (EUR) |
Funding ID | 287658 |
Organisation | European Commission |
Sector | Public |
Country | European Union (EU) |
Start | 02/2012 |
End | 01/2015 |
Description | IIKE Early Career Research Scheme |
Amount | £15,000 (GBP) |
Organisation | University of Sheffield |
Sector | Academic/University |
Country | United Kingdom |
Start | 11/2015 |
End | 02/2016 |
Description | Innovation Seed Funding |
Amount | £5,000 (GBP) |
Organisation | University of Sheffield |
Sector | Academic/University |
Country | United Kingdom |
Start | 05/2015 |
End | 08/2015 |
Description | ItsLanguage pronunciation assessment |
Amount | € 75,000 (EUR) |
Organisation | ITSLanguage bv |
Sector | Private |
Country | Netherlands |
Start | 11/2012 |
End | 08/2014 |
Description | Leverhulme International Network |
Amount | £125,000 (GBP) |
Organisation | The Leverhulme Trust |
Sector | Charity/Non Profit |
Country | United Kingdom |
Start | 01/2015 |
End | 12/2018 |
Description | Response to Tender (1) |
Amount | £73,726 (GBP) |
Organisation | Defence Science & Technology Laboratory (DSTL) |
Sector | Public |
Country | United Kingdom |
Start | 09/2012 |
End | 04/2013 |
Description | Response to Tender (2) |
Amount | £98,982 (GBP) |
Organisation | Defence Science & Technology Laboratory (DSTL) |
Sector | Public |
Country | United Kingdom |
Start | 12/2013 |
End | 04/2014 |
Description | Response to Tender (3) |
Amount | £78,684 (GBP) |
Organisation | Defence Science & Technology Laboratory (DSTL) |
Sector | Public |
Country | United Kingdom |
Start | 01/2015 |
End | 08/2016 |
Description | The DataLab Industry PhD |
Amount | £102,000 (GBP) |
Organisation | The Datalab |
Sector | Charity/Non Profit |
Start | 08/2016 |
End | 04/2020 |
Description | Toshiba PhD Studentship |
Amount | £144,485 (GBP) |
Organisation | Toshiba Research Europe Ltd |
Sector | Private |
Country | United Kingdom |
Start | 08/2017 |
End | 04/2021 |
Title | MGB Challenge Speech Recognition Systems |
Description | Speech recognition software, based the open source Kaldi toolkit, was released to enable the construction of lightly supervised multi-genre broadcast speech recognition systems. |
Type Of Material | Improvements to research infrastructure |
Year Produced | 2015 |
Provided To Others? | Yes |
Impact | These systems provided the baselines for the 2015 MGB Challenge |
URL | http://mgb-challenge.org |
Title | Artificial Personality |
Description | This dataset is associated with the paper "Artificial Personality and Disfluency" by Mirjam Wester, Matthew Aylett, Marcus Tomalin and Rasmus Dall published at Interspeech 2015, Dresden. The focus of this paper is artificial voices with different personalities. Previous studies have shown links between an individual's use of disfluencies in their speech and their perceived personality. Here, filled pauses (uh and um) and discourse markers (like, you know, I mean) have been included in synthetic speech as a way of creating an artificial voice with different personalities. We discuss the automatic insertion of filled pauses and discourse markers (i.e., fillers) into otherwise fluent texts. The automatic system is compared to a ground truth of human ``acted" filler insertion. Perceived personality (as defined by the big five personality dimensions) of the synthetic speech is assessed by means of a standardised questionnaire. Synthesis without fillers is compared to synthesis with either spontaneous or synthetic fillers. Our findings explore how the inclusion of disfluencies influences the way in which subjects rate the perceived personality of an artificial voice. |
Type Of Material | Database/Collection of data |
Year Produced | 2015 |
Provided To Others? | Yes |
Title | Automatic Speaker Verification Spoofing and Countermeasures Challenge (ASVspoof 2015) Database |
Description | The database has been used in the first Automatic Speaker Verification Spoofing and Countermeasures Challenge (ASVspoof 2015). Genuine speech is collected from 106 speakers (45 male, 61 female) and with no signi?cant channel or background noise effects. Spoofed speech is generated from the genuine data using a number of different spoo?ng algorithms. The full dataset is partitioned into three subsets, the ?rst for training, the second for development and the third for evaluation. More details can be found in the evaluation plan in the summary paper. |
Type Of Material | Database/Collection of data |
Year Produced | 2015 |
Provided To Others? | Yes |
Impact | Automatic speaker verification (ASV) offers a low-cost and flexible biometric solution to person authentication. While the reliability of ASV systems is now considered sufficient to support mass-market adoption, there are concerns that the technology is vulnerable to spoofing, also referred to as presentation attacks. Spoofing refers to an attack whereby a fraudster attempts to manipulate a biometric system by masquerading as another, enrolled person. Acknowledged vulnerabilities include attacks through impersonation, replay, speech synthesis and voice conversion. This database has been used for the 2015 ASVspoof challenge, which aims to encourage further progress through (i) the collection and distribution of a standard dataset with varying spoofing attacks implemented with multiple, diverse algorithms and (ii) a series of competitive evaluations. The first ASVspoof challenge was held during the 2015 edition of INTERSPEECH in Dresden, Germany. The challenge has been designed to support, for the first time, independent assessments of vulnerabilities to spoofing and of countermeasure performance and to facilitate the comparison of different spoofing countermeasures on a common dataset, with standard protocols and metrics. |
Title | CSTR VCTK Corpus -- Multi-speaker English Corpus for CSTR Voice Cloning Toolkit |
Description | This CSTR VCTK Corpus includes speech data uttered by 109 native speakers of English with various accents. Each speaker reads out about 400 sentences, most of which were selected from a newspaper plus the Rainbow Passage and an elicitation paragraph intended to identify the speaker's accent. The newspaper texts were taken from The Herald (Glasgow), with permission from Herald & Times Group. Each speaker reads a different set of the newspaper sentences, where each set was selected using a greedy algorithm designed to maximise the contextual and phonetic coverage. The Rainbow Passage and elicitation paragraph are the same for all speakers. The Rainbow Passage can be found in the International Dialects of English Archive: (http://web.ku.edu/~idea/readings/rainbow.htm). The elicitation paragraph is identical to the one used for the speech accent archive (http://accent.gmu.edu). The details of the the speech accent archive can be found at http://www.ualberta.ca/~aacl2009/PDFs/WeinbergerKunath2009AACL.pdf All speech data was recorded using an identical recording setup: an omni-directional head-mounted microphone (DPA 4035), 96kHz sampling frequency at 24 bits and in a hemi-anechoic chamber of the University of Edinburgh. All recordings were converted into 16 bits, were downsampled to 48 kHz based on STPK, and were manually end-pointed. This corpus was recorded for the purpose of building HMM-based text-to-speech synthesis systems, especially for speaker-adaptive HMM-based speech synthesis using average voice models trained on multiple speakers and speaker adaptation technologies. |
Type Of Material | Database/Collection of data |
Year Produced | 2012 |
Provided To Others? | Yes |
Impact | This is the first free corpus that is designed and appropriate for speaker-adaptive speech synthesis. This starts to become a standard database to build and compare speaker-adaptive speech synthesis systems and voice conversion systems. This was also used even for speaker verification systems. |
URL | http://homepages.inf.ed.ac.uk/jyamagis/page3/page58/page58.html |
Title | Computer, Speech and Language - Experiment results for paper "Acoustic Adaptation to Dynamic Background Conditions with Asynchronous Transformations" |
Description | The files in the dataset correspond to results that have been generated for the Computer, Speech and Language article: "Acoustic Adaptation to Dynamic Background Conditions with Asynchronous Transformations" http://dx.doi.org/10.1016/j.csl.2016.06.008. The files in the zip file are of three types:- .ctm, which correspond to the output of the automatic speech recognition system and the columns include segment information as well as transcripts of the recognition.- .sys, which correspond to scoring of the automatic speech recognition system and includes the overall word error rate as well as the number of insertions, deletions and substitutions of the overall system.- .lur, which provides a more detailed decomposition of the word error rate across different tags. The following is a description about the naming convention of the files: TableX-LineY: This is the recognition and scoring output corresponding to Line Y of Table X in the article.Figure X-BarY: This is the recognition and scoring output corresponding to Bar Y (starting on the left hand side) of Figure X in the article. All three file types are standard outputs that are recognised by the automatic speech recognition community and can be opened using any text editor. |
Type Of Material | Database/Collection of data |
Year Produced | 2016 |
Provided To Others? | Yes |
URL | https://figshare.shef.ac.uk/articles/dataset/Computer_Speech_and_Language_-_Experiment_results_for_p... |
Title | Computer, Speech and Language - Experiment results for paper "Acoustic Adaptation to Dynamic Background Conditions with Asynchronous Transformations" |
Description | The files in the dataset correspond to results that have been generated for the Computer, Speech and Language article: "Acoustic Adaptation to Dynamic Background Conditions with Asynchronous Transformations" http://dx.doi.org/10.1016/j.csl.2016.06.008. The files in the zip file are of three types:- .ctm, which correspond to the output of the automatic speech recognition system and the columns include segment information as well as transcripts of the recognition.- .sys, which correspond to scoring of the automatic speech recognition system and includes the overall word error rate as well as the number of insertions, deletions and substitutions of the overall system.- .lur, which provides a more detailed decomposition of the word error rate across different tags. The following is a description about the naming convention of the files: TableX-LineY: This is the recognition and scoring output corresponding to Line Y of Table X in the article.Figure X-BarY: This is the recognition and scoring output corresponding to Bar Y (starting on the left hand side) of Figure X in the article. All three file types are standard outputs that are recognised by the automatic speech recognition community and can be opened using any text editor. |
Type Of Material | Database/Collection of data |
Year Produced | 2016 |
Provided To Others? | Yes |
URL | https://figshare.shef.ac.uk/articles/dataset/Computer_Speech_and_Language_-_Experiment_results_for_p... |
Title | Data Underpinning "Joint Optimisation of Tandem Systems Using Gaussian Mixture Density Neural Network Discriminative Sequence Training" |
Description | Description of the Speech Recognition Training and Test Data and its Availability used for Experiments. Key Speech Recognition Outputs/Detailed Scoring Results used in the paper. |
Type Of Material | Database/Collection of data |
Year Produced | 2017 |
Provided To Others? | Yes |
Title | Experiment materials for "Disfluencies in change detection in natural, vocoded and synthetic speech." |
Description | The current dataset is associated with the DiSS paper "Disfluencies in change detection in natural, vocoded and synthetic speech." In this paper we investigate the effect of filled pauses, a discourse marker and silent pauses in a change detection experiment in natural, vocoded and synthetic speech. In natural speech change detection has been found to increase in the presence of filled pauses, we extend this work by replicating earlier findings and explore the effect of a discourse marker, like, and silent pauses. Furthermore we report how the use of "unnatural" speech, namely synthetic and vocoded, affects change detection rates. |
Type Of Material | Database/Collection of data |
Year Produced | 2015 |
Provided To Others? | Yes |
Title | Experiment materials for "The temporal delay hypothesis: Natural, vocoded and synthetic speech." |
Description | Including disfluencies in synthetic speech is being explored as a way of making synthetic speech sound more natural and conversational. How to measure whether the resulting speech is actually more natural, however, is not straightforward. Conventional approaches to synthetic speech evaluation fall short as a listener is either primed to prefer stimuli with filled pauses or when they aren't primed they prefer more fluent speech. Reaction time experiments from psycholinguistics may circumvent this issue. In this paper, we revisit one such reaction time experiment. For natural speech, delays in word onset were found to facilitate word recognition regardless of the type of delay; be they filled pause (um), silent or a tone. We reused the materials for natural speech, and extended it to vocoded and synthetic speech. The results partially replicate previous findings. For natural and vocoded speech, if the delay is a silent pause, significant increases in the speed of word recognition are found. If the delay comprises filled pauses there is a significant increase in reaction time for vocoded speech but not for natural speech. For synthetic speech, no clear effects of delay on word recognition are found. We hypothesise this is because it takes longer (requires more cognitive resources) to process synthetic speech than natural or vocoded speech. |
Type Of Material | Database/Collection of data |
Year Produced | 2015 |
Provided To Others? | Yes |
Title | Experimental results for IEEE/ACM Transaction on Audio, Speech and Language Processing Journal Paper: "Recurrent Neural Network Language Model Adaptation for Multi-Genre Broadcast Speech Recognition and Alignment" |
Description | The files in the dataset correspond to results that have been generated for the IEEE/ACM Transactions on Audio, Speech and Language Processing paper: "Recurrent Neural Network Language Model Adaptation for Multi-Genre Broadcast Speech Recognition and Alignment", DOI: 10.1109/TASLP.2018.2888814. The paper deals with language model adaptation for the MGB Challenge 2015 transcription and alignment tasks. The files in the zip file are of three types: - .ctm, which correspond to the output of the automatic speech recognition system and the columns include segment information as well as transcripts of the recognition. - .ctm.filt.sys, which correspond to scoring of the automatic speech recognition system and includes the overall word error rate as well as the number of insertions, deletions and substitutions of the overall system. - .ctm.filt.lur, which provides a more detailed decomposition of the word error rate across multiple genres. The three file types are repeated for all the results described in Tables 4,5 and 6 of the paper (27 entries in total). The following is a description about the naming convention of the files: 4gram.amlm.baseline refers to the 4-gram LM baseline on LM1 and LM2 text rnnlm refers to Recurrent Neural Network Language Model. amrnnlm prefix refers to acoustic model text RNNLM. amlmrnnlm prefix refers to acoustic model + language model text RNNLM. .baseline.lattice.rescore suffix refers to baseline results generated with lattice rescoring. .nbest.baseline.rescore suffix refers to baseline results generated with nbest rescoring. .noadaptation refers to RNNLM results with no adaptation. .genre.finetune refers to genre fine-tuning of the RNNLMs. .genre.adaptationlayer refers to genre LHN adaptation layer fine-tuning of the RNNLMs. .ldafeat.hiddenlayer refers to text-based Latent Dirichlet Allocation (LDA) features at the hidden layer. .acousticldafeat.hiddenlayer refers to acoustic LDA features at the hidden layer .acoustictextldafeat.hiddenlayer refers to acoustic and text LDA features at the hidden layer. .genrefeat.hiddenlayer refers to Genre 1-hot auxiliary codes at the hidden layer. .genrefeat.adaptationlayer refers to Genre 1-hot auxiliary codes at the adaptation layer. .2layer.ldafeat.hiddenlayer refers to a 2-layer RNNLM with text LDA features at the hidden layer and no feat. at adaptation layer. .2layer.ldafeat.hiddenlayer.genrefinetune refers to a 2-layer RNNLM with text LDA features at the hidden layer, no feat. at adaptation layer and genre fine-tuning. .kcomponent refers to K-Component Adaptive Topic fine-tuning using LDA posteriors All three file types are standard outputs that are recognised by the automatic speech recognition community and can be opened using any text editor. |
Type Of Material | Database/Collection of data |
Year Produced | 2021 |
Provided To Others? | Yes |
URL | https://figshare.shef.ac.uk/articles/dataset/Experiments_results_for_IEEE_ACM_Transaction_on_Audio_S... |
Title | Experimental results for IEEE/ACM Transaction on Audio, Speech and Language Processing Journal Paper: "Recurrent Neural Network Language Model Adaptation for Multi-Genre Broadcast Speech Recognition and Alignment" |
Description | The files in the dataset correspond to results that have been generated for the IEEE/ACM Transactions on Audio, Speech and Language Processing paper: "Recurrent Neural Network Language Model Adaptation for Multi-Genre Broadcast Speech Recognition and Alignment", DOI: 10.1109/TASLP.2018.2888814. The paper deals with language model adaptation for the MGB Challenge 2015 transcription and alignment tasks. The files in the zip file are of three types: - .ctm, which correspond to the output of the automatic speech recognition system and the columns include segment information as well as transcripts of the recognition. - .ctm.filt.sys, which correspond to scoring of the automatic speech recognition system and includes the overall word error rate as well as the number of insertions, deletions and substitutions of the overall system. - .ctm.filt.lur, which provides a more detailed decomposition of the word error rate across multiple genres. The three file types are repeated for all the results described in Tables 4,5 and 6 of the paper (27 entries in total). The following is a description about the naming convention of the files: 4gram.amlm.baseline refers to the 4-gram LM baseline on LM1 and LM2 text rnnlm refers to Recurrent Neural Network Language Model. amrnnlm prefix refers to acoustic model text RNNLM. amlmrnnlm prefix refers to acoustic model + language model text RNNLM. .baseline.lattice.rescore suffix refers to baseline results generated with lattice rescoring. .nbest.baseline.rescore suffix refers to baseline results generated with nbest rescoring. .noadaptation refers to RNNLM results with no adaptation. .genre.finetune refers to genre fine-tuning of the RNNLMs. .genre.adaptationlayer refers to genre LHN adaptation layer fine-tuning of the RNNLMs. .ldafeat.hiddenlayer refers to text-based Latent Dirichlet Allocation (LDA) features at the hidden layer. .acousticldafeat.hiddenlayer refers to acoustic LDA features at the hidden layer .acoustictextldafeat.hiddenlayer refers to acoustic and text LDA features at the hidden layer. .genrefeat.hiddenlayer refers to Genre 1-hot auxiliary codes at the hidden layer. .genrefeat.adaptationlayer refers to Genre 1-hot auxiliary codes at the adaptation layer. .2layer.ldafeat.hiddenlayer refers to a 2-layer RNNLM with text LDA features at the hidden layer and no feat. at adaptation layer. .2layer.ldafeat.hiddenlayer.genrefinetune refers to a 2-layer RNNLM with text LDA features at the hidden layer, no feat. at adaptation layer and genre fine-tuning. .kcomponent refers to K-Component Adaptive Topic fine-tuning using LDA posteriors All three file types are standard outputs that are recognised by the automatic speech recognition community and can be opened using any text editor. |
Type Of Material | Database/Collection of data |
Year Produced | 2021 |
Provided To Others? | Yes |
URL | https://figshare.shef.ac.uk/articles/dataset/Experiments_results_for_IEEE_ACM_Transaction_on_Audio_S... |
Title | Experiments results for IEEE/ACM Transaction on Audio, Speech and Language Processing Journal Paper: "Recurrent Neural Network Language Model Adaptation for Multi-Genre Broadcast Speech Recognition and Alignment" |
Description | The files in the dataset correspond to results that have been generated for the IEEE/ACM Transactions on Audio, Speech and Language Processing paper: "Recurrent Neural Network Language Model Adaptation for Multi-Genre Broadcast Speech Recognition and Alignment", DOI: 10.1109/TASLP.2018.2888814. The paper deals with language model adaptation for the MGB Challenge 2015 transcription and alignment tasks. The files in the zip file are of three types: - .ctm, which correspond to the output of the automatic speech recognition system and the columns include segment information as well as transcripts of the recognition. - .ctm.filt.sys, which correspond to scoring of the automatic speech recognition system and includes the overall word error rate as well as the number of insertions, deletions and substitutions of the overall system. - .ctm.filt.lur, which provides a more detailed decomposition of the word error rate across multiple genres. The three file types are repeated for all the results described in Tables 4,5 and 6 of the paper (27 entries in total). The following is a description about the naming convention of the files: 4gram.amlm.baseline refers to the 4-gram LM baseline on LM1 and LM2 text rnnlm refers to Recurrent Neural Network Language Model. amrnnlm prefix refers to acoustic model text RNNLM. amlmrnnlm prefix refers to acoustic model + language model text RNNLM. .baseline.lattice.rescore suffix refers to baseline results generated with lattice rescoring. .nbest.baseline.rescore suffix refers to baseline results generated with nbest rescoring. .noadaptation refers to RNNLM results with no adaptation. .genre.finetune refers to genre fine-tuning of the RNNLMs. .genre.adaptationlayer refers to genre LHN adaptation layer fine-tuning of the RNNLMs. .ldafeat.hiddenlayer refers to text-based Latent Dirichlet Allocation (LDA) features at the hidden layer. .acousticldafeat.hiddenlayer refers to acoustic LDA features at the hidden layer .acoustictextldafeat.hiddenlayer refers to acoustic and text LDA features at the hidden layer. .genrefeat.hiddenlayer refers to Genre 1-hot auxiliary codes at the hidden layer. .genrefeat.adaptationlayer refers to Genre 1-hot auxiliary codes at the adaptation layer. .2layer.ldafeat.hiddenlayer refers to a 2-layer RNNLM with text LDA features at the hidden layer and no feat. at adaptation layer. .2layer.ldafeat.hiddenlayer.genrefinetune refers to a 2-layer RNNLM with text LDA features at the hidden layer, no feat. at adaptation layer and genre fine-tuning. .kcomponent refers to K-Component Adaptive Topic fine-tuning using LDA posteriors All three file types are standard outputs that are recognised by the automatic speech recognition community and can be opened using any text editor. |
Type Of Material | Database/Collection of data |
Year Produced | 2018 |
Provided To Others? | Yes |
URL | https://figshare.shef.ac.uk/articles/dataset/Experiments_results_for_IEEE_ACM_Transaction_on_Audio_S... |
Title | Human vs Machine Spoofing |
Description | Listening test materials for "Human vs Machine Spoofing Detection on Wideband and Narrowband data." They include lists of the speech material selected from the SAS spoofing database and the listeners' responses. The main data file has been split into five smaller files (labelled "aa" to "ae") for ease of download. |
Type Of Material | Database/Collection of data |
Year Produced | 2015 |
Provided To Others? | Yes |
Title | Improving Interpretability and Regularisation in Deep Learning |
Description | The provided .ctm and scoring .sys files correspond to the MPE systems of Table VI (Javanese) and Table X (BN) of this paper. |
Type Of Material | Database/Collection of data |
Year Produced | 2018 |
Provided To Others? | Yes |
Title | MGB Challenge |
Description | The MGB Challenge data was released to support the MGB Challenge evaluation of multi-genre broadcast speech recognition systems. It consists of approximately 1,600 hours of broadcast audio taken from seven weeks of BBC output across all TV channels, captions as originally broadcast on TV, accompanied by baseline lightly-supervised alignments using an ASR system, with confidence measures, several hundred million words of subtitle text from BBC TV output collected over a 15 year period, and a hand-compiled British English lexicon derived from Complex. |
Type Of Material | Database/Collection of data |
Year Produced | 2015 |
Provided To Others? | Yes |
Impact | This research database supported the MGB Challenge at the IEEE ASRU-2015 workshop |
URL | http://mgb-challenge.org |
Title | MGB database |
Description | The MGB database is the official database of the MGB challenge. It contains 2,000 hours of audio, 700 million words of transcripts plus other metadata. |
Type Of Material | Database/Collection of data |
Year Produced | 2015 |
Provided To Others? | Yes |
Impact | Features in the MGB challenge and the accompanying workshop at ASRU 2015. |
Title | Multimedia Tools and Applications - Experiments results for paper "Lightly supervised alignment of subtitles on multigenre broadcasts" |
Description | The files in the dataset correspond to results that have been generated for the Multimedia Tools and Applications (Springer ISSN: 1380-7501 / 1573-7721) article: "Lightly supervised alignment of subtitles on multigenre broadcasts". The files in the zip file are of three types: - .ctm, which correspond to the output of the automatic speech recognition system or lightly supervised alignment system. - .rttm, which correspond to the output of the speech segmentation system. - .sys, which correspond to scoring of the speech segmentation, automatic speech recognition or lightly supervised alignment system. The following is a description about the naming convention of the files: TableX-LineY-[ser|wer|f1]: This is the output and scoring results corresponding to Line Y of Table X in the article in terms of SER, WER or F1 score. All three file types are standard outputs that are recognised by the speech technology community and can be opened using any text editor. |
Type Of Material | Database/Collection of data |
Year Produced | 2016 |
Provided To Others? | Yes |
URL | https://figshare.shef.ac.uk/articles/dataset/Multimedia_Tools_and_Applications_-_Experiments_results... |
Title | Multimedia Tools and Applications - Experiments results for paper "Lightly supervised alignment of subtitles on multigenre broadcasts" |
Description | The files in the dataset correspond to results that have been generated for the Multimedia Tools and Applications (Springer ISSN: 1380-7501 / 1573-7721) article: "Lightly supervised alignment of subtitles on multigenre broadcasts". The files in the zip file are of three types: - .ctm, which correspond to the output of the automatic speech recognition system or lightly supervised alignment system. - .rttm, which correspond to the output of the speech segmentation system. - .sys, which correspond to scoring of the speech segmentation, automatic speech recognition or lightly supervised alignment system. The following is a description about the naming convention of the files: TableX-LineY-[ser|wer|f1]: This is the output and scoring results corresponding to Line Y of Table X in the article in terms of SER, WER or F1 score. All three file types are standard outputs that are recognised by the speech technology community and can be opened using any text editor. |
Type Of Material | Database/Collection of data |
Year Produced | 2016 |
Provided To Others? | Yes |
URL | https://figshare.shef.ac.uk/articles/dataset/Multimedia_Tools_and_Applications_-_Experiments_results... |
Title | REHASP |
Description | Studio recording of female native British English talker producing three sets of Harvard sentences (thirty prompts), each prompt repeated forty times. Available both as unprocessed 96 kHz recordings and standardised 16 kHz files. |
Type Of Material | Database/Collection of data |
Year Produced | 2014 |
Provided To Others? | Yes |
Impact | The following paper has been published: G. E. Henter, T. Merritt, M. Shannon, C. Mayo, and S. King, "Measuring the perceptual effects of modelling assumptions in speech synthesis using stimuli constructed from repeated natural speech," in Proc. Interspeech, 2014 |
Title | Spoofing and Anti-Spoofing (SAS) corpus v1.0 |
Description | This dataset is associated with the paper "'SAS: A speaker verification spoofing database containing diverse attacks': presents the first version of a speaker verification spoofing and anti-spoofing database, named SAS corpus. The corpus includes nine spoofing techniques, two of which are speech synthesis, and seven are voice conversion. We design two protocols, one for standard speaker verification evaluation, and the other for producing spoofing materials. Hence, they allow the speech synthesis community to produce spoofing materials incrementally without knowledge of speaker verification spoofing and anti-spoofing. To provide a set of preliminary results, we conducted speaker verification experiments using two state-of-the-art systems. Without any anti-spoofing techniques, the two systems are extremely vulnerable to the spoofing attacks implemented in our SAS corpus". |
Type Of Material | Database/Collection of data |
Year Produced | 2015 |
Provided To Others? | Yes |
Impact | This SAS database is the first version of a standard dataset for spoofing and anti-spoofing research. Currently, the SAS corpus includes speech generated using nine spoofing methods, each of which comprises around 300000 spoofed trials. To the best of our knowledge, this is the first attempt to include such a diverse range of spoofing attacks in a single database. The SAS corpus is publicly available at no cost. |
Title | The Voice Conversion Challenge 2016 database |
Description | The Voice Conversion Challenge (VCC) 2016, one of the special sessions at Interspeech 2016, deals with speaker identity conversion, referred as Voice Conversion (VC). The task of the challenge was speaker conversion, i.e., to transform the voice identity of a source speaker into that of a target speaker while preserving the linguistic content. Using a common dataset consisting of 162 utterances for training and 54 utterances for evaluation from each of 5 source and 5 target speakers, 17 groups working in VC around the world developed their own VC systems for every combination of the source and target speakers, i.e., 25 systems in total, and generated voice samples converted by the developed systems. The objective of the VCC was to compare various VC techniques on identical training and evaluation speech data. The samples were evaluated in terms of target speaker similarity and naturalness by 200 listeners in a controlled environment. This dataset consists of the participants' VC submissions and the listening test results for naturalness and similarity. |
Type Of Material | Database/Collection of data |
Year Produced | 2016 |
Provided To Others? | Yes |
Impact | 17 groups working in VC around the world have used this database and have developed their own VC systems. |
URL | http://datashare.is.ed.ac.uk/handle/10283/2211 |
Title | Wargames Day 2 and 3 |
Description | Further recordings of two days, playing the game Warhammer, yielding a total of 20 hours of transcribed speech. |
Type Of Material | Database/Collection of data |
Year Produced | 2016 |
Provided To Others? | Yes |
Impact | No direct impact has been recorded, however the Kaldi team has proposed to refine the system scripts included in the corpus. |
URL | http://mini.dcs.shef.ac.uk/resources/sheffield-wargames-corpus/ |
Title | Wargames I |
Description | Recordings of groups of people playing the Warhammer hgame, recorded in 96 audio channels and 3 media streams, fully transcribed. |
Type Of Material | Database/Collection of data |
Year Produced | 2014 |
Provided To Others? | Yes |
Impact | It was discussed at ASRU 2016 as a potential candidate for future tasks. University of Sheffield are recording part II and part III which will be made available shortly, for this objective. |
Title | the homeService corpus |
Description | an audio corpus of spontaneous dysarthric speech |
Type Of Material | Database/Collection of data |
Year Produced | 2015 |
Provided To Others? | Yes |
Impact | first example of semi-spontaneuos dysarthric speech corpus |
URL | http://mini.dsc.shef.ac.uk/ |
Description | BBC |
Organisation | British Broadcasting Corporation (BBC) |
Department | BBC Research & Development |
Country | United Kingdom |
Sector | Public |
PI Contribution | Development of systems and showcases of the use of automatic speech processing of media archives |
Collaborator Contribution | Provided audio and video broadcast data and gave feedback on their requirements for future systems |
Impact | Several systems for media transcription are available in webASR now (www.webasr.org), a showcase for transcription of Youtube clips is also available (http://staffwww.dcs.shef.ac.uk/people/O.Saztorralba/youtube/) |
Start Year | 2012 |
Description | BBC Data Science Partnership |
Organisation | British Broadcasting Corporation (BBC) |
Department | BBC Research & Development |
Country | United Kingdom |
Sector | Public |
PI Contribution | Development of speech and language technology applied to broadcasting and media production |
Collaborator Contribution | R&D work from BBC researchers; data sharing. |
Impact | MGB Challenge iCASE studentships EPSRC SCRIPT Project |
Start Year | 2017 |
Description | Barnsley Hospital NHS Foundation Trust |
Organisation | Barnsley Hospital NHS Foundation Trust |
Country | United Kingdom |
Sector | Public |
PI Contribution | recruitment of homeService users |
Collaborator Contribution | recruitment of homeService users |
Impact | recruitment of homeService users |
Start Year | 2012 |
Description | Bloomberg PhD Studentship |
Organisation | Johns Hopkins University |
Department | Johns Hopkins Bloomberg School of Public Health |
Country | United States |
Sector | Academic/University |
PI Contribution | Research, systems, evaluation of multi-domain speech recognition |
Collaborator Contribution | Full funding for year 1 of a PhD studentship |
Impact | PhD student commenced work in Sept 2015 |
Start Year | 2015 |
Description | Cereproc |
Organisation | Cereproc Ltd. |
Country | United Kingdom |
Sector | Private |
PI Contribution | Steve Renals is non-executive director of Cereproc Ltd. |
Collaborator Contribution | Brought a deeper understanding of commercial exploitation of speech technology |
Impact | Cereproc had developed into one of the leading companies in the speech synthesis area |
Start Year | 2014 |
Description | Dysarthric speech organisation in Sheffield |
Organisation | University of Sheffield |
Country | United Kingdom |
Sector | Academic/University |
PI Contribution | recruitment of hS users |
Collaborator Contribution | recruitment of hS users |
Impact | recruitment of hS users |
Start Year | 2015 |
Description | English Heritage |
Organisation | English Heritage |
Country | United Kingdom |
Sector | Charity/Non Profit |
PI Contribution | Development of a system and a platform for information retrieval and content linking in oral archives |
Collaborator Contribution | Provided audio data and summaries from interviews nad gave feedback on their system requirements |
Impact | A demonstrator for the use of the technology on a set of oral history interviews was developed (http://brodsworthhall.azurewebsites.net/) |
Start Year | 2013 |
Description | Julia Olcoz, visiting researcher |
Organisation | University of Zaragoza |
Country | Spain |
Sector | Academic/University |
PI Contribution | Provided a baseline system for the task of lightly supervised alignment in media broadcasts |
Collaborator Contribution | Developed novel techniques for improving the lightly supervised alignment task |
Impact | The enhanced system for lightly supervised alignment is available in webASR. A paper detailing the system was submitted to Interspeech 2016. |
Start Year | 2015 |
Description | MGB Challenge |
Organisation | British Broadcasting Corporation (BBC) |
Department | BBC Research & Development |
Country | United Kingdom |
Sector | Public |
PI Contribution | Organised international speech recognition challenge using BBC data: the MGB Challenge at ASRU-2015. We provided baseline and state-of-the-art systems, defined the challenge procedures |
Collaborator Contribution | BBC provided 2000 hrs of multi-genre TV recordings, and 634M words of subtitle transcriptions |
Impact | P Bell, MJF Gales, T Hain, J Kilgour, P Lanchantin, X Liu, A McParland, S Renals, O Saz, M Wester, and P Woodland, The MGB Challenge: Evaluating multi-genre broadcast media recognition, IEEE ASRU-2015 |
Start Year | 2014 |
Description | Mediaeval |
Organisation | Medieval Settlement Research Group |
Country | United Kingdom |
Sector | Charity/Non Profit |
PI Contribution | Provided automatic transcription of media data for their evaluation campaigns |
Collaborator Contribution | Provided the data |
Impact | The transcribed data was used by participants in the evaluation campaign and features in their several publications related to the evaluation. The Mediaeval is an evaluation campaign that aims to improve the information retrieval task in media data. It's organised by Maria Eskevich at Eurecom (France). |
Start Year | 2014 |
Description | NII |
Organisation | National Institute of Informatics (NII) |
Country | Japan |
Sector | Public |
PI Contribution | Joint research in speech synthesis |
Collaborator Contribution | Joint research in speech synthesis |
Impact | many joint publications; collaboration on open source software (HTS); joint work on voice banking; joint position for Dr Junichi Yamagishi |
Start Year | 2013 |
Description | NITech |
Organisation | Nagoya Institute of Technology |
Country | Japan |
Sector | Academic/University |
PI Contribution | Joint research in particular focussed on HTS speech synthesis and user generated spoken dialogue systems. |
Collaborator Contribution | Joint research in particular focussed on HTS speech synthesis and user generated spoken dialogue systems. |
Impact | multiple joint publications |
Start Year | 2011 |
Description | Pengyuang Zhang - visitor |
Organisation | Chinese Academy of Sciences |
Country | China |
Sector | Public |
PI Contribution | Hosting the visitor, and providing access to research facilities at the University of Sheffield |
Collaborator Contribution | Research collaboration with the group |
Impact | The research visit resulted in the output of two research papers: http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=7078564 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6854663 |
Start Year | 2013 |
Description | The Centre for Assistive Technology and Connected Healthcare (CATCH) |
Organisation | University of Sheffield |
Country | United Kingdom |
Sector | Academic/University |
PI Contribution | sharing of knowledge and resources |
Collaborator Contribution | sharing of knowledge and resources |
Impact | unknown |
Start Year | 2013 |
Description | Visitor Dr Yan-Xiong Li |
Organisation | South China University of Technology |
Department | School of Environment and Energy |
Country | China |
Sector | Academic/University |
PI Contribution | Collaboration on diarisation work on meeting and broadcast media data. |
Collaborator Contribution | Joint annotation and testing experiments on BBC data and AMI meeting data. Public Release of new data. |
Impact | Annotation of BBC and RT'07 data. |
Start Year | 2013 |
Description | uSTAR collaborative R&D project |
Organisation | NICT National Institute of Information and Communications Technology |
Country | Japan |
Sector | Academic/University |
PI Contribution | Building online speech recognition for an ipphone platform for speech to speech translation |
Collaborator Contribution | Building online speech recognition for an ipphone platform for speech to speech translation |
Impact | Building online speech recognition for an ipphone platform for speech to speech translation |
Start Year | 2012 |
Title | Combilex-ASR |
Description | Combilex-ASR is a large scale lexicon for speech recognition in British English. It is licensed under a Creative Commons BY-NC license |
IP Reference | |
Protection | Copyrighted (e.g. software) |
Year Protection Granted | 2016 |
Licensed | Yes |
Impact | This lexicon underpinned the ASRU-2015 MGB Challenge, and is planned to be used in the US IARPA Babel Programme |
Title | High-quality speech synthesizer, HTS voice |
Description | High-quality speech synthesis software based on speech technologies developed during my fellowship. |
IP Reference | |
Protection | Copyrighted (e.g. software) |
Year Protection Granted | 2012 |
Licensed | Yes |
Impact | I have formally licensed the high-quality speech synthesizer to two companies for a commercial basis. |
Title | Clinical trial of personalized speech synthesis voices for MND patients |
Description | Adaptive speech synthesis may be be used to develop personalised synthetic voices for people who have a vocal pathology. In 2009 Dr. Sarah Creer from University of Sheffield and I have successfully applied it to clinical voice banking for laryngectomees (individuals who have had their vocal cords removed due to a developing cancer) to reconstruct their voices. In 2010, I have "implanted" the personalised synthetic voice of a patient who has motor neurone disease into their assistive communication device. Such a personalised voice can lead to far more natural communication for patients, particularly with family. A "voice reconstruction" trial has been tested with about 100 patients in total at the Euan MacDonald Centre for MND Research and the Anne Rowling Regenerative Neurology Clinic in Edinburgh. |
Type | Health and Social Care Services |
Current Stage Of Development | Initial development |
Year Development Stage Completed | 2015 |
Development Status | Actively seeking support |
Impact | We have recorded about 100 MND patients at the Euan MacDonald Centre for MND Research and the Anne Rowling Regenerative Neurology Clinic in Edinburgh and have constructed personalized speech synthesizers based on their disordered voices. We have received and analyzed feedback from the patients and we have confirmed that this new speech synthesis technology can improve their quality-of-life. |
Title | Alignment task and scoring software |
Description | Implementation of the alignment scoring rules tha the University of Sheffield has defined as part of organising the MGB Challenge |
Type Of Technology | Software |
Year Produced | 2015 |
Open Source License? | Yes |
Impact | Used in the MGB Challenge and used already for several publications by other research groups. |
Title | CUED-RNNLM Toolkit |
Description | Software to train recurrent neural network language models (RNNLMs) for speech recognition and other applications. The software features efficient training (on a GPU) and efficient evaluation (on a CPU). It includes a patch to HTK 3.4.1 that enables the application of RNNLMs to HTK-based speech recognition lattices. RNNLMs can be trained with additional features in the input layer for better performance (e.g. topic adaptation using an LDA-based topic vector). |
Type Of Technology | Software |
Year Produced | 2015 |
Open Source License? | Yes |
Impact | This software was used to train RNNLMs that were used in the Cambridge University transcription systems used in the 2015 ASRU multi-genre broadcast challenge (international challenge involving the automatic transcription of BBC broadcast audio). The Cambridge system gave the lowest error rates in the challenge and the use of RNNLMs efficiently trained on a large corpus of subtitle material was a key component, as well as the use of topic adaptation. |
URL | http://mi.eng.cam.ac.uk/projects/cued-rnnlm/ |
Title | Diarisation scoring tools |
Description | Implementation of new diarisation methods as published in ICASSP 2016 |
Type Of Technology | Software |
Year Produced | 2016 |
Open Source License? | Yes |
Impact | Not yet. |
Title | HTK 3.5 |
Description | HTK is a portable toolkit for building and manipulating hidden Markov models which has been developed over many years primarily at Cambridge University Engineering Dept.. HTK is primarily used for speech recognition research although it is also widely used for speech synthesis and other applications. HTK 3.5 adds built-in support for artificial neural network (ANN) models while maintaining compatibility with most existing functions (including hybrid and tandem models, sequence training and CPU/GPU math kernels), as well as support for decoding RNN language models HTK is supplied in source form with a specific licence that allows any use of the models produced but does not allow software re-distribution. HTK has over 100,000 registered users. |
Type Of Technology | Software |
Year Produced | 2015 |
Open Source License? | Yes |
Impact | HTK 3.5 has been used as a platform to develop various types of speech technology research at Cambridge, building on developments over many years to focus on the use of deep neural network acoustic models and recurrent neural network language models. A particular outcome has been the development of the Cambridge University systems for the 2015 ARSU multi-genre broadcast (MGB) challenge. This required the processing of more than 1600 hours of BBC TV audio data and developing systems for transcription, subtitle alignment and diarisation. This was embodied in 4 tasks in the MGB challenge, and Cambridge University systems based on HTK 3.5 had the best performance for all these tasks. Many HTK users have downloaded HTK 3.5 and are actively using it to develop both research and commercial systems. |
URL | http://htk.eng.cam.ac.uk/ |
Title | HTS ver 2.3 |
Description | HTS is an open source toolkit for statistical speech synthesis. I am a member of a team developing the the free open-source research software packages for speech synthesis. |
Type Of Technology | Software |
Year Produced | 2015 |
Open Source License? | Yes |
Impact | The HTS toolkit is used worldwide by both academic and commercial organisations, such as Microsoft, Nuance, Toshiba, Pentax, and Google. The number of downloads of HTS exceeds 10,000 and various commercial products using HTS are on the market. Therefore, this toolkit is a very influential platform for me to disseminate outcomes and form an immediate pathway to impact. |
URL | http://hts.sp.nitech.ac.jp |
Title | Merlin |
Description | Merlin is the Neural Network (NN) based Speech Synthesis System developed at the Centre for Speech Technology Research (CSTR), University of Edinburgh |
Type Of Technology | Software |
Year Produced | 2016 |
Open Source License? | Yes |
Impact | Since its release at the end of the Natural Speech Technology project, Merlin has established a significant base of users and developers. |
URL | https://github.com/CSTR-Edinburgh/merlin |
Title | The Festival Speech Synthesis System |
Description | Festival offers a general framework for building speech synthesis systems as well as including examples of various modules. As a whole it offers full text to speech through a number APIs: from shell level, though a Scheme command interpreter, as a C++ library, from Java, and an Emacs interface. Festival is multi-lingual (currently English (British and American), and Spanish) though English is the most advanced. Other groups release new languages for the system. And full tools and documentation for build new voices are available through Carnegie Mellon's FestVox project (http://festvox.org). The software was first released in the 1990s, but has been under continuous development, improvement, and maintenance since then. v2.1 q was released in November 2010. |
Type Of Technology | Software |
Open Source License? | Yes |
Impact | Festival is distributed as default in a number of standard Linux distributions including Arch Linux, Fedora, CentOS, RHEL, Scientific Linux, Debian, Ubuntu, openSUSE, Mandriva, Mageia and Slackware, and can easily be installed on any Linux distribution that supports apt-get. More recently our work on statistical parametric speech synthesis and the algorithms for adaptation have been incorporated in the HTS toolkit (one of the coordinators (Yamagishi) is from Edinburgh), which integrates with Festival. These toolkits are the most used open-source speech synthesis systems and have also formed the high performing baseline systems for the international Blizzard evaluation of (commercial and research) speech synthesis also organised by Edinburgh. |
URL | http://www.cstr.ed.ac.uk/projects/festival/ |
Title | The Festival Speech Synthesis system - version 2.4 |
Description | The de facto industry standard toolkit for developing text-to-speech systems. |
Type Of Technology | Software |
Year Produced | 2014 |
Open Source License? | Yes |
Impact | Commercial products from AT&T. Spinout company Rhetorical Systems. |
URL | http://www.cstr.ed.ac.uk/projects/festival/ |
Title | homeService protocol |
Description | The homeService protocol to develop and human-machine interaction for users with speech and mobility impairment. Example of virtuous cycle. |
Type Of Technology | Software |
Year Produced | 2015 |
Open Source License? | Yes |
Impact | Not at this point. |
Title | webASR |
Description | Publicly available webtool (www.webasr.org) and two showcases on media transcription (http://staffwww.dcs.shef.ac.uk/people/O.Saztorralba/youtube/) and alignment of lecture subtitles (http://staffwww.dcs.shef.ac.uk/people/O.Saztorralba/ted/) |
Type Of Technology | Webtool/Application |
Year Produced | 2015 |
Impact | New version of webASR (www.webasr.org) with new systems and demonstrators |
URL | http://www.webasr.org/ |
Company Name | Quorate Technology |
Description | Quorate Technology develops QSpeech, a speech recognition software that records and transcribes audio, as well as analysing the data in order to make it searchable. |
Year Established | 2011 |
Impact | The company has a variety of commercial contracts.There are at least 10 full-time scientific positions. |
Website | http://www.quoratetechnology.com |
Company Name | Speak:Unique |
Description | Speak:Unique develops synthetic voices based on the person's voice that allows those losing their voice to use their own instead of a robotic sounding one. |
Year Established | 2018 |
Impact | The company is still in an early phase. |
Website | https://www.speakunique.co.uk/ |
Description | A talk about the homeService experience/project |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | Regional |
Primary Audience | Professional Practitioners |
Results and Impact | A talk about the homeService experience/project to Birmingham University (Feb 2016 |
Year(s) Of Engagement Activity | 2016 |
Description | A talk about the homeService experience/project |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | Regional |
Primary Audience | Professional Practitioners |
Results and Impact | A talk about the homeService experience/project to Birmingham University (Feb 2016 |
Year(s) Of Engagement Activity | 2016 |
Description | A talk about the homeService experience/project to Medical Humanity Sheffield |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | Regional |
Primary Audience | Professional Practitioners |
Results and Impact | A talk about the homeService experience/project to Medical Humanity Sheffield |
Year(s) Of Engagement Activity | 2015 |
Description | CATCH kickoff event |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | Regional |
Primary Audience | Professional Practitioners |
Results and Impact | Introduction of the homeService project to interest audience |
Year(s) Of Engagement Activity | 2013 |
Description | COST APPELE meeting |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Industry/Business |
Results and Impact | Engaging academics and industry across Europe |
Year(s) Of Engagement Activity | 2014 |
URL | http://aapele.eu/%7Clink |
Description | Data Science for Media Summit |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | The Summit was organised by the Alan Turing Institute to bring together researchers and media specialists to discuss in the future directions of research in data science for media |
Year(s) Of Engagement Activity | 2015 |
Description | How technology is changing speech and language therapy (Guardian on-line) |
Form Of Engagement Activity | A press release, press conference or response to a media enquiry/interview |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Media (as a channel to the public) |
Results and Impact | Article in Guardian which includes information about progress in speech recognition technology and specific mention of the EPSRC Natural Speech Technology project. |
Year(s) Of Engagement Activity | 2015 |
URL | http://www.theguardian.com/higher-education-network/2015/apr/15/how-technology-is-changing-speech-an... |
Description | Mobile University outreach event 2015 |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | Local |
Primary Audience | Public/other audiences |
Results and Impact | Our research on speech recognition and machine translation was presented as part of a 'Mobile University' outreach to general public in Sheffield City Centre. |
Year(s) Of Engagement Activity | 2015 |
URL | http://mini.dcs.shef.ac.uk/mobileuni2015/ |
Description | Multi-Genre Broadcast (MGB) Challenge |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | Organised and participated the Multi-Genre Broadcast (MGB) challenge. The challenge took place at ASRU 2015 and it served as a meeting for the participants. 20+ research groups participated in the challenge. |
Year(s) Of Engagement Activity | 2015 |
URL | http://www.mgb-challenge.org/ |
Description | Talk at AIST Tsukuba |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | Talk at AIST, Tsukuba, Japan - "Improving speech transcription using out-of-domain data" |
Year(s) Of Engagement Activity | 2012 |
Description | The future of Languages - more than just words |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Public/other audiences |
Results and Impact | A public lecture at the Public Library in Amsterdam, followed by a debate with an audience. Interactions with the audience. |
Year(s) Of Engagement Activity | 2012 |
URL | http://www.clubofamsterdam.com/event.asp?contentid=854 |
Description | Using speech synthesis to give everyone their own voice |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | Local |
Primary Audience | Public/other audiences |
Results and Impact | Discussions with the audience afterwards. Follow up emails from members of the public. |
Year(s) Of Engagement Activity | 2012 |
Description | seminar at INESC 2012 |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | Talk entitled "Assistive Speech Technology" at INESC-ID, Lisbon |
Year(s) Of Engagement Activity | 2012 |