Data mining word pronunciations

Lead Research Organisation: University of Edinburgh
Department Name: Sch of Informatics

Abstract

One of the most convenient and important ways for humans to interact with robots and other machines in the future will be through speech. Speech technology like automatic speech recognition (ASR) and text-to- speech (TTS) conversion is thus critical to realising this future. Such technology has indeed already begun to see mainstream deployment, for example in voice-activated personal assistants such as Siri, Cortana or Alexa, but this is just the beginning.

To work with many of the world's languages, speech technology relies heavily upon the availability of a lexicon - a list of words along with their pronunciation(s) and other useful information such as part-of-speech tags. English is a prime example. Whereas some languages (e.g. Czech or Spanish) have very regular spelling, English is inconsistent, with many arbitrary contradictions and exceptions. Machine learning models can be trained to predict word pronunciation, which is called grapheme-to-phone conversion (G2P). However, the performance for English, for example, is only around 75-80% of words correct, depending on data set and evaluation criteria used (Bisani & Ney (2008); Yao & Zweig (2015); Richmond et al. (2009)). The problem is especially acute for names and acronyms (which I include in the blanket term "word"). This difficulty means a lexicon is crucial to ensuring words are pronounced and recognised correctly.

Combilex is a lexicon for English that was created at CSTR. It has many advanced features for speech technology purposes, including rich lexical information, explicit grapheme-phoneme linking, morphological derivation capabilities, broad coverage and demonstrated high consistency (Richmond et al.(2010)). Importantly, it is also is an accent-independent lexicon. Pronunciations are entered as baseform transcriptions using metaphone symbols, which can be considered to be a model of how that word will be pronounced in all accents of English simultaneously. Finite state transducers can then process these to give surface form transcriptions tailored to a given accent. Combilex has been commercially licensed by Google, Amazon, Samsung, and Microsoft, among many others. Though expert-produced lexicons such as Combilex have been successful, they do have drawbacks. They are laborious to compile, making them costly for languages like English (Combilex cost >£120k), but prohibitively expensive for low-resourced languages of the world. Whereas neologisms are constantly being introduced by users of a language (e.g. "Brexit", "lidar"), expert written lexicons struggle to keep pace. They also tend to be rather general and rudimentary models of pronunciation, not capturing nuances of variation that linguists deal with routinely in relation to factors such as geographical location, domain of use, social setting and so on. To address these drawbacks, I propose to explore new data-driven techniques to help lexicographers create better lexicons, taking less time and effort.

The aim of this research is to explore superior alternatives to hand-written lexicons, by applying datamining principles to "harvest" word pronunciations. New techniques and tools will be developed for trawling through vast quantities of audio/text data from sources such as the internet or the British National Corpus to continually enrich a lexicon with pronunciation data. I aim to make it far easier to i) create large lexical resources, both for English and other languages; ii) tailor these to particular regions and domains of use; and iii) keep these continually refreshed. There has been little previous work in this direction. Schlippe et al.(2014) addressed similar issues but used very different internet resources (i.e. Wiktionary) to those I propose. As far as I know, this will be among the very first attempts at harvesting lexicons at scale, accounting for broad pronunciation variation, from massive audio streams.

Studentship Projects

Project Reference Relationship Related To Start End Student Name
ES/R500938/1 30/09/2017 29/09/2021
1939413 Studentship ES/R500938/1 30/09/2017 29/06/2021 Jason Taylor