Word segmentation from noisy data with minimal supervision

Lead Research Organisation: University of Edinburgh

Department Name: Sch of Informatics

Abstract

In recent years, the field of natural language processing (NLP) has made great advances in a wide range of areas, such as machine translation, document summarization, and topic identification. However, much of this success is due to systems that are built using large quantities of human-annotated data in a supervised machine learning approach. This means that languages with fewer annotated resources (low-density languages) are left without much useful language technology. An important direction in NLP research is therefore to improve our ability to develop successful systems using as little annotated data as possible. Research on completely unsupervised systems is particularly interesting not only for its potential to broaden the reach of NLP technology, but also because it may shed light on the ways in which human infants manage to learn language with little or no explicit instruction.We propose to focus on the particular problem of word segmentation, and to develop a new type of probabilistic model, the infinite noisy channel model, for solving this problem in settings where little or no annotated data is available. Word segmentation refers to the problem of identifying word boundaries in either text or speech. It arises in NLP systems for many Asian languages, where words are not separated by whitespace, and also for infants learning language, because most spoken words are not separated by pauses. Previous work on unsupervised word segmentation has assumed that every time a particular word occurs, it is realized in exactly the same way. However, this is not the case for infants learning language (since words are subject to phonetic variability and noise in pronunciation), nor is it always true in NLP (if the input text contains errors, such as those produced by an optical character recognition system). Our new model will address this shortcoming by simultaneously performing word segmentation and correction of noise and variability, to recover a sequence of de-noised words from the unsegmented noisy input. We plan to develop two different versions of our model. One of these will be designed to correct for phonetic variability, and will be evaluated as a cognitive model of human language acquisition. With this model, we hope to gain insight into the computational mechanisms that allow infants to successfully extract words from noisy input, and in particular to show that the Bayesian inference techniques used in our model are a plausible explanation of infants' learning behavior. The second version of our model will be designed to correct for errors resulting from optical character recognition, and will be evaluated as a word segmentation and error-correcting NLP application in several different languages. We hope to show that the model reduces the number of character errors in the document while also producing successful segmentations. We expect these improvements to be particularly pronounced in low-density language situations.

Planned Impact

The research proposed here has the potential to create impact through several different routes: the cognitive modeling component addresses fundamental questions about the nature of language acquisition, the development of new machine learning methodologies will advance the state-of-the-art in unsupervised language processing in general, and the application to OCR post-processing will create improvements in a specific technology. Many of the potential impacts of these advances (especially in the area of language acquisition) have long time horizons and will best be realized by disseminating our work first to other academics, who will be able to develop it further and realize some of the benefits described below. Since our track record in dissemination to academics is detailed elsewhere, we do not discuss it further here. As for more immediate knowledge transfer to non-academic parties (as might be appropriate for some of the NLP/OCR applications described below), Edinburgh has a strong infrastructure in place. This is centered around the 8.25m ProspeKT program, a five-year program of activities in Knowledge Transfer, Entrepreneurism, and Public Outreach within the School of Informatics. ProspeKT provides a dedicated commercialization arm to proactively engage with industry, as well as activities in Informatics to encourage entrepreneurship and company formation. The ProspeKT team will be able to help in identifying and exploiting any commercial opportunities arising from our work. In addition, the University provides services to help with patenting and other protection of intellectual property. Some of the potential beneficiaries of our work include: 1. Language-impaired individuals, their families, and community. Basic research into the mechanisms of language acquisition has the long-term potential to improve diagnosis and treatment of various language disorders. 2. Speakers of low-density languages. These individuals will have better access to language technology, which will improve business and government efficiency, and quality of life for private individuals. 3. Governments and businesses with interests in low-density language areas. Examples include businesses moving into new markets in linguistically diverse areas of the world, and governments desiring better local information for diplomatic and intelligence purposes. 4. Speakers of endangered languages and their communities. Unsupervised language technology has the potential to be useful in the documentation and analysis of endangered languages by field linguists. For example, the OCR correction system described here could be particularly useful for digitizing hand-written fieldnotes that may have been collected before the advent of portable computers. This kind of technological improvement would benefit communities hoping to preserve their linguistic heritage. 5. Digital libraries and their users. Document recognition software has already changed the way in which users access information for libraries, often allowing them to search complete texts from their own homes. However, automatic recognition of older documents with historical and cultural importance is far less accurate than recognition of modern texts. Improvements in this process could significantly decrease the amount of effort required to digitize further documents, thereby increasing their accessibility to both specialized audiences (e.g., libraries and historical societies) and the general public.

Funded Value:

£281,783

Funded Period:

Jan 11 - Apr 14

Funder:

EPSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

EP/H050442/1

Principal Investigator:

Sharon Goldwater

Research Subject:

Info. & commun. Technol. (75%)

Linguistics (25%)

Research Topic:

Artificial Intelligence (75%)

Comput./Corpus Linguistics (25%)

Organisations

People	ORCID iD
Sharon Goldwater (Principal Investigator)

Publications

Author Name

Title Publication Date Published

10 25 50

Clarke A (2013) Where's Wally: the influence of visual salience on referring expression generation in Frontiers in Psychology

Elsner M (2018) Visual Complexity and Its Effects on Referring Expression Generation. in Cognitive science

Frank, S (2014) Weak semantic context helps phonetic learning in a model of infant language acquisition in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Kamper H (2016) Unsupervised Word Segmentation and Lexicon Discovery Using Acoustic Word Embeddings in IEEE/ACM Transactions on Audio, Speech, and Language Processing

Kamper H (2017) A segmental framework for fully-unsupervised large-vocabulary speech recognition in Computer Speech & Language

Kamper H (2015) Fully Unsupervised Small-Vocabulary Speech Recognition Using a Segmental Bayesian Model

Kamper H (2017) An embedded segmental K-means model for unsupervised segmentation and clustering of speech

Micha Elsner (Author) (2012) Bootstrapping a Unified Model of Lexical and Phonetic Acquisition

Micha Elsner (Author) (2013) A Joint Learning Model of Word Segmentation, Lexical Acquisition, and Phonetic Variability

Key Findings
Further Funding
Research Databases and Models
Collaboration
Software and Technical Products


Description	We have developed a model of early language acquisition in infants, when infants are learning to segment the speech stream into individual words. Unlike previous models, our new model integrates the segmentation process with the process of learning about phonetic variability. The performance of this model more closely matches what we know about infant segmentation than do previous models, providing evidence that, like our model, infants are learning about segmentation and phonetic processes simultaneously. We have also investigated the effects of non-linguistic contextual information on the word learning process, showing that this kind of information can be helpful for phonetic learning even if the learner does not have detailed knowledge of word meanings.
Exploitation Route	Some of our early results were used as inspiration for work developed during and following the 2012 JHU CLSP Workshop on Zero-Resouce Speech Technologies and Models of Early Language Acquisition, a two-week hands-on workshop attended by around 25 researchers, which has led to several publications and ongoing collaborations. We expect that our more recent work on contextual information in phonetic learning will inspire follow-on behavioral experiments to test our hypotheses regarding early word representations and contextual information.
Sectors	Digital/Communication/Information Technologies (including Software) Education Healthcare


Description	Understanding synergies in language acquisition through computational modeling
Amount	£400,000 (GBP)
Organisation	James S. McDonnell Foundation
Sector	Charity/Non Profit
Country	United States
Start	08/2013
End	08/2019


Title	Pronunciation-varied Bernstein-Ratner corpus
Description	dataset used in ACL-12 and EMNLP-13 papers. The Bernstein-Ratner corpus used by Brent for word segmentation, with added pronunciation variation from the Buckeye corpus.
Type Of Material	Database/Collection of data
Year Produced	2012
Provided To Others?	No


Description	Feldman
Organisation	University of Maryland, College Park
Department	Department of Linguistics
Country	United States
Sector	Academic/University
PI Contribution	Postdoctoral and PI research time
Collaborator Contribution	Consultation on aspects of the research and paper writing
Impact	Papers: Elsner et al. 2013, Frank et al. 2014 (see publications list). Multi-disciplinary, collaborator is a developmental linguist
Start Year	2012


Title	Beamseg
Description	Joint model of word segmentation and phonetic learning from EMNLP-13 paper, including code (C++), analysis scripts (Python) and sample output.
Type Of Technology	Software
Year Produced	2013

Abstract

Planned Impact

Organisations

People

ORCID iD

Publications