Word segmentation from noisy data with minimal supervision

Lead Research Organisation: University of Edinburgh
Department Name: Sch of Informatics

Abstract

In recent years, the field of natural language processing (NLP) has made great advances in a wide range of areas, such as machine translation, document summarization, and topic identification. However, much of this success is due to systems that are built using large quantities of human-annotated data in a supervised machine learning approach. This means that languages with fewer annotated resources (low-density languages) are left without much useful language technology. An important direction in NLP research is therefore to improve our ability to develop successful systems using as little annotated data as possible. Research on completely unsupervised systems is particularly interesting not only for its potential to broaden the reach of NLP technology, but also because it may shed light on the ways in which human infants manage to learn language with little or no explicit instruction.We propose to focus on the particular problem of word segmentation, and to develop a new type of probabilistic model, the infinite noisy channel model, for solving this problem in settings where little or no annotated data is available. Word segmentation refers to the problem of identifying word boundaries in either text or speech. It arises in NLP systems for many Asian languages, where words are not separated by whitespace, and also for infants learning language, because most spoken words are not separated by pauses. Previous work on unsupervised word segmentation has assumed that every time a particular word occurs, it is realized in exactly the same way. However, this is not the case for infants learning language (since words are subject to phonetic variability and noise in pronunciation), nor is it always true in NLP (if the input text contains errors, such as those produced by an optical character recognition system). Our new model will address this shortcoming by simultaneously performing word segmentation and correction of noise and variability, to recover a sequence of de-noised words from the unsegmented noisy input. We plan to develop two different versions of our model. One of these will be designed to correct for phonetic variability, and will be evaluated as a cognitive model of human language acquisition. With this model, we hope to gain insight into the computational mechanisms that allow infants to successfully extract words from noisy input, and in particular to show that the Bayesian inference techniques used in our model are a plausible explanation of infants' learning behavior. The second version of our model will be designed to correct for errors resulting from optical character recognition, and will be evaluated as a word segmentation and error-correcting NLP application in several different languages. We hope to show that the model reduces the number of character errors in the document while also producing successful segmentations. We expect these improvements to be particularly pronounced in low-density language situations.

Planned Impact

The research proposed here has the potential to create impact through several different routes: the cognitive modeling component addresses fundamental questions about the nature of language acquisition, the development of new machine learning methodologies will advance the state-of-the-art in unsupervised language processing in general, and the application to OCR post-processing will create improvements in a specific technology. Many of the potential impacts of these advances (especially in the area of language acquisition) have long time horizons and will best be realized by disseminating our work first to other academics, who will be able to develop it further and realize some of the benefits described below. Since our track record in dissemination to academics is detailed elsewhere, we do not discuss it further here. As for more immediate knowledge transfer to non-academic parties (as might be appropriate for some of the NLP/OCR applications described below), Edinburgh has a strong infrastructure in place. This is centered around the 8.25m ProspeKT program, a five-year program of activities in Knowledge Transfer, Entrepreneurism, and Public Outreach within the School of Informatics. ProspeKT provides a dedicated commercialization arm to proactively engage with industry, as well as activities in Informatics to encourage entrepreneurship and company formation. The ProspeKT team will be able to help in identifying and exploiting any commercial opportunities arising from our work. In addition, the University provides services to help with patenting and other protection of intellectual property. Some of the potential beneficiaries of our work include: 1. Language-impaired individuals, their families, and community. Basic research into the mechanisms of language acquisition has the long-term potential to improve diagnosis and treatment of various language disorders. 2. Speakers of low-density languages. These individuals will have better access to language technology, which will improve business and government efficiency, and quality of life for private individuals. 3. Governments and businesses with interests in low-density language areas. Examples include businesses moving into new markets in linguistically diverse areas of the world, and governments desiring better local information for diplomatic and intelligence purposes. 4. Speakers of endangered languages and their communities. Unsupervised language technology has the potential to be useful in the documentation and analysis of endangered languages by field linguists. For example, the OCR correction system described here could be particularly useful for digitizing hand-written fieldnotes that may have been collected before the advent of portable computers. This kind of technological improvement would benefit communities hoping to preserve their linguistic heritage. 5. Digital libraries and their users. Document recognition software has already changed the way in which users access information for libraries, often allowing them to search complete texts from their own homes. However, automatic recognition of older documents with historical and cultural importance is far less accurate than recognition of modern texts. Improvements in this process could significantly decrease the amount of effort required to digitize further documents, thereby increasing their accessibility to both specialized audiences (e.g., libraries and historical societies) and the general public.

Publications

10 25 50
 
Description We have developed a model of early language acquisition in infants, when infants are learning to segment the speech stream into individual words. Unlike previous models, our new model integrates the segmentation process with the process of learning about phonetic variability. The performance of this model more closely matches what we know about infant segmentation than do previous models, providing evidence that, like our model, infants are learning about segmentation and phonetic processes simultaneously. We have also investigated the effects of non-linguistic contextual information on the word learning process, showing that this kind of information can be helpful for phonetic learning even if the learner does not have detailed knowledge of word meanings.
Exploitation Route Some of our early results were used as inspiration for work developed during and following the 2012 JHU CLSP Workshop on Zero-Resouce Speech Technologies and Models of Early Language Acquisition, a two-week hands-on workshop attended by around 25 researchers, which has led to several publications and ongoing collaborations. We expect that our more recent work on contextual information in phonetic learning will inspire follow-on behavioral experiments to test our hypotheses regarding early word representations and contextual information.
Sectors Digital/Communication/Information Technologies (including Software),Education,Healthcare

 
Description Understanding synergies in language acquisition through computational modeling
Amount £400,000 (GBP)
Organisation James S. McDonnell Foundation 
Sector Charity/Non Profit
Country United States
Start 09/2013 
End 08/2019
 
Title Pronunciation-varied Bernstein-Ratner corpus 
Description dataset used in ACL-12 and EMNLP-13 papers. The Bernstein-Ratner corpus used by Brent for word segmentation, with added pronunciation variation from the Buckeye corpus. 
Type Of Material Database/Collection of data 
Year Produced 2012 
Provided To Others? No  
 
Description Feldman 
Organisation University of Maryland, College Park
Department Department of Linguistics
Country United States 
Sector Academic/University 
PI Contribution Postdoctoral and PI research time
Collaborator Contribution Consultation on aspects of the research and paper writing
Impact Papers: Elsner et al. 2013, Frank et al. 2014 (see publications list). Multi-disciplinary, collaborator is a developmental linguist
Start Year 2012
 
Title Beamseg 
Description Joint model of word segmentation and phonetic learning from EMNLP-13 paper, including code (C++), analysis scripts (Python) and sample output. 
Type Of Technology Software 
Year Produced 2013