Bayesian Models of Grammar Induction and Translation

Lead Research Organisation: University of Oxford
Department Name: Computer Science

Abstract

The processes by which humans learn the rules that govern what is and is not a valid sentence in a language constitute an enduring theme of linguistic research. The development of computational models able to reproduce these processess holds great promise for both increasing our understanding of how children learn languages, and the development of advanced language technologies for processing online data.

The fields of Computational Linguistics and Machine Learning seek to provide the technologies necessary to enable people to interact seamlessly with the vast quantities of multilingual text published each day on the world wide web. Core amongst these technologies are those that assign syntactic structure to text (parsing) and automatically translate between languages (machine translation). Traditionally researchers have relied upon supervised machine learning techniques to build their systems, first annotating data by hand with the desired output of the system, then training the system to replicate and generalise from these annotations on new data. However this process of hand annotation is both time consuming and expensive, and as a result such data only exists for dominant languages (e.g. English).

The research programme set out for this fellowship will provide a solution to the problem of obtaining syntactic analyses for large quantities of real world language data. The overarching aim of this project is to develop large scale and language independent algorithms for learning syntactic structure from unannotated text using techniques from non-parametric Bayesian probability. In tandem a new syntactic model of machine translation will be developed to evaluate and validate these algorithms.

The project will consist of two major components:
1. Develop scalable models of unsupervised grammar induction suitable for use with languages exhibiting a wide range of morphological and syntactic phenomena.
2. Develop a syntactic model of translation based upon the analyses produced by the unsupervised model. This system will be composed of a source reordering model which reorders the input conditioned on the induced syntactic structure, and a phrasal translation model which maps the reordered source to a translation.

The specific scientific contributions of this project are:
1. The first accurate large scale grammar induction algorithms applicable to a wide range of language processing tasks.
2. New advanced machine learning algorithms for latent variable induction and approximate inference within Bayesian non-parametric models.
3. An investigation of the cognitive implications of the developed grammar induction algorithms, which are considerably more powerful than those previously used for language acquisition simulations within computational cognitive science.
4. An extrinsic evaluation of the induced grammars within a novel machine translation system.
5. A state-of-the-art open source machine translation system capable of producing high quality translations for a much larger range of languages than those handled by current systems.

The success of this research programme will have wide ranging impacts beyond the core contribution of a large scale syntactic induction system; from advanced new algorithms for machine learning and machine translation, to a powerful new tool for simulating child language acquisition within cognitive science.
The aims of this project are adventurous but the contribution of an effective and scalable model for Grammar Induction will be transformative for a wide range of text processing applications.

Planned Impact

The long term impacts of the proposed research program span both the scientific and commercial spheres. The availability of computational models capable of automatically learning the syntactic structure of human languages will allow us to build sophisticated language technologies with minimal supervision while also furthering our understanding of human language acquisition.

Current language technologies have been extensively adopted but are limited by the difficulties involved in developing systems for new languages or domains. The wide availability of high quality unsupervised language processing technologies that can be affordably built from raw language data holds extraordinary promise for reducing communication barriers in many sectors of the community. The research funded by this proposal will also have a significant impacts on a range of related research fields such as Machine Translation, Information Extraction, Cognitive Science, and Machine Learning.

Commercial:
Britain's economic strength is founded on its origins as a trading nation and the development of high quality language technologies hold the potential to greatly strengthen the competitiveness of British companies doing business in a global online commercial environment. Leading international technology companies, such as Google, Microsoft, IBM etc., rapidly absorb the latest advances in research to improve their online information processing and translation systems, advancing a revolution in the ability of individuals and companies to communicate across language divides on the web. Currently existing commercial language technology systems achieve acceptable performance for a limited number of closely related languages, typically European languages. The development of the algorithms described in this proposal will broaden these horizons by leading allowing systems to be quickly build for any language for which raw textual data is available.

Public:
The British people form a multicultural and multilingual society with a continuing need for high quality language services. Government departments are large producers of multilingual information in their role of communicating with the public. While much progress has been made in automatically extracting information from and translating between English and some European languages, other languages of great significance to British Society, such as Hindi and Urdu, lack the availability of high quality annotated data required to build the current generation of supervised systems. The proposed research will address such limitations by designing machine learning algorithms to directly learning the structure of languages using widely available raw unannotated data.

Research Community
In the wider Computational Linguistics research community the cutting-edge of research is focused on integrating sophisticated linguistic resources, such as treebank trained parsers, into translation systems. While such systems hold great promise for translations involving English, the high expense (currently millions of pounds) of creating such resources for other languages is a barrier to such technologies being widely applicable to all the languages of the world.
The successful execution and dissemination of the proposed research will serve to demonstrate the potential for unsupervised systems to realise similar performance gains without the need for language specific resources.
In addition to contributions specific to Computational Linguistics, the algorithms developed by this research will push the boundaries of current structured Machine Learning technologies, providing general algorithms which will have wide applicability to research in other areas of artificial intelligence, such as the discovery of structure in images, genetic sequences etc.

Publications

10 25 50

publication icon
Hermann K.M. (2014) Multilingual distributed representations without word alignment in 2nd International Conference on Learning Representations, ICLR 2014 - Conference Track Proceedings

publication icon
Hermann Karl Moritz (2013) Multilingual Distributed Representations without Word Alignment in arXiv e-prints