Bayesian Synchronous Grammar Induction

Lead Research Organisation: University of Oxford
Department Name: Computer Science

Abstract

Statistical Machine Translation (SMT) is the technology that allows computers to learn to translate between human languages (English, French, Chinese etc.) by being shown large numbers of example translations. This is the technology that drives popular online translation tools such as those provided by Google and Bing (Microsoft).The last decade of research in SMT has seen rapid progress, as small scale research systems have matured into large commercial products and popular online tools. Unfortunately the success of SMT has not been uniform; current state-of-the-art translation output varies markedly in quality depending on the languages being translated. Those language pairs that are closely related (e.g. English and French) can be translated with a high degree of precision, while for distant pairs (e.g. English and Chinese) the result is far from acceptable. This effect is clearly discernible when comparing the state-of-the-art for two well studied language pairs: Arabic-English and Chinese-English. While the quality of Arabic-English translation could be described as remarkable, translating Chinese into English often results in unreadable output. Clearly SMT has a long way to go before being usable across a large range of languages. It has been tempting to argue that SMT's current limitations can be overcome simply by increasing the amount of data on which the systems are trained. However large scale evaluation campaigns for Chinese-English translation, with ever increasing model sizes, have not yielded the hoped for gains. The failure to adequately translate between languages such as Chinese and English can be attributed to two significant shortcomings of current translation models: 1. an inability to model large changes in word order between input and output languages (referred to as reordering), 2. no reliable mechanism for directly learning phrasal (non-word based) translation units: a significant issue for non-segmenting languages (languages such as Chinese which don't use spaces to separate words) and languages with complex morphology (e.g. German). While a significant amount of research effort is currently being applied to tackling these issues, the proposed solutions are limited by focusing on more expressive models for producing translations rather than addressing the issue of how the translation units are learnt in the first place. In this research proposal I argue that both the fundamental structure and estimation methods of SMT models must change. By recasting the problem of learning translation models as synchronous grammar induction I aim to build models capable of handling complex translation phenomena, bringing us closer to the goal of readily available translation between all the worlds languages. I propose to go beyond current research on inducing statistical translation models by using non-parametric Bayesian methods to directly learn a state-of-the-art synchronous grammar translation model from parallel sentences. This research will have the following research impacts: 1. At present synchronous grammars for translation are learnt from non-hierarchical word alignment models, losing much of the benefit of the grammars ability to represent difficult translation phenomena not captured in the alignments. By simultaneously learning the alignments and the grammar in one model the full power of these hierarchical models will be unlocked. 2. This research will advance the state-of-the-art for learning complex structured models within a non-parametric formulation, an important contribution for both machine translation and many other areas of machine learning.

Planned Impact

The wide availability of high quality machine translation (MT) holds extraordinary promise for reducing communication barriers in many sectors of the community. In its 2009 Strategy for American Innovation report the Executive Office of the President of the USA recognised this impact by identifying the development of such technologies as a grand challenge of the 21st century, calling for: Automatic, highly accurate and real-time translation between the major languages of the world - greatly lowering the barriers to international commerce and collaboration. The research funded by this proposal will have a significant impact on the state-of-the-art for MT by: * increasing the acceptability of translations produced from current systems by improving the quality of the translation rules which drive the underlying processing * broadening the range of languages for which high quality systems can be built by allowing accurate translation grammars to be extracted for languages with complex morphology and large differences in word order, without reliance on expensive language specific resources. Below we sketch the specific impact of the proposed research across the public and commercial sectors: Commercial: Britain's economic strength is founded on its origins as a trading nation and the development of high quality translation systems hold the potential to greatly strengthen the competitiveness of British companies doing business in a global online comercial environment. Leading international technology companies, such as Google, Microsoft, IBM etc., rapidly absorb the latest advances in MT research to improve their online translation systems, advancing a revolution in the ability of individuals and companies to communicate across language divides on the web. Currently existing commercial translation systems achieve acceptable performance for a limited number of closely related language pairs, typically European languages. The development of the algorithms described in this proposal will broaden these horizons by leading to translation systems particularly suited to translating languages of emerging economic powers such as India and China. In addition, the establishment of a world leading machine translation research group at the University of Oxford will give local companies direct access to state-of-the-art MT research. Public: The British people form a multicultural and multilingual society with a con- tinuing need for high quality translation services. Government departments are large producers of multilingual information in their role of communicating with the public. While much progress has been made in automatically translating between English and European languages, other language pairs of great significance to British society, such as Hindi and Urdu, exhibit linguistic phenomena beyond the capabilities of current systems. The proposed research will address such limitations by designing machine learning algorithms to directly model such phenomena.
 
Description Google
Amount £43,000 (GBP)
Organisation Google 
Sector Private
Country United States
Start 06/2012 
End 06/2013