Statistical Machine Translation in constrained domains

Lead Research Organisation: University of Cambridge
Department Name: Engineering

Abstract

Statistical Machine Translation (SMT) involves automatically generating translations between human languages via analysing the characteristics of existing bilingual text. Good translation systems are becoming increasingly important with international communications, including global social media.

In particular, I am interested in situations where the domain, or subject, of the source language is known or can be inferred. This might be the case when translating a speech on a particular topic, or when translating for a given set of readers.

This project has relevance to the EPSRC's Natural Language Processing (NLP) research area. Inferring domain, adapting to a particular domain or to changing domains may require novel machine learning approaches from Artificial Intelligence-related research areas.

Publications

10 25 50

Studentship Projects

Project Reference Relationship Related To Start End Student Name
EP/N509620/1 01/10/2016 30/09/2022
1750003 Studentship EP/N509620/1 01/10/2016 30/09/2020 Danielle Saunders
 
Description During research funded by this grant I have been exploring the problem of fine-tuning existing translation models to perform better on specific domains, or genres, of text. As an interim report, the following are particularly key findings:
- I have developed a technique for combining different sentence representations, like syntax, in translation, with results suggesting this area is not currently promising for further research.
- I have shown that translation models can be adapted fast and effectively to new domains with minimal loss of performance on the original domain. This requires adaptive techniques for training a translation model and for generating translations. I have developed these techniques, allowing further research on domain adaptation and combination.
- I have shown that fine-tuning is sensitive to the quantity and quality of data in the new domain, and developed a form of continued training that is effective in countering this.
- I have applied these translation schemes to the problem of gender bias in translation datasets leading to incorrect, biased translation. My scheme strongly and efficiently reduces gender-inflection errors in translations. It also respects data privacy, as it does not require access to the original model's data.
Exploitation Route My findings might be used in the machine translation industry to generate custom translation models depending on the data handled by individual customers. This is an area of great interest at present, with multiple companies developing tools for adaptive or custom machine translation.

In academia, my findings could be applied to other fields of Natural Language Processing which are also domain-dependent.
Sectors Digital/Communication/Information Technologies (including Software)

 
Description SDL plc is a company which takes generic translation engines and customises them to customer data and requirements. My findings and implementations for fast machine translation domain adaptation are used at SDL plc to build custom products for direct commercial sales.
First Year Of Impact 2018
Sector Digital/Communication/Information Technologies (including Software)
Impact Types Economic

 
Title SGNMT 
Description SGNMT is an open-source framework for neural machine translation (NMT) and other sequence prediction tasks. The tool provides a flexible platform which allows pairing different NMT frameworks with various other models such as language models, length models, or bag2seq models. It supports rescoring both n-best lists and lattices. A wide variety of search strategies is available for complex decoding problems. 
Type Of Technology Software 
Year Produced 2018 
Open Source License? Yes  
Impact The software has been used in multiple research investigations into machine translation, translation adaptation and grammatical error correction in the group, several of which have resulted in published work. The software has also been used for multiple final year MEng / MPhil projects in the Cambridge University Engineering Department. 
URL https://github.com/ucam-smt/sgnmt
 
Description Participation in general research outreach day 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Postgraduate students
Results and Impact A collection of undergraduates, postgraduates, research fellows and staff associated with Clare College, Cambridge attended a day of research presentations directed at lay audiences, with Q&A.
Year(s) Of Engagement Activity 2019
URL http://www.clareity.co.uk/clareity-symposium-2019-programme/
 
Description Participation in language sciences symposium 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Postgraduate students
Results and Impact Presented a research poster at a local symposium for language sciences. Event drew attendees from many fields in the social sciences and humanities who would otherwise have been unlikely to engage with the research. The poster prompted many interested questions.
Year(s) Of Engagement Activity 2018
 
Description Visit to discuss applying adaptive NMT to legal translation 
Form Of Engagement Activity Participation in an open day or visit at my research institution
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Professional Practitioners
Results and Impact A visit with a legal and financial translation company to advise on recent developments in my research area, with questions and discussion, with feedback that our discussion affected product research for the company.
Year(s) Of Engagement Activity 2019