Statistical Machine Translation in constrained domains
Lead Research Organisation:
University of Cambridge
Department Name: Engineering
Abstract
Statistical Machine Translation (SMT) involves automatically generating translations between human languages via analysing the characteristics of existing bilingual text. Good translation systems are becoming increasingly important with international communications, including global social media.
In particular, I am interested in situations where the domain, or subject, of the source language is known or can be inferred. This might be the case when translating a speech on a particular topic, or when translating for a given set of readers.
This project has relevance to the EPSRC's Natural Language Processing (NLP) research area. Inferring domain, adapting to a particular domain or to changing domains may require novel machine learning approaches from Artificial Intelligence-related research areas.
In particular, I am interested in situations where the domain, or subject, of the source language is known or can be inferred. This might be the case when translating a speech on a particular topic, or when translating for a given set of readers.
This project has relevance to the EPSRC's Natural Language Processing (NLP) research area. Inferring domain, adapting to a particular domain or to changing domains may require novel machine learning approaches from Artificial Intelligence-related research areas.
Organisations
People |
ORCID iD |
William Byrne (Primary Supervisor) | |
Danielle Saunders (Student) |
Studentship Projects
Project Reference | Relationship | Related To | Start | End | Student Name |
---|---|---|---|---|---|
EP/N509620/1 | 01/10/2016 | 30/09/2022 | |||
1750003 | Studentship | EP/N509620/1 | 01/10/2016 | 30/09/2020 | Danielle Saunders |
Description | During research funded by this grant I have been exploring the problem of fine-tuning existing translation models to perform better on specific domains, or genres, of text. As an interim report, the following are particularly key findings: - I have developed a technique for combining different sentence representations, like syntax, in translation, with results suggesting this area is not currently promising for further research. - I have shown that translation models can be adapted fast and effectively to new domains with minimal loss of performance on the original domain. This requires adaptive techniques for training a translation model and for generating translations. I have developed these techniques, allowing further research on domain adaptation and combination. - I have shown that fine-tuning is sensitive to the quantity and quality of data in the new domain, and developed a form of continued training that is effective in countering this. - I have applied these translation schemes to the problem of gender bias in translation datasets leading to incorrect, biased translation. My scheme strongly and efficiently reduces gender-inflection errors in translations. It also respects data privacy, as it does not require access to the original model's data. |
Exploitation Route | My findings might be used in the machine translation industry to generate custom translation models depending on the data handled by individual customers. This is an area of great interest at present, with multiple companies developing tools for adaptive or custom machine translation. In academia, my findings could be applied to other fields of Natural Language Processing which are also domain-dependent. |
Sectors | Digital/Communication/Information Technologies (including Software) |
Description | SDL plc is a company which takes generic translation engines and customises them to customer data and requirements. My findings and implementations for fast machine translation domain adaptation are used at SDL plc to build custom products for direct commercial sales. |
First Year Of Impact | 2018 |
Sector | Digital/Communication/Information Technologies (including Software) |
Impact Types | Economic |
Title | SGNMT |
Description | SGNMT is an open-source framework for neural machine translation (NMT) and other sequence prediction tasks. The tool provides a flexible platform which allows pairing different NMT frameworks with various other models such as language models, length models, or bag2seq models. It supports rescoring both n-best lists and lattices. A wide variety of search strategies is available for complex decoding problems. |
Type Of Technology | Software |
Year Produced | 2018 |
Open Source License? | Yes |
Impact | The software has been used in multiple research investigations into machine translation, translation adaptation and grammatical error correction in the group, several of which have resulted in published work. The software has also been used for multiple final year MEng / MPhil projects in the Cambridge University Engineering Department. |
URL | https://github.com/ucam-smt/sgnmt |
Description | Participation in general research outreach day |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | Local |
Primary Audience | Postgraduate students |
Results and Impact | A collection of undergraduates, postgraduates, research fellows and staff associated with Clare College, Cambridge attended a day of research presentations directed at lay audiences, with Q&A. |
Year(s) Of Engagement Activity | 2019 |
URL | http://www.clareity.co.uk/clareity-symposium-2019-programme/ |
Description | Participation in language sciences symposium |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | Local |
Primary Audience | Postgraduate students |
Results and Impact | Presented a research poster at a local symposium for language sciences. Event drew attendees from many fields in the social sciences and humanities who would otherwise have been unlikely to engage with the research. The poster prompted many interested questions. |
Year(s) Of Engagement Activity | 2018 |
Description | Visit to discuss applying adaptive NMT to legal translation |
Form Of Engagement Activity | Participation in an open day or visit at my research institution |
Part Of Official Scheme? | No |
Geographic Reach | Regional |
Primary Audience | Professional Practitioners |
Results and Impact | A visit with a legal and financial translation company to advise on recent developments in my research area, with questions and discussion, with feedback that our discussion affected product research for the company. |
Year(s) Of Engagement Activity | 2019 |