Improving Target Language Fluency in Statistical Machine Translation
Lead Research Organisation:
University of Cambridge
Department Name: Engineering
Abstract
Recent years have seen great improvement in the quality of statistical machine translation (SMT). Automatic translation has benefitted from increasing amounts of monolingual and translated data, from advancements in core modelling algorithms, and from a growing understanding of how best to integrate automatic translation into large-scale language processing systems. Despite these improvements, even the best SMT output is rarely of human quality. Any casual inspection of MT output will quickly find syntactic and semantic errors that only a machine would make. New modelling techniques, capable of extracting the best possible models from all available data, are needed.
This proposal aims to overcome one of technical barriers to delivering 'human quality' statistical machine translation (SMT): the production of grammatical output. We propose here to use multiple grammars in SMT. One grammar is focused on translation of the source language, as in current practice. The second grammar is focused on production, with the aim of producing fluent and grammatical sentences in the target language. We will develop a decoding framework in which translation and production are closely linked but independent processes driven by these two grammars. Our systems will be based on state-of-the-art syntactic SMT, and our aim will be to dramatically improve the fluency of the translation output, particularly in situations where the original source language text is noisy and difficult to translate fluently.
This work will be of value to UK industry. The UK translation and interpretation market was estimated at EURO 290M - EURO 434M in 2009, and UK localisation and language service providers are strong competitors in the worldwide language industry, forecast to grow to EURO 16B by 2015. Reducing the cost of high-quality translation is a concern for this industry which we will address directly, in that improving target language fluency is a key factor in translation post-editing efficiency.
In academia, SMT systems are now used to build systems incorporating speech recognition, speech synthesis, and dialogue systems. Researchers at Edinburgh University, Heriot-Watt University, Oxford University, Sheffield University are among universities with groups working on these problems. Our project will enable SMT researchers to apply their expertise in translation grammar induction, large-scale language modelling, and parameterisation to target language production.
Motivated by these needs, our research hypotheses are that: (1) modelling techniques from syntax-based SMT can be used to build stochastic production systems; (2) production quality can be improved using 'Big Data' and machine learning statistical modelling techniques; and (3) target language production systems can be integrated into syntax-based statistical machine translation systems using risk-based decoding procedures, yielding improvements in translation quality, robustness, and fluency. The novelty in this proposal is in: (1) the use of separate grammars for syntax-based statistical machine translation, one grammar for translation and a second for production; (2) coupling them into a risk-based consensus decoding procedure; (3) incorporation of phrase-based production grammars and search procedures; (4) an explicit focus on fluency.
Our research will yield new models and algorithms in the form of open source software and systems. We take the view that the best pathway to economic impact for this type of research is by: publishing research results; releasing software and data under generous Open Source licenses for unconstrained use by industry; and by training students and PDRAs who can take their skills and knowledge from the university to industry. We believe this is the broadest and surest way to enhance the research capacity, knowledge and skills of businesses and organisations. All results of this research project will be distributed in the public domain.
This proposal aims to overcome one of technical barriers to delivering 'human quality' statistical machine translation (SMT): the production of grammatical output. We propose here to use multiple grammars in SMT. One grammar is focused on translation of the source language, as in current practice. The second grammar is focused on production, with the aim of producing fluent and grammatical sentences in the target language. We will develop a decoding framework in which translation and production are closely linked but independent processes driven by these two grammars. Our systems will be based on state-of-the-art syntactic SMT, and our aim will be to dramatically improve the fluency of the translation output, particularly in situations where the original source language text is noisy and difficult to translate fluently.
This work will be of value to UK industry. The UK translation and interpretation market was estimated at EURO 290M - EURO 434M in 2009, and UK localisation and language service providers are strong competitors in the worldwide language industry, forecast to grow to EURO 16B by 2015. Reducing the cost of high-quality translation is a concern for this industry which we will address directly, in that improving target language fluency is a key factor in translation post-editing efficiency.
In academia, SMT systems are now used to build systems incorporating speech recognition, speech synthesis, and dialogue systems. Researchers at Edinburgh University, Heriot-Watt University, Oxford University, Sheffield University are among universities with groups working on these problems. Our project will enable SMT researchers to apply their expertise in translation grammar induction, large-scale language modelling, and parameterisation to target language production.
Motivated by these needs, our research hypotheses are that: (1) modelling techniques from syntax-based SMT can be used to build stochastic production systems; (2) production quality can be improved using 'Big Data' and machine learning statistical modelling techniques; and (3) target language production systems can be integrated into syntax-based statistical machine translation systems using risk-based decoding procedures, yielding improvements in translation quality, robustness, and fluency. The novelty in this proposal is in: (1) the use of separate grammars for syntax-based statistical machine translation, one grammar for translation and a second for production; (2) coupling them into a risk-based consensus decoding procedure; (3) incorporation of phrase-based production grammars and search procedures; (4) an explicit focus on fluency.
Our research will yield new models and algorithms in the form of open source software and systems. We take the view that the best pathway to economic impact for this type of research is by: publishing research results; releasing software and data under generous Open Source licenses for unconstrained use by industry; and by training students and PDRAs who can take their skills and knowledge from the university to industry. We believe this is the broadest and surest way to enhance the research capacity, knowledge and skills of businesses and organisations. All results of this research project will be distributed in the public domain.
Planned Impact
Our pathways to impact will follow our close links with the UK language technologies industry; with international sponsored research programmes; our participation in the Cambridge Language Sciences Initiative; and our involvement in undergraduate and post-graduate teaching in the Department of Engineering and Computer Laboratory at the University of Cambridge. Our impact activities are:
1 Publication, Release of Open Source Tools and Data, Participation in International Evaluations
Our main focus is academic impact in the form of traditional peer-reviewed publication. However, the machine translation and natural language processing (NLP) research communities rely heavily on shared experimental infrastructure. Substantial research effort within the field is devoted to: distribution of open source software; creation of common data sets to train and evaluate systems; workshops and short courses for research students; and participation in international competitions evaluated by impartial third-party judges. Publication impact is greatly enhanced when research results are used in a good evaluation system and when the software and data underlying the publication are released with accompanying tutorial documentation. This enables others in the field to quickly replicate and assess the published results. We will participate in all these activities.
2 Involvement of Undergraduates, MPhil Students, and PhD Students in Sponsored Research
- PhD students in the Cambridge Engineering Department and the Computer Laboratory will be offered opportunities to work on the project.
- Students on the Cambridge Advanced Computer Science MPhil are given a project option. We will offer suitably scoped research projects based on topics in this proposal. These will be closely tied to the main themes of this proposal, with suitable resources and a limited agenda so that students can finish successfully within their course
- Cambridge Engineering students are required to do a Fourth Year Project. We will offer projects in Information Engineering to interest students. Projects are scoped for the skills and abilities of students. We offer software engineering projects, e.g. iOS/Android applications for mobile interaction with server-based SMT systems. Research-minded students can undertake projects on specific technologies or translation problems.
3. Cambridge Language Sciences Initiative
Cambridge Language Sciences was established in mid-2012 as a University-level Strategic Research Initiative. These Initiatives are intended to influence national and international research, policy and funding agendas; to strengthen internal cross-disciplinary research collaborations; and to provide a platform for large-scale funding applications, recruitments and international research partnerships. Human Language Technologies has been named as one of the key Research Themes within the initiative. There are regular seminars, workshops, and discussion groups, which are widely attended, and to which we will contribute research results from this project.
4. Interaction with Industry
We take the view that the best pathway to economic impact for this type of research is by: publishing research results; releasing software and data under generous Open Source licenses for unconstrained use by industry; and by training students and PDRAs who can take their skills and knowledge from the university to industry. All results of this research project will be distributed in the public domain. This is the best way to ensure they can be easily used by industry. We have two additional, specific paths to commercialisation and exploitation of scientific knowledge:
- Close ties to Language Weaver / SDL plc, a UK language technologies and language services provider
- Cambridge Enterprise Limited is a wholly owned subsidiary of the University of Cambridge responsible for the commercialisation of technology arising from University departments.
1 Publication, Release of Open Source Tools and Data, Participation in International Evaluations
Our main focus is academic impact in the form of traditional peer-reviewed publication. However, the machine translation and natural language processing (NLP) research communities rely heavily on shared experimental infrastructure. Substantial research effort within the field is devoted to: distribution of open source software; creation of common data sets to train and evaluate systems; workshops and short courses for research students; and participation in international competitions evaluated by impartial third-party judges. Publication impact is greatly enhanced when research results are used in a good evaluation system and when the software and data underlying the publication are released with accompanying tutorial documentation. This enables others in the field to quickly replicate and assess the published results. We will participate in all these activities.
2 Involvement of Undergraduates, MPhil Students, and PhD Students in Sponsored Research
- PhD students in the Cambridge Engineering Department and the Computer Laboratory will be offered opportunities to work on the project.
- Students on the Cambridge Advanced Computer Science MPhil are given a project option. We will offer suitably scoped research projects based on topics in this proposal. These will be closely tied to the main themes of this proposal, with suitable resources and a limited agenda so that students can finish successfully within their course
- Cambridge Engineering students are required to do a Fourth Year Project. We will offer projects in Information Engineering to interest students. Projects are scoped for the skills and abilities of students. We offer software engineering projects, e.g. iOS/Android applications for mobile interaction with server-based SMT systems. Research-minded students can undertake projects on specific technologies or translation problems.
3. Cambridge Language Sciences Initiative
Cambridge Language Sciences was established in mid-2012 as a University-level Strategic Research Initiative. These Initiatives are intended to influence national and international research, policy and funding agendas; to strengthen internal cross-disciplinary research collaborations; and to provide a platform for large-scale funding applications, recruitments and international research partnerships. Human Language Technologies has been named as one of the key Research Themes within the initiative. There are regular seminars, workshops, and discussion groups, which are widely attended, and to which we will contribute research results from this project.
4. Interaction with Industry
We take the view that the best pathway to economic impact for this type of research is by: publishing research results; releasing software and data under generous Open Source licenses for unconstrained use by industry; and by training students and PDRAs who can take their skills and knowledge from the university to industry. All results of this research project will be distributed in the public domain. This is the best way to ensure they can be easily used by industry. We have two additional, specific paths to commercialisation and exploitation of scientific knowledge:
- Close ties to Language Weaver / SDL plc, a UK language technologies and language services provider
- Cambridge Enterprise Limited is a wholly owned subsidiary of the University of Cambridge responsible for the commercialisation of technology arising from University departments.
People |
ORCID iD |
William Byrne (Principal Investigator) |
Publications
Elliott, D
(2015)
Multilingual Image Description With Neural Sequence Models
Hasler E
(2017)
A Comparison of Neural Models for Word Ordering
Hasler E
(2017)
A Comparison of Neural Models for Word Ordering
Hasler E
(2018)
Neural Machine Translation Decoding with Terminology Constraints
Hasler E
(2017)
A Comparison of Neural Models for Word Ordering
Hasler E
(2017)
Source sentence simplification for statistical machine translation
in Computer Speech & Language
Hasler E.
(2018)
Neural machine translation decoding with terminology constraints
in NAACL HLT 2018 - 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference
Description | Language technology has recently been revolutionized by the development of deep learning / neural networks which are particularly noted for their fluency in producing text in problems such as machine translation. The proposal for this project pre-dated these developments, and our research plans were changed in response. We developed minimum Bayes risk decoding procedures that allow a loose coupling of multiple translation systems so that individual systems can work independently to produce a consensus translation (see https://arxiv.org/abs/1612.03791). In this approach is that the systems `vote` for hypotheses at the word and phrase level, based on their confidence in their hypotheses. We found the (very) surprising result that even very good `neural` systems could still be improved by this form of loose combination with previous, syntactic machine translation systems. Using these approaches we submitted machine translation systems (http://www.aclweb.org/anthology/W18-6427) to international competition run by the Third Conference on Machine Translation (WMT18) (http://www.statmt.org/wmt18/) . Our systems were not at the absolute top in terms of BLEU (a common metric used to measure translation quality), but we were 2nd plans in English<->German, and within 2 BLEU of the best system in Chinese->English. Although these UCAM systems were not at the absolute top in BLEU score, they were singled out in subsequent independent manual assessments of the translation systems which found that the UCAM systems (http://www.statmt.org/wmt18/pdf/WMT063.pdf)): - 'achieves the highest accuracies in most linguistic phenomena, as compared to the rest of the systems' - `achieves a significantly better performance than all other systems concerning verb tense/aspect/mood, reaching a 86.9% accuracy' - these results 'may be explained by the fact that this system differs from others, since it combines several different neural models together with a phrase-based SMT system in an syntactic MBR-based scheme.' In the above, we have found that formally syntactic approaches to machine translation can still be useful with more powerful neural machine translation if decoding strategies are used that can make the best of the strengths of these two very distinct modelling approaches. Other findings of the project (and partially supported work): - Text summarization techniques can be used to `simplify` source text to improve translation - Neural sequence to sequence models can be used for multilingual captioning of images - Neural sequence to sequence models can be used in automated lyric annotation to annotate rap lyrics - An `Operation Sequence Model for Neural Machine Translation` has been developed that exploits the ability of NMT sequence-to-sequence models to generate a syntactic representation of their output. We have developed a `grammar` that is not a formal syntactic grammar that describes the target language, but is instead a (possible) description of how the system produces its translation output from its source. This syntactic description 'accompanies' the translation, although the translation is not constrained by the syntax of the explanation. |
Exploitation Route | Our findings could be useful for any applications in which text (or possibly speech) is presented directly to users in conditions when fluency is needed. We have worked with colleagues at the Cambridge Department of Computer Science and Technology to develop grammatical error correction systems for improving the writing of second language learners of English. The MBR decoding techologies developed within the project have been used directly by colleagues at SDL plc in their commercial machine translation systems (http://www.aclweb.org/anthology/N18-3013). My collleages and I are particularly excited by our `Operation Sequence` approach to explainable neural machine translation. It is very possible that this approach can be applied more broadly to build neural sequence to sequence systems that can provide their own explanation as to how they generate their output. |
Sectors | Digital/Communication/Information Technologies (including Software) |
Description | The neural machine translation decoding strategies described in the papers on minimum Bayes risk decoding have been incorporated into translation products by SDL plc. These techniques deliver the fluency of neural machine translation with the robustness of previous syntactic approaches to translation. In addition to being elegant and efficient, the combination of quality and robustness is extremely valuable commercially. |
First Year Of Impact | 2017 |
Sector | Digital/Communication/Information Technologies (including Software) |
Impact Types | Economic |
Description | Courses for the Cambridge MPhil in Machine Learning and Machine Intelligence: (1) Probabilistic Automata and (2) Machine Translation |
Geographic Reach | Multiple continents/international |
Policy Influence Type | Influenced training of practitioners or researchers |
Impact | Students graduating from the Cambridge MPhil in Machine Learning and Machine Intelligence (formerly the MPhil in Machine Learning, Speech, and Language Technology) are in great demand by UK-based language technology industry. They have skills in speech recognition, speech synthesis, dialogue systems, and machine translation, which are the technologies that make possible interactive computing systems such as Siri/Alexa/Cortana. Technology developed within this research project has been used to create practical coursework (that formed most of the assessed mark for the courses) for one course in Probabilistic Automata and a second course in Machine Translation. |
URL | http://www.mlmi.eng.cam.ac.uk/ |
Description | Giving Voice to Digital Democracies |
Amount | £725,000 (GBP) |
Organisation | Humanities and Social Change International Foundation| |
Sector | Private |
Country | Germany |
Start | 09/2018 |
End | 10/2022 |
Title | Corpus of original and simplified English sentences |
Description | In order to investigate the potential of input simplification for automatic translation systems, we created a corpus of 3000 pairs of original and simplified English sentences, using the crowdsourcing platform Crowdflower. We chose to annotate test sets from publicly available datasets, which are widely used in the research community, in order to allow other researchers to use our data and promote interest in this research topic. |
Type Of Material | Improvements to research infrastructure |
Provided To Others? | No |
Impact | Thus far, we have only used the dataset internally, but we are planning to make it available to the public in the near future. The dataset has allowed us to study the potential of input simplification for automatic translation and provides us with a guideline for automatic approaches to input simplification. |
Title | SGNMT - Syntactically Guided Neural Machine Translation |
Description | SGNMT is an open-source framework for neural machine translation (NMT) and other sequence prediction tasks. The tool provides a flexible platform which allows pairing NMT with various other models such as language models, length models, or bag2seq models. It supports rescoring both n-best lists and lattices. A wide variety of search strategies is available for complex decoding problems. |
Type Of Material | Improvements to research infrastructure |
Year Produced | 2017 |
Provided To Others? | Yes |
Impact | This decoder is being used in for two courses in the Cambridge MPhil in Machine Learning, Speech, and Language Technology. It is being used in the Weighted Automata course and the course in Statistical Machine Translation. The tools are being used for quick prototyping of translation systems by SDL plc, a UK-based language service provider. The tools have also been crucial to several papers arising from this project. |
URL | https://ucam-smt.github.io/sgnmt/html/ |
Title | Research Data Supporting "Syntactically Guided Neural Machine Translation" |
Description | The data set includes two NMT models trained on the English-German WMT'15 parallel data, a German NPLM language model, and an n-best list and translation lattices generated with HiFST for the WMT'15 test set. It is intended to support the tutorial for the SGNMT tool. |
Type Of Material | Computer model/algorithm |
Year Produced | 2016 |
Provided To Others? | Yes |
Impact | These baseline systems have made it possible for others to replicate our published results and to verify that the SGNMT models are running correctly. While not a direct impact, making these baselines available makes it much easier for other research groups to adopt and continue this line of work. |
URL | https://www.repository.cam.ac.uk/handle/1810/256339 |
Title | Research data supporting "Source Sentence Simplification for Statistical Machine Translation" |
Description | This data set contains subsets of English-German test sets from the Workshop for Machine Translation (WMT) which have been annotated with manual text simplification information on the source side in the form of gap begin and gap end symbols (, ). The data was tokenized and truecased using the processing scripts distributed with the Moses SMT system. The source simplifications were produced by workers recruited on the crowdsourcing platform Crowdflower (https://www.crowdflower.com). We asked workers to simplify a sentence by deleting words and punctuation, while trying to retain the most important information in the shortened sentence. Their performance was controlled using test questions and a second Crowdflower task which asked workers to identify bad simplifications from the first task. The outcomes of the second task were aggregated by combining an agreement score and the average worker trust score for each simplification. We selected randomly from the remaining simplifications with a combined score of at least 0.5. |
Type Of Material | Database/Collection of data |
Year Produced | 2016 |
Provided To Others? | Yes |
Description | Research on multilingual multi-modal models |
Organisation | University of Amsterdam |
Country | Netherlands |
Sector | Academic/University |
PI Contribution | We contributed to this collaboration by providing input in regular meetings and discussions, setting up baseline systems, carrying out analyses of system outputs and revising article drafts. We also helped to promote interest in the research by giving a talk at a nearby institution (Computer Laboratory in Cambridge). |
Collaborator Contribution | Our collaborators have provided the main software for this project as well as the infrastructure to carry out the necessary experiments. |
Impact | The collaboration has so far resulted in a pre-publication on http://arxiv.org as listed above and has further led to a new shared task at the annual Workshop for Machine translation (http://www.statmt.org/wmt16/multimodal-task.html) which will help to promote further interest and activity in this field. |
Start Year | 2015 |
Title | SNMT |
Description | SGNMT is an open-source framework for neural machine translation (NMT). The tool provides a flexible platform which allows pairing NMT with various other models such as language models, length models, or bag2seq models. It supports rescoring both n-best lists and lattice rescoring. A wide variety of search strategies is available for complex decoding problems. SGNMT is compatible with Blocks/Theano and TensorFlow. |
Type Of Technology | Software |
Year Produced | 2016 |
Open Source License? | Yes |
Impact | The algorithms and modelling approaches implemented in SGNMT has been incorporated into research and teaching in the Cambridge University Department of Engineering. The software is a key component of two Cambridge MPhil modules: Weighted Automata and Statistical Machine Translation. The tools remain under active development. SGNMT has been adopted by SDL plc, a UK-based translation services provider, for fast prototyping of translation systems. The SGNMT tools are being used by students on the MPhil in Machine Learning, Speech, and Language Technology. University of Cambridge. Two MPhil dissertations in in the 2015-16 academic year were done using SGNMT. Marcin Tomczak did a thesis titled `Bachbot` on automatic music generation. He trained sequence-to-sequence neural models to generate scores for Bach chorales, and also implemented constraints on valid compositions using weighted automata (acceptors). SGNMT allowed him to use the neural models to generate scores that obeyed the `grammar' of the musical form. This is a particularly pleasing application of machine translation technology to a completely different problem domain. Jiameng Gao did a thesis titled ` Variable length word encodings for neural translation models.` This is a more straightforward machine translation thesis involving the use of transducers to combine word-level and sub-word-level models. http://www.mlsalt.eng.cam.ac.uk/foswiki/pub/Main/CurrentMPhils/Marcin_Tomczak_8224841_assignsubmission_file_Tomczak_dissertation.pdf http://www.mlsalt.eng.cam.ac.uk/foswiki/pub/Main/CurrentMPhils/Jiameng_Gao_8224881_assignsubmission_file_J_Gao_MPhil_dissertation.pdf |
URL | https://ucam-smt.github.io/sgnmt/html/index.html |
Title | UCAM-SMT |
Description | This open source package contains the Cambridge SMT system, a set of tools for statistical machine translation, which rely on the Google OpenFST Weighted Finite State Automata toolkit. It includes the following features: HiFST -- Hierarchical phrase-based statistical machine translation system based on OpenFST; Direct production of translation lattices as Weighted Finite State Automata; Efficient WFSA rescoring procedures; OpenFst wrappers for direct inclusion of KenLM and ARPA language models as WFSAs; Lattice Minimum Bayes Risk decoding; Lattice Minimum Error Rate training; Client/Server mode; WFSA true-casing; Hadoop-based rule extraction. Tutorials are also provided to implement recently published research results. |
Type Of Technology | Software |
Year Produced | 2015 |
Open Source License? | Yes |
Impact | We have found that the syntactic approach implemented in these tools is a good complement for the recent `neural` approaches to machine translation. We use these tools to generate very large target language automata which are used to loosely guide the NMT systems. In doing so, we find that translation quality is improved over either individual approach. The minimum Bayes risk decoding approach we have developed for using syntactic and neural mt together will form the basis for including generation into MT, as one of the major objectives of this project. |
URL | https://ucam-smt.github.io/ |
Description | Cambridge Conversations in Translation: Translation and Technology |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | Regional |
Primary Audience | Postgraduate students |
Results and Impact | The Cambridge CENTRE FOR RESEARCH IN THE ARTS, SOCIAL SCIENCES AND HUMANITIES (CRASH) is running a year-long programme on Translation. Dr Marcus Tomalin (RA on this grant) is one of the Conveners of the programme. Overall Aims of the `Conversations in Translation Programme` In recent decades, the theory and practice of translation has become an increasingly prominent area of academic discussion and debate. Offering important opportunities for interdisciplinary research, this flourishing field inevitably promotes interactions across and within a wide range of different discourses. However, the University of Cambridge currently has no institutional infrastructure devoted to such work, and those interested in translation tend to be confined to informal fragmentary clusters that rarely converge. The Cambridge Conversations in Translation (CCiT) research group seeks to rectify this by providing a forum in which anyone and everyone with an active interest in translation can meet to exchange ideas about this rich and complex subject. A series of panel discussions and workshops will bring together practitioners and scholars from fields as diverse as literary studies, linguistics, theology, history, music, philosophy, and anthropology to consider and respond to what we anticipate will be provocative insights from our invited speakers. The resulting discussions will encourage an engagement with both practice and theory as they draw on the experiences of professional translators alongside more speculative theoretical frameworks and methodologies. Translation and Technology (Panel Discussion) 23 January 2017, 14:00 - 16:00 Dr Marcus Tomalin (RA on this grant) Dr Adrià de Gispert (RA on this grant) In recent years, the art of translation has witnessed an unprecedented technological revolution. For many people, websites such as Google Translate are rapidly becoming the primary resource for obtaining a rough-and-ready translation of a given source-language text. If a Hungarian rendering of the first sentence of this current paragraph is required, then it can be obtained instantaneously: 'Az elmúlt években, a muvészet fordítás tanúja technológiai forradalmat'. The need for long years of patient tussling with conjugations, declensions, and the mysteries of vowel harmony is (seemingly) eliminated. However, few of the so-called 'naïve users' of these online translation systems know how they work. And even if they are dimly aware that some kind of modelling is being deployed, they generally do not know how or why it is applied, or whether a given system is rule-based, example-based, or statistical in nature (Trujillo 2012; Bhattacharyya 2015). Yet in order to evaluate the significance of any such systems, it is important to understand how they are trained, what kinds of bilingual corpora are used, and which particular kinds of linguistic patterns are modelled. There are also important distinctions between the kinds of texts translated. Machine translation systems struggle with poetry, but cope more successfully with certain kinds of genre-specific technical writing. This discussion panel will explore different aspects of the impact of recent technology on the art and craft of translation. It will assess the professional contexts of use of machine translation systems, and it will offer a chance to reflect upon the overarching anxiety that such systems pose a potential threat to human-produced translations. |
Year(s) Of Engagement Activity | 2016 |
URL | http://www.crassh.cam.ac.uk/events/26901 |
Description | Giving Voice to Digital Democracies: a Series of Panel Discussions hosted by the Centre for Research in the Arts, Social Sciences, and Humanities. |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | Local |
Primary Audience | Professional Practitioners |
Results and Impact | A series of panel discussions is planned for the 2018 academic year. The general area is AICT: the intersection of Artificial Intelligence and Information and Communication Technologies. The first event addressed the `Future of AI` and was held on 21 Feb 2018 in the Department of Engineering. The panelists were Prof Steve Young, U Cambridge and Apple Inc; Dr Hugo Zaragoza, Amazon; Prof David Runciman, POLIS, U Cambridge. The aim of the discussion was for leading technological practitioners to give their view on the future and risks of artificial intelligence, with commentary and views from a leading political theorist on the role of individuals, organisations, and agents in the political process. The seminar was very well attended: 50+ in the audience, with a mix of students, academics, and passers-by. The presentations ran for 1 hour, followed by another hour of Q&A and general discussion. The consensus is that it was extremely valuable and thought-provoking: the audience stayed for the full two hours. Two more events are planned for the series: The Economics of AICT and AIC and Social Change. The series conveners are Dr Marcus Tomalin (RA on EP/L027623/1), Prof Ann Copestake (Head of the Cambridge Department of Computer Science and Technology), and Prof Bill Byrne (PI on EP/L027623/1). |
Year(s) Of Engagement Activity | 2018 |
Description | Hay Festival presentation: `Lost in Translation?`, with Dr Marcus Tomalin, Dr Helena Sanson, Prof Bill Byrne |
Form Of Engagement Activity | A formal working group, expert panel or dialogue |
Part Of Official Scheme? | No |
Geographic Reach | National |
Primary Audience | Public/other audiences |
Results and Impact | A `Conversation` led by Dr Helena Sanson, Cambridge MML, discussing the possibilities and limits of machine translation . Dr Tomalin gave a high level overview of the technology and how it works, after which Dr Sanson posed `questions` to Tomalin and Byrne of the sort that a literary person or linguist might ask about the technology and how it differs from how human translation works. It was a lively discussion, with much interaction with the audience. |
Year(s) Of Engagement Activity | 2018 |
URL | https://hayfestival.com/m-127-hay-festival-2018.aspx?sectionfilterid=0&genrefilterID=7 |
Description | MT Marathon includes keynote talks |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Postgraduate students |
Results and Impact | Machine Translation Marathon 2016 is a week long gathering of machine translation researchers, developers, students and users. It features: - MT Lectures and Labs covering the basics and tutorials. - Technical Talks about open source tools. - Hacking Projects to advance tools or research in one week. The MT Marathon took place on 12-17 September 2016 in Prague, Czech Republic, organised by the Institute of Formal and Applied Linguistics (ÚFAL) of the Faculty of Mathematics and Physics, Charles University in Prague. Dr Adrià de Gispert (RA on this grant) gave an invited keynote lecture titled `Directed MT Research for Commercial Settings' . Dr de Gispert is part-time on this grant, with the remainder of his time spent as Senior Research Scientist, SDL Research, Cambridge, UK. Dr de Gispert spoke about successful strategies for migrating research from academia to industry. Abstract: Successfully deploying MT systems in commercial settings offers challenging problems not usually encountered in academic research. Customer and use case requirements need to be considered along with general translation quality. When training and optimizing MT systems, factors like decoding speed, memory and disk footprint, usability, robustness, ability to train with relevant data, or training time, are key to success. In this talk I will present recent work done at SDL Research to bring MT to users, and discuss other aspects of doing research in industry |
Year(s) Of Engagement Activity | 2016 |
URL | http://ufal.mff.cuni.cz/mtm16/keynote.html |
Description | Recent Developments in Neural Machine Translation, Cambridge Computational Biology Institute Annual Symposium |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | Invited presentation on recent developments in machine translation pitched at computational biologists and their interest in processing sequences. |
Year(s) Of Engagement Activity | 2018 |
URL | http://talks.cam.ac.uk/talk/index/104587 |
Description | THIRD CONFERENCE ON MACHINE TRANSLATION (WMT18) shared translation tasks |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | The annual WMT evaluation is by far the most recognized evaluation campaign in the field of MT and usually receives submissions from leading MT groups around the globe. Entries from the Cambridge MT group repeatedly ranked among the top systems in past evaluations. For example in 2018, Cambridge submitted systems for three high-profile language pairs (English-German, German-English, and Chinese-English) and achieved the second place in terms of human judgments of translation quality in all three language pairs. In a report titled `Fine-grained evaluation of German-English Machine Translation based on a Test Suite` (http://www.statmt.org/wmt18/pdf/WMT063.pdf), it was found that our system `achieves the highest accuracy in most linguistic phenomena, as compared to the rest of the systems` and that `obtains a significantly bet- ter performance than all other systems concerning verb tense/aspect/mood`. The investigators go on to note that this `performance may be explained by the fact that UCAM differs from others, since it combines several difference neural models together with a phrase-based SMT system in an syntactic MBR-based scheme (Stahlberg et al., 2016).` This `MBR-based scheme` is a key result of this research project. |
Year(s) Of Engagement Activity | 2018 |
URL | http://www.statmt.org/wmt18 |
Description | Talk at Computer Laboratory, Cambridge |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | Local |
Primary Audience | Other audiences |
Results and Impact | About 40 people attended the talk, coming from different departments of the university of Cambridge as well as companies in the area, such as Microsoft Research. The talk sparked an interesting discussion about the potential of multilingual multi-modal modelling, which is not only interesting for our immediate research focus but also for other natural language applications. |
Year(s) Of Engagement Activity | 2015 |
URL | http://talks.cam.ac.uk/talk/index/62971 |