Modelling Discourse in Statistical Machine Translation

Lead Research Organisation: University of Sheffield
Department Name: Computer Science


Automatic translation of human languages is an increasing necessity in our global society: large amounts of text are constantly produced in various languages and fast, cheap and accurate translation into a number of other languages is required to foster business and communication within and across nations. This high demand for translations cannot be fulfilled by human translators because of its sheer volume, cost and the lack of skilled professionals.

Different Machine Translation (MT) approaches have been proposed to automate translation. The most widely adopted approach is Statistical MT (SMT): the broad availability of free, open source SMT systems, along with significant improvements in their quality in recent years, has made SMT a very promising technology. This is evidenced by the many commercially successful SMT systems, such as those developed by Google, Microsoft and IBM.

Despite its recent success, SMT systems are still far from producing translations that reach human quality levels. A major limitation is that they translate sentences one by one, in isolation, without resorting to any information about the context in which such sentences appear. This leads to systems that are computationally feasible; however, more advanced approaches that overcome this limitation are needed to improve SMT quality and make it a de facto translation technology. The context surrounding a sentence -- its discourse -- contains information about dependencies connecting words or expressions across sentences. Neglecting such connections can lead to incoherent and inconsistent translations:

-- Humans use different words to refer to the same concepts in different sentences. If the links between these words are not identified, sentences can be incoherently translated. E.g.: in "The man bought a leather bag" and "It was soft", Bing Translator misses the connection between "it" and "bag". It produces for Portuguese "[...]. *Ele *foi *suave", rendering a completely inadequate meaning: "He went smooth".

-- The same text can appear in different sentences. If the links between these occurrences are not identified, they can be translated inconsistently. E.g.: in "He took cash from the bank" and "The bank was far away", only the first sentence has enough information about the correct meaning of "bank", and thus the second occurrence gets translated as "*margem" in Portuguese (river bank).

SMT is a young area and researchers have so far focused on overcoming issues within sentence boundaries. Most of these issues have been addressed to a large extent in recent years and it is now time to turn to discourse-level challenges. Very few attempts to deal with these challenges have been proposed. These are limited to pre- or post-processing strategies.

This project aims at explicitly modelling discourse level relationships across sentences in SMT at translation time without compromising the scalability of existing approaches. The proposed approach includes (i) a novel framework to model discourse level relationships by learning valid transitions across sentences based on rich linguistic information for both source and target languages and (ii) a constraint-based inference algorithm to use these relationships to guide the translation process while keeping it tractable. By decoupling model learning and inference, a basic SMT model will augmented at inference time with document-wide constraints representing expected discourse relationships that are too expensive or unavailable at model learning time.

Planned Impact

This project has major potential impact in Machine Translation (MT) research and use as it proposes a significant change in the way translations are produced: in the context of a document, as opposed to sentence by sentence, in isolation. The impact of this project spans four main areas:

-- Economy: It is expected that improvements resulting from discourse-informed MT will yield better quality translations, which will have a strong impact in the translation industry and among industrial users of multilingual content, with the potential to further reduce translation costs and turnaround times. It is estimated that more than 40% of Language Service Providers worldwide already use MT as part of their translation workflow, and that this results in 30-40% cost reduction and 70% productivity increase. In addition, 28% of the large corporations worldwide use MT to translated their content. Low quality translations due to, among other things, inconsistencies and incoherences is reported as the main reason preventing an even wider adoption of MT. Improvements in translation quality resulting from using discourse can thus magnify the usefulness of MT.

-- Society: Better quality translations will also affect individuals who use translations. These include millions of users who benefit from freely available online systems such as Google Translate and Microsoft Bing Translator. These systems are popular among users on the internet for a number of purposes, from the gisting of content in foreign languages to enabling communication (through chats, forums, etc.) with speakers of other languages. For these end-users, translation can have a big impact, as evidenced during the Haiti earthquake in 2010, when Microsoft and Google built basic Statistical MT systems for Haitian Creole in four days. These systems were successfully used to help the relief efforts by improving communication between locals and support teams.

-- Knowledge: This project will advance the state of the art research in Statistical MT (SMT) and the use of linguistic information for this problem, which is a recent and promising direction. The findings of this project will also impact research in a number of related fields: other approaches to MT, namely the rule-based and example-based approaches, which also translate sentences in isolation, and could benefit from the general methodology for discourse processing proposed in this project; NLP for other cross-lingual applications, by providing a bilingual discourse framework that could be adapted to such applications; discourse processing, by providing better understanding of how discourse models can affect bilingual applications; translation studies, by providing a framework to represent and study several linguistic phenomena related to (human or machine) translation.

-- People: The project will have a positive impact on the careers of the PI and RA, who will both gain additional knowledge and experience in discourse modelling, constrained-based learning frameworks and development of scalable software. It will also provide greater exposure for both researchers, allowing the PI to consolidate her position as a new lecturer and raise her profile nationally and internationally in Natural Language Processing (NLP) circles. Having a first UK funded project is a very important step in the PI's academic career. Despite her recent success with EU/US grants, the PI has never been involved in UK-funded projects. Given her goal of establishing herself as one of the leading MT researchers in the UK (and subsequently worldwide), it is essential that she engages in such projects. An EPSRC first grant is the ideal opportunity to start this process: its duration and scale will help ensure the success of this endeavour.


10 25 50
publication icon
Aziz W (2014) Exact Decoding for Phrase-Based Statistical Machine Translation in Conference on Empirical Methods in Natural Language Processing

publication icon
Smith Sim K (2016) The Trouble with Machine Translation Coherence in Baltic J. Modern Computing

publication icon
Steele D (2014) Text, Speech and Dialogue

publication icon
Steele D (2015) WA-continuum: Visualising word alignments across multiple parallel sentences simultaneously in ACL-IJCNLP 2015 - 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, Proceedings of System Demonstrations

Description Current machine translation approaches translate sentence by sentence, without access to wider context. This was an exploratory research (small scale, first grant) on the impact of document context in machine translation. We developed coherence models that are able to evaluate translations of entire documents, corpora to evaluate such models, and algorithms to potentially incorporate them into translation approaches.
Exploitation Route Larger research projects in the same area can further develop the models created here.
Sectors Digital/Communication/Information Technologies (including Software)

Description Horizon 2020: H2020-ICT-2014-1
Amount € 290,625 (EUR)
Organisation European Commission 
Sector Public
Country European Union (EU)
Start 02/2015 
End 01/2018