Prodigy: Probabilistic Deep Generation

Lead Research Organisation: University of Brighton
Department Name: Sch of Computing, Engineering & Maths

Abstract

Computational methods for generating language are lagging behind computational methods for analysing language in several ways, most obviously in that they are not used commercially. The main reasons are that systems for generating language take inordinate amounts of time to build, yet once built cannot be reused, and tend to be severely lacking in language variation, something that is easily perceived as a lack of quality. The current situation in language generation research is reminiscent of language analysis research in the late 1980s, when symbolic and statistical methods briefly formed entirely separate research paradigms. Language analysis soon moved towards a paradigm merger, realising that symbolic methods lacked the efficiency and robustness that probabilistic methods could provide, which in turn would benefit from the accuracy and subtlety of symbolic methods. A similar development is currently underway in the field of machine translation where - after several years of purely statistical methods dominating the field - researchers are now beginning to bring linguistic knowledge back in. The experience from these research fields suggests that higher quality can be achieved when the symbolic and statistical paradigms join forces. Recent research shows that this is likely to be true for language generation too. The purpose of the Prodigy project is to develop, for the first time, a comprehensive, linguistically informed, probabilistic methodology for generating language that substantially improves development time, reusability and language variation in language generation systems, and thereby enhances their commercial viability. Taking the principal investigator's previous EPSRC-funded research on probabilistic NLG as a starting point, the Prodigy project will explore whether the combination of the probabilistic and the linguistic can be as beneficial for the field of language generation as it has been for language analysis. We will focus on two aspect in particular: (i) developing reusable data representation and encoding strategies, and (ii) developing specific probabilistic techniques for guiding language generation processes.We will test and evaluate our representations and techniques on five different data sets which have been collected from real-world text production tasks and include weather forecasts, descriptions of museum exhibits, and nurses' reports.The Prodigy project will produce research outcomes that are of potential benefit to industry, the research community and individual end-users. Research will primarily benefit through advances in our understanding of reusable language generation technology, industry through improvements in commercial viability, and the technology itself can help individual users by speeding up text production, as well as by making available a modality that does not always exist (e.g. enabling visually impaired readers to access graphical information).
 
Description The experience from other language processing research fields suggested that higher quality can be achieved when symbolic and statistical paradigms join forces. Recent research had shown that this is likely to be true for language generation too. The Prodigy project developed, for the first time, a comprehensive, linguistically informed, probabilistic methodology for generating language that substantially improved development time, reusability and variation in language generation systems, and thereby enhanced their commercial viability.
Exploitation Route Prodigy resulted in two major research outputs which were not publications:

1. The Prodigy-METEO corpus of paired numerical and textual data which is freely available and has been used by several research teams as a benchmark.

2. Freely available language generation technology.
Sectors Digital/Communication/Information Technologies (including Software)