Large-scale Unsupervised Parsing for Resource-Poor Languages

Lead Research Organisation: University of Edinburgh
Department Name: Sch of Informatics

Abstract

This project focuses on the automatic induction of grammatical structure from raw text. Automatic inference of the syntax of sentences is an old problem in natural language processing, which originates in studies attempting to build computational models for the way humans learn language.

This problem is still far from being solved. There is yet no fully-fledged computer program that takes raw text and returns a computational representation of its syntax (for example, identifying the noun phrases, the verb phrases, the prepositional phrases, and how they relate to each other in the text).

This research aims to make a major step toward building such a system. The goal is to derive a new algorithm that recovers, at least partially, the syntax of raw text. The algorithm is based on the assumption that words which frequently tend to co-occur should usually be linked, not just semantically, but also syntactically. For example, if the word "deep" often co-occurs with the word "puddle", the algorithm will assume that "deep" tends to modify the word "puddle."

The algorithm is based on a new learning paradigm developed in the machine learning community called "spectral learning". This paradigm has many advantages, most notably, its well-motivated mathematical component. This means that we can derive mathematical proofs that guarantee that the algorithm will be able to learn the syntax of a language if the algorithm is exposed to sufficiently large enough amounts of raw text.

Such proofs are important partially because they explain the learnability of language by humans. If these proofs show that we do not require much data to learn syntax, they can shed light on humans' ability to learn language from (relatively) short exposure to language through their childhood.

Planned Impact

From the societal/educational perspective:

Language is the main topic of research in NLP, and this research area offers an approachable topic in computer science for such a broad audience. This is especially true for syntactic parsing. Most of us have an intuitive notion of what syntax means, as well as understanding of computer applications. Synthesising these two together should therefore be comprehensible to a young audience, and can potentially give them a window to further inspect ideas about computation and language.

Our goal will be to develop a website that makes all of the material we develop accessible to such a crowd. We will explain the basics of computational parsing to this crowd, and build a demo website that will enable this crowd to experiment with our syntactic parsers in various languages.

From the economic perspective:

The development of our algorithms can be very useful for the language technology industry. Syntactic structures are core algorithms used in natural language analysis and applications, including machine translation, information retrieval, information extraction and question answering.

Availability of parsers for resource-poor languages has the potential to advance such applications for these languages. One of the greatest impediments for creating natural language applications for various languages has been the lack of tools which analyse language at its basic level, such as syntax. Making our parsing tools more mature for companies that develop natural language applications for resource-poor languages can therefore be very valuable for these companies.

Publications

10 25 50
 
Description For a broad audience: We discovered that new techniques in machine learning, that combine linear algebra with statistics, can be efficiently used to improve the accuracy of natural language understanding systems.

More in detail: We have discovered that one can efficiently learn syntactic parsers (functions that map sentences to their grammatical structure) with incomplete data using spectral methods. We discovered that spectral methods can advance state of the art for grammar and structured prediction models, and that they can serve as a viable alternative to existing learning frameworks in the unsupervised/latent-variable setting, such as deep learning and log-likelihood maximisation.
Exploitation Route This project simplified and improved the accessibility of a set of algorithms (in the context of spectral learning) that have strong theoretical guarantees and also give state of the art results on certain problems (or close to state of the art).

One of the main issues with the spectral learning framework is that it has a "steep learning curve", and one of our goals in this work was to make the algorithms more accessible. We believe we managed to do so in at least one of our papers.

We believe that the contributions we currently make with spectral learning will serve as the basis for long term impact for machine learning and natural language processing -- similarly to the impact that deep learning is now making, after years of hard work in trying to prove it as a good learning framework.
Sectors Digital/Communication/Information Technologies (including Software),Education,Financial Services, and Management Consultancy,Other

 
Description Research Grant from Xerox
Amount $30,000 (USD)
Organisation European Centre for Research and Teaching of Environmental Geosciences (CEREGE) 
Sector Academic/University
Country France
Start 01/2016 
End 01/2017
 
Description Work on social media with grammar algorithms 
Organisation University of Edinburgh
Department School of Informatics Edinburgh
Country United Kingdom 
Sector Academic/University 
PI Contribution Collaborated with another postdoctoral researcher (Annie Louis) who is a Newton fellow (http://www.newtonfellowships.org/) at the School of Informatics. The algorithm we developed in our EMNLP paper as part of the grant work was used for her work as well, and led to a publication in the same conference.
Collaborator Contribution Annie Louis led the experimental evaluation and the data collection.
Impact Conversation Trees: A Grammar Model for Topic Structure in Forums, Annie Louis and Shay B. Cohen, In Proceedings of Empirical Methods in Natural Language Processing, 2015
Start Year 2015
 
Description Work on unsupervised learning of grammars using spectral learning 
Organisation Carnegie Mellon University
Department Mellon College of Science
Country United States 
Sector Academic/University 
PI Contribution I worked with Ankur Parikh from Carnegie Mellon University (now at Google) on techniques for unsupervised learning of grammars. Our contribution was guiding the experiments done by CMU, and developing some of the theoretical understanding of the algorithms.
Collaborator Contribution Ankur Parikh was the main investigator for this collaboration, and as such was in charge of programming all the applications required for this collaboration and developing the theoretical results to their fullest.
Impact Spectral Unsupervised Parsing with Additive Tree Metrics, Ankur P. Parikh, Shay B. Cohen and Eric Xing, In Proceedings of ACL, 2014
Start Year 2014