Large-scale Unsupervised Parsing for Resource-Poor Languages

Lead Research Organisation: University of Edinburgh

Department Name: Sch of Informatics

Abstract

This project focuses on the automatic induction of grammatical structure from raw text. Automatic inference of the syntax of sentences is an old problem in natural language processing, which originates in studies attempting to build computational models for the way humans learn language.

This problem is still far from being solved. There is yet no fully-fledged computer program that takes raw text and returns a computational representation of its syntax (for example, identifying the noun phrases, the verb phrases, the prepositional phrases, and how they relate to each other in the text).

This research aims to make a major step toward building such a system. The goal is to derive a new algorithm that recovers, at least partially, the syntax of raw text. The algorithm is based on the assumption that words which frequently tend to co-occur should usually be linked, not just semantically, but also syntactically. For example, if the word "deep" often co-occurs with the word "puddle", the algorithm will assume that "deep" tends to modify the word "puddle."

The algorithm is based on a new learning paradigm developed in the machine learning community called "spectral learning". This paradigm has many advantages, most notably, its well-motivated mathematical component. This means that we can derive mathematical proofs that guarantee that the algorithm will be able to learn the syntax of a language if the algorithm is exposed to sufficiently large enough amounts of raw text.

Such proofs are important partially because they explain the learnability of language by humans. If these proofs show that we do not require much data to learn syntax, they can shed light on humans' ability to learn language from (relatively) short exposure to language through their childhood.

Planned Impact

From the societal/educational perspective:

Language is the main topic of research in NLP, and this research area offers an approachable topic in computer science for such a broad audience. This is especially true for syntactic parsing. Most of us have an intuitive notion of what syntax means, as well as understanding of computer applications. Synthesising these two together should therefore be comprehensible to a young audience, and can potentially give them a window to further inspect ideas about computation and language.

Our goal will be to develop a website that makes all of the material we develop accessible to such a crowd. We will explain the basics of computational parsing to this crowd, and build a demo website that will enable this crowd to experiment with our syntactic parsers in various languages.

From the economic perspective:

The development of our algorithms can be very useful for the language technology industry. Syntactic structures are core algorithms used in natural language analysis and applications, including machine translation, information retrieval, information extraction and question answering.

Availability of parsers for resource-poor languages has the potential to advance such applications for these languages. One of the greatest impediments for creating natural language applications for various languages has been the lack of tools which analyse language at its basic level, such as syntax. Making our parsing tools more mature for companies that develop natural language applications for resource-poor languages can therefore be very valuable for these companies.

Funded Value:

£100,650

Funded Period:

Nov 14 - Feb 16

Funder:

EPSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

EP/L02411X/1

Principal Investigator:

Shay Cohen

Research Subject:

Info. & commun. Technol. (50%)

Linguistics (50%)

Research Topic:

Artificial Intelligence (50%)

Comput./Corpus Linguistics (50%)

Organisations

People	ORCID iD
Shay Cohen (Principal Investigator)

Publications

Author Name

Title Publication Date Published

|< < 1 2 3 > >|

10 25 50

Cohen S (2016) Parsing Linear Context-Free Rewriting Systems with Fast Matrix Multiplication in Computational Linguistics

Cohen, S (2015) Parsing Linear Context-Free Rewriting Systems with Fast Matrix Multiplication

Louis, A (2015) Conversation Trees: A Grammar Model for Topic Structure in Forums

Narayan S (2016) Paraphrase Generation from Latent-Variable PCFGs for Semantic Parsing

Narayan S (2016) Optimizing Spectral Learning For Parsing

Narayan S (2016) Paraphrase Generation from Latent-Variable PCFGs for Semantic Parsing

Narayan S (2016) Optimizing Spectral Learning For Parsing

Narayan S (2016) Paraphrase Generation From Latent-Variable Pcfgs For Semantic Parsing

Narayan S (2016) Optimizing Spectral Learning for Parsing

Narayan S (2016) Unsupervised Sentence Simplification Using Deep Semantics

Key Findings
Further Funding
Collaboration


Description	For a broad audience: We discovered that new techniques in machine learning, that combine linear algebra with statistics, can be efficiently used to improve the accuracy of natural language understanding systems. More in detail: We have discovered that one can efficiently learn syntactic parsers (functions that map sentences to their grammatical structure) with incomplete data using spectral methods. We discovered that spectral methods can advance state of the art for grammar and structured prediction models, and that they can serve as a viable alternative to existing learning frameworks in the unsupervised/latent-variable setting, such as deep learning and log-likelihood maximisation.
Exploitation Route	This project simplified and improved the accessibility of a set of algorithms (in the context of spectral learning) that have strong theoretical guarantees and also give state of the art results on certain problems (or close to state of the art). One of the main issues with the spectral learning framework is that it has a "steep learning curve", and one of our goals in this work was to make the algorithms more accessible. We believe we managed to do so in at least one of our papers. We believe that the contributions we currently make with spectral learning will serve as the basis for long term impact for machine learning and natural language processing -- similarly to the impact that deep learning is now making, after years of hard work in trying to prove it as a good learning framework.
Sectors	Digital/Communication/Information Technologies (including Software) Education Financial Services and Management Consultancy Other


Description	Research Grant from Xerox
Amount	$30,000 (USD)
Organisation	European Centre for Research and Teaching of Environmental Geosciences (CEREGE)
Sector	Academic/University
Country	France
Start	01/2016
End	01/2017


Description	Work on social media with grammar algorithms
Organisation	University of Edinburgh
Department	School of Informatics Edinburgh
Country	United Kingdom
Sector	Academic/University
PI Contribution	Collaborated with another postdoctoral researcher (Annie Louis) who is a Newton fellow (http://www.newtonfellowships.org/) at the School of Informatics. The algorithm we developed in our EMNLP paper as part of the grant work was used for her work as well, and led to a publication in the same conference.
Collaborator Contribution	Annie Louis led the experimental evaluation and the data collection.
Impact	Conversation Trees: A Grammar Model for Topic Structure in Forums, Annie Louis and Shay B. Cohen, In Proceedings of Empirical Methods in Natural Language Processing, 2015
Start Year	2015


Description	Work on unsupervised learning of grammars using spectral learning
Organisation	Carnegie Mellon University
Department	Mellon College of Science
Country	United States
Sector	Academic/University
PI Contribution	I worked with Ankur Parikh from Carnegie Mellon University (now at Google) on techniques for unsupervised learning of grammars. Our contribution was guiding the experiments done by CMU, and developing some of the theoretical understanding of the algorithms.
Collaborator Contribution	Ankur Parikh was the main investigator for this collaboration, and as such was in charge of programming all the applications required for this collaboration and developing the theoretical results to their fullest.
Impact	Spectral Unsupervised Parsing with Additive Tree Metrics, Ankur P. Parikh, Shay B. Cohen and Eric Xing, In Proceedings of ACL, 2014
Start Year	2014

Abstract

Planned Impact

Organisations

People

ORCID iD

Publications