Creating anaphorically annotated resources through semantic wikis (AnaWiki)

Lead Research Organisation: University of Essex

Department Name: Computer Sci and Electronic Engineering

Abstract

The ability to make progress in Natural Language Processing - both to develop better NLP systems and to develop better theories of how humans process language - depends on the availability of large annotated corpora: collections of documents annotated with human judgments about, say, what is the interpretation of ambiguous words such as 'bank' or 'stock' in a particular context, or what is the interpretation of anaphoric expressions like 'the corpus'. So the fact that current corpora annotated for semantic information are not large enough and do not collect the judgments of a large enough number of subjects is a major obstacle for NLP. Creating larger hand-annotated corpora with the current methods, however, is very expensive and time consuming; in practice, it is unfeasible to think of annotating more than 1M words. A variety of techniques for solving the problem by semi-automatic annotation have been proposed in the literature, such as bootstrapping and active learning; however, their usefulness has not yet been convincingly demonstrated. However, the success of Wikipedia shows that another approach might be possible: take advantage of the willingness of the Web population to contribute in collaborative resource creation efforts. This willingness has already been harnessed to tag images through the ESP game; we propose to develop tools that will make it possible for large numbers of volunteers over the Web to collaborate in the creation of semantically annotated corpora (specifically, of a corpus annotated with coreference information) . In this, we will build on existing efforts to develop versions of MediaWiki to support work on the Semantic Web, and on our own to develop reliable and easy-to-follow instructions for marking semantic judgments about anaphora. At the very least, these tools will make it possible for the community of NLP researchers themselves to collaborate in the creation of an Anaphoric Bank. We will however also run a pilot developing methods to attract the interest of the Web community at large; if these tests are successful, we may be able to use the power of collaborative effort through the Web to create really large annotated corpora. A distinctive feature of the approach we will adopt is that we will allow volunteers to mark differences in semantic judgments, and to express comments on previously expressed semantic judgments, so as to identify those judgments on which there is wide agreement and ones on which there is disagreement.

Funded Value:

£143,320

Funded Period:

Nov 07 - Sep 09

Funder:

EPSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

EP/F00575X/1

Principal Investigator:

Massimo Poesio

Research Subject:

Info. & commun. Technol. (60%)

Linguistics (40%)

Research Topic:

Comput./Corpus Linguistics (40%)

Information & Knowledge Mgmt (60%)

Organisations

University of Essex (Lead Research Organisation)

People	ORCID iD
Massimo Poesio (Principal Investigator)
Udo Kruschwitz (Co-Investigator)

Publications

Author Name

Title Publication Date Published

10 25 50

Chamberlain J (2009) A demonstration of human computation using the Phrase Detectives annotation game

Chamberlain J (2018) Optimising crowdsourcing efficiency: Amplifying human computation with validation in it - Information Technology

Chamberlain J. (2008) Addressing the resource bottleneck to create large-scale annotated texts in Semantics in Text Processing, STEP 2008 - Conference Proceedings

Chamberlain J. (2009) Constructing an anaphorically annotated corpus with non-experts: Assessing the quality of collaborative annotations in People's Web 2009 - 2009 Workshop on The People's Web Meets NLP: Collaboratively Constructed Semantic Resources at the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, ACL-IJCNLP 2009 - Proceedings

Chamberlain J. (2016) Phrase detectives corpus 1.0 crowdsourced anaphoric coreference in Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016

Hopfgartner F (2014) The annotation-validation (AV) model

Poesio M. (2008) ANAWIKI: Creating anaphorically annotated resources through Web cooperation in Proceedings of the 6th International Conference on Language Resources and Evaluation, LREC 2008

Yu J. (2023) Aggregating Crowdsourced and Automatic Judgments to Scale Up a Corpus of Anaphoric Reference for Fiction and Wikipedia Texts in EACL 2023 - 17th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference

Abstract

Organisations

People

ORCID iD

Publications