Creating anaphorically annotated resources through semantic wikis (AnaWiki)

Lead Research Organisation: University of Essex
Department Name: Computer Sci and Electronic Engineering

Abstract

The ability to make progress in Natural Language Processing - both to develop better NLP systems and to develop better theories of how humans process language - depends on the availability of large annotated corpora: collections of documents annotated with human judgments about, say, what is the interpretation of ambiguous words such as 'bank' or 'stock' in a particular context, or what is the interpretation of anaphoric expressions like 'the corpus'. So the fact that current corpora annotated for semantic information are not large enough and do not collect the judgments of a large enough number of subjects is a major obstacle for NLP. Creating larger hand-annotated corpora with the current methods, however, is very expensive and time consuming; in practice, it is unfeasible to think of annotating more than 1M words. A variety of techniques for solving the problem by semi-automatic annotation have been proposed in the literature, such as bootstrapping and active learning; however, their usefulness has not yet been convincingly demonstrated. However, the success of Wikipedia shows that another approach might be possible: take advantage of the willingness of the Web population to contribute in collaborative resource creation efforts. This willingness has already been harnessed to tag images through the ESP game; we propose to develop tools that will make it possible for large numbers of volunteers over the Web to collaborate in the creation of semantically annotated corpora (specifically, of a corpus annotated with coreference information) . In this, we will build on existing efforts to develop versions of MediaWiki to support work on the Semantic Web, and on our own to develop reliable and easy-to-follow instructions for marking semantic judgments about anaphora. At the very least, these tools will make it possible for the community of NLP researchers themselves to collaborate in the creation of an Anaphoric Bank. We will however also run a pilot developing methods to attract the interest of the Web community at large; if these tests are successful, we may be able to use the power of collaborative effort through the Web to create really large annotated corpora. A distinctive feature of the approach we will adopt is that we will allow volunteers to mark differences in semantic judgments, and to express comments on previously expressed semantic judgments, so as to identify those judgments on which there is wide agreement and ones on which there is disagreement.

Publications

10 25 50