Disambiguation of Verbs by Collocation

Lead Research Organisation: University of Wolverhampton
Department Name: Research Inst of Info & Lang Processing

Abstract

In this project, we propose using statistical approaches to the analysis of corpus data in order to discover Typical Usage Patterns (TUPs) and hence create a resource for the Disambiguation of Verbs by Collocation (DVC). This project goes beyond the current state-of-the-art represented by word sense disambiguation based on machine-readable dictionaries; typical valency-based approaches, which rarely pay attention to collocations; WordNet, which does not analyse lexical syntagmatics or collocates; and semantic role labelling, which tags mainly thematic roles (e.g. Agent, Patient, Location), rather than semantic types (e.g. Human, Firearm, Route). In our project we propose to associate meanings with normal usage patterns, rather than words in isolation, and to integrate lexical collocations with valency, providing an empirically well-founded resource for use in mapping meaning onto word use in free text. DVC will show the comparative frequency of each pattern of each verb, enabling programs to develop statistically based probabilistic reasoning about meanings, rather than trying to evaluate all possibilities equally.

The internal structure of lexical arguments of verbs will be analysed using computational linguistics techniques, so that for example the relationship between "repair the roof" and "repair the damage" is recognized. Even though the nouns "roof" and "damage" have different semantic types, they activate the same meaning of "repair". Once this has been done, the structural relationship is applied to other verbs, e.g. "treat a patient" and "treat his injuries".

In a pilot project at Masaryk University, Brno, CZ (http://nlp.fi.muni.cz/projects/cpa/), involving analysis of 700 verbs, Prof. Hanks, the co-investigator of the project, showed that, while words may be highly ambiguous, patterns are rarely ambiguous and, furthermore, most uses of most verbs can be assigned unambiguously to a pattern. The existing verbs will be used to train a statistical method the output from which will be verified lexicographically. As the number of annotated verbs increases, the training procedure will be repeated and so improve the accuracy and speed of the annotation. At each step, the researchers employed by the project will analyse the computer output and correct errors. The objective of the DVC project is to analyse 3000 common English verbs and annotate at least 250 corpus lines for each verb. An in-depth data analysis of 100 verbs will be carried out. The resource will be made publicly available at the end of the project.

The DVC project is based on and will contribute to the Theory of Norms and Exploitations (TNE) of Prof. Hanks. TNE says, in essence, that a language consists of two interlinked systems of rules governing word use: a set of rules for the normal uses of words and a second-order set of rules governing the ways in which normal patterns are exploited. Exploitations are deliberately unusual utterances. They play a large role in linguistic change (word-meaning change). as today's exploitation may become tomorrow's norm.

The value of DVC will be proven by textual entailment and paraphrasing, in this way demonstrating its potential usefulness in a large number of fields of computational linguistics which benefit from these two applications.

The project will be disseminated using a wide variety of means. A fully user-friendly publicly available website will contain news about the progress of the project and will provide links to project research papers. It will also host interactive demos that will enable visitors to see the patterns collected and test the technologies developed in the project. Papers will be submitted to international conference and peer-reviewed journals. Evaluation conferences such as SEMEVAL and RTE will be used to assess the methods developed in this project in a standard environment. An important outcome of the project will be a monograph (theory, methodology, empirical findings).

Planned Impact

The immediate beneficiaries will be computational linguists working both in academic and commercial environments, who will be able to use DVC as a resource in improving applications such as machine text interpretation, question answering, information retrieval, machine translation, and idiomatic text generation. DVC will provide an inventory of natural, normal and largely unambiguous phraseology (accompanied by meanings for each phrase), with a mechanism for interpreting unusual and imaginative phraseology by "best-match" preferential procedures. For these researchers, it will be a reference resource with a much-needed focus on phraseology and methods for disambiguation, areas of information that are neglected in currently available dictionaries.

It will also provide a resource that can contribute to making a reality of Berners Lee's original dream of the semantic web, in which machines will be able to process the information contained in the natural language of raw text (in contrast to the adding and processing of tags which is the current focus of "semantic web" research).

E-book reading devices such as the Amazon Kindle enable a dictionary to interact with any text that a user is reading. It is technically straightforward to set up a link connecting any word in any electronic text with the relevant entry in an appropriate electronic dictionary. However, for such an application to be fully effective, it is also necessary to enable the software to select, not only the most relevant word, but also the most relevant sense of that word. For this application, currently available research on word-sense relevance is not satisfactory. Research into normal phraseology such as that entailed by the DVC project can be shown to be an essential prerequisite, enabling programmers to match the contexts of word uses in a text with the best-match pattern(s) in a pattern dictionary. By embedding technology like the one developed in the DVC project into e-book readers, it will be possible to enhance the reading experience.

At a theoretical level, DVC results will support developments in empirically well founded theories of language and meaning. It will provide supporting evidence for use by the many researchers who are now beginning to re-evaluate speculative linguistic theories received from a previous generation of linguists. It will do this because it will be based on painstaking analysis of how people actually use words to make meanings in everyday language use, rather than on speculations about boundary-case possibilities. A new generation of linguists is focusing increasingly on the lexicon and aspects such as lexical preferences rather than on deterministic syntax. DVC is a lexical project par excellence, aiming to show how words combine (syntagmatics and collocations) to make meanings.

Researchers into metaphor and figurative language will welcome the many corpus-derived examples of anomalous uses of words, which will be a by-product of DVC's corpus analysis.

DVC will also be of benefit to English language teachers, course-book writers, and lexicographers, showing how words can be grouped into sets and how they combine syntagmatically to make meanings. The Pattern Dictionary of English Verbs, which will be an eventual outcome of DVC research, will aim among other things to identify the normal patterns in which each verb is used. Feedback from the language-teaching community has already indicated that language teachers who are aware of the complexity of corpus data regard such a dictionary as a much-needed resource.
 
Description The DVC project has successfully developed new insights into the nature of meaning in language and has created new tools for lexical analysis and lexicography. It has demonstrated, by painstaking analysis of nearly 463,000 corpus lines from the 'balanced and representative' British National Corpus (BNC), that the meaning of clauses and sentences depends in large part on collocational patterns of word use, not merely on a simple concatenation of lexical items, each one having a distinct meaning.

All normal patterns of use of 1729 English verbs have been analysed and identified. We have demonstrated through empirical analysis of corpus evidence that meanings are, in many cases, not properties of individual words but rather that there is a close link between meanings and patterns of word use.

In close collaboration with our subcontractor, we have designed, tested, debugged, and implemented a specially designed database for storing and processing verb patterns and their implicatures. This database is designed as a fundamental resource for use in both computational linguistics and language teaching. It is a basis for future work on activities such as pattern prediction, analysis of entailments and presuppositions, machine translation, and text generation. We also foresee applications in language teaching such as development of a lexical syllabus component and error correction.

DVC's results have been published on line (www.pdev.org.uk) along with the associated empirically well-founded shallow ontology of 230 semantic types, developed throughout the course of the project, applied to nouns for the purpose of verb sense disambiguation. A semantic type represents an intrinsic property of each lexical item. The ontology of semantic types is supplemented in relevant contexts by contextually assigned roles and semantic prosody. E.g. consider the verb execute: the meaning of "They executed the rebels" is distinguished from "They executed his orders" only by means of the contrasting semantic types of the two nouns in the direct-object position, namely [[Human]] vs. [[Speech Act]]. Unlike any other ontology, the DVC shallow ontology has been created by empirical observation of the smallest number of contrasting semantic types needed for verb sense disambiguation. A corpus-driven mechanism for populating each semantic type with sets of lexical items has been developed.

Instead of being confined to just 100 verbs, in-depth analysis was carried out on all completed verbs in the database. However, dependencies were not analysed, because no parsed version of the BNC was deemed to be sufficiently reliable for this purpose.
- Number of verbs analysed: 1729
- Number of patterns: 6848 (phrasal verbs - 566; idioms - 528)
The overall results of corpus analysis in DVC in terms of corpus lines as at 16.09.2015 are as follows:
- 403,561 corpus lines classified as normal uses and linked to a specific pattern
- 19,405 corpus lines (4.52%) classified as creative exploitations of normal uses, likewise linked to patterns
- 5,097 corpus lines marked as unassignable, because of ambiguity or unclarity
- 33,407 corpus lines rejected because of wrong part-of-speech assignment in BNC

These numbers support the conclusion that the majority of clauses in everyday English use are highly patterned, but around 5% are creative exploitations of normal usage.

Work on DVC continues. In addition to completing the analysis of verbs, there is a need for corpus pattern analysis of nouns and adjectives. There is also a need to create more precise links between the patterns and a parsed corpus, which could in itself contribute to improvements in parsing technology. All of these will be the subject of future research proposals.
Exploitation Route The prototypical patterns of word use established by DVC can be used as a reference point for the comparative investigation of specialized corpora and literary texts.

A by-product of DVC is new insights into the nature of and possibilities for creative uses of 'exploitations' of normal, conventional word uses. It is our intention that these insights will be used to improve English Language teaching methodologies and resources. Applied linguists studying creative and non-literal language have also expressed an interest in using this data.

Based on the positive feedback received by many professionals met at the conferences where we disseminated the projects work (Euralex, Italy; LREC, Iceland; AIETI7, Spain; RANLP, Bulgaria), we are considering future applications for our data in the field of Natural Language Processing e.g. to improve machine translation systems or post editing tools used by language professionals.

Work on DVC continues. In addition to completing the analysis of verbs, there is a need for corpus pattern analysis of nouns and adjectives. There is also a need to create more precise links between the patterns and a parsed corpus, which could in itself contribute to improvements in parsing technology. All of these will be the subject of future research proposals.
Sectors Digital/Communication/Information Technologies (including Software),Education,Other

URL http://pdev.org.uk
 
Description The Pattern Dictionary of English Verbs is becoming widely accepted among computational linguists in Europe as a resource similar to FrameNet, a foundation tool for computational applications, language teaching, and (of course) future dictionaries. For instance, preliminary research conducted within the DVC project has already shown that the proposed methodology and tools can be successfully applied to Machine Translation Evaluation (Bechara et al., 2015), thus demonstrating their potential in achieving a wider economic impact. Industrial partners attending international conferences such as AIETI7 and eLex2015 have expressed a keen interest in using the data in post-editing and to develop vocabulary-building mobile apps. A SEMEVAL (semantic evaluation) exercise was accepted for its 2015 edition. This involves a collaboration between research centres across Europe, including Fondazione Bruno Kessler, Trento and The Centre for Natural Language Processing, Brno. The goal of this exercise is to bring the resources created by the DVC project to the attention of the Natural Language Processing community and encourage their use in real-world applications such as statistical machine translation. The DVC project has made a substantial contribution to the way that we understand the inner workings of language use. It has shown that a verb pattern consists of a combination of collocations (lexical sets) and syntagmatics (clause roles, a.k.a. arguments). Each pattern is used repeatedly an infinite number of times, with lexical variations, in everyday usage. But in addition, the project has shed light on the creative use of language. It has been shown that normal, conventional patterns of word use are exploited in various ways. The project discovered evidence for three basic types of exploitation of normal usage patterns: figurative uses such as metaphors, metonymy, and similes; anomalous arguments; and syntactic exploitations. Exploitations are used for rhetorical effect, but also to create meanings expressing new and unfamiliar situations. The evidence that we have analysed suggests that exploitations are rule-governed - but governed by a quite different set of rules from those that govern syntactic well-formedness. Analysis of exploitation rules has been earmarked as a topic for a future research project. Professor Hanks' monograph, Lexical Analysis, the Theory of Norms and Exploitations, is recognised Internationally as an important contribution to the theory of meaning in language.
First Year Of Impact 2013
Sector Digital/Communication/Information Technologies (including Software),Education,Other
Impact Types Cultural,Societal

 
Description Establishment of a Masters course in "Practical Corpus Linguistics"
Geographic Reach Europe 
Policy Influence Type Influenced training of practitioners or researchers
 
Title PDEV 
Description DVC results have been published on line, in a user-friendly version at www.pdev.org.uk, along with the associated empirically well-founded shallow ontology of 220 semantic types, which was developed in the course of the project, being applied to nouns for the purpose of verb sense disambiguation. A semantic type represents an intrinsic property of each lexical item. The CPA ontology of semantic types is supplemented in relevant contexts by contextually assigned roles and semantic prosody. 
Type Of Material Database/Collection of data 
Provided To Others? No  
Impact The DVC project has shown that a verb pattern consists of a combination of collocations (lexical sets) and syntagmatics (clause roles, a.k.a. arguments). Each pattern is used repeatedly an infinite number of times, with lexical variations, in everyday usage. But in addition, the project has shed light on the creative use of language. It has been shown that normal, conventional patterns of word use are exploited in various ways. The project discovered evidence for three basic types of exploitation of normal usage patterns: figurative uses such as metaphors, metonymy, and similes; anomalous arguments; and syntactic exploitations. Exploitations are used for rhetorical effect, but also to create meanings expressing new and unfamiliar situations. The evidence that we have analysed suggests that exploitations are rule-governed - but governed by a quite different set of rules from those that govern syntactic well-formedness. Analysis of exploitation rules has been earmarked as a topic for a future research project. 
URL http://www.pdev.org.uk