Corpus-based grammar in contrast: the cross-linguistic distributional analysis of Nepali grammatical categories

Lead Research Organisation: Lancaster University
Department Name: Linguistics and English Language

Abstract

The proposed research concerns the application of a new set of methods to the study of the grammar of nouns and verbs in the Nepali language.

In grammar, certain types of word are associated with particular categories of grammatical markers. The grammatical categories associated with nouns, for example, include markers of case - in English these are mostly prepositions, e.g. in phrases such as 'in the bath', 'on Saturday', 'with a hammer'. Verbs, by contrast, are associated with grammatical elements indicating time (tense) or the probability/possibility of an event (mood) - examples of these categories in English include auxiliary verbs such as 'can', 'may', and 'shall', or forms such as 'have [done]' or 'be [doing]'.

In Nepali, complex sequences of grammatical categories may found after both nouns and verbs. For example:

(1) Example of a noun: 'manisharulai'
manis-haru-lai
man-PLRL-ACC/DAT
'(to the) men'

(2) Another example noun: 'kalamle'
kalam-le
pen-ERG/INSTR
'with (a/the) pen'

(3) Example of a verb: 'garirahancha'
gar-i-rahan-cha
do-PASS-stay-be.3SG
's/he is doing'

(NB: in the examples above, the grammatical categories are labelled with abbreviations in capital letters of the terms usually used for them.)

The nature of these categories, and the nature of the structures which they occur in, are the topics of this proposal.

Grammatical categories have often been analysed by looking at their distribution: that is, in terms of what contexts they can and cannot occur in. For instance, in English, an important feature of nouns is that they may co-occur with the word 'the', to give one very basic example. Distribution is also important in corpus linguistics - the approach to studying language based on analysing large collections of text on computer. In corpus linguistics, distribution can be analysed statistically - that is, by looking at what words and what forms commonly occur together or in close proximity to one another. This link between words which occur frequently together in a large corpus of text is described as 'collocation'.

The advantage of using corpus methods in the study of grammar is that findings from a corpus reflect grammar in texts, as it is used in everyday writing (and, in some cases, speech as well). It allows our knowledge of grammar to emerge from a wide view of the ways in which people actually use their language.

This study will develop methods based on statistical measures of co-occurrence and collocation, and use them to look at the distribution of grammatical categories in Nepali. This will enable an investigation of the nature of these grammatical categories and the patterns in which they are used in language. This will extend our knowledge of the types of elements in the Nepali language seen in the examples above.

It has long been established that some features of grammar are specific to particular languages, whereas others are more general and are found in many different languages. The final stage of the investigation proposed here will be to look at the contrast between Nepali and other languages, from the perspective of the quantitative methods outlined above. To do this, the same set of methods will be applied to corpora of two language - English and Russian - which are from the same family of languages as Nepali, although they come from different regions of the world. This will allow a finely-grained, closely focussed contrastive analysis. The patterns found in the three languages will be compared, allowing us to build up a picture of how the grammars of these languages use grammar differently in real texts.

In all cases, the raw material for analysis will be the output of computerised searches of a corpus. However, this statistical information must be carefully analysed to establish what patterns can of interest can be shown to exist, in terms of the meaning or the grammar of the statistical links that the computer detects.

Publications

10 25 50