Lexicography in Motion: A History of the Tibetan Verb

Lead Research Organisation: School of Oriental & African Studies
Department Name: History, Religions and Philosophy

Abstract

At one point or another, most language users rely on dictionaries as authoritative source of lexicographical information. The first recorded dictionaries date back to Sumerian times (3rd millennium BCE) compiled in the course of the linguistic convergence ('Sprachbund') between Akkadian and Sumerian. Since then, dictionaries have played a key role in intercultural communication and advanced scientific research across languages and nation states. Modern-day lexicography still serves these goals, but its methods have changed beyond recognition. Card catalogues have given way to databases and digital resources that offer access to a much larger pool of linguistic data. Today, practically all lexicographers deploy text corpora and corpus querying tools, both to sharpen the empirical base of definitions and to provide contextual examples for the end user. We propose to take advantage of these developments to create a corpus-based diachronic lexicon of Tibetan verbs.

Verbs play a central role in most sentences. Knowledge of the meaning of a verb leads to the arguments it requires and to the semantic roles the arguments, in turn, assume. Our lexicon draws on these links. It will allow the user to infer the complete structure of a sentence, based primarily on the terminal verb and the type of accompanying arguments. Furthermore, it charts the morphological and semantic changes of the verbs from the earliest records of Tibetan in the 8th century CE to contemporary times. Each verb is tracked to its earlier occurrence in the Old Tibetan material within the corpus and then compared with its applications in Classical and Modern Tibetan. Some of the existing dictionaries contain sporadic diachronic information, but this is never analysed or juxtaposed with other data. We propose to identify, examine and contextualise the diachronic evidence in a systematic fashion in order to obtain a better grasp of the evolution of the Tibetan language overall.

Corpus resources and processing tools constitute indispensable components in modern lexicography. For Tibetan, some of these tools are now available. 'Tibetan in Digital Communication' produced a large corpus of Tibetan language material, with part-of-speech-tagging, spanning Old, Classical and Modern Tibetan. For the lexicon, we mine its content by running a series of automated queries drawing on Natural Language Processing (NLP) software. At first, we create an internal workflow tool. This allows us to categorise, both systematically and comprehensively, all the Tibetan verbs within the corpus. The different forms of the verbs are then grouped together in discrete entries; we analyse in depth verbal stems that display semantic ambiguity, repeated change or morphological irregularity. In parallel, we identify and label the arguments connected with each verb. We use this data, individually and cumulatively, to generate the citations and definitions for the lexicon.

The verb lexicon will become an indispensable asset for students and scholars alike, working on any one of the many facets of Tibetan culture, past and present. Outside academia, through its modern component, the lexicon improves access to Tibet-related content in the political and economic sphere. Development aid, humanitarian assistance, medical provisions and educational support are best delivered in conversation with the recipients. These conversations must be conducted in Tibetan. Very few Tibetans are fluent in English and most do not wish to communicate in Chinese. The software we create also advances the creation of new digital tools for Tibetan speakers. The IT sector is reluctant to invest in the language of a people that holds little political or economical influence. Its speakers are excluded from the vast resources of the web. Key to such technologies is the availability of a Basic Language Resource Kit (BLARK). The lexicon and predicate software bring us one step closer to the completion of a BLARK for Tibetan.

Planned Impact

Tibetan speakers fall into two broad categories: the population of Tibet in China and members of the diaspora dispersed throughout South Asia, Europe and the USA. Both face acute pressures. Chinese interference within Tibet continues to erode the cultural and linguistic identity of its people. Mandarin is taught as the main language in schools and serves as lingua franca in public life. The diaspora struggles to maintain Tibetan since its members are encouraged to adopt the language(s) of their host nations. We are not able to change state policies, but we can arrest the decline of Tibetan on the web through carefully calibrated language resources. The verb lexicon, with its tools, strengthens Tibetan in the digital world. Digital technologies are slow to develop for languages that command little commercial attention. Basic Language Resource Kits (BLARK) help create much improved start-up conditions for research, education and developments in such technologies. They are fundamental to text messaging, spell-checking, speech recognition and machine-aided translation. TDC created the first three components of a Tibetan BLARK. We deliver now two of the remaining components: a lemmatiser and a lexical resource. The resultant BLARK reduces start-up investment; it prepares for private sector involvement without which Tibetan would continue to be all but invisible on the internet.
The lexicon, on its own, improves access to English language materials for Tibetan speakers. It possesses dual search capability: it allows English speakers to learn about Tibetan verbs and helps Tibetans to identify English equivalents. Since it is an online resource, the lexicon is available to all, independent of economic status, social profile or political persuasion.
We propose to offer the lexicon free of charge, to three Tibet-related charitable organisations: Esukhia (esukhia.org), TBRC (tbrc.org) and THL (thlib.org). Esukhia is an NGO based in India, staffed by Tibetans and closely aligned with the needs of the diaspora. Even though it commands an international presence, Esukhia is essentially a local organisation with strong roots in community service. We support its work in three ways: we employ two members of staff for three years, we enhance its international profile and upgrade its educational provisions through the lexicon.
TBRC and THL work closely with Tibetans in China; indeed, most of TBRC's web traffic comes from China. TBRC provides primarily textual and historical resources; THL has broader cultural remit: it connects local communities in Tibet with research initiatives in the West in order to document and preserve Tibet's cultural heritage. The presence of our lexicon on these sites adds functionality and helps them to strengthen their work in Tibet.
The lexicon also benefits staff at Western NGOs posted in locations in Tibet where English is rarely spoken. Three prominent organisations have already expressed interest in our work: Machik, Bridge Fund and Trace Foundation. Access to a reliable online lexicon that covers not only spoken Tibetan but also gives information about local dialects holds good practical value in day-to-day communication. It is light, fast and flexible; it provides example sentences and can be readily deployed in conversation.
Even the private sector gains from our work. Firms with an investment in Natural Language Processing, such as Lexical Computing, LinguaSys and Basis Technology, would be able to incorporate the corpus with its tools into their own systems. The project reaches also into mainstream computing. The technologies that Google adopts for its search engines or Twitter's social media software could both be enhanced through our tools. Google, Microsoft and Basis Technology have appointed staff to serve on our Advisory Board. Our resources readily transfer into the corporate arena, at little expense, and help drive forward new language technologies and their application across boundaries.

Publications

10 25 50