Lexicography in Motion: A History of the Tibetan Verb

Lead Research Organisation: School of Oriental and African Studies
Department Name: Sch of History, Religions and Philosophy

Abstract

At one point or another, most language users rely on dictionaries as authoritative source of lexicographical information. The first recorded dictionaries date back to Sumerian times (3rd millennium BCE) compiled in the course of the linguistic convergence ('Sprachbund') between Akkadian and Sumerian. Since then, dictionaries have played a key role in intercultural communication and advanced scientific research across languages and nation states. Modern-day lexicography still serves these goals, but its methods have changed beyond recognition. Card catalogues have given way to databases and digital resources that offer access to a much larger pool of linguistic data. Today, practically all lexicographers deploy text corpora and corpus querying tools, both to sharpen the empirical base of definitions and to provide contextual examples for the end user. We propose to take advantage of these developments to create a corpus-based diachronic lexicon of Tibetan verbs.

Verbs play a central role in most sentences. Knowledge of the meaning of a verb leads to the arguments it requires and to the semantic roles the arguments, in turn, assume. Our lexicon draws on these links. It will allow the user to infer the complete structure of a sentence, based primarily on the terminal verb and the type of accompanying arguments. Furthermore, it charts the morphological and semantic changes of the verbs from the earliest records of Tibetan in the 8th century CE to contemporary times. Each verb is tracked to its earlier occurrence in the Old Tibetan material within the corpus and then compared with its applications in Classical and Modern Tibetan. Some of the existing dictionaries contain sporadic diachronic information, but this is never analysed or juxtaposed with other data. We propose to identify, examine and contextualise the diachronic evidence in a systematic fashion in order to obtain a better grasp of the evolution of the Tibetan language overall.

Corpus resources and processing tools constitute indispensable components in modern lexicography. For Tibetan, some of these tools are now available. 'Tibetan in Digital Communication' produced a large corpus of Tibetan language material, with part-of-speech-tagging, spanning Old, Classical and Modern Tibetan. For the lexicon, we mine its content by running a series of automated queries drawing on Natural Language Processing (NLP) software. At first, we create an internal workflow tool. This allows us to categorise, both systematically and comprehensively, all the Tibetan verbs within the corpus. The different forms of the verbs are then grouped together in discrete entries; we analyse in depth verbal stems that display semantic ambiguity, repeated change or morphological irregularity. In parallel, we identify and label the arguments connected with each verb. We use this data, individually and cumulatively, to generate the citations and definitions for the lexicon.

The verb lexicon will become an indispensable asset for students and scholars alike, working on any one of the many facets of Tibetan culture, past and present. Outside academia, through its modern component, the lexicon improves access to Tibet-related content in the political and economic sphere. Development aid, humanitarian assistance, medical provisions and educational support are best delivered in conversation with the recipients. These conversations must be conducted in Tibetan. Very few Tibetans are fluent in English and most do not wish to communicate in Chinese. The software we create also advances the creation of new digital tools for Tibetan speakers. The IT sector is reluctant to invest in the language of a people that holds little political or economical influence. Its speakers are excluded from the vast resources of the web. Key to such technologies is the availability of a Basic Language Resource Kit (BLARK). The lexicon and predicate software bring us one step closer to the completion of a BLARK for Tibetan.

Planned Impact

Tibetan speakers fall into two broad categories: the population of Tibet in China and members of the diaspora dispersed throughout South Asia, Europe and the USA. Both face acute pressures. Chinese interference within Tibet continues to erode the cultural and linguistic identity of its people. Mandarin is taught as the main language in schools and serves as lingua franca in public life. The diaspora struggles to maintain Tibetan since its members are encouraged to adopt the language(s) of their host nations. We are not able to change state policies, but we can arrest the decline of Tibetan on the web through carefully calibrated language resources. The verb lexicon, with its tools, strengthens Tibetan in the digital world. Digital technologies are slow to develop for languages that command little commercial attention. Basic Language Resource Kits (BLARK) help create much improved start-up conditions for research, education and developments in such technologies. They are fundamental to text messaging, spell-checking, speech recognition and machine-aided translation. TDC created the first three components of a Tibetan BLARK. We deliver now two of the remaining components: a lemmatiser and a lexical resource. The resultant BLARK reduces start-up investment; it prepares for private sector involvement without which Tibetan would continue to be all but invisible on the internet.
The lexicon, on its own, improves access to English language materials for Tibetan speakers. It possesses dual search capability: it allows English speakers to learn about Tibetan verbs and helps Tibetans to identify English equivalents. Since it is an online resource, the lexicon is available to all, independent of economic status, social profile or political persuasion.
We propose to offer the lexicon free of charge, to three Tibet-related charitable organisations: Esukhia (esukhia.org), TBRC (tbrc.org) and THL (thlib.org). Esukhia is an NGO based in India, staffed by Tibetans and closely aligned with the needs of the diaspora. Even though it commands an international presence, Esukhia is essentially a local organisation with strong roots in community service. We support its work in three ways: we employ two members of staff for three years, we enhance its international profile and upgrade its educational provisions through the lexicon.
TBRC and THL work closely with Tibetans in China; indeed, most of TBRC's web traffic comes from China. TBRC provides primarily textual and historical resources; THL has broader cultural remit: it connects local communities in Tibet with research initiatives in the West in order to document and preserve Tibet's cultural heritage. The presence of our lexicon on these sites adds functionality and helps them to strengthen their work in Tibet.
The lexicon also benefits staff at Western NGOs posted in locations in Tibet where English is rarely spoken. Three prominent organisations have already expressed interest in our work: Machik, Bridge Fund and Trace Foundation. Access to a reliable online lexicon that covers not only spoken Tibetan but also gives information about local dialects holds good practical value in day-to-day communication. It is light, fast and flexible; it provides example sentences and can be readily deployed in conversation.
Even the private sector gains from our work. Firms with an investment in Natural Language Processing, such as Lexical Computing, LinguaSys and Basis Technology, would be able to incorporate the corpus with its tools into their own systems. The project reaches also into mainstream computing. The technologies that Google adopts for its search engines or Twitter's social media software could both be enhanced through our tools. Google, Microsoft and Basis Technology have appointed staff to serve on our Advisory Board. Our resources readily transfer into the corporate arena, at little expense, and help drive forward new language technologies and their application across boundaries.

Publications

10 25 50
 
Description Christian Faggionato: creation of a Tibetan Sketch Grammar on the Sketch Engine website. Edward Garrett: organised Tibetan Text Processing Workshop at University of Virginia, Charlottesville (17-18 October 2018) that was attended by colleagues working on Tibetal cultural databases, representatives of NGOs working in Tibet at schools and other educational institutions. Legia Lugli: is working towards re-using Gdex+collocation system developed for this project for the Limburgish Corpus Dictionary of the Stichting Limborgse Academie Christian Faggionato and Edward Garrett: presentation of project to the Namgyal Institute of Tibetology in Gangtok, Sikkim and to the Central University of Higher Tibetan Studies in Sarnath. Both institutions are deeply involved in the study and in the preservation of the Tibetan language among the Tibetan speaking communities residing in India. The tools and the research developed so far have been seen by our Tibetan partners as an invaluable resource in facilitating and strengthening their educational agenda in relation to the Tibetan Language.
First Year Of Impact 2019
Sector Digital/Communication/Information Technologies (including Software),Education,Culture, Heritage, Museums and Collections
Impact Types Cultural

 
Description Computing the Dharma: a natural language processing infrastructure to explore word meanings in Buddhist Sanskrit literature
Amount $97,384 (USD)
Funding ID HAA-277246-21 
Organisation Mangalam Research Center for Buddhist Languages 
Sector Academic/University
Country United States
Start 05/2021 
End 04/2022
 
Description Digital Humanities Advancement Grants
Amount $75,000 (USD)
Funding ID HAA-290402-23 
Organisation NEH National Endowment For The Humanities 
Sector Public
Country United States
Start 09/2023 
End 08/2025
 
Description International Partnership Grant for a Visiting Professorship in Corpus Linguistics and Lexicography at Universidade Estadual Paulista
Amount R$ 20,790 (BRL)
Organisation National Council for Scientific Research 
Sector Public
Country Lebanon
Start 06/2021 
End 07/2021
 
Title Lexicography in Motion Annodoc 
Description This site documents the annotation scheme for Tibetan language texts used by the project Lexicography in Motion (LIM) based at SOAS University of London and the Bavarian Academy of Sciences and Humanities (Bayerische Akademie der Wissenschaften) in Munich. The project is focused on verbs. Our primary objective is to annotate the predicate-argument structure of verbs in the service of building a corpus-based Tibetan verb lexicon. The annotation scheme follows the guidelines of the Universal Dependencies project. 
Type Of Material Improvements to research infrastructure 
Year Produced 2017 
Provided To Others? Yes  
Impact It has led to the development of the annotated SOAS corpus. Its use outside SOAS is not known at this time. 
URL https://tibetan-nlp.github.io/lim-annodoc/
 
Title A corpus of tagged and annotated Tibetan texts 
Description This resource contains Old, Classical and Modern Tibetan texts that have been POS-tagged and annotated for verbal argument structure, or just Tibetan texts that have been annotated for verbal argument structure. 
Type Of Material Database/Collection of data 
Year Produced 2018 
Provided To Others? Yes  
Impact The corpus may have been used by researchers outside of our project, in their own work on Tibetan corpora and natural language processing. 
URL https://github.com/tibetan-nlp/soas-corpus
 
Title A lexicon of Tibetan verb stems 
Description A digital version of the following print dicitionary, suitable for automated processing and use by natural language processing tools: * Hill, Nathan W. (2010) A Lexicon of Tibetan Verb Stems as Reported by the Grammatical Tradition. Munich: Bayerische Akademie der Wissenschaften. (Studia Tibetica) 
Type Of Material Database/Collection of data 
Year Produced 2017 
Provided To Others? Yes  
Impact This research database has had significant impact on the development of other databases and methods used within the funded project. Its use outside the project is not known. 
URL https://github.com/tibetan-nlp/lexicon-of-tibetan-verb-stems
 
Title Awesome Tibetan NLP 
Description An "awesome list" or curated list of resources related to Tibetan natural language processing. 
Type Of Material Database/Collection of data 
Year Produced 2019 
Provided To Others? Yes  
Impact People outside our project have contributed to the resource and are using it to guide their investigations into Tibetan natural language processing. 
URL https://github.com/tibetan-nlp/awesome-tibetan-nlp
 
Title Classical Tibetan corpus annotated for verb-argument dependency relations 
Description Classical Tibetan corpus annotated for verb-argument dependency relations 
Type Of Material Database/Collection of data 
Year Produced 2021 
Provided To Others? Yes  
Impact to be established 
URL http://doi.org/10.5281/zenodo.4727108
 
Title Lexicon of Tibetan Verb Stems (v. 2.0) 
Description Lexicon of Tibetan Verb Stems (v2.0) 
Type Of Material Database/Collection of data 
Year Produced 2021 
Provided To Others? Yes  
Impact none as yet 
URL https://doi.org/10.5281/zenodo.4726991
 
Title Modern Tibetan corpus annotated for verb-argument dependency relations (v1.0) 
Description Modern Tibetan corpus annotated for verb-argument dependency relations (v1.0) 
Type Of Material Database/Collection of data 
Year Produced 2021 
Provided To Others? Yes  
Impact none available yet 
URL https://doi.org/10.5281/zenodo.4727129
 
Title Old Tibetan Corpus and Normalization Grammar (Version v1.0) 
Description Old Tibetan Corpus and Normalization Grammar (Version v1.0) 
Type Of Material Database/Collection of data 
Year Produced 2021 
Provided To Others? Yes  
Impact to be established 
URL http://doi.org/10.5281/zenodo.4727552
 
Title Visual Dictionary of Tibetan Verb Valency: Data 
Description Visual Dictionary of Tibetan Verb Valency: Data 
Type Of Material Database/Collection of data 
Year Produced 2021 
Provided To Others? Yes  
Impact to be established 
URL https://doi.org/10.5281/zenodo.5596064
 
Title A Visual Dictionary of Tibetan Verb Valency Version 1 
Description The software consists of an R-Shiny web application that powers an online digital dictionary. 
Type Of Technology Webtool/Application 
Year Produced 2021 
Open Source License? Yes  
Impact There are no tangible impacts yet, as the software is still to be released to the public. Expected release date April 20 2021. 
URL https://mangalamresearch.shinyapps.io/VisualDictionaryOfTibetanVerbValency
 
Title Cg3 grammar to normalize Old Tibetan Unicode to Classical Tibetan Unicode 
Description Cg3 grammar to normalize Old Tibetan Unicode to Classical Tibetan Unicode: this set of Cg3 rules converts Old Tibetan texts into Classical Tibetan texts. In this way it is possible to apply existing NLP tools for Classical Tibetan to Old Tibetan texts, facilitating future research in the field of historical linguistics and Tibetan studies in general. 
Type Of Technology Webtool/Application 
Year Produced 2020 
Open Source License? Yes  
Impact This tool is yet to generate impact since it has only very recently been produced. 
URL https://github.com/tibetan-nlp/tibcg3
 
Title Constraint grammars for Tibetan dependency parsing 
Description Constraint grammars for Tibetan dependency parsing 
Type Of Technology Webtool/Application 
Year Produced 2021 
Open Source License? Yes  
Impact to be established 
 
Title Constraint grammars for Tibetan language processing 
Description A GitHub repository that collects various grammars for processing Tibetan texts using the VISL CG3 formalism. Grammars are included for: - normalizing Old Tibetan texts to Classical Tibetan - automatically annotating the verbal argument structure of POS-tagged Tibetan texts - attaching additional grammatical dependencies to a POS-tagged Tibetan text that has been annotated for verbal argument structure 
Type Of Technology Software 
Year Produced 2019 
Open Source License? Yes  
Impact This software has been used internally within the project, and is available for wider use. We do not know what the uptake has been outside of our project. 
URL https://github.com/tibetan-nlp/tibcg3
 
Title Lexonomy dictionary of Tibetan verbs 
Description Lexonomy is a cloud-based, open-source system for writing and publishing dictionaries. We have implemented a Tibetan verb dictionary in Lexonomy. This resource is currently only for internal use, it will be released to the public at the end of the project 
Type Of Technology Webtool/Application 
Year Produced 2019 
Open Source License? Yes  
Impact This has been used within the project. Its impact outside the project is not known. 
URL https://www.lexonomy.eu/zv8iruxkg/
 
Title NER for Tibetan and Mongolian Newspapers." Cambridge: Cambridge Open Engage 
Description NER for Tibetan and Mongolian Newspapers." Cambridge: Cambridge Open Engage 
Type Of Technology Webtool/Application 
Year Produced 2021 
Open Source License? Yes  
Impact to be established 
 
Title Visual Dictionary of Tibetan Verb Valency 
Description Lexicon (Tibetan/English) of Tibetan Verbs with a focus on Valency 
Type Of Technology Webtool/Application 
Year Produced 2021 
Open Source License? Yes  
Impact to be established 
URL https://mangalamresearch.shinyapps.io/VisualDictionaryOfTibetanVerbValency/
 
Title Visual Dictionary of Tibetan Verb Valency Version 0.1 
Description This web application is a visual dictionary of Tibetan verb valency implemented using R on shinyapps. It is currently under development and the URL is not discoverable. It will be released and discoverable at the conclusion of the project. 
Type Of Technology Webtool/Application 
Year Produced 2020 
Impact Since the URL is not discoverable, this web application has not yet had impact outside the project. 
URL http://mangalamresearch.shinyapps.io/VisualDictionaryOfTibetanVerbValency/
 
Description Conference Presentation 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Lugli, Ligiea, "Smart Lexicography for low resource languages:lessons learned from Sanskrit and Tibetan", eLex 2019, Sintra, October 2019.
Year(s) Of Engagement Activity 2019
 
Description Conference Presentation 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Christian Faggionato, "Constraint Grammars for Tibetan Language Processing", Proceedings of the 22nd Nordic Conference on Computational Linguistics, Department of Future Technologies, University of Turku, Finland, 2019 ( 30-09-2019)
Year(s) Of Engagement Activity 2019
 
Description Conference Presentation 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact Edward Garrett, "Learning Tibetan from a Tagged and Parsed Corpus". Fifteenth Seminar of the International Association for Tibetan Studies. Institut National des Langues et Civilisations Orientales, Paris (7-13 July 2019)
Year(s) Of Engagement Activity 2019
 
Description Conference Presentation 
Form Of Engagement Activity Participation in an open day or visit at my research institution
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Postgraduate students
Results and Impact Ligeia Lugli and Edward Garrett. "Diachronic Valency Dictionary of Tibetan Verbs: A progress report". Cardiff University, Cardiff (22-26 July 2019)
Year(s) Of Engagement Activity 2019
 
Description Conference Presentation 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Faggionato, Christian, "Developing the Old Tibetan Treebank", NSURL 2019 Workshop on NLP Solutions for Under Resourced Languages, Department of Computer Science and Information Engineering, University of Trento, Italy, 2019 (11-09-2019)
Year(s) Of Engagement Activity 2019
 
Description Conference Presentation 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Faggionato, Christian, "The Old Tibetan Treebank", 2019 Conference on Recent Advances in Natural Language Processing, Varna, Bulgaria, 2019 (04-09-2019)
Year(s) Of Engagement Activity 2019
 
Description Conference paper 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Christian Faggionato, Natural Language Processing of Tibetan Texts; Seminar lecture for the course in
"Introduction to Machine Learning and Evolutionary Robotics", Department of Engineering and Architecture, University of Trieste, 2020.
Year(s) Of Engagement Activity 2020
 
Description Conference paper: Faggionato, Garrett, Rode & Solmsdorf: A Dependency Tagged Corpus of Tibetan Texts (34th South Asian Languages Analysis Roundtable, University of Konstanz, June 2018) 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact Academic audience of ca 60 colleagues and PhD students was informed of the first findings of Lexicography in Motion.
Year(s) Of Engagement Activity 2018
 
Description Edward Garrett: Learning Tibetan from a Tagged and Parsed Corpus (Accepted, International Association for Tibetan Studies, 15th Seminar, Paris, July 2019) 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact The purpose of this presentation is to introduce the findings of LIM to a wider international academic audience for feedback, assessment and future directions of study. IATS conferences are attended by NGOs and other agencies working in Tibet and in India with Tibetan refugees.
Year(s) Of Engagement Activity 2019
 
Description Edward Garrett: Towards a Universal Dependency Treebank of Tibetan Texts (Tibetan Text Processing Workshop, Charlottesville, October 2018) 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact Presentation exploring impact avenues of LIM research for Tibetan speakers in Tibet and China.
Year(s) Of Engagement Activity 2018
 
Description Ligeia Lugli & Edward Garrett: Diachronic Valency Dictionary of Tibetan Verbs: A progress report (Accepted, Corpus Linguistics 2019, Cardiff, July 2019) 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact The purpose of this presentation is to introduce the findings of LIM to a wider academic audience for feedback, assessment and future directions of study.
Year(s) Of Engagement Activity 2019
 
Description Samyo Rode & Nikolai Solmsdorf: Neural Dependency Parsing (LMU, January 29, 2019 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact Release of research findings springing from LIM to wider academic audience at the University of Munich
Year(s) Of Engagement Activity 2019
 
Description Samyo Rode & Nikolai Solmsdorf: Relationsextraktion - Ein Werkstattbericht (LMU, January 21, 2019) 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Postgraduate students
Results and Impact Progress report of research and research methodologies driving forward work on LIM to academic audience at the University of Munich.
Year(s) Of Engagement Activity 2019
 
Description Samyo Rode & Nikolai Solmsdorf: Wortsegmentierung im Tibetischen (LMU, December 18, 2018) 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Postgraduate students
Results and Impact Release of research findings springing from LIM to wider academic audience at the University of Munich. This generated questions about Tibetan digital research and broader corpus linguistics.
Year(s) Of Engagement Activity 2018
 
Description Workshop Participation 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact Faggionato, Christian, RANLP '19 - Summer School on Deep Learning in NLP, Varna, Bulgaria (29-08-2019 - 01-09-2019)
Year(s) Of Engagement Activity 2019
 
Description Workshop Participation 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Supporters
Results and Impact Lugli, Ligeia, "Not smart enough (yet): Smart lexicography for Classical Tibetan", Lexicography in Motion Peer Review Workshop, SOAS University of London, 5 March 2019
Year(s) Of Engagement Activity 2019
 
Description Workshop Participation 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Supporters
Results and Impact Lugli, Ligeia, "Automatic filtering of good dictionary examples for Classical Tibetan", Lexicography in Motion Peer Review Workshop, SOAS University of London, 8 December 2017.
Year(s) Of Engagement Activity 2017
 
Description Workshop Participation 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Postgraduate students
Results and Impact Solmsdorf, Nikolai and Samyo Rode, "Exploring a Classical Corpus", Lexicography in Motion, Peer Review Workshop, SOAS University of London, 5 March 2019
Year(s) Of Engagement Activity 2019