Tibetan in Digital Communication: Corpus Linguistics and Lexicography

Lead Research Organisation: School of Oriental and African Studies
Department Name: Study of Religions

Abstract

In age, breadth and diversity of genre, Tibetan literature is in every way comparable to English. The Tibetan alphabet was invented in 650 CE. The earliest currently available securely dateable document dates to ca. 763 CE. Literary production has continued from that time unabated until today. Yet, the lexicographical resources of Tibetan are very inadequate and vastly inferior to what is available to English speakers. In total, students of Tibetan can draw on about a dozen dictionaries, most for Classical Tibetan. The scope of these lexicons tends to be poorly defined, and none of them meets the standards of scientific lexicography. Moreover, there is not a single work that covers the earliest period of Tibetan literature, Old Tibetan (650-1000 CE). The corpus and tools we propose to create will serve as the first step to advance the compilation of a comprehensive historical Tibetan dictionary akin to the Oxford English Dictionary.

In order to achieve this, we propose to produce a large corpus of Tibetan texts spanning the language's entire history, drawn from Old, Classical and Modern Tibetan. In the past, scholars used laborious collections of slips organised and stored in vast filing cabinets in order to compile large dictionaries. Advances in computational linguistics mean that this work can now be achieved more thoroughly and effectively through the creation of annotated digital corpora. But our corpus, once carefully analysed and tagged, will not only pave the way for the compilation of Tibetan dictionaries of hitherto inconceivable calibre, but it will also prepare the ground for a wide range of other significant research initiatives. By mounting it on the Web, scholars from a wide range of disciplines (history, religion, literature, linguistics, etc.) working with Tibetan language materials will be able to search it and use its content for their own research. It is thus likely to become foundational to a vast array of research initiatives, benefiting many different constituencies in academia.

Outside academia, in the modern world of electronic communication, our corpus will lay the foundation for the creation of new digital technologies for Tibetan (text messaging, automated translation, etc.). The high investment required to develop language software leaves languages without commercial or political power isolated and poorly resourced. Digital communication technologies are built on basic language processing tools (eg, word-segmentation programmes, part-of-speech taggers) of the very type we propose to create. Our work will reduce the cost to develop such technologies and thus attract commercial interest. Although Tibetan is spoken by more than two million people, it is barely represented in electronic media as a spoken language. We seek to remedy this by creating an electronic resource that will restore to Tibetans, irrespective of their residence or adopted nationality, the choice to use their language as they see fit in a world that is increasingly shaped by digital communication.

Planned Impact

We propose to create a carefully tagged corpus of Tibetan texts, spanning fourteen centuries. This resource prepares the ground for improvements in Tibetan lexicography. The corpus and electronic tools we seek to develop benefit several constituencies, holding the promise to serve anyone interested in Tibetan language and culture in a digital context. Their design and structures are sufficiently flexible to lend themselves to multiple applications. In 1997 natural language processing research developed the notion of a Basic Language Resource Kit (BLARK). BLARKs consist of sets of computational tools foundational to the creation of language technologies. BLARKs are a prerequisite for text messaging systems and spellcheckers. They are also central to the production of more specialised applications, such as speech recognition software, optical character recognition and screen-reading devices for the blind. BLARKs also enable cross-linguistic communication, including machine-aided translation and inter-lingual software for the Internet. Our project sets out to develop two key components of a BLARK for the Tibetan language: a carefully tagged corpus identifying part-of-speech and an automatic part-of-speech tagger.

Although Tibetan is the primary language of over two million people, its speakers continue to be excluded from even the most rudimentary language technologies. In a country as vast and remote as Tibet, mobile networks constitute the most effective means of telecommunication. Tibetan does not have a fully functional spellchecker with the ability to analyse word composition. Mobile text communication is only available to a privileged elite with sufficient resources to purchase specialist handsets. To most Tibetans, these are wholly out of reach. As a result, virtually everyone is forced to text (SMS) in either English or Chinese. These are serious limitations since very few Tibetans are fluent in English and most do not wish to communicate in Chinese. As a result, Tibetans are reluctant to use these services. This has profound consequences. (1) It creates a barrier between Tibetans in China (who generally do not know English) and Tibetans in the diaspora (who often do not speak Chinese). (2) The imposed use of Chinese gives Beijing tight control over text messaging traffic in Tibet. (3) It undermines the status of Tibetan in social, cultural and commercial situations. In education, Tibetan has long been replaced by Chinese. Most public discourse is conducted in Chinese. (4) It disenfranchises Tibetan speakers living in rural areas who, with little access to formal education, possess only a rudimentary grasp of Chinese.

We are not in the position to develop those much-needed technologies ourselves. This is best left to established software companies. We propose to prepare the way for the creation of a Tibetan BLARK by publishing our corpus and electronic tools online with open-content and open-source licenses. We shall also deposit them with the Oxford Text Archive and the Tibetan and Himalayan Library. Microsoft (Thierry Fontenelle, Senior Program Manager, Natural Language Group) expressed keen interest in our initiative. Inevitably perhaps, commercial constraints will always limit their investment in digital technologies for 'minor' languages. This explains why they have produced a plethora of tools for English and French, for example, while so little for Basque or Latvian. It is thus not without self-interest that Microsoft has pledged to increase Tibetan language support for its software once granted access to our resources. In other words, the proposed research will form a bridge connecting fourteen centuries of Tibetan language with the age of digital communication. It will help millions of Tibetan speakers, in Tibet and the diaspora, to maintain their language and to communicate with each other.

Publications

10 25 50
publication icon
Garrett E (2014) A Rule-based Part-of-speech Tagger for Classical Tibetan in Himalayan Linguistics

publication icon
Garrett, E (2015) Constituent Order in the Tibetan Noun Phrase in SOAS Working Papers in Linguistics

publication icon
HILL N (2014) Tibetan vlan 'reply' in Journal of the Royal Asiatic Society of Great Britain & Ireland

publication icon
Hill, N (2015) Tibetan part-of-speech conundrums: ma? and yun ri? in Rocznik Orientalistyczny

 
Title A part-of-speech (POS) lexicon of Classical Tibetan for NLP 
Description This part-of-speech (POS) lexicon of Classical Tibetan was prepared in the course of the research project 'Tibetan in Digital Communication' (2012-2015) hosted at SOAS, University of London and funded by the UK's Arts and Humanities Research Council (grant code: AH/J00152X/1). The data for verbs comes from a digitized version of A Lexicon of Tibetan Verb Stems as Reported by the Grammatical Tradition (Munich: Bayerische Akademie der Wissenschaften, 2010) by Nathan W. Hill. Otherwise data comes from the manually part-of-speech tagged training data produced by the corpus and a few lexical items specifically added by hand to improve rule based tagging. 
Type Of Technology Software 
Year Produced 2015 
Open Source License? Yes  
Impact This software tool created the foundation for another AHRC award, secured in December 2016 
URL https://zenodo.org/record/574876#.WR8gm8m1unc
 
Title A part-of-speech (POS) tagged corpus of Classical Tibetan 
Description This part-of-speech (POS) tagged corpus of Classical Tibetan was prepared in the course of the research project 'Tibetan in Digital Communication' (2012-2015) hosted at SOAS, University of London and funded by the UK's Arts and Humanities Research Council (grant code: AH/J00152X/1). For a description of the tag set see Garrett et al. 2014. and Garrett et al. 2015. This corpus includes the Mdza?s blun (9th century, canonical), the Bu ston chos ?byu? (13th century, ecclesiastical history), the Mi la ras pa?i rnam thar and Mar pa?i rnam thar (15th century, biography). 
Type Of Technology Software 
Year Produced 2015 
Open Source License? Yes  
Impact The production of this software led to the award of a follow-up AHRC award aimed at the production of of Tibetan verb lexicon. 
URL https://zenodo.org/record/574878#.WR8f3Mm1unc
 
Title A rule based Tibetan part-of-speech (POS) tagger for the creation of gold standard training data 
Description This rule based Tibetan part-of-speech (POS) tagger was prepared in the course of the research project 'Tibetan in Digital Communication' (2012-2015) hosted at SOAS, University of London and funded by the UK's Arts and Humanities Research Council (grant code: AH/J00152X/1). For a description of the tag set see Garrett et al. 2014. and Garrett et al. 2015. For a description of the tagger itself see Garrett et al. 2014. Note that the tagger must be used together with a lexicon (for example Hill & Garrett 2017a). One must use one's own script to tag all words with all tags in the lexicon and then apply the tagger to remove incorrect tags. On the associated corpus of 318,230 words (Hill & Garrett 2017b) the lexical tagger (i.e. simply applying all available tags to all words) tags 141,911 words with the correct unique tag, achieves as accuracy of 1.000 (by definition getting the right tag among others for each word) with an ambiguity of 2.73111. In contrast, the Rule Tagger tags 241,256 words with the correct unique tag, achieves an accuracy of 0.99893 and an ambiguity of 1.38577. Because this tagger does not achieve ambiguity 1.000 it is not suitable for tagging large scale corpora, but instead is useful for the creation of gold standard training data. 
Type Of Technology Software 
Year Produced 2015 
Open Source License? Yes  
Impact Preparation of further AHRC application: award granted in December 2017 
URL https://zenodo.org/record/574882#.WR8d0Mm1unc