The Emergence of Egophoricity: a diachronic investigation into the marking of the conscious self

Lead Research Organisation: SOAS University of London
Department Name: East Asian Languages and Cultures

Abstract

This project looks at the way certain Tibetan and Newar varieties express the perspective of the speaker in the sentence. In Lhasa Tibetan, for example, the auxiliary verb 'yin' can be used in sentences where the speaker is the subject (nga em-chi yin '*I'm* a doctor'), if the speaker wants to identify their personal relation or possession ('di nga'i bu-mo yin 'This is *my* daughter') or if the speaker chooses to emphasise who performed an action ('di khyed-rang-gi gsol-ja yin 'This is your tea [that *I* have made for you]'). Other Tibetan varieties, such as Jirel or South Mustang Tibetan also exhibit egophoric markers like Lhasa Tibetan 'yin', but not always in the same contexts. In Newar varieties that are also spoken in Nepal, however, egophoric marking consists of long vowels in verbal endings rather than separate (auxiliary) verbs (ji Manaj napalan-aa 'I (the speaker) met Manoj as planned' vs. ji Manaj napalan-a 'I met Manoj by coincidence'). Finally, in older stages of both Tibetan and Newar varieties, this egophoric marking cannot be found. The central question that this project aims to answer is how and why specific grammatical markers to indicate the speaker's involvement emerge over time in ways that slightly differ, even in closely related languages. What subtle grammatical clues can be found in olders stages of these languages that in later stages result in egophoric marking?

In this project we first investigate how Present-Day Tibetan and Newar varieties grammatically express the speaker's involvement. For this purpose we will create annotated corpora: digital text collections enriched with linguistic information about the structure and meaning of each element in the sentence. Because there is no data available yet for the highly endangered Lalitpur Newar variety, we will conduct fieldwork in Nepal to document the language and collect texts for our corpora. We then add the same linguistic information to historical texts. Older archive texts in South Mustang Tibetan, for example, will be compared to 18-19th texts written in standard Classical Tibetan to investigate the development of the Present-Day Lhasa Tibetan egophoric marker 'byung', which indicates the speaker is the recipient of an action (khong gis ngar yige btang byung 'He sent *me* a letter.'). Present-Day South Mustang Tibetan also has a verb 'byung', which goes back to Old and Classical Tibetan 'byung' meaning 'receive, get'. But unlike Lhasa Tibetan, this verb in South Mustang Tibetan has not changed into an egophoric auxiliary verb. Because of the extensive and consistent linguistic annotation of our corpora, we will be able to systematically study subtle differences in use of verbs like 'byung'. Since our corpora will not only contain morphosyntactic annotation, but information about meaning and function in discourse context as well, we will be in a unique position to investigate complex grammatical phenomena like egophoricity. Investigating this in a historical context gives us the opportunity to test theories of languages change that make predictions about triggers and mechanisms of change in particular. Are language-internal factors (e.g. changes in phonology) responsible for the emergence of egophoric marking, can language-external factors (language contact) play a role and/or can we observe a combination of factors in these languages that have throughout history been spoken by people in close promixity in Nepal?

Finally, since even closely-related Tibetan and Newar varieties exhibit some significant differences, comparison with egophoric marking on other languages can provide further clues on this complex phenomenon. In the final year of the project, we will therefore put our findings from Tibetan and Newar in crosslinguistic perspective.

Publications

10 25 50
 
Description In 2023, this project made significant strides in understanding the development and use of egophoric markers, which signify the conscious self, in the Tibetan and Newar languages. The creation of the Diachronic Annotated Corpus of Newar (DACON) represented a groundbreaking effort to facilitate linguistic research across a millennium, enabling scholars to track changes in Newar language use over time. Similarly, our study presented at the Himalayan Languages Symposium 2023 elucidated the evolution of evidentiality markers in Tibetan, demonstrating how semantic and information-structural factors influence the choice of subordinate markers. Furthermore, this project's efforts in language preservation, particularly through the development of Automatic Speech Recognition (ASR) models for the endangered Newar and Dzardzongke languages, highlighted the project's commitment to safeguarding linguistic diversity. The project's comprehensive approach, combining diachronic corpus analysis with cutting-edge NLP tools, has significantly advanced our ability to understand egophoricity and its lexical origins in these languages.
Exploitation Route The economic implications of this project extend beyond its academic achievements. By fostering advancements in Natural Language Processing (NLP) and Automatic Speech Recognition (ASR) technologies, the project has contributed to the UK's tech sector, particularly in the areas of language technology and AI. The development of ASR models for low-resource languages not only addresses critical issues in language preservation but also positions the UK as a leader in the application of technology to linguistics. The employment of research assistants in Nepal for transcription work further supports international collaboration and capacity building, potentially opening new markets for UK-based technological solutions. Additionally, the project's success in generating a highly specialized corpus and developing NLP tools for under-researched languages can stimulate innovation in the broader field of computational linguistics, contributing to economic growth through the creation of new knowledge, technologies, and applications that enhance the UK's competitive edge in the global technology landscape.
Sectors Digital/Communication/Information Technologies (including Software)

Education

 
Title Diaspora Kathmandu Newar 2019 
Description Dataset of recordings, plus transcriptions, for the study of the Kathmandu Newar dialect in a diaspora setting. Recordings of read materials contributed by Sanyukta Shrestha in London, 2019, with permission for his voice to be used for research purposes. Recorded by Nathan Hill, curated by Alexander James O'Neill. 
Type Of Material Database/Collection of data 
Year Produced 2024 
Provided To Others? Yes  
Impact 1 download, 13 views 
URL https://zenodo.org/doi/10.5281/zenodo.10611827
 
Title Lalitpur Newar 2022 
Description Dataset of recordings, plus metadata, for the study of the Lalitpur Newar dialect. Collected in 2022 with the consent of participants for its use for research purposes. 
Type Of Material Database/Collection of data 
Year Produced 2022 
Provided To Others? Yes  
Impact Dataset is too new to have produced much impact yet, but it is the first open access corpus of spoken Newar. 
URL https://zenodo.org/record/7501051
 
Title OCR model for Pracalit for Sanskrit and Newar MSS 16th to 19th C., Ground Truth 
Description Ground truth data (png and xml files) for a an OCR model. Will be continually updated. Originally trained on Transkribus with a PyLaia model created from ground truth data based on transcripts into Pracalit Unicode of four Nepalese manuscripts. The manuscripts used to create this model are Staatsbibliothek zu Berlin's Hitopadesa (MIK I 4851) (mixed Newar and Sanskrit dating to 1561) and Vetalapañcavi?sati (HS. Or. 6414) (Newar dating to 1675) as well as Cambridge Digital Library's Avalokitesvaragu?akara??avyuha (MS Add. 1322) (Sanskrit, 18th century) and the Royal Asiatic Society Online Collection's Madhyamasvaya?bhupura?a (RAS Hodgson MS 23) (Newar and Sanskrit dating to c. 1800). The training was done on 441 pages and validation on 242 pages. This model does not recognise spacing, except for large gaps (i.e. for pictures or string holes). Newar word divider markers may not be represented or may be transcribed as virama. In general, the model is made for MSS with scriptio continua and will transcribe into scriptio continua into Pracalit Unicode. Transcription was performed by Dr Alexander O'Neill (SOAS University of London). Transcription of the Vetalapañcavi?sati (HS. Or. 6414) and Madhyamasvaya?bhupura?a (RAS Hodgson MS 23) was aided by unpublished materials provided by Dr Felix Otter (Philipps-Universität Marburg), as well as the published transcription in Shakya, Min Bahadur, and Shanta Harsha Bajracharya, eds. "Svayambhu Pura?a." Lalitpur: Nagarjuna Institute of Exact Methods, 2001. The transcription of Avalokitesvaragu?akara??avyuha (MS Add. 1322) was aided by the transcription provided by the Digital Sanskrit Buddhist Canon Project based on Lokesh Chandra, "Gu?akara??avyuhasutram," New Delhi: International Academy of Indian Culture, 1999. 
Type Of Material Database/Collection of data 
Year Produced 2022 
Provided To Others? Yes  
Impact This dataset has made HTR for Newar Pracalit possible for the first time. 
URL https://zenodo.org/record/6967421
 
Description 2023 Language Preservation through ASR, poster presentation given at Cambridge Language Sciences Annual Symposium, 2023. 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Postgraduate students
Results and Impact 2023 Language Preservation through ASR, poster presentation given at Cambridge Language Sciences Annual Symposium, 2023.
Year(s) Of Engagement Activity 2023
 
Description 2023 The Emergence of Egophoricity: NLP Pipeline for Diachronic Tibetan and Newar Corpus Analysis, paper given at Perspectives of Digital Humanities in the Field of Buddhism, Universität Hamburg, Numata Zentrum für Buddhismuskunde & Khyentse Center for Tibetan Buddhist Textual Scholarship. 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact 2023 The Emergence of Egophoricity: NLP Pipeline for Diachronic Tibetan and Newar Corpus Analysis, paper given at Perspectives of Digital Humanities in the Field of Buddhism, Universität Hamburg, Numata Zentrum für Buddhismuskunde & Khyentse Center for Tibetan Buddhist Textual Scholarship.
Year(s) Of Engagement Activity 2023
 
Description 2024 Participant in the roundtable "Transkribus for Asian Languages" at the Transkribus User Conference 24, Innsbruck, Austria. Convened by Rachael Griffiths. 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact 2024 Participant in the roundtable "Transkribus for Asian Languages" at the Transkribus User Conference 24, Innsbruck, Austria. Convened by Rachael Griffiths.
Year(s) Of Engagement Activity 2024
 
Description Handwritten Text Recognition for Pracalit Script Manuscripts 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact 2024 Handwritten Text Recognition for Pracalit Script Manuscripts, paper given for the section "Transcribe faster, Discover more: Studying handwritten and printed documents in the age of Transkribus" at the Transkribus User Confer-ence 24, Innsbruck, Austria.
Year(s) Of Engagement Activity 2024