📣 Help Shape the Future of UKRI's Gateway to Research (GtR)

We're improving UKRI's Gateway to Research and are seeking your input! If you would be interested in being interviewed about the improvements we're making and to have your say about how we can make GtR more user-friendly, impactful, and effective for the Research and Innovation community, please email gateway@ukri.org.

The Emergence of Egophoricity: a diachronic investigation into the marking of the conscious self

Lead Research Organisation: SOAS University of London
Department Name: East Asian Languages and Cultures

Abstract

This project looks at the way certain Tibetan and Newar varieties express the perspective of the speaker in the sentence. In Lhasa Tibetan, for example, the auxiliary verb 'yin' can be used in sentences where the speaker is the subject (nga em-chi yin '*I'm* a doctor'), if the speaker wants to identify their personal relation or possession ('di nga'i bu-mo yin 'This is *my* daughter') or if the speaker chooses to emphasise who performed an action ('di khyed-rang-gi gsol-ja yin 'This is your tea [that *I* have made for you]'). Other Tibetan varieties, such as Jirel or South Mustang Tibetan also exhibit egophoric markers like Lhasa Tibetan 'yin', but not always in the same contexts. In Newar varieties that are also spoken in Nepal, however, egophoric marking consists of long vowels in verbal endings rather than separate (auxiliary) verbs (ji Manaj napalan-aa 'I (the speaker) met Manoj as planned' vs. ji Manaj napalan-a 'I met Manoj by coincidence'). Finally, in older stages of both Tibetan and Newar varieties, this egophoric marking cannot be found. The central question that this project aims to answer is how and why specific grammatical markers to indicate the speaker's involvement emerge over time in ways that slightly differ, even in closely related languages. What subtle grammatical clues can be found in olders stages of these languages that in later stages result in egophoric marking?

In this project we first investigate how Present-Day Tibetan and Newar varieties grammatically express the speaker's involvement. For this purpose we will create annotated corpora: digital text collections enriched with linguistic information about the structure and meaning of each element in the sentence. Because there is no data available yet for the highly endangered Lalitpur Newar variety, we will conduct fieldwork in Nepal to document the language and collect texts for our corpora. We then add the same linguistic information to historical texts. Older archive texts in South Mustang Tibetan, for example, will be compared to 18-19th texts written in standard Classical Tibetan to investigate the development of the Present-Day Lhasa Tibetan egophoric marker 'byung', which indicates the speaker is the recipient of an action (khong gis ngar yige btang byung 'He sent *me* a letter.'). Present-Day South Mustang Tibetan also has a verb 'byung', which goes back to Old and Classical Tibetan 'byung' meaning 'receive, get'. But unlike Lhasa Tibetan, this verb in South Mustang Tibetan has not changed into an egophoric auxiliary verb. Because of the extensive and consistent linguistic annotation of our corpora, we will be able to systematically study subtle differences in use of verbs like 'byung'. Since our corpora will not only contain morphosyntactic annotation, but information about meaning and function in discourse context as well, we will be in a unique position to investigate complex grammatical phenomena like egophoricity. Investigating this in a historical context gives us the opportunity to test theories of languages change that make predictions about triggers and mechanisms of change in particular. Are language-internal factors (e.g. changes in phonology) responsible for the emergence of egophoric marking, can language-external factors (language contact) play a role and/or can we observe a combination of factors in these languages that have throughout history been spoken by people in close promixity in Nepal?

Finally, since even closely-related Tibetan and Newar varieties exhibit some significant differences, comparison with egophoric marking on other languages can provide further clues on this complex phenomenon. In the final year of the project, we will therefore put our findings from Tibetan and Newar in crosslinguistic perspective.

Publications

10 25 50

publication icon
Meelen M. (2024) End-to-End Speech Recognition for Endangered Languages of Nepal in ComputEL 2024 - 7th Workshop on the Use of Computational Methods in the Study of Endangered Languages, Proceedings of the Workshop

publication icon
O'Neill A (2022) Text Recognition for Nepalese Manuscripts in Pracalit Script in Journal of Open Humanities Data

publication icon
O'NEILL A (2024) The Diachronic Annotated Corpus of Newar From manuscript to morphosyntax in Cahiers de Linguistique Asie Orientale

 
Description In 2023, this project made significant strides in understanding the development and use of egophoric markers, which signify the conscious self, in the Tibetan and Newar languages. The creation of the Diachronic Annotated Corpus of Newar (DACON) represented a groundbreaking effort to facilitate linguistic research across a millennium, enabling scholars to track changes in Newar language use over time. Similarly, our study presented at the Himalayan Languages Symposium 2023 elucidated the evolution of evidentiality markers in Tibetan, demonstrating how semantic and information-structural factors influence the choice of subordinate markers. Furthermore, this project's efforts in language preservation, particularly through the development of Automatic Speech Recognition (ASR) models for the endangered Newar and Dzardzongke languages, highlighted the project's commitment to safeguarding linguistic diversity. The project's comprehensive approach, combining diachronic corpus analysis with cutting-edge NLP tools, has significantly advanced our ability to understand egophoricity and its lexical origins in these languages.
Exploitation Route The economic implications of this project extend beyond its academic achievements. By fostering advancements in Natural Language Processing (NLP) and Automatic Speech Recognition (ASR) technologies, the project has contributed to the UK's tech sector, particularly in the areas of language technology and AI. The development of ASR models for low-resource languages not only addresses critical issues in language preservation but also positions the UK as a leader in the application of technology to linguistics. The employment of research assistants in Nepal for transcription work further supports international collaboration and capacity building, potentially opening new markets for UK-based technological solutions. Additionally, the project's success in generating a highly specialized corpus and developing NLP tools for under-researched languages can stimulate innovation in the broader field of computational linguistics, contributing to economic growth through the creation of new knowledge, technologies, and applications that enhance the UK's competitive edge in the global technology landscape.
Sectors Digital/Communication/Information Technologies (including Software)

Education

 
Title Classical Newar Annotation Manual: Part I: Preprocessing 
Description Classical Newar Annotation Manual: Part I: Preprocessing Creators Meelen, Marieke (Researcher)1 ORCID icon O'Neill, Alexander (Researcher)2 ORCID icon Description This is part one of the Classical Newar Annotation Manual for the Diachronic Annotated Corpus of Newar (DACON) detailing preprocessing procedures for corpus annotation. For full details, please see: O'Neill & Meelen, "The Diachronic Annotated Corpus of Newar: from Manuscript to Morphosyntax," Cahiers de Linguistique Asie Orientale (2024). 
Type Of Material Data analysis technique 
Year Produced 2024 
Provided To Others? Yes  
Impact This manual is an essential tool in our research. 
 
Title Classical Newar Annotation Manual: Part II: Segmentation and Part-of-Speech Tagging 
Description Classical Newar Annotation Manual: Part II: Segmentation and Part-of-Speech Tagging Creators O'Neill, Alexander (Researcher)1 ORCID icon Meelen, Marieke (Researcher)2 ORCID icon Description This is part two of the Classical Newar Annotation Manual for the Diachronic Annotated Corpus of Newar (DACON) detailing segementation and part-of-speech tagging procedures for corpus annotation. For full details, please see: O'Neill & Meelen, "The Diachronic Annotated Corpus of Newar: from Manuscript to Morphosyntax," Cahiers de Linguistique Asie Orientale (2024). See also: 
Type Of Material Data analysis technique 
Year Produced 2024 
Provided To Others? Yes  
Impact This is an essential guide for data processing in our project. 
 
Title Diachronic Annotated Corpus of Newar 
Description This dataset contains segmented and part-of-speech-tagged files that comprise the ongoing Diachronic Annotated Corpus of Newar (DACON). Files are provided in .txt format. File names are explained as follows: cnew (Classical Newar)century (e.g. 12)short text name (and other information such as manuscript name (e.g., MSB) and line number completed for incomplete texts (e.g. 10000))SEG or POS (segmented or part-of-speech-tagged e.g. cnew19-manicuda-10000_SEG.txt For full details, including text citations with discussion, please see: O'Neill & Meelen, "The Diachronic Annotated Corpus of Newar: from Manuscript to Morphosyntax," Cahiers de Linguistique Asie Orientale (2024). Annotation Manual Part I (Preprocessing) Annotation Manual Part II (Segmentation and POS Tagging) Tools for the Diachronic Annotated Corpus of Newar (DACON) 
Type Of Material Database/Collection of data 
Year Produced 2024 
Provided To Others? Yes  
URL https://zenodo.org/doi/10.5281/zenodo.12887385
 
Title Diachronic Annotated Corpus of Newar 
Description This dataset contains segmented and part-of-speech-tagged files that comprise the ongoing Diachronic Annotated Corpus of Newar (DACON). Files are provided in .txt format. File names are explained as follows: cnew (Classical Newar)century (e.g. 12)short text name (and other information such as manuscript name (e.g., MSB) and line number completed for incomplete texts (e.g. 10000))SEG or POS (segmented or part-of-speech-tagged e.g. cnew19-manicuda-10000_SEG.txt For full details, including text citations with discussion, please see: O'Neill & Meelen, "The Diachronic Annotated Corpus of Newar: from Manuscript to Morphosyntax," Cahiers de Linguistique Asie Orientale (2024). Annotation Manual Part I (Preprocessing) Annotation Manual Part II (Segmentation and POS Tagging) Tools for the Diachronic Annotated Corpus of Newar (DACON) 
Type Of Material Database/Collection of data 
Year Produced 2024 
Provided To Others? Yes  
URL https://zenodo.org/doi/10.5281/zenodo.12887386
 
Title Diaspora Kathmandu Newar 2019 
Description Dataset of recordings, plus transcriptions, for the study of the Kathmandu Newar dialect in a diaspora setting. Recordings of read materials contributed by Sanyukta Shrestha in London, 2019, with permission for his voice to be used for research purposes. Recorded by Nathan Hill, curated by Alexander James O'Neill. 
Type Of Material Database/Collection of data 
Year Produced 2024 
Provided To Others? Yes  
Impact 1 download, 13 views 
URL https://zenodo.org/doi/10.5281/zenodo.10611827
 
Title Ground Truth Model for Pracalit for Sanskrit and Newar MSS 16th to 19th C. 
Description Ground truth data for a an OCR model. Will be continually updated. Originally trained on Transkribus with a PyLaia model created from ground truth data based on transcripts into Pracalit Unicode of four Nepalese manuscripts. The manuscripts used to create this model are Staatsbibliothek zu Berlin's Hitopadesa (MIK I 4851) (mixed Newar and Sanskrit dating to 1561) and Vetalapañcavi?sati (HS. Or. 6414) (Newar dating to 1675) as well as Cambridge Digital Library's Avalokitesvaragu?akara??avyuha (MS Add. 1322) (Sanskrit, 18th century) and the Royal Asiatic Society Online Collection's Madhyamasvaya?bhupura?a (RAS Hodgson MS 23) (Newar and Sanskrit dating to c. 1800). The training was done on 441 pages and validation on 242 pages. This model does not recognise spacing, except for large gaps (i.e. for pictures or string holes). Newar word divider markers may not be represented or may be transcribed as virama. In general, the model is made for MSS with scriptio continua and will transcribe into scriptio continua into Pracalit Unicode. Transcription was performed by Dr Alexander O'Neill (SOAS University of London). Transcription of the Vetalapañcavi?sati (HS. Or. 6414) and Madhyamasvaya?bhupura?a (RAS Hodgson MS 23) was aided by unpublished materials provided by Dr Felix Otter (Philipps-Universität Marburg), as well as the published transcription in Shakya, Min Bahadur, and Shanta Harsha Bajracharya, eds. "Svayambhu Pura?a." Lalitpur: Nagarjuna Institute of Exact Methods, 2001. The transcription of Avalokitesvaragu?akara??avyuha (MS Add. 1322) was aided by the transcription provided by the Digital Sanskrit Buddhist Canon Project based on Lokesh Chandra, "Gu?akara??avyuhasutram," New Delhi: International Academy of Indian Culture, 1999. 
Type Of Material Database/Collection of data 
Year Produced 2022 
Provided To Others? Yes  
URL https://heidata.uni-heidelberg.de/citation?persistentId=doi:10.11588/data/WI9184
 
Title Lalitpur Newar 2022 
Description Dataset of recordings, plus metadata, for the study of the Lalitpur Newar dialect. Collected in 2022 with the consent of participants for its use for research purposes. 
Type Of Material Database/Collection of data 
Year Produced 2022 
Provided To Others? Yes  
Impact Dataset is too new to have produced much impact yet, but it is the first open access corpus of spoken Newar. 
URL https://zenodo.org/record/7501051
 
Title OCR model for Pracalit for Sanskrit and Newar MSS 16th to 19th C., Ground Truth 
Description Ground truth data (png and xml files) for a an OCR model. Will be continually updated. Originally trained on Transkribus with a PyLaia model created from ground truth data based on transcripts into Pracalit Unicode of four Nepalese manuscripts. The manuscripts used to create this model are Staatsbibliothek zu Berlin's Hitopadesa (MIK I 4851) (mixed Newar and Sanskrit dating to 1561) and Vetalapañcavi?sati (HS. Or. 6414) (Newar dating to 1675) as well as Cambridge Digital Library's Avalokitesvaragu?akara??avyuha (MS Add. 1322) (Sanskrit, 18th century) and the Royal Asiatic Society Online Collection's Madhyamasvaya?bhupura?a (RAS Hodgson MS 23) (Newar and Sanskrit dating to c. 1800). The training was done on 441 pages and validation on 242 pages. This model does not recognise spacing, except for large gaps (i.e. for pictures or string holes). Newar word divider markers may not be represented or may be transcribed as virama. In general, the model is made for MSS with scriptio continua and will transcribe into scriptio continua into Pracalit Unicode. Transcription was performed by Dr Alexander O'Neill (SOAS University of London). Transcription of the Vetalapañcavi?sati (HS. Or. 6414) and Madhyamasvaya?bhupura?a (RAS Hodgson MS 23) was aided by unpublished materials provided by Dr Felix Otter (Philipps-Universität Marburg), as well as the published transcription in Shakya, Min Bahadur, and Shanta Harsha Bajracharya, eds. "Svayambhu Pura?a." Lalitpur: Nagarjuna Institute of Exact Methods, 2001. The transcription of Avalokitesvaragu?akara??avyuha (MS Add. 1322) was aided by the transcription provided by the Digital Sanskrit Buddhist Canon Project based on Lokesh Chandra, "Gu?akara??avyuhasutram," New Delhi: International Academy of Indian Culture, 1999. 
Type Of Material Database/Collection of data 
Year Produced 2022 
Provided To Others? Yes  
Impact This dataset has made HTR for Newar Pracalit possible for the first time. 
URL https://zenodo.org/record/6967421
 
Title PACTib - PArsed Corpus of Tibetan (11th-21st c.) 
Description This PArsed Corpus of Tibetan (PACTib) contains >5000 historical Tibetan texts (>82m words) from over 10 different centuries. The original texts are from the Buddhist Digital Resource Center (BDRC) automatically enriched with linguistic annotation in the form of segmentation (tokenisation), Part-of-Speech Tags and constituency parses. Files in this deposit are:- a csv file with an overview of all texts with metadata linking file IDs + date ranges- segmented & POS-tagged txt files (using the ACTib segmenter & tagger)- parsed txt files (using the ACTib parser - forth.) Note that only the dated files are part of this collection. More information about the corpus can be found in:Meelen, M., & Roux, É. (2020). Meta-dating the PArsed Corpus of Tibetan (PACTib). In Proceedings of the 19th International Workshop on Treebanks and Linguistic Theories (pp. 31-42). 
Type Of Material Database/Collection of data 
Year Produced 2024 
Provided To Others? Yes  
URL https://zenodo.org/doi/10.5281/zenodo.12104250
 
Title PACTib - PArsed Corpus of Tibetan (11th-21st c.) 
Description This PArsed Corpus of Tibetan (PACTib) contains >5000 historical Tibetan texts (>82m words) from over 10 different centuries. The original texts are from the Buddhist Digital Resource Center (BDRC) automatically enriched with linguistic annotation in the form of segmentation (tokenisation), Part-of-Speech Tags and constituency parses. Files in this deposit are:- a csv file with an overview of all texts with metadata linking file IDs + date ranges- segmented & POS-tagged txt files (using the ACTib segmenter & tagger)- parsed txt files (using the ACTib parser - forth.) Note that only the dated files are part of this collection. More information about the corpus can be found in:Meelen, M., & Roux, É. (2020). Meta-dating the PArsed Corpus of Tibetan (PACTib). In Proceedings of the 19th International Workshop on Treebanks and Linguistic Theories (pp. 31-42). 
Type Of Material Database/Collection of data 
Year Produced 2024 
Provided To Others? Yes  
URL https://zenodo.org/doi/10.5281/zenodo.12104249
 
Description 2023 Language Preservation through ASR, poster presentation given at Cambridge Language Sciences Annual Symposium, 2023. 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Postgraduate students
Results and Impact 2023 Language Preservation through ASR, poster presentation given at Cambridge Language Sciences Annual Symposium, 2023.
Year(s) Of Engagement Activity 2023
 
Description 2023 The Emergence of Egophoricity: NLP Pipeline for Diachronic Tibetan and Newar Corpus Analysis, paper given at Perspectives of Digital Humanities in the Field of Buddhism, Universität Hamburg, Numata Zentrum für Buddhismuskunde & Khyentse Center for Tibetan Buddhist Textual Scholarship. 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact 2023 The Emergence of Egophoricity: NLP Pipeline for Diachronic Tibetan and Newar Corpus Analysis, paper given at Perspectives of Digital Humanities in the Field of Buddhism, Universität Hamburg, Numata Zentrum für Buddhismuskunde & Khyentse Center for Tibetan Buddhist Textual Scholarship.
Year(s) Of Engagement Activity 2023
 
Description 2024 Participant in the roundtable "Transkribus for Asian Languages" at the Transkribus User Conference 24, Innsbruck, Austria. Convened by Rachael Griffiths. 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact 2024 Participant in the roundtable "Transkribus for Asian Languages" at the Transkribus User Conference 24, Innsbruck, Austria. Convened by Rachael Griffiths.
Year(s) Of Engagement Activity 2024
 
Description Handwritten Text Recognition for Pracalit Script Manuscripts 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact 2024 Handwritten Text Recognition for Pracalit Script Manuscripts, paper given for the section "Transcribe faster, Discover more: Studying handwritten and printed documents in the age of Transkribus" at the Transkribus User Confer-ence 24, Innsbruck, Austria.
Year(s) Of Engagement Activity 2024