The Emergence of Egophoricity: a diachronic investigation into the marking of the conscious self

Lead Research Organisation: SOAS University of London

Department Name: East Asian Languages and Cultures

Abstract

This project looks at the way certain Tibetan and Newar varieties express the perspective of the speaker in the sentence. In Lhasa Tibetan, for example, the auxiliary verb 'yin' can be used in sentences where the speaker is the subject (nga em-chi yin '*I'm* a doctor'), if the speaker wants to identify their personal relation or possession ('di nga'i bu-mo yin 'This is *my* daughter') or if the speaker chooses to emphasise who performed an action ('di khyed-rang-gi gsol-ja yin 'This is your tea [that *I* have made for you]'). Other Tibetan varieties, such as Jirel or South Mustang Tibetan also exhibit egophoric markers like Lhasa Tibetan 'yin', but not always in the same contexts. In Newar varieties that are also spoken in Nepal, however, egophoric marking consists of long vowels in verbal endings rather than separate (auxiliary) verbs (ji Manaj napalan-aa 'I (the speaker) met Manoj as planned' vs. ji Manaj napalan-a 'I met Manoj by coincidence'). Finally, in older stages of both Tibetan and Newar varieties, this egophoric marking cannot be found. The central question that this project aims to answer is how and why specific grammatical markers to indicate the speaker's involvement emerge over time in ways that slightly differ, even in closely related languages. What subtle grammatical clues can be found in olders stages of these languages that in later stages result in egophoric marking?

In this project we first investigate how Present-Day Tibetan and Newar varieties grammatically express the speaker's involvement. For this purpose we will create annotated corpora: digital text collections enriched with linguistic information about the structure and meaning of each element in the sentence. Because there is no data available yet for the highly endangered Lalitpur Newar variety, we will conduct fieldwork in Nepal to document the language and collect texts for our corpora. We then add the same linguistic information to historical texts. Older archive texts in South Mustang Tibetan, for example, will be compared to 18-19th texts written in standard Classical Tibetan to investigate the development of the Present-Day Lhasa Tibetan egophoric marker 'byung', which indicates the speaker is the recipient of an action (khong gis ngar yige btang byung 'He sent *me* a letter.'). Present-Day South Mustang Tibetan also has a verb 'byung', which goes back to Old and Classical Tibetan 'byung' meaning 'receive, get'. But unlike Lhasa Tibetan, this verb in South Mustang Tibetan has not changed into an egophoric auxiliary verb. Because of the extensive and consistent linguistic annotation of our corpora, we will be able to systematically study subtle differences in use of verbs like 'byung'. Since our corpora will not only contain morphosyntactic annotation, but information about meaning and function in discourse context as well, we will be in a unique position to investigate complex grammatical phenomena like egophoricity. Investigating this in a historical context gives us the opportunity to test theories of languages change that make predictions about triggers and mechanisms of change in particular. Are language-internal factors (e.g. changes in phonology) responsible for the emergence of egophoric marking, can language-external factors (language contact) play a role and/or can we observe a combination of factors in these languages that have throughout history been spoken by people in close promixity in Nepal?

Finally, since even closely-related Tibetan and Newar varieties exhibit some significant differences, comparison with egophoric marking on other languages can provide further clues on this complex phenomenon. In the final year of the project, we will therefore put our findings from Tibetan and Newar in crosslinguistic perspective.

Funded Value:

£733,596

Funded Period:

Jan 22 - Jan 27

Funder:

AHRC

Project Status:

Active

Project Category:

Research Grant

Project Reference:

AH/V011235/1

Principal Investigator:

Nathan Hill

Research Subject:

Languages & Literature (20%)

Linguistics (80%)

Research Topic:

Asiatic & Oriental Studies (20%)

Corpus Linguistics (20%)

Language Variation & Change (20%)

Morphology & Phonology (20%)

Syntax (20%)

Organisations

SOAS University of London (Lead Research Organisation)

People	ORCID iD
Nathan Hill (Principal Investigator)
Marieke Meelen (Co-Investigator)
Alexander O'Neill (Researcher)

Publications

Author Name

Title Publication Date Published

10 25 50

Christian Faggionato (2022) NLP Pipeline for Annotating (Endangered) Tibetan and Newar Varieties

Li S. (2023) Printed Text Recognition for Lexical Lists in Chinese-International Phonetic Alphabet (IPA) Glossing in Journal of Open Humanities Data

Meelen M. (2024) End-to-End Speech Recognition for Endangered Languages of Nepal in ComputEL 2024 - 7th Workshop on the Use of Computational Methods in the Study of Endangered Languages, Proceedings of the Workshop

O'Neill A (2023) Language Preservation through ASR

O'Neill A (2022) Text Recognition for Nepalese Manuscripts in Pracalit Script in Journal of Open Humanities Data

O'Neill A (2023) Language Preservation through ASR

O'NEILL A (2024) The Diachronic Annotated Corpus of Newar From manuscript to morphosyntax in Cahiers de Linguistique Asie Orientale

Key Findings
Research Databases and Models
Engagement Activities


Description	In 2023, this project made significant strides in understanding the development and use of egophoric markers, which signify the conscious self, in the Tibetan and Newar languages. The creation of the Diachronic Annotated Corpus of Newar (DACON) represented a groundbreaking effort to facilitate linguistic research across a millennium, enabling scholars to track changes in Newar language use over time. Similarly, our study presented at the Himalayan Languages Symposium 2023 elucidated the evolution of evidentiality markers in Tibetan, demonstrating how semantic and information-structural factors influence the choice of subordinate markers. Furthermore, this project's efforts in language preservation, particularly through the development of Automatic Speech Recognition (ASR) models for the endangered Newar and Dzardzongke languages, highlighted the project's commitment to safeguarding linguistic diversity. The project's comprehensive approach, combining diachronic corpus analysis with cutting-edge NLP tools, has significantly advanced our ability to understand egophoricity and its lexical origins in these languages.
Exploitation Route	The economic implications of this project extend beyond its academic achievements. By fostering advancements in Natural Language Processing (NLP) and Automatic Speech Recognition (ASR) technologies, the project has contributed to the UK's tech sector, particularly in the areas of language technology and AI. The development of ASR models for low-resource languages not only addresses critical issues in language preservation but also positions the UK as a leader in the application of technology to linguistics. The employment of research assistants in Nepal for transcription work further supports international collaboration and capacity building, potentially opening new markets for UK-based technological solutions. Additionally, the project's success in generating a highly specialized corpus and developing NLP tools for under-researched languages can stimulate innovation in the broader field of computational linguistics, contributing to economic growth through the creation of new knowledge, technologies, and applications that enhance the UK's competitive edge in the global technology landscape.
Sectors	Digital/Communication/Information Technologies (including Software) Education


Title	Classical Newar Annotation Manual: Part I: Preprocessing
Description	Classical Newar Annotation Manual: Part I: Preprocessing Creators Meelen, Marieke (Researcher)1 ORCID icon O'Neill, Alexander (Researcher)2 ORCID icon Description This is part one of the Classical Newar Annotation Manual for the Diachronic Annotated Corpus of Newar (DACON) detailing preprocessing procedures for corpus annotation. For full details, please see: O'Neill & Meelen, "The Diachronic Annotated Corpus of Newar: from Manuscript to Morphosyntax," Cahiers de Linguistique Asie Orientale (2024).
Type Of Material	Data analysis technique
Year Produced	2024
Provided To Others?	Yes
Impact	This manual is an essential tool in our research.


Title	Classical Newar Annotation Manual: Part II: Segmentation and Part-of-Speech Tagging
Description	Classical Newar Annotation Manual: Part II: Segmentation and Part-of-Speech Tagging Creators O'Neill, Alexander (Researcher)1 ORCID icon Meelen, Marieke (Researcher)2 ORCID icon Description This is part two of the Classical Newar Annotation Manual for the Diachronic Annotated Corpus of Newar (DACON) detailing segementation and part-of-speech tagging procedures for corpus annotation. For full details, please see: O'Neill & Meelen, "The Diachronic Annotated Corpus of Newar: from Manuscript to Morphosyntax," Cahiers de Linguistique Asie Orientale (2024). See also:
Type Of Material	Data analysis technique
Year Produced	2024
Provided To Others?	Yes
Impact	This is an essential guide for data processing in our project.


Title	Diachronic Annotated Corpus of Newar
Description	This dataset contains segmented and part-of-speech-tagged files that comprise the ongoing Diachronic Annotated Corpus of Newar (DACON). Files are provided in .txt format. File names are explained as follows: cnew (Classical Newar)century (e.g. 12)short text name (and other information such as manuscript name (e.g., MSB) and line number completed for incomplete texts (e.g. 10000))SEG or POS (segmented or part-of-speech-tagged e.g. cnew19-manicuda-10000_SEG.txt For full details, including text citations with discussion, please see: O'Neill & Meelen, "The Diachronic Annotated Corpus of Newar: from Manuscript to Morphosyntax," Cahiers de Linguistique Asie Orientale (2024). Annotation Manual Part I (Preprocessing) Annotation Manual Part II (Segmentation and POS Tagging) Tools for the Diachronic Annotated Corpus of Newar (DACON)
Type Of Material	Database/Collection of data
Year Produced	2024
Provided To Others?	Yes
URL	https://zenodo.org/doi/10.5281/zenodo.12887385


Title	Diachronic Annotated Corpus of Newar
Description	This dataset contains segmented and part-of-speech-tagged files that comprise the ongoing Diachronic Annotated Corpus of Newar (DACON). Files are provided in .txt format. File names are explained as follows: cnew (Classical Newar)century (e.g. 12)short text name (and other information such as manuscript name (e.g., MSB) and line number completed for incomplete texts (e.g. 10000))SEG or POS (segmented or part-of-speech-tagged e.g. cnew19-manicuda-10000_SEG.txt For full details, including text citations with discussion, please see: O'Neill & Meelen, "The Diachronic Annotated Corpus of Newar: from Manuscript to Morphosyntax," Cahiers de Linguistique Asie Orientale (2024). Annotation Manual Part I (Preprocessing) Annotation Manual Part II (Segmentation and POS Tagging) Tools for the Diachronic Annotated Corpus of Newar (DACON)
Type Of Material	Database/Collection of data
Year Produced	2024
Provided To Others?	Yes
URL	https://zenodo.org/doi/10.5281/zenodo.12887386


Title	Diaspora Kathmandu Newar 2019
Description	Dataset of recordings, plus transcriptions, for the study of the Kathmandu Newar dialect in a diaspora setting. Recordings of read materials contributed by Sanyukta Shrestha in London, 2019, with permission for his voice to be used for research purposes. Recorded by Nathan Hill, curated by Alexander James O'Neill.
Type Of Material	Database/Collection of data
Year Produced	2024
Provided To Others?	Yes
Impact	1 download, 13 views
URL	https://zenodo.org/doi/10.5281/zenodo.10611827


Title	Ground Truth Model for Pracalit for Sanskrit and Newar MSS 16th to 19th C.
Description	Ground truth data for a an OCR model. Will be continually updated. Originally trained on Transkribus with a PyLaia model created from ground truth data based on transcripts into Pracalit Unicode of four Nepalese manuscripts. The manuscripts used to create this model are Staatsbibliothek zu Berlin's Hitopadesa (MIK I 4851) (mixed Newar and Sanskrit dating to 1561) and Vetalapañcavi?sati (HS. Or. 6414) (Newar dating to 1675) as well as Cambridge Digital Library's Avalokitesvaragu?akara??avyuha (MS Add. 1322) (Sanskrit, 18th century) and the Royal Asiatic Society Online Collection's Madhyamasvaya?bhupura?a (RAS Hodgson MS 23) (Newar and Sanskrit dating to c. 1800). The training was done on 441 pages and validation on 242 pages. This model does not recognise spacing, except for large gaps (i.e. for pictures or string holes). Newar word divider markers may not be represented or may be transcribed as virama. In general, the model is made for MSS with scriptio continua and will transcribe into scriptio continua into Pracalit Unicode. Transcription was performed by Dr Alexander O'Neill (SOAS University of London). Transcription of the Vetalapañcavi?sati (HS. Or. 6414) and Madhyamasvaya?bhupura?a (RAS Hodgson MS 23) was aided by unpublished materials provided by Dr Felix Otter (Philipps-Universität Marburg), as well as the published transcription in Shakya, Min Bahadur, and Shanta Harsha Bajracharya, eds. "Svayambhu Pura?a." Lalitpur: Nagarjuna Institute of Exact Methods, 2001. The transcription of Avalokitesvaragu?akara??avyuha (MS Add. 1322) was aided by the transcription provided by the Digital Sanskrit Buddhist Canon Project based on Lokesh Chandra, "Gu?akara??avyuhasutram," New Delhi: International Academy of Indian Culture, 1999.
Type Of Material	Database/Collection of data
Year Produced	2022
Provided To Others?	Yes
URL	https://heidata.uni-heidelberg.de/citation?persistentId=doi:10.11588/data/WI9184


Title	Lalitpur Newar 2022
Description	Dataset of recordings, plus metadata, for the study of the Lalitpur Newar dialect. Collected in 2022 with the consent of participants for its use for research purposes.
Type Of Material	Database/Collection of data
Year Produced	2022
Provided To Others?	Yes
Impact	Dataset is too new to have produced much impact yet, but it is the first open access corpus of spoken Newar.
URL	https://zenodo.org/record/7501051


Title	OCR model for Pracalit for Sanskrit and Newar MSS 16th to 19th C., Ground Truth
Description	Ground truth data (png and xml files) for a an OCR model. Will be continually updated. Originally trained on Transkribus with a PyLaia model created from ground truth data based on transcripts into Pracalit Unicode of four Nepalese manuscripts. The manuscripts used to create this model are Staatsbibliothek zu Berlin's Hitopadesa (MIK I 4851) (mixed Newar and Sanskrit dating to 1561) and Vetalapañcavi?sati (HS. Or. 6414) (Newar dating to 1675) as well as Cambridge Digital Library's Avalokitesvaragu?akara??avyuha (MS Add. 1322) (Sanskrit, 18th century) and the Royal Asiatic Society Online Collection's Madhyamasvaya?bhupura?a (RAS Hodgson MS 23) (Newar and Sanskrit dating to c. 1800). The training was done on 441 pages and validation on 242 pages. This model does not recognise spacing, except for large gaps (i.e. for pictures or string holes). Newar word divider markers may not be represented or may be transcribed as virama. In general, the model is made for MSS with scriptio continua and will transcribe into scriptio continua into Pracalit Unicode. Transcription was performed by Dr Alexander O'Neill (SOAS University of London). Transcription of the Vetalapañcavi?sati (HS. Or. 6414) and Madhyamasvaya?bhupura?a (RAS Hodgson MS 23) was aided by unpublished materials provided by Dr Felix Otter (Philipps-Universität Marburg), as well as the published transcription in Shakya, Min Bahadur, and Shanta Harsha Bajracharya, eds. "Svayambhu Pura?a." Lalitpur: Nagarjuna Institute of Exact Methods, 2001. The transcription of Avalokitesvaragu?akara??avyuha (MS Add. 1322) was aided by the transcription provided by the Digital Sanskrit Buddhist Canon Project based on Lokesh Chandra, "Gu?akara??avyuhasutram," New Delhi: International Academy of Indian Culture, 1999.
Type Of Material	Database/Collection of data
Year Produced	2022
Provided To Others?	Yes
Impact	This dataset has made HTR for Newar Pracalit possible for the first time.
URL	https://zenodo.org/record/6967421


Title	PACTib - PArsed Corpus of Tibetan (11th-21st c.)
Description	This PArsed Corpus of Tibetan (PACTib) contains >5000 historical Tibetan texts (>82m words) from over 10 different centuries. The original texts are from the Buddhist Digital Resource Center (BDRC) automatically enriched with linguistic annotation in the form of segmentation (tokenisation), Part-of-Speech Tags and constituency parses. Files in this deposit are:- a csv file with an overview of all texts with metadata linking file IDs + date ranges- segmented & POS-tagged txt files (using the ACTib segmenter & tagger)- parsed txt files (using the ACTib parser - forth.) Note that only the dated files are part of this collection. More information about the corpus can be found in:Meelen, M., & Roux, É. (2020). Meta-dating the PArsed Corpus of Tibetan (PACTib). In Proceedings of the 19th International Workshop on Treebanks and Linguistic Theories (pp. 31-42).
Type Of Material	Database/Collection of data
Year Produced	2024
Provided To Others?	Yes
URL	https://zenodo.org/doi/10.5281/zenodo.12104250


Title	PACTib - PArsed Corpus of Tibetan (11th-21st c.)
Description	This PArsed Corpus of Tibetan (PACTib) contains >5000 historical Tibetan texts (>82m words) from over 10 different centuries. The original texts are from the Buddhist Digital Resource Center (BDRC) automatically enriched with linguistic annotation in the form of segmentation (tokenisation), Part-of-Speech Tags and constituency parses. Files in this deposit are:- a csv file with an overview of all texts with metadata linking file IDs + date ranges- segmented & POS-tagged txt files (using the ACTib segmenter & tagger)- parsed txt files (using the ACTib parser - forth.) Note that only the dated files are part of this collection. More information about the corpus can be found in:Meelen, M., & Roux, É. (2020). Meta-dating the PArsed Corpus of Tibetan (PACTib). In Proceedings of the 19th International Workshop on Treebanks and Linguistic Theories (pp. 31-42).
Type Of Material	Database/Collection of data
Year Produced	2024
Provided To Others?	Yes
URL	https://zenodo.org/doi/10.5281/zenodo.12104249


Description	2023 Language Preservation through ASR, poster presentation given at Cambridge Language Sciences Annual Symposium, 2023.
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	Local
Primary Audience	Postgraduate students
Results and Impact	2023 Language Preservation through ASR, poster presentation given at Cambridge Language Sciences Annual Symposium, 2023.
Year(s) Of Engagement Activity	2023


Description	2023 The Emergence of Egophoricity: NLP Pipeline for Diachronic Tibetan and Newar Corpus Analysis, paper given at Perspectives of Digital Humanities in the Field of Buddhism, Universität Hamburg, Numata Zentrum für Buddhismuskunde & Khyentse Center for Tibetan Buddhist Textual Scholarship.
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	2023 The Emergence of Egophoricity: NLP Pipeline for Diachronic Tibetan and Newar Corpus Analysis, paper given at Perspectives of Digital Humanities in the Field of Buddhism, Universität Hamburg, Numata Zentrum für Buddhismuskunde & Khyentse Center for Tibetan Buddhist Textual Scholarship.
Year(s) Of Engagement Activity	2023


Description	2024 Participant in the roundtable "Transkribus for Asian Languages" at the Transkribus User Conference 24, Innsbruck, Austria. Convened by Rachael Griffiths.
Form Of Engagement Activity	A formal working group, expert panel or dialogue
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	2024 Participant in the roundtable "Transkribus for Asian Languages" at the Transkribus User Conference 24, Innsbruck, Austria. Convened by Rachael Griffiths.
Year(s) Of Engagement Activity	2024


Description	Handwritten Text Recognition for Pracalit Script Manuscripts
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	2024 Handwritten Text Recognition for Pracalit Script Manuscripts, paper given for the section "Transcribe faster, Discover more: Studying handwritten and printed documents in the age of Transkribus" at the Transkribus User Confer-ence 24, Innsbruck, Austria.
Year(s) Of Engagement Activity	2024

Abstract

Organisations

People

ORCID iD

Publications