Natural Language Processing Working Together with Arabic and Islamic Studies

Lead Research Organisation: University of Leeds
Department Name: Sch of Computing

Abstract

Summary

This is an interdisciplinary project which addresses the ICT call of "working together" by aligning ICT expertise and research interests from Computational and Corpus Linguistics, with Humanities research streams in Arabic and Islamic Studies, focusing on the Qur'an as a core text. It is also an international collaboration between the Universities of Leeds and Jordan, and further addresses the "working together" call via incoming and outgoing mobility in the form of Visiting Researcher placements in the School of Computing at Leeds (incoming) and the Centre for the Study of Islam in the Contemporary World at Jordan (outgoing). This agreement is proactive and novel, and has high impact, ensuring knowledge transfer from different methodological perspectives and cultures.

The study of Tajwid or Qur'anic recitation is a sub-field and taught module* in Islamic Studies programmes at both universities and elsewhere, and the original insight informing this project is to view Tajwid mark-up in the Qur'an as additional text-based data for computational analysis. This mark-up is already incorporated into Qur'anic Arabic script, and identifies prosodic-syntactic phrase boundaries of different strengths, plus gradations of prosodic and semantic salience through colour-coded highlighting of pitch accented syllables, and hence prosodically and semantically salient words.

The Computational Linguistics Module in Year 1 entails development and evaluation of software for generating a phonetically-transcribed, stressed and syllabified version of the entire text of the Qur'an, using the International Phonetic Alphabet (IPA). This canonical pronunciation tier for Classical Arabic will be informed and evaluated by Arabic linguists, Tajwid scholars, and phoneticians, and published in an updated version of the open-source Boundary-Annotated Qur'an Corpus [1], [2], preferably for LREC2 2014. The software will also be re-usable for Natural Language Engineering applications for Modern Standard Arabic, and for constructing dictionaries for Arabic language learners.

The Text Analytics Module in Year 2 implements statistical techniques such as keyword extraction3 to explore semiotic relationships between sound and meaning in the Qur'an, invoking a Saussurean-type view of the sign as '...a bi-unity of expression and content...' [5]. Our investigation entails: (i) text data mining for statistically significant phonemes, syllables, words, and correlates of rhythmic juncture [6], [7]; and (ii) interpretation of results from interdisciplinary perspectives: Corpus Linguistics (ICT); Tajwid science, plus Tafsir or Qur'anic exegesis (Islamic Studies); Arabic (Language and Literature); and Phonetics and Phonology (Linguistics).

In terms of ICT applications, the team will collaborate with stakeholders and beneficiaries to develop an associated or follow-on funding proposal for the UK Research Councils, to include publication of project software as an advanced corpus-query and visualization tool for Islamic Studies and Humanities scholars, plus Arabic language learners. This again represents an extension of the "working together" theme.

Finally, our approach is interdisciplinary and pioneers stylistic analysis of sound and rhythm encoded in writing as a semiotic system for religious and other literary texts. As such it is entirely novel and has direct implications for research-led teaching in both partner institutions plus a broad cross-section of research groups and user communities, namely: Natural Language Processing and Artificial Intelligence; Qur'anic and Islamic Studies; Arabic Language and Literature; Linguistics and Phonetics; Digital Humanities; and Psychology.

All references appear in Case for Support.

Planned Impact

Impact Summary

Novelty and Originality: The original insight in this project [1] is to use Tajwid recitation markup of chunk boundaries in the Qur'an, plus other orthographical features, as untapped sources of text data for computational analysis. A principal objective is to establish quantifiable, linguistically rigorous links between the prosodic markup and the semantics of the Qur'an. This falls within exegetical science (Tafsir); it will make an original contribution to the Tafsir literature, which as yet provides no detailed analysis in this area, and be of interest to educated pious Muslims worldwide.

Debate and Controversy: Another objective is to develop a Tajwid-IPA and this might be seen as controversial since the Muslim mainstream has traditionally not treated any Qur'anic transliteration as the Qur'an proper. Any suggestion of a movement away from Arabic script to another language system would be seen as unacceptable by the mainstream - despite the advantages of doing so - and so this would almost certainly open up a debate, placing at least traditionalists and modernists in opposition to one another.

Sustainability: The project delivers re-usable algorithms for phoneticizing Arabic script for use in Arabic speech and language applications and lexicography. The Boundary-Annotated Qur'an Corpus [1] with canonical pronunciation tier will be the largest resource of its kind for training Arabic statistical language models and for Arabic linguistics and Islamic Studies. Re-usable algorithms for quantitative analysis of implicit prosody [3] in text will inform research on the link between text and spoken form in other languages. New interdisciplinary techniques for exploring religious and literary texts will inform AI research on deep text analytics. For Islamic Studies this research will open up the possibility of developing a new Qur'anic orthography comfortably rooted within modern phonetics, thus bringing an old oral tradition into a dynamic academic area of study. Current editions of the colour-coded Tajwid Qur'an are still inadequate in terms of providing the full gamut of Tajwid-phonetic values. This research is a stepping-stone towards a new Tajwid-IPA which we envisage would be welcomed by Arab linguists (though resisted by traditionalists). In this respect, we believe it to be a path-breaking research project.

Beneficiaries and how they will benefit: Beneficiaries span a broad cross-section of research groupings and user communities.

Science and Engineering: AI/NLP researchers who want a rich, gold-standard dataset for machine learning; language technologists for re-usable software to incorporate into systems; computational and corpus linguists as developers and users of annotated corpora and associated software; psycholinguists for conceptual models of prosody and language as a semiotic system.

Humanities: Richly annotated corpora are tools of the trade for Corpus linguists, Arabic scholars and researchers in Applied Linguistics, and can be exploited by lexicographers for dictionary construction; online Qur'anic resources have been identified as a priority by groups such as the Muslims in Britain Research Network and the UK Islamic Studies Network [4].

Economic: re-usable software for generating IPA transcriptions of Arabic words has market potential for publishers of dictionaries and language learning materials; the project aims to exploit re-usable software and annotated online resources for an industry-standard corpus-query and visualization tool.

Societal: The research will generate interest in the Middle East particularly in key centres of Islamic scholarship and is sustainable beyond the scope of this project, since a principal long-term objective is to provide a complete Tajwid markup system which can be used by institutions of Islamic learning in Britain - and worldwide.

References in Case for Support.
 
Description Software for generating a phonetically-transcribed, stressed and syllabified version of the entire Arabic text of the Qur'an, using a character set based on the Roman alphabet found on British keyboards; this helps British and other muslims who do not read or write Arabic to recite the Quran correctly.
Exploitation Route We want to include this analysis of the Quran in the major Quran website http://quran.com/ used by millions of muslims and others interested in Islamic texts; and to apply this research in computational analysis and understanding of other religious texts. Islamic text analytics is also useful to government agencies monitoring Islamist groups for example CPNI.
Sectors Communities and Social Services/Policy,Creative Economy,Digital/Communication/Information Technologies (including Software),Education,Leisure Activities, including Sports, Recreation and Tourism,Government, Democracy and Justice,Culture, Heritage, Museums and Collections,Security and Diplomacy,Other

 
Description Brierley C; Sawalha M; Heselwood B; Atwell ES A verified Arabic-IPA mapping for Arabic transcription technology, informed by Quranic recitation, traditional Arabic linguistics, and modern phonetics. Journal of Semitic Studies, To appear. Sawalha M; Atwell E A standard tag set expounding traditional morphological features for Arabic language part-of-speech tagging. Word Structure, vol. 6, pp.43-99. 2013. Dukes K; Atwell ES; Habash N Supervised Collaboration for Syntactic Annotation of Quranic Arabic. Language Resources and Evaluation, vol. 47, pp.33-62. 2013. Hassan H; Daud NM; Atwell ES Connectives in the World Wide Web Arabic Corpus. World Applied Sciences Journal, vol. 21 (Special Issue of Studies in Language Teaching and Learning), pp.67-72. 2013. Alfaifi, A., Atwell, E. and Hedaya, I. (2014). Arabic Learner Corpus (ALC) v2: A New Written and Spoken Corpus of Arabic Learners. In: Ishikawa, S (ed.) Learner corpus studies in Asia and the world. Vol. 2. Papers from LCSAW2014, pp. 77-89. Kobe, Japan: School of Languages and Communication, Kobe University. Paper Alfaifi, Abdullah, and Atwell, Eric. (2014). Arabic Learner Corpus: A New Resource for Arabic Language Research. The 7th Saudi Students Conference, 1-2 February 2014, Edinburgh, UK. Poster Alfaifi, Abdullah and Atwell, Eric (2014). Tools for Searching and Analysing Arabic Corpora: an Evaluation Study. BAAL / Cambridge University Press Applied Linguistics, 14 Jun 2014. Leeds Metropolitan University, UK. Alfaifi, Abdullah and Atwell, Eric (2014). Arabic Learner Corpus and Its Potential Role in Teaching Arabic to Non-Native Speakers. The 7th Biennial IVACS conference, 19 - 21 Jun 2014. Newcastle, UK. Alfaifi, Abdullah and Atwell, Eric (2014). An Evaluation of the Arabic Error Tagset v2. The American Association for Corpus Linguistics (AACL) conference. 26-28 September 2014, Flagstaff, USA. Alrehaili SM; Atwell E Computational ontologies for semantic tagging of the Quran:A survey of past approaches in: LREC 2014 Proceedings. European Language Resources Association. 2014. Alzahrani A; Atwell E Multimodality in Arabic virtual learning environments in: Multimodality in Language Research. 2014. Sawalha M; Brierley C; Atwell E Automatically generated, phonemic Arabic-IPA pronunciation tiers for the Boundary Annotated Qur'an Dataset for Machine Learning (version 2.0) in: Proceedings of LRE-Rel 2: 2nd Workshop on Language Resource and Evaluation for Religious Texts, Reykjavik, Iceland, pp.42-42. 2014. Sawalha M; Atwell ES; Abushariah M SALMA: Standard Arabic Language Morphological Analysis in: Proceedings ICCSPA International Conference on Communications, Signal Processing, and their Applications, pp.1-6. 2013. Alfaifi AYG; Atwell ES Arabic Learner Corpus v1: A New Resource for Arabic Language Research in: Second Workshop on Arabic Corpus Linguistics. 2013. Alfaifi AYG; Atwell E; Abuhakema G Error Annotation of the Arabic Learner Corpus: A New Error Tagset in: Language Processing and Knowledge in the Web, vol. 8105, pp.14-22. Springer. 2013. Alfaifi AYG; Atwell E Potential Uses of the Arabic Learner Corpus in: Leeds Language, Linguistics and Translation PGR Conference 2013. University of Leeds. 2013. Abbas N; Atwell ES Annotating the Arabic Quran with a classical semantic ontology in: Proceedings of WACL'2 Second Workshop on Arabic Corpus Linguistics. 2013. Abbas N; Aldhubayi L; Al-Khalifa H; Alqassem Z; Atwell ES; Dukes K; Sawalha M; Sharaf M Unifying linguistic annotations and ontologies for the Arabic Quran in: Proc WACL2 Second Workshop on Arabic Corpus Linguistics, pp.13-13. 2013. Sawalha M; Atwell ES Accelerating the processing of large corpora: using Grid Computing for lemmatizing the 176 million words Arabic Internet Corpus in: Proceedings of the 2nd Workshop of Arabic Corpus Linguistics WACL-2. 2013. Atwell ES; Sawalha M Comparing morphological tag-sets for Arabic and English in: Proceedings of the 7th International Corpus Linguistics Conference CL2013. 2013
First Year Of Impact 2013
Sector Creative Economy,Digital/Communication/Information Technologies (including Software),Education,Leisure Activities, including Sports, Recreation and Tourism,Culture, Heritage, Museums and Collections,Other
Impact Types Cultural,Societal

 
Description Quranic Arabic Corpus http://corpus.quran.com/ 
Form Of Engagement Activity A magazine, newsletter or online publication
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Public/other audiences
Results and Impact More than 1 million hits (visits to the website)

feedback on website - see Quranic Arabic Corpus http://corpus.quran.com/
Year(s) Of Engagement Activity 2013,2014
URL http://corpus.quran.com/