Corpws Cenedlaethol Cymraeg Cyfoes (The National Corpus of Contemporary Welsh): A community driven approach to linguistic corpus construction

Lead Research Organisation: Cardiff University

Department Name: Sch of English Communication and Philos

Abstract

This project will create a major corpus of Welsh language: CorCenCC (Corpws Cenedlaethol Cymraeg Cyfoes: National Corpus of Contemporary Welsh). A corpus is a principled collection of language data sampled from real-life contexts, presented as a searchable database. This will be the first corpus to represent spoken, written and electronically-mediated Welsh, and the first in any language with a functional design informed, from the outset, by representatives of all anticipated academic and community user groups. CorCenCC will provide societal, economic and academic benefits by:
- Facilitating uses of Welsh in public, commercial, educational and governmental settings.
- Redefining the scope, relevance and design infrastructure of corpus development methodology.

A corpus allows users to identify and explore language as it is actually used, rather than relying on intuition or prescriptive accounts of how it 'should' be used. This evidence-based approach is used by academic researchers, lexicographers, teachers, language learners, assessors, resource developers, policy makers, publishers, translators and others, and is essential to the development of technologies such as predictive text production, word processing tools, machine translation, voice recognition and web search tools. Welsh has had no comprehensive corpus facility able to meet these requirements.

CorCenCC will capitalise on extensive community interest in sustaining and 'growing' Welsh, using the novel integration of crowdsourcing, a powerful data collection method which has the potential to revolutionize corpus construction. Recruited through social and broadcast media, roadshows and existing networks, Welsh speakers will record and upload their own data via a mobile app, and even contribute to data coding. This approach promises representative language across genres, language varieties (regional and social) and contexts. Traditional, data collection will supplement the crowdsourcing, ensuring a representative balance of data as specified in the project targets.

Preliminary engagement with stakeholders (including a briefing event at the Senedd) generated collaboration from the Welsh Government, Welsh Language Commissioner, Welsh Joint Education Committee, Welsh for Adults, BBC, Gwasg y Lolfa press, and University of Wales Dictionary; all have identified current needs which CorCenCC can meet, and all will be represented in the project advisory group, so the corpus design is user-informed throughout. A language corpus able to inform delivery of Welsh has been called for by e.g. National Foundation for Educational Research (2008:48) and Welsh Government (2013:27,71). CorCenCC, with its integrated pedagogical toolkit, will impact significantly on Welsh language teaching practice, enabling data-driven, inductive learning and assessment.

CorCenCC will be open-source and publicly accessible, with user interfaces for specific groups. It will enable, for example, community users to investigate dialect variation or idiosyncrasies of their own language use; professional users to profile texts for readability or develop digital language tools; language learners learn from real life models of Welsh; and researchers to investigate patterns of language use and change. In order to ensure that CorCenCC remains a sustainable, permanent and user-oriented record of language, an in-built facility will allow data to be added and moderated beyond the life of the project.

The project team comprises experts in corpus linguistics, Welsh, and language pedagogy and assessment, who specialise in the application of linguistic tools to real world issues. Working with an advisory body of stakeholder representatives, they are optimally placed to meet the project aims: creating a permanent, sustainable and fit-for-purpose record of the living language, and pioneering an approach to content generation and user-driven applications that will provide a model for future corpus creation.

Planned Impact

CorCenCC will be a freely available resource under an open licence which, when combined with the user-driven design and construction, will maximise its potential impact, enabling it to inform the work and activities of current and future users of Welsh in a number of critical areas, including:
- Second language teaching and learning: Reports on the teaching of Welsh for Adults (Mac Giolla Chriost et al., 2012; Welsh Government report, 2013) have drawn attention to the need for a corpus of contemporary Welsh. CorCenCC will meet this need, informing curriculum writing, language assessment and language learning resources as similar corpora do effectively in English (e.g. the Cambridge English Corpus (CEC) which informs Cambridge English Language teaching resources, the British National Corpus (BNC) which informs Pearson Longman's resources). CorCenCC will facilitate data-driven learning, enhancing the effectiveness of teaching Welsh as a second language (compulsory in all schools in Wales up to the end of Key Stage 4).
- The Welsh Government and National Assembly of Wales (Language Policy): CorCenCC will facilitate the realisation of action points in the Welsh Language Commissioner's strategy relating to digital content and applications, translation, terminology, language planning and research. These reflect the priorities of the Welsh Government in its Welsh Language Strategy for 2012-17 'A living language: a language for living'.
- The translation industry in Wales: CorCenCC outputs fit with the mid-term development of Microsoft Translate software: preliminary research (Screen, 2014) shows that example-based machine translation alone can improve the productivity of human translators by up to 55%, and by contributing to an eventual hybrid machine translation system, CorCenCC could further improve translation efficiency.
- The media in Wales: CorCenCC will increase the accessibility of the content of Welsh language media across all platforms and, by ensuring the language is appropriately pitched, will encourage more people to interface with the media in Welsh. CorCenCC offers TV and radio broadcasters the potential to produce language guidelines similar to those developed by Catalan language broadcaster TV3. BBC Cymru Wales is working with CorCenCC to provide data and to ensure that it can inform their work on all media platforms.
- Welsh language publishers and lexicographers: CorCenCC provides the means to target content at audiences of different reading abilities and enhance the language tools available to authors for constructing graded readers. It will enable the commissioning of dictionaries of modern Welsh based on actual language use (see letters of support E and D from University of Wales Dictionary and Gwasg y Lolfa).
- Language technology companies: a core requirement for companies using web-based and online social media data is a large high quality training corpus and CorCenCC will provide this. Data analytics and big data are predicted to account for cumulative benefits of £216 billion to the UK economy between 2012-17. Availability of CorCenCC will help to stimulate related research in Wales and for Welsh textual analytics.
- In the public domain: Via the project engagement strategy (National Eisteddfod interaction, short story competitions, etc.), and facilitated by the crowdsourcing approach, future users will be directly involved in the construction and design of the corpus to ensure it is user-friendly, accessible and appropriate to their needs. This will build on existing interest in Welsh language and heritage, to foster community 'ownership' of the corpus.

Funded Value:

£1,440,883

Funded Period:

Mar 16 - Nov 20

Funder:

ESRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

ES/M011348/1

Principal Investigator:

Dawn Knight

Research Subject:

Languages & Literature (16%)

Linguistics (80%)

Research Topic:

Applied Linguistics (16%)

Celtic Studies (16%)

Corpus Linguistics (16%)

Language Acquisition (32%)

Sociolinguistics (16%)

Organisations

People	ORCID iD
Dawn Knight (Principal Investigator)
Enlli Thomas (Co-Investigator)
Irena Spasic (Co-Investigator)
Jeremy Evas (Co-Investigator)
Alexander Lovell (Co-Investigator)
Jonathan Morris (Co-Investigator)
Edmund Stonelake (Co-Investigator)
Steven Dyfrig Morris (Co-Investigator)
Paul Rayson (Co-Investigator)
Tess Fitzpatrick (Co-Investigator)
Scott Piao (Researcher)
Jennifer Needs (Researcher)

Publications

Author Name

Title Publication Date Published

|< < 1 2 3 4 5 6 > >|

10 25 50

Corcoran P (2021) Creating Welsh Language Word Embeddings in Applied Sciences

El-Haj M (2022) Creation of an evaluation corpus and baseline evaluation scores for Welsh text summarisation

Espinosa-Anke L (2021) English-Welsh Cross-Lingual Embeddings in Applied Sciences

Ezeani I (2022) Introducing the Welsh Text Summarisation Dataset and Baseline Systems

Ezeani I (2019) Leveraging Pre-Trained Embeddings for Welsh Taggers

Fitzpatrick T (2016) Creating pedagogical wordlists: a comparison of thematic and corpus approaches

Fitzpatrick, T. (2016) Creating pedagogical wordlists without a corpus: a replication study for the Welsh L2 curriculum.

Knight D (2022) Developing vocabulary lists for adult learners of Welsh: a user-driven iterative approach (conference presentation)

Knight D (2018) CorCenCC: applying the sociolinguistics of new speakers within a contemporary corpus of Welsh

Knight D (2021) Building a National Corpus - A Welsh Language Case Study

Key Findings
Impact Summary
Policy Influence
Further Funding
Research Databases and Models
Collaboration
Software and Technical Products
Engagement Activities


Description	CorCenCC is an inter-disciplinary and multi-institutional project that has created a large-scale, open-source corpus of contemporary Welsh. A corpus, in this context, is a collection of examples of spoken, written and/or e-language examples from real life contexts, that allows users to identify and explore language as it is actually used, rather than relying on intuition or prescriptive accounts of how it 'should' be used. Corpora let us investigate how we use language across different genres and communicative mediums (i.e. spoken, written or digital), and how it varies according to the speaker/writer and the communicative purpose. This evidence-based approach is used by academic researchers, lexicographers, teachers, language learners, assessors, resource developers, policy makers, publishers, translators and others, and is essential to the development of technologies such as predictive text production, word processing tools, machine translation, speech recognition and web search tools. CorCenCC is the first corpus of the Welsh language that covers all three aspects of contemporary Welsh: spoken, written and electronically mediated (e-language). It offers a snapshot of the Welsh language across a range of contexts of use, e.g. private conversations, group socialising, business and other work situations, in education, in the various published media, and in public spaces. It includes examples of news headlines, personal and professional emails and correspondence, academic writing, formal and informal speech, blog posts and text messaging. Language data was sampled from a range of different speakers and users of Welsh, from all regions of Wales, of all ages and genders, with a wide range of occupations, and with a variety of linguistic backgrounds (e.g. how they came to speak Welsh), to reflect the diversity of text types and of Welsh speakers found in contemporary Wales (with the corpus including data from over 2000 contributors). In this way, the CorCenCC corpus provides the means for empowering users of Welsh to better understand and observe the language across diverse settings, and creates a solid evidence base for the teaching of contemporary Welsh to those who aspire to use it. Over time, the corpus has the potential to make a significant contribution to the transformation of Welsh as the language of public, commercial, education and governmental discourse. Work on the CorCenCC project was distributed across six coordinated work packages (WPs), each with specific tasks, aims and objectives. Led by Knight, WP0 attended to the on-going design, scoping and training activities, and involved all members of the project team. The other WPs were: - WP1: Collect, transcribe and anonymise the data - WP2: Develop the part-of-speech tag-set/tagger - WP3: Develop a semantic tagger for Welsh and semantically tag all data - WP4: Scope, design and construct Y Tiwtiadur - WP5: Construct the infrastructure to host CorCenCC and build the corpus While work was distributed across these work-packages, colleagues had a mutual understanding of the shared vision for the project and worked collaboratively to achieve it, with a considerable measure of interdependence between WPs that required discussion and coordination. For example, WP3 built on the research undertaken in WP1 for corpus collection, and employed WP2's part of speech (POS) tagger as a first step in the semantic analysis of the Welsh language data. WP3's output then fed into WP4 for the online pedagogic toolkit (Y Tiwtiadur), which used the multiple levels of corpus annotation to improve the engagement with and affordances of the toolkit for teachers and learners. Additionally, WP3's semantically tagged corpus fed directly into the corpus infrastructure developed in WP5. Key outputs/contributions from each WP follow. WP1: The main contribution of WP1 is the 11.2-million-words of data that form the core of the corpus. In addition, the following range of resources, created as part of the data generation process, help achieve one of the stated aims of the CorCenCC project: to increase capacity and expand the interface between the Welsh language (and by extension other minoritised languages around the world) and the discipline of applied linguistics (including, in particular, corpus linguistics, sociolinguistics, and language planning and policy): - A sampling frame for the creation of a general corpus of a minority language; - A definition of 'inappropriate language' suitable for the context of a minority language where speakers are all bilinguals; - A bespoke set of transcription conventions which can be applied to contemporary spoken Welsh; - A team of Welsh-speaking research assistants who have been trained in the principles of corpus creation, who have had the opportunity to work with international experts across the world, and who are able to apply their skills to future projects. WP2: Key contributions include: - The CorCenCC POS tagset (https://cytag.corcencc.org/tagset?lang=en) - A gold-standard evaluation corpus. A gold-standard corpus is one that has been manually annotated and checked by multiple individuals. This effectively provides a model that can train and evaluate the automated (computerised) approach. This gold- standard evaluation corpus has also been released for other researchers to use in the development of their own tools. - The CyTag website (i.e. the CorCenCC Welsh language tagger): (https://cytag.corcencc.org) - CyTag on Github (https://github.com/CorCenCC/CyTag) WP3: A key contribution of the WP3 research is the freely available software tools and linguistic resources which augment the resource bank for Welsh language analysis and text mining. The CySemTag Java code is released on GitHub and has been incorporated into the Wmatrix (Rayson et al., 2004) corpus annotation and analysis system. This system is very widely used in corpus linguistics, and means that future researchers can use it for Welsh corpora. Overall the work in WP3 has extended the scope of research in corpus and computational linguistics in at least two ways. Firstly, it has demonstrated a method for effectively extending semantic analysis techniques to the specific challenges of the Welsh language. Secondly, it has shown that crowdsourcing methods can be used to contribute to the development of such resources. WP4: Through the development of a pedagogical interface (Y Tiwtiadur - which includes: Gap Filling (Cloze); Vocabulary Profiler; Word Identification; Word-in-Context exercises), led by the concept of data-driven learning and assessment, WP4 has contributed: - a new pedagogical resource, that is - drawn from an online corpus of contemporary Welsh, of a kind that - has never existed for the teaching of Welsh before, and that - can serve as a model for similar work with other minority languages. WP4 has made an invaluable contribution to language teaching and learning and, being open-source, is available to support users in their inductive learning, irrespective of age, ability level and geographical location. The resource offers a new and unique opportunity for schools in Wales to embrace the concept of Data Driven Learning and to engage with and develop their own corpus-led pedagogies. In addition to this, teachers and learners, once introduced to the corpus via Y Tiwtiadur, might also feel confident enough to explore the corpus in other ways via the main CorCenCC query tools. WP5: Constructed a new corpus infrastructure and innovative crowdsourcing data collection app. The infrastructure had to be built before the texts we had collected could be suitably indexed for entry into the corpus. The key pillars of the infrastructure included a framework that supports metadata collection, the mobile app for collecting spoken data (utilising a crowdsourcing approach), a backend database that stores curated data and a web-based interface that allows users to query the data online. By using Welsh language tags, we have ensured that the corpus is not, and cannot be perceived as, an external (English) tool superimposed onto Welsh, but rather belongs to Wales and the Welsh language.
Exploitation Route	The CorCenCC corpus and associated software tools are licensed under Creative Commons CC-BY-SA v4 and thus are freely available for use by professional communities and individuals with an interest in language. The tools/dataset are available via the following sites: - To access the corpus visit: www.corcencc.org/explore - To access the GitHub site: https://github.com/CorCenCC These tools can be adapted by others when creating their own corpora. We were particularly committed to supporting the building of corpora for other minority languages, and the user-driven model utilised throughout the project directly informs such projects by providing a template for corpus development in any other language. The 11.2-million-word CorCenCC dataset itself is designed to enable, for example, community users to investigate dialect variation or idiosyncrasies of their own language use; professional users to profile texts for readability or develop digital language tools; Welsh language learners to draw on real life models of Welsh; and researchers to investigate patterns of language use and change. The corpus is also anticipated to reveal new insights into the vocabulary and language patterns of Welsh and to serve as a major resource for teaching the Welsh language to both those who have it as their first language and new speakers of it. This multifaceted impact potential has been made possible by CorCenCC's significant contribution at the methodological level, in extending the scope, relevance and design infrastructure of language corpora. Specifically, the project has involved the development of important new tools and processes, including a unique user-driven corpus design in which language data was collected and validated through crowdsourcing, and an in-built pedagogic toolkit (Y Tiwtiadur) developed in consultation with representatives of all anticipated academic and community user groups (for a detailed discussion of CorCenCC's user-driven design, see Knight et al., 2021a, Knight et al., 2021b). Every part of the project (as characterised by the work packages) had valuable applications that offer societal, economic and/or academic benefits. At a societal level, the CorCenCC corpus provides the opportunity to understand Welsh as a living language in use. In economic terms, the corpus offers scope to develop valuable new resources for Welsh learners and users, including potential for a corpus-based dictionary and a range of data-informed technological tools that might include language learning apps, predictive text production, word processing tools, machine translation, speech recognition and web search tools. Indirectly, the support of the corpus for these social and economic outcomes will promote the recognition of Welsh as a significant element of the UK and world linguistic landscape. The potential academic applications are broad and varied, including applications in the following field: corpus linguistics; language acquisition and bilingualism; sociolinguistics, dialectology and morphosyntactic analyses; language planning; lexicography; computer and translation technology. Potential non-academic practitioner and professional domains include: second language teaching and learning; The Welsh Government and National Assembly of Wales (Language Policy); the translation industry in Wales; the media in Wales; Welsh language publishers and lexicographers; language technology companies. For more detail please refer to the project report, here: http://orca.cf.ac.uk/135540/
Sectors	Communities and Social Services/Policy,Creative Economy,Digital/Communication/Information Technologies (including Software),Education,Culture, Heritage, Museums and Collections
URL	https://www.corcencc.org/outputs/


Description	Data from CorCenCC and Yr Amliadur (the Welsh language frequency lists, which were based on the CorCenCC dataset), have been used by the National Centre for Learning Welsh to ensure that the top 100 words of the Welsh language are included in their Entry level (A1) course books for beginners (revisions of which were published in January 2021 - so the data was extracted/used in late 2020). The National Centre for Learning Welsh was set up in 2016 to provide strategic leadership for the Welsh for Adults sector in Wales. In 2018-19 there were 13,260 unique learners of Welsh taught by around 500 part-time and full-time tutors. We were awarded ESRC IAA funding (details on main Research Fish submission, 2021) to extend and refine these lists and to create online resources for learners of Welsh. In this project we extended the wordlists to include the 500 highest frequency words in Welsh, to provide more in-depth lexical information for learners at the Foundation (A2) level. For this frequency-based list to be a more effective and targeted resource, we also sought to bridge the gap between a purely corpus-based wordlist and one which required input from end-users themselves. To this extent, the project team established a novel user-driven iterative approach to the development of our corpus-and-user informed vocabulary list, Geirfan. This approach can be replicated and adapted for use in any other language context. The Geirfan vocabulary list is currently being used by NCLW to revamp their 2023 A2-level coursebook. This is the first time ever that the Learning Welsh curriculum has been based on frequency information from a corpus. With an estimate of circa 12,500 students engaging with the Centre's courses at A1 and A2 levels, the impact and reach of this work is extensive. Geirfan is to be published online (with its own DOI), in the same way that the original Yr Amliadur was. As an extension to this project, we have also developed a prototype online dictionary to partner/support the resource (currently containing 60 items, but to be extended to the 500-item list, subject to funding). This is available at: https://geirfan.cymru/alpha_index.html In the dictionary version of Geirfan, we aim to include items from the Geirfan vocabulary list, with level appropriate examples from CorCenCC itself, in addition to expanded information on collocations (i.e. words that commonly co-occur with the target word), etymological information, pronunciation details and other relevant information identified through consultation with the project partners. We intend the Geirfan website to be openly available for all learners and users of the Welsh language. This resource has the potential to be extended in the future, should wordlists for students at B1/B2 (intermediate) and CA/C2 (advanced) levels be developed (note: if this were to happen, these wordlists would again be used by NCLW in their future curricula and materials).
First Year Of Impact	2020
Sector	Education
Impact Types	Cultural,Policy & public services


Description	Membership of the Panel for Standardisation of the Welsh Language
Geographic Reach	National
Policy Influence Type	Membership of a guideline committee


Description	Reference to CorCenCC in Welsh Government's rapid review of the National Centre for Learning Welsh
Geographic Reach	National
Policy Influence Type	Citation in other policy documents
Impact	For the first time ever, Welsh has a contemporary corpus from which CEFR based frequency lists can be created to inform the teaching of the language to adult learners.
URL	https://gov.wales/rapid-review-national-centre-learning-welsh-html


Description	Reference to the project in the section on 'Linguistic Infrastructure' in the Welsh Government policy/work programme document 'Cymraeg 2050: A million Welsh speakers - work programme 2017-21'. This emphasises the link between CorCenCC and the programme to develop more Welsh language technology tools and resources.
Geographic Reach	National
Policy Influence Type	Citation in other policy documents


Description	'Learning English-Welsh bilingual embeddings and applications in text categorisation
Amount	£90,000 (GBP)
Organisation	Government of Wales
Sector	Public
Country	United Kingdom
Start	05/2020
End	04/2021


Description	British Council Funding (for the CorCenCC launch)
Amount	£2,000 (GBP)
Organisation	British Council
Sector	Charity/Non Profit
Country	United Kingdom
Start	02/2017
End	04/2017


Description	Cardiff University CUROP (Cardiff University Research Opportunity) internal funding. Project name: entitled 'Corpws Cenedlaethol Cymraeg Cyfoes: National Corpus of Contemporary Welsh - a focus on spoken data'
Amount	£2,100 (GBP)
Organisation	Cardiff University
Sector	Academic/University
Country	United Kingdom
Start	07/2018
End	08/2018


Description	Cardiff University CUROP (Cardiff University Research Opportunity) internal funding. Project name: entitled 'Corpws Cenedlaethol Cymraeg Cyfoes: National Corpus of Contemporary Welsh - semantic tagging and data annotation'
Amount	£2,100 (GBP)
Organisation	Cardiff University
Sector	Academic/University
Country	United Kingdom
Start	07/2018
End	08/2018


Description	Competitive commission from Welsh Government to provide a rapid evidence assessment of effective second language teaching approaches and methods
Amount	£24,992 (GBP)
Funding ID	Contract 171802
Organisation	Government of Wales
Sector	Public
Country	United Kingdom
Start	10/2017
End	03/2018


Description	Creating vocabulary lists from CorCenCC (National Corpus of Contemporary Welsh) [ESRC IAA funding]
Amount	£14,988 (GBP)
Organisation	Economic and Social Research Council
Sector	Public
Country	United Kingdom
Start	05/2021
End	09/2022


Description	Cymraeg 2050 2017-2018 Grant Scheme GC2050/17-18/20: Welsh WordNet
Amount	£19,964 (GBP)
Organisation	Government of Wales
Sector	Public
Country	United Kingdom
Start	01/2018
End	04/2018


Description	ESRC DTP Collaborative Studentship - Welsh and Applied Linguistics : ESRC Wales Doctoral Training Partnership PhD Studentship "Strategic bilingualism: identifying optimal context for Welsh as a second language in the curriculum"
Amount	£81,253 (GBP)
Funding ID	2096320
Organisation	Economic and Social Research Council
Sector	Public
Country	United Kingdom
Start	10/2018
End	09/2021


Description	FreeTxt: supporting bilingual free-text survey and questionnaire data analysis
Amount	£80,647 (GBP)
Funding ID	AH/W004844/1
Organisation	Arts & Humanities Research Council (AHRC)
Sector	Public
Country	United Kingdom
Start	01/2022
End	01/2023


Description	Get Creative with Cymraeg
Amount	£20,000 (GBP)
Organisation	Government of Wales
Sector	Public
Country	United Kingdom
Start	01/2018
End	04/2018


Description	RIAH - Research Institute for Arts and Humanities, Swansea University Funding (for the CorCenCC launch)
Amount	£1,000 (GBP)
Organisation	Swansea University
Sector	Academic/University
Country	United Kingdom
Start	02/2017
End	04/2017


Description	School Research and Innovation Fund (for the CorCenCC launch)
Amount	£1,500 (GBP)
Organisation	Cardiff University
Department	School of English, Communication & Philosophy
Sector	Academic/University
Country	United Kingdom
Start	02/2017
End	03/2017


Description	Swansea University: SPIN (Swansea paid internship) placement for data collection, transcription and interviewing of teachers/tutors 2017-18
Amount	£1,200 (GBP)
Organisation	Swansea University
Sector	Academic/University
Country	United Kingdom
Start	03/2018
End	08/2018


Description	Welsh Automatic Text Summarisation
Amount	£90,000 (GBP)
Organisation	Government of Wales
Sector	Public
Country	United Kingdom
Start	05/2021
End	04/2022


Description	Welsh Government Technology Funding - funding for the Welsh Stemmer project
Amount	£20,000 (GBP)
Organisation	Government of Wales
Sector	Public
Country	United Kingdom
Start	01/2019
End	04/2019


Description	Welsh for Adults - B1 Canolradd core vocabulary research project
Amount	£1,968 (GBP)
Funding ID	Project 102497
Organisation	Welsh Joint Education Committee
Sector	Academic/University
Country	United Kingdom
Start	01/2018
End	03/2018


Description	Welsh language processing infrastructure: Welsh word embeddings
Amount	£90,000 (GBP)
Organisation	Government of Wales
Sector	Public
Country	United Kingdom
Start	08/2019
End	04/2020


Title	CorCenCC: Corpws Cenedlaethol Cymraeg Cyfoes - the National Corpus of Contemporary Welsh
Description	The CorCenCC corpus contains over 11 million words (circa 14.4m tokens) from written, spoken and electronic (online, digital texts) Welsh language sources, taken from a range of genres, language varieties (regional and social) and contexts. The contributors to CorCenCC are representative of the over half a million Welsh speakers in the country. The creation of CorCenCC was a community-driven project, which offered users of Welsh an opportunity to be proactive in contributing to a Welsh language resource that reflects how Welsh is currently used. To make CorCenCC as representative of contemporary Welsh as possible, the project team designed a bespoke sampling framework. Extracts were collected from sources including for example, journals, emails, sermons, road signs, TV programmes, meetings, magazines and books. Conversations were recorded by the research team, and a specially designed crowdsourcing app (see: https://www.corcencc.org/app/) enabled Welsh speakers in the community to record and upload samples of their own language use to the corpus. The published corpus therefore contains data from Welsh speakers from all kinds of backgrounds, abilities and contexts, capturing how Welsh is truly used today across the country. A beta version of some bilingual corpus query tools have also been created as part of the CorCenCC project (see: www.corcencc.org/explore). These include simple query, full query, frequency list, n-gram, keyword and collocation functionalities. The CorCenCC website also contains Y Tiwtiadur, a collection of data-driven teaching and learning tools designed to help supplement Welsh language learning at all different ages and levels. Y Tiwtiadur contains four distinct corpus-based exercises: Gap Filling (Cloze), Vocabulary Profiler, Word Identification and Word-in-Context (see: https://www.corcencc.org/y-tiwtiadur/). The CorCenCC project was led by Dawn Knight (KnightD5@cardiff.ac.uk), at the Centre for Language and Communication Research, Cardiff University. The full project team comprised: 1 Principal Investigator (PI - Dawn Knight), 2 Co-Investigators (CIs - Steve Morris and Tess Fitzpatrick), who made up, with the PI, the CorCenCC Management Team, a total of 7 other CIs and 8 Research Assistants/Associates over the course of the project. In addition, there were 11 advisory board members, 6 consultants (from 4 countries around the world), 2 PhD students, 4 Undergraduate summer placement students, 4 professional service support staff, 4 project ambassadors and 2 project volunteers. More information can be found on the project website: www.corcencc.org Dataset: The CorCenCC dataset includes 14,338,149 tokens (circa 11.2-million-words). The data in CorCenCC represents a wide range of contexts, genres and topics. This data has, as far as possible, been anonymised using a combination of manual and automated techniques, and has been fully tagged in terms of part-of-speech (POS) and semantic categories. The POS and semantic tagging was carried out using CyTag and SemCyTag tools, available from CorCenCC's GitHub website: https://github.com/CorCenCC The following files are included in this dataset: categorisation_guide: guide to interpreting columns in CorCenCC's corpus tables/files. categorization: links individual contribution_id's to specific taxonomy_id's (from the corpus design frame). Refer to taxonomy file for details. complete_corpus: zipped folder containing all individual contribution files (data is fully POS and semantic tagged). contrib_links: linking specific contributor_id's to individual contributions. contribution: list of all contributions in the corpus (linking to specific modes). contributor: contributor metadata for the complete corpus. corpus_data: fully POS and semantically tagged CorCenCC corpus data. electronic: metadata associated with individual contribution_id's (electronic mode). spoken: metadata associated with individual contribution_id's (spoken mode). taxonomy: metadata taxonomy guide, used as a basis for classifying contributions according to their genre, context, location, target audience, topic, who (i.e. interlocutors), and source. written: metadata associated with individual contribution_id's (written mode).
Type Of Material	Database/Collection of data
Year Produced	2020
Provided To Others?	Yes
Impact	CorCenCC enables users to investigate how we use language across different genres and communicative mediums (i.e. spoken, written or digital), and how it varies according to the speaker/writer and the communicative purpose. This evidence-based approach is used by academic researchers, lexicographers, teachers, language learners, assessors, resource developers, policy makers, publishers, translators and others, and is essential to the development of technologies such as predictive text production, word processing tools, machine translation, speech recognition and web search tools. CorCenCC is the first corpus of the Welsh language that covers all three aspects of contemporary Welsh: spoken, written and electronically mediated (e-language). It offers a snapshot of the Welsh language across a range of contexts of use, e.g. private conversations, group socialising, business and other work situations, in education, in the various published media, and in public spaces.
URL	https://research.cardiff.ac.uk/converis/portal/detail/Dataset/119878310?auxfun=&lang=en_GB


Title	The Geirfan wordlist: A Vocabulary list for adult learners of Welsh
Description	The Geirfan wordlist is a curated list of 500 of the most frequent words in the Welsh language, designed for use by learners at A1/A2[1] levels of proficiency (Council of Europe, 2021). This vocabulary list was developed using an innovative symbiosis of corpus-based methods (using data from the CorCenCC corpus) and expert-led introspection and reflection; an approach which can be replicated and adapted for use in any other language context. The lists are included in the appendices of the Geirfan document, comprising the following: Appendix A contains the most frequent 750 words from CorCenCC, the result of tagging the corpus with CyTag2. This was the list that was curated by the project team, resulting in the basic 500-word wordlist (Appendix B). Appendix B contains the basic 500-word list, without additions. These 500 words are those which are directly drawn from CorCenCC's frequency data. Appendix C contains the working list of additions, as an alphabetical list. The 500-word basic list plus these additions is the initial batch of headwords for the dictionary on the Geirfan website. Full details on how to interpret the lists are included in the main body of the documentation. [1] A2 refers to the Common European Framework of Reference for Languages (CEFR) basic user, waystage level. All references to levels in this paper are as defined by CEFR and in the context of Welsh. See https://www.wjec.co.uk/qualifications/welsh-for-adults-qualification-suite/#tab_overview [Accessed 26.08.22]
Type Of Material	Database/Collection of data
Year Produced	2022
Provided To Others?	Yes
Impact	The Geirfan vocabulary list is currently being used by the National Centre for Learning Welsh (NCLW) to revamp their 2023 A2-level coursebook. This is the first time ever that the Learning Welsh curriculum has been based on frequency information from a corpus. With an estimate of circa 12,500 students engaging with the Centre's courses at A1 and A2 levels, the impact and reach of this work is extensive.
URL	https://research.cardiff.ac.uk/converis/portal/detail/Dataset/234583226?auxfun=&lang=en_GB


Title	Yr Amliadur: Frequency Lists for Contemporary Welsh
Description	Yr Amliadur contains the following sample frequency lists of contemporary Welsh language usage: All frequency data, sorted alphabetically (excel file) All frequency data, in frequency order (excel file) The most-frequent 5000 words, with separate sheets for each 500-word frequency band (excel file) PDF file with the following lists in: Top 100 words in CorCenCC (rank ordered list) Top 1000 words in CorCenCC (ordered alphabetically) Top 100 lemmas in CorCenCC (rank ordered list) Top 1000 lemmas in CorCenCC (ordered alphabetically) Top 100 lemmas in CorCenCC (open-class words only) Top 1000 words in CorCenCC (open-class words only; ordered alphabetically) Top 500 nouns in CorCenCC (rank ordered list) Top 500 verbs in CorCenCC (rank ordered list) Top 500 adjectives in CorCenCC (rank ordered list) Top 50 adverbs in CorCenCC (rank ordered list) Top 50 interjections in CorCenCC (rank ordered list) Top 100 open-class words in the written component of CorCenCC (rank ordered list) Top 100 open-class words in the spoken component of CorCenCC (rank ordered list) Top 100 open-class words in the e-language component of CorCenCC (rank ordered list) The sample frequency lists are based on the CorCenCC (Corpws Cenedlaethol Cymraeg Cyfoes - National Corpus of Contemporary Welsh, Knight et al., 2020 which includes 14,338,149 tokens (circa 11.2-million-words). The data in CorCenCC represents a wide range of contexts, genres and topics and has, as far as possible, been anonymised using a combination of manual and automated techniques, and fully tagged in terms of part-of-speech (POS) and semantic categories.
Type Of Material	Database/Collection of data
Year Produced	2020
Provided To Others?	Yes
Impact	Yr Amliadur enable individuals to identify some of the most frequently used words in the Welsh language, which is invaluable for, for example, the creation of teaching and linguistic reference materials.
URL	https://research.cardiff.ac.uk/converis/portal/detail/Dataset/120164107?auxfun=%E3%80%88=en_GB


Description	BBC
Organisation	British Broadcasting Corporation (BBC)
Department	BBC Cymru Wales
Country	United Kingdom
Sector	Charity/Non Profit
PI Contribution	BBC Wales have become official partners of the project (a collaborative contract has been signed). BBC Wales will provide extensive amounts of data for us to use on the project and we will involve them in our user-driven consultations regarding the design and construction of CorCenCC.
Collaborator Contribution	BBC Wales will provide extensive amounts of data for us to use on the project and we will involve them in our user-driven consultations regarding the design and construction of CorCenCC.
Impact	BBC contributed data which has been included in the released version of the CorCenCC corpus.
Start Year	2017


Description	S4C
Organisation	S4C
Country	United Kingdom
Sector	Private
PI Contribution	S4C have become official partners of the project (a collaborative contract has been signed). S4C will provide extensive amounts of data for us to use on the project and we will involve them in our user-driven consultations regarding the design and construction of CorCenCC.
Collaborator Contribution	S4C will provide extensive amounts of data for us to use on the project and we will involve them in our user-driven consultations regarding the design and construction of CorCenCC.
Impact	S4C contributed data which has been included in the released version of the CorCenCC corpus.
Start Year	2016


Title	CorCenCC crowdsourcing app
Description	As part of the CorCenCC (National Corpus of Contemporary Welsh) project, the CorCenCC Crowdsourcing Application has been designed to allow Welsh speakers to record conversations between themselves and others across a range of contexts and to upload them for inclusion in the final corpus. Crowdsourced corpus data is a relatively new direction that complements more traditional language data collection methods, and is ideally suited to the positive community spirit that exists among speakers and learners of the Welsh language. Using our Crowdsourcing Application, Welsh speakers can engage with the CorCenCC project easily and at their own convenience. Users are able to: * Create and adjust a user profile based around the context of their Welsh language background, * Make audio and video recordings of their Welsh language conversations and exchanges, * Include focused additional information about recordings as metadata, * Upload recordings for inclusion in CorCenCC - the National Corpus of Contemporary Welsh. In making contributions to the corpus a much more personal experience, the CorCenCC team wants to give users ownership and control of their own language data, and the opportunity to share the most natural and accurate representation possible of their Welsh in the contemporary context with the new National Corpus.
Type Of Technology	Webtool/Application
Year Produced	2017
Impact	The app has only just been released - too early to comment.
URL	https://itunes.apple.com/gb/app/ap-torfoli-corcencc/id1199426082


Title	CorCenCC full corpus query tools
Description	The beta version of CorCenCC's full bilingual corpus query tools (which are accompanied by a complete user guide) includes the following functionalities: Simple Query: to explore any word and/or lemma form in the corpus, and one or many part-of-speech (POS) tags, mutation types, or semantic category tags of a specific word and/or lemma. A randomised selection of results are presented in a KWIC (Key Word in Context) output. Results can then be filtered of results by mode, geographical area, context, genre, topic, target audience and source. Full Query: used to search for longer sequences of patterns (multi-word expressions) separated by spaces, using CorCenCC's bespoke query syntax. Results are presented in a KWIC (Key Word in Context) output, which can be filtered according to mode, geographical area, context, genre, topic, target audience and source. Frequency List: produces a list of words or lemmas in the corpus, ranked according to frequency of occurrence. N-Gram Analysis: lists patterns of n-grams/clusters of 2-7 words, lemmas, or POS in the corpus, ranked according to frequency of occurrence. Keyword Analysis: displaying words that are unusually frequent in one sub-set of the corpus compared with a different 'reference' sub-set of the corpus. Collocation Analysis: displaying information on the relationships between word types that appear together within a given context window. All data in CorCenCC has been fully tagged in terms of part-of-speech (POS) and semantic category. These tags are fully searchable within the corpus and, in the case of Simple and Full Queries, POS-tags are also colour coded to ease the examination of patterns in query results. All data is also categorised according to its context of use, genre, topic etc., enabling users to examine patterns within/across specific types of text and demographic information in the corpus. Details of tags and taxonomies used, are available in the user guide on the main query tools page and via CorCenCC's GitHub site.
Type Of Technology	Webtool/Application
Year Produced	2020
Impact	A group of corpus linguists evaluated the usability and functionality of the CorCenCC web-based query interface. This process of evaluation involved a combination of questionnaires and talk-aloud exercises. Overall, the participants found the system useful in terms of meeting their information needs within the scope of their professional activities. The functionality was easy to understand without having to resort to help screen assistance. All participants agreed that they were likely to adopt the system and recommend it to other linguists.
URL	https://corpus.corcencc.org/?language=en


Title	CyTag - Welsh Part of Speech Tagger
Description	CyTag is an innovative Welsh tagger (complete with bespoke tagset) designed and constructed for the project. It is being used in conjunction with the semantic tagger to tag all lexical items in the corpus.
Type Of Technology	Software
Year Produced	2018
Open Source License?	Yes
Impact	CyTag will be demoed at future project roadshows and public events (including Tafwyl and the Eistedfodd).
URL	http://cytag.corcencc.org


Title	Demo version of the CorCenCC query tools
Description	Demo version of the query tools that will be used for CorCenCC
Type Of Technology	Webtool/Application
Year Produced	2019
Open Source License?	Yes
Impact	Links to these tools have been circulated and the iterative development of them will continue until the final release of the corpus at the end of the project
URL	https://corpusdemo.corcencc.org/home?language=en


Title	PyMUSAS
Description	Python Multilingual UCREL Semantic Analysis System is currently a rule based token level semantic tagger which can be added to any spaCy pipeline, or linked with an external POS tagger or lemmatiser. For Welsh, it links with the CyTag toolkit developed in CorCenCC. Similarly, for Indonesian, this links with TreeTagger. For other languages (Chinese, Dutch, French, Italian, Portuguese, Spanish) it links with the spaCy pipeline.
Type Of Technology	Software
Year Produced	2021
Open Source License?	Yes
Impact	After a soft release of pymusas in December 2021, we have already presented pymusas at a Newton / British Council Researcher Links Workshop, and an invited talk at the Edge Hill Corpus Research Group seminar. It will also be a core element of an invited talk at the Korean Association for the Study of English Language and Linguistics (KASELL) Spring Conference in May 2022 in Busan, South Korea.
URL	https://ucrel.github.io/pymusas/


Title	Welsh Semantic Tagger Version 1
Description	We have created a first version of the software prototype to apply corpus annotation automatically to Welsh language data. This first version incorporates word and coarse grained grammatical analysis but no semantic disambiguation so far. The potential meanings assigned have been derived automatically by converting English dictionaries through bilingual dictionaries and small parallel corpora.
Type Of Technology	Software
Year Produced	2017
Open Source License?	Yes
Impact	This first prototype was publicly demonstrated at the project launch in Cardiff in March 2017 to a large audience including members of the Welsh assembly and other external project stakeholders.
URL	http://ucrel.lancs.ac.uk/usas/


Title	Y Tiwtiadur
Description	Y Tiwtiadur is a collection of data-driven teaching and learning tools designed to help supplement Welsh language learning at all different ages and levels. Y Tiwtiadur contains four distinct corpus-based tools: a Gap-Fill (Cloze) tool allowing teachers to delete words from a text at specified intervals to encourage or assess comprehension abilities and prediction strategies a Word Profiler tool that enables the grading of texts by word frequency a Word Identification tool testing learners' ability to guess a word in context a Word Task Creator tool that facilitates intensive work on a specified vocabulary item. The tools all use information from the 10-million-word CorCenCC corpus. All the language in the corpus is from real life communication, so the word frequencies and the language samples reflect how Welsh is really used across a range of data types from different speakers/contributors, in different situations, and discussing a range of topics. Some of the tools allow you to choose different types of language samples to work with, based on the topic under discussion or the type of language - spoken or written - etc., for example.
Type Of Technology	Webtool/Application
Year Produced	2020
Impact	Y Tiwtiadur was developed in consultation with representatives of all anticipated academic and community user groups, including teachers and learners.
URL	https://ytiwtiadur.corcencc.org


Description	A TV interview explaining the project and appealing for data
Form Of Engagement Activity	A broadcast e.g. TV/radio/film/podcast (other than news/press)
Part Of Official Scheme?	No
Geographic Reach	National
Primary Audience	Media (as a channel to the public)
Results and Impact	An invited interview on 6 February 2019 as part of the contents of the S4C TV Welsh language programme 'Prynhawn da'. The main purpose of the interview was to explain to the audience what the aims of the CorCenCC project are and to appeal for more data from viewers, either through the CorCenCC app or by contacting us directly. The programme is a general entertainment programme and was an excellent way of reaching a non-academic Welsh speaking audience.
Year(s) Of Engagement Activity	2019


Description	BBC Radio Wales interview (App/project launch - Dawn Knight)
Form Of Engagement Activity	A press release, press conference or response to a media enquiry/interview
Part Of Official Scheme?	No
Geographic Reach	National
Primary Audience	Media (as a channel to the public)
Results and Impact	Project PI Dawn Knight was interviewed on Good Morning Wales, discussing the launch of the CorCenCC crowdsourcing app and encouraging listeners to 'Give us your Welsh' (i.e. contribute data to the project). 28/02/17 (2:24:53 in).
Year(s) Of Engagement Activity	2017
URL	http://www.bbc.co.uk/programmes/b08d6d7q


Description	Bimonthly project newsletter
Form Of Engagement Activity	A magazine, newsletter or online publication
Part Of Official Scheme?	No
Geographic Reach	National
Primary Audience	Study participants or study members
Results and Impact	We issue a bimonthly bilingual newsletter which is circulated to all members of the team, stakeholders, participants and supporters of the project (as well as members of the general public). The newsletter provides project updates and encourages individuals to sign up to contribute data to the corpus.
Year(s) Of Engagement Activity	2016,2017
URL	http://www.corcencc.org/news_events/


Description	Business Wales Advances Magazine coverage
Form Of Engagement Activity	A press release, press conference or response to a media enquiry/interview
Part Of Official Scheme?	No
Geographic Reach	National
Primary Audience	Industry/Business
Results and Impact	Business Wales Advances Newsletter
Year(s) Of Engagement Activity	2017
URL	https://businesswales.gov.wales/sites/business-wales/files/documents/Advances82_English_FINAL.pdf


Description	Cardiff University college newsletter (online)
Form Of Engagement Activity	A press release, press conference or response to a media enquiry/interview
Part Of Official Scheme?	No
Geographic Reach	National
Primary Audience	Other audiences
Results and Impact	Cardiff University College newsletter project coverage
Year(s) Of Engagement Activity	2017
URL	http://sites.cardiff.ac.uk/ahss/introducing%E2%80%AFthe-corcencc-project/


Description	CorCenCC project newsletter
Form Of Engagement Activity	A magazine, newsletter or online publication
Part Of Official Scheme?	No
Geographic Reach	National
Primary Audience	Study participants or study members
Results and Impact	CorCenCC project newsletter. Was produced monthly from April 2016-November 2016, then bi-monthly after this date (latest edition = issue 15, January 2018). The Welsh version of the newsletters can be found here: http://www.corcencc.cymru/y_diweddaraf/#s2
Year(s) Of Engagement Activity	2016,2017,2018
URL	http://www.corcencc.org/news_events/#s2


Description	CorCenCC website
Form Of Engagement Activity	Engagement focused website, blog or social media channel
Part Of Official Scheme?	No
Geographic Reach	National
Primary Audience	Public/other audiences
Results and Impact	Bilingual public-facing website which will be used to host the corpus when it is constructed. The website contains information on what the project aims to do; how individuals can get involved (and how they can sign up to the newsletter); and provides updates on 'where we are' with the work.
Year(s) Of Engagement Activity	2017
URL	http://www.corcencc.org/


Description	CorCenCC: Corpus planning towards a million speakers of Welsh
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	Invited talk at research seminar
Year(s) Of Engagement Activity	2020


Description	Cwis y Corpws Cenedlaethol
Form Of Engagement Activity	A press release, press conference or response to a media enquiry/interview
Part Of Official Scheme?	No
Geographic Reach	National
Primary Audience	Public/other audiences
Results and Impact	Online quiz run by the BBC - providing some basic information on what a corpus is, and quizzing readings about patterns in word frequency and usage.
Year(s) Of Engagement Activity	2018
URL	https://www.bbc.co.uk/cymrufyw/46391607


Description	Formal CorCenCC project launch at the Pierhead Building, Cardiff Bay
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	National
Primary Audience	Policymakers/politicians
Results and Impact	The CorCenCC launch event was an opportunity for (invited) attendees to learn more about the project, view a demonstration of the new data collection app, and experience the corpus tools in action. In a series of short presentations, the following people shared their impressions of how CorCenCC will impact on research, policy, and on the Welsh language community more widely: - Bethan Jenkins AM, Chair of the Welsh Language and Communications Committee - Professor Elizabeth Treasure, Deputy Vice-Chancellor, Cardiff University - Professor Damian Walford Davies, Head of the School of English, Communication and Philosophy, Cardiff University - Dr Dawn Knight, Principal Investigator of the CorCenCC project, Cardiff University - Alun Davies AM, Minister for Lifelong Learning and Welsh Language - Professor Martin Stringer, Pro-Vice-Chancellor, Swansea University Attendees include representatives from the BBC, S4C, National Library of Wales, Welsh Language Commissioner's office, National Assembly for Wales, various academic institutions, Welsh for Adults and project partners and collaborators.
Year(s) Of Engagement Activity	2017


Description	Heno TV appearance/project plug by project ambassador Nia Parry
Form Of Engagement Activity	A press release, press conference or response to a media enquiry/interview
Part Of Official Scheme?	No
Geographic Reach	National
Primary Audience	Media (as a channel to the public)
Results and Impact	Project Ambassador Nia Parry was involved in a TV interview on Heno (S4C) and mentioned the project - briefly discussing the aims and objectives of the work, in an effort to engage members of the public and encourage them to contribute data to the corpus (16/6/16 - at minute 41).
Year(s) Of Engagement Activity	2016
URL	http://www.bbc.co.uk/iplayer/episode/p03w7wcm/heno-mon-06-jun-2016


Description	Interview of Radio Cymru - 4th February 2019
Form Of Engagement Activity	A press release, press conference or response to a media enquiry/interview
Part Of Official Scheme?	No
Geographic Reach	National
Primary Audience	Public/other audiences
Results and Impact	RA Laura Arman was interviewed on Aled Hughes' programme on Radio Cymru, discussing the progress on the CorCenCC project to date and outlining to the general public how they may get involved in the future.
Year(s) Of Engagement Activity	2019
URL	https://www.bbc.co.uk/programmes/m0002bx7?fbclid=IwAR0Ovac120CiPcgeroAAFzseafsxCRgsKfzlGhFuB1RuJ14RK...


Description	Invited on-line response to article about Welsh language version of Wordle
Form Of Engagement Activity	Engagement focused website, blog or social media channel
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Public/other audiences
Results and Impact	A short invited on-line response to an article about a Welsh language version of Wordle. The connection is made between assessing the difficulty of words chosen in Wordle and how CorCenCC can help in informing regarding this as well as future plans to create learner orientated Wordles based on knowledge of word frequency/CEFR levels.
Year(s) Of Engagement Activity	2022
URL	https://golwg.360.cymru/newyddion/2085450-gairglo-cymhwyso-iaith-ddatrys-cliwiau


Description	Invited public talk - Cymdeithas y Llan a'r Bryn, Llangennech, Carmarthenshire
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	Regional
Primary Audience	Public/other audiences
Results and Impact	I was invited to talk about the project as part of the annual programme of public talks by Cymdeithas y Llan a'r Bryn of Llangennech. Around 50 people attended on the night and many subsequently agreed to give data to the project. There was a lively debate as to what consitutes 'correct' or 'acceptable' Welsh and therefore what should or should not be included in the Corpus. I have been invited to return to talk about the Corpus at a later stage when it has been completed.
Year(s) Of Engagement Activity	2017


Description	Media engagement/announcement
Form Of Engagement Activity	A magazine, newsletter or online publication
Part Of Official Scheme?	No
Geographic Reach	Regional
Primary Audience	Schools
Results and Impact	One of the project RAs was featured in a local newspaper (as an alumni of a local school), discussing the aims and objectives of the project.
Year(s) Of Engagement Activity	2018
URL	https://www.gllm.ac.uk/news/2147491168/


Description	National newspaper project mention
Form Of Engagement Activity	A press release, press conference or response to a media enquiry/interview
Part Of Official Scheme?	No
Geographic Reach	National
Primary Audience	Public/other audiences
Results and Impact	Newspaper coverage of project
Year(s) Of Engagement Activity	2017
URL	http://www.dailymail.co.uk/news/article-4304544/Mucking-playschool-goes-right-window.html


Description	Newyddion 9 on S4C - TV interview by CI Steve Morris
Form Of Engagement Activity	A press release, press conference or response to a media enquiry/interview
Part Of Official Scheme?	No
Geographic Reach	National
Primary Audience	Media (as a channel to the public)
Results and Impact	Project CI Steve Morris was involved in a TV news interview on S4C (Newyddion 9 on S4C) to discuss the aims and objectives of the project, in an effort to engage members of the public and encourage them to contribute data to the corpus.
Year(s) Of Engagement Activity	2016
URL	http://www.bbc.co.uk/cymrufyw/34509519


Description	Press release (University)
Form Of Engagement Activity	A press release, press conference or response to a media enquiry/interview
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Public/other audiences
Results and Impact	University press release to update on the final stages of work on CorCenCC.
Year(s) Of Engagement Activity	2019
URL	https://www.cardiff.ac.uk/news/view/1545380-language-project-approaches-target


Description	Press release to mark the start of the CorCenCC project (featured predominantly on academic websites and in research newsletters)
Form Of Engagement Activity	A press release, press conference or response to a media enquiry/interview
Part Of Official Scheme?	No
Geographic Reach	National
Primary Audience	Public/other audiences
Results and Impact	Press release issued to mark the start of the CorCenCC project. This press release was published on the following academic websites and in academic publications (site/publication details; date of publication; link (where appropriate)): Cardiff University website (English); 02/03/2016; http://www.cardiff.ac.uk/news/view/212132-st-davids-day-kick-off-for-welsh-language-project Cardiff University website (Welsh); 02/03/2016; http://www.cardiff.ac.uk/cy/news/view/212132-st-davids-day-kick-off-for-welsh-language-project Swansea University website (English); 01/02/2016; http://www.swansea.ac.uk/riah/research-projects/corcencc/ Swansea University website (Welsh); 01/02/2016; http://www.swansea.ac.uk/cy/riah/prosiectau-ymchwil/corcencc/ Cardiff University Digital Cultures blog (relating to an internal launch); 25/03/2016; https://cardiffdigitalnetwork.org/2016/03/25/corcencc-launch/ Swansea University research website (Welsh); 01/03/2016; http://www.swansea.ac.uk/media/Momentwm%20rhifyn%2021.pdf Swansea University research website (English); 01/03/2016; http://www.swansea.ac.uk/media/Momentum%20issue%2021.pdf
Year(s) Of Engagement Activity	2016


Description	Project launch event and app launch press release (16 different sources/places of publication)
Form Of Engagement Activity	A press release, press conference or response to a media enquiry/interview
Part Of Official Scheme?	No
Geographic Reach	National
Primary Audience	Media (as a channel to the public)
Results and Impact	To mark twelve months since the start of the project, and to mark the completion of the crowdsourcing app and the public launch event (28/02/16) we issued a press release which sought to inform members of the public, policy makers, government officials etc., about the progress made on the project and to encourage them to 'give us their Welsh'. The following list documents the websites/newspapers that the media release was published on (and whether it was in Welsh or English); date of publication and a link to the site (or the title of the piece, as relevant): Swansea University website (Welsh); 14/02/2017; http://www.swansea.ac.uk/cy/canolfan-y-cyfryngau/newyddion-diweddaraf/gallsiaradwyrcymraegymmhobmangyfrannuatadnoddiaithcenedlaetholdrwyddefnyddioapnewydd.php Swansea University University website (Welsh); 14/02/17; http://www.swansea.ac.uk/media-centre/latest-news/welshspeakerseverywherecancontributetoanationallanguageresourcethoughnewapp.php Y Cymro Welsh paper; 17/01/17; link not available Techdragons Wales blog; 2017; http://techdragons.wales/academics-launch-app-to-promote-welsh-language/ Bangor University website (English); 15/02/2017; https://www.bangor.ac.uk/news/latest/we-need-your-welsh-31042 Bangor University website (Welsh); 15/02/2017; https://www.bangor.ac.uk/addysg/newyddion/mae-angen-eich-cymraeg-arnom-31042 Denbighshire Free Press Local Paper (Welsh Language Section); 22/02/17; link not available BBC Wales news site (English); 28/02/2017; http://www.bbc.co.uk/news/uk-wales-39120536 BBC Wales news site (Welsh) 28/02/2017; http://www.bbc.co.uk/cymrufyw/39109825 Cardiff University homepage (English); 01/03/2017; http://www.cardiff.ac.uk/news/view/616189-national-corpus-of-contemporary-welsh Cardiff University homepage (Welsh); 01/03/2017; http://www.cardiff.ac.uk/cy/news/view/616189-national-corpus-of-contemporary-welsh Lancaster University webpage; 01/03/2017; http://www.lancaster.ac.uk/news/articles/2017/national-corpus-of-contemporary-welsh/ My Science Blog; 01/03/2017; https://www.myscience.org.uk/wire/national_corpus_of_contemporary_welsh-2017-cardiff Daily Mail online; 11/03/2017; http://www.dailymail.co.uk/news/article-4304544/Mucking-playschool-goes-right-window.html
Year(s) Of Engagement Activity	2017


Description	Project launch press release (10 different sources/places of publication)
Form Of Engagement Activity	A press release, press conference or response to a media enquiry/interview
Part Of Official Scheme?	No
Geographic Reach	National
Primary Audience	Media (as a channel to the public)
Results and Impact	Having obtained the funding for the CorCenCC project, we had an initial press release in 2015 which sought to inform members of the public, policy makers, government officials etc., about the aims and objectives of the project from the very start. The following list documents the websites/newspapers that the media release was published on (and whether it was in Welsh or English); date of publication and a link to the site (or the title of the piece, as relevant): Wales Online; Online news (Welsh); 08/10/2015; http://www.walesonline.co.uk/news/wales-news/welsh-language-10-million-words-10217359 Lleol.Cymru; Online news (Welsh) 08/10/2015; http://www.lleol.cymru/blog/detail.php?blog=corpws-cyntaf-yn-y-gymraeg-yn-cael-ei-sefydlu Y Cymro; Welsh newspaper; 09/10/2015; Corpws cyntaf o'r iaith Gymraeg Tab Student paper; 12/10/2015; http://thetab.com/uk/cardiff/2015/10/12/1-8m-granted-cardiff-university-save-welsh-language-11686 ENCAP website; Uni website; 14/10/2015; http://www.cardiff.ac.uk/news/view/147217-1.8m-for-online-resource-of-contemporary-welsh-language COMSC site; Uni website; 16/10/2015; http://www.cs.cf.ac.uk/newsandevents/corpus.html Bangor Uni website; 13/10/2015; http://www.bangor.ac.uk/addysg/news/-1-8m-funding-for-large-scale-online-resource-of-contemporary-welsh-language-24635 AcSS Website; 01/10/2015; https://www.acss.org.uk/news/new-large-scale-open-source-corpus-of-contemporary-welsh-language-to-be-created/ Lancaster University website; 02/11/2015; http://www.lancaster.ac.uk/news/articles/2015/18m-for-first-ever-large-scale-online-resource-of-contemporary-welsh-language/ WISERD webpage; 04/11/2015; http://www.wiserd.ac.uk/news/latest-news/corcencc-commence-march-2016/#sthash.CpIdAo3S.dpbs
Year(s) Of Engagement Activity	2015


Description	Radio Cymru: Post Prynhawn (Welsh) - project description with Steve Morris
Form Of Engagement Activity	A press release, press conference or response to a media enquiry/interview
Part Of Official Scheme?	No
Geographic Reach	National
Primary Audience	Media (as a channel to the public)
Results and Impact	Project CI Steve Morris was involved in a radio interview on BBC Radio Cymru to discuss the aims and objectives of the project, in an effort to engage members of the public and encourage them to contribute data to the corpus (17/2/14 - 56:15 onward)
Year(s) Of Engagement Activity	2016
URL	http://www.bbc.co.uk/programmes/b08d6d7q


Description	Radio Cymru: Post Prynhawn (Welsh) interview (App/project launch - Nia Parry)
Form Of Engagement Activity	A press release, press conference or response to a media enquiry/interview
Part Of Official Scheme?	No
Geographic Reach	National
Primary Audience	Media (as a channel to the public)
Results and Impact	Project ambassador Nia Parry was interviewed on Radio Cymru, discussing the launch of the CorCenCC crowdsourcing app and encouraging listeners to 'Give us your Welsh' (i.e. contribute data to the project). 28/02/17 (at around 7:40am).
Year(s) Of Engagement Activity	2017
URL	http://www.bbc.co.uk/programmes/b08d6d7q


Description	Social media campaign by Lancaster University promoting Global Lancaster focussed on our multilingual semantic tagging software.
Form Of Engagement Activity	Engagement focused website, blog or social media channel
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Public/other audiences
Results and Impact	This was a social media campaign by Lancaster University promoting Global Lancaster. Paul Rayson, Scott Piao and Mahmoud El-Haj were interviewed and featured in the video talking out the need for Natural Language Processing and AI research. The video focussed on our multilingual semantic tagging software and how the general public, and other groups could engage in the research and benefit from it.
Year(s) Of Engagement Activity	2018
URL	https://twitter.com/LancasterUni/status/1022138287035764736


Description	Talks/activity at the National Eisteddfod of Wales festival
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	National
Primary Audience	Public/other audiences
Results and Impact	We deliver 1-2 presentations (in Welsh) during the week of the National Eisteddfod of Wales festival on an annual basis. The presentations outline the aims and objectives of the project to the general public and function to recruit participants to contribute data to the corpus, and to disseminate findings from the research. Attendees are also encouraged to sign up to the project newsletter to receive further information about the project as time progresses. 2-4 members of the project team are involved in this event on an annual basis.
Year(s) Of Engagement Activity	2016,2017,2018
URL	https://eisteddfod.wales/


Description	Tawfyl festival
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	National
Primary Audience	Public/other audiences
Results and Impact	We delivered presentations (in Welsh) during the weekend of the annual Tawfyl festival (Welsh arts and culture festival). The presentations outline the aims and objectives of the project to the general public and function to recruit participants to contribute data to the corpus. Attendees are also encouraged to sign up to the project newsletter to receive further information about the project as time progresses. 2-4 members of the project team are annually involved in this event.
Year(s) Of Engagement Activity	2016,2017,2018
URL	http://tafwyl.org


Description	WordNet funding details (Government press release)
Form Of Engagement Activity	A press release, press conference or response to a media enquiry/interview
Part Of Official Scheme?	No
Geographic Reach	National
Primary Audience	Media (as a channel to the public)
Results and Impact	WordNet project funding details - Government Press Release. Welsh version can be found here: http://gov.wales/newsroom/welshlanguage/2017/projects-which-get-creative-with-cymraeg-announced/?skip=1?=cy
Year(s) Of Engagement Activity	2017
URL	http://gov.wales/newsroom/welshlanguage/2017/projects-which-get-creative-with-cymraeg-announced/?lan...


Description	WordNet project press release (Cardiff University)
Form Of Engagement Activity	A press release, press conference or response to a media enquiry/interview
Part Of Official Scheme?	No
Geographic Reach	National
Primary Audience	Media (as a channel to the public)
Results and Impact	WordNet project funding press release (Cardiff University). Welsh version can be found here: http://www.cardiff.ac.uk/cy/news/view/1013418-wordnet-cymraeg?utm_content=buffere2611&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer
Year(s) Of Engagement Activity	2017
URL	http://www.cardiff.ac.uk/news/view/1013418-wordnet-cymraeg?utm_content=buffere2611&utm_medium=social...


Description	Workshop and 'Gogglebox' type data gathering session as part of Being Human Festival 2017
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	Regional
Primary Audience	Public/other audiences
Results and Impact	As part of the 2017 'Being Human' Festival (and the only event held through the medium of Welsh), a workshop was held at Ty'r Gwrhyd (Welsh Language Centre) in Pontardawe where members of the public were invited (i) to learn more about the project and (ii) to contribute their data through watching and reacting to videos (which did not include any spoken language) in a similar way to the Channel 4 Gogglebox programme. The session used one of the project's straplines "Rho dy Gymraeg i ni!" [Give us your Welsh] to attract members of the public to the event and many hours of spoken data were collected.
Year(s) Of Engagement Activity	2017
URL	https://beinghumanfestival.org/event/give-us-your-welshrho-dy-gymraeg-i-ni/

Abstract

Planned Impact

Organisations

People

ORCID iD

Publications