British National Corpus (BNC) as a sociolinguistic dataset: Exploring individual and social variation

Lead Research Organisation: Lancaster University
Department Name: Linguistics and English Language

Abstract

The project exploits an existing dataset, The British National Corpus (BNC), for the study of informal spoken British English as used by different age and social groups across the UK. In addition, new developments in British English will be investigated by comparing the BNC with BNC2014, a new dataset that is being developed at Lancaster University in collaboration with Cambridge University Press. This allows us, for the first time, to look at language change in spoken British English, on a large scale, over twenty years. By combing methodologies from the fields of corpus linguistics and sociolinguistics as well as using novel analytical methods for in-depth exploration of the data, the project will offer new insights into social variation in British English that have previously not been possible. The focus of the sociolinguistic analyses will be on age, an important aspect of everyday social life, that has so far received only limited attention from researchers studying language.

The main contribution of the project is not only to our knowledge of British English but also to enabling future systematic research in this area. The results of the project will be applied in teaching of the English language at secondary schools (AS and A-level) and in ESL/EFL classes to students whose mother tongue is not English. Internationally, there is a growing demand for EFL/ESL teaching, which also represents an important part of British economy. The results of the project will also be disseminated via a free online course (our Corpus Linguistics MOOC) run by the ESRC Centre for Corpus Approaches to Social Science, Lancaster University as well as via different channels of the project partners (project ambassadors). The project has been endorsed by Cambridge University Press, a leading global academic publisher and part of the University of Cambridge, the English and Media Centre, an important educational charity working with secondary teachers of English Language and Media Studies in the UK and abroad and Trinity College London, a major international testing board operating in over 40 countries worldwide.

Planned Impact

Our project will have five principal audiences for impact, established researchers in different disciplines, postgraduate researchers, language testers/material developers, non-academics in the UK and non-academics worldwide.
Who - established academics: The proposed research will have interdisciplinary impact. Apart from researchers in sociolinguistics and corpus linguistics, researchers investigating different aspects of British English and society (sociologists, social psychologists, educators etc.) will directly benefit from the findings of the research (see Academic beneficiaries).
How: Age is an important variable in social research especially in connection with the overall aging of society. To maximise impact, an online platform for searching the data and carrying out multi-variate analyses (advanced mode) will be developed. In this way, researchers will be able to test their own hypotheses about language and society using the subset of the BNC developed for the secondary data analysis (see Objectives).
Who - Master's and PhD students and junior researchers from the UK and overseas: who every year participate in free Lancaster Summer Schools in Corpus Linguistics; future cohorts from these summer schools will have an opportunity to engage with the findings of the study and use the techniques in their own research. Since 2011, the summer schools have attracted almost 400 participants from over 15 countries.
How: The Summer school workshops will include training in corpus linguistics and quantitative sociolinguistic techniques using the online platform developed in the project. The participants will be able to use the findings of the study as well as to test their own sociolinguistic hypotheses and include these in their theses/dissertations and academic publications.
Who - language testers/material developers: The project has been supported by three large UK organisations - CUP, EMC and TCL (see Pathways to Impact) which are involved in materials development/language testing.
How: Our impact partners will not only provide channels for the dissemination of the results but will also gain early access to these results. This will allow them to incorporate the findings into their own material development/language testing activities. This will give them an advantage on the competitive international market.
Who - non-academic beneficiaries in the UK: who fall into two groups i) secondary level teachers/students of English Language and ii) teachers/learners of English as a second/foreign language. The sociolinguistic perspective has been given a prominent position in the syllabi and assessment of English Language taught at AS and A-level by several major examination boards. In addition, for the acquisition of English as a second language, the awareness of what language is appropriate/typical for different situations or groups of speakers is very important (see Pathways to Impact).
How: The project will represent a valuable educational resource in both AS and A-level English Language classes as well as ESL/EFL classes through materials and an interactive online platform (simple mode) that will allow teachers and students easy access to sociolinguistic data in a user-friendly manner. The teaching materials will be deposited at the ESRC resources archive Social Science for Schools.
Who - non-academic audiences worldwide: Finally, via our successful free online course (Corpus Linguistics MOOC) the research findings will be disseminated to a large international audience (of academics but also importantly of non-academics). The course has attracted over 30,000 participants worldwide in its three runs in 2014 and 2015.
How: The team will develop a sociolinguistics module which will summarise the findings of the research and allow the audience to actively engage with it. The module will also provide links to the online platform and the teaching materials that will be deposited at the ESRC resources archive.
 
Description The key findings for this project can be categorised in four core areas:

1) Significant new knowledge. The research brought significant new understanding of the evolution of spoken British English over the period of twenty years based on a large-scale analysis of two corpora of spoken British English, the Spoken British National Corpus 1994 and the Spoken British National Corpus 2014 (see collected volume Brezina et al. 2018). Specifically,

a. It contributed to sociolinguistic theory, by describing the role of age as one of the major social factors in linguistic shift. Findings of this project highlighted the role of speakers' age in both the development of language across their lifetime and in the evolution of the language as a whole.

b. It contributed to a new, sociolinguistically-informed and empirically-based description of linguistic (lexical, grammatical and pragmatic) changes in spoken English over the period of twenty years. We focused on a wide range of linguistic variables from low level processes (e.g. employment of intensifiers) to higher level social interaction (e.g. use of politeness markers).

c. It enhanced the knowledge underlying corpus building as it contributed to the understanding of significance of social variables in language use and the need to represent them in corpus creation. We proposed a new sociolinguistically informed corpus sampling procedure with age being the crucial structuring variable.

2) Important new questions. The key findings contributed to opening new areas of research across several disciplines. First, the contribution of sociolinguistic theory to corpus-based studies - here, for example, communities of practice as an explanatory principle was introduced; Second, this also resulted in a call for an examination of the role of social factors in corpus-based studies of second language acquisition and use, something previously not taken into consideration. Important conceptual issues related to interdisciplinary research in second language acquisition are highlighted in the lead article in a forthcoming issue of the Annual Review of Applied Linguistics (McEnery et al. 2019).

3) Innovative methods, tools and techniques. The key theoretical findings and the new research questions drew upon methodological innovation that emerged from the project as part of a close cooperation between sociolinguistics and corpus linguistics (see comprehensive review of statistical methods in Brezina 2018). These improved methods have been implemented in three tools (freely accessible online). BNClab, an online tool allowing powerful sociolinguistic searches, #LancsBox v4, a software package analysing users' own data and comparing it with existing datasets, and Lancaster Stasts Tools online, an easy-to-use online statistical environment for statistical analysis and data visualisation of linguistic data.

4) Increased research capability through a new dataset
For comparison purposes a new corpus, Spoken British National Corpus 2014 was created, which consist of 10 million words (downloadable for free as XML files). A balanced subset of the Spoken British National Corpus 2014 and the original British National Corpus was extracted for direct comparison and made available via BNClab.
Exploitation Route There are two major routes by which the outcomes have been and can be further taken forward by different groups of beneficiaries:

1) Use of the outcomes to generate new scholarly knowledge. The resources generated in the project (i.e. the corpora and the tools for their analysis) provide ideal research instruments for pursuing a wide variety of research questions by scholars in the fields of sociolinguistics, applied linguistics, and first and second language acquisition. The specialist training offered through a number of channels (e.g. workshops, summer schools, MOOC) has provided academics from these communities with skills necessary to use these tools. The resources have been already used in proposing new research activities (e.g. grant proposals submitted by the ESRC Centre for Corpus Approaches to Social Science). We are also preparing further work with the British Library directly informed by the outcomes of this project.

2) Improvement of education in the UK at primary and secondary levels. The outcomes of the project (the interactive platform for language analysis and teaching resources) can be integrated by A-level English Language teachers as well as ELT teachers to enhance their lessons. In addition to using the teaching materials and lesson plans, teachers can also develop their own materials and use the platform flexibly in response to their teaching aims and the needs of their students. The outcomes of the project have a strong potential to inform education policy in the area of language teaching.
Sectors Education,Culture, Heritage, Museums and Collections

 
Description There are two major areas in which the research findings of the project have been used. 1) Development of BNClab and the Corpus for Schools project. Results of the research have been used to develop an online interactive platform, BNClab (http://corpora.lancs.ac.uk/bnclab/ ), giving users access to two key corpora of British English - the British National Corpus (ESRC dataset) and the Spoken British National Corpus 2014, a newly created counterpart to the British National Corpus. The platform has had 4,545 page visits (by 1,584 users from 81 countries worldwide) to date since being made accessible to the public in September 2018. The results of the research were further used to create teaching materials both for the use in A-level English Language classes as well as for English language teaching (ELT). The website 'Corpus for Schools' (http://wp.lancs.ac.uk/corpusforschools/) has been created to support the uptake of the materials by secondary schools teachers as well as ELT teachers. To date, we have organised seven engagement activities with secondary school teachers, ELT teachers, students and the general public with the total number of over 250 of face-to-face participants. For example, we organised a training event for A-level students and A-level teachers with over 60 participants and we also trained the participants of three Lancaster Summer schools in corpus linguistics (110 participants in 2018) in the use of the tool and discussed the main results in the context of the summer schools. In addition, the learning environment (the BNClab platform and teaching materials) have beenintegrated into two units of a highly successful MOOC in Corpus Linguistics with over 6,000 participants in September - November 2018. In the MOOC, the participants are both trained in the use of the resource as well as made aware of the key findings of the project. Finally, the findings have been used to inform two new ESRC grant bids; one of the grants has already been awarded (£750,905, PI: E. Semino), the decision for the second one is pending (£4.256M; PI: E. Semino). 2) Development and use of #LancsBox The project also allowed creation and further development of #LancsBox v. 4, a flexible desktop tool for the analysis of linguistic data. The success of #LancsBox lies in the fact that it allows users to upload and analyse their own data and compare them to the data provided, including the spoken subset of the British National Corpus. To date, #LancsBox has attracted 18,080 users from 137 countries worldwide. It is available for free download for all major operating systems from http://corpora.lancs.ac.uk/lancsbox/
First Year Of Impact 2017
Sector Education
Impact Types Cultural,Societal

 
Title Spoken BNC2014 
Description The Spoken BNC2014 is an 11 million word collection of modern British English conversations, transcribed and annotated, for linguistic analysis. It was developed by CASS in collaboration with Cambridge University Press and first released online in 2017. It is accessible at zero cost to anyone, subject to the terms of an end user licence that permits any noncommercial use in research and teaching (but, for reasons of IP, not redistribution). 
Type Of Material Database/Collection of data 
Year Produced 2017 
Provided To Others? Yes  
Impact So far, (as of September 2017 when the corpus was first released), one journal special issue and one edited volume have been compiled containing research under taken using a pre-release subset of the corpus. Both are due for publication by Q1 2018. Other impacts will follow now that the corpus has been made publicly available. 
URL http://corpora.lancs.ac.uk/bnc2014
 
Title #LancsBox v. 3 
Description A new generation corpus analysis tool and data visualisation tool. 
Type Of Technology Software 
Year Produced 2017 
Impact This tool has been introduced to a large number of researchers and students via Corpus linguistics MOOC. 
URL http://corpora.lancs.ac.uk/lancsbox/
 
Title BNClab 
Description The web-based software allows efficient analysis and visualisation of sociolinguistic data; it analyses data according to gender, age, social class, region as well as individual speaker performance. It also compares language development over the period of 20 years from 1994 to 2014. The software employs complex multi-variate statistical analysis to test different sociolinguistic hypotheses about the dataset. 
Type Of Technology Webtool/Application 
Year Produced 2018 
Open Source License? Yes  
Impact The impact activities for this software are planed for the period of May - July 2018. 
URL http://corpora.lancs.ac.uk/bnclab/search
 
Title Lancaster Stats Tools online 
Description Lancaster Stats Tools online offers access to powerful statistical tools through a simple 'click and analyse' user interface, into which the data can be directly copy/pasted from a spreadsheet (e.g. Excel or Calc). The statistical tools offer the power of the R package in the background combined with a user-friendly interface designed specifically for analyses of data in corpus linguistics. To search corpora and obtain frequencies for statistical analysis a range of software tools can be used. 
Type Of Technology Webtool/Application 
Year Produced 2018 
Open Source License? Yes  
Impact The software tool brings innovation to corpus linguistics. It offers a comprehensive overview of methods that can be used to analyse linguistic data. It is based on extensive research that was enabled by the ERSC-funded project. 
 
Description #LancsBox a new tool for researches, teachers and students 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact This workshop introduced a new analytical tool #LancsBox that can be used for both research and teaching purposes.
Year(s) Of Engagement Activity 2017
 
Description Cambridge ELT blogpost: Stories behind pronouns: evidence from real spoken British English 
Form Of Engagement Activity Engagement focused website, blog or social media channel
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact This interactive blogpost on http://www.cambridge.org/elt/ website and twitter coverage targeted a large number of practitioners, textbook writers and policy makers (over 61,200 twitter followers). It brought some highlights of the project to one of the target audiences (beneficiaries). As a result, there was an increased uptake in the use of the teaching materials based on the project. Further impact activities are planned with Camabridge University Press and Cambridge Language Testing.
Year(s) Of Engagement Activity 2019
URL http://www.cambridge.org/elt/blog/2019/01/04/pronouns-spoken-english/?utm_source=twitter&utm_medium=...
 
Description Corpus MOOC - new #LancsBox training 
Form Of Engagement Activity Engagement focused website, blog or social media channel
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact Development of new training videos and teaching activities for a massive open online course (MOOC) in corpus linguistics.
Year(s) Of Engagement Activity 2017
 
Description Corpus linguistics and sociolinguistics - public engagement event 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Public/other audiences
Results and Impact This event oriented to the general public was organised as part of the Lancaster University's Campus in the City series. The university hired a shop in Lancaster city centre where we introduced the interested public (teachers, school children and their parents, people from local businesses etc.) to the tools developed as part of the project. Participants could search any word or phrase of interest in BNClab (http://corpora.lancs.ac.uk/bnclab/search). More than fifty people attended the event, which sparked many interesting discussion about language and society in Britain. Primary and secondary school students were exposed to both the process and the product of academic research, showing them possibilities of carriers in cutting-edge computational research.
Year(s) Of Engagement Activity 2019
URL http://cass.lancs.ac.uk/cass-in-the-city/
 
Description Corpus- MOOC - new units on sociolinguistics and language learning 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact We created two brand new units featuring the results of the ESRC-funded project. The international audience (over 5000 participants from almost 100 countries in the 2018 run) allowed wide dissemination of the research findings. The corpus MOOC was also instrumental in helping the University of Mosul to rebuild their language studies programme.
Year(s) Of Engagement Activity 2018
URL https://esrc.ukri.org/news-events-and-publications/news/news-items/esrc-centre-helps-mosul-universit...
 
Description Lancaster Summer Schools 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact This large international event took place in June 2018 at Lancaster University. During the four-day intensive training the participants learnt to use new software tools designed as part of the ESRC-funded project. There was a significant increase in the use of the new tools. Lancaster stats Tools online (242 28-Day Active Users), #LancsBox (505 28-Day Active Users).
Year(s) Of Engagement Activity 2018
URL http://wp.lancs.ac.uk/corpussummerschools/
 
Description NATE conference: Corpus for schools: Using corpus resources in A level English Language classes 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact This event took place at the National Association of Teachers of English conference in Birmingham, 23nd June 2018. Head teachers across the country for attended this event. The event sparked a vivid debate and wide interest in the newly developed tool (BNClab). In the following months several dozen requests for teaching materials available for free at the BNClab platform followed.
Year(s) Of Engagement Activity 2018
URL http://wp.lancs.ac.uk/corpusforschools/2018/09/06/bnclab-at-the-nate-conference-in-birmingham/
 
Description School visit - Corpus linguistics: Scientific approach to language. 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Schools
Results and Impact This half-day workshop for A-level English Language students at Ripley St Thomas secondary school took place in Lancaster on 16th July, 2018. The workshop was jointly led by Dr. Dana Gablasova and Dr. Vaclav Brezina. Students learnt how to use a new online tool BNClab that was created as part of this ESRC-funded project. The workshop was well received and stimulated discussion and follow up conversations. Early feedback on the tool was provided.
Year(s) Of Engagement Activity 2018
 
Description Using corpora to explore the English language 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Schools
Results and Impact Over 50 secondary-school students and 5 English Language teachers attended this A-level workshop at the A-level conference. The event took place at Lancaster University on 3rd July 2018. The participants were introduced into using corpora and corpus techniques. Separate instructions were provided to teachers (lecture led by Dr. Dana Gablasova) and students (practical session led by Dr. Vaclav Brezina). The focus of the event was to empower teachers and students to use software tools that were developed as part of the ESRC-funded project. After the event, there was an increased uptake in the use of the #LancsBox (505 28-day active users in July 2018).
Year(s) Of Engagement Activity 2012,2018