Investigating the diffusion of morphosyntactic innovations using social media

Lead Research Organisation: University of Cambridge
Department Name: Linguistics

Abstract

We propose to research the way in which changes in the grammar of languages ("innovations") spread out from a small number of speakers to a larger section of the population ("diffusion"). The types of data gathered by traditional means are not especially well-suited to studying diffusion processes. Dialectologists have traditionally gathered data by surveying speakers at a large range of different localities. However, it is usually impractical to survey a large number of speakers at each locality, meaning that such surveys do poorly when there is variation between speakers living in the same places (as is usually the case during ongoing change). Sociolinguists typically interview a large number of speakers at a single locality; it is possible to then investigate diffusion processes by comparing such datasets, but this only ever offers a very limited geographical range. Accordingly, we propose to use data from social-media platforms such as Twitter to investigate diffusion processes. Although a form of written language, social-media data tend to be highly informal and often provide a good approximation to spoken data. By using data from social media, we will be able to gather a very large quantity of localised data from many places across large areas. We will explore new ways of using social media data to investigate language variation and change and demonstrate the effectiveness of using such data to investigate geospatial diffusion and change in grammatical patterns.

We will use Twitter to collect three datasets ("corpora") of tweets in English and Welsh in Britain, and in Norwegian in Norway covering a period of nine months. The selection of these three languages enables us to compare the effect of different demographic and geographic scenarios on patterns of diffusion: Welsh as a minority/lesser-used language vs. English and Norwegian as majority languages; low population density in Norway vs. higher population density in much of England; etc. We will supplement these Twitter corpora with additional data obtained by data-scraping from Norwegian- and Welsh-language-specific social media.

We will then identify language changes currently diffusing in these populations and investigate their distribution in these corpora. Changes we expect to investigate include the spread of a new second-person pronoun 'chdi' ('you') in Welsh, the spread of future constructions with 'komme til å' and 'bli å' in Norwegian and changes in the syntax of constructions with 'need' in English (such as 'you need your hair washing' vs. 'you need your hair washed'). By identifying all instances of both the innovative and the older option(s) in our corpora we will be able to map where each different option is typically used. By comparing these findings with geographical patterns known from earlier studies or identified in other datasets, we will be able to map the spread of new forms over time and so identify the properties of processes of diffusion. In this way, we will be able to answer questions such as: do changes in these speech communities spread continuously over land ("contagious diffusion") or jump from city to city before reaching rural regions ("hierarchical diffusion")? Is this mode of diffusion affected by the type of change in question, by the demographics of the region or by some other factors?

We will use interactions between users in our corpora (retweets, @direct messages, mutual following) to construct a model of these users' social network. We will then be able to compare the effectiveness of this network model as a predictor of the pathway of diffusion to the geographical model. Our results will be demonstrated in action through web-apps that predict users' origins using their responses to questions about their language use and they will furthermore be made available to the public via an online atlas-style website.

Planned Impact

The project will benefit non-academic parties in a variety of ways. We will communicate our findings to education professionals and the general public via online dialect-prediction apps, via a flexible online 'atlas' of linguistic variation in social media, and face-to-face through a series of public events. We will also involve such users in the design of the research through engagement on social media.

During the design phase of the project, we will engage with potential users through Facebook discussion groups on language and dialect. The experience and observations of members of these groups will provide useful input to our choice of variables to investigate, particularly in bringing very recent innovations to our attention.

Once data has been collected, it will be used as the basis for a major impact element of the project, namely the construction of three dialect web-apps after the model of the New York Times (Katz & Andrews 2013) and Der Spiegel (Leemann et al. 2015) dialect surveys. These apps will take the form of a series of questions about respondents' language use. The app will then offer a prediction of the respondent's birthplace on the basis of their answers to these questions and the distributions of these answers identified in our geolocated social-media corpus. Finally, users will be offered the opportunity to rate the prediction and submit social metadata including any social-media identities. These apps will form an integral part of the research project: they will allow us to collect a comparitor dataset of self-reporting data for future research and social metadata for any users who also contributed data to the social-media corpus. However, they will also allow us to engage directly and positively with the public about the project, bringing people's attention to ways in which their language use might reflect their geographical origin that they are less likely to be aware of and informing them about processes of language change. For example, people are rarely conscious of regional differences in syntax.

As with previous similar surveys (cf. Leemann's collaboration with Der Spiegel), we will work with news media in order to reach the largest audience possible, both by promoting links to the surveys themselves on national news websites in the relevant countries and by being available for interview by other news outlets. In this way we will also raise the public profile of linguistics and contribute to well-informed media coverage of the discipline.

We will also help to raise the profile of the research outside academic circles by constructing an atlas website. This will be a set of interactive maps in which users can explore the geographical distribution of each of the different variables we have examined, presented clearly and in language that will be accessible to the general public. As such, it will represent an excellent teaching resource for teachers of English Language at GCSE and A-level in the UK when exploring the notion of dialect variation, and for teachers in Norway at vidergående 3 level teaching 'characteristics of spoken dialects'. It will also be valuable as an accessible resource for teachers of Welsh as a second language in Wales, where there is a need for greater awareness of dialect variation and the distance between Standard Literary Welsh and regional spoken Welsh.

We will give a series of public engagement talks over the entire timeline of the project. These will comprise two larger public engagement events with invited speakers aimed at education professionals (teachers, adult education etc.), as well as individual talks at existing public-engagement events such as the Cambridge Festival of Ideas and the Cambridge Modern Foreign Language Annual Conference for Teachers. In these talks we will discuss dialect variation and change in modern languages, especially in syntax, and the relationship between spoken languages and written language in social media.
 
Description We have built on and improved the tools we developed in the previous year for associating individual twitter users with geographical locations, allowing us to create geographically tagged corpora of language use. We have extended these tools, which we had previously developed and tested for English and Welsh, to the mainland Scandinavian languages (Norwegian, Swedish, Danish), and improved the accuracy of our existing tools.

With these localisation methods, we have undertaken investigations into features of these languages believed to be undergoing change. In English, we have looked at the phenomenon of 'preposition drop', in which constructions involving a motion verb and destination are produced without a preposition (as in 'go (the) pub' for 'go to the pub'). This is relatively new pattern historically and has not attracted a great deal of the literature thus far. Our methods have allowed us to much better describe its geographical distribution, going from individual points where it had been observed in previous studies to a fully defined continuous region. We have identified grammatical variation across this region, hinting at its recent historical developed. We have also examined non-standard usage in the past tense of 'to be' with forms like 'you was', 'I were', etc., which are assumed to be being lost in favour of the standard form; this has allowed us to map the regions in which non-standard usage remains with more specificity than has previously been possible. In the Scandinavian languages, we have identified initial results for several variables, including allomorphic variation in the definite article in Norwegian and Swedish, and different constructions for expression possession relations in all three languages.

These new datasets are all much more comprehensive and associated with better spatial metadata than traditional studies have been able to achieve, and so we anticipate that our work will provide a basis for future work on these phenomena.
Exploitation Route The method for associating Twitter users with geographic locations may used by researchers or non-academic users interested in linguistic variation, or indeed any other feature represented in social media that could vary geographically (mentions of any topic of interest to the researcher). Dialect maps of features produced by the project may be used in teaching (either first or second-language teaching in contexts where awareness of dialect variation could add depth to the teaching) and are of interest to the general public, stimulating interest in and discussion of the nature of dialect variation, above all in British English and Welsh, but also potentially in the other languages of study.
Sectors Digital/Communication/Information Technologies (including Software),Education,Culture, Heritage, Museums and Collections

URL http://www.ling.cam.ac.uk/socmedia/
 
Description As outlined in our portfolio, our findings so far have appeared on social media via our two Twitter accounts, @tweetolectology in English (5,500 followers) and @trydarieitheg in Welsh (1,300 followers) via a programme of informal interaction. We have created a public platform for the discussion of language variation along with the presentation of historical dialectological data in an up-to-date and accessible form. Discussions there have fed into the choice of variables for further impact and investigation. We have also used this as a platform to put out informal surveys about current language variation among the followers of the accounts, which have provoked further discussion. Group members have appeared on the radio (Rhaglen Aled Hughes, Radio Cymru) and are scheduled to at science festivals (Scientifica, Zürich) both in the UK and abroad.
First Year Of Impact 2018
Sector Education,Culture, Heritage, Museums and Collections
Impact Types Cultural,Societal

 
Description Cambridge Humanities Research Grant Scheme (CHRG) (grant title: "Investigating variation and change in Haitian Creole using social media")
Amount £8,600 (GBP)
Organisation University of Cambridge 
Sector Academic/University
Country United Kingdom
Start 01/2020 
End 06/2020
 
Description Mapping Language Variation and Change
Amount £3,660 (GBP)
Organisation German Academic Exchange Service (DAAD) 
Sector Academic/University
Country United States
Start 03/2019 
End 06/2019
 
Description 'Tweetolectology' - Wie man mit Twitter Veränderung in der Sprache untersuchen kann. 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Public/other audiences
Results and Impact Outreach talk (Leemann, on behalf of Leemann, Blaxter, Gopal, Willis) at 'Scientifica' (the University of Zurich research days), 30 Aug/1 September 2019.
Year(s) Of Engagement Activity 2019
URL http://www.scientifica.ch
 
Description Clare Politics Society panel discussion event 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Undergraduate students
Results and Impact An audience of 100-150 attended for a panel discussion on the social effects of attitudes towards dialects. Panel members (D. Willis and A. Leemann) gave short presentations followed by a question and answer session and lively discussion, raising awareness of a broad range of issues related to the project.
Year(s) Of Engagement Activity 2019
 
Description Seminar series, Language Variation and Change 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Postgraduate students
Results and Impact A series of seminars was held on current issues in language variation and change with an emphasis on how these issues related to social media research. Speakers were Prof Jack Grieve (Aston), Prof Benedikt Szmrcsanyi (Leuven) and Prof Isabelle Buchstaller (Duisberg-Essen). Around 20 participants, primarily postgraduate students, attended, sparking discussion of these issues and promoting use of the relevant approaches in postgraduate research.
Year(s) Of Engagement Activity 2018
 
Description Twitter account @trydarieitheg 
Form Of Engagement Activity Engagement focused website, blog or social media channel
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Public/other audiences
Results and Impact Welsh-language Twitter account (~1300 followers at time of submission), posting dialect maps and project outputs, and participating in discussions with followers.
Year(s) Of Engagement Activity 2018,2019
URL https://twitter.com/trydarieitheg
 
Description Twitter account @tweetolectology 
Form Of Engagement Activity Engagement focused website, blog or social media channel
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Public/other audiences
Results and Impact English-language Twitter account (~3200 followers at time of submission), posting dialect maps and project outputs, and participating in discussions with followers.
Year(s) Of Engagement Activity 2018,2019
URL https://twitter.com/tweetolectology