Investigating the diffusion of morphosyntactic innovations using social media

Lead Research Organisation: University of Cambridge
Department Name: Linguistics


We propose to research the way in which changes in the grammar of languages ("innovations") spread out from a small number of speakers to a larger section of the population ("diffusion"). The types of data gathered by traditional means are not especially well-suited to studying diffusion processes. Dialectologists have traditionally gathered data by surveying speakers at a large range of different localities. However, it is usually impractical to survey a large number of speakers at each locality, meaning that such surveys do poorly when there is variation between speakers living in the same places (as is usually the case during ongoing change). Sociolinguists typically interview a large number of speakers at a single locality; it is possible to then investigate diffusion processes by comparing such datasets, but this only ever offers a very limited geographical range. Accordingly, we propose to use data from social-media platforms such as Twitter to investigate diffusion processes. Although a form of written language, social-media data tend to be highly informal and often provide a good approximation to spoken data. By using data from social media, we will be able to gather a very large quantity of localised data from many places across large areas. We will explore new ways of using social media data to investigate language variation and change and demonstrate the effectiveness of using such data to investigate geospatial diffusion and change in grammatical patterns.

We will use Twitter to collect three datasets ("corpora") of tweets in English and Welsh in Britain, and in Norwegian in Norway covering a period of nine months. The selection of these three languages enables us to compare the effect of different demographic and geographic scenarios on patterns of diffusion: Welsh as a minority/lesser-used language vs. English and Norwegian as majority languages; low population density in Norway vs. higher population density in much of England; etc. We will supplement these Twitter corpora with additional data obtained by data-scraping from Norwegian- and Welsh-language-specific social media.

We will then identify language changes currently diffusing in these populations and investigate their distribution in these corpora. Changes we expect to investigate include the spread of a new second-person pronoun 'chdi' ('you') in Welsh, the spread of future constructions with 'komme til å' and 'bli å' in Norwegian and changes in the syntax of constructions with 'need' in English (such as 'you need your hair washing' vs. 'you need your hair washed'). By identifying all instances of both the innovative and the older option(s) in our corpora we will be able to map where each different option is typically used. By comparing these findings with geographical patterns known from earlier studies or identified in other datasets, we will be able to map the spread of new forms over time and so identify the properties of processes of diffusion. In this way, we will be able to answer questions such as: do changes in these speech communities spread continuously over land ("contagious diffusion") or jump from city to city before reaching rural regions ("hierarchical diffusion")? Is this mode of diffusion affected by the type of change in question, by the demographics of the region or by some other factors?

We will use interactions between users in our corpora (retweets, @direct messages, mutual following) to construct a model of these users' social network. We will then be able to compare the effectiveness of this network model as a predictor of the pathway of diffusion to the geographical model. Our results will be demonstrated in action through web-apps that predict users' origins using their responses to questions about their language use and they will furthermore be made available to the public via an online atlas-style website.

Planned Impact

The project will benefit non-academic parties in a variety of ways. We will communicate our findings to education professionals and the general public via online dialect-prediction apps, via a flexible online 'atlas' of linguistic variation in social media, and face-to-face through a series of public events. We will also involve such users in the design of the research through engagement on social media.

During the design phase of the project, we will engage with potential users through Facebook discussion groups on language and dialect. The experience and observations of members of these groups will provide useful input to our choice of variables to investigate, particularly in bringing very recent innovations to our attention.

Once data has been collected, it will be used as the basis for a major impact element of the project, namely the construction of three dialect web-apps after the model of the New York Times (Katz & Andrews 2013) and Der Spiegel (Leemann et al. 2015) dialect surveys. These apps will take the form of a series of questions about respondents' language use. The app will then offer a prediction of the respondent's birthplace on the basis of their answers to these questions and the distributions of these answers identified in our geolocated social-media corpus. Finally, users will be offered the opportunity to rate the prediction and submit social metadata including any social-media identities. These apps will form an integral part of the research project: they will allow us to collect a comparitor dataset of self-reporting data for future research and social metadata for any users who also contributed data to the social-media corpus. However, they will also allow us to engage directly and positively with the public about the project, bringing people's attention to ways in which their language use might reflect their geographical origin that they are less likely to be aware of and informing them about processes of language change. For example, people are rarely conscious of regional differences in syntax.

As with previous similar surveys (cf. Leemann's collaboration with Der Spiegel), we will work with news media in order to reach the largest audience possible, both by promoting links to the surveys themselves on national news websites in the relevant countries and by being available for interview by other news outlets. In this way we will also raise the public profile of linguistics and contribute to well-informed media coverage of the discipline.

We will also help to raise the profile of the research outside academic circles by constructing an atlas website. This will be a set of interactive maps in which users can explore the geographical distribution of each of the different variables we have examined, presented clearly and in language that will be accessible to the general public. As such, it will represent an excellent teaching resource for teachers of English Language at GCSE and A-level in the UK when exploring the notion of dialect variation, and for teachers in Norway at vidergående 3 level teaching 'characteristics of spoken dialects'. It will also be valuable as an accessible resource for teachers of Welsh as a second language in Wales, where there is a need for greater awareness of dialect variation and the distance between Standard Literary Welsh and regional spoken Welsh.

We will give a series of public engagement talks over the entire timeline of the project. These will comprise two larger public engagement events with invited speakers aimed at education professionals (teachers, adult education etc.), as well as individual talks at existing public-engagement events such as the Cambridge Festival of Ideas and the Cambridge Modern Foreign Language Annual Conference for Teachers. In these talks we will discuss dialect variation and change in modern languages, especially in syntax, and the relationship between spoken languages and written language in social media.