Investigating the diffusion of morphosyntactic innovations using social media

Lead Research Organisation: University of Cambridge

Department Name: Linguistics

Abstract

We propose to research the way in which changes in the grammar of languages ("innovations") spread out from a small number of speakers to a larger section of the population ("diffusion"). The types of data gathered by traditional means are not especially well-suited to studying diffusion processes. Dialectologists have traditionally gathered data by surveying speakers at a large range of different localities. However, it is usually impractical to survey a large number of speakers at each locality, meaning that such surveys do poorly when there is variation between speakers living in the same places (as is usually the case during ongoing change). Sociolinguists typically interview a large number of speakers at a single locality; it is possible to then investigate diffusion processes by comparing such datasets, but this only ever offers a very limited geographical range. Accordingly, we propose to use data from social-media platforms such as Twitter to investigate diffusion processes. Although a form of written language, social-media data tend to be highly informal and often provide a good approximation to spoken data. By using data from social media, we will be able to gather a very large quantity of localised data from many places across large areas. We will explore new ways of using social media data to investigate language variation and change and demonstrate the effectiveness of using such data to investigate geospatial diffusion and change in grammatical patterns.

We will use Twitter to collect three datasets ("corpora") of tweets in English and Welsh in Britain, and in Norwegian in Norway covering a period of nine months. The selection of these three languages enables us to compare the effect of different demographic and geographic scenarios on patterns of diffusion: Welsh as a minority/lesser-used language vs. English and Norwegian as majority languages; low population density in Norway vs. higher population density in much of England; etc. We will supplement these Twitter corpora with additional data obtained by data-scraping from Norwegian- and Welsh-language-specific social media.

We will then identify language changes currently diffusing in these populations and investigate their distribution in these corpora. Changes we expect to investigate include the spread of a new second-person pronoun 'chdi' ('you') in Welsh, the spread of future constructions with 'komme til å' and 'bli å' in Norwegian and changes in the syntax of constructions with 'need' in English (such as 'you need your hair washing' vs. 'you need your hair washed'). By identifying all instances of both the innovative and the older option(s) in our corpora we will be able to map where each different option is typically used. By comparing these findings with geographical patterns known from earlier studies or identified in other datasets, we will be able to map the spread of new forms over time and so identify the properties of processes of diffusion. In this way, we will be able to answer questions such as: do changes in these speech communities spread continuously over land ("contagious diffusion") or jump from city to city before reaching rural regions ("hierarchical diffusion")? Is this mode of diffusion affected by the type of change in question, by the demographics of the region or by some other factors?

We will use interactions between users in our corpora (retweets, @direct messages, mutual following) to construct a model of these users' social network. We will then be able to compare the effectiveness of this network model as a predictor of the pathway of diffusion to the geographical model. Our results will be demonstrated in action through web-apps that predict users' origins using their responses to questions about their language use and they will furthermore be made available to the public via an online atlas-style website.

Planned Impact

The project will benefit non-academic parties in a variety of ways. We will communicate our findings to education professionals and the general public via online dialect-prediction apps, via a flexible online 'atlas' of linguistic variation in social media, and face-to-face through a series of public events. We will also involve such users in the design of the research through engagement on social media.

During the design phase of the project, we will engage with potential users through Facebook discussion groups on language and dialect. The experience and observations of members of these groups will provide useful input to our choice of variables to investigate, particularly in bringing very recent innovations to our attention.

Once data has been collected, it will be used as the basis for a major impact element of the project, namely the construction of three dialect web-apps after the model of the New York Times (Katz & Andrews 2013) and Der Spiegel (Leemann et al. 2015) dialect surveys. These apps will take the form of a series of questions about respondents' language use. The app will then offer a prediction of the respondent's birthplace on the basis of their answers to these questions and the distributions of these answers identified in our geolocated social-media corpus. Finally, users will be offered the opportunity to rate the prediction and submit social metadata including any social-media identities. These apps will form an integral part of the research project: they will allow us to collect a comparitor dataset of self-reporting data for future research and social metadata for any users who also contributed data to the social-media corpus. However, they will also allow us to engage directly and positively with the public about the project, bringing people's attention to ways in which their language use might reflect their geographical origin that they are less likely to be aware of and informing them about processes of language change. For example, people are rarely conscious of regional differences in syntax.

As with previous similar surveys (cf. Leemann's collaboration with Der Spiegel), we will work with news media in order to reach the largest audience possible, both by promoting links to the surveys themselves on national news websites in the relevant countries and by being available for interview by other news outlets. In this way we will also raise the public profile of linguistics and contribute to well-informed media coverage of the discipline.

We will also help to raise the profile of the research outside academic circles by constructing an atlas website. This will be a set of interactive maps in which users can explore the geographical distribution of each of the different variables we have examined, presented clearly and in language that will be accessible to the general public. As such, it will represent an excellent teaching resource for teachers of English Language at GCSE and A-level in the UK when exploring the notion of dialect variation, and for teachers in Norway at vidergående 3 level teaching 'characteristics of spoken dialects'. It will also be valuable as an accessible resource for teachers of Welsh as a second language in Wales, where there is a need for greater awareness of dialect variation and the distance between Standard Literary Welsh and regional spoken Welsh.

We will give a series of public engagement talks over the entire timeline of the project. These will comprise two larger public engagement events with invited speakers aimed at education professionals (teachers, adult education etc.), as well as individual talks at existing public-engagement events such as the Cambridge Festival of Ideas and the Cambridge Modern Foreign Language Annual Conference for Teachers. In these talks we will discuss dialect variation and change in modern languages, especially in syntax, and the relationship between spoken languages and written language in social media.

Funded Value:

£373,465

Funded Period:

Aug 17 - Jun 20

Funder:

ESRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

ES/P00752X/1

Principal Investigator:

David Willis

Research Subject:

Linguistics (100%)

Research Topic:

Language Variation & Change (50%)

Linguistics (General) (50%)

Organisations

University of Cambridge (Lead Research Organisation)

People	ORCID iD
David Willis (Principal Investigator)	http://orcid.org/0000-0003-0755-9248
Tam Blaxter (Co-Investigator)
Adrian Leemann (Co-Investigator)
Deepthi Gopal (Researcher)	http://orcid.org/0000-0002-0433-9648

Publications

Author Name

Title Publication Date Published

|< < 1 2 > >|

10 25 50

Blaxter T (2019) Localising morphosyntactic variation in Twitter data

D Willis (2018) Localising morphosyntactic variation in Twitter data

Willis D (2019) Innovation and obsolescence in the Syntactic Atlas of Welsh Dialects and Trydarieitheg projects

Willis D (2018) Localising morphosyntactic variation in Welsh Twitter data

Willis D (2019) Big data for a small language: Mapping variation in Welsh on social media

Willis D (2020) Using social-media data to investigate morphosyntactic variation and dialect syntax in a lesser-used language: Two case studies from Welsh in Glossa: a journal of general linguistics

Willis D (2019) Innovation and obsolescence in the Syntactic Atlas of Welsh Dialects and Trydarieitheg projects

Willis D (2020) Variation in British English morphosyntax in the Tweetolectology corpus

Willis D (2019) Apparent-time and spatial diffusion in large social-media corpora

Willis D (2019) Localizing morphosyntactic variation in Welsh Twitter data

Related Projects

Project Reference	Relationship	Related To	Start	End	Award Value
ES/P00752X/1			29/08/2017	29/06/2020	£373,465
ES/P00752X/2	Transfer	ES/P00752X/1	30/06/2020	30/07/2021	£83,080

Key Findings
Impact Summary
Further Funding
Research Databases and Models
Engagement Activities


Description	We have built on and improved tools for associating individual Twitter users with geographical locations, allowing us to create geographically tagged corpora of language use. We have extended these tools, which we had previously developed and tested for English and Welsh, to the mainland Scandinavian languages (Norwegian, Swedish, Danish), and improved the accuracy of our existing tools. With these localisation methods, we have undertaken investigations into features of these languages believed to be undergoing change. For instance, we have looked at the phenomenon of 'preposition drop' in English, in which constructions involving a motion verb and destination are produced without a preposition (as in 'go (the) pub' for 'go to the pub'). A number of these case studies concern relatively new patterns historically that have not attracted a great deal of the literature thus far. Our methods have allowed us to much better describe the geographical distribution of the linguistic constructions under investigation, identifying grammatical variation across geographical space, and deriving or substantiating hypotheses about historical grammatical development and diffusion of innovations. Other features of particular note include: the dative alternation (give it me vs. give me it vs. give it to me); the syntax of the verbs need and want; the spread of do-support with have (do you have it? replacing have you got it?) etc. We have been able to map the regions in which older usage remains with more specificity than has previously been possible. The same tools have been applied to our work on Welsh and the Scandinavian languages to produce new datasets with much more comprehensive and associated with better spatial metadata than traditional studies have been able to achieve. We anticipate that our work will provide a basis for future work on these phenomena.
Exploitation Route	The method for associating Twitter users with geographic locations may used by researchers or non-academic users interested in linguistic variation, or indeed any other feature represented in social media that could vary geographically (mentions of any topic of interest to the researcher). Our data sets may be used for further investigation of both individual constructions in the languages of the project and in further investigation of how innovations spread through space. Dialect maps of features produced by the project may be used in teaching (either first or second-language teaching in contexts where awareness of dialect variation could add depth to the teaching) and are of interest to the general public, stimulating interest in and discussion of the nature of dialect variation, above all in British English and Welsh, but also potentially in the other languages of study.
Sectors	Digital/Communication/Information Technologies (including Software) Education Culture Heritage Museums and Collections
URL	http://tweetolectology.com


Description	Our findings appeared on social media via our two Twitter/X accounts, @tweetolectology in English (5,500 followers) and @trydarieitheg in Welsh (1,300 followers) via a programme of informal interaction. We created a public platform for the discussion of language variation along with the presentation of historical dialectological data in an up-to-date and accessible form. Discussions there fed into the choice of variables for further impact and investigation. We also used this as a platform to put out informal surveys about current language variation among the followers of the accounts, which provoked further discussion. Group members appeared on the radio (Rhaglen Aled Hughes, Radio Cymru) and at science festivals (Scientifica, Zürich) both in the UK and abroad. We developed an online atlas to showcase our results and datasets for both specialist users and the general public, and a dialect app for the general public which attempts to infer people's place of origin from their answer to various questions about language use. This provides an accessible and entertaining way for the general public to understand our work.
First Year Of Impact	2019
Sector	Education,Culture, Heritage, Museums and Collections
Impact Types	Cultural Societal


Description	Cambridge Humanities Research Grant Scheme (CHRG) (grant title: "Investigating variation and change in Haitian Creole using social media")
Amount	£8,600 (GBP)
Organisation	University of Cambridge
Sector	Academic/University
Country	United Kingdom
Start	01/2020
End	06/2020


Description	Mapping Language Variation and Change
Amount	£3,660 (GBP)
Organisation	German Academic Exchange Service (DAAD)
Sector	Academic/University
Country	United States
Start	03/2019
End	06/2019


Title	Tweetolectology Atlas
Description	This is an online atlas, providing user-controlled visualizations of the data produced by the project, namely a database of linguistic variants produced in tweets by Twitter users, annotated for geographic location of user and linguistically relevant conditioning factors as appropriate.
Type Of Material	Database/Collection of data
Year Produced	2022
Provided To Others?	Yes
Impact	The atlas dataset forms the basis for project publications and conference and seminar papers.
URL	http://atlas.tweetolectology.com


Title	Underlying dataset for the publication 'Using social-media data to investigate morphosyntactic variation and dialect syntax in a lesser-used language: Two case studies from Welsh'
Description	The dataset contains instances of the following, annotated for context and other linguistic factors, in a corpus of tweets in Welsh: • forms of the second-person singular pronoun • deletion of auxiliary verbs in clauses in the second-person singular Full metadata can be found at https://doi.org/10.5334/gjgl.1073.s2/. The accompanying research output is at https://www.glossa-journal.org/article/10.5334/gjgl.1073/. For terms of use, see https://www.glossa-journal.org/ and https://creativecommons.org/licenses/by/3.0/.
Type Of Material	Database/Collection of data
Year Produced	2020
Provided To Others?	Yes
Impact	This research developed methods that were subsequently enhanced for other project publications and fed into other outputs and public engagement, notably the project online atlas and dialect app.
URL	https://doi.org/10.5334/gjgl.1073.s1


Description	'Tweetolectology' - Wie man mit Twitter Veränderung in der Sprache untersuchen kann.
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Public/other audiences
Results and Impact	Outreach talk (Leemann, on behalf of Leemann, Blaxter, Gopal, Willis) at 'Scientifica' (the University of Zurich research days), 30 Aug/1 September 2019.
Year(s) Of Engagement Activity	2019
URL	http://www.scientifica.ch


Description	Analyzing phonetic and morphosyntactic variation and change in British English using app and Twitter data
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Postgraduate students
Results and Impact	Invited talk at the University of Basel, Switzerland. Between 50 and 70 students (UG and PG level) were in attendance. Audience was regional as well as international. The talk was given in the context of a seminar on language variation and change in the British English.
Year(s) Of Engagement Activity	2020


Description	Clare Politics Society panel discussion event
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	Local
Primary Audience	Undergraduate students
Results and Impact	An audience of 100-150 attended for a panel discussion on the social effects of attitudes towards dialects. Panel members (D. Willis and A. Leemann) gave short presentations followed by a question and answer session and lively discussion, raising awareness of a broad range of issues related to the project.
Year(s) Of Engagement Activity	2019


Description	Dialect-guessing app
Form Of Engagement Activity	Engagement focused website, blog or social media channel
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Public/other audiences
Results and Impact	A dialect-guessing web-based app in which the general public pick options from a list of variant sentences and then see maps of their distribution, culminating in a prediction about where they are from. Users gain greater awareness of grammatical variation and may be encouraged to look more at the project's work. Also available at http://app.tweetolectology.com/.
Year(s) Of Engagement Activity	2022
URL	http://wheredotheysay.com/


Description	Discovering Linguistics: Linguistic Discoveries
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Undergraduate students
Results and Impact	Presentation to around 50 undergraduate students from a wide range of backgrounds aimed at sparking their interest in linguistics.
Year(s) Of Engagement Activity	2022
URL	https://www.dlld.ugent.be/


Description	Selwyn Linguists' Society
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	National
Primary Audience	Undergraduate students
Results and Impact	A talk to around 30 undergraduates across modern languages, linguistics and Asian and Middle Eastern studies highlighting the research and variation and change more broadly.
Year(s) Of Engagement Activity	2022


Description	Seminar series, Language Variation and Change
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	Local
Primary Audience	Postgraduate students
Results and Impact	A series of seminars was held on current issues in language variation and change with an emphasis on how these issues related to social media research. Speakers were Prof Jack Grieve (Aston), Prof Benedikt Szmrcsanyi (Leuven) and Prof Isabelle Buchstaller (Duisberg-Essen). Around 20 participants, primarily postgraduate students, attended, sparking discussion of these issues and promoting use of the relevant approaches in postgraduate research.
Year(s) Of Engagement Activity	2018


Description	Twitter account @trydarieitheg
Form Of Engagement Activity	Engagement focused website, blog or social media channel
Part Of Official Scheme?	No
Geographic Reach	National
Primary Audience	Public/other audiences
Results and Impact	Welsh-language Twitter account (~1300 followers at time of submission), posting dialect maps and project outputs, and participating in discussions with followers.
Year(s) Of Engagement Activity	2018,2019
URL	https://twitter.com/trydarieitheg


Description	Twitter account @tweetolectology
Form Of Engagement Activity	Engagement focused website, blog or social media channel
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Public/other audiences
Results and Impact	English-language Twitter account (~3200 followers at time of submission), posting dialect maps and project outputs, and participating in discussions with followers.
Year(s) Of Engagement Activity	2018,2019
URL	https://twitter.com/tweetolectology


Description	Workshop on language variation and change
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Postgraduate students
Results and Impact	Workshop on 'Language variation and change' organised in December 2019. Speakers were the project members, Nanna Hilton (Groningen), Péter Jeszensky (Bern), and Steven Coats (Oulu), with around 20 attendees, mostly postgraduate and undergraduate students.
Year(s) Of Engagement Activity	2019
URL	http://www.ling.cam.ac.uk/socmedia/resources/workshop_december2019_programme.pdf