Investigating the diffusion of morphosyntactic innovations using social media
Lead Research Organisation:
University of Cambridge
Department Name: Linguistics
Abstract
We propose to research the way in which changes in the grammar of languages ("innovations") spread out from a small number of speakers to a larger section of the population ("diffusion"). The types of data gathered by traditional means are not especially well-suited to studying diffusion processes. Dialectologists have traditionally gathered data by surveying speakers at a large range of different localities. However, it is usually impractical to survey a large number of speakers at each locality, meaning that such surveys do poorly when there is variation between speakers living in the same places (as is usually the case during ongoing change). Sociolinguists typically interview a large number of speakers at a single locality; it is possible to then investigate diffusion processes by comparing such datasets, but this only ever offers a very limited geographical range. Accordingly, we propose to use data from social-media platforms such as Twitter to investigate diffusion processes. Although a form of written language, social-media data tend to be highly informal and often provide a good approximation to spoken data. By using data from social media, we will be able to gather a very large quantity of localised data from many places across large areas. We will explore new ways of using social media data to investigate language variation and change and demonstrate the effectiveness of using such data to investigate geospatial diffusion and change in grammatical patterns.
We will use Twitter to collect three datasets ("corpora") of tweets in English and Welsh in Britain, and in Norwegian in Norway covering a period of nine months. The selection of these three languages enables us to compare the effect of different demographic and geographic scenarios on patterns of diffusion: Welsh as a minority/lesser-used language vs. English and Norwegian as majority languages; low population density in Norway vs. higher population density in much of England; etc. We will supplement these Twitter corpora with additional data obtained by data-scraping from Norwegian- and Welsh-language-specific social media.
We will then identify language changes currently diffusing in these populations and investigate their distribution in these corpora. Changes we expect to investigate include the spread of a new second-person pronoun 'chdi' ('you') in Welsh, the spread of future constructions with 'komme til å' and 'bli å' in Norwegian and changes in the syntax of constructions with 'need' in English (such as 'you need your hair washing' vs. 'you need your hair washed'). By identifying all instances of both the innovative and the older option(s) in our corpora we will be able to map where each different option is typically used. By comparing these findings with geographical patterns known from earlier studies or identified in other datasets, we will be able to map the spread of new forms over time and so identify the properties of processes of diffusion. In this way, we will be able to answer questions such as: do changes in these speech communities spread continuously over land ("contagious diffusion") or jump from city to city before reaching rural regions ("hierarchical diffusion")? Is this mode of diffusion affected by the type of change in question, by the demographics of the region or by some other factors?
We will use interactions between users in our corpora (retweets, @direct messages, mutual following) to construct a model of these users' social network. We will then be able to compare the effectiveness of this network model as a predictor of the pathway of diffusion to the geographical model. Our results will be demonstrated in action through web-apps that predict users' origins using their responses to questions about their language use and they will furthermore be made available to the public via an online atlas-style website.
We will use Twitter to collect three datasets ("corpora") of tweets in English and Welsh in Britain, and in Norwegian in Norway covering a period of nine months. The selection of these three languages enables us to compare the effect of different demographic and geographic scenarios on patterns of diffusion: Welsh as a minority/lesser-used language vs. English and Norwegian as majority languages; low population density in Norway vs. higher population density in much of England; etc. We will supplement these Twitter corpora with additional data obtained by data-scraping from Norwegian- and Welsh-language-specific social media.
We will then identify language changes currently diffusing in these populations and investigate their distribution in these corpora. Changes we expect to investigate include the spread of a new second-person pronoun 'chdi' ('you') in Welsh, the spread of future constructions with 'komme til å' and 'bli å' in Norwegian and changes in the syntax of constructions with 'need' in English (such as 'you need your hair washing' vs. 'you need your hair washed'). By identifying all instances of both the innovative and the older option(s) in our corpora we will be able to map where each different option is typically used. By comparing these findings with geographical patterns known from earlier studies or identified in other datasets, we will be able to map the spread of new forms over time and so identify the properties of processes of diffusion. In this way, we will be able to answer questions such as: do changes in these speech communities spread continuously over land ("contagious diffusion") or jump from city to city before reaching rural regions ("hierarchical diffusion")? Is this mode of diffusion affected by the type of change in question, by the demographics of the region or by some other factors?
We will use interactions between users in our corpora (retweets, @direct messages, mutual following) to construct a model of these users' social network. We will then be able to compare the effectiveness of this network model as a predictor of the pathway of diffusion to the geographical model. Our results will be demonstrated in action through web-apps that predict users' origins using their responses to questions about their language use and they will furthermore be made available to the public via an online atlas-style website.
Planned Impact
The project will benefit non-academic parties in a variety of ways. We will communicate our findings to education professionals and the general public via online dialect-prediction apps, via a flexible online 'atlas' of linguistic variation in social media, and face-to-face through a series of public events. We will also involve such users in the design of the research through engagement on social media.
During the design phase of the project, we will engage with potential users through Facebook discussion groups on language and dialect. The experience and observations of members of these groups will provide useful input to our choice of variables to investigate, particularly in bringing very recent innovations to our attention.
Once data has been collected, it will be used as the basis for a major impact element of the project, namely the construction of three dialect web-apps after the model of the New York Times (Katz & Andrews 2013) and Der Spiegel (Leemann et al. 2015) dialect surveys. These apps will take the form of a series of questions about respondents' language use. The app will then offer a prediction of the respondent's birthplace on the basis of their answers to these questions and the distributions of these answers identified in our geolocated social-media corpus. Finally, users will be offered the opportunity to rate the prediction and submit social metadata including any social-media identities. These apps will form an integral part of the research project: they will allow us to collect a comparitor dataset of self-reporting data for future research and social metadata for any users who also contributed data to the social-media corpus. However, they will also allow us to engage directly and positively with the public about the project, bringing people's attention to ways in which their language use might reflect their geographical origin that they are less likely to be aware of and informing them about processes of language change. For example, people are rarely conscious of regional differences in syntax.
As with previous similar surveys (cf. Leemann's collaboration with Der Spiegel), we will work with news media in order to reach the largest audience possible, both by promoting links to the surveys themselves on national news websites in the relevant countries and by being available for interview by other news outlets. In this way we will also raise the public profile of linguistics and contribute to well-informed media coverage of the discipline.
We will also help to raise the profile of the research outside academic circles by constructing an atlas website. This will be a set of interactive maps in which users can explore the geographical distribution of each of the different variables we have examined, presented clearly and in language that will be accessible to the general public. As such, it will represent an excellent teaching resource for teachers of English Language at GCSE and A-level in the UK when exploring the notion of dialect variation, and for teachers in Norway at vidergående 3 level teaching 'characteristics of spoken dialects'. It will also be valuable as an accessible resource for teachers of Welsh as a second language in Wales, where there is a need for greater awareness of dialect variation and the distance between Standard Literary Welsh and regional spoken Welsh.
We will give a series of public engagement talks over the entire timeline of the project. These will comprise two larger public engagement events with invited speakers aimed at education professionals (teachers, adult education etc.), as well as individual talks at existing public-engagement events such as the Cambridge Festival of Ideas and the Cambridge Modern Foreign Language Annual Conference for Teachers. In these talks we will discuss dialect variation and change in modern languages, especially in syntax, and the relationship between spoken languages and written language in social media.
During the design phase of the project, we will engage with potential users through Facebook discussion groups on language and dialect. The experience and observations of members of these groups will provide useful input to our choice of variables to investigate, particularly in bringing very recent innovations to our attention.
Once data has been collected, it will be used as the basis for a major impact element of the project, namely the construction of three dialect web-apps after the model of the New York Times (Katz & Andrews 2013) and Der Spiegel (Leemann et al. 2015) dialect surveys. These apps will take the form of a series of questions about respondents' language use. The app will then offer a prediction of the respondent's birthplace on the basis of their answers to these questions and the distributions of these answers identified in our geolocated social-media corpus. Finally, users will be offered the opportunity to rate the prediction and submit social metadata including any social-media identities. These apps will form an integral part of the research project: they will allow us to collect a comparitor dataset of self-reporting data for future research and social metadata for any users who also contributed data to the social-media corpus. However, they will also allow us to engage directly and positively with the public about the project, bringing people's attention to ways in which their language use might reflect their geographical origin that they are less likely to be aware of and informing them about processes of language change. For example, people are rarely conscious of regional differences in syntax.
As with previous similar surveys (cf. Leemann's collaboration with Der Spiegel), we will work with news media in order to reach the largest audience possible, both by promoting links to the surveys themselves on national news websites in the relevant countries and by being available for interview by other news outlets. In this way we will also raise the public profile of linguistics and contribute to well-informed media coverage of the discipline.
We will also help to raise the profile of the research outside academic circles by constructing an atlas website. This will be a set of interactive maps in which users can explore the geographical distribution of each of the different variables we have examined, presented clearly and in language that will be accessible to the general public. As such, it will represent an excellent teaching resource for teachers of English Language at GCSE and A-level in the UK when exploring the notion of dialect variation, and for teachers in Norway at vidergående 3 level teaching 'characteristics of spoken dialects'. It will also be valuable as an accessible resource for teachers of Welsh as a second language in Wales, where there is a need for greater awareness of dialect variation and the distance between Standard Literary Welsh and regional spoken Welsh.
We will give a series of public engagement talks over the entire timeline of the project. These will comprise two larger public engagement events with invited speakers aimed at education professionals (teachers, adult education etc.), as well as individual talks at existing public-engagement events such as the Cambridge Festival of Ideas and the Cambridge Modern Foreign Language Annual Conference for Teachers. In these talks we will discuss dialect variation and change in modern languages, especially in syntax, and the relationship between spoken languages and written language in social media.
Organisations
Publications
Willis, D. W. E.
(2017)
Welsh pronouns and auxiliary deletion: A comparison of social media and traditional methods
Willis, D.
(2020)
Variation in British English morphosyntax in the Tweetolectology corpus
Willis D
(2020)
Using social-media data to investigate morphosyntactic variation and dialect syntax in a lesser-used language: Two case studies from Welsh
in Glossa: a journal of general linguistics
Willis, D. W. E.
(2017)
Using social-media data to investigate morphosyntactic variation and change
Willis, D. W. E.
(2017)
Using social-media data to investigate morphosyntactic variation and change
Willis D
(2018)
Urbanisation and morphosyntactic variation in Twitter data
Willis D
(2019)
Localizing morphosyntactic variation in Welsh Twitter data
Willis D
(2018)
Localising morphosyntactic variation in Welsh Twitter data
D Willis
(2018)
Localising morphosyntactic variation in Twitter data
Blaxter T
(2019)
Localising morphosyntactic variation in Twitter data
Willis D.
(2018)
Introducing the Tweetolectology project
Description | We have built on and improved tools for associating individual Twitter users with geographical locations, allowing us to create geographically tagged corpora of language use. We have extended these tools, which we had previously developed and tested for English and Welsh, to the mainland Scandinavian languages (Norwegian, Swedish, Danish), and improved the accuracy of our existing tools. With these localisation methods, we have undertaken investigations into features of these languages believed to be undergoing change. For instance, we have looked at the phenomenon of 'preposition drop' in English, in which constructions involving a motion verb and destination are produced without a preposition (as in 'go (the) pub' for 'go to the pub'). A number of these case studies concern relatively new patterns historically that have not attracted a great deal of the literature thus far. Our methods have allowed us to much better describe the geographical distribution of the linguistic constructions under investigation, identifying grammatical variation across geographical space, and deriving or substantiating hypotheses about historical grammatical development and diffusion of innovations. Other features of particular note include: the dative alternation (give it me vs. give me it vs. give it to me); the syntax of the verbs need and want; the spread of do-support with have (do you have it? replacing have you got it?) etc. We have been able to map the regions in which older usage remains with more specificity than has previously been possible. The same tools have been applied to our work on Welsh and the Scandinavian languages to produce new datasets with much more comprehensive and associated with better spatial metadata than traditional studies have been able to achieve. We anticipate that our work will provide a basis for future work on these phenomena. |
Exploitation Route | The method for associating Twitter users with geographic locations may used by researchers or non-academic users interested in linguistic variation, or indeed any other feature represented in social media that could vary geographically (mentions of any topic of interest to the researcher). Our data sets may be used for further investigation of both individual constructions in the languages of the project and in further investigation of how innovations spread through space. Dialect maps of features produced by the project may be used in teaching (either first or second-language teaching in contexts where awareness of dialect variation could add depth to the teaching) and are of interest to the general public, stimulating interest in and discussion of the nature of dialect variation, above all in British English and Welsh, but also potentially in the other languages of study. |
Sectors | Digital/Communication/Information Technologies (including Software) Education Culture Heritage Museums and Collections |
URL | http://tweetolectology.com |
Description | As outlined in our portfolio, our findings so far have appeared on social media via our two Twitter accounts, @tweetolectology in English (5,500 followers) and @trydarieitheg in Welsh (1,300 followers) via a programme of informal interaction. We have created a public platform for the discussion of language variation along with the presentation of historical dialectological data in an up-to-date and accessible form. Discussions there have fed into the choice of variables for further impact and investigation. We have also used this as a platform to put out informal surveys about current language variation among the followers of the accounts, which have provoked further discussion. Group members have appeared on the radio (Rhaglen Aled Hughes, Radio Cymru) and are scheduled to at science festivals (Scientifica, Zürich) both in the UK and abroad. We have developed an online atlas to showcase our results and datasets for both specialist users and the general public, and a dialect app for the general public which attempts to infer people's place of origin from their answer to various questions about language use. This provides an accessible and entertaining way for the general public to understand our work. |
First Year Of Impact | 2019 |
Sector | Education,Culture, Heritage, Museums and Collections |
Impact Types | Cultural Societal |
Description | Cambridge Humanities Research Grant Scheme (CHRG) (grant title: "Investigating variation and change in Haitian Creole using social media") |
Amount | £8,600 (GBP) |
Organisation | University of Cambridge |
Sector | Academic/University |
Country | United Kingdom |
Start | 01/2020 |
End | 06/2020 |
Description | Mapping Language Variation and Change |
Amount | £3,660 (GBP) |
Organisation | German Academic Exchange Service (DAAD) |
Sector | Academic/University |
Country | United States |
Start | 03/2019 |
End | 06/2019 |
Title | Tweetolectology Atlas |
Description | This is an online atlas, providing user-controlled visualizations of the data produced by the project, namely a database of linguistic variants produced in tweets by Twitter users, annotated for geographic location of user and linguistically relevant conditioning factors as appropriate. |
Type Of Material | Database/Collection of data |
Year Produced | 2022 |
Provided To Others? | Yes |
Impact | The atlas dataset forms the basis for project publications and conference and seminar papers. |
URL | http://atlas.tweetolectology.com |
Title | Underlying dataset for the publication 'Using social-media data to investigate morphosyntactic variation and dialect syntax in a lesser-used language: Two case studies from Welsh' |
Description | The dataset contains instances of the following, annotated for context and other linguistic factors, in a corpus of tweets in Welsh: • forms of the second-person singular pronoun • deletion of auxiliary verbs in clauses in the second-person singular Full metadata can be found at https://doi.org/10.5334/gjgl.1073.s2/. The accompanying research output is at https://www.glossa-journal.org/article/10.5334/gjgl.1073/. For terms of use, see https://www.glossa-journal.org/ and https://creativecommons.org/licenses/by/3.0/. |
Type Of Material | Database/Collection of data |
Year Produced | 2020 |
Provided To Others? | Yes |
Impact | This research developed methods that were subsequently enhanced for other project publications and fed into other outputs and public engagement, notably the project online atlas and dialect app. |
URL | https://doi.org/10.5334/gjgl.1073.s1 |
Description | 'Tweetolectology' - Wie man mit Twitter Veränderung in der Sprache untersuchen kann. |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Public/other audiences |
Results and Impact | Outreach talk (Leemann, on behalf of Leemann, Blaxter, Gopal, Willis) at 'Scientifica' (the University of Zurich research days), 30 Aug/1 September 2019. |
Year(s) Of Engagement Activity | 2019 |
URL | http://www.scientifica.ch |
Description | Analyzing phonetic and morphosyntactic variation and change in British English using app and Twitter data |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Postgraduate students |
Results and Impact | Invited talk at the University of Basel, Switzerland. Between 50 and 70 students (UG and PG level) were in attendance. Audience was regional as well as international. The talk was given in the context of a seminar on language variation and change in the British English. |
Year(s) Of Engagement Activity | 2020 |
Description | Clare Politics Society panel discussion event |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | Local |
Primary Audience | Undergraduate students |
Results and Impact | An audience of 100-150 attended for a panel discussion on the social effects of attitudes towards dialects. Panel members (D. Willis and A. Leemann) gave short presentations followed by a question and answer session and lively discussion, raising awareness of a broad range of issues related to the project. |
Year(s) Of Engagement Activity | 2019 |
Description | Dialect-guessing app |
Form Of Engagement Activity | Engagement focused website, blog or social media channel |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Public/other audiences |
Results and Impact | A dialect-guessing web-based app in which the general public pick options from a list of variant sentences and then see maps of their distribution, culminating in a prediction about where they are from. Users gain greater awareness of grammatical variation and may be encouraged to look more at the project's work. Also available at http://app.tweetolectology.com/. |
Year(s) Of Engagement Activity | 2022 |
URL | http://wheredotheysay.com/ |
Description | Discovering Linguistics: Linguistic Discoveries |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Undergraduate students |
Results and Impact | Presentation to around 50 undergraduate students from a wide range of backgrounds aimed at sparking their interest in linguistics. |
Year(s) Of Engagement Activity | 2022 |
URL | https://www.dlld.ugent.be/ |
Description | Selwyn Linguists' Society |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | National |
Primary Audience | Undergraduate students |
Results and Impact | A talk to around 30 undergraduates across modern languages, linguistics and Asian and Middle Eastern studies highlighting the research and variation and change more broadly. |
Year(s) Of Engagement Activity | 2022 |
Description | Seminar series, Language Variation and Change |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | Local |
Primary Audience | Postgraduate students |
Results and Impact | A series of seminars was held on current issues in language variation and change with an emphasis on how these issues related to social media research. Speakers were Prof Jack Grieve (Aston), Prof Benedikt Szmrcsanyi (Leuven) and Prof Isabelle Buchstaller (Duisberg-Essen). Around 20 participants, primarily postgraduate students, attended, sparking discussion of these issues and promoting use of the relevant approaches in postgraduate research. |
Year(s) Of Engagement Activity | 2018 |
Description | Twitter account @trydarieitheg |
Form Of Engagement Activity | Engagement focused website, blog or social media channel |
Part Of Official Scheme? | No |
Geographic Reach | National |
Primary Audience | Public/other audiences |
Results and Impact | Welsh-language Twitter account (~1300 followers at time of submission), posting dialect maps and project outputs, and participating in discussions with followers. |
Year(s) Of Engagement Activity | 2018,2019 |
URL | https://twitter.com/trydarieitheg |
Description | Twitter account @tweetolectology |
Form Of Engagement Activity | Engagement focused website, blog or social media channel |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Public/other audiences |
Results and Impact | English-language Twitter account (~3200 followers at time of submission), posting dialect maps and project outputs, and participating in discussions with followers. |
Year(s) Of Engagement Activity | 2018,2019 |
URL | https://twitter.com/tweetolectology |
Description | Workshop on language variation and change |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Postgraduate students |
Results and Impact | Workshop on 'Language variation and change' organised in December 2019. Speakers were the project members, Nanna Hilton (Groningen), Péter Jeszensky (Bern), and Steven Coats (Oulu), with around 20 attendees, mostly postgraduate and undergraduate students. |
Year(s) Of Engagement Activity | 2019 |
URL | http://www.ling.cam.ac.uk/socmedia/resources/workshop_december2019_programme.pdf |