Constraints on the adaptiveness of information content in language (CAIL): Improving communication and detecting failures in audience design

Lead Research Organisation: Newcastle University
Department Name: Sch of English Lit, Lang & Linguistics

Abstract

This project takes a new approach to one of the most difficult questions in the study of human evolution: what underlies our unique communication system, language? Our project considers language as both a cognitive and a social phenomenon, and shows how language solves the problem of efficient communication for the deeply social animals that we are. Underpinning the idea of "efficient" communication is Information Theory. Pioneered by Claude Shannon in 1948, Information Theory is one of the highest impact mathematical and conceptual frameworks of the last century, influencing fields from genetics to network science. At its core is the insight that receiving information reduces uncertainty about a situation in a quantifiable way. This key insight led to the ability to measure information in a new unit, "bits", which quantifies how many binary circuits a computing device would need to store a particular amount of information - a concept which underlies all modern computing.

Our project applies Information Theory to the social nature of human language, and will show that people subconsciously structure their language with an even spread of information, in order to reliably transmit information. In other words, rather than clustering the most informative words at the beginning or end of a sentence, language users prefer to distribute information evenly across a sentence. Crucially, they do this for people they are communicating with, as part of a phenomenon known as audience design. More evenly spread information makes for a higher likelihood that a hearer will understand an utterance - even if there is interference in transmission. Evenly spread information makes it less likely that random interference would disrupt the communication and corrupt a message beyond recognition. We suggest that humans have evolved to subconsciously manipulate language in this way because language is fundamentally social: it strengthens the social relationship between a speaker and interlocutor, and must be designed for what an audience or interlocutor can most easily hear, understand, and remember. To show this, the project will focus on the distribution of information across sentences in different sociolinguistic contexts (e.g. political speeches and personal letters), in language over long historical periods (e.g. from Old English to modern English), and in the language produced by different kinds of people (e.g. speakers with autism spectrum disorders, or ASD, vs neurotypical speakers). We will use this data to develop an online tool called InfoWave, which will allow anyone to assessing the spread of information in any text.

Our Information Theoretic approach to analysing text has several exciting real-world applications. First, language-related diagnostics currently play only a minor role in ASD diagnosis alongside more arduous social cognitive tests, particularly for adults. This project provides an innovative solution: the InfoWave tool assesses an information-patterning ability that can be observed in language, but which is also basic to the human social brain. Secondly, InfoWave could help anyone writing messages that need to be maximally understandable to a diverse audience, like the general public. We see a particular application for public health practitioners who need to craft easily understandable public health materials, and for the policy makers who commission them. Thirdly, our research hones in on a property of language that is specifically adapted to human communication; artificial, "bot"-generated text will exhibit non-human patterns of information density. This means that InfoWave could be valuable in the detection of "fake news" and other automatically generated text masquerading as human communication. Finally, our project will take the opportunity to engage the public in general scientific questions about the evolution of humans and the nature of information through the bridging theme of language they encounter every day.

Planned Impact

The project's combination of evolutionary, cognitive and computational approaches positions it particularly well to have far-reaching clinical, public health, and technological impacts. First, our understanding of language as a social cognitive ability is relevant to clinicians who treat Autism Spectrum Disorders (ASD). Second, the ability to easily quantify how understandable sentences are has consequences for public communication. Finally, information theoretic analysis can play a key role in detecting automatically generated text and "fake news" - issues of increasing urgency in the current political and journalistic climate. These impacts will revolve around our InfoWave tool: an open-source, freely available, web-based tool, which will allow for rapid information density analysis of any text using the algorithm at the core of the project.
Clinical: Current language-related diagnostics for ASD are vulnerable to false positives and are generally designed for early childhood diagnosis. As a result, they can only diagnose adult ASD in conjunction with additional, more arduous social cognitive tests, and are often mismatched for adult language. InfoWave will assess an information-patterning ability in language which reflects more general socio-cognitive capacities. Thus, the tool can assess expressive language as an index for social cognition, and can be applied to adult discourses. We are liaising with ASD clinical practitioners on the Methods for Diagnosis of ASC in Adults project at Newcastle University, to explore new avenues for efficient, low-stress adult ASD diagnosis using InfoWave.
Public Health: Clarity in public health communication is an urgent area of need, with more and more people accessing health information online. Practitioners will be able to use our InfoWave tool to identify suboptimal distributions of information content in texts, and, in turn, improve their clarity. We will actively partner with the infant feeding organisation Feed UK to adapt InfoWave to assist with improving the clarity and effectiveness of public health and policy messaging on issues surrounding infant feeding. A core part of Feed UK's mission is to digest complex primary literature on best practices for infant feeding, and disseminate evidence-based resources on infant feeding to the public, policy makers, and stakeholders.
Technological: While many services provide word frequency information on an ad-hoc basis (i.e., word by word), automatic analysis of information content in text is not currently widely accessible. Such data has potential impact in areas ranging from machine learning to language education, where specific estimates for the information content of language could improve natural language processing and readability. An application of current interest is the identification of "bot"-generated text, which will not have information distributions like true human language. InfoWave could be highly valuable in the detection of "fake news" and other machine-generated text masquerading as human. An information content database and API (Application Programming Interface) will form the backbone of the InfoWave tool. The database will be open to queries from users across academia, industrial natural language processing, education, and other sectors.
Finally, the CAIL team will develop two public exhibits featuring InfoWave in the broader context of information theory and the evolution of language to engage members of the public with the impacts outlined above. The first will be on display at the Edinburgh International Science Festival, Spring 2021, followed by a longer-term exhibit at The National Centre for the Written Word. They will give visitors an introduction to information theory as an important advance in human intellectual and technological history, before moving on to its insights regarding the evolution of human language, and finally, the ways in which these concepts can be applied to technology and public health.

Publications

10 25 50
 
Description We have provided a new measure for the uniformity of information across an utterance, the DORM, which is currently available for use on other datasets as open code published alongside our paper in Cognition. This paper apples the measure to the Penn York Computer annotaed Corpus of a Large amount of English (PYCCLE), showing that in actual uses the information content of real utterances is more uniform than we would expect by chance, and optimising for uniformity allows utterances to be more resistant to noise in communication. In other words, more uniform utterances are more likely to be communicatively successful.

We also applied this measure to historical corpora of Icelandic and English. All languages have basic word order, for example, English is generally Subject Verb Object (SVO): e.g., the tree (S) hit (V) the ground (O). Both English and Icelandic underwent a shift from SOV to SVO in the last thousand years; our analyses show that during this change, when speakers could use either SVO or SOV, they displayed a preference for the order that was more uniform given other contextual constraints in the sentence. While our first paper showed that utterances that are more uniform are more resistant to noise, this paper shows that speakers optimise for noise restance.

Ongoing work is applying this measure to look at uniformity and lingusitic planning, contrasting prepared speeches (the British Archive of Political Speech), prepared debate (Hansard Corpus), and live interview (Andrew Marr Interview Archive). In contrasting these we will create a new combined tertiary dataset, the Database of British Political Speech (DoBPS), which will track speakers both over time and across registers. DoBPS is being prepared and annotated with information theoretic measures for wider release, in preparation for submission to the International Journal of Corpus Linguistics.
Exploitation Route The code for the DORM measure is already widely available, and the Database of British Political Speech will be made open and can be used by a variety of reasearchers across different fields, from politics to linguistics.
Sectors Digital/Communication/Information Technologies (including Software),Education,Government, Democracy and Justice,Other

 
Title Deviation of (variance) of rolling mean of information content in a temporal sequence 
Description This provides open source code for a novel dependent measure of information content in temporal sequences. The implementation is specifically applied to language, but the code is modular and can be applied to any kind of temporal sequence for which units occur probabilistically to asses the uniformity of the distribution of information over time. 
Type Of Material Physiological assessment or outcome measure 
Year Produced 2021 
Provided To Others? Yes  
Impact The measure has already been adapted in our subsequent work (Wallenberg, Bailes, Cuskley & Ingason, 2021) and we are looking to apply the measure to additional context particularly in seeking further funding. 
URL https://github.com/CCuskley/NoiseResistance
 
Description Public facing website 
Form Of Engagement Activity Engagement focused website, blog or social media channel
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Public/other audiences
Results and Impact We built a public facing website for our project, to explain the overall aims and give more information about the team. So far this has mainly been shared through social media as part of the launch of the project, but will act as a general hub for more complex engagement efforts going forward. For example, this will link to our InfoWave API, which we're currently developing as a writing exploration tool for members of the general public, as well as a more specialised information content API tied to particular corpora for use in research.
Year(s) Of Engagement Activity 2020,2021
URL https://cail-project.github.io/