Data Stories: Engaging citizens with data in a post-truth society

Lead Research Organisation: King's College London
Department Name: Informatics

Abstract

In the post-truth society we live in, experts must find novel ways to bring hard, factual data to citizens. Data must entertain as well as inform, and excite as well as educate. It must be built with sharing through social channels in mind and become part of our everyday activities and interactions with others. Data Stories will look at novel frameworks and technologies for bringing data to people through art, games, and storytelling. It will examine the impact that varying levels of localisation, topicalisation, participation, and shareability have on the engagement of the general public with factual evidence substantiated by different forms of digital content derived and repurposed from a variety of sources. It will deliver the tools and guidance that community and civic groups need to achieve broader participation and support for their initiatives at local and national level, and empower artists, designers, statisticians, analysts, and journalists to communicate with data in inspiring, informative ways.

Our research hypotheses are as follows:
1. People engage more with data that is made relevant to them by localisation (data related to a specific geographic or geopolitical area of interest) and topicalisation (data about a particular entity, theme, or event).
2. People engage more with data and understand it better when said data is provided through interactive and participatory methods that help build a coherent narrative.
3. Data is more likely to be shared, and therefore reach more people, if shareability is built into its presentation.

We will test these hypotheses and propose a data experience framework supported by models, algorithms, tools, and guidelines that help individuals and groups in creating bespoke, participatory content (for example, art, games, and stories, from data). The framework design will be informed by practice-led research in three main areas: (i) finding and enriching data; (ii) generating content; and (iii) sharing and engaging with content. It will draw upon methods from several disciplines: data and content management; machine learning; human data interaction; game design and gamification; crowdsourcing; online communities; social and political sciences; creative writing; and visual arts. The research will be prototypically showcased in four contexts: (i) within the Data as Culture programme at the ODI, working together with artists, designers, and open data activists; (ii) as part of the Datapolis project run by the ODI, which looks at the use of game interfaces to demystify data, with the support of game designers and local communities; (iii) in a fact-checking & journalism showcase together with the BBC, Full Fact, and the Parliament Digital Service; and (iv) via datathons and our own Data Stories challenge, run by WSI and the ODI, alongside initiatives such as Bath:Hacked and ODCamp UK, which will build community-relevant data narratives from open data enriched with other media, using creative writing techniques.

Our proposal is well aligned with the EPSRC call, addressing several themes to varying degrees. The majority of the research is focused on enabling and facilitating content creation. Specifically, we look at providing intelligent tools to make it easier for people to create data experiences. The beneficiaries are artists, storytellers (such as journalists or analysts), game makers, and those in community and civil society groups wishing to use the modes of art, games, and narration to raise broader awareness of their work. The research will include using data to create immersive experiences through art, games and virtual reality environments that are built from structured data alongside other forms of digital content. Ultimately, these novel ways to get to know and interact with data, relevant to one's context and presented creatively and innovatively, will inform and educate the public, leading, to more sustainable digital ecosystems, and to a more inclusive society.

Planned Impact

Less than a month since the EU referendum, our research could not be timelier. The lack of public engagement with facts and the distrust of experts are core challenges in the UK and elsewhere as the world will face fundamental questions over the next decades. As a society, we will be dealing with significant economic, social, and environmental challenges: a lack of international investment, inequality and divisions, and a changing climate. The decisions that we make must be informed by evidence, but our appetite is for entertainment. To avoid being misled, it is essential for the public to question and understand the figures and statistics that they are presented with. This research will target the role of the creative industries in enabling better decision making, capitalising on areas of expertise in which the UK is internationally recognised: data-driven technologies and creativity, two of the fastest growing sectors of the economy.

The UK leads the world in open data; considerable effort and resources have been devoted over the last five years in publishing and promoting open data sets to create growth and stimulate innovation. Data Stories will help the UK remain at the forefront of new developments in this space by exploring an open data theme that focuses specifically on interdisciplinary contexts at the intersection between arts, design, and technology. The proposal complements and expands existing programmes such as ODI's Data as Culture and the European funded Open Data Incubator for Europe (ODINE), which looks at the use of open data in industrial settings. In addition, the work around data search has the potential for substantial impact on the UK's national data infrastructure; this topic is still underexplored and our research outputs will contribute directly to the success of existing investments in this space.

From an end-user and societal point of view, our showcases will prioritise the needs of local and national communities in the ODI Nodes network, with a special focus on triple bottom line impact and the three P's (people, planet, profit). In terms of academic impact, Data Stories will help maintain UK excellence in data-driven technologies, in particular in a cross-disciplinary context that seeks input from arts, design, social sciences, and HCI to define more engaging, immersive data experiences, which in turn will lead to more informed citizens and better decisions in virtually all areas from the economy to the environment. The project will shape the research agenda in this emerging field, leveraging the collaborations with national and international ODI Nodes network, as well as the outstanding position of Southampton's WSI as pioneer of interdisciplinary research in Web and data science. Given the increasing importance of data literacy in society, Data Stories will impact the state of the art by proposing a practice-led design and scalable implementation of data discovery and search mechanisms based around localisation and topicality; and by designing frameworks, templates, and tools to produce novel ways to interact with data, which appeal to experts and non-experts alike.

From an EPSRC point of view, our main focus is on enabling and facilitating content creation, providing intelligent tools to make it easier for people to experience data in a different way and advocating the use of open data, which anyone can access, use, and share. We believe that the ability to understand and engage with data is necessary for inclusion, in particular in the democratic process. Turning it into art, stories, and games should enable more people to engage with it, use it to inform their arguments, and thus empower them. Our proposal hence responds to two of the challenge areas of the RCUK Digital economy theme: Sustainable society, which is based on people being able to make better choices; and Communities and culture, and the responsible use of digital means.

Publications

10 25 50
publication icon
Blount T (2020) Smoking gun in Interactions

publication icon
Chapman A (2019) Dataset search: a survey in The VLDB Journal

publication icon
Koesten L (2020) Dataset Reuse: Translating Principles to Practice in SSRN Electronic Journal

publication icon
Koesten L (2020) Everything you always wanted to know about a dataset: Studies in data summarisation in International Journal of Human-Computer Studies

publication icon
Koesten L (2021) Talking datasets - Understanding data sensemaking behaviours in International Journal of Human-Computer Studies

Related Projects

Project Reference Relationship Related To Start End Award Value
EP/P025676/1 01/10/2017 31/01/2020 £704,835
EP/P025676/2 Transfer EP/P025676/1 01/02/2020 31/01/2021 £167,019
 
Description The framework design of the Data Stories project was informed by practice-led research in three main areas:

Finding and enriching data
Generating content from data
Sharing and engaging with content

Finding and enriching data -
A number of peer reviewed publications in the area of data discovery and collaboration with structured data have resulted from this project. These, together with the datasearch workshops,described in the Section "Engagement activities", demonstrate the importance and recognition of dataset search within the research community. We believe this to be a significant achievement by contributing new knowledge to this rapidly evolving topic. Due to the exploratory nature of current research on data discovery we see a large space for future research taking our findings forward. Our findings on dataset search and dataset specific selection criteria have been used by the European Data portal to inform the development of their data search functionality.

Based on the results of a mixed methods study on dataset summaries for human consumption we also proposed a guidelines to support people to write meaningful dataset summaries for the purpose of dataset reuse. These insights can inform the design of data discovery and exploration tools, by tailoring functionalities to user needs specifically directed at structured data. We further used our results to develop a small prototype for data publishers to guide them through the summary writing process.

In order to better understand the patterns and specific attributes that data consumers use to search for data and how it compares with general web search, we performed a query log analysis based on logs from four national open data portals and conducted a qualitative analysis of user data requests for requests issued to one of them. In addition to that we conducted a crowdsourcing experiment where we asked crowdworkers to create queries for dataset described in a data request. The queries they provided were aimed at finding a dataset to answer a specific user need. It appeared that portals search functionalities are currently used in an exploratory manner, rather than to retrieve a specific resource, which reinforced our hypothesis that dataset search is different to general web search and needs tailored approaches taking advantage of the dataset structure.

After identifying that lack of context in dataset retrieval is a big factor in how users assess whether the datasets is suitable to their task we looked into possible approaches to adding such context to the data inside the dataset. Approaches assigning semantic labels from knowledge bases to specific columns disambiguating their meaning exist but their primary focus till now was on columns with textual data rather than numbers. Given that numerical columns are the most popular column type on open data platforms we proposed an approach to add semantic meaning to numerical columns. The approach was evaluated using a benchmark generated for the purpose of this work. We showed the influence of the different levels of analysis on the success of assigning semantic labels to numerical values within tables. Further, we compared our work with state of the art approaches looking at this problem and showed that our approach is less affected by the structure of the data and by data quality issues.

One reason to engage in dataset search is to find data that can be reused for other purposes. In order to understand whether a dataset can be reused people need to make sense of it and determine it's "fitness for use". We identified a gap in research aiming to understand sensemaking specifically for structured data as opposed to information seeking more generally. To this end we conducted a qualitative mixed methods study, looking at how researchers make sense of and reuse existing data. We were able to identify clusters of activity patterns and related data attributes important in data exploration and sensemaking. We derived concrete recommendations for how these activity patterns and data characteristics can inform tool design and documentation practices to support data-centric sensemaking behaviours.

Through a number of partner workshops, we have been interrogating the diversity implications of the structuring of data. Data is usually categorised and structured by "neurotypicals". We have reports of neurodiverse data users being frustrated by the incoherence/illogical of categorisation - it appears that neurotypical people have a greater capacity to cope with inconsistency and illogicality. Hierarchical classifications and seemingly subjective schema design can be difficult to comprehend by neuro-atypical individuals. "Data based decision making" is a term used in relation to evidence-based processes, but the data can be illogical to certain people. It may bring a unique perspective to the difficulties of categorisation, and the process of creating standards. These findings ask us to question who makes the rules behind database structures and presentation, and do the designers of these rules consider a diverse user base?

Generating content from data -
In investigating the use of data games and the effects of play on recall and engagement, a simple data game was implemented (based on the work of Togelius and Friberger (2013)) aiming to help players memorise simple data sets. An experiment was carried out in which participants were shown either a variant of this "gamified" visualisation, or a set of traditional bar charts. However, experimental results have show that participants that were shown the gamified visualisation did not necessarily perform better in terms of recall that those that saw a traditional visualisation. There are a number of reasons this may be the case, such as participants focusing on using the in-game mechanics to achieve a higher score, rather than taking in the data. This leads us to conclude that simply incorporating the notion of play into a data visualisation is, alone, insufficient, and does not inherently help better communicate the message behind the data (and in some cases may distract from it). As such, ongoing work seeks to understand the way in which the individual mechanics of games can be used to encourage exploration of, and focus on, data, and how mid-game "gating" (or tasks/quizzes) can encourage a deeper understanding. We have a paper under review describing aspects of this work.

As more of a theoretical contribution we published a paper about sensemaking with data using a mixed-methods study in which we identify three distinct clusters of sensemaking activity patterns and their related data attributes. This can be used to discuss user needs important when understanding and reusing data created by others and we propose design recommendations for tools to support data sensemaking and reuse.

We further worked on the development of a tool prototype to support the work of data journalists in their creation of stories that incorporate a semi-automated generative logical structure and intelligently recommended visualisations. We included data journalists in the iterative development of the tool through feedback cycles and testing sessions in the form of a contextual inquiry.

Sharing and engaging with content -
Numeric data:
Investigations of the shareability of data, in terms of reach and engagement have led to a public dataset of socially derived "numeric data", a unique corpus of more than 20 million occurrences of numeric data identified as appearing in social media feeds. The use of data rich language in natural language communication has not been the subject of significant research focus, and this dataset allows studies of the references to and the reliance on data in human communication. We conducted an analysis (and refinement) of the data to model the use of data which is incorporated into the WebData RA tool.

Chart identification:
For the training and testing purposes of our system for chart identification of chart images on social media, we built using crowdsourcing a new corpus consisting of 3k image tweets that have been posted by Twitter accounts of some major news agencies (e.g. nytgraphics and ReutersGraphics, GuardianData). The corpus was formed because we found that there are differences between the chart images that are made available in benchmark corpora and those that are shared on social media platforms. The latter are often augmented with additional elements, such as text and images. This deems the task of identifying them more challenging, especially for systems that have been built based on idealised examples. Based on the statistics from this new corpus, we found that bar charts (incl. column charts) are the most common type of visualisation used by data journalism-oriented accounts with 378 and 89 images showing solely a bar chart and a bar chart accompanied by a different chart type respectively; the second most common visualisation type are maps with respective quantities of 382 and 14.

Furthermore, we built an architecture based on deep neural networks for predicting the virality potential of a chart image on Twitter. Our system predicts the expected virality as a function of the total number of its retweets and likes. Using this architecture, we tested the separate contribution of different signals (i.e. chart images, its original poster and the accompanying text) for the prediction of the expected number of likes and retweets. We evaluated our result using Spearman's rank correlation and Root Mean Square Error (RMSE) of the predicted values with respect to the actual retweet and like counts in our test set. We found that coupling the textual information from the text with author-related cues (e.g. number of friends, followers and likes) results in better performance gain for like counts prediction than combining it with extracted visual features from the corresponding chart. On the other hand, the combination of textual features with either author- or chart-related cues are equally important for predicting the total number of expected retweets. In general, we found that the most accurate predictions are computed when all three types of information (i.e. visual from the chart, textual from its accompanying text and social from its original poster's characteristics) are taken into consideration.

To analyse how data-rich content is currently being shared, information is being collected from Twitter to classify the kinds of data used, the presentation mechanisms chosen, the role played by the data in the shared content and the individuals who share data-rich content.

Data experiences:
A different aspect of engaging with content was addressed in our work on "data-experiences", which resulted in two gamebased artworks. One was created in a participatory design process with a neurodiverse person to express a personal response to data, in an artistic context. The outcome facilitates the engagement of citizens with neurodiversity through the liaison of game (a playable pinball machine) and data. One of the key findings from the design process was how categorisations inherent to data are tailored towards neurotypical experiences. The second piece is insights on collaborative decision making with data, to make sense of story fragments in the context of a game.
The goals were to explore how to use narrative and game mechanics to change the way the public engages with data. The project asked questions such as: Can the game experience encourage people to engage with types of data with which they might not otherwise engage? Can it encourage them to engage more thoroughly and rigorously than they would have otherwise?
Exploitation Route (To add to URLs: https://fastfamiliar.com/research/smoking-gun/ , https://www.youtube.com/watch?v=M9-TfvYw7g4)
Sectors Creative Economy,Digital/Communication/Information Technologies (including Software),Education,Culture, Heritage, Museums and Collections

URL http://datastories.co.uk/
 
Description Our work has received interest and a number of collaborations based on findings of this work emerged: We started a direct collaboration with the European Data Portal, including a webinar series and research activities. This resulted in interest to conduct more research on dataset search, informed by the related studies we published in the Data stories project, which will start in April 2021 with additonal user studies. The findings have also informed work in the project TheyBuyForYou where we advised public administration in Human Data Interaction around decisions in procurement intelligence. Parts of the datastories team have won a project with Google and a project with Nesta on sensemaking of data charts which is informed by the outcomes of our data centric sensemaking work. Another project which is a follow-up from the interactions with Artists in DataStories is a H2020 project that started in 2020 where we will be working with 40+ artists doing work with data and AI. We are further planning a future collaboration with BT about communicating quality dimensions of data in the form of a consultancy project which builds on key findings in Data Stories.
First Year Of Impact 2020
Sector Digital/Communication/Information Technologies (including Software),Education,Culture, Heritage, Museums and Collections
Impact Types Cultural,Societal,Economic,Policy & public services

 
Description Nesta
Amount £29,452 (GBP)
Organisation Nesta 
Sector Charity/Non Profit
Country United Kingdom
Start 04/2021 
End 12/2021
 
Description Understanding charts 
Organisation Google
Department Google Crowdsource
Country India 
Sector Private 
PI Contribution In this project we aim to build a dataset of charts that vary according to a range of design choices and data properties commonly displayed in data visualizations. The dataset will then be annotated, via a crowdsourcing task, with ratings of whether or not the chart is perceived to be readable and trustworthy. This is a collaboration in which we (ES, LK) create a study design for a large scale crowdsourcing experiment, led by us, in collaboration with a team from Google crowdsource. We contribute expertise, project management, data to be used in the experiment, as well as the original idea.
Collaborator Contribution Our partners at the crowdsource team contribute expertise, resources and the task implementation. This includes implementing and running the experiement and advertising the task to their user base to increase participation.
Impact No outcomes yet, collaboration is ongoing.
Start Year 2020
 
Title Data Storytelling Tool 
Description A tool to support the authors of data stories, by recommending a narrative structure based on features of interest from a dataset they have uploaded. The tool is a client-side, javascript-powered, html page (i.e. while the tool will be deployed/hosted on a server, all data-processing happens on the user's machine - no sensitive data of any sort will leave the user's machine). The objective of this tool is to assist the analysis of data through authored (and semi-automated) narrative, making it useful and applicable to data journalists, procurement specialists, and any other authors of data stories. The tool will allow users to import their own data, provide an overview of the data, recommend suitable story-beats and visualisations, and export the story to a number of formats. Workflow Overview User uploads a (.csv) datafile (alternatively, the user can load a previously saved story) User selects fields of interest User selects possible/likely dependencies/correlations Tool recommends story-beats/visualisations User completes story-template with text/images/manually selected visualisations/etc. User exports story to one of several formats Completed Features 1. Data upload/overview: Data can be uploaded from CSV (comma separated value) files; an overview of the data (including data type, selection of values, and min/max values and value distribution, if applicable) is shown to the user, allowing them to browse the data at a high level, and select values of interest 2. Visualisation generation: Bar chart, scatterplot, and line chart (time series) visualisations are currently generated based on user-submitted data, using the d3.js library 3. Simple visualisation recommendation: Visualisations are currently recommended based on user submitted context of dependencies/correlations 4. Narrative authoring: users are currently able to author their narrative (based on simple recommendations based on their submitted correlations); users can supply text, images, and generate additional charts to construct the narrative 5. Export to html/json: Data stories can be exported in a number of formats including html (tailored to stand-alone pages, embeddable content, or (in conjunction with additional js libraries) slide-based content) and json 6. Story saving/loading: the tool supports saving/loading of data stories to user-controlled files (as no data is passed to a server) Future Features 1. Narrative template system: a rule-based system to enhance the authoring experience, by guiding the user step-by-step through the narrative process 2. Advanced visualisation recommendation: enhanced recommendation, that may include elements such as (for example) trend-detection, correlation-detection, and/or anomaly detection 3. Visualisation annotation: allow users to add additional annotations on top of the generated visualisations to highlight any elements that, given their contextual knowledge, would be valuable to their audience Codebase: https://github.com/data-stories/storytelling Demo: https://TBFY.github.io/storytelling 
Type Of Technology Webtool/Application 
Year Produced 2020 
Open Source License? Yes  
Impact The tool has been evaluated with support from industry and project partners using a contextual enquiry approach and there has been interest in further specialisation of the tool for relevant domains. It has also been presented at the Data Stories Symposium 2020. 
URL https://github.com/data-stories/storytelling
 
Description Data Stories Symposium 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact The Data Stories Symposium 2020 brought together experts from academia, industry and the third sector to discuss, generate ideas and inspire future interdisciplinary collaborations aiming to explore Human Data Interaction in relation to storytelling with data. The event took place online, due to COVID-19, over two half days and had a continuous participant number of around 100 during the event, with over 400 sign ups, more than 50% of which came from Academia, the rest self-reported to be from a mix of industry, public and third sector. The event sparked many interesting discussions and resulted in collaboration opportunities as well as explicit interest to repeat the event.
Year(s) Of Engagement Activity 2020
URL http://datastories.co.uk/symposium/
 
Description Invited Talk at the Anual Open Data Conference 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Policymakers/politicians
Results and Impact Presented research on insights in Human Data Interaction to civil servants working on Open Data in Ireland
Year(s) Of Engagement Activity 2020
URL https://data.gov.ie/blog/annual-conference-2020
 
Description Invited Talk: University of Bristol 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Professional Practitioners
Results and Impact Invited talk at the University of Bristol as part of the Data Visualization Seminar, Deparment of Informatics
Year(s) Of Engagement Activity 2021
URL https://dataviz.blogs.bristol.ac.uk/2020/11/16/upcoming-january-2021-talk-elena-simperl/