Data Stories: Engaging citizens with data in a post-truth society

Lead Research Organisation: University of Southampton
Department Name: Sch of of Electronics and Computer Sci

Abstract

In the post-truth society we live in, experts must find novel ways to bring hard, factual data to citizens. Data must entertain as well as inform, and excite as well as educate. It must be built with sharing through social channels in mind and become part of our everyday activities and interactions with others. Data Stories will look at novel frameworks and technologies for bringing data to people through art, games, and storytelling. It will examine the impact that varying levels of localisation, topicalisation, participation, and shareability have on the engagement of the general public with factual evidence substantiated by different forms of digital content derived and repurposed from a variety of sources. It will deliver the tools and guidance that community and civic groups need to achieve broader participation and support for their initiatives at local and national level, and empower artists, designers, statisticians, analysts, and journalists to communicate with data in inspiring, informative ways.

Our research hypotheses are as follows:
1. People engage more with data that is made relevant to them by localisation (data related to a specific geographic or geopolitical area of interest) and topicalisation (data about a particular entity, theme, or event).
2. People engage more with data and understand it better when said data is provided through interactive and participatory methods that help build a coherent narrative.
3. Data is more likely to be shared, and therefore reach more people, if shareability is built into its presentation.

We will test these hypotheses and propose a data experience framework supported by models, algorithms, tools, and guidelines that help individuals and groups in creating bespoke, participatory content (for example, art, games, and stories, from data). The framework design will be informed by practice-led research in three main areas: (i) finding and enriching data; (ii) generating content; and (iii) sharing and engaging with content. It will draw upon methods from several disciplines: data and content management; machine learning; human data interaction; game design and gamification; crowdsourcing; online communities; social and political sciences; creative writing; and visual arts. The research will be prototypically showcased in four contexts: (i) within the Data as Culture programme at the ODI, working together with artists, designers, and open data activists; (ii) as part of the Datapolis project run by the ODI, which looks at the use of game interfaces to demystify data, with the support of game designers and local communities; (iii) in a fact-checking & journalism showcase together with the BBC, Full Fact, and the Parliament Digital Service; and (iv) via datathons and our own Data Stories challenge, run by WSI and the ODI, alongside initiatives such as Bath:Hacked and ODCamp UK, which will build community-relevant data narratives from open data enriched with other media, using creative writing techniques.

Our proposal is well aligned with the EPSRC call, addressing several themes to varying degrees. The majority of the research is focused on enabling and facilitating content creation. Specifically, we look at providing intelligent tools to make it easier for people to create data experiences. The beneficiaries are artists, storytellers (such as journalists or analysts), game makers, and those in community and civil society groups wishing to use the modes of art, games, and narration to raise broader awareness of their work. The research will include using data to create immersive experiences through art, games and virtual reality environments that are built from structured data alongside other forms of digital content. Ultimately, these novel ways to get to know and interact with data, relevant to one's context and presented creatively and innovatively, will inform and educate the public, leading, to more sustainable digital ecosystems, and to a more inclusive society.

Planned Impact

Less than a month since the EU referendum, our research could not be timelier. The lack of public engagement with facts and the distrust of experts are core challenges in the UK and elsewhere as the world will face fundamental questions over the next decades. As a society, we will be dealing with significant economic, social, and environmental challenges: a lack of international investment, inequality and divisions, and a changing climate. The decisions that we make must be informed by evidence, but our appetite is for entertainment. To avoid being misled, it is essential for the public to question and understand the figures and statistics that they are presented with. This research will target the role of the creative industries in enabling better decision making, capitalising on areas of expertise in which the UK is internationally recognised: data-driven technologies and creativity, two of the fastest growing sectors of the economy.

The UK leads the world in open data; considerable effort and resources have been devoted over the last five years in publishing and promoting open data sets to create growth and stimulate innovation. Data Stories will help the UK remain at the forefront of new developments in this space by exploring an open data theme that focuses specifically on interdisciplinary contexts at the intersection between arts, design, and technology. The proposal complements and expands existing programmes such as ODI's Data as Culture and the European funded Open Data Incubator for Europe (ODINE), which looks at the use of open data in industrial settings. In addition, the work around data search has the potential for substantial impact on the UK's national data infrastructure; this topic is still underexplored and our research outputs will contribute directly to the success of existing investments in this space.

From an end-user and societal point of view, our showcases will prioritise the needs of local and national communities in the ODI Nodes network, with a special focus on triple bottom line impact and the three P's (people, planet, profit). In terms of academic impact, Data Stories will help maintain UK excellence in data-driven technologies, in particular in a cross-disciplinary context that seeks input from arts, design, social sciences, and HCI to define more engaging, immersive data experiences, which in turn will lead to more informed citizens and better decisions in virtually all areas from the economy to the environment. The project will shape the research agenda in this emerging field, leveraging the collaborations with national and international ODI Nodes network, as well as the outstanding position of Southampton's WSI as pioneer of interdisciplinary research in Web and data science. Given the increasing importance of data literacy in society, Data Stories will impact the state of the art by proposing a practice-led design and scalable implementation of data discovery and search mechanisms based around localisation and topicality; and by designing frameworks, templates, and tools to produce novel ways to interact with data, which appeal to experts and non-experts alike.

From an EPSRC point of view, our main focus is on enabling and facilitating content creation, providing intelligent tools to make it easier for people to experience data in a different way and advocating the use of open data, which anyone can access, use, and share. We believe that the ability to understand and engage with data is necessary for inclusion, in particular in the democratic process. Turning it into art, stories, and games should enable more people to engage with it, use it to inform their arguments, and thus empower them. Our proposal hence responds to two of the challenge areas of the RCUK Digital economy theme: Sustainable society, which is based on people being able to make better choices; and Communities and culture, and the responsible use of digital means.

Publications

10 25 50
 
Title Bar Chart Ball 
Description A "serious game" which takes a data driven approach. Players must traverse a bar-chart of real-world data to guide a ball to the end of the level. The aim of the game is to encourage people to engage with simple forms of data. 
Type Of Art Artefact (including digital) 
Year Produced 2018 
Impact This game has been used to study the effectiveness of novel modes of visualisation when compared to traditional bar charts and the impact of different modes of play on data recall. The game is being used in engagement activities such as the UoS Science Festival (March 2019). 
 
Description The framework design will be informed by practice-led research in three main areas:

Finding and enriching data
Generating content from data
Sharing and engaging with content

Finding and enriching data - A number of peer reviewed publications in the area of data discovery and collaboration with structured data have resulted from this project so far. These, together with the datasearch workshops described in the Section "Engagement activities", demonstrate the importance and recognition of dataset search within the research community. We believe this to be a significant achievement by contributing new knowledge to this rapidly evolving topic. Due to the exploratory nature of current research on data discovery we see a large space for future research taking our findings forward.

Based on the results of a mixed methods study on dataset summaries for human consumption we proposed a guidelines to support people to write meaningful dataset summaries for the purpose of dataset reuse. These insights can inform the design of data discovery and exploration tools, by tailoring functionalities to user needs specifically directed at structured data. We further used our results to develop a small prototype for data publishers to guide them through the summary writing process.

In order to better understand the patterns and specific attributes that data consumers use to search for data and how it compares with general web search, we performed a query log analysis based on logs from four national open data portals and conducted a qualitative analysis of user data requests for requests issued to one of them. In addition to that we conducted a crowdsourcing experiment where we asked crowdworkers to create queries for dataset described in a data request. The queries they provided were aimed at finding a dataset to answer a specific user need. It appeared that portals search functionalities are currently used in an exploratory manner, rather than to retrieve a specific resource, which reinforced our hypothesis that dataset search is different to general web search and needs tailored approaches taking advantage of the dataset structure.

After identifying that lack of context in dataset retrieval is big factor in how users assess whether the datasets is suitable to their task we looked into possible approaches to adding such context to the data inside the dataset. Approaches assigning semantic labels from knowledge bases to specific column disambiguating their meaning exist but their primary focus till now was on column with textual data rather than numbers. Given that numerical columns are the most popular column type on open data platforms we proposed an approach to add semantic meaning to numerical columns. The approach was evaluated using a benchmark generated for the purpose of this work. We showed the influence of the different levels of analysis on the success of assigning semantic labels to numerical values within tables. Further, we compared our work with state of the art approaches looking at this problem and showed that our approach is less affected by the structure of the data and by data quality issues.

Through a number of partner workshops, we have been interrogating the diversity implications of the structuring of data. Data is usually categorised and structured by "neurotypicals". We have reports of neurodiverse data users being frustrated by the incoherence/illogical of categorisation - it appears that neurotypical people have a greater capacity to cope with inconsistency and illogicality. Hierarchical classifications and seemingly subjective schema design can be difficult to comprehend by neuro-atypical individuals. "Data based decision making" is a term used in relation to evidence-based processes, but the data can be illogical to certain people. It may bring a unique perspective to the difficulties of categorisation, and the process of creating standards. These findings ask us to question who makes the rules behind database structures and presentation, and do the designers of these rules consider a diverse user base?

Generating content from data - In investigating the use of data games and the effects of play on recall and engagement, a simple data game was implemented (based on the work of Togelius and Friberger (2013)) aiming to help players memorise simple data sets. An experiment was carried out in which participants were shown either a variant of this "gamified" visualisation, or a set of traditional bar charts. However, experimental results have show that participants that were shown the gamified visualisation did not necessarily perform better in terms of recall that those that saw a traditional visualisation. There are a number of reasons this may be the case, such as participants focusing on using the in-game mechanics to achieve a higher score, rather than taking in the data. This leads us to conclude that simply incorporating the notion of play into a data visualisation is, alone, insufficient, and does not inherently help better communicate the message behind the data (and in some cases may distract from it). As such, ongoing work seeks to understand the way in which the individual mechanics of games can be used to encourage exploration of, and focus on, data, and how mid-game "gating" (or tasks/quizzes) can encourage a deeper understanding.

Work on the development of a tool to support the work of data journalists in their creation of stories that incorporate a semi-automated generative logical structure and intelligently recommended visualisations is on-going, with meetings with data journalists planned, and a prototype being developed.

Sharing and engaging with content - For the training and testing purposes of our system for chart identification of chart images on social media, we built using crowdsourcing a new corpus consisting of 3k image tweets that have been posted by Twitter accounts of some major news agencies (e.g. nytgraphics and ReutersGraphics, GuardianData). The corpus was formed because we found that there are differences between the chart images that are made available in benchmark corpora and those that are shared on social media platforms. The latter are often augmented with additional elements, such as text and images. This deems the task of identifying them more challenging, especially for systems that have been built based on idealised examples.

Based on the statistics from this new corpus, we found that bar charts (incl. column charts) are the most common type of visualisation used by data journalism-oriented accounts with 378 and 89 images showing solely a bar chart and a bar chart accompanied by a different chart type respectively; the second most common visualisation type are maps with respective quantities of 382 and 14.

Furthermore, we built an architecture based on deep neural networks for predicting the virality potential of a chart image on Twitter. Our system predicts the expected virality as a function of the total number of its retweets and likes. Using this architecture, we tested the separate contribution of different signals (i.e. chart images, its original poster and the accompanying text) for the prediction of the expected number of likes and retweets. We evaluated our result using Spearman's rank correlation and Root Mean Square Error (RMSE) of the predicted values with respect to the actual retweet and like counts in our test set. We found that coupling the textual information from the text with author-related cues (e.g. number of friends, followers and likes) results in better performance gain for like counts prediction than combining it with extracted visual features from the corresponding chart. On the other hand, the combination of textual features with either author- or chart-related cues are equally important for predicting the total number of expected retweets. In general, we found that the most accurate predictions are computed when all three types of information (i.e. visual from the chart, textual from its accompanying text and social from its original poster's characteristics) are taken into consideration.

To analyse how data-rich content is currently being shared, information is being been collected from Twitter to classify the kinds of data used, the presentation mechanisms chosen, the role played by the data in the shared content and the individuals who share data-rich content.
Exploitation Route Finding & enriching data - Understanding how people search for data can inform the building of systems or functionalities for data discovery that take user needs into account and support access to structured data.

Generating content from data - Our work could inform the optimal features for how data can be used to generate (or be embedded within) popular consumable media; for example, how different genres of game (puzzle, shooter, role-play) or different aesthetics of play (fantasy, cooperation, competition, abnigation) affect retention/comprehension of data and the enjoyment of the media.

Sharing & engaging with content - We see our work as a first step towards identifying misinformative data visualisations before they become viral. A natural extension of our work would be the implementation of a system capable of cross-checking facts presented in a data visualisation against information in knowledge bases and open data sources. We believe that our findings along with the involved technologies can be useful to businesses, including but not limited to social media platforms, that seek to protect theirs users from the dissemination of fake news.
Sectors Creative Economy,Digital/Communication/Information Technologies (including Software),Education,Culture, Heritage, Museums and Collections

URL http://datastories.co.uk/
 
Description EC Horizon 2020 TheyBuyForYou
Amount € 2,925,693 (EUR)
Funding ID 780247 
Organisation European Commission H2020 
Sector Public
Country Belgium
Start 01/2018 
End 12/2020
 
Description Birmingham Open Media co-created artwork: Tribes, Treasure Hunts & Truth Seekers 
Organisation Birmingham Open Media
PI Contribution Put together a brief for the artists to develop a participatory artwork with members of the neurodiverse community in and around Birmingham.
Collaborator Contribution Managed by Birmingham Open Media, two artist fellows designed a series of co-creation workshops with neurodiverse artists and residents of Birmingham with the intention to develop, one or a number of 'data experiences'.
Impact Research activities are still in progress.
Start Year 2018
 
Title Numer 
Description Numer: a two level (row and column based) approach to add semantic meaning to numerical values (columns) within tables. Additionally, we generated a DBpedia based benchmark - NumDB with which we evaluate Numer. 
Type Of Technology New/Improved Technique/Technology 
Year Produced 2018 
Impact With the increasing amount of structured data on the web the need to understand and support search over this emerging data space is growing. Adding semantics to structured data can help address existing challenges in data discovery, as it facilitates understanding the values in their context. While there are approaches on how to lift structured data to semantic web formats to enrich it and facilitate discovery, most work to date focuses on textual fields rather than numerical data. 
URL https://github.com/chabrowa/semantification
 
Title Web Data RA 
Description In our work on evaluating the current use of social media channels for sharing "data rich content", we implemented a data gathering tool for social media platforms such as Twitter and Facebook. It allows tweets / status updates to be gathered from common web platforms, including historic data gathering. WebDataRA is open source and available on the Chrome Browser Extension store (bit.ly/WebDataRA). The software is usable by non-programmer researchers, and converts social media platform data into spreadsheets for easy subsequent analysis. It has been used to investigate different communities' use of Twitter for communicating data rich stories, and has enabled a variety of MSc projects in conjunction with Data Stories. The software has been disseminated through a Workshop for 50 people at "Social Media and Society" 2018 in Copenhagen. The workshop called "Follow the Data", in the form of a tutorial run by Professor Carr, was attended by 50 researchers and PhD students. 
Type Of Technology Webtool/Application 
Year Produced 2018 
Open Source License? Yes  
Impact Not at this stage 
URL http://bit.ly/WebDataRA
 
Description Can Google Make us Smarter 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact Keynote talk at ELearning Symposium at UoS
Year(s) Of Engagement Activity 2019
 
Description DATA:SEARCH'18: International Workshop on Searching Data on the Web 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact This workshop explores challenges in data search, with a particular focus on data on the web. We want to stimulate an interdisciplinary discussion around how to improve the description, discovery, ranking and presentation of structured and semi-structured data, across data formats and domain applications. We welcome contributions describing algorithms and systems, as well as frameworks and studies exploring human data interaction. We see a large space for discussion and future research in the development of federated data discovery and search technologies, which leverages recent advances in information retrieval, Semantic Web and databases, and is mindful of human factors. The workshop aims to bring together communities interested in making the web of data more discoverable, easier to search and more user friendly.
Year(s) Of Engagement Activity 2018
URL http://sigir.org/sigir2018/
 
Description Data Stories: Engaging with data in a post-truth world 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Postgraduate students
Results and Impact A seminar at UoS, resulting in student discussions.

ABSTRACT: One of the interpretations of the EU referendum result and the rise of Donald Trump in the US is that we are now living in a post-truth society - a world in which anecdotes shared on social media and invented numbers thrown on the sides of buses are more trusted and influential than official statistics, extensive research, and proven expertise. In this world, scientists, statisticians, analysts, and journalists must find new ways to bring hard, factual data to citizens." "Data must entertain as well as inform, excite as well as educate. It must be built with social media sharing in mind, and become part of our everyday activities and digital interactions with others."
Data Stories looks at frameworks and technology to bring data closer to people through art, games, and storytelling. It examines the impact that varying levels of localisation, topicalisation, participation, and shareability have on the engagement of the public with factual evidence. It delivers tools and guidance for communities and civic groups to achieve wider participation and support for their initiatives; and empower artists, designers, statisticians, analysts, and journalists to communicate through data in inspiring, informative ways.
Year(s) Of Engagement Activity 2018
 
Description Data Stories: Tribes, Treasure Hunts & Truth Seekers 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact Panel discussion "how foraging for meaningful data can help us to understand who we are and reinvent the world we live in". Part of the Open Data Summit 2018 at the Guardian HQ in London, attended by 500 people.
Year(s) Of Engagement Activity 2018
URL http://theodi.org/odi-summit-2018-date-value-speakers/
 
Description Data and Storytelling: Data Stories Launch Workshop 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Industry/Business
Results and Impact A gathering of industrial and academic professionals working in spaces related to the Data Stories project (including those partnered directly with the project); a series of talks and discussion panels in which participants discussed on-going research projects in the field, cultural and artwork projects, and industrial applications of these sorts of technology.
Year(s) Of Engagement Activity 2018
URL http://datastories.co.uk/events/data-and-storytelling-workshop/
 
Description Loops of humans and bots in Wikidata 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact A talk given in connection with Data Stories at HumL@www2018

Abstract:
Human­-in-­the-­loop is a model of interaction where a machine process and one or more humans have an iterative interaction. In this paradigm the user has the ability to heavily influence the outcome of the process by providing feedback to the system as well as the opportunity to grab different perspectives about the underlying domain and understand the step by step machine process leading to a certain outcome. Amongst the current major concerns in Artificial Intelligence research are being able to explain and understand the results as well as avoiding bias in the underlying data that might lead to unfair or unethical conclusions. Typically, computers are fast and accurate in processing vast amounts of data. People, however, are creative and bring in their perspectives and interpretation power. Bringing humans and machines together creates a natural symbiosis for accurate interpretation of data at scale. The goal of this workshop is to bring together researchers and practitioners in various areas of AI (i.e., Machine Learning, NLP, Computational Advertising, etc.) to explore new pathways of the human­in­the­loop paradigm.
Year(s) Of Engagement Activity 2018
URL https://humlworkshop.github.io/HumL-WWW2018/
 
Description PROFILES & Data:Search - International Workshop on Profiling and Searching Data on the Web 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Workshop at the Web Conference 2018:
The Web of Data has seen tremendous growth recently. In addition, new forms of structured data have emerged in the form of Web markup, such as schema.org, and entity-centric data in the Web tables. Considering these rich, heterogeneous and evolving data sources which cover a wide variety of domains, exploitation of Web Data becomes increasingly important in the context of various applications, including federated search, entity linking, question answering, and fact verification. These applications require reliable information on dataset characteristics, including general metadata, quality features, statistical information, dynamics, licensing and provenance. Lack of a thorough understanding of the nature, scope and characteristics of data from particular sources limits their take-up and reuse, such that applications are often limited and focused on well-known reference datasets. The PROFILES workshop series aim at gathering approaches to analyse, describe and discover data sources - including but not limited to semantic search and SPARQL endpoints - as a facilitator for applications and tasks such as query distribution, entity retrieval and recommendation. PROFILES offers a highly interactive forum for researchers and practitioners bringing together experts in the fields of Web, Semantic Web, Web Data, Semantic Search, Databases, NLP, IR and application domains.
Year(s) Of Engagement Activity 2018
URL http://www2018.thewebconf.org
 
Description The Data We Want: Framework and Tools to Engage with Data 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Professional Practitioners
Results and Impact An invited talk at Office for National Statistics Data Science Campus

Abstract:
One of the interpretations of the EU referendum result and the rise of Donald Trump in the US is that we are now living in a post-truth society - a world in which anecdotes shared on social media and invented numbers thrown on the sides of buses are more trusted and influential than official statistics, extensive research, and proven expertise. In this world, scientists, statisticians, analysts, and journalists are continuously looking for new ways to bring hard, factual data to citizens. Data must entertain as well as inform, excite as well as educate. It must be easy to find, built with social media sharing in mind, and become part of our everyday activities and digital interactions with others.
In this talk, we will introduce the Data Stories framework and toolkit, which aim to bring data closer to people through novel interfaces and experiences. We will present studies that try to understand how people search and make sense of data, as it is currently made available on the web or on data portals. We will also explore emerging technologies, including intelligent assistants, decentralised ledgers, and personal data economies and their potential role in enriching human data interactions.
Year(s) Of Engagement Activity 2018
 
Description The data we want: Interfaces, methods and experiences to engage with data 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Industry/Business
Results and Impact An invited talk to Amazon (Cambridge).

One of the interpretations of the EU referendum result and the rise of Donald Trump in the US is that we are now living in a post-truth society - a world in which anecdotes shared on social media and invented numbers thrown on the sides of buses are more trusted and influential than official statistics, extensive research, and proven expertise. In this world, scientists, statisticians, analysts, and journalists are continuosly looking for new ways to bring hard, factual data to citizens. Data must entertain as well as inform, excite as well as educate. It must be easy to find, built with sharing and reuse in mind, and become part of our everyday activities and digital interactions with others. In this talk, we will introduce Data Stories, a toolkit of interfaces, techniques, and experiences to bring data closer to people. We will present studies which explore how people look for and engage with different kinds of data, from CSV files published on open government portals to knowledge graphs created by large online communities and visual representations of data shared through social media.
Year(s) Of Engagement Activity 2019
 
Description Tribes, Treasure Hunts and Truth Seekers 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Study participants or study members
Results and Impact This was a workshop at Birmingham Open Media in which (self identified) neurodiverse participants from the local area were invited to assist to artists in creating a piece of artwork exploring the relationship between neurodiversity and data.
Year(s) Of Engagement Activity 2018
URL https://www.bom.org.uk/engagement/bom-autism/tribes-treasure-hunts-truth-seekers/