Data Stories: Engaging citizens with data in a post-truth society

Lead Research Organisation: University of Southampton
Department Name: Sch of of Electronics and Computer Sci

Abstract

In the post-truth society we live in, experts must find novel ways to bring hard, factual data to citizens. Data must entertain as well as inform, and excite as well as educate. It must be built with sharing through social channels in mind and become part of our everyday activities and interactions with others. Data Stories will look at novel frameworks and technologies for bringing data to people through art, games, and storytelling. It will examine the impact that varying levels of localisation, topicalisation, participation, and shareability have on the engagement of the general public with factual evidence substantiated by different forms of digital content derived and repurposed from a variety of sources. It will deliver the tools and guidance that community and civic groups need to achieve broader participation and support for their initiatives at local and national level, and empower artists, designers, statisticians, analysts, and journalists to communicate with data in inspiring, informative ways.

Our research hypotheses are as follows:
1. People engage more with data that is made relevant to them by localisation (data related to a specific geographic or geopolitical area of interest) and topicalisation (data about a particular entity, theme, or event).
2. People engage more with data and understand it better when said data is provided through interactive and participatory methods that help build a coherent narrative.
3. Data is more likely to be shared, and therefore reach more people, if shareability is built into its presentation.

We will test these hypotheses and propose a data experience framework supported by models, algorithms, tools, and guidelines that help individuals and groups in creating bespoke, participatory content (for example, art, games, and stories, from data). The framework design will be informed by practice-led research in three main areas: (i) finding and enriching data; (ii) generating content; and (iii) sharing and engaging with content. It will draw upon methods from several disciplines: data and content management; machine learning; human data interaction; game design and gamification; crowdsourcing; online communities; social and political sciences; creative writing; and visual arts. The research will be prototypically showcased in four contexts: (i) within the Data as Culture programme at the ODI, working together with artists, designers, and open data activists; (ii) as part of the Datapolis project run by the ODI, which looks at the use of game interfaces to demystify data, with the support of game designers and local communities; (iii) in a fact-checking & journalism showcase together with the BBC, Full Fact, and the Parliament Digital Service; and (iv) via datathons and our own Data Stories challenge, run by WSI and the ODI, alongside initiatives such as Bath:Hacked and ODCamp UK, which will build community-relevant data narratives from open data enriched with other media, using creative writing techniques.

Our proposal is well aligned with the EPSRC call, addressing several themes to varying degrees. The majority of the research is focused on enabling and facilitating content creation. Specifically, we look at providing intelligent tools to make it easier for people to create data experiences. The beneficiaries are artists, storytellers (such as journalists or analysts), game makers, and those in community and civil society groups wishing to use the modes of art, games, and narration to raise broader awareness of their work. The research will include using data to create immersive experiences through art, games and virtual reality environments that are built from structured data alongside other forms of digital content. Ultimately, these novel ways to get to know and interact with data, relevant to one's context and presented creatively and innovatively, will inform and educate the public, leading, to more sustainable digital ecosystems, and to a more inclusive society.

Planned Impact

Less than a month since the EU referendum, our research could not be timelier. The lack of public engagement with facts and the distrust of experts are core challenges in the UK and elsewhere as the world will face fundamental questions over the next decades. As a society, we will be dealing with significant economic, social, and environmental challenges: a lack of international investment, inequality and divisions, and a changing climate. The decisions that we make must be informed by evidence, but our appetite is for entertainment. To avoid being misled, it is essential for the public to question and understand the figures and statistics that they are presented with. This research will target the role of the creative industries in enabling better decision making, capitalising on areas of expertise in which the UK is internationally recognised: data-driven technologies and creativity, two of the fastest growing sectors of the economy.

The UK leads the world in open data; considerable effort and resources have been devoted over the last five years in publishing and promoting open data sets to create growth and stimulate innovation. Data Stories will help the UK remain at the forefront of new developments in this space by exploring an open data theme that focuses specifically on interdisciplinary contexts at the intersection between arts, design, and technology. The proposal complements and expands existing programmes such as ODI's Data as Culture and the European funded Open Data Incubator for Europe (ODINE), which looks at the use of open data in industrial settings. In addition, the work around data search has the potential for substantial impact on the UK's national data infrastructure; this topic is still underexplored and our research outputs will contribute directly to the success of existing investments in this space.

From an end-user and societal point of view, our showcases will prioritise the needs of local and national communities in the ODI Nodes network, with a special focus on triple bottom line impact and the three P's (people, planet, profit). In terms of academic impact, Data Stories will help maintain UK excellence in data-driven technologies, in particular in a cross-disciplinary context that seeks input from arts, design, social sciences, and HCI to define more engaging, immersive data experiences, which in turn will lead to more informed citizens and better decisions in virtually all areas from the economy to the environment. The project will shape the research agenda in this emerging field, leveraging the collaborations with national and international ODI Nodes network, as well as the outstanding position of Southampton's WSI as pioneer of interdisciplinary research in Web and data science. Given the increasing importance of data literacy in society, Data Stories will impact the state of the art by proposing a practice-led design and scalable implementation of data discovery and search mechanisms based around localisation and topicality; and by designing frameworks, templates, and tools to produce novel ways to interact with data, which appeal to experts and non-experts alike.

From an EPSRC point of view, our main focus is on enabling and facilitating content creation, providing intelligent tools to make it easier for people to experience data in a different way and advocating the use of open data, which anyone can access, use, and share. We believe that the ability to understand and engage with data is necessary for inclusion, in particular in the democratic process. Turning it into art, stories, and games should enable more people to engage with it, use it to inform their arguments, and thus empower them. Our proposal hence responds to two of the challenge areas of the RCUK Digital economy theme: Sustainable society, which is based on people being able to make better choices; and Communities and culture, and the responsible use of digital means.
 
Title Bar Chart Ball 
Description A "serious game" which takes a data driven approach. Players must traverse a bar-chart of real-world data to guide a ball to the end of the level. The aim of the game is to encourage people to engage with simple forms of data. 
Type Of Art Artefact (including digital) 
Year Produced 2018 
Impact This game has been used to study the effectiveness of novel modes of visualisation when compared to traditional bar charts and the impact of different modes of play on data recall. The game is being used in engagement activities such as the UoS Science Festival (March 2019). 
 
Title Mood Pinball 
Description The artwork "Mood Pinball" by two artist fellows who have been commissioned to develop a participatory artwork with members of the neurodiverse community. The work is a full size digital pinball machine with custom software and bespoke graphics on a wooden frame. The aim was to to communicate the neurodiverse community's perspectives by creating an experience which puts the player in a young autistic woman's shoes, using data embedded in a playable pinball game. It was created following a series of consultation workshops for autistic and neurodiverse adults at BOM (Birmingham Open Media) in 2018. The artwork playfully re-imagines how city-wide data might be used by an individual to find their comfort zones and improve their experience of the city. The goal of Mood Pinball is to keep the 'Mood-o-Meter' happy by responding to noise level data revealed by gameplay. 
Type Of Art Artwork 
Year Produced 2019 
Impact Mood Pinball was exhibited at three exhibitions and created discussion amongst the collaborators and with the involved members of the neurodiverse community. The participatory process of the artworks creation created awareness of neurodiversity which is reflected in the artwork. The project contributed to building relationships between the project partners, was received very positively at exhibitions and sparked numerous conversations with those experiencing it. The artist Edie Murray was able to gain experience in participating and collaborating in the project which has contributed to her securing an Arts Council England Develop Your Creative Practice grant. BOM (Birmingham Open Media, BOM is a centre for art, technology and science dedicated to creative innovation with purpose) report their involvement gave them a platform to gain a wider audience for their work, gain more followers from their peers across sectors and particularly widen their network in London, UK. They further gained significant experience in working with autistic adults which has fed into their wider work and programmes directed specifically at autistic people aged 25 and over. 
URL https://www.bom.org.uk/2019/09/20/mood-pinball-at-the-v-as-digital-design-weekend/
 
Title Smoking Gun 
Description A gameplay experience involving collaborative puzzle solving through data analysis and group communication has been developed by fanShen in collaboration with the STARTS artist-in-residency program and DataStories. Players must collaborate to solve data-based puzzles across a five-day period and unravel a mystery using different types of data. 
Type Of Art Artistic/Creative Exhibition 
Year Produced 2020 
Impact Too early to tell but Smoking Gun is being launched officially at the end of February 2020 and will be showcased at the STARTS Residency Days in Paris from Feb 29 - March 1 (https://vertigo.starts.eu/starts-residencies-days/). 
URL https://www.youtube.com/watch?v=Ol1NIJ3hWRM
 
Description The framework design is informed by practice-led research in three main areas:

Finding and enriching data
Generating content from data
Sharing and engaging with content

Finding and enriching data -
A number of peer reviewed publications in the area of data discovery and collaboration with structured data have resulted from this project so far. These, together with the datasearch workshops described in the Section "Engagement activities", demonstrate the importance and recognition of dataset search within the research community. We believe this to be a significant achievement by contributing new knowledge to this rapidly evolving topic. Due to the exploratory nature of current research on data discovery we see a large space for future research taking our findings forward.

Based on the results of a mixed methods study on dataset summaries for human consumption we proposed a guidelines to support people to write meaningful dataset summaries for the purpose of dataset reuse. These insights can inform the design of data discovery and exploration tools, by tailoring functionalities to user needs specifically directed at structured data. We further used our results to develop a small prototype for data publishers to guide them through the summary writing process.

In order to better understand the patterns and specific attributes that data consumers use to search for data and how it compares with general web search, we performed a query log analysis based on logs from four national open data portals and conducted a qualitative analysis of user data requests for requests issued to one of them. In addition to that we conducted a crowdsourcing experiment where we asked crowdworkers to create queries for dataset described in a data request. The queries they provided were aimed at finding a dataset to answer a specific user need. It appeared that portals search functionalities are currently used in an exploratory manner, rather than to retrieve a specific resource, which reinforced our hypothesis that dataset search is different to general web search and needs tailored approaches taking advantage of the dataset structure.

After identifying that lack of context in dataset retrieval is big factor in how users assess whether the datasets is suitable to their task we looked into possible approaches to adding such context to the data inside the dataset. Approaches assigning semantic labels from knowledge bases to specific column disambiguating their meaning exist but their primary focus till now was on column with textual data rather than numbers. Given that numerical columns are the most popular column type on open data platforms we proposed an approach to add semantic meaning to numerical columns. The approach was evaluated using a benchmark generated for the purpose of this work. We showed the influence of the different levels of analysis on the success of assigning semantic labels to numerical values within tables. Further, we compared our work with state of the art approaches looking at this problem and showed that our approach is less affected by the structure of the data and by data quality issues.

One reason to engage in dataset search is to find data that can be reused for other purposes. In order to understand whether a dataset can be reused people need to make sense of it and determine it's "fitness for use". We identified a gap in research aiming to understand sensemaking specifically for structured data as opposed to information seeking more generally. To this end we conducted a qualitative mixed methods study, looking at how researchers make sense of and reuse existing data. We were able to identify clusters of activity patterns and related data attributes important in data exploration and sensemaking. We derived concrete recommendations for how these activity patterns and data characteristics can inform tool design and documentation practices to support data-centric sensemaking behaviours.

Through a number of partner workshops, we have been interrogating the diversity implications of the structuring of data. Data is usually categorised and structured by "neurotypicals". We have reports of neurodiverse data users being frustrated by the incoherence/illogical of categorisation - it appears that neurotypical people have a greater capacity to cope with inconsistency and illogicality. Hierarchical classifications and seemingly subjective schema design can be difficult to comprehend by neuro-atypical individuals. "Data based decision making" is a term used in relation to evidence-based processes, but the data can be illogical to certain people. It may bring a unique perspective to the difficulties of categorisation, and the process of creating standards. These findings ask us to question who makes the rules behind database structures and presentation, and do the designers of these rules consider a diverse user base?

Generating content from data -
In investigating the use of data games and the effects of play on recall and engagement, a simple data game was implemented (based on the work of Togelius and Friberger (2013)) aiming to help players memorise simple data sets. An experiment was carried out in which participants were shown either a variant of this "gamified" visualisation, or a set of traditional bar charts. However, experimental results have show that participants that were shown the gamified visualisation did not necessarily perform better in terms of recall that those that saw a traditional visualisation. There are a number of reasons this may be the case, such as participants focusing on using the in-game mechanics to achieve a higher score, rather than taking in the data. This leads us to conclude that simply incorporating the notion of play into a data visualisation is, alone, insufficient, and does not inherently help better communicate the message behind the data (and in some cases may distract from it). As such, ongoing work seeks to understand the way in which the individual mechanics of games can be used to encourage exploration of, and focus on, data, and how mid-game "gating" (or tasks/quizzes) can encourage a deeper understanding.

Work on the development of a tool to support the work of data journalists in their creation of stories that incorporate a semi-automated generative logical structure and intelligently recommended visualisations is on-going, with meetings with data journalists planned, and a prototype being developed.

Sharing and engaging with content -
Numeric data:
Investigations of the shareability of data, in terms of reach and engagement have led to a public dataset of socially derived "numeric data", a unique corpus of more than 20 million occurrences of numeric data identified as appearing in social media feeds. The use of data rich language in natural language communication has not been the subject of significant research focus, and this dataset allows studies of the references to and the reliance on data in human communication. Analysis (and refinement) of the data is ongoing and a model of the use of data will be incorporated into the WebData RA tool.

Chart identification:
For the training and testing purposes of our system for chart identification of chart images on social media, we built using crowdsourcing a new corpus consisting of 3k image tweets that have been posted by Twitter accounts of some major news agencies (e.g. nytgraphics and ReutersGraphics, GuardianData). The corpus was formed because we found that there are differences between the chart images that are made available in benchmark corpora and those that are shared on social media platforms. The latter are often augmented with additional elements, such as text and images. This deems the task of identifying them more challenging, especially for systems that have been built based on idealised examples. Based on the statistics from this new corpus, we found that bar charts (incl. column charts) are the most common type of visualisation used by data journalism-oriented accounts with 378 and 89 images showing solely a bar chart and a bar chart accompanied by a different chart type respectively; the second most common visualisation type are maps with respective quantities of 382 and 14.

Furthermore, we built an architecture based on deep neural networks for predicting the virality potential of a chart image on Twitter. Our system predicts the expected virality as a function of the total number of its retweets and likes. Using this architecture, we tested the separate contribution of different signals (i.e. chart images, its original poster and the accompanying text) for the prediction of the expected number of likes and retweets. We evaluated our result using Spearman's rank correlation and Root Mean Square Error (RMSE) of the predicted values with respect to the actual retweet and like counts in our test set. We found that coupling the textual information from the text with author-related cues (e.g. number of friends, followers and likes) results in better performance gain for like counts prediction than combining it with extracted visual features from the corresponding chart. On the other hand, the combination of textual features with either author- or chart-related cues are equally important for predicting the total number of expected retweets. In general, we found that the most accurate predictions are computed when all three types of information (i.e. visual from the chart, textual from its accompanying text and social from its original poster's characteristics) are taken into consideration.

To analyse how data-rich content is currently being shared, information is being collected from Twitter to classify the kinds of data used, the presentation mechanisms chosen, the role played by the data in the shared content and the individuals who share data-rich content.

Data experiences:
A different aspect of engaging with content was addressed in our work on "data-experiences", which resulted in two gamebased artworks. One was created in a participatory design process with a neurodiverse person to express a personal response to data, in an artistic context. The outcome facilitates the engagement of citizens with neurodiversity through the liaison of game (a playable pinball machine) and data. One of the key findings from the design process was how categorisations inherent to data are tailored towards neurotypical experiences. The second piece will result in insights on collaborative decision making with data, to make sense of story fragments in the context of a game, and will be launched soon.
Exploitation Route Finding & enriching data - Understanding how people search for data can inform the building of systems or functionalities for data discovery that take user needs into account and support access to structured data. This includes an understanding of selection criteria for datasets which can inform interface design for dataset search; as well as insights into dataset summaries for human consumption which can inform the creation of semi-automatic data summarisation approaches.

Generating content from data - Our work could inform the optimal features for how data can be used to generate (or be embedded within) popular consumable media; for example, how different genres of game (puzzle, shooter, role-play) or different aesthetics of play (fantasy, cooperation, competition, abnigation) affect retention/comprehension of data and the enjoyment of the media.

Sharing & engaging with content - We see our work as a first step towards identifying misinformative data visualisations before they become viral. A natural extension of our work would be the implementation of a system capable of cross-checking facts presented in a data visualisation against information in knowledge bases and open data sources. We believe that our findings along with the involved technologies can be useful to businesses, including but not limited to social media platforms, that seek to protect theirs users from the dissemination of fake news.
Sectors Creative Economy,Digital/Communication/Information Technologies (including Software),Education,Culture, Heritage, Museums and Collections

URL http://datastories.co.uk/
 
Description NARRATIVE IMPACT Finding and enriching data There are a number of initiatives aiming to improve dataset search and reuse on the web, driven by, for instance, the open data and open government data movements as well as the research data sharing communities. Recent attempts to facilitate web-scale dataset discovery include cross-domain dataset portals, data portals for open governmental or research data. However, because these solutions are limited to particular communities and/or based on rather incomplete and noisy metadata, transparent discovery of data on the web remains a challenge. However, the need for better discoverability of datasets is increasingly recognised and Google recently released Dataset Search. This dataset specific search engine takes advantage of initiatives such as schema.org who have released a dataset specific vocabulary to markup structured data on the web but is still in its early days. By initiating a workshop series on data search in 2018 we have started to connect researchers and to some extent practitioners to come together and discuss challenges in data discovery. The workshop was meant to be an interdisciplinary one, resulting from the range of challenges data discovery presents. We aimed to gain a better understanding of the extent to which techniques, methods and lessons learned from document retrieval (broadly construed) could be applied to data-centric contexts, providing an opportunity for an in-depth exploration of the differences between these two areas from both a technical and interaction perspective. We received interest from governmental agencies in the UK, from data archiving institutes, and we were able to build a network of researchers and practitioners interested in bringing the topic forward. We are also contributing to the European Data Portal, which consists of analytical reports about the future of open data portals and prototypes for alternative architectures. This is joint work with Cap Gemini and others for the EC publication office and DG Connect. With the availability of Google Dataset Search the open data landscape could become more accessible to less specialised users which makes our research on dataset specific selection criteria and dataset summaries even more timely and points towards future research potential in this area. Generating content from data In the "post-truth" era, communication of data to the general public is increasingly important. However, it must be communicated in such a way as to allow it to be both critically evaluated, and absorbed. To this end, work is being undertaken to create and evaluate novel methods of creating visualisations and narratives from data. One novel method of communicating data in an entertaining way is through the use of "data games" - games in which play is driven through exploring a particular dataset. To this end, a simple, "casual" data game has been created (based on existing work in the field) that can be generically applied to visualise any numerical CSV data file. A second, physical instantiation of a pinball-based data game (created through collaboration between the ODI and BOM) is described below. A third gameplay experience involving collaborative puzzle solving through data analysis and group communication has been developed by fanShen in collaboration with the STARTS artist-in-residency program. In addition, we aim to support those working to use data to bring news to the general public - data journalists - by developing technology to assist in creating a data story (a news article supported by a dataset, incorporating pertinent visualisations, etc.). The Data Storytelling Tool will allow authors to create a story directly from one or more datasets that uses author-input, combined with intelligent recommendation systems and underlying templates, to highlight key points of interest and suggest (and generate) suitable visualisations and a logical story-structure. The tool has been developed in collaboration with TheyBuyForYou (TBFY) , an H2020 project which is focusing on increased accessibility of European procurement data. While it is still under development early prototypes have received feedback from TBFY project partners as well as from the Bureau Local of Investigative Journalism in the UK, as part of a partnership initiated by Data Stories. These examples are linked in their characteristic of basing a game, an artwork or a story directly on an underlying dataset. Sharing and engaging with content: through charts In today's data-driven society, data visualisation are frequently used by experts that seek to communicate quantitative information to the general public. In addition, the wide participation of users on the various social media platforms offers a dynamic audience to professionals with whom they are able to share their data stories. This is exemplified by the fact that among other experts popular news agencies have Twitter accounts, such as nytgraphics and ReutersGraphics, GuardianData, that specialise in the dissemination of data-driven information using charts. Recent efforts (e.g. the Facebook initiative) fuelled by the ever-increasing concerns related to the propagation of misinformation, have looked at how fake news in the form of textual articles spreads through the various social media channels. Nonetheless, the case of data visual as a potential agent of misinformation is still an unexplored domain. To this end, we have focused on understanding how data visuals are shared on social media. We have built a data-driven approach based on deep neural networks that first identifies, whether a posted image displays a chart, and in case it does, its exact chart type, and, subsequently, predicts its virality potential. One of the challenges that we encountered for the development of our end-system is that both the format and the quality of charts published on social media platforms vary. This along with the lack of available corpora of realistic data visualisations introduce difficulties in training automatic approaches for chart identification. We explored a training strategy by training our algorithm on a corpus of idealised chart images, and, subsequently, fine-tuning it on a much smaller dataset of realistic data visualisation, with promising results. We use the above system as part of a multi-modal neural architecture that jointly learns to make a prediction for the total number of times that chart-related post on Twitter will be liked and retweeted. Our approach uses different modules to process: (i) the visual features from the shared chart image, (ii) its accompanying text, and (iii) the original author's social cues (e.g. number of posts, friends and followers). We experiment with alterations of this model using different input signals (e.g. with or without the author- and the image-related features) in order to determine their separate contribution towards the expected number of likes and retweets. While we have focused our investigation on the Twitter use case, our approach is transferable to other social media platforms. We see this work as a first step towards identifying misinformative data visualisations before they become viral. Having the ability to identify a chart among other images and compute an estimate of its virality potential could be useful for prioritising the resources required to certify its authenticity. We believe that, besides the social networks platforms that seek to check the accuracy of posts on their channels, many companies employed on the fact-checking domain can immediately benefit from such a technology. Sharing and engaging with content: through art and game The artistic element of the Data Stories project aims to go beyond basic visualisations of data as graphs or infographics, and focus specifically on creative works, gaming and narrative design, produced collaboratively through co-creation. This includes the engagement of community groups in the creation process of an artwork or game but also the engagement of a general audience with those outputs. Working with a local community group we explored projects that explore or play with the narratives of data from a positive perspective, specifically local and / or civic data which reflects and affects the community, or specific parts of the community. The emerging impact of this approach is the new perspective of the arts organisation and its audience of how data affects their lives, and how it can be used as a rich source of inspiration for artists and community groups. Interesting challenges from the work have arisen from misplaced assumptions that data and its meaning could be universal. We have noticed that different groups conceptualise and use data differently - specifically those who are neurodiverse. As the project continued we explored these differences with artists and the neurodiverse community. The final outcome aims to raise awareness of the multiplicity of data interpretations and usage. The underlying data presented in the artworks is either realistic or a close reflection of reality, but the story that unfolds is imaginary and based on unpredictable events caused by playing the game. Taking a different approach we also worked with an artist collective to create a mobile-based collaborative mystery artistic experience as an alternate-reality game. Throughout the project the artists and those engaging with their work became more aware that data is useful to a narrative when it has a purpose or meaning, ideally one with a personal connection.
First Year Of Impact 2019
Sector Digital/Communication/Information Technologies (including Software),Education,Government, Democracy and Justice
Impact Types Cultural,Societal,Policy & public services

 
Description EC Horizon 2020 TheyBuyForYou
Amount € 2,925,693 (EUR)
Funding ID 780247 
Organisation European Commission H2020 
Sector Public
Country Belgium
Start 01/2018 
End 12/2020
 
Title Initial method to assess the reusability of datasets within a corpus 
Description An increasing amount of data is published openly on the web, ideally with the aim of reuse. We created a method, including a prediction model that can help to identify how likely it is that a certain dataset from a collection of datasets will be reused. Based on a corpus of datasets published on GitHub we show how to identify a basket of engagement metrics and predict the reusability of a dataset based on attributes such as: its structure, the way it was published, and its documentation. The method consists of a number of steps, applicable to data repositories more generally: 1. Scoping the assessment, for instance by deciding the specific collection of datasets that will be considered. 2. Defining reuse metrics. These depend on the capabilities of the data repository and the underlying technical infrastructure. If direct metrics cannot be defined, proxy metrics can be used. The metrics should be validated by exploring if they are quantifiably linked to dataset reuse. 3. Collecting reuse metrics (or proxies). Technical capabilities are needed, which may be built into the data publishing software, or aggregated metrics derived from lower-level system logs. 4. Defining reuse indicators. These need to be measurable and will be used as features in a prediction model for reusability. 5. Analysing their distribution for the top-reused group of datasets. 6. A combination of those features are used to build a statistical model to predict reusability. 
Type Of Material Improvements to research infrastructure 
Year Produced 2019 
Provided To Others? No  
Impact This work is currently in the submission process to a scientific journal, therefore no impact could be measured yet. 
 
Title Point at the Triple: Identifying Chart Types from images 
Description We created an algorithm to identify charts from images, with the aim to predict the virality of charts in social media. The aim was to understand how data rendered visually as charts or infographics "travels" on Twitter. To this end, we proposed a neural network architecture that is partially-trained on the DataTweet+ corpus to identify whether a post includes a chart, distinguish among different types of charts, for instance line graphs or scatter plots, and predict how much they will be shared. This is performed by a multi-modal neural architecture. This model learns to predict the number of times a chart post will be retweeted and liked given the set of visual, textual and author features that accompany a chart-related tweet. 
Type Of Material Improvements to research infrastructure 
Year Produced 2019 
Provided To Others? No  
Impact Currently under review in an international journal. 
URL https://github.com/pvougiou/Point-at-the-Triple
 
Title DataTweet+ 
Description A collection of 3000 annotated tweets containing images collected from Twitter accounts dedicated to data journalism (e.g. GuardianData and nytgraphics). Each image is accompanied by the name of the different chart types (e.g. bar chart, line graph etc) it displays. The images have been annotated in a crowdsourcing experiment using the FigureEight platform. The aim is to understand how data rendered visually as charts or infographics "travels" on Twitter. To this end, we propose a neural network architecture that is partially-trained on the DataTweet+ corpus to identify whether a post includes a chart, distinguish among different types of charts, for instance line graphs or scatter plots, and predict how much they will be shared. 
Type Of Material Database/Collection of data 
Year Produced 2019 
Provided To Others? No  
Impact A publication including a model which was trained on this dataset is currently under review and accepted with minor corrections at an international journal. 
URL https://github.com/pvougiou/Pie-Chart-or-Pizza
 
Title Virality Measures of "Data Tweets" 
Description This is a public dataset of socially derived "numeric data", a unique corpus of more than 20 million occurrences of numeric data identified as appearing in social media feeds (on Twitter). The use of data rich language in natural language communication has not been the subject of significant research focus, and this dataset allows studies of the references to and the reliance on data in human communication. The dataset consists of two files in TSV format derived from a large number of tweets (16754250) that were identified as containing different forms of "numeric data" in an extended collection of tweets from Twitter's 1% public sample over 11 months from September 2018. Both files have a key column labelled "TweetID" which is the Twitter API ID that can be used to retrieve the full twitter data (recommended retrieval via TWARC). 
Type Of Material Database/Collection of data 
Year Produced 2020 
Provided To Others? Yes  
Impact Analysis (and refinement) of the data is currently ongoing. 
URL https://figshare.com/articles/Virality_Measures_of_Data_Tweets_/11940426
 
Description "Smoking Gun" an artistic innovation created in collaboration with fanShen 
Organisation European Commission
Department Horizon 2020
Country European Union (EU) 
Sector Public 
PI Contribution We contributed to the design elements of the project to ensure it was focused with research questions relevant to the Data Stories project in mind, as well as participating in project scoping and providing general support to the artist collective during development. We assisted in generated a portion of a synthetic dataset to be used within the game, and also participated in testing and feedback sessions, as well as in dissemination activities (such as the SOSTEF).
Collaborator Contribution The artist collective fanShen led the design and development portion of the project, which was created iteratively over the course of several months. As part of the process they created a large amount of content used in the game including facsimiles of "real world" websites, such as mirroring real news articles, conference websites, and email servers, to create an alternate-reality game.
Impact The outcome of this partnership is mobile-based collaborative mystery game/artistic experience called Smoking Gun, which plays out in real time over the course of several days. Each morning, players are provided new clues to the mystery that must be solved through data analysis skills; each evening, players are able to communicate and discuss their findings and share information. This was a multi-disciplinary partnership involving technical artists and game designers from fanShen, and researchers and data scientists from the Data Stories project. Smoking Gun is being launched officially at the end of February and will be showcased at the STARTS Residency Days in Paris from Feb 29 - March 1 (https://vertigo.starts.eu/starts-residencies-days/). We also have a project description on YouTube: https://www.youtube.com/watch?v=Ol1NIJ3hWRM STARTS residency days: https://www.starts.eu/agenda/starts-residencies-days/detail/
Start Year 2019
 
Description Birmingham Open Media co-created artwork: Tribes, Treasure Hunts & Truth Seekers 
Organisation Birmingham Open Media
Country United Kingdom 
Sector Private 
PI Contribution Put together a brief for the artists to develop a participatory artwork with members of the neurodiverse community in and around Birmingham. Two artist fellows Harmeet Chagger Kahn and Ben Neale have been commissioned to develop a participatory artwork with members of the neurodiverse community in and around Birmingham.
Collaborator Contribution Managed by Birmingham Open Media, two artist fellows designed a series of co-creation workshops with neurodiverse artists and residents of Birmingham with the intention to develop, one or a number of 'data experiences'. Two artist fellows Harmeet Chagger Kahn and Ben Neale have been commissioned to develop a participatory artwork with members of the neurodiverse community in and around Birmingham. Two successful workshops were held with neurodiverse participants and teams from Birmingham Open Media (BOM) and ODI. Data Stories team members for each workshop were Tom Blount, Rachel Wilson and Hannah Redler Hawes. The workshops were attended by around five participants interested in data and art from the neurodiverse community in and around Birmingham. An early community building workshop was held in January 2018: The ODI organised and hosted a workshop as a kick-off for the Data Stories project bringing in 30 people from journalism, art, civil society and academia around the theme of "bringing data to citizens, and vice versa". The purpose of this workshop was to introduce the Data Stories project to data journalists and data activists, and survey the state-of-the-art in terms of data narratives and data engagement. Whilst the workshop was a success and achieved its objective to convene a variety of stakeholders around an interesting topic, the ODI felt it could play a more effective and unique leading role bringing a DAC flavour, rather than the initial plan to facilitate further workshops in support of the various work streams.
Impact The artwork resulting from this collaboration was showcased in 3 exhibitions: 1) BOM Hacked! , 2) V&As Digital Design Weekend, 3) part of the "Copy That? Surplus Data in an Age of Repetitive Duplication" exhibition This was a multidisciplinary partnership between the Data as Culture programme at the ODI, artists and an art collective as well as the Data Stories research staff.
Start Year 2018
 
Description Talking Datasets: A Study on Verbal Dataset Description 
Organisation University of Amsterdam
Country Netherlands 
Sector Academic/University 
PI Contribution We have started to connect researchers and to some extent practitioners to come together and discuss challenges in data discovery, aiming to gain a better understanding of the extent to which techniques, methods and lessons learned from document retrieval (broadly construed) could be applied to data-centric contexts, providing an opportunity for an in-depth exploration of the differences between these two areas from both a technical and interaction perspective. Together with the Data Archiving Networked Services (Royal Netherlands Academy for Arts and Sciences) & Informatics Institute, University of Amsterdam, we conducted an interview study with 30 participants: design, set-up, execution and analysis of the study.
Collaborator Contribution Working together with University of Southampton as above.
Impact We conducted a mixed-methods study and wrote up the analysis as a paper submission the International Journal of Human Computer Studies where it is currently under review. We were able to built a network of researchers and practitioners interested in bringing the topic forward. We plan to continue a workshop with more focus on non-academic audiences and present the work at the Data Stories Symposium which we organise in June 2020.
Start Year 2018
 
Title Data Storytelling Tool 
Description We have made the Data Storytelling Tool freely available under an MIT license that allows the general public to benefit from this research. It uses the MIT License, a copy of which is present in the software repository, which allows members of the public to reuse the software on a commercial basis, providing they license any derivations of the work under the same license. 
IP Reference  
Protection Copyrighted (e.g. software)
Year Protection Granted
Licensed Yes
Impact The tool is under development and will be evaluated for functionality, usability and impact when finalised.
 
Title Data Storytelling Tool 
Description A tool to support the authors of data stories, by recommending a narrative structure based on features of interest from a dataset they have uploaded. The tool is a client-side, javascript-powered, html page (i.e. while the tool will be deployed/hosted on a server, all data-processing happens on the user's machine - no sensitive data of any sort will leave the user's machine). The objective of this tool is to assist the analysis of data through authored (and semi-automated) narrative, making it useful and applicable to data journalists, procurement specialists, and any other authors of data stories. The tool will allow users to import their own data, provide an overview of the data, recommend suitable story-beats and visualisations, and export the story to a number of formats. Workflow Overview User uploads a (.csv) datafile (alternatively, the user can load a previously saved story) User selects fields of interest User selects possible/likely dependencies/correlations Tool recommends story-beats/visualisations User completes story-template with text/images/manually selected visualisations/etc. User exports story to one of several formats Completed Features 1. Data upload/overview: Data can be uploaded from CSV (comma separated value) files; an overview of the data (including data type, selection of values, and min/max values and value distribution, if applicable) is shown to the user, allowing them to browse the data at a high level, and select values of interest 2. Visualisation generation: Bar chart, scatterplot, and line chart (time series) visualisations are currently generated based on user-submitted data, using the d3.js library 3. Simple visualisation recommendation: Visualisations are currently recommended based on user submitted context of dependencies/correlations 4. Narrative authoring: users are currently able to author their narrative (based on simple recommendations based on their submitted correlations); users can supply text, images, and generate additional charts to construct the narrative 5. Export to html/json: Data stories can be exported in a number of formats including html (tailored to stand-alone pages, embeddable content, or (in conjunction with additional js libraries) slide-based content) and json 6. Story saving/loading: the tool supports saving/loading of data stories to user-controlled files (as no data is passed to a server) Future Features 1. Narrative template system: a rule-based system to enhance the authoring experience, by guiding the user step-by-step through the narrative process 2. Advanced visualisation recommendation: enhanced recommendation, that may include elements such as (for example) trend-detection, correlation-detection, and/or anomaly detection 3. Visualisation annotation: allow users to add additional annotations on top of the generated visualisations to highlight any elements that, given their contextual knowledge, would be valuable to their audience Codebase: https://github.com/TBFY/storytelling Live demo: https://TBFY.github.io/storytelling (pending) 
Type Of Technology Webtool/Application 
Year Produced 2020 
Open Source License? Yes  
Impact The software is still in development, so impact has not yet been measured. Preliminary evaluation has been undergone with support from industry and project partners and there has been interest in further specialisation of the tool for relevant domains. 
URL http://tomblount.co.uk/storytelling/
 
Title Numer 
Description Numer: a two level (row and column based) approach to add semantic meaning to numerical values (columns) within tables. Additionally, we generated a DBpedia based benchmark - NumDB with which we evaluate Numer. 
Type Of Technology New/Improved Technique/Technology 
Year Produced 2018 
Impact With the increasing amount of structured data on the web the need to understand and support search over this emerging data space is growing. Adding semantics to structured data can help address existing challenges in data discovery, as it facilitates understanding the values in their context. While there are approaches on how to lift structured data to semantic web formats to enrich it and facilitate discovery, most work to date focuses on textual fields rather than numerical data. 
URL https://github.com/chabrowa/semantification
 
Title Web Data RA 
Description In our work on evaluating the current use of social media channels for sharing "data rich content", we implemented a data gathering tool for social media platforms such as Twitter and Facebook. It allows tweets / status updates to be gathered from common web platforms, including historic data gathering. WebDataRA is open source and available on the Chrome Browser Extension store (bit.ly/WebDataRA). The software is usable by non-programmer researchers, and converts social media platform data into spreadsheets for easy subsequent analysis. It has been used to investigate different communities' use of Twitter for communicating data rich stories, and has enabled a variety of MSc projects in conjunction with Data Stories. The software has been disseminated through a Workshop for 50 people at "Social Media and Society" 2018 in Copenhagen. The workshop called "Follow the Data", in the form of a tutorial run by Professor Carr, was attended by 50 researchers and PhD students. 
Type Of Technology Webtool/Application 
Year Produced 2018 
Open Source License? Yes  
Impact Not at this stage 
URL http://bit.ly/WebDataRA
 
Description Can Google Make us Smarter 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact Keynote talk at ELearning Symposium at UoS
Year(s) Of Engagement Activity 2019
 
Description DATA:SEARCH'18: International Workshop on Searching Data on the Web 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact This workshop explores challenges in data search, with a particular focus on data on the web. We want to stimulate an interdisciplinary discussion around how to improve the description, discovery, ranking and presentation of structured and semi-structured data, across data formats and domain applications. We welcome contributions describing algorithms and systems, as well as frameworks and studies exploring human data interaction. We see a large space for discussion and future research in the development of federated data discovery and search technologies, which leverages recent advances in information retrieval, Semantic Web and databases, and is mindful of human factors. The workshop aims to bring together communities interested in making the web of data more discoverable, easier to search and more user friendly.
Year(s) Of Engagement Activity 2018
URL http://sigir.org/sigir2018/
 
Description Data Stories: Engaging with data in a post-truth world 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Postgraduate students
Results and Impact A seminar at UoS, resulting in student discussions.

ABSTRACT: One of the interpretations of the EU referendum result and the rise of Donald Trump in the US is that we are now living in a post-truth society - a world in which anecdotes shared on social media and invented numbers thrown on the sides of buses are more trusted and influential than official statistics, extensive research, and proven expertise. In this world, scientists, statisticians, analysts, and journalists must find new ways to bring hard, factual data to citizens." "Data must entertain as well as inform, excite as well as educate. It must be built with social media sharing in mind, and become part of our everyday activities and digital interactions with others."
Data Stories looks at frameworks and technology to bring data closer to people through art, games, and storytelling. It examines the impact that varying levels of localisation, topicalisation, participation, and shareability have on the engagement of the public with factual evidence. It delivers tools and guidance for communities and civic groups to achieve wider participation and support for their initiatives; and empower artists, designers, statisticians, analysts, and journalists to communicate through data in inspiring, informative ways.
Year(s) Of Engagement Activity 2018
 
Description Data Stories: Tribes, Treasure Hunts & Truth Seekers 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact Panel discussion "how foraging for meaningful data can help us to understand who we are and reinvent the world we live in". Part of the Open Data Summit 2018 at the Guardian HQ in London, attended by 500 people.
Year(s) Of Engagement Activity 2018
URL http://theodi.org/odi-summit-2018-date-value-speakers/
 
Description Data Study Group 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Postgraduate students
Results and Impact A researcher in my team attended a week long data study group with the aim to create data stories about government contract opportunities.
Year(s) Of Engagement Activity 2019
 
Description Data and Storytelling: Data Stories Launch Workshop 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Industry/Business
Results and Impact A gathering of industrial and academic professionals working in spaces related to the Data Stories project (including those partnered directly with the project); a series of talks and discussion panels in which participants discussed on-going research projects in the field, cultural and artwork projects, and industrial applications of these sorts of technology.
Year(s) Of Engagement Activity 2018
URL http://datastories.co.uk/events/data-and-storytelling-workshop/
 
Description Loops of humans and bots in Wikidata 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact A talk given in connection with Data Stories at HumL@www2018

Abstract:
Human­-in-­the-­loop is a model of interaction where a machine process and one or more humans have an iterative interaction. In this paradigm the user has the ability to heavily influence the outcome of the process by providing feedback to the system as well as the opportunity to grab different perspectives about the underlying domain and understand the step by step machine process leading to a certain outcome. Amongst the current major concerns in Artificial Intelligence research are being able to explain and understand the results as well as avoiding bias in the underlying data that might lead to unfair or unethical conclusions. Typically, computers are fast and accurate in processing vast amounts of data. People, however, are creative and bring in their perspectives and interpretation power. Bringing humans and machines together creates a natural symbiosis for accurate interpretation of data at scale. The goal of this workshop is to bring together researchers and practitioners in various areas of AI (i.e., Machine Learning, NLP, Computational Advertising, etc.) to explore new pathways of the human­in­the­loop paradigm.
Year(s) Of Engagement Activity 2018
URL https://humlworkshop.github.io/HumL-WWW2018/
 
Description PROFILES & Data:Search - International Workshop on Profiling and Searching Data on the Web 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Workshop at the Web Conference 2018:
The Web of Data has seen tremendous growth recently. In addition, new forms of structured data have emerged in the form of Web markup, such as schema.org, and entity-centric data in the Web tables. Considering these rich, heterogeneous and evolving data sources which cover a wide variety of domains, exploitation of Web Data becomes increasingly important in the context of various applications, including federated search, entity linking, question answering, and fact verification. These applications require reliable information on dataset characteristics, including general metadata, quality features, statistical information, dynamics, licensing and provenance. Lack of a thorough understanding of the nature, scope and characteristics of data from particular sources limits their take-up and reuse, such that applications are often limited and focused on well-known reference datasets. The PROFILES workshop series aim at gathering approaches to analyse, describe and discover data sources - including but not limited to semantic search and SPARQL endpoints - as a facilitator for applications and tasks such as query distribution, entity retrieval and recommendation. PROFILES offers a highly interactive forum for researchers and practitioners bringing together experts in the fields of Web, Semantic Web, Web Data, Semantic Search, Databases, NLP, IR and application domains.
Year(s) Of Engagement Activity 2018
URL http://www2018.thewebconf.org
 
Description Participation at an exhibtion (BOM Hacked!) 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Public/other audiences
Results and Impact One of the artworks resulting from the project ("Mood Pinball") was exhibited at Birmingham Open Media as part of their "Hacked! Games Re-designed" exhibition from September 13 to December 21 2019. The exhibition showcased games and instruments that were made with, for and by disabled people. The exhibition intended to capture 'a unique moment in time when games designed from alternative all-ability perspectives could lead us towards an altogether more immersive future.' Exhibition Assistants reported that visitors were naturally drawn to Mood Pinball and found it easy to access the game.
Year(s) Of Engagement Activity 2019
URL https://www.bom.org.uk/event/hacked/
 
Description Participation at an exhibtion (V&A Digital Design Weekend) 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Public/other audiences
Results and Impact One of the projects artistic outputs was on display during the Victoria and Alberts Museum Digital Design Weekend in London where hundreds of visitors had the opportunity to engage with the work. Assistants reported that many visitors expressed interest in the project and were able to ask questions to the artists who were on hand to talk about the piece and its concept development. One visitor said that it was the most accessible piece in the whole Digital Design Weekend exhibition. The Data Stories project team carried out their own assessment of visitor's perception and understanding of the data as revealed by gameplay.
Year(s) Of Engagement Activity 2019
URL https://www.vam.ac.uk/event/q0nvJ80O/digital-design-weekend-sep-2019
 
Description STARTS residency days 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact The projects of the STARTS Residency initative were exhibited at the Centquatre gallery in Paris to an audience of 100+ delegates from technology partners, professional artists and the general public. fanShen and the University of Southampton hosted a workshop/engagement event, presented in both French and English, describing the collaboration, the art piece, and the intened outcomes for both partners, to an audience of ~50 people. A general discussion about the purpose of the work, the perception of data, and the implications of people's own data in their daily lives followed, with suggestions made to reutilise the artwork as a means of training data literacy, unconcious bias, investigative skills in industry, particularly those with international departments.
Year(s) Of Engagement Activity 2020
URL https://www.starts.eu/agenda/starts-residencies-days/detail/
 
Description Southampton Science & Engineering Festival (SOTSEF) 
Form Of Engagement Activity Participation in an open day or visit at my research institution
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Schools
Results and Impact Public outreach event Southampton Science & Engineering Festival (SOTSEF), in which the public was be able to observe and interact with novel data visualisations and technologies stemming from the Data Stories project. Four core Data Stories team members participated in numerous discussions with the audience and in engaging them in a data game (Bar Chart Ball).
Year(s) Of Engagement Activity 2019
URL https://www.sotsef.co.uk/
 
Description The Data We Want: Framework and Tools to Engage with Data 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Professional Practitioners
Results and Impact An invited talk at Office for National Statistics Data Science Campus

Abstract:
One of the interpretations of the EU referendum result and the rise of Donald Trump in the US is that we are now living in a post-truth society - a world in which anecdotes shared on social media and invented numbers thrown on the sides of buses are more trusted and influential than official statistics, extensive research, and proven expertise. In this world, scientists, statisticians, analysts, and journalists are continuously looking for new ways to bring hard, factual data to citizens. Data must entertain as well as inform, excite as well as educate. It must be easy to find, built with social media sharing in mind, and become part of our everyday activities and digital interactions with others.
In this talk, we will introduce the Data Stories framework and toolkit, which aim to bring data closer to people through novel interfaces and experiences. We will present studies that try to understand how people search and make sense of data, as it is currently made available on the web or on data portals. We will also explore emerging technologies, including intelligent assistants, decentralised ledgers, and personal data economies and their potential role in enriching human data interactions.
Year(s) Of Engagement Activity 2018
 
Description The data we want: Interfaces, methods and experiences to engage with data 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Industry/Business
Results and Impact An invited talk to Amazon (Cambridge).

One of the interpretations of the EU referendum result and the rise of Donald Trump in the US is that we are now living in a post-truth society - a world in which anecdotes shared on social media and invented numbers thrown on the sides of buses are more trusted and influential than official statistics, extensive research, and proven expertise. In this world, scientists, statisticians, analysts, and journalists are continuosly looking for new ways to bring hard, factual data to citizens. Data must entertain as well as inform, excite as well as educate. It must be easy to find, built with sharing and reuse in mind, and become part of our everyday activities and digital interactions with others. In this talk, we will introduce Data Stories, a toolkit of interfaces, techniques, and experiences to bring data closer to people. We will present studies which explore how people look for and engage with different kinds of data, from CSV files published on open government portals to knowledge graphs created by large online communities and visual representations of data shared through social media.
Year(s) Of Engagement Activity 2019
 
Description Tribes, Treasure Hunts and Truth Seekers 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Study participants or study members
Results and Impact This was a workshop at Birmingham Open Media in which (self identified) neurodiverse participants from the local area were invited to assist to artists in creating a piece of artwork exploring the relationship between neurodiversity and data.
Year(s) Of Engagement Activity 2018
URL https://www.bom.org.uk/engagement/bom-autism/tribes-treasure-hunts-truth-seekers/