Social Sciences, Social Data and the Semantic Web (S3W)

Lead Research Organisation: University of Bristol
Department Name: Sociology

Abstract

Recent years have seen phenomenal growth in quantity and range of digital data that might be used for social research. The ESRC has already invested in harnessing administrative and business data as well as 'new and emerging forms of data' (e.g. social media and sensor data) for the social sciences. Now a new opportunity arises. 'Semantic linked data' (SLD) offers a new method for structuring and organizing digital data, which promises to have a profound effect on research capacity for data linkage and analysis across multiple, heterogeneous sources, at hitherto unimaginable speed and scale. Indeed, within the Computer Sciences, the proponents of SLD argue that if data are published following shared standards and protocols the Web will be transformed from a library of documents into a single linked data base, described as the 'semantic web'.

The value of data linkage is already well established in the social sciences, but existing methods are labour intensive, involve the retrospective matching of records, across small numbers of data sets, and are done to address particular pre-determined questions with the linkage made for that specific purpose. In contrast, SLD techniques focus on the prospective production of data to allow the on-going matching and accumulation of information about people, places, businesses, artefacts and even conceptual categories and like 'race' or 'class' to be drawn together, however the subsequent user determines, at the scale of the World Wide Web (Halford, Pope and Weal 2012).

However, whilst SLD offers great promise to the social sciences there is - to date - negligible use of SLD by social scientists. The agenda for SLD is being driven by computer scientists and demonstrations are based on relatively straightforward examples such as transport timetables or estates data. Whilst these work well technically, they offer no substantial investigation of how appropriate the techniques might be in addressing more complex social science questions. At a time of financial constraint, when funding for major new data collection is uncertain, it is essential that we explore these opportunities.

The research proposed here will be the detailed investigation into if and how SLD might be harnessed for social science research. To achieve this we have drawn together a strong team of social and computational scientists, with a well-established track record of collaboration. This team will be supported by an outstanding Advisory Group of experts, who have already agreed to participate in this project (see Impact Summary below).

We will explore three research questions:

(i) What are the implications of using SLD methods to describe social data?
(ii) What does SLD contribute to our capacity to understand health inequalities across the life course?
(iii) What are the implications of SLD for data archiving and re-use?

To answer these questions we will:

(i) Carry out a detailed study of the processes involved in converting existing data into SLD. This will be undertaken by the research team, with the participation of experts from our Advisory Group. Specifically, we will work with the English Longitudinal Survey of Ageing and the Great British Class Survey and other related data already in the 'linked data cloud'
(ii) Develop a 'demonstrator' of SLD (using the data sets developed at (i) above) to examine the specific question of health inequalities across the life-course
(iii) Collaborate with the UK Data Service and the GESIS-Leibniz Institute in Germany (which provides a similar data infrastructure to UKDS) to explore the opportunities for data archiving.

In this way, we seek to engage social science in the ongoing development of SLD and the emerging Semantic Web; and to explore the implications of SLD for building next generation data infrastructures in the social sciences.

Planned Impact

Three specific non-academic user groups will benefit from this research.

i) Data archive professionals and policy makers: the value of secondary data analysis is now increasingly recognised, both in terms of securing maximum returns from initial financial investment and in terms of generating knowledge and understanding. This has been a policy priority for Research Councils UK and for the UK data archives, now consolidated as the UK Data Service. The rapid development of open and semantically linked data (SLD) promise a step-change in the find-ability and reusability of existing data, allowing analysis across multiple data sets using computational tools to permit a step-change in the scale and range of data linking that is possible. To date, there has been very little application of SLD techniques to social science data. Meanwhile the development of these techniques in the computational sciences continues apace but with little reference to the social sciences, or the particular challenges that our data may present for these techniques. This project will harness current computational advances to the social sciences, and work closely with our key members of our Advisory Group to ensure that these reach non-academic users directly. In particular we will work with the UK Data Service, who have time costed into this bid, to explore how SLD might support their current shift towards 'data as a platform'. The Office for National Statistics will also be closely involved, as the project offers insights to several of their teams including the Big Data team, ONS Geography, and the Health Inequalities team as well as to those concerned with data infrastructure.

ii) Semantic Web developers: seeking to further the development of tools and inference techniques to encompass a broader range of quantitative and qualitative datasets. By having a better understanding of the way open data providers produce data the community will be able to improve tool support for the creation, publishing and analysis of linked data. This study will provide a case study in modelling qualitative data, such as interview transcripts, which is widely ignored by the Semantic Web community currently. The longitudinal and aggregate nature of the datasets being considered in these studies presents a particular challenge in moving beyond the current consensus of identifier use on the Semantic Web.

Both groups have been involved in the preparation of this research proposal and will be integrally involved in the project as it unfolds. This will provide firm grounding for impact with these users, on which we can build to develop wider policy, skills and public impact. The mechanisms that we will use to achieve this are described in the attached 'Pathways to Impact' document.

Publications

10 25 50
 
Description We started the analysis phase of the project in January 2020. Outputs are currently being planned. Key findings will be added at the next opportunity.
Exploitation Route Award still ongoing, too early to say
Sectors Communities and Social Services/Policy,Digital/Communication/Information Technologies (including Software),Education,Healthcare,Leisure Activities, including Sports, Recreation and Tourism

 
Title S3W- SLD design and development and ethnography 
Description Recent years have seen profound changes in our data landscape. The ESRC has already invested in harnessing administrative and business data as well as 'new and emerging forms of data' (e.g. social media and sensor data) for the social sciences. Now a new opportunity arises. 'Semantic linked data' (SLD) offers a new way to structure and organize digital data, which promises to have a profound effect on research capacity for data linkage and analysis across multiple, heterogeneous sources, at hitherto unimaginable speed and scale. It is well-established in the social sciences that linkage across data sets provides important insights. However, current methods are restricted to retrospective linkage across a small number of data sets, pursuing specific pre-defined questions and usually based on unique individual identifiers (e.g. National Insurance number). In distinction, SLD entails a shift towards prospective data construction using shared standards and protocols for (i) the naming of data entities (e.g. people, places, social classes etc.) (ii) the description of the relationships between data entities (in formal ontologies) and (iii) the development of computational tools for rapid, flexible machine readable data analysis. This promises a step-change in our capacity to integrate and interrogate data across siloes, potentially at web-scale. If all data were published online as SLD the Web would be transformed from a library of documents into a single linked database (Berners-Lee et al ). This project conducts a benchmarking examination of the challenges and opportunities of SLD for social science research by investigating the method itself as an object of study and developing a demonstrator application of SLD to the study of social inequality. We have constructed a SLD demonstrator to 1. enable exploring the affordances of SLD in facilitating complex interrogation of ELSA across its qualitative, quantitative and longitudinal elements and 2. enhance the ELSA by making links to GBCS and to other relevant datasets in the Linked Data Cloud. Both data sets are now converted to SLD enabling to identify similar entities (social groups, occupations, etc.) in each and conduct querying and inference across these linked datasets. Our remaining work includes explore the affordances of SLD for broader understandings of class practices and assets, alongside better-known identifiers of occupation and income, shape health over the life course. 
Type Of Material Improvements to research infrastructure 
Year Produced 2020 
Provided To Others? Yes  
Impact Over the past decade, over 150bn pieces of interlinked data have been published in the Linked Data 'cloud' http://lod-cloud.net/. This already shapes the presentation of web-scale data: web browsers now offer coherently aggregated data about particular entities - places or organizations for example - rather than a list of word-matched 'hits'. Meanwhile, computer scientists have developed applications for SLD, including bus timetables and building facilities. However, the challenges and opportunities of SLD for social science research are not yet understood. The potential significance of SLD for social science may be profound, making viable the radical expansion of cross-data set data analysis at vast scale and high speed. But, social data are more difficult to describe than those used in existing applications of SLD, for example involving (i) competing descriptions for the same entities (e.g. for social class) (ii) qualitative as well as quantitative data (iii) longitudinal data, where the nature of entities and their relations with each other changes over time. In short, the promise of SLD for social science research may be transformational but to date its actual value remains unknown. The novel interdisciplinary methodology used for Semantic linkage of two social science data sets, and the lessons learnt from the ethnographic observations in this project could be applied to other data sets. 
 
Title S3W ELSA SLD data 
Description As part of our proof of concept demonstrator, we have converted ELSA wave 6 dataset to Semantic Linked Data (SLD) that enables linkage with the GBCS SLD and future integration with other datasets. The resulted SLD data are re-usable and when used along other datasets such as GBCS have the potential to enable querying social science questions (e.g. health inequalities) that were not easily possible using the conventional data linkage techniques. 
Type Of Material Database/Collection of data 
Year Produced 2020 
Provided To Others? Yes  
Impact Impacts include enabling re-usability and linkage with other datasets for asking health inequalities research questions that were previously very complex and cumbersome using conventional data linkage methods. Our remaining work includes explore the affordances of SLD for broader understandings of class practices and assets, alongside better-known identifiers of occupation and income, shape health over the life course. 
 
Title S3W GBCS SLD Data 
Description As part of our proof of concept demonstrator, we have converted GBCS's GfK dataset to Semantic Linked Data (SLD) that enables linkage with the GBCS SLD and future integration with other datasets. The resulted SLD data are re-usable and when used along other datasets such as ELSA have the potential to enable querying social science questions (e.g. health inequalities) that were not easily possible using the conventional data linkage techniques. 
Type Of Material Database/Collection of data 
Year Produced 2020 
Provided To Others? Yes  
Impact Impacts include enabling re-usability and linkage with other datasets for asking health inequalities research questions that were previously very complex and cumbersome using conventional data linkage methods. Our remaining work includes explore the affordances of SLD for broader understandings of class practices and assets, alongside better-known identifiers of occupation and income, shape health over the life course. 
 
Title S3W Social class ontology 
Description As part of our proof of concept demonstrator, we have designed and developed a social class ontology that formally models entities and relationships related to the notion of social class as conceptualised in the GBCS and ELSA data.The ontology is re-usable and when used along SLD datasets such as S3W ELSA and GBCS data, has the potential to enable querying social science questions (e.g. health inequalities) that were not easily possible using the conventional data linkage techniques. 
Type Of Material Computer model/algorithm 
Year Produced 2020 
Provided To Others? Yes  
Impact Impacts include enabling re-usability and linkage with other datasets for asking health inequalities research questions that were previously very complex and cumbersome using conventional data linkage methods. Also, if used and updated overtime it can enable archiving and studying how social class is modelled in the future. Our remaining work includes explore the affordances of SLD for broader understandings of class practices and assets, alongside better-known identifiers of occupation and income, shape health over the life course. 
 
Title S3W proof of concept demonstrator 
Description As part of the proof of concept demonstrator we designed and developed 1. ELSA and GBCS Ontologies 2. algorithms for assigning a GBCS conception of social class in GBCS and ELSA datasets. The algorithms enable querying social science questions (e.g. health inequalities) across these linked datasets. 3. visualisations 
Type Of Technology New/Improved Technique/Technology 
Year Produced 2020 
Impact Impacts include enabling re-usability and linkage with other datasets for asking health inequalities research questions that were previously very complex, cumbersome and inflexible using conventional data linkage methods. Our remaining work includes explore the affordances of SLD for broader understandings of class practices and assets, alongside better-known identifiers of occupation and income, shape health over the life course. 
 
Description S3W workshop 2- GBCS ontology 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact The workshop participants are prominent social scientists and computer scientists from a wide range of academic and governmental institutions such as London school of Economics Office for National Statistics (ONS), Leibniz Institute for the Social Sciences (GESIS), etc. The workshop started with various presentations from the S3W team, providing an overview of the project and the progress that had been made. The domain experts participating in the workshop contributed to outlining and discussing potential reasons and approaches to linking the data sets, as well as the opportunities and challenges of the SLD approach to data linkage for social science research. The domain experts were then asked to comment and provide feedback on the project progress after this was presented to them. Contributions from the domain experts were key to shaping the project in the coming months and avoiding pitfalls. The outcomes from the activity include, interests from the domain experts in the research, and contributions to shaping the development of the demonstrator proof of concept showcasing SLD linkage across GBCS and ELSA datasets, as well as to ethnographic data that is currently being analysed, that will result in publications in the coming months. The nature of the project and any collaborations on it are inherently interdisciplinary and commonly include a wide range from sociology and epidemiology, to computer science, and archival studies.
Year(s) Of Engagement Activity 2019
 
Description S3W workshop1- ELSA ontology 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact The workshop participants are prominent social scientists and computer scientists from a wide range of academic and governmental institutions such as University of Manchester, Office for National Statistics (ONS), Leibniz Institute for the Social Sciences (GESIS), etc. The workshop started with various presentations from the S3W team, providing an overview of the project and the progress that had been made. The domain experts participating in the workshop contributed to outlining and discussing potential reasons and approaches to linking the data sets, as well as the opportunities and challenges of the SLD approach to data linkage for social science research. The domain experts were then asked to comment and provide feedback on the project progress after this was presented to them. AG contributions from this meeting were key to shaping the project in the coming months and avoiding pitfalls. The outcomes from the activity include, interests from the domain experts in the research, and contributions to shaping the development of the demonstrator proof of concept showcasing SLD linkage across GBCS and ELSA datasets, as well as to ethnographic data that is currently being analysed, that will result in publications in the coming months. The nature of the project and any collaborations on it are inherently interdisciplinary and commonly include a wide range from sociology and epidemiology, to computer science, and archival studies.
Year(s) Of Engagement Activity 2019
 
Description S3W workshop3- The domnstrator 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact The workshop started with presentations from the S3W team on the progress made on the work and the processes of developing the demonstrator. Then the domain experts were given time to engage with the demonstrator, whilst their interactions were observed by the S3W team for the ethnographic side of the project. The domain experts contributed to outlining and discussing potential, as well as the opportunities and challenges of the demonstrator for social science research. They also specifically provided feedback on the visualizations and the design of the demonstrator.
Year(s) Of Engagement Activity 2020