Social Sciences, Social Data and the Semantic Web (S3W)

Lead Research Organisation: University of Bristol
Department Name: Sociology

Abstract

Recent years have seen phenomenal growth in quantity and range of digital data that might be used for social research. The ESRC has already invested in harnessing administrative and business data as well as 'new and emerging forms of data' (e.g. social media and sensor data) for the social sciences. Now a new opportunity arises. 'Semantic linked data' (SLD) offers a new method for structuring and organizing digital data, which promises to have a profound effect on research capacity for data linkage and analysis across multiple, heterogeneous sources, at hitherto unimaginable speed and scale. Indeed, within the Computer Sciences, the proponents of SLD argue that if data are published following shared standards and protocols the Web will be transformed from a library of documents into a single linked data base, described as the 'semantic web'.

The value of data linkage is already well established in the social sciences, but existing methods are labour intensive, involve the retrospective matching of records, across small numbers of data sets, and are done to address particular pre-determined questions with the linkage made for that specific purpose. In contrast, SLD techniques focus on the prospective production of data to allow the on-going matching and accumulation of information about people, places, businesses, artefacts and even conceptual categories and like 'race' or 'class' to be drawn together, however the subsequent user determines, at the scale of the World Wide Web (Halford, Pope and Weal 2012).

However, whilst SLD offers great promise to the social sciences there is - to date - negligible use of SLD by social scientists. The agenda for SLD is being driven by computer scientists and demonstrations are based on relatively straightforward examples such as transport timetables or estates data. Whilst these work well technically, they offer no substantial investigation of how appropriate the techniques might be in addressing more complex social science questions. At a time of financial constraint, when funding for major new data collection is uncertain, it is essential that we explore these opportunities.

The research proposed here will be the detailed investigation into if and how SLD might be harnessed for social science research. To achieve this we have drawn together a strong team of social and computational scientists, with a well-established track record of collaboration. This team will be supported by an outstanding Advisory Group of experts, who have already agreed to participate in this project (see Impact Summary below).

We will explore three research questions:

(i) What are the implications of using SLD methods to describe social data?
(ii) What does SLD contribute to our capacity to understand health inequalities across the life course?
(iii) What are the implications of SLD for data archiving and re-use?

To answer these questions we will:

(i) Carry out a detailed study of the processes involved in converting existing data into SLD. This will be undertaken by the research team, with the participation of experts from our Advisory Group. Specifically, we will work with the English Longitudinal Survey of Ageing and the Great British Class Survey and other related data already in the 'linked data cloud'
(ii) Develop a 'demonstrator' of SLD (using the data sets developed at (i) above) to examine the specific question of health inequalities across the life-course
(iii) Collaborate with the UK Data Service and the GESIS-Leibniz Institute in Germany (which provides a similar data infrastructure to UKDS) to explore the opportunities for data archiving.

In this way, we seek to engage social science in the ongoing development of SLD and the emerging Semantic Web; and to explore the implications of SLD for building next generation data infrastructures in the social sciences.

Planned Impact

Three specific non-academic user groups will benefit from this research.

i) Data archive professionals and policy makers: the value of secondary data analysis is now increasingly recognised, both in terms of securing maximum returns from initial financial investment and in terms of generating knowledge and understanding. This has been a policy priority for Research Councils UK and for the UK data archives, now consolidated as the UK Data Service. The rapid development of open and semantically linked data (SLD) promise a step-change in the find-ability and reusability of existing data, allowing analysis across multiple data sets using computational tools to permit a step-change in the scale and range of data linking that is possible. To date, there has been very little application of SLD techniques to social science data. Meanwhile the development of these techniques in the computational sciences continues apace but with little reference to the social sciences, or the particular challenges that our data may present for these techniques. This project will harness current computational advances to the social sciences, and work closely with our key members of our Advisory Group to ensure that these reach non-academic users directly. In particular we will work with the UK Data Service, who have time costed into this bid, to explore how SLD might support their current shift towards 'data as a platform'. The Office for National Statistics will also be closely involved, as the project offers insights to several of their teams including the Big Data team, ONS Geography, and the Health Inequalities team as well as to those concerned with data infrastructure.

ii) Semantic Web developers: seeking to further the development of tools and inference techniques to encompass a broader range of quantitative and qualitative datasets. By having a better understanding of the way open data providers produce data the community will be able to improve tool support for the creation, publishing and analysis of linked data. This study will provide a case study in modelling qualitative data, such as interview transcripts, which is widely ignored by the Semantic Web community currently. The longitudinal and aggregate nature of the datasets being considered in these studies presents a particular challenge in moving beyond the current consensus of identifier use on the Semantic Web.

Both groups have been involved in the preparation of this research proposal and will be integrally involved in the project as it unfolds. This will provide firm grounding for impact with these users, on which we can build to develop wider policy, skills and public impact. The mechanisms that we will use to achieve this are described in the attached 'Pathways to Impact' document.
 
Description We have completed a systematic and in-depth investigation of the opportunities, challenges and affordances of 'semantic linked data' (SLD) for social research, focussing on the specific case study of health inequalities. We have our core achievements:

(1) We have pioneered an in-depth interdisciplinary approach to the creation of SLD for social research. To the best of our knowledge (across social and computational sciences) this is the first time this has been done.
(2) We have converted two existing social science survey based data-sets (the Great British Class Survey and sections of ELSA waves 6 &8) into SLD, complete with ontologies (i) modelling the entities and relationships in these data sets (ii) modelling Bourdieusian approaches to social class and (iii) modelling temporal change to enable analysis of individual survey responses over time. The resulting SLD is now prospectively linkable with a range of other SLD (e.g. prescribing data, heath services data) and will be made available for use by other researchers through UKDS.
(3) We have created a 'demonstrator' that allows us to query the data sets in highly flexible ways to explore complex questions about social class and health over the life-course. The source code, software and tools to create this demonstrator for reuse will be archived with UKDS.
(4) We have completed an auto-ethnography (over a full year) of the work undertaken. This allows us to (i) trace the epistemological and ontological issues that arise when using SLD for social research; (ii) explore the challenges and processes of interdisciplinary research across the social and computational sciences.

Our initial findings are as follows:
(1) We have generated new knowledge of the challenges and compromises of using SLD for social data and research. This poses difficulties for computer scientists e.g. in capturing longitudinal change across data sets with large numbers of variables and individuals (SLD is more usually used for smaller and/or static data sets). It also poses difficulties for sociologists where analysis of data can be more open and iterative than can be recognised by SLD.
(2) We have generated significant new insight to the differences and similarities between the everyday knowledge practices of sociologists and computer scientists and understanding of how differences and tensions are addressed in practice. This is important to support and further the growing expectations for interdisciplinary research across the computational and social sciences.
(3) We have created new resources for sociological research on social class and health inequalities. Our data sets, research tools and software will be archived with UKDS and are available for re-use.
(4) We have created the means to interrogate health inequalities using the new 7-class typology created by the Great British Class Survey.
(5) We have enhanced the skill-sets of both the computer science team and the sociology team, as a platform for further interdisciplinary research and teaching
Exploitation Route The specific motivation for the creation of semantic linked data is to open up prospective opportunities for querying across related data sets. We will archive our data sets with UKDS, ready prepared for further semantic data linkage and computational analysis. ur data sets and associated tools allow other researchers to recreate a 'demonstrator' for querying across the Great British Class Survey and key sections of ELSA (Waves 6 & 8). Our ontologies are intended to be re-usable by other semantic linked data projects.
Sectors Communities and Social Services/Policy,Digital/Communication/Information Technologies (including Software),Education,Healthcare,Leisure Activities, including Sports, Recreation and Tourism

 
Title Capital Thresholds for GBCS class assignment 
Description This tool defines the thresholds for economic, social and cultural capital measures that were operationalised as part of the process to assign GBCS social classes to the original GBCS participants and to participants in ELSA waves 6&8. PLEASE NOTE: these capital thresholds need to be read alongside the class logic description and the algorithm for assigning social class, also described in Researchfish and deposited as part of the S3W archive held at UKDS. An overall description of how these tools hang together can be found in the powerpoint also described in Researchfish and deposited as part of the S3W archive held by UKDS. 
Type Of Material Improvements to research infrastructure 
Year Produced 2020 
Provided To Others? No  
Impact This tool (as part of the overall set of tools developed for S3W) allows users to assign GBCS social class to ELSA waves 6 & 8. 
 
Title S3W- SLD design and development and ethnography 
Description Recent years have seen profound changes in our data landscape. The ESRC has already invested in harnessing administrative and business data as well as 'new and emerging forms of data' (e.g. social media and sensor data) for the social sciences. Now a new opportunity arises. 'Semantic linked data' (SLD) offers a new way to structure and organize digital data, which promises to have a profound effect on research capacity for data linkage and analysis across multiple, heterogeneous sources, at hitherto unimaginable speed and scale. It is well-established in the social sciences that linkage across data sets provides important insights. However, current methods are restricted to retrospective linkage across a small number of data sets, pursuing specific pre-defined questions and usually based on unique individual identifiers (e.g. National Insurance number). In distinction, SLD entails a shift towards prospective data construction using shared standards and protocols for (i) the naming of data entities (e.g. people, places, social classes etc.) (ii) the description of the relationships between data entities (in formal ontologies) and (iii) the development of computational tools for rapid, flexible machine readable data analysis. This promises a step-change in our capacity to integrate and interrogate data across siloes, potentially at web-scale. If all data were published online as SLD the Web would be transformed from a library of documents into a single linked database (Berners-Lee et al ). This project conducts a benchmarking examination of the challenges and opportunities of SLD for social science research by investigating the method itself as an object of study and developing a demonstrator application of SLD to the study of social inequality. We have constructed a SLD demonstrator to 1. enable exploring the affordances of SLD in facilitating complex interrogation of ELSA across its qualitative, quantitative and longitudinal elements and 2. enhance the ELSA by making links to GBCS and to other relevant datasets in the Linked Data Cloud. Both data sets are now converted to SLD enabling to identify similar entities (social groups, occupations, etc.) in each and conduct querying and inference across these linked datasets. Our remaining work includes explore the affordances of SLD for broader understandings of class practices and assets, alongside better-known identifiers of occupation and income, shape health over the life course. 
Type Of Material Improvements to research infrastructure 
Year Produced 2020 
Provided To Others? Yes  
Impact Over the past decade, over 150bn pieces of interlinked data have been published in the Linked Data 'cloud' http://lod-cloud.net/. This already shapes the presentation of web-scale data: web browsers now offer coherently aggregated data about particular entities - places or organizations for example - rather than a list of word-matched 'hits'. Meanwhile, computer scientists have developed applications for SLD, including bus timetables and building facilities. However, the challenges and opportunities of SLD for social science research are not yet understood. The potential significance of SLD for social science may be profound, making viable the radical expansion of cross-data set data analysis at vast scale and high speed. But, social data are more difficult to describe than those used in existing applications of SLD, for example involving (i) competing descriptions for the same entities (e.g. for social class) (ii) qualitative as well as quantitative data (iii) longitudinal data, where the nature of entities and their relations with each other changes over time. In short, the promise of SLD for social science research may be transformational but to date its actual value remains unknown. The novel interdisciplinary methodology used for Semantic linkage of two social science data sets, and the lessons learnt from the ethnographic observations in this project could be applied to other data sets. 
 
Title Social Class Logic description 
Description This provides a description of the class logics that were derived by the S3W team from the Great British Class Survey in order to assign social class to participants in both GBCS (which allowed us to compare with the class assignment made by the GBCS team originally) and waves 6 & 8 of the English Longitudinal Survey of Ageing. The social class logics which allowed us to do this were derived from two sources: (1) research publications from the GBCS team and (2) reverse engineering the BBC Class Calculator. It is important to note that the GBCS project was inductive, with seven social classes derived from cluster analysis of a large research survey. The BBC Class Calculator was deductive, applying a set of predetermined rules to allocate individuals to the seven social classes depending on their answers to a short survey. We created an algorithm to allocate GBCS social class, which applies a set of rules that also draws on the theoretical and methodological considerations of the GBCS project and are more flexible and nuanced than the class calculator. 
Type Of Material Improvements to research infrastructure 
Year Produced 2020 
Provided To Others? No  
Impact Our tool allows us to assign one of seven social classes to participants in both the original GBCS and ELSA waves 6&8. Testing our outcomes with those of the original GBCS outcomes, we see a good correspondence in outcome. 
 
Title Summary of Research Tools Developed for the S3W Project 
Description This powerpoint provides a chronological description of each stage in our project to create semantic linked data from two existing (non semantic) data sets (the Great British Class Survey and the English Longitudinal Survey of Ageing). The powerpoint shows the tools that we developed at each stage. In sum, the tools are as follows: (1) GBCS Ontology (2) ELSA Ontology (3) The social class ontology - allowing us to model different conceptualisations of social class (4) The temporal ontology - allowing us to model change over time (5) The algorithm developed to generate Bourdieusian social class from capital measures (6) The R2RML converter that was used to convert our data to RDF 
Type Of Material Improvements to research infrastructure 
Year Produced 2020 
Provided To Others? No  
Impact The aim of this project was to investigate the sociodigital process involved in creating semantic linked data and their epistemological and ontological underpinnings. We are currently working in applying these to tools to a Bourdieusian class analysis of health inequalities. 
 
Title ELSA ontology 
Description This ontology provides formal description of objects, concepts and entities that exist in a subsection of ELSA wave 6 and 8 datasets in a way that is compatible and interoperable with the S3W GBCS ontology, and can be potentially used to query on health inequalities. 
Type Of Material Computer model/algorithm 
Year Produced 2020 
Provided To Others? Yes  
Impact The main impact of this ontology is from constructing an infrastructure that can enable anonymised linkage of concepts across datasets, and uses formal logic in a way that allows going beyond the usual correlational-only research. 
 
Title GBCS ontology 
Description This ontology provides formal description of objects, concepts and entities that exist in the GfK dataset in a way that is compatible and interoperable with the S3W ELSA ontology, and can be potentially used to query on health inequalities. 
Type Of Material Computer model/algorithm 
Year Produced 2020 
Provided To Others? Yes  
Impact The main impact of this ontology is from constructing an infrastructure that can enable anonymised linkage of concepts across datasets, and uses formal logic in a way that allows going beyond the usual correlational-only research. 
 
Title S3W ELSA SLD data 
Description As part of our proof of concept demonstrator, we have converted ELSA wave 6 dataset to Semantic Linked Data (SLD) that enables linkage with the GBCS SLD and future integration with other datasets. The resulted SLD data are re-usable and when used along other datasets such as GBCS have the potential to enable querying social science questions (e.g. health inequalities) that were not easily possible using the conventional data linkage techniques. 
Type Of Material Database/Collection of data 
Year Produced 2020 
Provided To Others? Yes  
Impact Impacts include enabling re-usability and linkage with other datasets for asking health inequalities research questions that were previously very complex and cumbersome using conventional data linkage methods. Our remaining work includes explore the affordances of SLD for broader understandings of class practices and assets, alongside better-known identifiers of occupation and income, shape health over the life course. 
 
Title S3W GBCS SLD Data 
Description As part of our proof of concept demonstrator, we have converted GBCS's GfK dataset to Semantic Linked Data (SLD) that enables linkage with the GBCS SLD and future integration with other datasets. The resulted SLD data are re-usable and when used along other datasets such as ELSA have the potential to enable querying social science questions (e.g. health inequalities) that were not easily possible using the conventional data linkage techniques. 
Type Of Material Database/Collection of data 
Year Produced 2020 
Provided To Others? Yes  
Impact Impacts include enabling re-usability and linkage with other datasets for asking health inequalities research questions that were previously very complex and cumbersome using conventional data linkage methods. Our remaining work includes explore the affordances of SLD for broader understandings of class practices and assets, alongside better-known identifiers of occupation and income, shape health over the life course. 
 
Title S3W Social class ontology 
Description As part of our proof of concept demonstrator, we have designed and developed a social class ontology that formally models entities and relationships related to the notion of social class as conceptualised in the GBCS and ELSA data.The ontology is re-usable and when used along SLD datasets such as S3W ELSA and GBCS data, has the potential to enable querying social science questions (e.g. health inequalities) that were not easily possible using the conventional data linkage techniques. 
Type Of Material Computer model/algorithm 
Year Produced 2020 
Provided To Others? Yes  
Impact Impacts include enabling re-usability and linkage with other datasets for asking health inequalities research questions that were previously very complex and cumbersome using conventional data linkage methods. Also, if used and updated overtime it can enable archiving and studying how social class is modelled in the future. Our remaining work includes explore the affordances of SLD for broader understandings of class practices and assets, alongside better-known identifiers of occupation and income, shape health over the life course. 
 
Title S3W proof of concept demonstrator 
Description As part of the proof of concept demonstrator we designed and developed 1. ELSA and GBCS Ontologies 2. algorithms for assigning a GBCS conception of social class in GBCS and ELSA datasets. The algorithms enable querying social science questions (e.g. health inequalities) across these linked datasets. 3. visualisations 
Type Of Technology New/Improved Technique/Technology 
Year Produced 2020 
Impact Impacts include enabling re-usability and linkage with other datasets for asking health inequalities research questions that were previously very complex, cumbersome and inflexible using conventional data linkage methods. Our remaining work includes explore the affordances of SLD for broader understandings of class practices and assets, alongside better-known identifiers of occupation and income, shape health over the life course. 
 
Description S3W workshop 2- GBCS ontology 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact The workshop participants are prominent social scientists and computer scientists from a wide range of academic and governmental institutions such as London school of Economics Office for National Statistics (ONS), Leibniz Institute for the Social Sciences (GESIS), etc. The workshop started with various presentations from the S3W team, providing an overview of the project and the progress that had been made. The domain experts participating in the workshop contributed to outlining and discussing potential reasons and approaches to linking the data sets, as well as the opportunities and challenges of the SLD approach to data linkage for social science research. The domain experts were then asked to comment and provide feedback on the project progress after this was presented to them. Contributions from the domain experts were key to shaping the project in the coming months and avoiding pitfalls. The outcomes from the activity include, interests from the domain experts in the research, and contributions to shaping the development of the demonstrator proof of concept showcasing SLD linkage across GBCS and ELSA datasets, as well as to ethnographic data that is currently being analysed, that will result in publications in the coming months. The nature of the project and any collaborations on it are inherently interdisciplinary and commonly include a wide range from sociology and epidemiology, to computer science, and archival studies.
Year(s) Of Engagement Activity 2019
 
Description S3W workshop1- ELSA ontology 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact The workshop participants are prominent social scientists and computer scientists from a wide range of academic and governmental institutions such as University of Manchester, Office for National Statistics (ONS), Leibniz Institute for the Social Sciences (GESIS), etc. The workshop started with various presentations from the S3W team, providing an overview of the project and the progress that had been made. The domain experts participating in the workshop contributed to outlining and discussing potential reasons and approaches to linking the data sets, as well as the opportunities and challenges of the SLD approach to data linkage for social science research. The domain experts were then asked to comment and provide feedback on the project progress after this was presented to them. AG contributions from this meeting were key to shaping the project in the coming months and avoiding pitfalls. The outcomes from the activity include, interests from the domain experts in the research, and contributions to shaping the development of the demonstrator proof of concept showcasing SLD linkage across GBCS and ELSA datasets, as well as to ethnographic data that is currently being analysed, that will result in publications in the coming months. The nature of the project and any collaborations on it are inherently interdisciplinary and commonly include a wide range from sociology and epidemiology, to computer science, and archival studies.
Year(s) Of Engagement Activity 2019
 
Description S3W workshop3- The domnstrator 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact The workshop started with presentations from the S3W team on the progress made on the work and the processes of developing the demonstrator. Then the domain experts were given time to engage with the demonstrator, whilst their interactions were observed by the S3W team for the ethnographic side of the project. The domain experts contributed to outlining and discussing potential, as well as the opportunities and challenges of the demonstrator for social science research. They also specifically provided feedback on the visualizations and the design of the demonstrator.
Year(s) Of Engagement Activity 2020