Digitising Scotland

Lead Research Organisation: University of Edinburgh
Department Name: School of Geosciences

Abstract

This project aims to digitise the 24 million vital events record images (births, marriages and deaths) for all people in Scotland since 1855 (ie transcribe them into machine encoded text). This will allow research access to individual level information on some 18 million individuals, a large proportion of those who have ever lived in Scotland between 1855 to the present day. At the moment these records are kept as indexed images. This means that to extract any data, a researcher must search for an individual record by name and then manually transcribe the information they need themselves (eg cause of death, occupation etc.); this of course makes any large scale research impossible. A one off investment would mean that these records could made available for major population research.

This dataset will be prepared for linkage to existing longitudinal studies, primarily the Scottish Longitudinal Study (SLS), and more generally to the already highly developed Scottish health informatics systems. This will allow the characteristics (place of birth, age at marriage, occupation, longevity, cause of death etc) of parents, grandparents and other relatives of those followed by the SLS to be analysed and will therefore enhance contemporary Scottish and UK health datasets, health informatics systems, longitudinal datasets and genetic studies. People surviving to have children, grandchildren and other descendents are unlikely to be fully representative of the historic population of Scotland, and the complete transcription will importantly enable a complete linkage exercise between the different to create full or partial life histories for all those experiencing vital events in Scotland since 1855.This will mean that for the first time the UK will have a data system of a similar potential depth and breadth as the Scandinavian and Low countries, whose population registers provide such life histories, with countries, where work such as those using "The Demographic Database" at Umeá University, Sweden; the "Historical Population Registers" Project at the Norwegian Historical Data Centre, University of Tromso, Norway and the Historical Sample of the Netherlands, based at the International Institute of Social History, Amsterdam, Netherlands is currently extending knowledge of demography as well as economic and social history over the nineteenth and early decades of the twentieth century. Such a dataset will bring the possibility of exploring the condition of the present Scottish population within the context of their families through multiple generations of micro-data.

Planned Impact

This project will not involved the production of research output, although it will play an important role in maximising the impact of outputs of projects by researchers using dataset produced (see Pathways to Impact). However in other parts of the application, we have outlined how the data may be used. From this it is possible to identify research areas that this data will support and therefore potential beneficiaries. There will of course be many ways that innovative researchers will use this data, that we cannot yet envisage.

Those working in Social policy will benefit from a deeper understanding:
- of social mobility across 150 years
- the impact of a developing welfare state
- the nature of industrialisation and its impact on populations
- of fertility

Those working in Educational policy will have an insight into
- the impact of changing educational policy over 150

Those working in public health will come to better understand
- the effect of food shortages for mothers during pregnancy on the later life (intergenerational) health of the child
- the impact of urbanisation and severe environmental insults - on health
- the long-term effect of serious viruses (eg the 1918 influenza pandemic) on the surviving population

Those working in the NHS
- will benefit potentially from better family history risk assessment tools
- will more easily be able to produce genealogies in support of clinical genetic counselling services
- will benefit from refinement to genetic studies that will be gained from extensive pedigrees, this is likely to be a substantial impact

The public
- the Scottish public will benefit from the enhancements to their healthcare system
- may very directly benefit from enhancements to the genetic cancer counselling service and family history as used in clinical practise
- they will also benefit from the added attractiveness to researchers across many sectors - including the biomedical field - of the enhanced data environment that exists in Scotland and that will be enhanced with the dataset.

Related Projects

Project Reference Relationship Related To Start End Award Value
ES/K00574X/1 01/09/2012 30/10/2014 £2,557,074
ES/K00574X/2 Transfer ES/K00574X/1 31/10/2014 31/10/2020 £2,457,746
 
Description This project aims to digitise the 24 million vital events record images (births, marriages and deaths) for all people in Scotland since 1855 (ie transcribe them into machine encoded text). This will allow research access to individual level information on some 18 million individuals, a large proportion of those who have ever lived in Scotland between 1855 to the present day. At the moment these records are kept as indexed images. This means that to extract any data, a researcher must search for an individual record by name and then manually transcribe the information they need themselves (eg cause of death, occupation etc.); this of course makes any large scale research impossible.

This project has main 4 objectives within 4 workpackages, to:
[1] digitise vital events records back to 1855.
[2] to develop a method and software package to automatically code occupational descriptions to the Historical International Standard Classification of Occupations (HISCO).
[3] to develop a frame with which to code cause of death descriptions to a standardised Classification of Disease and to produce a software package for separating different causes and then classifying them automatically.
[4] to develop a method and software package to link the addresses on the records to a consistent, through time, geographical reference.

The research in workpackage 2-4 have demonstrated that it is possible to automatically code the textual information in the records using machine learning techniques. This is of course vital to an endeavor that needs to 24 million records.

Because this project is developing a research infrastructure at the moment, there have been no research discovery as yet, however one might expect the dataset to enable the following type of questions to be answered.

Those working in Social policy will benefit from a deeper understanding:
- of social mobility across 150 years
- the impact of a developing welfare state
- the nature of industrialisation and its impact on populations

Those working in Educational policy will have an insight into
- the impact of changing educational policy over 150

Those working in public health will come to better understand
- the effect of food shortages for mothers during pregnancy on the later life (intergenerational) health of the child
- the impact of urbanisation and severe environmental insults - on health
- the long-term effect of serious viruses (eg the 1918 influenza pandemic) on the surviving population

Those working in the NHS
- will benefit potentially from better family history risk assessment tools
- will more easily be able to produce genealogies in support of clinical genetic counselling services
- will benefit from refinement to genetic studies that will be gained from extensive pedigrees, this is likely to substantial

The public
- the Scottish public may benefit from the enhancements to their healthcare system
- may very directly benefit from enhancements to the genetic cancer counselling service and family history as used in clinical practise
- they will also benefit from the added attractiveness to researchers across many sectors - including the biomedical field- of the enhanced data environment that exists in Scotland and that will be enhanced with the dataset.
Exploitation Route Similar data enhancement projects are being attempted in a number of countries. There has been considerable interest in the techniques and software products we have developed. We are actively sharing this learning and products with these projects.

Specifically the automatic classification software developed within DS was used to classify (1) English language occupation strings to the HISCO classification system, for Prof. Marco van Leeuwen, of the Sociology Department at Utrecht University and (2) occupations from the newly created Scottish Longitudinal Study (SLS) Birth Cohort of 1936 (SLSBC1936), which includes occupations from the Census-like 1939 National Register (3) we are working closely with the CLARIAH project in the Netherlands, a distributed research infrastructure for the humanities and social sciences (http://www.clariah.nl/en/), to share our machine learning software.
Sectors Communities and Social Services/Policy,Digital/Communication/Information Technologies (including Software),Education,Healthcare,Culture, Heritage, Museums and Collections

URL http://www.lscs.ac.uk/projects/digitising-scotland/
 
Description The DS project will not be involved the production of research output and at the present time is focused on data enhancement. However it is possible to identify areas of research that this data will support and therefore potential beneficiaries. There will of course be many ways that innovative researchers will use this data that we cannot yet envisage. Such research and analysis will enhance contemporary Scottish and UK health datasets, health informatics systems, longitudinal datasets and genetic studies. Many of the new potential research opportunities made available through the DS project are listed on the DS website (http://www.lscs.ac.uk/projects/digitising-scotland/research-potential-of-digitising-scotland). The LSCS centre has put considerable time into organising and structuring this DS work, sharing knowledge and building networks across Europe. Though close collaboration with other countries working on similar data we successfully won a Horizon 2020 Marie Curie Innovative Training Networks (ITN) European Training Networks (ETN) 'Methodologies and Data mining techniques for the analysis of Big Data based on Longitudinal Population and Epidemiological Registers' (http://cordis.europa.eu/project/rcn/200475_en.html). We will have 2 early stage researchers (ESR) based in Edinburgh with close association with DS along with 4 research stays from other ESRs within the network all working in historical demography on short research projects aligned with the DS project. The 4 ESRs will take back learning from their stays within the LSCS centre to their home institutions to enhance further research projects. Similar historical data enhancement projects are being attempted in a number of countries. From attending workshops and within networks there has been considerable interest in the DS techniques and software products for automatically coding text (occupations and causes of death) and geolocation that we have developed. We are sharing this learning and products with these projects given the DS work will have further applications - and has already been used by 2 projects. Specifically the automatic classification software developed within DS was used to classify English language occupation strings to the HISCO classification system, for Prof. Marco van Leeuwen, of the Sociology Department at Utrecht University. Additionally, another project within the wider LSCS centre involves the creation of new Scottish Longitudinal Study (SLS) Birth Cohort of 1936 (SLSBC1936), which includes information from the Scottish Mental Survey of 1947 (SMS1947) - a cognitive ability test that almost all Scottish children born in 1936 sat - and linked it to the Census-like 1939 National Register, the current NHS Central Register (NHSCR) and the SLS. We coded the occupations from this based on the DS automatic classification software.
First Year Of Impact 2015
Sector Digital/Communication/Information Technologies (including Software),Culture, Heritage, Museums and Collections
Impact Types Societal

 
Title Automatic classification for occupation and cause of death strings 
Description The software uses Apache Mahout machine learning algorithms to classify unseen strings from historical records. It uses a sample of human-classified records to train the machine learning models. The software has been used to classify occupation and cause-of-death strings to the HISCO and ICD-10 coding systems, respectively. 
Type Of Technology Software 
Year Produced 2014 
Open Source License? Yes  
Impact None as yet. 
URL http://digitisingscotland.cs.st-andrews.ac.uk/record_classification/index.html
 
Title Historical Address Geocoder (HAG-GIS) 1.0.0 
Description The Historical Address Geocoder (HAG-GIS) is a Python 2.7 program for automating the geocoding process for the Digitising Scotland project. The geocoding process involves fuzzy-matching historical records with contemporary addresses. This automating system takes into account spatial information deriving from historical administrative data improving the accuracy of the geocoded historical addresses and producing geography boundaries at small administrative scales where geographical boundaries are not available. 
Type Of Technology Software 
Year Produced 2015 
Open Source License? Yes  
Impact The HAG-GIS software is a core tool for the geocoding process of Digitising Scotland project which has the potential to be of high impact in a variety of different fields, whether it is simply the digitisation of the records or the further coding and linking of the data. It will allow researchers to study events for some 18 million individuals over 150 years of events allowing the quantification of change in the important characteristics of Scottish Society over time and space. 
URL http://lscs-projects.github.io/HAGGIS/
 
Title SMARTS (SQL, Management, Accountable, Reliable, and Tracking System) 
Description Developed in house by CDDA SMARTS allows for the remote tracking of data entry projects including time spent entering data, accuracy levels by individuals, rekeying activities, and superviser managrment. It is being rolled out by CDDA to other data capture projects. 
Type Of Technology Webtool/Application 
Year Produced 2018 
Impact Allows for accurate tracking of data capture work 
 
Title Synthetic population generator 
Description The software generates synthetic human populations. It is configurable with a number of probability distributions including longevity, number of children, occupation, age at marriage, parenthood etc. The software requires a Java runtime. 
Type Of Technology Software 
Year Produced 2014 
Open Source License? Yes  
Impact None as yet; it is hoped that the application will be useful in the evaluation and comparison of population linkage techniques. 
URL http://digitisingscotland.cs.st-andrews.ac.uk/population_model/
 
Description BitBlaster: A fast complete similarity search algorithm - Graham Kirby 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Professional Practitioners
Results and Impact Research talk at meeting of St Andrew Institute for Data-Intensive Research
Year(s) Of Engagement Activity 2018
URL http://www.idir.st-andrews.ac.uk/wp-content/uploads/2018/09/BitBlaster.pdf
 
Description Digitising Scotland Summer Seminar 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other audiences
Results and Impact A seminar organised by the Digitising Scotland team over two days - day one was a day of sharing information on procedural catch ups and progress updates. Day two was a workshop format focusing on research from the various Digitising Scotland groupings along with planning and the next steps following transcription.
Year(s) Of Engagement Activity 2017
 
Description Digitising Scotland Summer Workshop 2018 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other audiences
Results and Impact Digitising Scotland Summer one day workshop was held on 28th August 2018 in Edinburgh to bring together the various DS collaborators for progress updates, information sharing and discussion on planning and next steps.
Year(s) Of Engagement Activity 2018
 
Description Geocoding 24 million historical addresses in Scotland (HGIS meeting UoE) 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Other academic audiences (collaborators, peers etc.)
Results and Impact talk sparked questions and discussion afterwards as purpose was to share H-GIS methods

-
Year(s) Of Engagement Activity 2014
 
Description Geocoding 24 million historical addresses in Scotland from 1855 to 1974 (EHPS-Net Meeting) 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? Yes
Geographic Reach International
Primary Audience Other academic audiences (collaborators, peers etc.)
Results and Impact talk sparked questions and discussion afterwards as purpose was to share H-GIS methods

-
Year(s) Of Engagement Activity 2014
URL http://www.ehps-net.eu/news/workshop-working-group-9
 
Description Involvement in Health in Port Cities project, Dr Eilidh Garrett 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Dr Eilidh Garrett attended a formal working group to discuss cause of death coding system devised for Digitising Scotland project with interational colleagues wishing to devise a scheme capable of being used across the countries of Europe and beyond and for a variety of time periods.
Year(s) Of Engagement Activity 2018
URL https://www.ru.nl/historicaldemography/research-projects/ship/
 
Description On 3-4 July 2015, Working Group 9 - GIS - held a workshop 'Integrating time, space and individual life stories'. 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other audiences
Results and Impact 20 European experts in GIS and historical research met in Edinburgh to discuss advances in their particular areas of reserach. New ideas and potential future collaborations were identified.
Year(s) Of Engagement Activity 2015
 
Description Participation in a 'Hackathon' ('Thyynge') at St Andrews' Department of Computing Science 2nd - 4th November in St Andrews 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other audiences
Results and Impact Digitising Scotland team were involved in this 'Hackathon' which was organised following the Skye Data Linkage Workshop in August 2016. The 'Hackathon' (also called 'Thyynge') was held over three days at St Andrews' Department of Computing Science 2nd - 4th November. The Hackathon brought together teams of computer scientists from the Universities of Edinburgh, St Andrews, Herriot Watt and ANU together with demographic historians from Cambridge, Albany, Madrid and Edinburgh.
Year(s) Of Engagement Activity 2016
 
Description Participation in the EHPS-NET workshop, University of Edinburgh, UK 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Presentation on "HAG-GIS: An advanced system for geocoding historical addresses in Scotland" in the EHPS-NET workshop- Working Group 9- "GIS - integrating time, space and individual life stories", University of Edinburgh, UK, Daras K., Feng Z. & Dibben C., Williamson L., 2015.

The main purpose of the workshop was to discuss, compare and develop methods and standards for storage, integration, analyses and visualization of data.
Year(s) Of Engagement Activity 2015
URL http://www.research.ed.ac.uk/portal/en/activities/ehpsnet-meeting-working-group-9--gis(82e20892-af17...
 
Description Presentation at Skye Data Linkage Workshop 27th August 2016 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other audiences
Results and Impact The Digitising Scotland team were presenting 'Historical record linkage on the Isle of Skye: A colloquium for historians and computer scientists' at the Skye Data Linkage event for PhDs and early careers researchers. The workshop brought together teams of computer scientists from the Universities of Edinburgh, St Andrews, Herriot Watt and ANU together with demographic historians from Cambridge, Albany, Madrid and Edinburgh. Much of the workshop was spent in dialouge between the historians and the computer scientists, to learn each others language(s) and ways of thinking. The historians realised that not everyone can read historic documents and that data can be organised and thought of in a myriad of ways. The computer scientists learned that very little historical data is straight forward and that historians don't look at data in the same way at all. Progress was made and mutual understanding moved forward. A further activity, a 'hackathon' was arranged to follow up the event.
Year(s) Of Engagement Activity 2016
 
Description Reconstructing Historical Populations: How Do You Know if it Worked? - Graham Kirby 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Professional Practitioners
Results and Impact Research talk at meeting of St Andrew Institute for Data-Intensive Research, October 2018.
Year(s) Of Engagement Activity 2018
URL http://www.idir.st-andrews.ac.uk
 
Description Spatial Humanities Expert Meeting, Lancaster 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Alice Reid, Eilidh Garrett and Joe Day were invited to attend, as experts, a Spatial Humanities Expert Meeting on Monday 28 November 2016 to discuss prospects for research in demographic history and the history of public health and health inequalities as part of the Spatial Humanities Project at the University of Lancaster (ERC funded). There were a number of presentations from the Spatial Humanities team, the experts attending, and a general discussion about possible new routes of enquiry.
Year(s) Of Engagement Activity 2016
 
Description Understanding the linking possibilities in Scottish records and an algorithmic approach to full linkage - Graham Kirby 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact The purpose was to disseminate research on multiple linkage opportunities within rich demographic datasets, and to promote discussion.
Year(s) Of Engagement Activity 2018
URL https://ijpds.org/article/view/508
 
Description Validating synthetic longitudinal populations for evaluation of population data linkage - Graham Kirby 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact The purpose was to disseminate research on synthesising realistic population data for evaluation of linkage approaches, and to promote discussion.
Year(s) Of Engagement Activity 2018
URL https://ijpds.org/article/view/504