Digitising Scotland

Lead Research Organisation: University of St Andrews
Department Name: School of Geography and Geosciences

Abstract

This project aims to digitise the 24 million vital events record images (births, marriages and deaths) for all people in Scotland since 1855 (ie transcribe them into machine encoded text). This will allow research access to individual level information on some 18 million individuals, a large proportion of those who have ever lived in Scotland between 1855 to the present day. At the moment these records are kept as indexed images. This means that to extract any data, a researcher must search for an individual record by name and then manually transcribe the information they need themselves (eg cause of death, occupation etc.); this of course makes any large scale research impossible. A one off investment would mean that these records could made available for major population research.

This dataset will be prepared for linkage to existing longitudinal studies, primarily the Scottish Longitudinal Study (SLS), and more generally to the already highly developed Scottish health informatics systems. This will allow the characteristics (place of birth, age at marriage, occupation, longevity, cause of death etc) of parents, grandparents and other relatives of those followed by the SLS to be analysed and will therefore enhance contemporary Scottish and UK health datasets, health informatics systems, longitudinal datasets and genetic studies. People surviving to have children, grandchildren and other descendents are unlikely to be fully representative of the historic population of Scotland, and the complete transcription will importantly enable a complete linkage exercise between the different to create full or partial life histories for all those experiencing vital events in Scotland since 1855.This will mean that for the first time the UK will have a data system of a similar potential depth and breadth as the Scandinavian and Low countries, whose population registers provide such life histories, with countries, where work such as those using "The Demographic Database" at Umeá University, Sweden; the "Historical Population Registers" Project at the Norwegian Historical Data Centre, University of Tromso, Norway and the Historical Sample of the Netherlands, based at the International Institute of Social History, Amsterdam, Netherlands is currently extending knowledge of demography as well as economic and social history over the nineteenth and early decades of the twentieth century. Such a dataset will bring the possibility of exploring the condition of the present Scottish population within the context of their families through multiple generations of micro-data.

Planned Impact

This project will not involved the production of research output, although it will play an important role in maximising the impact of outputs of projects by researchers using dataset produced (see Pathways to Impact). However in other parts of the application, we have outlined how the data may be used. From this it is possible to identify research areas that this data will support and therefore potential beneficiaries. There will of course be many ways that innovative researchers will use this data, that we cannot yet envisage.

Those working in Social policy will benefit from a deeper understanding:
- of social mobility across 150 years
- the impact of a developing welfare state
- the nature of industrialisation and its impact on populations
- of fertility

Those working in Educational policy will have an insight into
- the impact of changing educational policy over 150

Those working in public health will come to better understand
- the effect of food shortages for mothers during pregnancy on the later life (intergenerational) health of the child
- the impact of urbanisation and severe environmental insults - on health
- the long-term effect of serious viruses (eg the 1918 influenza pandemic) on the surviving population

Those working in the NHS
- will benefit potentially from better family history risk assessment tools
- will more easily be able to produce genealogies in support of clinical genetic counselling services
- will benefit from refinement to genetic studies that will be gained from extensive pedigrees, this is likely to be a substantial impact

The public
- the Scottish public will benefit from the enhancements to their healthcare system
- may very directly benefit from enhancements to the genetic cancer counselling service and family history as used in clinical practise
- they will also benefit from the added attractiveness to researchers across many sectors - including the biomedical field - of the enhanced data environment that exists in Scotland and that will be enhanced with the dataset.
 
Description This project aims to digitise the 24 million vital events record images (births, marriages and deaths) for all people in Scotland since 1855 (ie transcribe them into machine encoded text). This will allow research access to individual level information on some 18 million individuals, a large proportion of those who have ever lived in Scotland between 1855 to the present day. At the moment these records are kept as indexed images. This means that to extract any data, a researcher must search for an individual record by name and then manually transcribe the information they need themselves (eg cause of death, occupation etc.); this of course makes any large scale research impossible.

This project has main 4 objectives within 4 work packages (WP), to:
[WP1] digitise vital events records back to 1855.
[WP2] to develop a method and software package to automatically code occupational descriptions to the Historical International Standard Classification of Occupations (HISCO).
[WP3] to develop a frame with which to code cause of death descriptions to a standardised Classification of Disease and to produce a software package for separating different causes and then classifying them automatically.
[WP4] to develop a method and software package to link the addresses on the records to a consistent, through time, geographical reference.
For WP1 so far 6 million records - a fifth of the entire transcription project - have been transcribed and this equates to just over 950 million characters. The research in WPs 2-4 have demonstrated that it is possible to automatically code the textual information in the records using machine learning techniques. This is of course vital to an endeavour that needs to code 24 million records.
Because this project is developing a research infrastructure at the moment, there have been no research discovery as yet, however one might expect the dataset to enable the following type of questions to be answered.
Those working in Social policy will benefit from a deeper understanding:
- of social mobility across 150 years - the impact of a developing welfare state
- the nature of industrialisation and its impact on populations
Those working in Educational policy will have an insight into
- the impact of changing educational policy over 150 Those working in public health will come to better understand
- the effect of food shortages for mothers during pregnancy on the later life (intergenerational) health of the child
- the impact of urbanisation and severe environmental insults - on health
- the long-term effect of serious viruses (eg the 1918 influenza pandemic) on the surviving population
Those working in the NHS
- will benefit potentially from better family history risk assessment tools
- will more easily be able to produce genealogies in support of clinical genetic counselling services
- will benefit from refinement to genetic studies that will be gained from extensive pedigrees, this is likely to substantial
The public
- the Scottish public may benefit from the enhancements to their healthcare system
- may very directly benefit from enhancements to the genetic cancer counselling service and family history as used in clinical practise
- they will also benefit from the added attractiveness to researchers across many sectors - including the biomedical field- of the enhanced data environment that exists in Scotland and that will be enhanced with the dataset.
Exploitation Route Similar data enhancement projects are being attempted in a number of countries. There has been considerable interest in the techniques and software products we have developed. We are actively sharing this learning and products with these projects. Specifically the automatic classification software developed within DS was used to classify (1) English language occupation strings to the HISCO classification system, for Prof. Marco van Leeuwen, of the Sociology Department at Utrecht University and (2) occupations from the newly created Scottish Longitudinal Study (SLS) Birth Cohort of 1936 (SLSBC1936), which includes occupations from the Census-like 1939 National Register (3) we are working closely with the CLARIAH project in the Netherlands, a distributed research infrastructure for the humanities and social sciences (http://www.clariah.nl/en/), to share our machine learning software.
Sectors Communities and Social Services/Policy,Education,Healthcare,Culture, Heritage, Museums and Collections

URL http://www.lscs.ac.uk/projects/digitising-scotland/
 
Description The DS project will not be involved the production of research output and at the present time is focused on data enhancement. However it is possible to identify areas of research that this data will support and therefore potential beneficiaries. There will of course be many ways that innovative researchers will use this data that we cannot yet envisage. Such research and analysis will enhance contemporary Scottish and UK health datasets, health informatics systems, longitudinal datasets and genetic studies. Many of the new potential research opportunities made available through the DS project are listed on the new DS website (https://digitisingscotland.ac.uk). The LSCS centre has put considerable time into organising and structuring this DS work, sharing knowledge and building networks across Europe. Through close collaboration with other countries working on similar data we successfully won a Horizon 2020 Marie Curie Innovative Training Networks (ITN) European Training Networks (ETN) 'Methodologies and Data mining techniques for the analysis of Big Data based on Longitudinal Population and Epidemiological Registers' (http://cordis.europa.eu/project/rcn/200475_en.htm l). We have 2 early stage researchers (ESR) based in Edinburgh with close association with DS along with 4 research stays from other ESRs within the network all working in historical demography on short research projects aligned with the DS project. The 4 ESRs will take back learning from their stays within the LSCS centre to their home institutions to enhance further research projects. Similar historical data enhancement projects are being attempted in a number of countries. From attending workshops and within networks there has been considerable interest in the DS techniques and software products for automatically coding text (occupations and causes of death) and geolocation that we have developed. We are sharing this learning and products with these projects given the DS work will have further applications - and has already been used by 2 projects. Specifically the automatic classification software developed within DS was used to classify English language occupation strings to the HISCO classification system, for Prof. Marco van Leeuwen, of the Sociology Department at Utrecht University. Additionally, another project within the wider LSCS centre involves the creation of new Scottish Longitudinal Study (SLS) Birth Cohort of 1936 (SLSBC1936), which includes information from the Scottish Mental Survey of 1947 (SMS1947) - a cognitive ability test that almost all Scottish children born in 1936 sat - and linked it to the Census-like 1939 National Register, the current NHS Central Register (NHSCR) and the SLS. We coded the occupations from this based on the DS automatic classification software.
First Year Of Impact 2014
Sector Digital/Communication/Information Technologies (including Software)
Impact Types Societal

 
Description Digitising Scotland (DS) Project 
Organisation National Records of Scotland
Country United Kingdom 
Sector Public 
PI Contribution Work package 1 of the DS project will digitise the 24 million Scottish vital events record images (births, marriages and deaths) since 1855. At the moment these records are kept as indexed images accessible from Scotland's People website. This will enhance the currently held electronic index and allow research access to individual-level information on some 18 million individuals - a large proportion of those who have lived in Scotland since 1855. In order for this to be possible - National Records of Scotland (NRS) needed to be working in partnership with the DS team. We have established this partnership with NRS over a 5 year period, successfully encouraging NRS to contribute their own staff time and facilities to support this workpackage.
Collaborator Contribution National Records of Scotland are responsible for appointing and managing a supplier of digitising services to transcribe textual information from images to enhance the currently held electronic index.
Impact None of the DS outputs so far have resulted directly from the partnership (as only in the pilot phase of transcription).
Start Year 2012
 
Title Automatic classification for occupation and cause of death strings 
Description The software uses Apache Mahout machine learning algorithms to classify unseen strings from historical records. It uses a sample of human-classified records to train the machine learning models. The software has been used to classify occupation and cause-of-death strings to the HISCO and ICD-10 coding systems, respectively. 
Type Of Technology Software 
Year Produced 2014 
Open Source License? Yes  
Impact None as yet. 
URL http://digitisingscotland.cs.st-andrews.ac.uk/record_classification/index.html
 
Title Synthetic population generator 
Description The software generates synthetic human populations. It is configurable with a number of probability distributions including longevity, number of children, occupation, age at marriage, parenthood etc. The software requires a Java runtime. 
Type Of Technology Software 
Year Produced 2014 
Open Source License? Yes  
Impact None as yet; it is hoped that the application will be useful in the evaluation and comparison of population linkage techniques. 
URL http://digitisingscotland.cs.st-andrews.ac.uk/population_model/
 
Description Geocoding 24 million historical addresses in Scotland (HGIS meeting UoE) 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Other academic audiences (collaborators, peers etc.)
Results and Impact talk sparked questions and discussion afterwards as purpose was to share H-GIS methods

-
Year(s) Of Engagement Activity 2014
 
Description Geocoding 24 million historical addresses in Scotland from 1855 to 1974 (EHPS-Net Meeting) 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? Yes
Geographic Reach International
Primary Audience Other academic audiences (collaborators, peers etc.)
Results and Impact talk sparked questions and discussion afterwards as purpose was to share H-GIS methods

-
Year(s) Of Engagement Activity 2014
URL http://www.ehps-net.eu/news/workshop-working-group-9
 
Description Spatial Humanities Expert Meeting, Lancaster 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Alice Reid, Eilidh Garrett and Joe Day were invited to attend, as experts, a Spatial Humanities Expert Meeting on Monday 28 November 2016 to discuss prospects for research in demographic history and the history of public health and health inequalities as part of the Spatial Humanities Project at the University of Lancaster (ERC funded). There were a number of presentations from the Spatial Humanities team, the experts attending, and a general discussion about possible new routes of enquiry.
Year(s) Of Engagement Activity 2016