Heritage Connector

Lead Research Organisation: Science Museum Group
Department Name: Science Museum Research


As with almost all data, museum collection catalogues are largely unstructured, variable in consistency and overwhelmingly composed of thin records. This is largely a legacy of the development of these catalogues from handwritten paper records. The depth and form of collection catalogues has been primarily guided by collection management needs (records of acquisition, administration of loans, provenance documentation, etc.) where unstructured data can fulfil the needs of the organisation and comply with collection management standards. When computer technology was adopted for collection management in the 1980s it was implemented to handle these same back-office tasks rather than to support public access. The resulting form of the catalogues means that the potential for new forms of digital research, access and scholarly enquiry remain dormant, and searching across collections is currently possible only through aggregation which is labour intensive to implement, or by third-party search engines where results are unreliable. In this project, we will apply a battery of digital techniques to connect similar, identical and related items within and across collections. Our primary research question is "How can existing digital tools and methods be used to build relationships at scale between poorly and inconsistently catalogued digitised collection objects and other content sources?"

Since the turn of the twenty-first century enormous and growing volumes of material have been digitised, and catalogues have begun to address the needs of digital public access. However, this has been mainly at an institutional level or via a handful of content aggregators and thus the enhancement of catalogues for the purposes of public access has been driven by the needs of individual collection websites, with little or no interlinking to other collections or content sources. Where that linking (people, places, events, objects, etc.) does exist it has been undertaken by human intervention, and because of the number of records, it has been limited in scale and scope and rarely an ongoing endeavour despite the evolving nature of the catalogue.

Alongside the digitisation of collections, recent years have seen a growth in the publication online of scholarly research related to heritage collections: open access journals, theses, and other online resources. However, beyond the host institution, references to this material is rarely, if ever, ingested into the underlying collections systems and made available via links from related collection websites. This project will therefore also use computer analysis to attempt to identify and build links to this material. Finally, structured data and rich linking are an increasingly urgent concern as new forms of discovery and access emerge - notably artificial intelligence powered discovery and new interfaces such as voice search - that rely on these for their functionality.

This project will explore an alternative approach - a "Heritage Connection Engine" - that will analyse catalogues, published material and knowledge graphs, and build links at massive scale between these that can then be used for new forms of research. It will explore the opportunity for computer generated links with Wikidata to provide new levels of structure and machine-readable data that can form the foundation of new types of discovery and access. The "Heritage Connection Engine" will use a range of technologies including machine learning; named entity recognition; open data; and persistent IDs. These methods will create a large-scale data source of links, each with a confidence ranking. Computational enquiry to generated links via an application programming interface (API) will enable the creation of a range of proof-of-concept research and discovery tools. All software will be documented released under an open source Licence. All datasets will be released under the Creative Commons Zero license.

Planned Impact

As a public-facing institution, Science Museum Group (SMG) will disseminate the results of Heritage Connector beyond the academy to broad publics through the development of a thoroughly integrated and cross-referenced digital catalogue. This resource will unite files from different museum holdings, producing more relevant search results from databases. Users will gain unprecedented access to museum collections and archives, enriching knowledge of Britain's past and legacy in the arts and sciences, with long-term value for materials in store and on display. Improved collections knowledge will inform future gallery and exhibition development plans.
VISITORS TO SMG: With improved access to museum collections and archive holdings, SMG will be able to transform its programme of cultural events, broadening its scope to include hitherto overlooked or neglected topics and objects. These initiatives will directly impact the experiences of SMG's 6 million annual visitors to our five national museum sites. Public events such as the two-day hackathon will provide further opportunities for a specific public to engage critically with project activities, and contribute to the research findings.
VISITORS TO UK HERITAGE ATTRACTIONS MORE BROADLY: The ability to discover related material across heritage organisations representing different disciplinary areas will lead to new and richer narratives, enhancing the interdisciplinary offer in displays and programming across organisations to the benefit of visitors who will be exposed to a new, wide, range of experiences.
MUSEUM & HERITAGE PROFESSIONALS: Museum and heritage professionals will benefit from our aim to transform digital cataloguing practices. The production of more thoroughly integrated and cross-referenced national museum records will directly benefit curatorial activities and perspectives. The end of project conference and publications will extend the findings of the project further, reaching colleagues across the world. Cross-institutional collaboration between SMG and the V&A will also foster long-term research partnerships and enrich knowledge of their shared history.
WEBSITE AND ONLINE RESOURCES: SMG's 11 million online visitors will have improved digital access to museum objects and records. Improved image libraries with links to contextual information will allow users to realise the full breadth of SMG's collection and engage with material typically unavailable to the public in galleries. Academic findings about the application of new software can impact other digital developers, online communities, and projects, exposing new theories and practices, and collaborative opportunities.
AMATEUR & PROFESSIONAL HISTORIANS: Planned events will foster dialogues between these groups. Improved access to objects and records will allow those researchers working outside the Academy to pursue sustained investigations into the material history of science, technology, engineering and medicine. Providing an open-access catalogue on a digital platform ensures that all types of researchers regardless of their socio-economic background or location can use museum materials.
LIBRARIANS AND ARCHIVISTS: The impact on librarians and archivists will be in the form of knowledge exchange between them and project investigators. Archivists and librarians from across the country will be invited to convenings, allowing them to benefit from the expertise of Heritage Connector's investigator team, who will in turn benefit from improved knowledge of collections. Participants in the project will help librarians and archivists consider the relevance of these new digital tools for their collections, and how scholars and the wider public might use these digitally revamped collections for research activities.


10 25 50
Description Headline survey results of the project's June 2020 webinar:
? 26.7% of respondents were using Wikidata IDs and other IDs with their collections, 4.8% were using Wikidata IDs only, 24.0% were using other IDs but not Wikidata, and 44.5% were using no external IDs.
? 59% of respondents from cultural heritage institutions said that a major hurdle to them adopting Wikidata IDs in their collection was time, resources, or the large amount of work required.

Findings from literature review:
? Motivations for GLAM institutions working with Linked Open Data include: a concert to make cultural heritage more visible; an interest in exposing 'hidden' collections, or 'hidden' aspects of relatively well-known collections; the enrichment of catalogues and metadata; the encouragement of data reuse in new contexts; the desire to create a better user experience; the challenges of dealing with large volumes of data when resources are scarce.
? Many projects involve only one or at most two institutions, and international collaboration is relatively rare.
? Cultural heritage databases are rich, large and complex, and there is limited standardisation.
? Institutional histories and cultures can make standardisation challenging.
? Barriers to Linked Open Data (LOD) in the cultural heritage sector fall under four broad
headings: technical, conceptual, legal and financial.
? Working with LOD at any kind of scale is both time consuming and resource intensive.
? A great deal of LOD work to date has focused on people rather than objects.
? It is not a question of if human intervention and curation is needed, but at what point in the
pipeline it should be introduced and how it may be most usefully focused.
? Many LOD projects envisage personalisation as an important outcome, but this remains a
mid- to long-term goal.
? Quality, authority and trust are crucial for cultural heritage organisations, but these can hold
back experimentation and present a challenge for scalability.
? It is rare for promising experimental projects to move beyond the prototype stage.

Findings from software development:
? Aligning specific free-text fields to entities (in our case collection item types and locations) is important but can take a significant time using existing tools such as OpenRefine. Faster and more robust methods therefore exist in the Heritage Connector.
? You can expect varying success disambiguating records with Wikidata depending on their type, due to the nature of the records in Wikidata. We've had most success with people and organisations and expect that we'll be able to find Wikidata links for a much smaller proportion of objects as they are less likely to exist on Wikidata.
? The separately described steps of creating external and internal links work better when used iteratively rather than when they're treated as two separate 'run-once' processes. As you use NER to create more entities and relations in the graph, the effectiveness of the disambiguator will increase.
? Where possible, it's best not to bulk query Wikidata, especially through SPARQL. We've circumvented this by creating an Elasticsearch index we can use to perform text searches on Wikidata in a faster and more stable way.
Exploitation Route The software and methods are open source and documented and can be repurposed by others in the future.
Sectors Culture, Heritage, Museums and Collections

URL https://www.nationalcollection.org.uk/interim
Description In a survey taken at the project's June 2020 webinar, 59% of respondents from cultural heritage institutions said that a major hurdle to them adopting Wikidata IDs in their collection was time, resources, or the large amount of work required. Therefore, for every component in Heritage Connector, we design approaches that account for the constraints on time and budget that are common in the heritage sector. In first year of the project we have shown that small machine learning models that can be run on large datasets on a developer's laptop are useful tools for creating Linked Open Data from museum collections, particularly in tackling the problems of record linkage and information retrieval. We have shown that small machine learning models are effective for creating links between museum collections and Wikidata using a small amount of labelled data. Using the Science Museum Group collection we also show that both using collection item labels as a dictionary and encoding some heritage-specific knowledge as rules can improve the performance of Named Entity Recognition (NER) methods as part of the process of creating new knowledge graph links from text data, without the need for time-consuming data annotation. This performance increase is particularly significant on smaller models which museum developers are able to run on their own laptops on datasets the size of digital museum collections.
First Year Of Impact 2020
Sector Culture, Heritage, Museums and Collections
Impact Types Cultural

Title CLI for loading Wikidata subsets into Elasticsearch 
Description Running text search programmatically on Wikidata means using the MediaWiki query API, either directly or through the Wikidata query service/SPARQL. There are a couple of reasons you may not want to do this when running searches programmatically: - time constraints/large volumes: APIs are rate-limited, and you can only do one text search per SPARQL query - better search: using Elasticsearch allows for more flexible and powerful text search capabilities. The created software is a set of simple CLI tools to load a subset of Wikidata into Elasticsearch. 
Type Of Material Improvements to research infrastructure 
Year Produced 2020 
Provided To Others? Yes  
Impact The software has been favourited 24 times on GitHub and "forked" 4 times meaning that others are reusing the software. 
URL https://github.com/TheScienceMuseum/elastic-wikidata
Title Natural Language (NLP) tools for heritage collections 
Description Text processing for the Heritage Connector: a set of NLP utilities for the Heritage sector. Includes: - information extraction (NER, NEL, relation classification) - labelling (Label Studio) - test suite for models 
Type Of Material Computer model/algorithm 
Year Produced 2020 
Provided To Others? Yes  
Impact This is one of the main outputs of the project and demonstrates the capability of NLP for process cultural heritage collections data to extract entities and build knowledge graphs. 
URL https://github.com/TheScienceMuseum/heritage-connector-nlp
Title Heritage Connector 
Description A set of software tools to: - load tabular collection data to a knowledge graph - find links between collection entities and Wikidata - perform NLP to create more links in the graph - explore and analyse a collection graph ways that aren't possible in existing collections systems 
Type Of Technology Webtool/Application 
Year Produced 2020 
Open Source License? Yes  
Impact The software has been favourited 9 times in the GitHub repository. 
URL https://github.com/TheScienceMuseum/heritage-connector
Description Hands-on activity in linking and enriching geo-data, part of the Linked Pasts 6 conference 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact 39 participants took part in a technical workshop with hands-on activity as part of the Linked Pasts 6 conference hosted by the British Library and University of London. The Heritage Connector project team presented elements of the project's software and participants were able to test and play with the software. The team then took part in a roundtable discussion. The technical workshop was hosted by the AHRC TaNC Locating a National Collection project.
Year(s) Of Engagement Activity 2020
URL https://www.eventbrite.co.uk/e/linking-geo-data-through-test-and-play-tickets-129858356841#
Description Heritage Connector Project Blog 
Form Of Engagement Activity Engagement focused website, blog or social media channel
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact 1,302 users have made 2,164 visits to the project blog which collates project outcomes, documentation of events and links to reports, recordings and software developed.
Year(s) Of Engagement Activity 2020,2021
URL https://thesciencemuseum.github.io/heritageconnector
Description Heritage Connector YouTube Channel 
Form Of Engagement Activity Engagement focused website, blog or social media channel
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Webinars, presentations and demonstrations presented as part of the Heritage Connector project are collated on this YouTube channel which has received 222 video plays.
Year(s) Of Engagement Activity 2020,2021
URL https://www.youtube.com/channel/UCzO6jroIvj-JbFuiQ9BpZdQ
Description Heritage Connector Zotero Library 
Form Of Engagement Activity Engagement focused website, blog or social media channel
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Public project Zotero library established to collate literature review, case studies and related projects. The library currently contains 223 items organised by type and theme.
Year(s) Of Engagement Activity 2020,2021
URL https://www.zotero.org/groups/2439363/heritage_connector
Description Project Lightning Talk on AI4LAM Community Call 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact 61 participants viewed the lightning talk of the Heritage Connector's software on the AI4LAM (Artificial Intelligence for Libraries, Archives and Museums) Community Call, and participated in a Q&A.
Year(s) Of Engagement Activity 2021
URL https://docs.google.com/document/d/1gOQEPqSEBAkqpy6KtRsEIm5g1vCjsxdmnlkeO3YJM3Y/
Description Towards a National Collection: Persistent Identifiers as IRO Infrastructure - Project Launch Webinar 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact The AHRC TaNC Persistent Identifiers as IRO Infrastructure brings together best practices in the use of PIDs in the UK heritage sector, with a focus on those that are Independent Research Organisations. This webinar was run to introduce the community to our project and find out what stakeholders were interested in hearing more about as the project evolves. Heritage Connector was presented to align the two projects from the outset as there are areas of common interest. 118 people participated in the event.
Year(s) Of Engagement Activity 2020
URL https://www.pidforum.org/t/webinar-on-a-new-pids-in-glam-project-6th-april-2020/917
Description Wikidata and cultural heritage collections webinar 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact On 19 June 2020, the Science Museum Group hosted a free, public webinar on Wikidata and cultural heritage collections. This was the first in a series of convening as part of the Heritage Connector project. The webinar brought together a set of short case studies from practitioners who have worked in this field to present their work and the opportunities and challenges as they saw them. 296 people participated in the webinar which brought together international speakers and participated in the Q&A session and online survey.
Year(s) Of Engagement Activity 2020
URL https://thesciencemuseum.github.io/heritageconnector/events/2020/06/22/wikidata-and-cultural-heritag...