Harmony: A natural processing approach to data discoverability and harmonisation

Lead Research Organisation: UNIVERSITY COLLEGE LONDON
Department Name: Quantitative Social Science

Abstract

The UK is a world-leader when it comes to data resources for health, social and economic research which ultimately can improve many people's lives. However, there are several issues that prevent these resources from being used to their maximum potential. This project addresses two key barriers for optimal data usage: data discoverability and harmonisation. We aim to achieve this by piloting the integration of our established harmonisation tool 'Harmony'1,2 with data discoverability platforms in the UK research infrastructure.

Harmony screens study meta-data (in word, csv or pdf format) and uses Artificial Intelligence (AI), specifically natural language processing (NLP), to identify variables that are comparable across datasets based on their semantic content. Harmony's underlying NLP models calculate cosine similarity scores that indicate the level of similarity between the text content. Harmony was originally developed for mental health questionnaires. However, Harmony provides the technology to match any text content, also across languages. We propose to use this state-of-the art technology to expand the focus of Harmony and maximise its usage for data users.

We developed Harmony as part of Wellcome's Mental Health data prize, where we competed successfully against 11 other teams over 3 consecutive rounds, repeatedly securing more funding. Within only 6 months and an initial budget of £40k, our team showcased a fully functioning prototype that could automatise data harmonisation processes. Our team won the Wellcome's Data Prize with Harmony, which is freely available online, attracting close to 600 users per month from over 100 countries.

Harmony will provide faster and easier data discovery and harmonisation processes. Harmony's technology will allow us to connect to and 'speak' with other data platforms and presenting this information to users. Thus, users can utilise the Harmony platform as a one-stop-shop from which they can find and access meta-data information located on other data platforms (e.g. data catalogues, repositories, and trusted research environments). For this project we will demonstrate the immediate and long-term benefits of integrating Harmony with data platforms from the Catalogue of Mental Health Measures (CMHM), the UK-Longitudinal Linkage Collaboration (UK-LLC) and the Centre of Longitudinal Studies (CLS; in development). Our goal is to enable reliable two-way communication pathways between platforms to achieve more efficient user journeys for researchers, data and survey managers and the public. In addition to establishing Harmony as a central discoverability service, we will enable users to import variable meta-data from partner platforms directly into Harmony to facilitate faster data comparison and harmonisation across datasets. Together, these new pathways will offer users more efficient means of finding, comparing and pooling data from different sources, thus providing an innovative solution to enhance data discoverability and interoperability in the UK data infrastructure.

To achieve these goals, we have partnered with experts and leaders in the field, including the above-mentioned platforms and DATAMIND, Administrative Data Research-UK (ADR-UK), Health Data Research-UK (HDR-UK) and the UK Data Service (UKDS). These collaborations will allow us to pilot Harmony with the key platforms of the UK research data infrastructure. Additionally, bringing together a network of like-minded initiatives will secure a wide reach within and outside the research data community, which we will utilise to involve stakeholders throughout the project and disseminate training resources and outputs.

Publications

10 25 50