Talk data to me! Evaluating the potential for large language models to enhance data discoverability across federated data services

Lead Research Organisation: University of Liverpool
Department Name: Geography and Planning

Abstract

Advances in artificial intelligence (AI) are revolutionising how we search for information. Large language models (LLMs), such as OpenAI's 'Chat-GPT' or Google's 'Bard', are good at understanding what we say and the meaning behind our words. Through conversations with these tools, they are helping to improve the accuracy of what information we want to find. While existing search tools focus on using 'keywords', this may not always give good answers. LLMs help people who might not know the exact words to say, because they know the context and relationships behind our language. They can adapt to different ways of asking questions, as well as provide explanations about why they found such information.

We believe that these maturing technologies can help researchers search for data. Through training existing LLMs to learn what UKRI-supported research data exist, we can make the most of their existing abilities to understand human language to create a powerful data search tool. Their potential to be used as a data search tool is unknown and we are not aware of any existing tools for UK research datasets. Our proposal will develop, pilot and evaluate the effectiveness of LLMs to this end.

The main output of this work will be a fully deployable 'chat box' search tool that researchers will be able to use to discover research datasets. To achieve this, we will collate the metadata of data catalogues across a range of UKRI research investments including the Consumer Data Research Centre, NERC Environmental Data Service, Administrative Data Research UK and UK Data Service. Through combining data catalogues across these unconnected services, we provide a new single 'port of call' for searching research data. We will design our project so that it can easily adapt to integrate new datasets. These data will then be used to develop a new AI derived search tool based on LLMs.

We want to understand how these technologies can be used effectively by researchers and whether they will give more useful searches. Our mixed methods approach will test and evaluate the acceptability, suitability, and performance of our new search tool in comparison to existing UKRI search tools. This will include focus groups to qualitatively examine the acceptability of LLMs for data discovery, a quantitative comparison of how our new tool performs against existing keyword search tools, and by running tests that task participants with searching for data. We will report the strengths and limitations of LLMs to examine how useful they are. We will make recommendations for how they can be deployed, refined and sustain the changing ways in how researchers search for data.

Our project will bring added value to existing UKRI data discovery resources through creating a new tool that will know the context and meaning of search queries, providing a broader and more accurate list of datasets based on what is searched for. We hope that this will help researchers to find exactly the data they need for their research.

Publications

10 25 50