TRANSFORMATIVE TECHNOLOGIES - Mining for the best side of bioscience data, with machine learning

Lead Research Organisation: University of Edinburgh
Department Name: Sch of Biological Sciences

Abstract

Biology is being transformed by large-scale data sets, but how do you find the right one? The data needs a 'dating profile': sounds easy but it turns out to be a bottleneck that is holding back research progress and research culture. This project tackles that metadata bottleneck. We aim to help researchers to show the best side of their data.

Data-intensive bioscience depends upon online repositories that share the "Big Data". There's little value in sharing data, if you can't tell which organism, sample, or conditions it came from, so the databases also need descriptions of the data, termed metadata. Top-tier repositories pay professional data curators to deal with their metadata but many other repositories cannot do so. Even curators can't invent metadata, the original researchers have to describe their research for the curators.

This project first aims to understand current data descriptions in research data repositories, using text mining and machine learning, in particular named entity recognition in free-text descriptions. Based on this evidence, you will research the simplest ways to improve the descriptions in future. The project will test real-time feedback that encourages researchers to provide better descriptions, for example using controlled vocabularies. You will work with software developers to test and evaluate simple feedback processes, in practice, for biological data repositories. By then, you will also be an expert data steward.

Improving data descriptions will accelerate data-intensive bioscience across many research fields, as this bottleneck applies to many repositories and even electronic lab notebooks. Making the data easier to re-use will also reward the researchers who share their data, supporting the "Open Science" aspect of the new research culture.

The team: Andrew Millar (Edinburgh) and Jason Swedlow (Dundee), are biologists who also develop and run data repositories, and help researchers to manage their data. We have access to internationally-adopted repositories (e.g. https://idr.openmicroscopy.org), their metadata, and to their software developers, who can help to implement feedback processes.
Ian Simpson (Edinburgh Informatics) applies natural language processing (text mining) software tools to analyse bioscience literature, and has worked on several "Big Data" bioscience projects, including with Andrew Millar.

The project is based with the BioRDM team in the C.H. Waddington building, at the focal point of SynthSys, the interdisciplinary biology research centre at the University of Edinburgh, where many labs generate, analyse and model large-scale biological data. More information at https://www.ed.ac.uk/biology/synthsys

Publications

10 25 50

Studentship Projects

Project Reference Relationship Related To Start End Student Name
BB/T00875X/1 01/10/2020 30/09/2028
2890716 Studentship BB/T00875X/1 01/10/2023 30/09/2027