Creating an AI based data assistant to bridge genotype to metadata - linking primary clinical data to biobank samples.

Lead Research Organisation: University of Birmingham
Department Name: Institute of Cancer and Genomic Sciences

Abstract

Current methods to integrate data require the manual curation and conversion of data as current medical data discovery and data search systems rely on data being standardised. In the field of biobanking data standards are really poorly followed and biobanks tend to have their own terminology sets for both the characteristics of the donor (such as disease and treatments) but also the data surrounding the samples. It is known that in the UK there are millions of samples that remain unused. One of the main reasons for samples not being used is the lack of sufficient clinical annotation, as biobanks struggle to convert textual reports into a standard such as ICD or SNOMED. Projects, such as the UKCRC Tissue Directory and Coordination Centre then struggle to match the needs of the researchers with the availability of samples due to a lack of standardised data available. The result is that 75% of SMEs based in the UK source their samples from outside of the UK which is clearly not beneficial to support inward investment.
In the use case of the UKCRC Tissue Directory and Coordination Centre the purpose of the Directory is to allow researchers to find the biobank that is most likely to have the samples required to support their research. However, the biobanks registering in the Directory are required to do an exact conversion between data terms. For those searching it is important to find a short-list of biobanks that are worth contacting based on an in-depth search. Their preference would be greater depth of queries with the sacrifice of accuracy. To integrate and harmonize such data so as to make them amendable not only as tractable entities and, perhaps more crucially, as vital elements of research hypothesis formation. technical/syntactic interoperability and semantic interoperability needs to be orchestrated across all accessible datasets. The aim of this project is to utlise the UKCRC TDCC and research and define approaches to technical/syntactic interoperability and guidelines building upon those already in place. We have support from BC Platforms to establish an inter-connected set of tools that would allow a researcher to discover, enquire and procure the necessary samples. We aim to research semantic interoperability in terms of how biomedical data can be represented in different content models and coded with different terminology systems. We will utilise the emerging artificial intelligence technologies in the fields of text mining, semantic interoperability, as well as semantic AI so as to develop a health research information environment based on integrated structured and unstructured, clinical and research datasets and amenable to data
mining and analysis documents as input vectors for ML techniques allowing them to potentially discovering novel insights/associations by identifying important classification features of their own accord.

Publications

10 25 50

Studentship Projects

Project Reference Relationship Related To Start End Student Name
MR/S502431/1 01/10/2018 31/03/2022
2084640 Studentship MR/S502431/1 01/10/2018 30/06/2022 Samantha Pendleton
 
Title Max Perutz Science Writing Award (2019 entry) 
Description Submitted a written piece for the 2019 Max Perutz Science Writing Award, a competition for MRC students to write about their science to the general public. It is aimed for students to communicate about their PhD in a way that a wider audience, such as non-scientific individuals, can understand - to help us build our communication skills. 
Type Of Art Creative Writing 
Year Produced 2019 
Impact Writing piece gave me the chance to communicate my research in layman terms and then put into my blog. 
URL https://sap218.github.io/blogs/blog013.html
 
Title The Ocular Immune-Mediated Inflammatory Diseases Ontology 
Description A novel ontology, abbreviated as OCIMIDO, for ocular inflammatory disorders with their relevant complications, associated systemic diseases, and therapeutic interventions. 
Type Of Material Improvements to research infrastructure 
Year Produced 2020 
Provided To Others? Yes  
Impact Other biomedical ontologies exist for human phenotype or diseases, however none comprehensively describe ocular inflammatory disorders or formalised the conditions. OCIMIDO is novel: the first of it's kind, plus the first to include therapeutic interventions. 
URL https://github.com/sap218/ocimido
 
Title Method of clustering and genome-wide association study (GWAS) on patients with inflammation 
Description Using the UK Biobank, we curated patients with a single inflammatory condition, performed a clustering analysis on their blood assays (counts/biochemistry), then ran a GWAS on the output clusters to observe gene associations. 
Type Of Material Computer model/algorithm 
Year Produced 2021 
Provided To Others? No  
Impact Whilst clustering or a GWAS in common, our pipeline combines the two ideas to observe gene associations within unique clusters. 
 
Title Method of using term frequency-inverse document frequency (tf-idf) for synonym curation 
Description Using tf-idf, we can curate a set of terms and add relevant terms to an ontology. 
Type Of Material Data analysis technique 
Year Produced 2020 
Provided To Others? No  
Impact Proved that tf-idf is much faster and picks up more synonyms than manual. 
URL https://github.com/sap218/ocimido-scripts
 
Title Jabberwocky: an ontology-aware toolkit for manipulating text 
Description toolkit for those nonsensical ontologies 
Type Of Technology Software 
Year Produced 2020 
Open Source License? Yes  
Impact many tools for natural language processing don't include the ability to interact with an ontology, Jabberwocky is a tool available to use various ontology formats to curate text from a corpus, extract important terms, and update an ontology with new synonyms. 
URL https://zenodo.org/record/3922261
 
Description Images of Research (2020 entry) 
Form Of Engagement Activity A magazine, newsletter or online publication
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Postgraduate students
Results and Impact Westmere University Graduate School, University of Birmingham, open the "Images of Research" for students to submit a piece of art, photography, or other to show their research, along with a written piece to describe.
Year(s) Of Engagement Activity 2020