Submitted thesis 03/02/2021 - now has corrections - resubmission deadline Nov-2022

Lead Research Organisation: University College London
Department Name: Institute of Health Informatics

Abstract

Submitted thesis 03/02/2021 - now has corrections - resubmission deadline Nov-2022
The information needed for the discovery of a new indication of an old drug (repurposing) might come from different sources. Electronic health records and databases of drugs adverse events will give information about "real life" use of drugs. GWAS will provide information on gene-phenotype association. From biology we might know what code what proteins, and some of the biological pathways in which these proteins participate. From experiments we know to what proteins each molecule (drug) attaches (or might attach to). All these pieces of information are stored in different databases, and often the data is not linkable, in the sense that we do not have the information for the same individual in all the databases.
We will extract information on relevant associations and combine this information in such a way that we can provide a measurable degree of evidence or "belief" on the potential effect of a drug on a particular disease (i.e. provide evidence on a particular connection of disease-gene-protein-drug). The key problem here is to define a way of representing the degree of evidence or belief in each of the pieces of information, and come up with logically consistent rules to combine them.

Grant 552247 replaces old CHAPTER grant 513157

Publications

10 25 50

Studentship Projects

Project Reference Relationship Related To Start End Student Name
MR/T502583/1 01/01/2019 30/09/2020
2227769 Studentship MR/T502583/1 26/09/2016 30/09/2019
 
Title Protein Set Enrichment Analysis, SomaLogic Annotation Package 
Description For years genomics has dominated the omics field. As a result, the tools and statistical methods developed are tailored to analyse genetic data. However, proteomics is well poised to reach the same level of throughput and utility (both clinical and research) in the subsequent years, having already achieved 25% coverage of the human proteome (SomaLogic V4). An annotation package is a companion software that bridges the gap between the probes on high throughput assays (the hardware, i.e. the SOMAmers) and biological datasets (the software). Annotation packages can be used to provide biological context and statistical analysis. These annotation packages are normally developed by biotech companies. However, the current scope of SomaLogic is biomarker discovery and not software engineering and so do not provide a SomaLogic V4 annotation Package (nor do they have any immediate plans to do so). Therefore, high throughput analysis and biological interpretation is severely impeded. The PhD lead to the development of a bespoke SomeLogic annotation package to provide biological context (in the form of Gene Ontology and Reactome) and enable biostatistical analysis (protein set enrichment analysis). 
Type Of Material Improvements to research infrastructure 
Year Produced 2019 
Provided To Others? No  
Impact In collaboration with Harvard (Veterans Affairs, VA), we are carrying out a validation study to systematically check >5,000 protein's involvement with 3 cardiovascular diseases. Subsequently, a multiple testing threshold is applied to adjust for the number of tests conducted. We are using the tool to annotate our top hits with biological context in our validation study. Next, we use the tools to apply a threshold-free method (protein set enrichment analysis) to identify over and under enriched pathways associated with the 3 phenotypes. 
 
Title SiREN - The Single Research Network 
Description Drug repurposing (finding new uses for existing drugs) is a fast paced cross-disciplinary research area. As a result, the literature is exhaustive and nearly impossible to carry out a systematic review. An alternative way to identify the influential papers in the filed could be to model the life science (i.e. PubMed) as a massive citation network and apply network theory to data mine this network. I have made this network, stored it in a specialised graph database (Neo4j), connected the database to my statical software and analysed the network. Its not too relevant to the PhD hence its very immature. 
Type Of Material Database/Collection of data 
Year Produced 2017 
Provided To Others? No  
Impact Senior PI at the host institute (formally Farr Institute) carried out a review of the impact the Farr had on electronic health records. I acted as a consultant on their analysis/visualisation because of the experience I had in analysing and visualising large citation networks. The software I recommended ended up being used to produce visualisations for the paper and in the MRC's Delivery Plan 2019 (http://dx.doi.org/10.2139/ssrn.3312791 and https://t.co/SstikjiSJd?amp=1, page 14) The million veterans project (USA) had a similar goal to the Farr Institute and directly wanted to use my database but I suggested against it. My upgrade examiner requested one of the graphs from the database for her grant application. 
 
Description UCL (Institute of Health Informatics, IHI & Health Data Research UK, HDR-UK) Harvard (Veterans Affairs, VA) 
Organisation United States Department of Veterans Affairs
Country United States 
Sector Public 
PI Contribution This UCL - Harvard collaboration adapts the statistical methods and computational tools to apply the gene set enrichment analysis (GSEA) to protein data (protein set enrichment analysis, PSEA) to identify over and under expressed pathways. More specifically, the PSEA is applied in a drug discovery context; picking up where UCL's drug target identification (protein wide association study, PWAS) and Harvard's drug target validation (Mendelian Randomisation, MR) projects end. We have contributed expertise in bioinformatics in all stages of the VA's mendelian randomization project. This includes data preprocessing, visualisations, providing biological context to statistically significant hits and providing positive controls. We have also followed up the experiment with secondary pathway analysis to maximise the use of the summary statistics generated by the mendelian randomization phase of the experiment. Our contributions spans the domains of genomics, proteomics, clinical trials and biostatistics.
Collaborator Contribution The million veterans project (MVP) is a cohort of 1 million veterans with genomic data coupled with electronic health records. This is one of the largest cohorts in the world, but unlike the UK's BioBank, the data is not freely available to the public. By collaborating with the VA we have access to the data. By pooling together the UK BioBank data with the MVP we could have access to ~1.5m patients. Additionally, there have been a few instances where we have had early access to unpublished data ahead of preprints. The VA have also provided 2 cardio fellows to manually curate clinical trials data and provided me with a 3 month placement in the Brigham and Women's Hospital in Boston.
Impact Our contributions spans the domains of genomics, proteomics, clinical trials, medicinal chemistry and biostatistics. Our collaborators spans the domains of statistical genetics, biostatistics, network theory and medicine.
Start Year 2019
 
Title ClinicalTrials.Gov Miner 
Description From https://clinicaltrials.gov/ct2/about-site/background "ClinicalTrials.gov is a Web-based resource that provides patients, their family members, health care professionals, researchers, and the public with easy access to information on publicly and privately supported clinical studies on a wide range of diseases and conditions. The Web site is maintained by the National Library of Medicine (NLM) at the National Institutes of Health (NIH)." ClinicalTrials.gov is notoriously difficult to work with - the data is stored in a tree-like structure (XML) while most biological datasets are tabular; the data is stored in free text and is often incomplete, misspelled or haphazardly uploaded. Our software tackles the first bottleneck - it rapidly extracts and formats the trials for downstream analysis (either manual curation or for natural language processing). 
Type Of Technology Software 
Year Produced 2017 
Impact We're using the package to identify drugs, and in turn drug targets, for our experiments at UCL (drug target identification) and Harvard (drug target validation) for 3 cardiovascular phenotypes. Drug targets in phase 4 clinical trials are our positive controls across both experiments. Additionally, our Harvard arm uses data from trials in phase 1-3 for target validation. That is, we can reasonably predict if a drug target-disease pairing is going to be successful or not. This, when combined with additional streams of evidence, is good grounds to promote or terminate a trial. Finally, the package connects with the chemogenomic database ChEMBL to provide additional data for drug repurposing.