Development of software to model multi-modal genomic data as an integrated system: application to understanding the gene regulatory landscape

Lead Research Organisation: UNIVERSITY OF EXETER
Department Name: Institute of Biomed & Clinical Science

Abstract

To date over 4,500 genetic studies have been performed identifying almost 200,000 genetic risk factors for more than 1,800 diseases and traits. However, the biological consequences of the majority of these genetic risk factors are unknown. They are anticipated to influence when and where (i.e. which organ) genes are active by controlling or regulating this activity. Advances in technology mean we can now profile the complex layers of gene regulation in unprecedented detail. There is a wealth of data available to explore how gene regulation works as a biological system. The challenge is how to efficiently analyse this huge quantity of data and represent it in a meaningful manner. The aim of this Fellowship is to develop tools that are capable of building the most comprehensive model of gene regulation and is flexible to accommodate new data sets as they inevitably arise. These tools will take advantage of multiple different yet complementary data types and unite them as a single system. It will look for patterns across these data types which define different states of gene regulation. What makes this project unique is that it will be optimised for the analysis of large sample cohorts. My approach will extend existing research by focusing on identifying where the system varies across individuals. Knowing where gene regulation varies is the key to understanding how it influences the development of disease.
The final part of the project will focus on how to share the output of the software in a useable format, so that other researchers can integrate it with their own data. Specifically, I will create a model of gene regulation for human brain cell types that will provide a unique resource to improve our knowledge of diseases that affect the brain (e.g. Alzheimer's disease and schizophrenia). The data will be shared through a web based application, developed as part of the Fellowship. Crucially, researchers will be able to investigate how different combinations of genetic risk factors influence gene activity and identify which genes are affected. For example, they could identify which genes are disrupted by genetic risk factors that increase an individual's risk of developing Alzheimer's disease. At present there is no method available to provide this kind of insight.
There are a number of research groups and global consortium generating data that could be analysed with the planned software. The methodology is forward thinking and focused on maximising the information gain from existing data and is relevant for the study of any organ, disease or organism. This Fellowship, therefore, has the potential to transform our understanding of health and disease.
 
Description Defining Best Practises for Data Science Education across Disciplines
Amount £16,191 (GBP)
Funding ID ELAT2\100015 
Organisation Alan Turing Institute 
Sector Academic/University
Country United Kingdom
Start 03/2023 
End 03/2024
 
Title CEll TYpe deconvolution GOodness (CETYGO) 
Description The majority of epigenetic epidemiology studies to date have generated genome-wide profiles from bulk tissues (e.g. whole blood) however these are vulnerable to confounding from variation in cellular composition. Proxies for cellular composition can be mathematically derived from the bulk tissue profiles using a deconvolution algorithm however, there is no method to assess the validity of these estimates for a dataset where the true cellular proportions are unknown. We have developed an accuracy metric that quantifies the CEll TYpe deconvolution GOodness (CETYGO) score of a set of cellular heterogeneity variables derived from a genome-wide DNA methylation profile for an individual sample. The CETYGO score captures the deviation between a sample's DNAm profile and its expected profile given the estimated cellular proportions and cell type reference profiles. We have made our method available as a standard alone R package, CETYGO, available via GitHub to simultaneously calculate CETYGO alongside the estimation of cellular composition variables using Houseman's algorithm. In this way it can easily be adapted for use with other available reference panels, both now and in the future. 
Type Of Material Computer model/algorithm 
Year Produced 2022 
Provided To Others? Yes  
Impact We are currently using this tool to evaluate different models for estimating the cellular composition of the brain. Two critical findings here are that the models are not appropriate for the study of either the cerebellum or prenatal and early postnatal samples. 
URL https://github.com/ds420/CETYGO/blob/main/README.md
 
Description PhD student Calum Harvey 
Organisation University of Sheffield
Department Sheffield Biorepository
Country United Kingdom 
Sector Academic/University 
PI Contribution Calum spent 4 weeks visiting my team in Exeter and learning about DNA methylation analysis. I have since been asked to become his second supervisor. We focused on modelling cfDNA with motor neurons and cortical neurons spiked in to test the sensitivity of detecting these at varying read depths and with varying numbers of regions.
Collaborator Contribution Johnathan Cooper-Knock (University of Sheffield) is his primary supervisor.
Impact NA
Start Year 2022
 
Description Day of Ideas 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Industry/Business
Results and Impact The Institute of Data Science and Artificial Intelligence Day of Ideas was an industry networking event at Exeter Castle showcasing some of the interdisciplinary Data Science and Artificial Intelligence research that is ongoing at the University of Exeter. The event was opened by a welcome from Professor Lisa Roberts, the Vice-Chancellor of the University, who then invited Professor Sir Adrian Smith, the Director and Chief Executive of the Alan Turing Institute and President of the Royal Society to speak about the landscape of data science and artificial intelligence within the UK. Guests were then invited by Professor Richard Everson, Director of the Institute for Data Science and Artificial Intelligence to see some of the work we do here at IDSAI. I developed and hosted the Health Stand showcasing the genomic profiling and modelling happening as part of my Fellowship.
Year(s) Of Engagement Activity 2022
URL https://www.exeter.ac.uk/research/idsai/events/idsaidayofideas/#:~:text=IDSAI%20holds%20an%20industr...