Development of software to model multi-modal genomic data as an integrated system: application to understanding the gene regulatory landscape

Lead Research Organisation: UNIVERSITY OF EXETER
Department Name: Institute of Biomed & Clinical Science


To date over 4,500 genetic studies have been performed identifying almost 200,000 genetic risk factors for more than 1,800 diseases and traits. However, the biological consequences of the majority of these genetic risk factors are unknown. They are anticipated to influence when and where (i.e. which organ) genes are active by controlling or regulating this activity. Advances in technology mean we can now profile the complex layers of gene regulation in unprecedented detail. There is a wealth of data available to explore how gene regulation works as a biological system. The challenge is how to efficiently analyse this huge quantity of data and represent it in a meaningful manner. The aim of this Fellowship is to develop tools that are capable of building the most comprehensive model of gene regulation and is flexible to accommodate new data sets as they inevitably arise. These tools will take advantage of multiple different yet complementary data types and unite them as a single system. It will look for patterns across these data types which define different states of gene regulation. What makes this project unique is that it will be optimised for the analysis of large sample cohorts. My approach will extend existing research by focusing on identifying where the system varies across individuals. Knowing where gene regulation varies is the key to understanding how it influences the development of disease.
The final part of the project will focus on how to share the output of the software in a useable format, so that other researchers can integrate it with their own data. Specifically, I will create a model of gene regulation for human brain cell types that will provide a unique resource to improve our knowledge of diseases that affect the brain (e.g. Alzheimer's disease and schizophrenia). The data will be shared through a web based application, developed as part of the Fellowship. Crucially, researchers will be able to investigate how different combinations of genetic risk factors influence gene activity and identify which genes are affected. For example, they could identify which genes are disrupted by genetic risk factors that increase an individual's risk of developing Alzheimer's disease. At present there is no method available to provide this kind of insight.
There are a number of research groups and global consortium generating data that could be analysed with the planned software. The methodology is forward thinking and focused on maximising the information gain from existing data and is relevant for the study of any organ, disease or organism. This Fellowship, therefore, has the potential to transform our understanding of health and disease.
Description ATI AIDSE programme
Geographic Reach National 
Policy Influence Type Contribution to new or improved professional practice
Description Coding for Reproducible Research
Geographic Reach Local/Municipal/Regional 
Policy Influence Type Influenced training of practitioners or researchers
Impact This programme of training provides the necessary training for PhD students to complete their students. We also have contributed to the upskilling in technical skills for researchers across all domains.
Description DSIT AI Skills Expert Panel
Geographic Reach National 
Policy Influence Type Contribution to a national consultation/review
Description GW4 BioMed2 DTP Data Science Training Lead
Geographic Reach Local/Municipal/Regional 
Policy Influence Type Influenced training of practitioners or researchers
Impact My training supports the development of the PhD students in the DTP, not only for the benefit of their projects but this enhances their career prospects post award.
Description Defining Best Practises for Data Science Education across Disciplines
Amount £16,191 (GBP)
Funding ID ELAT2\100015 
Organisation Alan Turing Institute 
Sector Academic/University
Country United Kingdom
Start 03/2023 
End 03/2024
Description Elucidating the neural cell types affected by epigenetic dysregulation in Alzheimer's disease
Amount £110,379 (GBP)
Funding ID AS-PhD-22-043 
Organisation Alzheimer's Society 
Sector Charity/Non Profit
Country United Kingdom
Start 09/2023 
End 09/2027
Description Midlife Aging in the Dunedin Study Phase 52
Amount £1,321,366 (GBP)
Funding ID MR/X021149/1 
Organisation Medical Research Council (MRC) 
Sector Public
Country United Kingdom
Start 09/2023 
End 09/2028
Description Validating a 3rd-generation methylation measure of accelerated aging: DunedinPACE
Amount $2,999,216 (USD)
Funding ID 1R01AG073207-01A1 
Organisation National Institutes of Health (NIH) 
Sector Public
Country United States
Start 01/2022 
End 12/2026
Title CEll TYpe deconvolution GOodness (CETYGO) 
Description The majority of epigenetic epidemiology studies to date have generated genome-wide profiles from bulk tissues (e.g. whole blood) however these are vulnerable to confounding from variation in cellular composition. Proxies for cellular composition can be mathematically derived from the bulk tissue profiles using a deconvolution algorithm however, there is no method to assess the validity of these estimates for a dataset where the true cellular proportions are unknown. We have developed an accuracy metric that quantifies the CEll TYpe deconvolution GOodness (CETYGO) score of a set of cellular heterogeneity variables derived from a genome-wide DNA methylation profile for an individual sample. The CETYGO score captures the deviation between a sample's DNAm profile and its expected profile given the estimated cellular proportions and cell type reference profiles. We have made our method available as a standard alone R package, CETYGO, available via GitHub to simultaneously calculate CETYGO alongside the estimation of cellular composition variables using Houseman's algorithm. In this way it can easily be adapted for use with other available reference panels, both now and in the future. In 2023, the package was updated to include our novel reference panel for brain, including 5 different brain cell types. 
Type Of Material Computer model/algorithm 
Year Produced 2022 
Provided To Others? Yes  
Impact We are currently using this tool to evaluate different models for estimating the cellular composition of the brain. Two critical findings here are that the models are not appropriate for the study of either the cerebellum or prenatal and early postnatal samples. 
Description PhD student Calum Harvey 
Organisation University of Sheffield
Department Sheffield Biorepository
Country United Kingdom 
Sector Academic/University 
PI Contribution Calum spent 4 weeks visiting my team in Exeter and learning about DNA methylation analysis. I have since been asked to become his second supervisor. We focused on modelling cfDNA with motor neurons and cortical neurons spiked in to test the sensitivity of detecting these at varying read depths and with varying numbers of regions.
Collaborator Contribution Johnathan Cooper-Knock (University of Sheffield) is his primary supervisor.
Impact NA
Start Year 2022
Title ejh243/BrainFANS: QC pipeline 
Description Stable DNAm preprocessing scripts and neural cell type deconvolution analyses scripts 
Type Of Technology Software 
Year Produced 2023 
Open Source License? Yes  
Impact This has standardized the processing of DNA methylation and other regulatory 'omics data for a number of project affiliated to our group and beyond. 
Title ejh243/CETYGO: Publication release 
Description Link to Zenodo for doi 
Type Of Technology Software 
Year Produced 2023 
Open Source License? Yes  
Impact This software library has facilitated the integration of our metric for quantifying the accuracy of estimation of cellular composition from bulk tissue profiles with existing and future pipelines. 
Description Day of Ideas 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Industry/Business
Results and Impact The Institute of Data Science and Artificial Intelligence Day of Ideas was an industry networking event at Exeter Castle showcasing some of the interdisciplinary Data Science and Artificial Intelligence research that is ongoing at the University of Exeter. The event was opened by a welcome from Professor Lisa Roberts, the Vice-Chancellor of the University, who then invited Professor Sir Adrian Smith, the Director and Chief Executive of the Alan Turing Institute and President of the Royal Society to speak about the landscape of data science and artificial intelligence within the UK. Guests were then invited by Professor Richard Everson, Director of the Institute for Data Science and Artificial Intelligence to see some of the work we do here at IDSAI. I developed and hosted the Health Stand showcasing the genomic profiling and modelling happening as part of my Fellowship.
Year(s) Of Engagement Activity 2022