Machine Learning Methods to Re-annotate Histone Modifications with Locus-specific Functional Classification

Lead Research Organisation: University of Dundee


The human body contains about 200 different cell types, e.g. nerve or blood cells, each with their specific appearances and functions. To carry out their proper roles, they all execute different sets of genetic programs while containing an identical copy of the complete genomic instructions (the DNA), which is passed down from a single parent cell. In specialized cells,the majority of programs are switched off, allowing them to efficiently focus on a given task. This is what epigenetic mechanisms do: They package and organize the DNA, such that certain bits are shielded away and silenced, while other parts are accessible and readily executable.
As such, epigenetic mechanisms are vital for normal development and health. For instance, in the absence of certain epigenetic factors embryonic stem cells fail to differentiate. Epigenetic malfunctioning has also been observed in various diseases: For example, if normally silenced programs become activated, cells may change their identity; white blood cells, for instance, can turn into cancerous cells when their epigenetic machinery is faulty.
The epigenome comprises a number of chemical alterations, which exist 'on top' of the DNA sequence itself. For example, at the occurrence of certain DNA sequence features, methyl groups can be added to the DNA to silence corresponding genetic elements. Additionally, the DNA sequence is wrapped around histone proteins forming a "beads-on-a-string" type of architecture. By chemically modifying individual histone proteins, neighboring 'beads' can be brought into tight contact with each other thus forming dense and inaccessible regions of DNA. Alternatively, a different set of Histone modifications can result in open and accessible DNA domains.
Histone modifications are dynamically established by a large set of different enzymes, so called 'epigenetic writers'. They can also be actively removed by a number of specific 'epigenomic erasers'. The thus established epigenomic patterns are recognized by 'epigenetic readers'. Interestingly, some steady-state epigenomic modifications are remarkably well correlated with transcriptional activity, suggesting that effector proteins are indeed providing a read-out of epigenomic patterns. These findings have lead to the histone code hypothesis, according to which transcriptional activity is regulated by epigenomic modifications. However, despite intense research and substantial progress in our understanding of epigenetic mechanisms, the histone code has remained enigmatic.
Technological advances in the measurement of epigenomic snapshots have led to an explosion of available data. Yet owing to the high complexity and changing nature of these marks, a precise understanding of their meaning and readout is lacking. Today, I see a unique opportunity to tackle this challenge with the help of sophisticated machine learning technologies: These methods use computer systems to 'learn' hidden relationships from large data sets. I will build new computational tools to capture the molecular mechanisms underpinning the dynamic changes of epigenomic marks. Along with my co-investigator, I suggest cycling between sophisticated computational predictions and wet lab experiments that provide dynamic profiles of epigenomic patterns. In particular we plan to disturb the epigenetic machinery by rapidly degrading individual writers to observe how their action orchestrates operations of other writers and readers. I will also use statistical methods to analyse the spatiotemporal correlation between dynamic epigenomes and changing gene expression. This project will benefit from the existing epigenomic expertise at Dundee University and our efforts will in turn inform on-going projects to understand epigenetic contributions to healthy development and disease. In addition, parts of the project will be carried out at the Cyber Valley Campus Tuebingen, which hosts some of the world leaders in causal machine learning techniques.

Planned Impact

Given that machine learning (ML) and artificial intelligence (AI) are enormous growth areas with impacts on industry, health, social services, education and the basics of everyday life, there are multiple both beneficiaries from our research.

The Public: People are curious and anxious about the possible impacts of ML and AI. They need a chance to explore possible benefits and impacts in an environment of trust. They find applications of AI/ML to healthcare most likely to be useful. My research offers an exemplar with which to reach out to the public and contribute to an open two-way debate that may both reassure them and help inform policymakers/industry about concerns to be addressed. This can be achieved during the duration of the fellowship. Facilitating public discussion may help AI/ML ultimately gain public acceptance. Major impacts on global economic performance and the competitiveness of the UK are foreseen by 2030 if AI technology is accepted and introduced e.g. global GDP could be 14% higher than baseline projections ($15.7 trillion additional activity) and UK GDP will be 10.3% higher (additional £232 billion). This will have knock-on effects on public earnings and well-being contributing £1,800-£2,300 additional spending power per household ('The economic impact of artificial intelligence on the UK economy' In the longer term (5 years onwards), increased understanding of epigenetic patterns gained in this project will contribute to new insights of the epigenomic factors underlying human disease, including cancer, neurological and autoimmune disorders. This may directly contribute to the development of epigenomic biomarkers for early diagnosis. It may also contribute to the identification of potential target regions for epigenome editing which holds great promises in reversing pathological epigenomic conditions.

Policymakers: Putting an effective regulatory framework and policies in place will be crucial to the success of AI and ML technologies ('Artificial Intelligence: Public Perception, Attitude and Trust" Our work in Cyber Valley has received substantial attention from policy makers. The engagement that we envisage with the public can also help inform policymakers of the regulatory issues people are concerned about, while academics can also inform on the technology, potentially in a more impartial way than industry.

Industry: Public acceptance of AI and ML will foster global and UK economic productivity - our work illustrating how ML can help biomedical research could contribute to the debate and their ultimate use. More specific commercial uptake of our strategies and tools is also possible and will be promoted by my association with Cyber Valley and the extensive industry contacts made by the School of Life Sciences in Dundee. Cyber Valley partner companies include Amazon, BMW AG, Daimler AG, IAV GmbH, Porsche AG, Robert Bosch GmbH, and ZF Friedrichshafen AG plus numerous smaller companies and startups.

Workforce: We will contribute to the development of a skilled workforce versed in general and specific applications of ML though training workshops and supervision of undergraduate and postgraduate students.

Researchers: The computational tools we develop will directly benefit a large number of scientists studying epigenomic control mechanisms. Our intensive training sessions will reach researchers from different backgrounds and make them familiar with our software solutions. Our loss-of-function experiment datasets will be made Open Access and can be used as a testing playground for future ML developments. The long-term impact will be that more ML experts will work on epigenomic data sets, which will help bridge the gap between disciplines and bring additional benefit to the epigenomics community.
Description Epigenomic patterns are chemical modifications on the DNA that do not change the sequence per se but influence which genetic program are accessible or inaccessible in particular cell types. they provide a memory for cell identity as well as plasticity during differentiation. They are altered in disease, e.g. cancer or neurodevelopment diseases and have the potential to be early markers of disease. Most commonly epigenomic marks are mapped using a technique called ChIP-Seq. We found and investigated measurement biases in these measurements, which need to be corrected for, if subtle changes in epigenetic patterns need to be picked up for precision medicine
Exploitation Route We have created and released a software to correct for biases in Chip-seq data. The tool will be further investigated by the data analysis group (DAG) in Dundee. This group provides research service for many other groups in Dundee's school of Life Science and Medicine. In the next step we will make propagate the tool more widely in the UK and internationally.
Sectors Healthcare,Pharmaceuticals and Medical Biotechnology

Description As AI applications to biological sciences and health are having an ever bigger impact on our societies as a whole I have created a new network called SaiREN, social AI research and education network with colleagues form the political and social sciences and philosophy. We are working on an application to the PlusFund to enlarge this network. We are planning to study the wider aspects of AI applications in health for our society and also provide interdisciplinary training as well as public engagement activities.
First Year Of Impact 2021
Sector Other
Impact Types Cultural,Societal,Economic,Policy & public services

Description GPU-based Machine Learning System for fundamental biological research
Amount £406,349 (GBP)
Funding ID BB/V019805/1 
Organisation Biotechnology and Biological Sciences Research Council (BBSRC) 
Sector Public
Country United Kingdom
Start 04/2021 
End 05/2022
Title DecoDen: Tool to remove biases form ChIP-Seq data 
Description uses multiple histone modification assays to remove measurement bias. Based on NMF and half sibling regression 
Type Of Material Computer model/algorithm 
Year Produced 2022 
Provided To Others? Yes  
Impact Will help make sense of histone modification patterns for precision medicine 
Title eDICE: Deep learning tool to impute personal tissue-specific histone modification patterns 
Description Deep learning tool to impute personal tissue-specific histone modification patterns 
Type Of Material Computer model/algorithm 
Year Produced 2022 
Provided To Others? Yes  
Impact We used predecessor model imp to participate in international Encode Imputation Challenge and won 3rd place.