Machine Learning Methods to Re-annotate Histone Modifications with Locus-specific Functional Classification

Lead Research Organisation: University of Dundee
Department Name: School of Life Sciences

Abstract

The human body contains about 200 different cell types, e.g. nerve or blood cells, each with their specific appearances and functions. To carry out their proper roles, they all execute different sets of genetic programs while containing an identical copy of the complete genomic instructions (the DNA), which is passed down from a single parent cell. In specialized cells,the majority of programs are switched off, allowing them to efficiently focus on a given task. This is what epigenetic mechanisms do: They package and organize the DNA, such that certain bits are shielded away and silenced, while other parts are accessible and readily executable.
As such, epigenetic mechanisms are vital for normal development and health. For instance, in the absence of certain epigenetic factors embryonic stem cells fail to differentiate. Epigenetic malfunctioning has also been observed in various diseases: For example, if normally silenced programs become activated, cells may change their identity; white blood cells, for instance, can turn into cancerous cells when their epigenetic machinery is faulty.
The epigenome comprises a number of chemical alterations, which exist 'on top' of the DNA sequence itself. For example, at the occurrence of certain DNA sequence features, methyl groups can be added to the DNA to silence corresponding genetic elements. Additionally, the DNA sequence is wrapped around histone proteins forming a "beads-on-a-string" type of architecture. By chemically modifying individual histone proteins, neighboring 'beads' can be brought into tight contact with each other thus forming dense and inaccessible regions of DNA. Alternatively, a different set of Histone modifications can result in open and accessible DNA domains.
Histone modifications are dynamically established by a large set of different enzymes, so called 'epigenetic writers'. They can also be actively removed by a number of specific 'epigenomic erasers'. The thus established epigenomic patterns are recognized by 'epigenetic readers'. Interestingly, some steady-state epigenomic modifications are remarkably well correlated with transcriptional activity, suggesting that effector proteins are indeed providing a read-out of epigenomic patterns. These findings have lead to the histone code hypothesis, according to which transcriptional activity is regulated by epigenomic modifications. However, despite intense research and substantial progress in our understanding of epigenetic mechanisms, the histone code has remained enigmatic.
Technological advances in the measurement of epigenomic snapshots have led to an explosion of available data. Yet owing to the high complexity and changing nature of these marks, a precise understanding of their meaning and readout is lacking. Today, I see a unique opportunity to tackle this challenge with the help of sophisticated machine learning technologies: These methods use computer systems to 'learn' hidden relationships from large data sets. I will build new computational tools to capture the molecular mechanisms underpinning the dynamic changes of epigenomic marks. Along with my co-investigator, I suggest cycling between sophisticated computational predictions and wet lab experiments that provide dynamic profiles of epigenomic patterns. In particular we plan to disturb the epigenetic machinery by rapidly degrading individual writers to observe how their action orchestrates operations of other writers and readers. I will also use statistical methods to analyse the spatiotemporal correlation between dynamic epigenomes and changing gene expression. This project will benefit from the existing epigenomic expertise at Dundee University and our efforts will in turn inform on-going projects to understand epigenetic contributions to healthy development and disease. In addition, parts of the project will be carried out at the Cyber Valley Campus Tuebingen, which hosts some of the world leaders in causal machine learning techniques.

Planned Impact

Given that machine learning (ML) and artificial intelligence (AI) are enormous growth areas with impacts on industry, health, social services, education and the basics of everyday life, there are multiple both beneficiaries from our research.

The Public: People are curious and anxious about the possible impacts of ML and AI. They need a chance to explore possible benefits and impacts in an environment of trust. They find applications of AI/ML to healthcare most likely to be useful. My research offers an exemplar with which to reach out to the public and contribute to an open two-way debate that may both reassure them and help inform policymakers/industry about concerns to be addressed. This can be achieved during the duration of the fellowship. Facilitating public discussion may help AI/ML ultimately gain public acceptance. Major impacts on global economic performance and the competitiveness of the UK are foreseen by 2030 if AI technology is accepted and introduced e.g. global GDP could be 14% higher than baseline projections ($15.7 trillion additional activity) and UK GDP will be 10.3% higher (additional £232 billion). This will have knock-on effects on public earnings and well-being contributing £1,800-£2,300 additional spending power per household ('The economic impact of artificial intelligence on the UK economy' https://www.pwc.co.uk/economic-services/assets/ai-uk-report-v2.pdf). In the longer term (5 years onwards), increased understanding of epigenetic patterns gained in this project will contribute to new insights of the epigenomic factors underlying human disease, including cancer, neurological and autoimmune disorders. This may directly contribute to the development of epigenomic biomarkers for early diagnosis. It may also contribute to the identification of potential target regions for epigenome editing which holds great promises in reversing pathological epigenomic conditions.

Policymakers: Putting an effective regulatory framework and policies in place will be crucial to the success of AI and ML technologies ('Artificial Intelligence: Public Perception, Attitude and Trust" https://www.bristows.com/assets/pdf/Artificial%20Intelligence_%20Public%20Perception%20Attitude%20and%20Trust%20(Bristows).pdf). Our work in Cyber Valley has received substantial attention from policy makers. The engagement that we envisage with the public can also help inform policymakers of the regulatory issues people are concerned about, while academics can also inform on the technology, potentially in a more impartial way than industry.

Industry: Public acceptance of AI and ML will foster global and UK economic productivity - our work illustrating how ML can help biomedical research could contribute to the debate and their ultimate use. More specific commercial uptake of our strategies and tools is also possible and will be promoted by my association with Cyber Valley and the extensive industry contacts made by the School of Life Sciences in Dundee. Cyber Valley partner companies include Amazon, BMW AG, Daimler AG, IAV GmbH, Porsche AG, Robert Bosch GmbH, and ZF Friedrichshafen AG plus numerous smaller companies and startups.

Workforce: We will contribute to the development of a skilled workforce versed in general and specific applications of ML though training workshops and supervision of undergraduate and postgraduate students.

Researchers: The computational tools we develop will directly benefit a large number of scientists studying epigenomic control mechanisms. Our intensive training sessions will reach researchers from different backgrounds and make them familiar with our software solutions. Our loss-of-function experiment datasets will be made Open Access and can be used as a testing playground for future ML developments. The long-term impact will be that more ML experts will work on epigenomic data sets, which will help bridge the gap between disciplines and bring additional benefit to the epigenomics community.
 
Description Using advanced Machine Learning models we have developed several computational tools that will analyse , interpret and impute large amounts of biomedical data investigating the human epigenome.
Exploitation Route Epigenomic changes are involved in a large number of diseases from neurodevelopment disorders to tumorigenesis. However, despite their importance the 'epigenomic code' is not yet solved. We are providing computational tools that will help to achieve this task. It will be essential for making accurate predictions of intervention outcomes in a precision medicine context.
Sectors Healthcare

Pharmaceuticals and Medical Biotechnology

 
Description As AI applications to biological sciences and health are having an ever bigger impact on our societies as a whole I have created a new network called SaiREN, social AI research and education network with colleagues form the political and social sciences and philosophy. We are working on an application to the PlusFund to enlarge this network. We are planning to study the wider aspects of AI applications in health for our society and also provide interdisciplinary training as well as public engagement activities.
First Year Of Impact 2021
Sector Other
Impact Types Cultural

Societal

Economic

Policy & public services

 
Description GPU-based Machine Learning System for fundamental biological research
Amount £406,349 (GBP)
Funding ID BB/V019805/1 
Organisation Biotechnology and Biological Sciences Research Council (BBSRC) 
Sector Public
Country United Kingdom
Start 07/2021 
End 05/2022
 
Description Unlocking the Alternative Splicing Code
Amount £162,556 (GBP)
Funding ID BB/Y513040/1 
Organisation Biotechnology and Biological Sciences Research Council (BBSRC) 
Sector Public
Country United Kingdom
Start 03/2024 
End 02/2025
 
Title DecoDen: Tool to remove biases form ChIP-Seq data 
Description uses multiple histone modification assays to remove measurement bias. Based on NMF and half sibling regression 
Type Of Material Computer model/algorithm 
Year Produced 2022 
Provided To Others? Yes  
Impact Will help make sense of histone modification patterns for precision medicine 
URL https://github.com/ntanmayee/DecoDen
 
Title DecoFlex 
Description Tool to infer cellular composition of complex samples (tissues) from bulk RNA-Seq data, making use of partial matched reference single cell data sets. 
Type Of Material Data analysis technique 
Year Produced 2023 
Provided To Others? Yes  
Impact Tumor samples as well as complex tissues like the human brain are composed of many different cell types. Bulk measurements are still widely used to infer transcriptional properties of these samples. These measurement provide averaged values across all cells in the sample. The average is, however, affected by changes in the transcription of individual cells, as much as by the proportions of different cell types relative to each other. On the other hand single cell technologies can provide cellular resolution. These experiments remain significantly more expansive than bulk measurements, and in addition, they only provide incomplete pictures with many lowly expressed genes escaping detection (drop outs). Using our computational tools, DecoFlex, we can now have the best of both worlds: We can infer accurately the composition of bulk measurements and also make predictions of cell type specific expression patterns for cell types that are not present in the single cell reference. 
URL https://github.com/crhisto/DecoFlex
 
Title eDICE: Deep learning tool to impute personal tissue-specific histone modification patterns 
Description Deep learning tool to impute personal tissue-specific histone modification patterns 
Type Of Material Computer model/algorithm 
Year Produced 2022 
Provided To Others? Yes  
Impact We used predecessor model imp to participate in international Encode Imputation Challenge and won 3rd place. 
URL https://github.com/alex-hh/eDICE
 
Description Science Award Ceremony (Gips-Schuele Stiftung, Stuttgart, Germany) 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Policymakers/politicians
Results and Impact I was the invited speaker at a science award dinner. Audience were composed of invited guests from politics, business leaders, leading academics, and charitable organisation. Main theme was responsible and sustainable solutions to pressing scientific and societal problems.
Year(s) Of Engagement Activity 2023