Leveraging functional profiling datasets with machine learning to uncover proteins and cellular processes important for ageing

Lead Research Organisation: University College London
Department Name: Genetics Evolution and Environment

Abstract

Ageing is the largest risk factor for most human diseases in developed countries, including progressive diseases such as Alzheimer's and Parkinson's, diseases like cancer that show variable rates of onset, and catastrophic system failures such as heart-attack and stroke. While the study of specific disease processes has long been a major focus of research, there is a growing realization of the importance of studying the normal ageing process itself as an essential part of the problem, and of exploring ways to slow or reverse its effects. Ageing is a multi-factorial process that can be seen as an inevitable feature of the ravages of time. Recent discoveries, however, demonstrate that ageing can be modified in dramatic ways by simple interventions. For example, single gene knockouts can delay ageing and improve health late in the life of laboratory animals. The processes involved in ageing are similar in different organisms, and genetic mutations affecting these processes are associated with longevity in humans. A central challenge of ageing research, however, remains to tease out a complete and unified picture of the biological factors and processes determining lifespan.

Ageing is highly complex and affected by diverse proteins and processes. Modern biological assays can simultaneously measure properties and interactions of thousands of proteins or genes, but it is challenging to make sense of such large datasets. Advances in computational data-analysis methods, called 'machine learning', provide exciting opportunities to get the most from large biological datasets and thus increase our understanding of complex processes like ageing. Machine Learning can find hidden patterns in data that is too complex for humans to process. Advances in computer power, algorithms and data sizes allow recent machine-learning architectures (known as 'deep learning') to accurately find and classify intricate patterns in combined datasets of different types.

We plan to use fission yeast as a model organism, together with multi-step machine learning, to comprehensively identify biological processes with fundamental importance for ageing. Remarkably, many of these processes are similar from yeast to human, but are much easier to study in the simple yeast. Yeast cells enter a dormant, non-dividing state under limiting nutrients. Such dormant cells provide a useful system to analyse proteins and processes affecting the lifespan in this state. In previous studies, we have identified 116 proteins that, when absent, allow the yeast to live longer (long-lived knockout mutants). So these proteins are involved in ageing, and can be used to train machine-learning programs to predict new ageing proteins by a method known as 'guilt by association'. We will combine large systematic data on mutant features (phenotypes) with diverse existing data to empower the machine-learning predictor. We will test the predicted ageing proteins in the laboratory for lifespan effects in yeast, and feed this information back to the computer for it to learn more about ageing proteins. We will then use mutants of the new ageing proteins identified by the computer and confirmed in yeast to measure links with all other mutants. Such 'genetic-interaction' data provide rich information on functional relationships, which will be used to explore other, potentially more powerful deep-learning methods to predict the biological processes that are involved in ageing. We will then test the most attractive predictions with laboratory experiments. Moreover, we will make all the new data, methods and predictions available to interested scientists to help with their research. We anticipate that this project, using intimate cycles of experiments and machine-learning, will provide a valuable platform to better understand all the biological factors involved in ageing, to eventually develop interventions that extend healthy lifespan in humans.

Technical Summary

We want to establish comprehensive sets of proteins and biological processes involved in cellular ageing in fission yeast. This project combines functional-profiling experiments (large-scale phenotyping and genetic-interaction assays) with powerful new machine-learning prediction algorithms. Our integrated approach will benefit from iterated computational predictions and experimental validation, with different techniques used in two computational stages to get the most from the rich experimental data. The first computational stage will apply Bayes Multiple Kernel Learning, informed by phenotyping and heterogeneous network/homology datasets, to rank proteins based on their predicted associations with 116 ageing-associated proteins that we recently identified. In each of 5 iterations, we will test 50 top-ranked proteins for altered lifespans in the corresponding mutants to improve the predicted ranking in the next iteration. We will then screen the top-125 validated ageing proteins for genetic interactions using Synthetic Genetic Array analyses. In the second computational stage, we will exploit our functional-profiling data, integrated with in-house homology and network data, to build deep-learning predictors for GO Biological Processes relevant to ageing.

CAFA-2 recently ranked our CATH homology-based predictor top; CATH is unique in providing functional sub-families that outperform Pfam for functional purity. Combining this unique data with the functional-profiling data generated in this project will enhance the power of our predictors optimized for ageing-related processes. Deep learning is computationally expensive, but advances in computing (e.g. GPUs are ~15x faster than CPUs) and efficient code bases (e.g. TensorFlow) are helping in this respect. Furthermore, we have access to the JADE Centre for Deep Learning Computation, which provides excellent computational facilities to speed up our training and investigation of many architectures.

Planned Impact

Who will benefit from this research?
This proposed research is basic by its nature, and the immediate impacts from this work relate to scientific and knowledge advancement and the development of skills, capacity and capability. In the longer term, this research has the potential to impact areas of wealth and health. Beneficiaries beyond academia therefore are the commercial private sector and the wider public.

How will they benefit from this research?
The proposed research takes state-of-the-art experimental and computational approaches to address fundamental questions relating to biological processes involved in ageing. The research will deliver increased capacity and capability in strategically relevant areas of genomics and machine learning, through the provision of inter-disciplinary training and the further development of key methods and resources. Establishment of these methods is significant as they have a wide range of applications that reach beyond basic science into fields relating to human healthy ageing, the commercial (pharmaceutical) sector and beyond. The commercial sector might benefit by recruiting highly skilled and experienced scientists trained through this project.

Ultimately, the pharmaceutical sector will clearly benefit from all the publicly available experimental data that we will make available through the DeepAge resource. They might also benefit by exploiting fresh drug targets (ageing-associated proteins that slow ageing when down-regulated), to effectively reduce the effects of ageing as the major risk factor for multiple diseases. The ageing population is a huge and increasing problem in our society, with enormous cost implications due to the economic and social burden of the rise in associated diseases and diminished quality of life for both patients and carers. It is evident that any measures that promote healthy ageing will be of massive, broad ranging benefit to our society with respect to economy, quality of life, health and creative output. In the longer term, the general public may thus benefit from our fundamental contribution to the understanding of genetic mechanisms and universal principles involved in ageing-related phenotypes that will guide and empower research in more complex systems and may help to develop safe broad-spectrum, preventative measures against age-associated diseases.

Immediate and concrete deliverables with respect to impact beyond academia will be in public engagement, which we recognize as an important responsibility of scientists. We already have experience and established links that will facilitate good communication and public engagement of the research outputs. Details of our specific plans and timelines with respect to public engagement are outlined in the Pathways to Impact.

Publications

10 25 50
 
Description Comprehensive functional information for all available deletion mutants in fission yeast, encompassing over 100 experimental conditions, including dozens of conditions relevant for cellular ageing. This provides a rich resource to better understand gene function and diverse new factors that are potentially important for cellular lifespan.
Computational predictions of gene function based on machine-learning approaches.
Exploitation Route The experimental and computational predictions provide a rich resource and specific hypotheses for follow-on research.
Sectors Healthcare,Pharmaceuticals and Medical Biotechnology

 
Title Microbial phenomics platform 
Description Microbial fitness screens are a key technique in functional genomics. We developed an all-in-one solution, pyphe, for automating and improving data analysis pipelines associated with large-scale fitness screens, including image acquisition and quantification, data normalisation, and statistical analysis. Pyphe is versatile and processes fitness data from colony sizes, viability scores from phloxine B staining or colony growth curves, all obtained with inexpensive transilluminating flatbed scanners. 
Type Of Material Technology assay or reagent 
Year Produced 2020 
Provided To Others? Yes  
Impact Pyphe is user-friendly, open-source and fully documented, illustrated by applications to diverse fitness analysis scenarios. 
 
Title Pyphe: A python toolbox for assessing microbial growth and cell viability in high-throughput colony screens 
Description Microbial fitness assays are a powerful genetic approach for discovery of biological gene function or drug screens. Despite the popularity and importance of microbial screens, a consensus data framework has not emerged so far. We present a toolbox and underlying python package, named pyphe, to assemble a versatile pipeline for analysing fitness-screen data. Pyphe is an all-in-one solution for image acquisition, quantification, batch/plate bias correction, data reporting and hit calling, which enables the implementation of data analysis pipelines associated with large-scale assays for both microbial colony-growth and cell viability. Pyphe automates and improves phenome analysis at massive scale, so that hundreds of thousands to millions of growth assays can be efficiently analysed in parallel. Using a set of diverse wild yeast strains, we show that the fitness correction approach implemented in pyphe effectively reduces noise in the data. We find that late endpoint measurements of colony sizes perform similarly to maximum growth slopes from time series. Moreover, we show that accurate colony viability quantified by pyphe provides a largely orthogonal and independent readout to colony sizes, thus offering a complementary trait for genetic profiling. Pyphe is an open-source toolbox designed to be flexible, modular and user-friendly, here illustrated by its applications to a variety of fitness-analysis scenarios. This versatile tool will be of wide usefulness to microbiologists, geneticists and pharmacologists interested in functional genomics, phenomics and drug screens with prokaryotic or eukaryotic microbes. 
Type Of Technology Webtool/Application 
Year Produced 2020 
Impact Used for large-scale phenotyping of S. pombe gene-deletion collection and libraries of non-coding RNA mutants. 
URL https://www.biorxiv.org/content/10.1101/2020.01.22.915363v1
 
Description Host of in2science UK pupils 
Form Of Engagement Activity Participation in an open day or visit at my research institution
Part Of Official Scheme? Yes
Geographic Reach Local
Primary Audience Schools
Results and Impact Two pupils decided to study biology at university.

Inspiring talented students from disadvantaged backgrounds to study science.
Year(s) Of Engagement Activity 2012,2013,2014
URL http://in2scienceuk.org/
 
Description Hosting of in2scienceUK pupil 
Form Of Engagement Activity Participation in an open day or visit at my research institution
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Other audiences
Results and Impact in2science UK (http://in2scienceuk.org/) is a charity originating from UCL. This scheme promotes science and research to pupils from disadvantaged backgrounds by providing underprivileged but talented students, currently completing Science AS levels in deprived schools, the opportunity to work alongside practising scientists for ~2-week stints. Such placements and support for gifted pupils help them gain access to top Universities. Given the impact of high fees on this group of budding scientists, in2science UK provides extra encouragement. We have hosted two pupils in summer 2019.
Year(s) Of Engagement Activity 2016,2017,2018,2019,2020
URL http://in2scienceuk.org/