Predicting Mammalian Gene Essentiality

Lead Research Organisation: University of Manchester
Department Name: Life Sciences

Abstract

All organisms have a variety of genes encoded in their DNA. These genes perform diverse functions, and together control the overall attributes of an individual. Although many genes are needed for the survival and health of each individual, some genes are more critical than others. Those genes that are absolutely required for the survival of an organism are termed 'essential genes'. Determining which genes are essential in an organism has important benefits. For example, knowing which genes are essential in pathogenic organisms such as bacteria or fungi enables the development of new drugs targeted at those essential genes to eradicate infection.

In terms of human biology, the identification of essential genes is a key progression from the sequencing of the human genome. We have gained great knowledge of the complexity of genome architecture and the diversity of functions encoded by the approximately 21,000 protein-coding genes found in the human genome. Yet we still do not know specifically which minimal set of genes is absolutely required for survival. As it is not possible to experimentally examine gene essentiality in the human, experimental data from the mouse can be used to infer mammalian gene essentiality. The mouse shows great similarity to the human both in genome structure and gene content, and as such is a valuable source of data for modelling mammalian gene essentiality. From experiments that deleted single genes, it is estimated that approximately 40% of mouse genes are essential. However, specific annotations of essential or non-essential have only been applied to approximately 6000 mouse genes, leaving an additional 17,000 genes to annotate.

This project will fulfil the need for genome-wide annotations of essentiality by developing a computational model to predict if a gene better fits the profile of an essential gene or a non-essential gene. We will then generate predictions for the 17,000 mouse genes lacking experimental data. We will perform experimental validation on a small set of our predicted genes to test the accuracy of our model, and create a searchable database of our predictions as a public resource.

The availability of this database will be of benefit to scientists in diverse fields. Developmental biologists study the function of genes during embryonic stages to gain insights into how a fertilised egg becomes a mature organism. By definition, the genes we identify as essential will be required during development. Therefore, a large number of developmentally important genes will be annotated from our work. Additionally, our work will facilitate the identification of human genetic disease candidates, because the disease symptoms can be matched to genes with a similar essentiality prediction. Deficiencies in developmental genes are often associated with birth defects, which affect 1/40 births in Europe annually, generating large social, medical, and economic impacts. This work will be informative for evolutionary biologists, as there is much interest in determining how gene essentiality is conserved from simple to more complex organisms. Synthetic biologists defining the minimal components for cellular function will also find our predictions useful, because they will provide annotations for a complete mammalian genome to serve as instructions for building a complex mammalian cell. Finally, an international large-scale project to examine hundreds of mouse genes through gene deletion experiments has recently started. Although significant advances in technology have expedited this research, it is still costly and time consuming. Therefore, our essentiality predictions will inform these experimental projects by allowing investigators not interested in developmental processes to select non-essential genes for further experimental work or vice versa. Overall this work will inform many future experiments and further our understanding of the hallmarks of essential genes.

Technical Summary

Essential genes are those that are required for the survival of an organism. Because knowledge of mammalian essential genes on a genomic scale is lacking, we propose to fill this gap by utilising machine learning to generate a classifier to predict mammalian gene essentiality. We have generated training datasets of known essential and non-essential mammalian genes from mouse knockout experiments. We have collected features of these gene sets, including sequence, function, and network connectivity, and found many parameters that vary significantly between the two groups. Using random forests, we will develop a machine-learning model to predict mammalian gene essentiality. Feature selection will be employed to refine our model. We will then collect informative parameters for the remainder of the genes in the mouse genome lacking experimentally defined essentiality data (currently 17,000), and use our classifier to predict essentiality of this new test gene set. The test set will include genes from mouse knockout experiments performed after the start of the project, which are not in our training sets, which will allow confirmation of the accuracy of our model. We will evaluate the applicability of our model to other mutagenesis techniques by sequencing embryonic lethal mouse mutants created in a random chemical mutagenesis screen. As these mutants display embryonic lethality, they inherently meet our definition of essential genes. We will compare essentiality predictions for genes in the mutagenesis candidate region to the experimental data, determining the utility of our predictions for prioritising candidate genes in mouse mutagenesis experiments. We will generate a resource of a searchable database with our essentiality predictions, which will be useful for selecting genes for further study in mouse knockout experiments and identifying candidate genes for human genome wide association studies.

Planned Impact

The beneficiaries of this work include biomedical academic and industrial researchers, as detailed in the academic beneficiaries section. In addition, there are commercial beneficiaries of this research. For example, industrial researchers seeking to minimise mammalian developmental toxicity of their compounds will benefit from the identification of mammalian essential genes. We have already identified users at Syngenta UK (see letter of support, Drs Jayne Wright and Dick Lewis). Our predictions can be used in the modelling of the toxicity of compounds to facilitate selection of those with minimal developmental damage. It is expected that other companies seeking to minimise off-target or developmental toxicity will use our resource to inform compound selection.

We will achieve third-sector impacts on charities supporting research into the causes and treatments of developmental birth defects such as the Newlife Foundation, British Heart Foundation, March of Dimes, and Sparks. These organisations will benefit from this project as an increased understanding of genes needed during embryonic development will facilitate the identification of genetic causes underlying birth defects. Our resource will be useful to the charities for publicity and fundraising.

This project will also have impacts on the general public through engagement activities aimed at providing a greater understanding of developmental biology and developmental defects. The researchers involved in this project also regularly participate in public engagement events such as "World Heart Day" and the Manchester "Science Spectacular" where we educate the public about human development and birth defects caused by mutations in essential genes. Additionally, we have a widening participation programme where students from schools located in deprived geographic areas that are under-represented at University attend a laboratory practical at the University of Manchester. In this practical exercise students and their teachers examine chicken embryo cardiac development and learn about essential genes. These activities will broaden the impact of our research to the general public.
 
Description We have developed a computer model for predicting which genes are needed for an organism to survive development, which are the essential genes. We have predicted all the genes in the genome that are essential.
Exploitation Route They may eventually be used to match mouse knockout experiments to the correct researchers. They may also be used to investigate genes associated with abnormal development, mainly birth defects.
Sectors Healthcare

URL http://essentiality.ls.manchester.ac.uk/http://blog.mousephenotype.org/using-machine-learning-to-identify-mammalian-essential-genes/
 
Description Kids Kidney Research Project Grant
Amount £85,000 (GBP)
Organisation Kids Kidney Research 
Sector Charity/Non Profit
Country United Kingdom
Start 10/2016 
End 10/2018
 
Title MED 
Description A model for predicting mammalian essential genes. 
Type Of Material Computer model/algorithm 
Provided To Others? No  
Impact None yet. 
URL http://essentiality.ls.manchester.ac.uk/
 
Description Kidney Gene Prediction 
Organisation NHS Manchester
Country United Kingdom 
Sector Public 
PI Contribution Provided expertise in machine learning.
Collaborator Contribution Provided expertise in kidney development.
Impact Obtained grant from Kids Kidney Research.
Start Year 2016
 
Description Loreto College Visit 
Form Of Engagement Activity Participation in an open day or visit at my research institution
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Schools
Results and Impact 160 students visited the research organisation to perform a laboratory practical and meet scientists. The school reported increased enthusiasm for applying to STEM subjects for university.
Year(s) Of Engagement Activity 2017