Development and experimental validation of a deep-learning based pipeline for user-centric protein design.

Lead Research Organisation: University of Edinburgh
Department Name: Sch of Biological Sciences

Abstract

Proteins are the molecules that provide most of the complex functionality in all living things. They are made of 20 different building-block types called amino acids, which are combined in different sequences to make long chains. The varying shapes and chemistries of the amino acids cause the chains to fold into a distinct 3D structure. It is this structure that enables proteins to perform the different roles they have in nature, whether it's digesting your food, moving you around or simply keeping the top of your head warm.

Even though life emerged over 4 billion years ago, only a small number of possible protein structures have been explored by evolution due to their inherent complexity. As protein structure is directly related to function, this means that there is a huge pool of unexplored proteins with functions that could be applied to solve problems in medicine, biotechnology, energy and agriculture. If we can design new proteins from scratch, we can address some of these problems with the new proteins that we create.

As mentioned previously, proteins are complex, and so it is difficult to design new proteins, but to make it easier we can write programs that can create and test huge numbers of designs in computer simulations. This improves the chance of designing a sequence of amino acids that will adopt our desired structure when we create it in the laboratory. However, even with state-of-the-art methods for designing proteins on a computer, only a small number of sequences adopt the structures we intend them to, making protein design costly and unreliable.

I intend to create a new method for designing proteins that uses a type of artificial intelligence called a deep-neural network (see http://playground.tensorflow.org for an interactive example). This technique will be used to learn the complex rules for generating stable proteins that are hidden inside the amino-acid sequences of protein structures we have already observed. Once the rules have been learned, we can use them to create new sequences of amino acids that are good candidates for adopting the structure we require. This method will form part of an automated pipeline that will create and test protein structures in computer simulations, before recommending the best designs for our intended application. This will make the process of protein design much more reliable.

To get an understanding of how effective this method is, I will test it by creating hundreds of the protein designs recommended by the pipeline in the laboratory, using robotics to accelerate this process. Once tested, I plan to showcase the method by designing new proteins that can perform chemical reactions that are useful industrially. This will make performing these chemical reactions much cheaper and more environmentally friendly, paving the way for the design of many more proteins with useful functions that address the challenges that the human race currently faces.

Planned Impact

RESEARCH COMMERCIALISATION

The user-centric protein design pipeline and designed proteins described in this proposal have clear routes to commercialisation. There are three main approaches that could be taken:

(1) The software produced could be sold through licensing to industrial partners. The software is likely to be of significant interest to the biotechnology/pharmaceutical industry, as it will allow for the robust design of proteins to strict design specifications in order to perform required functions.
(2) The proteins that are designed could be sold as products. Some of the proteins designed in this proposal have clear applications in industrial biotechnology. The amino-acid sequences for these and related designs could be patented if appropriate.
(3) Protein design as a service. The full user-centric protein design pipeline has been designed to produce proteins that fit a strict specification for a targeted application. As a result, it could form the core enabling technology behind a company that performs protein design as a service for the biotechnology industry.

All research-commercialisation activities will be performed with support from the commercialisation office at University of Edinburgh, Edinburgh Innovations, who have a strong track record in tech transfer from academia to industry.

PUBLIC ENGAGEMENT

Protein design is an excellent vehicle for demonstrating interdisciplinary research. It covers many STEM areas and has a creative aspect, allowing it to reach a broad audience. I have previously developed a workshop aimed at secondary school pupils in years 9 and 10, which was run both in schools and at the university. The sessions were very well received, with excellent feedback from both teachers and students.

I will continue to develop this workshop, refining the activities as well as creating versions to fit the requirements of different teachers and schools, given their curriculum and available time. Furthermore, I will design a new workshop focused on protein structure aimed at primary school pupils, which will be taken to schools and science festivals. These activities will be organised in cooperation with the Beltane Network.

Publications

10 25 50
 
Description There are 3 aspects to this project: 1) the development of novel machine learning based protein-design methods, that should be more robust and easier to use than existing methods 2) The creation of an evaluation pipeline for assessing the quality of protein designs 3) The application of these tools to design a new enzyme that incorporates an unnatural chemical group. Section 1 and 2 have been published in peer reviewed journals.

Deep-neural networks (DNNs) are a promising route to generating novel protein-design methods. This is an active research field with many recent publications describing new neural networks. While developing our own DNNs for protein design, we quickly realised that published methods were very difficult to compare and were in some cases, impossible to recreate. The metrics that are commonly reported hide mask problems that are important to know if you intend to apply these methods. In order to address these issues, we have created a benchmarking suite that can be used to evaluate any protein-design method, and guided by this, we have significantly improved the performance of our own networks, to the point that they outperform some well established methods for sequence design. We have published our benchmarking suite and have made the source code available so other groups can use this tool to assess the quality of their design method.

Our design evaluation pipeline has been published and is available through the following URL: http://destressprotein.design. The website is made with modern web technologies and great care has been taken to make it as user friendly as possible, enabling non-experts to assess the quality of their designs. We are currently using this tool to develop novel therapeutic antibodies and enzymes.

Work is ongoing towards developing methods to incorporate unnatural cofactors, but this is well developed and follow up studies are underway to develop this technology.
Exploitation Route The technology that has been developed is free and open source, and it has started to be adopted by the protein engineering and design community within academia and industry. Protein engineering is a required step in industrial settings in order to create enzymes, therapeutics or materials that are fit for purpose. The tools that we have created should make this a more reliable process. We are exploring ways to commercialise the software and methods that we have produced, and currently have funding to perform a pilot of this with a local SME.
Sectors Chemicals,Environment,Manufacturing, including Industrial Biotechology,Pharmaceuticals and Medical Biotechnology

 
Description The tools and methods for designing and evaluating proteins, that were developed during the course of this grant, have generated interest from multiple companies locally, nationally and internationally. We are currently exploring how to best exploit our technology commercially, and have a small grant to trial this with a local biotech company. We have also shared our finding with the general public, and discussed the future direction of protein design. The major activity in this area was our participation in the Royal Society Summer Exhibition, which was attended by almost 10,000 people. As part of this, we produced a video with the Royal Society (https://youtu.be/Am45c83iLg4) describing our work that was posted on YouTube and has been viewed by thousands of people.
First Year Of Impact 2022
Sector Manufacturing, including Industrial Biotechology,Pharmaceuticals and Medical Biotechnology
Impact Types Economic

 
Description 21ENGBIO - High-Throughput Design of Novel Sensors to Help Address the Impending Phosphate Crisis
Amount £100,809 (GBP)
Funding ID BB/W013320/1 
Organisation Biotechnology and Biological Sciences Research Council (BBSRC) 
Sector Public
Country United Kingdom
Start 01/2022 
End 01/2023
 
Description Development of Genetically-Encodable Nitrate Sensors for Visualizing Nitrogen Flux in Plants
Amount £5,996 (GBP)
Funding ID IES\R2\212095 
Organisation The Royal Society 
Sector Charity/Non Profit
Country United Kingdom
Start 12/2021 
End 12/2022
 
Description Generalised Photocatalysis by Enzymes (GENPENZ)
Amount £3,178,051 (GBP)
Funding ID BB/X003027/1 
Organisation Biotechnology and Biological Sciences Research Council (BBSRC) 
Sector Public
Country United Kingdom
Start 02/2023 
End 08/2028
 
Title BAlaS: Fast, interactive and accessible computational alanine-scanning 
Description BAlaS is an interactive web application for performing computational alanine scanning mutagenesis and visualizing its results. BAlaS is interactive and intuitive to use. Results are displayed directly in the browser for the structure being interrogated enabling their rapid inspection. BAlaS has broad applications in areas, such as drug discovery and protein-interface design. 
Type Of Technology Webtool/Application 
Year Produced 2020 
Open Source License? Yes  
Impact Good initial response through social media, but slightly too early to determine usage statistics. 
URL https://balas.app/
 
Title DE-STRESS: Designed Structure Evaluation Services 
Description DE-STRESS is a web application for evaluating protein designs, to identify those that are most likely to express successfully and suit the needs of a particular application. The web server runs suite of metrics on the design and reports these back to the user. It also provides "reference sets", which are sets of known protein structures that the user can compare their design to in order to contextualise the reported metrics. Finally, the user can define a "design specification", which describes the properties that the designed proteins should have, and this can be applied to filter for designs that meet these criteria. 
Type Of Technology Webtool/Application 
Year Produced 2021 
Open Source License? Yes  
Impact It has just been deployed, so it's too early to evaluate the impact in the community. 
URL http://destressprotein.design/
 
Title PDBench 
Description PDBench is a dataset and software package for evaluating fixed-backbone sequence design algorithms. The structures included in PDBench have been chosen to account for the diversity and quality of observed protein structures, giving a more holistic view of performance. 
Type Of Technology Webtool/Application 
Year Produced 2022 
Open Source License? Yes  
Impact This benchmarking suite has already started to be adopted by the community that is developing protein sequence design methods, as evidenced by early citations. 
URL https://doi.org/10.1093/bioinformatics/btad027
 
Description Interview and demonstration - How to make a brand new protein 
Form Of Engagement Activity A press release, press conference or response to a media enquiry/interview
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Public/other audiences
Results and Impact The Royal Society made a video describing our work related to protein design to a general audience. It was incorporated into the live stream that took place during the RS Summer Exhibition and was shared on YouTube, where it's been viewed over 10,000 times as of 03/03/2023.
Year(s) Of Engagement Activity 2022
URL https://youtu.be/Am45c83iLg4
 
Description Royal Society Summer Exhibition - Programming Proteins 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Public/other audiences
Results and Impact The Royal Society's annual Summer Science Exhibition offers a free interactive experience for anyone curious about the latest advances in science and technology. The event took place over 5 days, attracting around 10,000 visitors. There were also two formal "soirees", to host dignitaries, fellows of the RS and members of the media. I also participated in dedicated "Meet the Scientist" sessions, where school children could ask questions about my route into science. The team had 1000s of meaningful interactions with all sorts of people, where we discussed the benefits that protein design offers, as well as the associated risks.
Year(s) Of Engagement Activity 2022
URL https://royalsociety.org/science-events-and-lectures/2022/summer-science-exhibition/