Investigating B-cell repertoire data using deep learning approaches to aid in the development of antibody therapeutics

Lead Research Organisation: University of Oxford

Department Name: Sustain Approach to Biomedical Sci CDT

Abstract

Antibodies are important proteins of the immune system. They recognize potentially harmful molecules, binding to them and initiating their removal from the body. With approximately 100 antibodies approved for clinical use to date, they have become an important and growing class of pharmaceuticals. However, therapeutic antibody development is complicated by numerous requirements, including chemical stability, solubility, low viscosity, bioavailability, long serum half-life, non-immunogenicity, and resistance to fragmentation, aggregation, post-translational modification and proteolytic cleavage, while also retaining their desired functions (binding affinity, specificity and functional activity).
Whilst methodologies for therapeutic antibody discovery are constantly evolving, it remains an expensive and cumbersome process. Insights introduced by in silico approaches, along with machine learning algorithms, which can extract and utilize information from previous experiments to predict the properties and functions of new antibodies, are therefore highly sought. Recently, methods such as UniRep have shown the possibilities of improving protein predictions by applying Natural Language Processing (NLP) inspired methods, notably transfer learning, on protein data. Transfer learning is when information from one domain is used in another related domain, which can be particularly powerful when only small data sets are available for the latter domain, a common occurrence with antibody data. Developing new ML tools for antibodies built on state-of-the-art NLP techniques can therefore have a large impact on the therapeutic antibody discovery.
The aim of this project is to explore antibody data and develop novel machine learning tools for improving the predictions of antibody properties and functions. This includes;
Investigating the large amount of sequence data available in the Observed Antibody Space database.
Develop novel ML techniques, and adaptation of novel NLP techniques to work on biological data.
Explore the use of these new techniques in antibody property and function prediction.
This DPhil project is a collaboration between Prof. Charlotte Deane at the Oxford Protein Informatics Group (OPIG), University of Oxford and Dr. Iain H. Moal at GlaxoSmithKline (GSK), London. This project aligns with several of EPSRC's strategies and research areas. It mainly falls within the EPSRC Biological Informatics research area for its development of novel computational techniques to model and analyse biological data (machine learning tools for antibody predictions). Additionally, the project also falls within the EPSRC Analytical Science and Artificial Intelligence Technologies research areas, for our use of novel ML techniques to extract information from large datasets for analyzing and predicting properties of antibodies.

Planned Impact

The UK's world-leading position in biomedical research is critically dependent upon training scientists with the cutting-edge research skills and technological know-how needed to drive future scientific advances. Since 2009, the EPSRC and MRC CDT in Systems Approaches to Biomedical Science (SABS) has been working with its consortium of 22 industrial and institutional partners to meet this training need.

Over this period, our partners have identified a growing training need caused by the increasing reliance on computational approaches and research software. The new EPSRC CDT in Sustainable Approaches to Biomedical Science: Responsible and Reproducible Research - SABS:R^3 will address this need. By embedding a sustainable approach to software and computational model development into all aspects of the existing SABS training programme, we aim to foster a culture change in how the computational tools and research software that now underpin much of biomedical research are developed, and hence how quantitative and predictive translational biomedical research is undertaken.

As with all CDT Programmes, the future impact of SABS:R^3 will be through its alumni, and by the culture change that its training engenders. By these measures, our existing SABS CDT is already proving remarkably successful. Our alumni have gone on to a wide range of successful careers, 21 in academic research, 19 in industry (including 5 in SABS partner companies) and the other 10 working in organisations from the Office of National Statistics to the EPSRC. SABS' unique Open Innovation framework has facilitated new company connections and a high level of operational freedom, facilitating 14 multi-company, pre-competitive, collaborative doctoral research projects between 11 companies, each focused on a SABS student.

The impact of sustainable and open computational approaches on biomedical research is clear from existing SABS' student projects. Examples include SAbDab which resulted from the first-ever co-sponsored doctorate in SABS, by UCB and Roche. It was released as open source software, is embedded in the pipelines of several pharmaceutical companies (including UCB, Medimmune, GSK, and Lonza) and has resulted in 13 papers. The SABS student who developed SAbDab was initially seconded to MedImmune, sponsored by EPSRC IAA funding; he went on to work at Roche, and is now at BenevolentAI. Similarly, PanDDA, multi-dataset X-ray crystallographic software to detect ligand-bound states in protein complexes is in CCP4 and is an integral part of Diamond Light Source's XChem Pipeline. The SABS student who developed PanDDA was awarded an EMBO Fellowship.

Future SABS:R^3 students will undertake research supported by both our industrial partners and academic supervisors. These supervisors have a strong track record of high impact research through the release of open source software, computational tools, and databases, and through commercialisation and licensing of their research. All of this research has been undertaken in collaboration with industrial partners, with many examples of these tools now in routine use within partner companies.

The newly focused SABS:R^3 will permit new industrial collaborations. Six new partners have joined the consortium to support this new bid, ranging from major multinationals (e.g. Unilever) to SMEs (e.g. Lhasa). SABS:R^3 will continue to make all of its research and teaching resources publicly available and will continue to help to create other centres with similar aims. To promote a wider cultural change, the SABS:R^3 will also engage with the academic publishing industry (Elsevier, OUP, and Taylor & Francis). We will explore novel ways of disseminating the outputs of computational biomedical research, to engender trust in the released tools and software, facilitate more uptake and re-use.

Student:

Tobias Olsen

Period of Study:

Oct 19 - Sep 23

Funder:

EPSRC

Project Status:

Closed

Project Category:

Studentship

Project Reference:

2271214

Research Topic:

Unclassified

Organisations

People	ORCID iD
Charlotte Deane (Primary Supervisor)	http://orcid.org/0000-0003-1388-2252
David Gavaghan (Primary Supervisor)
Tobias Olsen (Student)

Publications

Author Name Title Publication

Date Published

10 25 50

Olsen T (2021) Observed Antibody Space: A diverse database of cleaned, annotated, and translated unpaired and paired antibody sequences in Protein Science

Olsen T (2022) AbLang: An antibody language model for completing antibody sequences

Studentship Projects

Project Reference	Relationship	Related To	Start	End	Student Name
EP/S024093/1			01/10/2019	31/03/2028
2271214	Studentship	EP/S024093/1	01/10/2019	30/09/2023	Tobias Olsen

Key Findings
Impact Summary


Description	I have developed a database of >1.5 billion cleaned and annotated antibody sequences. This database is an extremely useful resource for discovering new antibodies to use as therapeutics and for training machine learning models to predict antibody properties. Further, I have build and trained a model, based on the aforementioned database, to learn the semantics of antibodies. This model can help is design new antibodies or optimize current antibody therapeutics, making it a valuable tool for drug discovery.
Exploitation Route	The created database is freely available online, providing antibody researchers an easy to obtain high-quality dataset to work with. Further, the developed antibody model is open-source and available through github. This way anyone can work and use it for antibody discovery and design.
Sectors	Digital/Communication/Information Technologies (including Software),Healthcare,Pharmaceuticals and Medical Biotechnology


Description	The created antibody database has been used by multiple companies as a source for antibody sequences, for their work with discovering and designing new therapeutic antibodies.
First Year Of Impact	2022
Sector	Pharmaceuticals and Medical Biotechnology
Impact Types	Economic