Securing and developing the IPD-MHC database to enhance research into livestock diseases

Lead Research Organisation: The Pirbright Institute
Department Name: Immunogenetics

Abstract

The Immuno Polymorphism Database (IPD) (https://www.ebi.ac.uk/ipd/) is a set of specialist databases that contain curated sequence datasets of polymorphic immune genes. Developed and maintained by the Anthony Nolan, these databases help save the lives of people with blood cancer by matching donors major histocompatibility complex (MHC; (also known as HLA in humans) to prevent rejection. One of these databases is IPD-MHC, a repository of non-human MHC genes that includes the major farmed animals species; cattle, sheep, pigs, trout and salmon. All these species exhibit high degrees of variation within their MHC genes. This diversity differentially influences how the adaptive and innate immune systems respond to pathogens and vaccines. Therefore a greater understanding of this diversity and the tools to analyse it offer a significant advance to our broad capability to examine immune function in these fundamental food producing species.
Since the first release in 2003, IPD-MHC has become the central source of curated and annotated comparative MHC data and nomenclature globally. The website now receives nearly 250 visitors every day and over 500 sequences were submitted in 2013 from all over the world. However, this success has created problems. The database and website have never been specifically funded. IPD-MHC has existed as an 'in kind' project that the research community and bioinformaticians at Antony Nolan have considered important enough to create and continue. The demands that the level of traffic has created mean that this model is no longer sustainable. Indeed, there is already a chronic lack of development and this data repository is in danger of becoming redundant.
This proposal will support a dedicated IPD-MHC bioinformatician located at Anthony Nolan to work alongside the other IPD bioinformaticians. Their role will be to unify the individual species websites, incorporate the extensive chicken MHC data as a key farmed animal and create one streamlined data submission processes. Once this is in place, the capability of IPD-MHC will be significantly enhanced to accept a greater range of data and an expanded suite of analysis tools will be imbedded to allow advanced analysis for non-bioinformaticians. IPD-MHC still has overwhelming support from the research community, Anthony Nolan and the offer of free infrastructure from the European Bioinformatics Institute. This project is aimed at securing this central resource and expertise in the UK, to benefit this important UK research community and reach out to the rest of the world as part of the global food security agenda.

Technical Summary

The unusual complexity and importance of the MHC led to the research community driving the creation of a curated database and bioinformatics resource dedicated to comparative major histocompatibility complex (MHC) sequences; IPD-MHC. The MHC is a highly variable region of all vertebrate genomes that encodes genes that dictate how an individual's immune system responds to challenge, influence reproduction and is linked with production in livestock. Since 2003 the considerable expansion and success of IPD-MHC has been sustained solely by 'in kind' contributions from curators, nomenclature committees, bioinformaticians and infrastructure resources.
Advances in sequencing technology and the growth in comparative immunology, especially for livestock species as part of national and international food security agendas, has created a demand for IPD-MHC that can only be met by providing a dedicated bioinformatician. Through the long term relationship between Anthony Nolan and the European Bioinformatics Institute (EBI), the infrastructure and access to the EBI suite of analysis tools will still be provided 'in kind'. However, the modernisation of IPD-MHC to be compatible with recent EBI upgrades is urgent, and the tools for submitting and curating sequences must be made to reflect current sequencing technologies and community needs. These upgrades require a bioinformatician to work exclusively on this project. With these upgrades, the opportunity will exist to significantly expand the analytical capabilities of IPD-MHC to be of maximum use to the MHC community. By developing current and bespoke bioinformatic analysis tools available through one portal, this project will ultimately provide a uniquely high quality sequence resource that includes the capability to interrogate MHC data for non-bioinformaticians. As current traffic is nearly 200 unique hits per day, the demand for this UK based resource can only increase, nationally and globally.

Planned Impact

A majority of the data in IPD-MHC is accessible through other publically accessible databases, such as GenBank and DDBJ. The key and hence the demand for IPD-MHC is that it adds significant value and impact to this data by having species specific teams of experts to ensure the quality and naming system is correct. This level of curation is essential for highly variable immune genes; low quality/wrongly named data can fragment research fields, cause confusion and lower the impact of research. This is evidenced by the numerous comparable resources for different human genes, many of which are also within the IPD framework.
Currently IPD-MHC is the central accurate repository for MHC data globally. The improvements and extensions this proposal will maintain this status and enhance its impact considerably. It should also be noted that this is a free web based resource and the data will be available to all researchers and interested parties with the widest of definitions.
The fundamental nature of the MHC combined with the farmed and companion species within IPD-MHC, means that researchers in many fields will be interested in this data. This will be enhanced by the capability to analyse MHC sequences knowing they are accurate and that the analysis tools have been tailored appropriately. As the importance of MHC and our analytical ability continues to increase, IPD-MHC will become a more valuable tool not only for academics, but also for veterinarians and livestock breeding companies. These groups will be able to understand MHC in the precise context of their species and associate phenotypic data. As the lPD-MHC capabilities increase, it is hoped that allele frequencies and other information directly applicable to livestock breeding companies and potentially farmers will also be available.
The accurate information that IPD-MHC provides can also be adopted by other sequence projects and resources. Public repositories contain MHC data that is incorrectly named or misinterpreted due to the highly polymorphic nature of these genomic regions. This has also caused the current genome assemblies of IPD-MHC species to contain inaccurate or very low confidence MHC assemblies. By targeting the MHC and taking on these quality and assembly challenges IPD-MHC will have the opportunity to feed back to other databases and repositories to improve quality. The impact of this will be very broad as academic and commercial genotyping strategies, livestock diversity measures and genome wide association studies are based on the current genome builds. Outputs from such studies, as well as more unbiased transcriptomic and proteomic studies, are interpreted using public sequence databases. Accurate data underlying these studies can therefore have a wide impact across academia and industry by improving the analysis accuracy. Confirming that the MHC is or isn't associated with important traits is equally as important.
A BBSRC supported bioinformatician within Anthony Nolan working with the EBI would become a highly skilled asset and possess skills now broadly identified as lacking in the UK. Future impact from this position would also be achieved by raising the profile of The Pirbright Institute and the BBSRC an area currently dominated by other funding streams and pharmaceutical companies.
As part of improving food security this research will have a beneficial impact on UK society in general and ultimately the rest of the world. Any effect on reducing the burden of disease will have a major beneficial effect on social welfare, wealth creation through the development of livestock industries and the removal of barriers to trade. As such, this project directly addresses BBSRC strategic priority areas in Food Security and therefore contributes to meeting its targets. This project also facilitates data sharing within the animal genetics community, and several other bioscience areas, to facilitate global research.

Publications

10 25 50
 
Description During the first year of this award, the IPD-MHC Database has been reorganized and updated, consolidating the data already present and improving the overall quality of data publicly available.
High-throughput sequencing technologies allow the MHC research community to generate a growing amount of high quality data, providing the potential for extending the database coverage to include genomic sequences, rather than individual exons. In order to allow the storage and analysis of genomic data, during the last year the IPD-MHC Database has faced a further improvement in its organization. This further expansion raised a number of bioinformatics challenges, spurring the development of new tools for the automatic analysis and annotation of MHC alleles.
The database has been redesigned to accommodate sequences of arbitrary size, making it scalable and allowing the introduction of and indefinite number of sequences in the future. In this regard, its scalability has been tested by importing all the known human MHC alleles (HLA) from the database, hosting more than 17 thousand sequences. In order to successfully compare genomic data with the already present cDNA sequences in the database, genomic sequences need to be annotated in order discern between exonic and intronic data. For this reason, an automatic mapping tool has been developed with the aim to detect exon boundaries in highly variable sequences from a number of different organisms, without using a-priori rules, that might vary between taxonomic groups. This tool has been successfully integrated in the database pipeline, allowing the automatic annotation of genomic data.
The alignment tool has been further improved to handle the comparison of genomic data from different species. The new algorithm provides for the first time the ability to visualise fractions of the genomic alignment, allowing users to focus only on the region of interest. Furthermore, in order to improve the speed and readability of particularly big alignments, the user is able to choose the allelic resolution to visualize, reducing the number of displayed sequences to the bare minimum. All these improvement in data analysis and visualization will be the subject of further studies, bringing to a publication about the efficient representation of great amounts of data, a hot topic in this big-data era.
With over 30,000 visits to the site in 2018 from all over the world, the utility of this databases is evident.
API access
With the growth of the IPD-MHC Database in the number of manually curated alleles and the amount of annotated information, the requirements for the research community to consume allele data has evolved. An increasing number of IPD-MHC database users are requiring accessing alignment information and sequence data in different formats. For this reason, the IPD-MHC database was recently updated to provide the scientific community with an API (application programming interface) that allows intermediate and advanced users to systematically access the information stored in the database. The API consists of a series of endpoints providing search functionalities as well as access allele sequence and generic information, with the ability to filter by taxonomic group, species, locus or search terms.
Functional classification of MHC class I alleles
The current MHC alleles nomenclature is based on the alignment of one or more portions of their nucleotide sequence and the nomenclature rules vary for each taxonomic group. These nomenclature rules don't necessarily reflect the ability of alleles to recognise pathogenic protein fragments. A metric that takes into account the ability of each allele to recognise a dataset of peptides has been developed to measure the functional similarity of MHC molecules belonging to different taxonomic groups and featuring different nucleotide sequence. This measure of similarity will give insight into the evolution of these molecules, allowing the comparison of homolog alleles from different and not- related species showing a similar surface shape and not sharing significant sequence similarity. In the long run, functional similarity can be used to uniform the nomenclature rules over different taxonomic groups.

IPD-MHC Genomic browser
The IPD-MHC Database will provide a tool for the visualization of IPD-MHC data and annotations directly on reference genome assemblies. Additional tools are integrated in the curator website to map loci to specific regions on genome assemblies, and initially the existing loci will be associated to a specific region on reference genomes, when possible. A single species can have multiple genome assemblies, and reference genomes of each species/taxonomic group will be used as default, but custom genome assemblies can be made available on the website. The tool is now in a beta version, a link to the browser will be available for each taxonomic group that provides the reference information, with a public version of the tool expected in the next months. This functionality will add value to the curated data the IPD-MHC database is hosting, providing an additional way to compare and integrate the available information with genomic data. It will also allow The Pirbright Institute to share genome assemblies that are not currently hosted anywhere.

Insight of the function and evolution of BoLA class I haplotypes
In the past year we have focused our studies on the understanding of the evolution of BoLA MHC class I alleles and their relation in terms of functionality. This work is based on the NGS sequencing pipeline developed by Dorothea Harrison for the sequencing of bovine MHC region for a number of breeds. The functional analysis of MHC class I alleles has been performed by analysing the predicted peptide repertoire of the sequences available of the IPD-MHC database, providing insight on the bovine MHC evolution and a review of the strategies for naming MHC alleles that reflects their functionality. A manuscript has been published explaining the role of different loci in cattle haplotype, and investigating on the generalist and specialist alleles hypothesis, providing evidence from experimental and in-silico analysis.
Exploitation Route A public version of the database is already available online, substituting the previous one and completely integrating with the EBI tool set. Furthermore, a new submission system has been implemented to facilitate the task of submitting data, and allowing the inclusion of extensive metadata regarding the sequence origin and features. In the first eight weeks following release more than 500 new sequences were successfully submitted and curated.

As the main objective of the first year was to establish a core structure of the database and import all data from the previous version, the award objectives are fully met. Additionally, the achieved awards will represent the base for our group to develop new analysis tools and for the target audience to further extend the volume of data in the IPD-MHC database. The achieved findings so far will present the base for our group to develop new analysis tools and for the target audience to further extend the volume of data in the IPD-MHC database.
The increasing knowledge of the molecular variation of MHC molecules would allow better understanding of the mechanisms behind the high levels of polymorphism that provide diversity in antigen-binding repertoires. Furthermore, the project would potentially impact the pharmaceutical and medical biotechnology fields, helping to understand the mechanisms of alloreactive immune responses, enabling a more rational approach to donor selection and avoiding HLA mismatches most likely to evoke a strong alloantibody response.
Following the success of the IPD-MHC Database, a database sharing the same structure was developed and released in 2018 to provide the research community with a resource for non-human KIR alleles. Killer-cell Immunoglobulin-like Receptors (KIR) have been shown to be highly polymorphic at the allelic and haplotypic level, like the MHC.
For these reasons, the knowledge generated during the development of the IPD-MHC Database has provided a strong background for the realization of this side project.
At the moment, two taxonomic groups representing non-human primates and bovins are present, but the scalable structure of the database provides the ability to expand them in the future.
Furthermore, the database provides basic sequence analysis tools, including multi locus inter- and intra- species alignment and a blast tool for sequence similarity search.
API access
Advanced and expert user can systematically access the API to integrate the curated IPD-MHC database data with their own pipelines.
Functional classification of MHC class I alleles
This metrics is currently used by our group to investigate on the functional similarity of Bovine haplotypes and a manuscript will be shortly published with the outcomes. Moreover, the method will be integrated in the future with the analysis tools in the IPD-MHC Database to provide additional similarity metrics to the end-user.

March 2021 update
The genomic browser will be shortly available for the research community on the IPD-MHC Database, with an initial set of genomic assemblies to choose from.

The genome browser will allow to quickly visualise the curated IPD-MHC data on official genomic assemblies and can be used for example to provide insight on the conservation of specific loci with the objective of designing primers/probes.

The functional analysis of BoLA class I alleles will impact not only on the current nomenclature rules of BoLA alleles, but also on the understanding of how haplotypes impact on the cattle susceptibility to pathogens.

The database is now secure for the near future and updatable/expandable with relatively little input from administrators.
Sectors Agriculture, Food and Drink,Education,Healthcare,Pharmaceuticals and Medical Biotechnology

URL https://www.ebi.ac.uk/ipd/mhc/
 
Description NARRATIVE IMPACT (GM) In the first year of the award, the database has been updated to meet the required objectives and a first version of the redesigned IPD-MHC Database has been released in November 2016. Following its release, the number of unique visitors has doubled, showing an increasing interest of the research community in the IPD-MHC Database. Furthermore, the analysis of the distribution of visitors around the world shows an increasing impact, not only in high-income countries, but also low and middle-income countries (LMIC). The new submission system greatly facilitated the task of both the submitter and the individual MHC group curators. From one side, the user is guided through the whole submission process, alleviating the task and allowing to provide already validated data. From the other side, the curator can rely on extensive metadata regarding the sequence origin and features, and additional tools for the analysis of data. As a consequence, the number of submitted sequences has increased, bringing the overall count of unique sequences to well over 15k. New sections are being added to IPD-MHC and the number of sequences and requests continues to rise. Most importantly, human and non-human MHC databases are now becoming ever closer aligned. This dramatically increases the power of our comparative analyses but also promotes shared infrastructure driving sustainability. A key use of the website and future-proofing.
First Year Of Impact 2018
Sector Agriculture, Food and Drink
 
Title Cattle MHC genotyping 
Description Using the sequence data generated through the targeted pull down of MHC, we developed a full gene and more targeted PCR approach to genotype cattle for the MHC class I region. This has been applied to many hundreds of samples to enable us to select individuals for breeding as well as survey genetic diversity in beef and diary herds. 
Type Of Material Technology assay or reagent 
Year Produced 2018 
Provided To Others? No  
Impact After publication which we anticipate in 2019, we will appy this method to targeted herds and are already attracting industry interest. 
 
Title Full length cattle MHC genes 
Description Using our method for full length MHC gene amplification, the largest reference set ever produced from common haplotypes has been sequenced and deposited in the publicly available data IPD-MHC with associated research tools. 
Type Of Material Database/Collection of data 
Year Produced 2020 
Provided To Others? No  
Impact None yet, but it will allow far greater resolution when analysing MHC polymorphism and evolution in cattle populations. 
URL https://www.ebi.ac.uk/ipd/mhc/group/BoLA/
 
Title Functional characterization of MHC Class I molecules 
Description In the effort of functionally characterize MHC class I molecules to understand the binding requirements, a scoring function was designed to capture structural features on the basis of the interaction between the binding pocket and the associated peptide. A machine learning approach is used to compare the peptide binding repertoires of MHC class I molecules. An in-silico binding profile is generated for each known livestock MHC class I allele by challenging a machine learning prediction algorithm with a dataset of potential binder peptides, composed of nonamers generated from a set of randomly selected soluble proteins from the UNIPROT database. This procedure provides a functional measure of allele diversity, allowing to discriminate between sequences on the basis of the biological function. 
Type Of Material Data analysis technique 
Year Produced 2018 
Provided To Others? No  
Impact This research will have immediate application: - In the structure prediction of MHC Class I alleles, allowing to identify the best template for homology modelling unknown structures. - In the MHC allele nomenclature, providing an alternative measure of allele similarity From a practical point of view, this is the first step of our effort to provide the research community with a centralized resource for MHC study in livestock, allowing the introduction of high-resolution MHC Class I structures in the IPD-MHC Database. 
 
Title Identification of full-length bovine MHC class I alleles 
Description Through a PCR and next-generation sequencing approach we identified 87 full-length bovine MHC class I alleles. These sequences allowed for the extension of the introns to 45 alleles already on IPD. The rest of the sequences were novel alleles. All of those are now available to the wider research community on IPD. 
Type Of Material Database/Collection of data 
Year Produced 2019 
Provided To Others? Yes  
Impact The availability of full-length MHC class I alleles allows to investigate the evolution of the MHC system as well as to understand MHC diversity in cattle in more detail. It also enables to develop a MHC typing system that contains all of exons 2 and 3 (the main part used for peptide binding), by utilizing primer binding sites in introns. 
 
Title Sequence mapping tool 
Description The sequence mapping tool was designed with the aim to annotate novel alleles starting from already annotated reference sequences. This kind of approach will potentially be able to solve the problem of annotating highly variable sequences, as no a-priori rule are used to infer features boundaries. The algorithm has been tested with sequences belonging to different taxonomic groups and performs well with evolutionary distant organisms. 
Type Of Material Computer model/algorithm 
Year Produced 2018 
Provided To Others? No  
Impact The sequence mapping tool will be tested, published and available in 2018. This is hoped to provide valuable functionality to the overall database. 
 
Description EBI 
Organisation EMBL European Bioinformatics Institute (EMBL - EBI)
Country United Kingdom 
Sector Academic/University 
PI Contribution We maintain and curate the IPD-MHC database
Collaborator Contribution They host the database and provide generic tools as an in kind contribution
Impact All usage of the website and other outcomes associated with this award
Start Year 2015
 
Description Extended group meeting 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Professional Practitioners
Results and Impact Every two months our research group organize a meeting to discuss the current status of research activity, and collaborators from other groups are also present. Each member of the group is invited to prepare a 20-minute presentation highlighting new findings and any eventual problem encountered in the research activity.
This activity usually brings new ideas and solutions, and often highlights connections between research activities, promoting inter- and intra- groups collaborations.
Year(s) Of Engagement Activity 2016,2017,2018
 
Description IPD-MHC Steering committee meeting 2016 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact This was the first meeting of the Steering Committee for the BBSRC Bioinformatics and Biological Resources Grant "Securing and developing the IPD-MHC database to enhance research into livestock diseases". Projects were made about how to increase the impact of the project and improve the number sequences, as well as integrating other taxonomic groups.
Year(s) Of Engagement Activity 2016
 
Description Internal seminar at The Pirbright Institute 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Professional Practitioners
Results and Impact An internal seminar at The Pirbright Institute about the development of new tools for the accurate analysis of MHC data. I presented some of the work done so far and future developments to a highly specialized public. I received some interesting feedback and ideas to further improve the usability of the IPD-MHC database.
Year(s) Of Engagement Activity 2016
 
Description Oral presentation at the 36th International Society for Animal Genetics Conference 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact I was selected for an oral presentation at the 36th International Society for Animal Genetics (ISAG) Conference involving the extension and update of the IPD-MHC Database. The ISAG is an international forum between scientists and practitioners of animal genetics applied to economically important and domesticated species. It comprised a series of plenary sessions with invited presentations from the world's leading scientists.
Year(s) Of Engagement Activity 2017
 
Description Presentation of the current status of the IPD-MHC Database 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Professional Practitioners
Results and Impact Presentation for the Anthony Nolan Research Institute to provide an update of the current status of the IPD-MHC database and how the newly developed tools will help also the interrogation of the other databases of the IPD project. There was interest in from the audience and suggestions were made to further improve the database
Year(s) Of Engagement Activity 2021
 
Description Sir John Kingman (UKRI) visit 
Form Of Engagement Activity Participation in an open day or visit at my research institution
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Policymakers/politicians
Results and Impact Visit to highlight Pirbright science to UKRI
Year(s) Of Engagement Activity 2019