Delivering accurate structural bioinformatics to the yeast community with the HHprY database

Lead Research Organisation: University College London
Department Name: Institute of Ophthalmology

Abstract

The understanding of cells has increased with new technology that has developed from genome sequencing. Experiments are run by robots to produce huge sets of results. As a result, our understanding of living cells is now so detailed that we can easily imagine a future where an entire organism is understood at the molecular level. The most likely candidate to be this organism is baker's (or brewer's) yeast, which was the pioneer cell type for many revolutionary experiments, including the first to have its genome sequenced. Because of the surprising degree of similarity at the molecular level between yeast and man, ground-breaking discoveries in yeast often reveal much about equivalent events in human cells.

Proteins are the major players that do things inside cells. So one way to understand any organism is to classify what its proteins do. In some cases pure proteins can be studied, but this is too challenging to do for every protein, and so another way to classify proteins is needed. Using genome sequences, we can very easily determine the sequence of the proteins coded by the genes. We can then look at the sequence of each protein in turn to find out if it is similar to a protein whose function we already know. Proteins whose sequences are similar, even if one is in yeast and another in human, are then said to be in a single protein family. As the families get bigger a new phenomenon occurs from looking at all the sequences together: we often find subtle patterns that the proteins share. The patterns are very useful, because often we can use the patterns to find even more sequences, slightly more distantly related but still in the family.

This approach is the one that has been applied universally to all new genomes and it helps identify what many of the proteins are doing. But it is far from universally successful. For yeast proteins there is a problem of perspective. The place where we typically start looking at a protein family is in humans. However, there are very many sequenced genomes for other animals, particularly vertebrates. So the patterns we find are very strongly biased to the vertebrate members, and sometimes the similarity shown by the yeast family member is too vague to be noticed. A second problem is that the whole approach of using a family to find a new member is that it has now been rendered out of date. A new approach is to work out for a new protein what proteins are in its close-knit family among other closely related species, and to use this family to find the pattern of shared sequence. Then, instead of using the pattern to find another sequence, the pattern is compared only to other patterns. Because each pattern holds within it much more information than one sequence can, this see far more subtle similarities, so it ends up identifying more ditant relationships that we could not see before.

We suspected that comparing patterns would increase what is currently known about the relationships between yeast proteins and proteins in other well understood organisms, including humans. In a sample of 130 proteins (2% of yeast's total) we found over 20 new relationships for at least part of the protein - one new piece of information for every six proteins. This ratio rose to one in three for proteins where no family relationship had been known previously. Finding these new relationships is a considerable step towards the complete mapping of this model organism.

We will now carry out our analysis for the whole yeast genome and create a web resource for yeast researchers to freely access. No genome-wide analysis of patterns has been done before. The patterns will be made and compared by computers, with minimal input from the research team. A major part of the project will raising awareness of our results by linking them to the most prominent web resource used by yeast

Technical Summary

The eukaryotic model system that is understood in the most detail is budding yeast. However, the value of systems wide "omics" experiments is limited by the lack of information on the likely function of ~1/6 of the yeast proteome, as 1000 proteins have no discernible homology that points to possible function. Many other proteins that do have homologs elsewhere appear to be lacking key domains, so it is not clear if the yeast protein is an ortholog.

Current approaches to detect yeast proteins all involve iterating to make sequence profiles and searching for significant matches in yeast. We have found two ways of enhancing detection of homologous domains in yeast. The first is to initiate searches with yeast sequences. Reversing the direction of search adds information because iterative searches are non-commutative. The second enhancement is to change from profile-sequence to profile-profile searches, which are known to be more sensitive. Our pilot work showed that these two advances will likely add ~1000 new domains to the yeast proteome, and will reduce the proteins with no functional homologue from ~1/6 to 1/10.

In this project we will create the first ever proteome-wide profile-profile map of all domains in yeast. We have already accumulated all the data for the map in a database of >100,000 searches, so in the project the first main task will be to parse this database for matches to make a draft of where the matches are. The second task will be to create readily interpretable diagrams and linked descriptions of each match. This will include developing explicit rules on including matches of borderline significance. The third task will be dissemination of results to maximise access to our data. The final task be automating the pathway for maintenance and upgrade.

The new resource will benefit not only genomic work in academia and pharma, but also individual researchers working on the yeast proteins we annotate or their homologs in fungi, crops and humans.

Planned Impact

Industry

There is huge interest in specific organisms and biological pathways in industry, either to use (micro)organisms to make specific biological molecules or to study larger organisms with economic value, particularly in the food chain. The approaches open to academia are also used in industry, so our work will also have a big impact there.

Micro-organisms as factories and biologicals including Biofuels

Yeast itself is being actively investigated as a bioreactor for production of high value biologicals such as edible polyunsaturated, oils industrial lubricants, diesel fuel and drugs. High throughput genetic approaches are used to identify blockages preventing better expression of the desired pathway in the yeast cell. However, in all these cases genes of unknown function are a barrier to understanding the data. With 15% of hits in this category, a key portion of results can only be framed in terms of other similar experiments. Being able to factor in data from a different dimension, the structural homology to protein families with a range of likely functions, may make a data set much more interpretable.

Other microorganisms in particular other fungi share proteins with yeast, so our findings will be particularly helpful there

Crops (and fungi again)

Plants and yeast are only distantly related, yet there are still domains of unknown function that they share, so advances in yeast will inform advances in crop biology and hence lead to economic benefit. An example of this is the family of SRPBCCs that we have discovered using HHsuite, which has multiple as yet undiscovered members not only in fungi and man, but also plants. The project as a whole will reveal links that advance understanding of how economically important plants deal with stresses including infestations and drought, and so further benefit the agricultural community.

Increased food security will come not only by increasing botanical knowledge, but also through increased understanding of fungi, which cause >$200Bn loss / year world-wide, and would be even greater if not for repeated antifungal treatment. In the UK alone fungicides prevent £200M/y wheat wastage by septoria leaf blotch. However, resistance is increasing, and to ensure food security we must develop new antifungal strategies. Therefore, beneficiaries will include agriculture overall.

Human heath and well-being
Better identification of structural homologies in yeast will also have implications for human biology. If HHprY works well enough to be a proof of principle , it could be applied humans, where many proteins of completely unknown function (e.g. c9orf72) are being linked to disease by modern genetics, where disease loci are easily tracked by sequencing.

Industrial Structural Bioinformatics
Companies need to understand their own, highly valuable big datasets, and so they too need to eliminate problems of proteins of unknown function. Therefore work that we are carrying out here will be a proof of principle for similar projects in industry, particularly where companies have invested in technology to carry out genome-wide experiments. Taking just the single area of biofuels, the need to work in microorganisms with large numbers of proteins of unknown function is severe. The best models are specialised fungi (e.g. Yarrowia, which is more adapted to make diesel than budding yeast) and various algae, which have ecological advantages for the development of biofuels. All these species have many proteins of unknown function, so an HHpr-Alg or HHpr-yarrow would be an attractive proposition.

* * * * *

Impact on society

The eventual understanding of a single cell in all its detail is an achievement that science can only dream of at present. The impact of reaching this goal at some stage in the future will probably be even bigger for society than for science. This cell is likely to be budding yeast, and the work of this project will contribute to that.

Publications

10 25 50
 
Description We have discovered that applying the HHpred/HHsearch algorithm in a parallel processing supercomputer requires many changes to be made to the code, compared to running it in series on a single unit. We have found that sensitivity for single small domains within whole proteins is low compared to "chunking" up the protein into small segments. However, since we have a whole proteome dataset based on whole proteins, we have instigated a second round where gaps between known domains are searched for a second round, thus increasing sensitivity. We have created an interface for viewing the results which allows used input in terms of the threshold and for the way hits are clustered. We have yet to analyse the full data set of hits, so the key finding we want most, which is the number of new domains predicted and the confidence with which they are predicted, is not yet known.
Exploitation Route We plan to take this forward by creating similar databases for other proteomes, particularly of evolutionarily diverse eukaryotes, as were studied previously by Margaret Robinson's group (in their Reverse HHpred database). Results will be linked across this wide set of proteomes.
We also are very keen to provide our dataset to SGD, the largest WWW resource.
Sectors Agriculture, Food and Drink,Healthcare,Pharmaceuticals and Medical Biotechnology

URL http://hhyeast.ucl.ac.uk/
 
Title Supervised byJames Hetherington (Honorary Lecturer, Department of Computer Science) members of that department have been finding out how to implement Hsearch on a clustered computer such as Legion. This new knowledge will be disseminated when finally the whole work-stream has been made to function, and we then publish our results. i.e. these are essential first steps 
Description Implementing software on a distributed/clustered supercomputer identified weaknesses in the code that are missed when implementing on a laptop. Our progress was useful for other researchers who undertake to implement HHsearach and other parts of HHsuite. With a first pass of HHsearch analysis we realised that a second pass was required to look at large gaps in the first data set. With both first and second passes we will analyse how much of the overall "gaps" in the proteome we have succeeded in filling in. The first pass information is currently available on a beta-testing website. On publication of our data this website will go live for a wider public and the data set will become available for download. 
Type Of Material Improvements to research infrastructure 
Year Produced 2018 
Provided To Others? No  
Impact none yet - still in development 
URL http://hhyeast.ucl.ac.uk:5000/