Building the PTM map of the human genome through commensal computing

Lead Research Organisation: University of Liverpool
Department Name: Institute of Integrative Biology

Abstract

In recent years, the concept of "crowd-sourcing" has emerged as an exciting new paradigm for engaging large groups of people to solve a common task - one particularly high-profile example is Wikipedia. Crowd-sourcing can also be applied to data analysis - engaging many distributed machines to solve problem. This is a hugely exciting development for science, since massive data sets are now commonplace. It is a major unsolved problem as to how research organisations should fund the computing equipment to analyse the data explosion. The traditional route has been to purchase large farms of computers (clusters) in a dedicated location. This route is expensive to purchase, and expensive to keep up-to-date as a cluster of 10 computers purchased in 2003, would have a similar computing power as a modern desktop PC today available at fraction of the cost. An alternative model that has received attention recently is cloud computing, in which companies such as Amazon and Google provide access to massive compute farms hosted in distributed locations, on a pay-as-you-use basis. This model is attractive for high-powered, short term jobs, as purchasing 1 hour of analysis time on 1000 computers costs approximately the same as 1000 hours on 1 computer. This model does not ultimately save any cost in real terms though, since the service providers are aiming to profit from the cluster provision. The crowd-sourcing model aims to take advantage of the fact that devices containing CPUs are now ubiquitous - not just in PCs, but also in tablets and mobile phones. The vast majority of CPU time on these devices goes un-used.

In this application, we are going to put the crowd-sourcing model to work to help annotate the human genome. The completion of the genome sequence was indeed an important scientific landmark, but the important part is now to study the functional units within the genome - the genes, and the protein(s) encoded by each gene. We wish to understand the basic function of each protein, what happens if a protein malfunctions, for example if the gene encoding it contains a mutation in some individuals, and how these proteins change in the cell. An important process that happens to proteins is post-translational modification. These are chemical changes that happen after the protein has been produced from the genetic code, altering the function- making it active or inactive, and influencing which other proteins it can interact with. The genetic code gives us no clues as to which sites in proteins can or will be modified with particular chemical groups, and so we must study these modifications experimentally. Mass spectrometry is widely used to study proteins on a very large scale, with a single experiment producing data on thousands of proteins at once. The computational analysis of the data is difficult to perform optimally, so most researchers often ignore data on modifications on proteins because they do not have access to sufficient computing power to analyse these properly. In this project, we are going to build a tool that runs in any browser platform (PC, tablet, phone etc), which will perform massive analysis of proteomics data. Our tool can be embedded in social media platforms, such as Facebook, so that the public can get personally involved in an important scientific endeavour, simply by having a near-silent application in existing browser windows they have open or by playing an interactive game we will build to map the problem to a solvable puzzle. This will provide us with a very large amount of CPU time for analysing the data fully, as well as engaging human brains to interpret challenging data. All results will be fed back into the genome annotation effort, so we can start to fully understand how every protein encoded in the human genome can be modified in different cell types. Other researchers will be able to mine this important data for their own studies in a wide variety of biological and biomedical contexts.

Technical Summary

We are developing a crowd-sourcing tool for massively parallel re-analysis of mass spectral data from proteomics studies, called the Human Proteome Modifier (HPM). The HPM tool features an unrestricted search for all types of variable modifications on proteins, such as post-translational modifications (PTMs) as well as chemical artefacts. These searches are not currently commonly performed by most proteomics groups due to the CPU time required. HPM will function as a browser-embedded application (running on PCs, tablets or phones) which will make use of a small amount of client-side CPU time, while users are browsing websites, using social media applications or playing games. This computing model we have adapted from the concept of "parasitic computing" - that is stealing CPU time, as "commensal computing", since users will be aware that their CPU time is being used for the public good (human genome annotation) but at no noticeable cost to them.

The HPM tool will be available for proteomics labs to upload new data sets for unrestricted modification searching, and for re-analysis of all (human proteome) data sets available in public databases. The results will be fed into a database we will develop, called HPM-DB, which will be mined by the human proteome project with the aim of discovering all experimentally observable modification sites on proteins. Visualisation software will be provided for specialists to analyse spectral level evidence, and for non-specialists to appreciate the strength of evidence for a given PTM site identified by HPM. Proteomics groups will also use HPM-DB to learn about the frequencies of all types of modifications that can occur on proteins. We will work with industrial collaborators to embed HPM in distributed games, to increase the uptake of the tool, maximise the amount of CPU time available for data analysis and employ the lay public as problem solvers.

Planned Impact

- Both large pharma and smaller biotech SME's will see direct benefits from the provision of the HPM-DB, as a resource for studying human proteins in a wide variety of contexts.
- Commercial software developers working in proteomics will benefit from the provision of HPM-DB through the provision of very high-quality data sets for training their peptide / modification identification algorithms.
- Research councils and charities funding computationally intensive Life Sciences research will see indirect benefits if the crowd sourcing model can be effectively deployed for data analysis, with potentially enormous savings in the high-performance computing costs.

The staff employed on the project will benefit through the development of skills and understanding in this cutting-edge software project.

Publications

10 25 50
 
Description In this grant, we have developed a new software approach for distributing computing that can perform large scale bioinformatics processing, for a complex task (identifying modified peptides in proteomics from mass spectrometry data). In this model, we use distributed CPUs machines (desktops, laptops, tablets, phones) over the internet, while users are browsing particular websites. The software has now been demonstrated to scale up, running on 100s of PCs, without requiring any software installs. To obtain almost limitless CPU potential, PCs simply have to open a web-browser and point at any given webpage that holds our application. We have demonstrated exceptional speed-up for performing very large proteomics searches at comparative sensitivity to regular search engines. Our software is called Dracula, and is currently submitted for review.
Exploitation Route The platform demonstrates the CPU crowdsourcing methodology is viable for bioinformatics, as an alternative to HPC clusters, and will become a new mode of data processing.
Sectors Healthcare,Manufacturing, including Industrial Biotechology

URL https://github.com/PGB-LIV/crowdsource-server
 
Description BBSRC-NSF/BIO PTMeXchange: Globally harmonized re-analysis and sharing of data on post-translational modifications
Amount £310,483 (GBP)
Funding ID BB/S017054/1 
Organisation Biotechnology and Biological Sciences Research Council (BBSRC) 
Sector Public
Country United Kingdom
Start 10/2019 
End 09/2022
 
Title Crowdsourcing platform for PTM finding 
Description We have developed a computational platform for distributed processing of proteomics mass spec data, so that CPU time can be accessed from users' machines via a web browser. 
Type Of Technology Webtool/Application 
Year Produced 2017 
Open Source License? Yes  
Impact Impact is on-going, we hope to demonstrate that this method can match HPC-levels of compute performance, but without needing a cluster, using instead spare CPUs on desktops, tablets and phones. 
URL http://pgb.liv.ac.uk/dracula
 
Title phpMs 
Description A web-based library for processing proteomics data, using the popular web development language PHP 
Type Of Technology Webtool/Application 
Year Produced 2018 
Open Source License? Yes  
Impact The library supports the community of web developers wishing to work with proteomics data and public data standards. 
URL http://pgb.liv.ac.uk/phpMs/