FPGA supercomputing technology for high-throughput identification and quantitation in proteomics

Lead Research Organisation: University of Sheffield

Department Name: Automatic Control and Systems Eng

Abstract

Proteomics is the study of the entire complement of a cell in a particular state. It is the proteins that 'act out' the information in the genome, and we cannot really understand cellular function without a detailed knowledge of the activity, dynamics and interplay between the 'actors'. .However, the science and technology of proteomics does not lend itself to the same highly multiplexed approaches that can be applied to nucleic acids, and strategies for protein identification and quantification are still highly serial, require complex and sometimes arcane data processing, and are slow. We have almost completed a proof-of-concept BBSRC e-science project that aimed to implement two common methods in proteomics: mass spectrum preprocessing and peptide mass fingerprint database searching, as a hardware implementation using reconfigurable computer chips known as field programmable gate arrays (FPGAs). A key feature of this computational platform is that the bioinformatics algorithms which are normally implemented as a software program were translated into optimized digital hardware processors that could process data significantly faster by running multiple analyses in parallel. The successful outcome of this project was a complete implementation that has achieved a phenomenal 2000-fold speed increase. We now wish to build on our previous success, capitalize upon the capabilities we have developed thus far, and deliver similar speed gains to the most commonly used method of proteome analysis, based on tandem mass spectrometry. At the same time, we will address an emergent and pressing need for faster and enhanced quantification to deliver new quantitative approaches and capabilities to proteomics researchers. Such tools are critical if proteomics is to deliver what we expect of it as a science.

Technical Summary

This project aims to develop a high-performance FPGA-based bioinformatics solution for high-throughput LC-MS/MS-based protein identification and quantification. This proposal builds on the results of a successful BBSRC project which has resulted in the development of the first complete reconfigurable computing solution for protein identification. The prototype system has achieved a staggering 2000 fold increase in computational speed compared with a standard software solution. The FPGA-hardware, which incorporates a raw mass spectra processor and a parallel search engine, delivers a match in less than a quarter of a second when searching the entire MSDB protein database. Developing a similar bioinformatics platform to address the computational challenges in tandem mass spectrometry and quantitative proteomics will involve designing an MSMS protein identification engine and a separate quantification engine. The hardware platform will consist of a reconfigurable computing motherboard which can hold three additional FPGA modules. The on board FPGA will be used to perform quantification. An additional FPGA module, with 1Gb SDRAM memory to hold the protein database, will be used to run the search engine. A key feature of the computational platform is the ability to perform computations in hardware which exploit algorithm and instruction parallelism. This leads to significant increases in performance, while retaining much of the flexibility of a software solution. The main challenges relate to redesign, partitioning and mapping the protein identification algorithms on the reconfigurable hardware. The proposed solution will dramatically enhance the efficiency of the proteomics related algorithms. Matching the 2000 fold speed increase achieved with the peptide mass fingerprinting solution would mean that a quantitative analysis that currently takes one hour could be completed in less than two seconds.

Funded Value:

£356,480

Funded Period:

Mar 08 - Feb 12

Funder:

BBSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

BB/F004893/1

Principal Investigator:

Daniel Coca

Research Subject:

Omic sciences & technologies (60%)

Tools, technologies & methods (20%)

Research Topic:

Bioinformatics (20%)

Proteomics (60%)

Organisations

People	ORCID iD
Daniel Coca (Principal Investigator)

Publications

Author Name

Title Publication Date Published

10 25 50

Bogdan IA (2009) Peptide mass fingerprinting using field-programmable gate arrays. in IEEE transactions on biomedical circuits and systems

Bogdán IA (2008) High-performance hardware implementation of a parallel database search engine for real-time peptide mass fingerprinting. in Bioinformatics (Oxford, England)

Daniel Coca (Author) (2010) FPGA Implementation of Database Search Engine for Protein Identification by Peptide Fragment Fingerprinting

Hubbard, Simon; Jones, Andy (2009) Proteome Bioinformatics

I. Bogdan (Author) (2009) Reconfigurable computing solution for Peptide Mass Fingerprinting

Iniewski, Krzysztof (2013) Embedded Systems: Hardware, Design and Implementation

Key Findings
Impact Summary
Research Tools and Methods
Software and Technical Products
Engagement Activities


Description	We developed two complementary high-performance FPGA co-processing solutions (FPGA-I and FPGA-II) for protein identification using tandem mass spectra. Each system consists of a multi-FPGA PCI card which can be attached to any standard PC, Intellectual Property Cores (i.e. the software-like FPGA logic designs of digital processors that implement the database searching and matching algorithms) which may be licensed to another party and C-routines that implement the interfacing protocols between the PCI board and the server-side software. Specifically, in order to enable existing users to exploit the benefits of FPGA technology without the need to change their established proteomic workflow, the FPGA systems have been integrated tightly with the the popular protein search engine X!Tandem by developing additional functionally for this open-source software which in effect allows users to integrate seamlessly the FPGA systems with the existing proteomics data analysis pipelines such as the Trans-Proteomic Pipeline (TPP). The design of the FPGA-I system has been optimized for real-time processing of MS/MS data. The FPGA-I system is over 100 times faster that the X!tandem software solution running on a Dual Quad Core system with 4GB RAM. By increasing the number of search processors, the system could easily be scaled up to take advantage of the dramatic increase in logic capacity and performance offered by the latest Virtex 7 devices which would enable fitting 13x more search processors on a single chip, resulting in a 13 fold increase in performance. This provides the means to process data in real-time as it is being produced by the instruments (modern instruments can generate more than 200 spectra per seconds) enabling the optimization of the instrument parameters in real-time rather than repeating the experiments. The design of the FPGA-II system has been optimized for processing of very large MS/MS data files containing tens or even hundreds of thousands of spectra. The reason for developing a second solution was that the search strategy adopted for the FPGA-I system was not optimal for batch processing. The FPGA-II system achieves more than 100 fold speed-up compared with a high performance Dual Quad-Core PC with 12 GB RAM running 64bit Windows 7. Essentially searches that take more than an hour on the PC are performed in less than a minute by the FPGA-II system.
Exploitation Route	The challenges of mass spectrometry based proteomics have been largely met in terms of sheer instrument capability, but there remain major obstacles to effective data analysis, such that substantially more time is spent analyzing data than in its acquisition. Armed with the normal computing power found in a typical proteomics lab, the time spent processing large data sets is an impediment, and an obstacle to re-evaluation of data streams using different algorithms or parameter streams. We eliminated this processing bottleneck by developing generic FPGA hardware solutions which reduce processing times from hours to seconds, and would greatly enhance the delivery of high quality, quantitative proteomics methods in small scale, but particularly large scale studies, such as the proteomics component in large scale systems biology programmes and biomarker profiling. The technology developed as part of this project is readily delivered to virtually all proteomics labs. We have recently learned that the designs of our Intellectual Property Cores may be patented. Together with Fusion IP, the University's commercialization partner we have been exploring the possibility to licence the designs. We plan demonstrate the system further to academic and commercial users. These activities are ongoing.
Sectors	Digital/Communication/Information Technologies (including Software) Healthcare Pharmaceuticals and Medical Biotechnology


Description	Our findings have informed the development of new strategies for processing mass spectra by Shimadzu, a major MS instrument manufacturer.
First Year Of Impact	2011
Sector	Pharmaceuticals and Medical Biotechnology
Impact Types	Economic


Title	FPGA System for Protein Identification and Quantitation
Description	Developed high-performance FPGA-based bioinformatics solution for high-throughput proteomics. Solution consists of a PC server equipped with FPGA board, FPGA implementation of data processing and database search algorithms, server side user interface and data processing software.
Type Of Material	Improvements to research infrastructure
Year Produced	2010
Provided To Others?	Yes
Impact	The hardware-implemented algorithms for de-noising, baseline correction, peak identification and deisotoping, running on a Xilinx Virtex 2 FPGA at 180 MHz, generate a mass fingerprint over 100 times faster than an equivalent algorithm written in C, running on a Dual 3 GHz Xeon workstation.
URL	https://www.liverpool.ac.uk/pfg/Pubs/files/00fe1160f83e43a71c582afd3e5ef1dc-28.html


Title	FPGA accelerated XTandem software
Description	We have modified the open source XTandem software, which is widely used for analysing tandem mass spectrometry data, such that all database searches are run on the FPGA board rather than on the host PC system.
Type Of Technology	Software
Year Produced	2012
Open Source License?	Yes
Impact	No actual Impacts realised to date
URL	http://scale.engin.brown.edu/theses/mandle.pdf


Title	High Performance Protein Identification Server
Description	PC equipped with FPGA co-processing board and software for protein identification
Type Of Technology	Software
Year Produced	2013
Impact	No actual Impacts realised to date


Description	Kratos Analytical Ltd Manchester 2008
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	Local
Primary Audience	Professional Practitioners
Results and Impact	Invited talk Initial meeting to explore potential knowledge transfer routes and collaboration. Slides no actual impacts realised to date
Year(s) Of Engagement Activity	2008

Abstract

Technical Summary

Organisations

People

ORCID iD

Publications