FPGA supercomputing technology for high-throughput identification and quantitation in proteomics

Lead Research Organisation: University of Sheffield
Department Name: Automatic Control and Systems Eng

Abstract

Proteomics is the study of the entire complement of a cell in a particular state. It is the proteins that 'act out' the information in the genome, and we cannot really understand cellular function without a detailed knowledge of the activity, dynamics and interplay between the 'actors'. .However, the science and technology of proteomics does not lend itself to the same highly multiplexed approaches that can be applied to nucleic acids, and strategies for protein identification and quantification are still highly serial, require complex and sometimes arcane data processing, and are slow. We have almost completed a proof-of-concept BBSRC e-science project that aimed to implement two common methods in proteomics: mass spectrum preprocessing and peptide mass fingerprint database searching, as a hardware implementation using reconfigurable computer chips known as field programmable gate arrays (FPGAs). A key feature of this computational platform is that the bioinformatics algorithms which are normally implemented as a software program were translated into optimized digital hardware processors that could process data significantly faster by running multiple analyses in parallel. The successful outcome of this project was a complete implementation that has achieved a phenomenal 2000-fold speed increase. We now wish to build on our previous success, capitalize upon the capabilities we have developed thus far, and deliver similar speed gains to the most commonly used method of proteome analysis, based on tandem mass spectrometry. At the same time, we will address an emergent and pressing need for faster and enhanced quantification to deliver new quantitative approaches and capabilities to proteomics researchers. Such tools are critical if proteomics is to deliver what we expect of it as a science.

Technical Summary

This project aims to develop a high-performance FPGA-based bioinformatics solution for high-throughput LC-MS/MS-based protein identification and quantification. This proposal builds on the results of a successful BBSRC project which has resulted in the development of the first complete reconfigurable computing solution for protein identification. The prototype system has achieved a staggering 2000 fold increase in computational speed compared with a standard software solution. The FPGA-hardware, which incorporates a raw mass spectra processor and a parallel search engine, delivers a match in less than a quarter of a second when searching the entire MSDB protein database. Developing a similar bioinformatics platform to address the computational challenges in tandem mass spectrometry and quantitative proteomics will involve designing an MSMS protein identification engine and a separate quantification engine. The hardware platform will consist of a reconfigurable computing motherboard which can hold three additional FPGA modules. The on board FPGA will be used to perform quantification. An additional FPGA module, with 1Gb SDRAM memory to hold the protein database, will be used to run the search engine. A key feature of the computational platform is the ability to perform computations in hardware which exploit algorithm and instruction parallelism. This leads to significant increases in performance, while retaining much of the flexibility of a software solution. The main challenges relate to redesign, partitioning and mapping the protein identification algorithms on the reconfigurable hardware. The proposed solution will dramatically enhance the efficiency of the proteomics related algorithms. Matching the 2000 fold speed increase achieved with the peptide mass fingerprinting solution would mean that a quantitative analysis that currently takes one hour could be completed in less than two seconds.

Publications

10 25 50
 
Description We developed two complementary high-performance FPGA co-processing solutions (FPGA-I and FPGA-II) for protein identification using tandem mass spectra. Each system consists of a multi-FPGA PCI card which can be attached to any standard PC, Intellectual Property Cores (i.e. the software-like FPGA logic designs of digital processors that implement the database searching and matching algorithms) which may be licensed to another party and C-routines that implement the interfacing protocols between the PCI board and the server-side software. Specifically, in order to enable existing users to exploit the benefits of FPGA technology without the need to change their established proteomic workflow, the FPGA systems have been integrated tightly with the the popular protein search engine X!Tandem by developing additional functionally for this open-source software which in effect allows users to integrate seamlessly the FPGA systems with the existing proteomics data analysis pipelines such as the Trans-Proteomic Pipeline (TPP).

The design of the FPGA-I system has been optimized for real-time processing of MS/MS data. The FPGA-I system is over 100 times faster that the X!tandem software solution running on a Dual Quad Core system with 4GB RAM. By increasing the number of search processors, the system could easily be scaled up to take advantage of the dramatic increase in logic capacity and performance offered by the latest Virtex 7 devices which would enable fitting 13x more search processors on a single chip, resulting in a 13 fold increase in performance. This provides the means to process data in real-time as it is being produced by the instruments (modern instruments can generate more than 200 spectra per seconds) enabling the optimization of the instrument parameters in real-time rather than repeating the experiments.

The design of the FPGA-II system has been optimized for processing of very large MS/MS data files containing tens or even hundreds of thousands of spectra. The reason for developing a second solution was that the search strategy adopted for the FPGA-I system was not optimal for batch processing. The FPGA-II system achieves more than 100 fold speed-up compared with a high performance Dual Quad-Core PC with 12 GB RAM running 64bit Windows 7. Essentially searches that take more than an hour on the PC are performed in less than a minute by the FPGA-II system.
Exploitation Route The challenges of mass spectrometry based proteomics have been largely met in terms of sheer instrument capability, but there remain major obstacles to effective data analysis, such that substantially more time is spent analyzing data than in its acquisition. Armed with the normal computing power found in a typical proteomics lab, the time spent processing large data sets is an impediment, and an obstacle to re-evaluation of data streams using different algorithms or parameter streams. We eliminated this processing bottleneck by developing generic FPGA hardware solutions which reduce processing times from hours to seconds, and would greatly enhance the delivery of high quality, quantitative proteomics methods in small scale, but particularly large scale studies, such as the proteomics component in large scale systems biology programmes and biomarker profiling. The technology developed as part of this project is readily delivered to virtually all proteomics labs.
We have recently learned that the designs of our Intellectual Property Cores may be patented. Together with Fusion IP, the University's commercialization partner we have been exploring the possibility to licence the designs. We plan demonstrate the system further to academic and commercial users. These activities are ongoing.
Sectors Digital/Communication/Information Technologies (including Software),Healthcare,Pharmaceuticals and Medical Biotechnology

 
Description Our findings have informed the development of new strategies for processing mass spectra by Shimadzu, a major MS instrument manufacturer.
First Year Of Impact 2011
Sector Pharmaceuticals and Medical Biotechnology
Impact Types Economic

 
Title FPGA System for Protein Identification and Quantitation 
Description Developed high-performance FPGA-based bioinformatics solution for high-throughput proteomics. Solution consists of a PC server equipped with FPGA board, FPGA implementation of data processing and database search algorithms, server side user interface and data processing software. 
Type Of Material Improvements to research infrastructure 
Year Produced 2010 
Provided To Others? Yes  
Impact The hardware-implemented algorithms for de-noising, baseline correction, peak identification and deisotoping, running on a Xilinx Virtex 2 FPGA at 180 MHz, generate a mass fingerprint over 100 times faster than an equivalent algorithm written in C, running on a Dual 3 GHz Xeon workstation. 
URL https://www.liverpool.ac.uk/pfg/Pubs/files/00fe1160f83e43a71c582afd3e5ef1dc-28.html
 
Title FPGA accelerated XTandem software 
Description We have modified the open source XTandem software, which is widely used for analysing tandem mass spectrometry data, such that all database searches are run on the FPGA board rather than on the host PC system. 
Type Of Technology software 
Year Produced 2012 
Open Source License? Yes  
Impact No actual Impacts realised to date 
URL http://scale.engin.brown.edu/theses/mandle.pdf
 
Title High Performance Protein Identification Server 
Description PC equipped with FPGA co-processing board and software for protein identification 
Type Of Technology Software 
Year Produced 2013 
Impact No actual Impacts realised to date 
 
Description Kratos Analytical Ltd Manchester 2008 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Professional Practitioners
Results and Impact Invited talk Initial meeting to explore potential knowledge transfer routes and collaboration. Slides

no actual impacts realised to date
Year(s) Of Engagement Activity 2008