PIT-DB: A Resource for Sharing, Annotating and Analysing Translated Genomic Elements

Lead Research Organisation: Queen Mary University of London

Department Name: Sch of Biological and Chemical Sciences

Abstract

The publication of the human genome in 2001 was rightly hailed as a major scientific achievement, but over a decade later we are still far from a complete understanding of the structure of the genome and the role of the various elements within it. While protein coding regions of the genome were identified and used to annotate the genome soon after it was sequenced, many more exotic genomic elements have subsequently attracted interest including pseudogenes, non-coding RNAs and short open reading frames (sORFs).

In recent years, post-genomic bioanalytical techniques such as RNA-seq transcriptomics (which tells us which genomic elements are expressed) and mass spectrometry based proteomics (which tells us which of the expressed elements are translated into peptides or proteins) have helped refine our understanding of the human genome at a fundamental level. Just this year, two proteomics studies published in Nature caused a stir by showing that no experimental evidence could be found for the expression of several genomic elements widely accepted to code for protein, while other regions of the genome that were not previously thought to be protein coding were in fact found to produce proteins. If this is the situation for the intensively studied human genome, we must assume that the genome annotations for less studied species (so called non-model organisms) are even less accurate.

We recently developed (and tested, and published) a methodology called proteomics informed by transcriptomics (PIT) that rapidly generates large numbers of genome annotations underpinned by multiple sources of experimental evidence. In PIT, every sample is analysed using both RNA-seq and proteomic mass spectrometry and the data from these two analyses integrated to provide a list of observed proteins and any other translated genomic elements (TGEs), together with the detailed transcriptomic and spectral evidence that underpins these observations. The beauty of PIT compared with traditional proteomics is that no prior sequence knowledge is needed, so novel TGEs (be they proteins or other more exotic features) can be detected. RNA-seq can be used by itself to rapidly generate genome annotations without prior knowledge, but without PIT's mass spectrometry step the confidence in these annotations is limited and there is no guarantee that transcribed elements actually get translated.

In a recent BBSRC TRDF project we developed easy to use web-based software workflows, implement in the popular Galaxy platform, to process the data from PIT experiments in a repeatable way with uniformly formatted output files. This has proven very useful for answering individual biological questions, but there is currently no meaningful way to share the results of PIT experiments. In this project we propose to plug this gap by developing PIT-DB, a web-accessible database of results produced by PIT. This publicly available database will immediately be populated with data from experiments conducted on various species at the University of Bristol, but other groups will be actively encouraged to submission their own data.

Having data from multiple PIT experiments in one database will deliver exciting new scientific insights. As well as simply allowing researchers to share their results from individual PIT experiments, PIT-DB will pool information about individual novel TGEs from multiple experiments so evidence can be accumulated for each individual TGE. Improving the quality of results by using data from replicate experiments is a fundamental concept in science and the utility of doing this on a community-wide basis has been repeatedly demonstrated by other bioinformatics databases such as Ensembl, UniProt and PRIDE. As well as being of interest individually, the well evidenced TGEs in PIT-DB will provide large numbers of experimentally derived (as opposed to computationally predicted) genome annotations for all of the species for which data is present in the database.

Technical Summary

We previously developed PIT (Proteomics Informed by Transcriptomics), a methodology in which a given sample is analysed by both RNA-seq and proteomic mass spectrometry (MS) followed by integration of the acquired data to provide genome-wide information about which genomic elements are transcribed and translated within a given sample. Unlike traditional shotgun proteomics this does not require prior knowledge of the sequences that may be expressed, so provides an unbiased analysis that is as suited to finding novel translated genomic elements (TGEs) as it is to finding established proteins. This type of analysis is getting a lot of attention thanks to recent studies that have questioned the accuracy of widely accepted genome annotations and have found evidence that there are many other molecules translated from RNA - not just proteins.

To make the processing of data from PIT experiments tractable for the typical lab scientist we have developed Galaxy-based data analysis workflows that integrate RNA-seq and MS data to produce uniform output files containing information about all the observed TGEs. We now have a growing collection of results from experiments on several species, and our aim in this project is to produce a web-accessible database called PIT-DB for sharing these results and results collected by other groups around the world.

PIT-DB will be created using standard methods for developing databases and web front ends, but additional work will be done to pool submitted data to build up evidence of TGEs over multiple experiments. This is expected to provide large numbers of novel genome annotations backed up by significant experimental evidence. We will conduct a small validation experiment to check for the existence of a number of novel TGEs from the database.

Planned Impact

The principal groups who will benefit from this project are:

1. Researchers from academia and industry seeking a better understanding of genomes

Genomics is now the cornerstone of a large proportion of biological research, coving a wide range of applications from medicine and food science through to ecology and industrial biotechnology. In all these areas a detailed understanding of the structure and function of the genome of the species under study is important in answering key research questions. The increased understanding of the genome that PIT-DB provides will accelerate progress towards answering these questions. Given the wide range of biological areas in which genomics is used, this will translate into impact across a broad range of strategically important research areas across BBSRC's remit, including bioenergy, infectious diseases, food security, healthy ageing, animal welfare and synthetic biology.

2. Industry

The range of companies that stand to benefit directly from PIT-DB is large and diverse. To give just three examples:

(i) Small companies dedicated to the discovery of biomarkers and drug targets will find that the plentiful supply of experimentally derived novel translated genomic elements (TGEs) provides a valuable source of material for new research projects.

(ii) Pharmaceutical companies who are becoming increasingly interested in the possible role of novel TGEs such as fusion proteins will benefit from having access to a large catalogue of such TGEs.

(iii) The agri-food industry, in which the analysis of genomes from a wide range of species including farmed animals, parasites, pathogens and multiple cultivars of popular crops is a core activity, will have the opportunity to benefit from the improved annotations of these genomes afforded by pooling of data within PIT-DB.

There will also be a general benefit from the greater transparency and repeatability of published PIT experiments, through the easy sharing of results. This will give industry more confidence in the findings of such research, increasing the likelihood of this research being translated into economic benefit.

3. General public

The ultimate beneficiaries of this project should be the general public, for whom the improved biological insight revealed by the groups above has the potential to lead to new medical treatments, increased food security, greener energy and an improved economy. It is impossible to predict which, if any, of these benefits will come to fruition but by making significant amounts of otherwise difficult-to-access PIT results freely available to researchers via an intuitive web-based user interface we aim to make a significant contribution towards this goal.

Funded Value:

£122,711

Funded Period:

Aug 15 - Aug 16

Funder:

BBSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

BB/M020118/1

Principal Investigator:

Conrad Bessant

Research Subject:

Tools, technologies & methods (98%)

Research Topic:

Bioinformatics (42%)

Tools for the biosciences (28%)

eScience (28%)

Organisations

People	ORCID iD
Conrad Bessant (Principal Investigator)
DA Matthews (Co-Investigator)

Publications

Author Name Title Publication

Date Published

10 25 50

Maringer K (2017) Proteomics informed by transcriptomics for characterising active transposable elements and genome annotation in Aedes aegypti. in BMC genomics

Chatzimichali EA (2016) Novel application of heuristic optimisation enables the creation and thorough evaluation of robust support vector machine ensembles for machine learning applications. in Metabolomics : Official journal of the Metabolomic Society

Davidson AD (2017) Proteomics technique opens new frontiers in mobilome research. in Mobile genetic elements

Saha S (2018) PITDB: a database of translated genomic elements. in Nucleic acids research

Key Findings
Impact Summary
Research Databases and Models
Research Tools and Methods
Collaboration
Software and Technical Products


Description	A database, accessible to anyone via their web browser, at https://pitdb.sbcs.qmul.ac.uk/ was produced to share data from experiments where samples had been analysed using proteomics and transcriptomics and the data from the two analytical methods integrated. Initial experiments deposited in the database include data from human cells, mice, and two important carriers of viruses: mosquito and flying fox bat.
Exploitation Route	Researchers can share their experimental data via the database, and others can browse and anlalyse the data for their own computer-based research. In 2020, the Bessant Lab embarked on a major upgrade of PITDB, and in 2021 will be releasing PITDB 2.0 which adds support for quantitative data (at both the RNA and protein level) and contains additional datasets and a streamlines submission protocol.
Sectors	Agriculture, Food and Drink,Healthcare,Manufacturing, including Industrial Biotechology,Pharmaceuticals and Medical Biotechnology
URL	https://pitdb.sbcs.qmul.ac.uk/


Description	The database has been accessed hundreds of times, by researchers around the world. The PDRA employed on this used the computational skills acquired to transition into the financial sector, becoming Data Science Lead at a major multinational bank. Several MSc Bioinformatics project students have gained valuable experience by being embedded in the Bessant Lab to work on the next version of PITDB.
Sector	Healthcare
Impact Types	Economic


Title	PIT-DB
Description	PITDB is a publicly available database for sharing of results from PIT (proteomics informed by transcriptomics) experiments. PIT involves the analysis of a given sample by both RNA-seq and proteomic mass spectrometry followed by integration of the acquired data to provide an unprecedented level of information about which genomic elements are being transcribed and translated within a given sample, even if the organism under study does not have an annotated genome. Observed translated genomic elements (TGEs) are BLASTed against reference proteomes using a published workflow to determine whether they are known proteins, protein variants or novel gene products.
Type Of Material	Improvements to research infrastructure
Year Produced	2017
Provided To Others?	Yes
Impact	The database was published in Nucleic Acids Research and has been accessed by groups around the world.
URL	https://pitdb.sbcs.qmul.ac.uk/


Title	PITDB
Description	PITDB is a publicly available database for sharing of results from PIT (proteomics informed by transcriptomics) experiments. PIT involves the analysis of a given sample by both RNA-seq and proteomic mass spectrometry followed by integration of the acquired data to provide an unprecedented level of information about which genomic elements are being transcribed and translated within a given sample, even if the organism under study does not have an annotated genome. Observed translated genomic elements (TGEs) are BLASTed against reference proteomes using a published workflow to determine whether they are known proteins, protein variants or novel gene products.
Type Of Material	Database/Collection of data
Year Produced	2017
Provided To Others?	Yes
Impact	To early to state.
URL	http://pitdb.org


Description	Understanding Alternative Splicing in Human Cancer by Proteomics Informed by Transcriptomics
Organisation	Queen Mary University of London
Department	Centre for Molecular Oncology
Country	United Kingdom
Sector	Hospitals
PI Contribution	We are applying our PIT methodology to experimental data obtained by the group of Dr Pabhakar Rajan, in an effort to help him understand the role that alternative splicing plays in cancer.
Collaborator Contribution	The partner has provided high quality multi-omic data, and valuable domain knowledge.
Impact	This collaboration in multi-disciplinary, leading to novel software tools and improved biological understanding. These will be published in due course.
Start Year	2018


Title	PIT-DB
Description	PITDB is a publicly available database for sharing of results from PIT (proteomics informed by transcriptomics) experiments. PIT involves the analysis of a given sample by both RNA-seq and proteomic mass spectrometry followed by integration of the acquired data to provide an unprecedented level of information about which genomic elements are being transcribed and translated within a given sample, even if the organism under study does not have an annotated genome. Observed translated genomic elements (TGEs) are BLASTed against reference proteomes using a published workflow to determine whether they are known proteins, protein variants or novel gene products.
Type Of Technology	Webtool/Application
Year Produced	2017
Open Source License?	Yes
Impact	The database was published in Nucleic Acids Research and has been accessed by groups around the world.
URL	https://pitdb.sbcs.qmul.ac.uk/