A curated, publically-accessible database of protein nanoscale organisation
Lead Research Organisation:
University of Birmingham
Department Name: Institute of Immunology & Immunotherapy
Abstract
Current descriptions of cellular systems are incomplete. They can be characterised at the genomic, transcriptomic and proteomic level, but there is a final level: How those proteins are organised in 3D space. This information is now accessible to scientists because of the advent of super-resolution microscopy, especially single-molecule imaging (SMLM) which allows the positions of biomolecules to be mapped with nanometer precision.
Well-curated, publically-accessible databases have been transformative across biology. Well established databases, such as GenBank are ubiquitously used, but the data they contain are relatively simplistic. Somewhat more complex data sets include protein structure databases, for example PDB and the associated predicted structures from DeepMind's AlphaFold2.
This application builds on initial Alan Turing Institute, EPSRC and BBSRC investment with the aim to become a national and global resource for the storage, sharing, curating and processing of SMLM data. Once established, we will lay the foundations for a new field of -omics, nano-omics: the study of protein nanoscale organisation.
Ultimatly, the resource will be a database where users can store, share and disseminate their SMLM data, benefitting the public engagement with science, aiding collaboration and helping meet data sharing mandates of funders and publishers. A community management structure will ensure the database follows best practice for research ethics and scientific excellence. The database will also feature advanced data analysis tools running in the cloud allwoing users to extract biologically relecvant information from uploaded datasets. This aids in bringing advanced statistical analysis to those without means and helping to democratize advanced imaging. Finally, we will conduct primary research into the meta-analysis of the uploaded data and initiate a new field of nano-omics - the study of the diversity of protein nanoscale organisation between proteins, cells and organisms.
Well-curated, publically-accessible databases have been transformative across biology. Well established databases, such as GenBank are ubiquitously used, but the data they contain are relatively simplistic. Somewhat more complex data sets include protein structure databases, for example PDB and the associated predicted structures from DeepMind's AlphaFold2.
This application builds on initial Alan Turing Institute, EPSRC and BBSRC investment with the aim to become a national and global resource for the storage, sharing, curating and processing of SMLM data. Once established, we will lay the foundations for a new field of -omics, nano-omics: the study of protein nanoscale organisation.
Ultimatly, the resource will be a database where users can store, share and disseminate their SMLM data, benefitting the public engagement with science, aiding collaboration and helping meet data sharing mandates of funders and publishers. A community management structure will ensure the database follows best practice for research ethics and scientific excellence. The database will also feature advanced data analysis tools running in the cloud allwoing users to extract biologically relecvant information from uploaded datasets. This aids in bringing advanced statistical analysis to those without means and helping to democratize advanced imaging. Finally, we will conduct primary research into the meta-analysis of the uploaded data and initiate a new field of nano-omics - the study of the diversity of protein nanoscale organisation between proteins, cells and organisms.
Technical Summary
The data from SMLM is not in the form of traditional microscopy images (i.e. arrays of pixels) and therefore cannot be incorporated into already emerging microscopy databases. Instead, SMLM data is pointillist - a point-cloud, a list of the xyz coordinates of labelled molecules. Each coordinate represents the position of animaged molecule of interest.
This database will store regions-of-interest (ROIs) containing such coordinates. A typical ROI is a 3000 x 3000 nm area (or volume) and in most data sets will consist of 100 - 10,000 xy coordinates. The data will be in the form of .csv files. A typical condition (e.g. Cell type X, conrol condition) will consist of 10 - 100 ROIs stored together in a folder. Each condition wil have a set of associated metadata (e.g. cell type, protein name, fluorophore, microscope hardware settings etc).
Each condition can be made privite by the uploader, made publically accessible ro shared with selected, registered users. The database is searchable via the metadata.
We will implement a new similarity score that can compare the statistical similarity of the point distributions in two ROIs. This also allows a search of the database - this time for the statistically most similar condition to a condition if interest. This is analogous to searching a protein structure database for proteins with similar structural motifs, or searching a gene sequence database for similar sequences.
Once uploaded, ROIs will be able to be processed in the cloud (i.e. via the database website) using advanced analysis algorithms to extract biologically relevant descriptions. One example is cluster analysis where each ROI can be tested as to whether the points form clusters, and if so, how many clusters are there, what percentage of monomers are there etc. Similarly the network architecture of actin or tubilin fibers can be extracted, the degrees of curvature and branching etc.
This database will store regions-of-interest (ROIs) containing such coordinates. A typical ROI is a 3000 x 3000 nm area (or volume) and in most data sets will consist of 100 - 10,000 xy coordinates. The data will be in the form of .csv files. A typical condition (e.g. Cell type X, conrol condition) will consist of 10 - 100 ROIs stored together in a folder. Each condition wil have a set of associated metadata (e.g. cell type, protein name, fluorophore, microscope hardware settings etc).
Each condition can be made privite by the uploader, made publically accessible ro shared with selected, registered users. The database is searchable via the metadata.
We will implement a new similarity score that can compare the statistical similarity of the point distributions in two ROIs. This also allows a search of the database - this time for the statistically most similar condition to a condition if interest. This is analogous to searching a protein structure database for proteins with similar structural motifs, or searching a gene sequence database for similar sequences.
Once uploaded, ROIs will be able to be processed in the cloud (i.e. via the database website) using advanced analysis algorithms to extract biologically relevant descriptions. One example is cluster analysis where each ROI can be tested as to whether the points form clusters, and if so, how many clusters are there, what percentage of monomers are there etc. Similarly the network architecture of actin or tubilin fibers can be extracted, the degrees of curvature and branching etc.
Organisations
People |
ORCID iD |
Dylan Owen (Principal Investigator) |