Building a Next Generation Image Repository: Molecular Annotation and Cloud-based Data Processing and Analysis

Lead Research Organisation: University of Dundee

Department Name: School of Life Sciences

Abstract

Access to primary research data is vital for the advancement of the scientific enterprise. It facilitates the validation of existing observations and provides the raw materials to build on those observations. In the life sciences, there are numerous examples where members of a research community determined that a particular type of data would be useful and necessary to share. These include gene sequences, protein structural data, and gene and protein expression profiles. In these cases the community united to standardize the structure of the data and its associated metadata, and to create centralized repositories to facilitate deposition, promote discoverability, and ensure the longevity of the data.

Imaging in the life sciences has undergone a revolution in recent years and is now used as a quantitative assay technology throughout the life and biomedical sciences. Imaging is used to understand the behavior of organisms, the formation of embryos, the structure and dynamics of cells, and the function and interactions of molecules that are the building blocks of life. Imaging datasets are complex, heterogeneous, and often extremely large, so they are rarely shared or published.

Based on the recent development of several image data management technologies and the rapidly decreasing cost of large data storage facilities, we propose to create a resource to host, serve, and make available original scientific image data that underpins life sciences research. Our proposal is based on open source technologies with proven utility and performance that already run on-line resources serving several terabytes (TBs) of image data. We propose to place this resource at EMBL-EBI, which is the established home of molecular and structural life sciences data and interface the resource with ELIXIR, Europe's research infrastructure for life science informatics. In particular we will build links with established molecular and structural resources and work towards a seamless integration of these data, so that any scientist can easily browse, query and compute on genomic, structural and phenotypic data across several scales.

Technical Summary

We will construct the Image Data Repository (IDR) based on hardware infrastructure located at EMBL-EBI and integrated with its existing resources for hosting and delivering large datasets to the world's scientific community. These resources will serve as the storage and archive for IDR data. OME's Bio-Formats and OMERO will be used to read, manage, serve, and link the data to EMBL-EBI's molecular and structural resources. We will build custom user interfaces and workflows for the IDR, to ensure easy access and browsing to the datasets it holds. To enable computational re-analysis of the data, we will extend OMERO's distributed compute capacity and make use of EMBL-EBI's Embassy system to allow virtual access to IDR data. This virtual resource will provide a 'sandbox' for performing processing and reanalysis of data deposited in the IDR and provide a working example of a next generation data repository that stores and manages data, but also provides community services for scientific data.

Planned Impact

The resource has the potential to impact all branches of basic life sciences research. If the IDR is built and delivered there will be literally massive impact for the community. Datasets that have never previously been accessible will be available for the community to search, view, mine and even process and analyze. Rich visualization and annotation will make both interactive browsing and programmatic mining possible for the first time. This project will deliver a resource valuable for scientists, funders, and journals, by promoting the validation of experimental methods and scientific conclusions, the comparison with new data obtained by scientists in the world, the possibility of data re-use by developers of new analysis and processing tools. In particular, the IDR will provide an opportunity to test concepts and measure the true value of reproducibility in science. Finally the IDR can serve as a model for how large complex multidimensional datasets can be shared with worldwide scientific communities.

Funded Value:

£1,788,151

Funded Period:

Jan 15 - Jun 16

Funder:

BBSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

BB/M018423/1

Principal Investigator:

Jason Swedlow

Research Subject:

Info. & commun. Technol. (96%)

Research Topic:

Image & Vision Computing (12%)

Information & Knowledge Mgmt (84%)

Organisations

People	ORCID iD
Jason Swedlow (Principal Investigator)	http://orcid.org/0000-0002-2198-1958
Rafael Edgardo Carazo Salas (Co-Investigator)
Alvis Brazma (Co-Investigator)

Publications

Author Name

Title Publication Date Published

|< < 1 2 > >|

10 25 50

Abbott S (2018) EMDB Web Resources. in Current protocols in bioinformatics

Antal B (2015) Mineotaur: a tool for high-content microscopy screen sharing and visual analytics. in Genome biology

Burel JM (2015) Publishing and sharing multi-dimensional image data with OMERO. in Mammalian genome : official journal of the International Mammalian Genome Society

Ellenberg J (2018) A call for public archives for biological image data. in Nature methods

Iudin A (2016) EMPIAR: a public archive for raw electron microscopy image data. in Nature methods

Iudin Andrii (2016) EMPIAR: a public archive for raw electron microscopy image data in NATURE METHODS

Jupp S (2016) The cellular microscopy phenotype ontology. in Journal of biomedical semantics

Kwakwa K (2016) easySTORM: a robust, lower-cost approach to localisation and TIRF microscopy. in Journal of biophotonics

Li S (2016) Metadata management for high content screening in OMERO. in Methods (San Diego, Calif.)

Patwardhan A (2016) Databases and Archiving for CryoEM. in Methods in enzymology

Key Findings
Impact Summary
Further Funding
Research Databases and Models
Collaboration
Software and Technical Products


Description	The Image Data Resource (IDR) To demonstrate the capability and utility of publishing complete scientific image data, we have built the open source and publicly available Image Data Repository (IDR), populated with community-submitted image datasets, experimental and analytical metadata, and phenotypic annotations linked to the original papers. This resource is deployed on EMBL-EBI's Embassy cloud at idr.openmicroscopy.org. IDR currently holds ~90 TB of image data in ~43 Mio images from >50 studies, and includes all associated experimental (e.g., genes, RNAi, chemistry, geographic location), analytic (e.g., submitter-calculated regions and features), and functional annotations. Wherever possible, metadata in IDR links to external resources that are the authoritative resource for that metadata (Ensembl, NCBI, PubChem, etc.). Datasets in human cells (e.g., http://goo.gl/1zoIIk), Drosophila (http://goo.gl/jPfM3j), and fungi (e.g., http://goo.gl/yFPQCw; http://goo.gl/n3ix5v). The full Mitocheck dataset (http://goo.gl/2FfBwd), a comprehensive chemical screen in human cells (http://goo.gl/BlFjQS) and a training dataset for deep learning applied to human cardiac biopsies are all included (https://goo.gl/Dedsx8). Finally, imaging from Tara Oceans, a global survey of plankton and other marine organisms is also included (http://goo.gl/2UWWnj). IDR contains imaging data from super-resolution, high content screening, timelapse imaging using conventional fluorescence and light-sheet microscopy and histological whole slide imaging. IDR Added Value IDR provides browse and search functions, a virtual analysis environment, and allows download of full original image datasets. IDR holds datasets from a few Mbytes to 20 Tbytes . Wherever possible, functional annotations (e.g., "increased peripheral actin), have been converted to defined terms in the EFO, CMPO or other ontologies, always in collaboration with the data submitters (e.g., goo.gl/mvKarG). >80% of the functional annotations have links to defined, published controlled vocabularies. IDR provides a unified interface that supports searches for genes (goo.gl/wivV3i), small molecules (goo.gl/ntQsbA) and phenotypes (goo.gl/Va8vnr). The integration of image-based phenotypes and calculated features makes IDR an attractive candidate for computational re-analysis. To ease the access to IDR's TByte-scale datasets, we have connected IDR to a Jupyter notebook-based computational resource (idr.openmicroscopy.org/jupyter) that exposes IDR datasets via a web-based computational portal. We include exemplar notebooks that provide visualization of image features using PCA, access to images annotated with CMPO phenotypes, calculation of gene networks, calculation of WND-CHARM features for individual images (github.com/IDR/idr-notebooks). We also maintain a public API for data re-analysis (idr.openmicroscopy.org/about/api.html). To allow re-use of IDR metadata locally, we have made all IDR databases, metadata and thumbnails available for download and have built Ansible scripts that automate the deployment of the IDR software stack (github.com/IDR/deployment). Anyone can leverage our work to build their own IDR and manage, integrate and publish their own imaging data. During 2018, IDR saw >100,000 hits/day from over 40,000 unique IP addresses, making a heavily used, valuable public resource.
Exploitation Route	Data can be browsed and downloaded for further analysis. We have installed a JupyerHub-based computational resource coupled to the IDR data resource that enables computational reuse of the data. See https://idr-analysis.openmicroscopy.org. We have made the IDR application stack freely available using scripts published at https://github.com/idr/deployment.
Sectors	Digital/Communication/Information Technologies (including Software) Healthcare Pharmaceuticals and Medical Biotechnology
URL	http://idr.openmicroscopy.org


Description	1. Google has downloaded all the data in the IDR (as of mid-2017). Typically they have not reported back on any use, benefits or other outcomes. 2. Core Life Analytics (https://www.corelifeanalytics.com/) markest and High Content Screening analysis product called HCStratominer, and the compant uses dynamic links to IDR datasets to demonstrate its software (e.g., http://edinburgh.eventful.com/events/workshop-data-analysis-following-high-content-/E0-001-111024466-2). 3. QuPath, an open source digital pathology analysis software package, uses links to data in IDR to demonstrate the use of its tools (https://www.youtube.com/watch?v=IzfYbQhJtkg) 4. UKRI-funded BioImage Archive (https://www.ebi.ac.uk/bioimage-archive/) is ingesting data from IDR. 5. Springer Nature journals have named IDR a Recommended Repository for their authors. 6. As of this reporting date, >250 TBytes and >11.4 M multi-dimensional images have been published from >100 independent studies.
First Year Of Impact	2019
Sector	Digital/Communication/Information Technologies (including Software),Healthcare,Pharmaceuticals and Medical Biotechnology
Impact Types	Economic


Description	Next Generation Data Formats For 21st Century Biology
Amount	£3,265,180 (GBP)
Funding ID	313803/Z/24/Z
Organisation	Wellcome Trust
Sector	Charity/Non Profit
Country	United Kingdom
Start	11/2024
End	04/2028


Description	The Image Data Resource: Making Biological Imaging Data FAIR
Amount	£1,323,597 (GBP)
Funding ID	212962/Z/18/Z
Organisation	Wellcome Trust
Sector	Charity/Non Profit
Country	United Kingdom
Start	09/2018
End	10/2022


Title	Image Data Repository (IDR)
Description	A collection of image data and metadata, including all experimental, acquisition, and analytic metadata.
Type Of Material	Database/Collection of data
Year Produced	2015
Provided To Others?	Yes
Impact	The scripts used for importing datasets into the IDR form the basis of proposed standards for experimental and analytic metadata image-based phenotypic studies. A proposal to fund the full development of these standards has been submitted.
URL	http://idr-demo.openmicroscopy.org


Title	McDole et al Dataset in IDR
Description	Addition of the KLB reader to Bio-Formats made it possible to publsih the definitive fate map of the mouse embryo (Publication: https://doi.org/10.1016/j.cell.2018.09.031)
Type Of Material	Database/Collection of data
Year Produced	2018
Provided To Others?	Yes
Impact	These are the original data that underly the publciation by McDole et al and demsontrate the definitive fate map of the mouse embryo.
URL	http://idr.openmicroscopy.org/webclient/?show=project-502


Description	IDR
Organisation	EMBL European Bioinformatics Institute (EMBL - EBI)
Country	United Kingdom
Sector	Academic/University
PI Contribution	We have built the OMERO and Bio-Formats technology that forms the basis of the IDR.
Collaborator Contribution	Alvis Brazma is a collaborator on our BBSRC IDR award (BB/M018423/1).
Impact	The IDR is the major current output. Publications are now in prep.
Start Year	2015


Title	IDR Infrastructure
Description	Scripts to build and deploy the IDR
Type Of Technology	Software
Year Produced	2016
Impact	These scripts make all IDR technology available to anyone, making it possible for anyone to build their own image publication system.
URL	https://github.com/IDR/infrastructure


Title	Mapr indexing and querying tool
Description	Mapr defines metadata categories that are "privileged", that is they they are likely to be key concepts for search queries (genes, antibodies, drugs, etc). It works as a configuration of OMERO.web and is useful for making open source OMERO into a custom dmain-specific querying tool
Type Of Technology	Software
Year Produced	2017
Open Source License?	Yes
Impact	Mar is used heavily in IDR to provide a querying infrastructure,.
URL	https://github.com/ome/omero-mapr