Building a Next Generation Image Repository: Molecular Annotation and Cloud-based Data Processing and Analysis

Lead Research Organisation: University of Dundee
Department Name: School of Life Sciences

Abstract

Access to primary research data is vital for the advancement of the scientific enterprise. It facilitates the validation of existing observations and provides the raw materials to build on those observations. In the life sciences, there are numerous examples where members of a research community determined that a particular type of data would be useful and necessary to share. These include gene sequences, protein structural data, and gene and protein expression profiles. In these cases the community united to standardize the structure of the data and its associated metadata, and to create centralized repositories to facilitate deposition, promote discoverability, and ensure the longevity of the data.

Imaging in the life sciences has undergone a revolution in recent years and is now used as a quantitative assay technology throughout the life and biomedical sciences. Imaging is used to understand the behavior of organisms, the formation of embryos, the structure and dynamics of cells, and the function and interactions of molecules that are the building blocks of life. Imaging datasets are complex, heterogeneous, and often extremely large, so they are rarely shared or published.

Based on the recent development of several image data management technologies and the rapidly decreasing cost of large data storage facilities, we propose to create a resource to host, serve, and make available original scientific image data that underpins life sciences research. Our proposal is based on open source technologies with proven utility and performance that already run on-line resources serving several terabytes (TBs) of image data. We propose to place this resource at EMBL-EBI, which is the established home of molecular and structural life sciences data and interface the resource with ELIXIR, Europe's research infrastructure for life science informatics. In particular we will build links with established molecular and structural resources and work towards a seamless integration of these data, so that any scientist can easily browse, query and compute on genomic, structural and phenotypic data across several scales.

Technical Summary

We will construct the Image Data Repository (IDR) based on hardware infrastructure located at EMBL-EBI and integrated with its existing resources for hosting and delivering large datasets to the world's scientific community. These resources will serve as the storage and archive for IDR data. OME's Bio-Formats and OMERO will be used to read, manage, serve, and link the data to EMBL-EBI's molecular and structural resources. We will build custom user interfaces and workflows for the IDR, to ensure easy access and browsing to the datasets it holds. To enable computational re-analysis of the data, we will extend OMERO's distributed compute capacity and make use of EMBL-EBI's Embassy system to allow virtual access to IDR data. This virtual resource will provide a 'sandbox' for performing processing and reanalysis of data deposited in the IDR and provide a working example of a next generation data repository that stores and manages data, but also provides community services for scientific data.

Planned Impact

The resource has the potential to impact all branches of basic life sciences research. If the IDR is built and delivered there will be literally massive impact for the community. Datasets that have never previously been accessible will be available for the community to search, view, mine and even process and analyze. Rich visualization and annotation will make both interactive browsing and programmatic mining possible for the first time. This project will deliver a resource valuable for scientists, funders, and journals, by promoting the validation of experimental methods and scientific conclusions, the comparison with new data obtained by scientists in the world, the possibility of data re-use by developers of new analysis and processing tools. In particular, the IDR will provide an opportunity to test concepts and measure the true value of reproducibility in science. Finally the IDR can serve as a model for how large complex multidimensional datasets can be shared with worldwide scientific communities.

Publications

10 25 50
publication icon
Abbott S (2018) EMDB Web Resources. in Current protocols in bioinformatics

publication icon
Burel JM (2015) Publishing and sharing multi-dimensional image data with OMERO. in Mammalian genome : official journal of the International Mammalian Genome Society

publication icon
Ellenberg J (2018) A call for public archives for biological image data. in Nature methods

publication icon
Jupp S (2016) The cellular microscopy phenotype ontology. in Journal of biomedical semantics

publication icon
Li S (2016) Metadata management for high content screening in OMERO. in Methods (San Diego, Calif.)

publication icon
Patwardhan A (2017) Trends in the Electron Microscopy Data Bank (EMDB). in Acta crystallographica. Section D, Structural biology

 
Description The Image Data Resource (IDR)
To demonstrate the capability and utility of publishing complete scientific image data, we have built the open source and publicly available Image Data Repository (IDR), populated with community-submitted image datasets, experimental and analytical metadata, and phenotypic annotations linked to the original papers. This resource is deployed on EMBL-EBI's Embassy cloud at idr.openmicroscopy.org. IDR currently holds ~90 TB of image data in ~43 Mio images from >50 studies, and includes all associated experimental (e.g., genes, RNAi, chemistry, geographic location), analytic (e.g., submitter-calculated regions and features), and functional annotations. Wherever possible, metadata in IDR links to external resources that are the authoritative resource for that metadata (Ensembl, NCBI, PubChem, etc.). Datasets in human cells (e.g., http://goo.gl/1zoIIk), Drosophila (http://goo.gl/jPfM3j), and fungi (e.g., http://goo.gl/yFPQCw; http://goo.gl/n3ix5v). The full Mitocheck dataset (http://goo.gl/2FfBwd), a comprehensive chemical screen in human cells (http://goo.gl/BlFjQS) and a training dataset for deep learning applied to human cardiac biopsies are all included (https://goo.gl/Dedsx8). Finally, imaging from Tara Oceans, a global survey of plankton and other marine organisms is also included (http://goo.gl/2UWWnj). IDR contains imaging data from super-resolution, high content screening, timelapse imaging using conventional fluorescence and light-sheet microscopy and histological whole slide imaging.

IDR Added Value
IDR provides browse and search functions, a virtual analysis environment, and allows download of full original image datasets. IDR holds datasets from a few Mbytes to 20 Tbytes . Wherever possible, functional annotations (e.g., "increased peripheral actin), have been converted to defined terms in the EFO, CMPO or other ontologies, always in collaboration with the data submitters (e.g., goo.gl/mvKarG). >80% of the functional annotations have links to defined, published controlled vocabularies. IDR provides a unified interface that supports searches for genes (goo.gl/wivV3i), small molecules (goo.gl/ntQsbA) and phenotypes (goo.gl/Va8vnr).
The integration of image-based phenotypes and calculated features makes IDR an attractive candidate for computational re-analysis. To ease the access to IDR's TByte-scale datasets, we have connected IDR to a Jupyter notebook-based computational resource (idr.openmicroscopy.org/jupyter) that exposes IDR datasets via a web-based computational portal. We include exemplar notebooks that provide visualization of image features using PCA, access to images annotated with CMPO phenotypes, calculation of gene networks, calculation of WND-CHARM features for individual images (github.com/IDR/idr-notebooks). We also maintain a public API for data re-analysis (idr.openmicroscopy.org/about/api.html). To allow re-use of IDR metadata locally, we have made all IDR databases, metadata and thumbnails available for download and have built Ansible scripts that automate the deployment of the IDR software stack (github.com/IDR/deployment). Anyone can leverage our work to build their own IDR and manage, integrate and publish their own imaging data.

During 2018, IDR saw >100,000 hits/day from over 40,000 unique IP addresses, making a heavily used, valuable public resource.
Exploitation Route Data can be browsed and downloaded for further analysis. We have installed a JupyerHub-based computational resource coupled to the IDR data resource that enables computational reuse of the data. See https://idr-analysis.openmicroscopy.org.

We have made the IDR application stack freely available using scripts published at https://github.com/idr/deployment.
Sectors Digital/Communication/Information Technologies (including Software),Healthcare,Pharmaceuticals and Medical Biotechnology

URL http://idr.openmicroscopy.org
 
Description 1. Google has downloaded all the data in the IDR (as of mid-2017). Typically they have not reported back on any use, benefits or other outcomes. 2. Core Life Analytics (https://www.corelifeanalytics.com/) markest and High Content Screening analysis product called HCStratominer, and the compant uses dynamic links to IDR datasets to demonstrate its software (e.g., http://edinburgh.eventful.com/events/workshop-data-analysis-following-high-content-/E0-001-111024466-2). 3. QuPath, an open source digital pathology analysis software package, uses links to data in IDR to demonstrate the use of its tools (https://www.youtube.com/watch?v=IzfYbQhJtkg) 4. UKRI-funded BioImage Archive (https://www.ebi.ac.uk/bioimage-archive/) is ingesting data from IDR. 5. Springer Nature journals have named IDR a Recommended Repository for their authors. 6. As of this reporting date, >250 TBytes and >11.4 M multi-dimensional images have been published from >100 independent studies.
First Year Of Impact 2019
Sector Digital/Communication/Information Technologies (including Software),Healthcare,Pharmaceuticals and Medical Biotechnology
Impact Types Economic

 
Description The Image Data Resource: Making Biological Imaging Data FAIR
Amount £1,323,597 (GBP)
Funding ID 212962/Z/18/Z 
Organisation Wellcome Trust 
Sector Charity/Non Profit
Country United Kingdom
Start 10/2018 
End 10/2021
 
Title Image Data Repository (IDR) 
Description A collection of image data and metadata, including all experimental, acquisition, and analytic metadata. 
Type Of Material Database/Collection of data 
Year Produced 2015 
Provided To Others? Yes  
Impact The scripts used for importing datasets into the IDR form the basis of proposed standards for experimental and analytic metadata image-based phenotypic studies. A proposal to fund the full development of these standards has been submitted. 
URL http://idr-demo.openmicroscopy.org
 
Title McDole et al Dataset in IDR 
Description Addition of the KLB reader to Bio-Formats made it possible to publsih the definitive fate map of the mouse embryo (Publication: https://doi.org/10.1016/j.cell.2018.09.031) 
Type Of Material Database/Collection of data 
Year Produced 2018 
Provided To Others? Yes  
Impact These are the original data that underly the publciation by McDole et al and demsontrate the definitive fate map of the mouse embryo. 
URL http://idr.openmicroscopy.org/webclient/?show=project-502
 
Description IDR 
Organisation EMBL European Bioinformatics Institute (EMBL - EBI)
Country United Kingdom 
Sector Academic/University 
PI Contribution We have built the OMERO and Bio-Formats technology that forms the basis of the IDR.
Collaborator Contribution Alvis Brazma is a collaborator on our BBSRC IDR award (BB/M018423/1).
Impact The IDR is the major current output. Publications are now in prep.
Start Year 2015
 
Title IDR Infrastructure 
Description Scripts to build and deploy the IDR 
Type Of Technology Software 
Year Produced 2016 
Impact These scripts make all IDR technology available to anyone, making it possible for anyone to build their own image publication system. 
URL https://github.com/IDR/infrastructure
 
Title Mapr indexing and querying tool 
Description Mapr defines metadata categories that are "privileged", that is they they are likely to be key concepts for search queries (genes, antibodies, drugs, etc). It works as a configuration of OMERO.web and is useful for making open source OMERO into a custom dmain-specific querying tool 
Type Of Technology Software 
Year Produced 2017 
Open Source License? Yes  
Impact Mar is used heavily in IDR to provide a querying infrastructure,. 
URL https://github.com/ome/omero-mapr