BioStudies and the Image Data Resource: Expanding Imaging Datasets, Linkage, Metadata, and Value

Lead Research Organisation: University of Dundee
Department Name: School of Life Sciences

Abstract

Access to primary research data is vital for the advancement of the scientific enterprise. It facilitates the validation of existing observations and provides the raw materials to build on those observations. In the life sciences, research communities have repeatedly collaborated to build resources that allow public submission and access to particular types of datasets. These include gene sequences, protein structural data, and gene and protein expression profiles. In these cases the community united to standardize the structure of the data and its associated metadata, and to create centralized repositories to facilitate deposition, promote discoverability, and ensure the longevity of the data.

Much of the published research in the life sciences carries with it detailed image data. These images are routinely used for quantitative measures of biological processes and structures that form the foundation of many of the results published in peer-reviewed life sciences journals. In almost all cases, however, images are presented in published articles in processed, compressed formats that do not accurately convey the quality and complexity of the original image data. Those original data are housed in thousands of individual labs in hundreds of different file formats. The data are difficult for researchers to share and in practice impossible to publish.

The sheer size and complexity of image data sets and even of individual multi-dimensional images makes data submission, handling and publication extremely complex. An image-based genome-wide "high content" screen (HCS) may have over a million images, new "virtual slide" and "light sheet" tissue imaging technologies generate individual images that contain gigapixels of data showing tissues or whole organisms at subcellular resolutions. Many of these datasets-acquired on the latest generation imaging systems-- are valuable resources that contain so much data that their full value can only be achieved if a large community is given the opportunity to view, analyse and re-analyse the data, sometimes in combination with other datasets.

Just as genomic and structural biology have already done, the imaging community must address the challenges posed by multidimensional image datasets so scientists, educators, students and the wider public can find, share, and validate the data that underlie published scientific results.

This proposal connects a growing community resource called BioStudies with a previous BBSRC-funded project that developed a public Image Data Repository to deliver the next step in public resources for scientific images.

Technical Summary

Much of the published research in the life sciences includes multidimensional, quantitative image data. These images are routinely used for quantitative measures of biological processes and structures that form the foundation of many of the results published in peer-reviewed life sciences journals. In almost all cases, however, images are presented in published articles in processed, compressed formats that do not accurately convey the quality and complexity of the original image data. The sheer size and heterogeneity of image data sets- multi-dimensional image stacks combined with experimental metadata and analytic results-- makes image data handling and publication extremely complex, and in practice, rarely achieved.

In this project we aim to build the submission pipeline for deposition of reference imaging data in BioStudies and then into IDR. This will grow the datasets that are publicly available in both BioStudies and IDR. We will do this by building a submission pipeline and updating the data submission templates and building metadata validators for use by submitters. This will ensure correct metadata submission and reduce the time spent curating submitted studies by IDR staff. We will also extend the value of data stored by adding links to several valuable resources and extending the metadata the IDR holds.

Planned Impact

There are several forms of impact from this project. The first will derive from the imaging datasets we make available in BioStudies and IDR. These datasets can be accessed through the interactive interfaces presented by the two resources, and thus meet two recent requirements for scientific data, that the datasets will be findable and accessible. Those reference datasets that are included in IDR will further be integrated with other datasets through curation and normalisation, thus starting to make them interoperable, and available via the IDR Jupiter resource and also downloadable by Aspera, so they are reusable.

One of the aims of BioStudies is to catalyze the development of data standards in life sciences - data can be initially described using the lightweight structures offered by BioStudies, and then tighter requirements can be defined in an incremental fashion. The proposed project will serve as a proof of concept of this process.

In addition, this project will help support the movement that is emerging to make the publication of imaging data routine, and possibly in the future, mandatory for scientific publications. Currently, journals, funders and community scientists are debating this issue- we hope to energise this debate and provide both technical solutions and scientific examples and rationales for publishing imaging data routinely. This potentialimpact is demonstrated by the LoS's from several leading journals.

Finally, the datasets are all available for download from BioStudies or IDR, providing resources for the development of new tools of image processing and analysis. Moreover, from the IDR, the application stacks and the metadata databases are all available, which allows others to download and re-use IDR data and systems, and integrate their datasets and analytics.

Publications

10 25 50
publication icon
Besson S (2019) Bringing Open Data to Whole Slide Imaging. in Digital Pathology : 15th European congress, ECDP 2019, Warwick, UK, April 10-13, 2019 : proceedings. European Congress on Digital Pathology (15th : 2019 : Warwick, England)

publication icon
Hartley M (2023) Providing open imaging data at scale: An EMBL-EBI perspective in Histochemistry and Cell Biology

publication icon
Hartley M (2022) The BioImage Archive - Building a Home for Life-Sciences Microscopy Data. in Journal of molecular biology

publication icon
Ellenberg J (2018) A call for public archives for biological image data. in Nature methods

 
Description We have grown and solidified the Image Data Resource one of the world's largest bioimage data resources, which now holds >300 TByte of data from >100 independent studies.
Exploitation Route The public datasets are proving to be a valuable foundation for several research projects, teaching and training and development of new imaging data analysis tools
Sectors Digital/Communication/Information Technologies (including Software),Pharmaceuticals and Medical Biotechnology

URL https://idr.openmicroscopy.org
 
Description IDR is now a flagship resource and is spurring the development of several national scale resources. These are emerging in several EU countries, the USA and Australia.
First Year Of Impact 2020
Sector Digital/Communication/Information Technologies (including Software)
Impact Types Policy & public services

 
Description Connecting and Expanding RIKEN's Systems Science of Biological Dynamics and OME's Image Data Resource Databases
Amount £2,980 (GBP)
Funding ID BB/S013032/1 
Organisation Biotechnology and Biological Sciences Research Council (BBSRC) 
Sector Public
Country United Kingdom
Start 10/2018 
End 11/2018
 
Title Image Data Repository (IDR) 
Description A collection of image data and metadata, including all experimental, acquisition, and analytic metadata. 
Type Of Material Database/Collection of data 
Year Produced 2015 
Provided To Others? Yes  
Impact The scripts used for importing datasets into the IDR form the basis of proposed standards for experimental and analytic metadata image-based phenotypic studies. A proposal to fund the full development of these standards has been submitted. 
URL http://idr-demo.openmicroscopy.org
 
Title McDole et al Dataset in IDR 
Description Addition of the KLB reader to Bio-Formats made it possible to publsih the definitive fate map of the mouse embryo (Publication: https://doi.org/10.1016/j.cell.2018.09.031) 
Type Of Material Database/Collection of data 
Year Produced 2018 
Provided To Others? Yes  
Impact These are the original data that underly the publciation by McDole et al and demsontrate the definitive fate map of the mouse embryo. 
URL http://idr.openmicroscopy.org/webclient/?show=project-502
 
Description Euro-BioImaging 
Organisation Euro-BioImaging
Country European Union (EU) 
Sector Public 
PI Contribution BioImagingUK connects with Euro-BioImaging to provide feedback and updates on then status and priorities of the UK imaging community.
Collaborator Contribution Euro-BioImaging
Impact Ongoing work during Euro-BioImaging Interim Phase
Start Year 2009
 
Description IDR 
Organisation EMBL European Bioinformatics Institute (EMBL - EBI)
Country United Kingdom 
Sector Academic/University 
PI Contribution We have built the OMERO and Bio-Formats technology that forms the basis of the IDR.
Collaborator Contribution Alvis Brazma is a collaborator on our BBSRC IDR award (BB/M018423/1).
Impact The IDR is the major current output. Publications are now in prep.
Start Year 2015
 
Title Bio-Formats 6.0 
Description Bio-Formats 6.0.0 is a major update that includes support for the updated OME-TIFF file format, which now supports multi-resolution tiled images (so-called pyramidal file format). For more info, see http://blog.openmicroscopy.org/file-formats/community/2018/11/29/ometiffpyramid/. This new version of Bio-Formats also includes support for the KLB format for light sheet microscopy. Bio-Formats API changes: Java 8 is now the minimum supported version Sub-resolution reading: added MetadataList and CoreMetadataList classes added a new SubResolutionFormatReader abstract class for handling pyramidal format readers updated all pyramid format readers to use SubResolutionFormatReader deprecated getCoreMetadataList, seriesToCoreIndex, coreIndexToSeries, getCoreIndex and setCoreIndex in IFormatWriter added a new IPyramidHandler interface with the resolution getter methods Sub-resolution writing changes: IFormatWriter now extends IPyramidHandler (breaking) added setResolutions and getResolutions methods to IFormatWriter (breaking) added examples of using the sub-resolution writing API Tiled writing API changes: updated IFormatWriter to use setTileSizeX(0) and setTileSizeY(0) as a way to disable tiling (breaking) updated FormatWriter to set 0 as the default values of getTileSizeX() and getTileSizeY (breaking) IFormatWriter.getCompressionTypes now returns the types for the selected writer only Metadata handling: added getter methods to MetadataTools for retrieving OME enumerations by value deprecated OME enumeration getter methods in FormatReader Refactor FilePatternReader logic in a new WrappedReader abstract class New file formats: KLB added a new reader for Keller Lab Block (KLB) files CV7000 added a new reader for Yokogawa CV7000 datasets GE MicroCT added a new reader for GE MicroCT datasets File format fixes and improvements: Aperio SVS/AFI removed pyramidal resolutions of mismatching pixel types fixed exposure times, improved image naming of AFI datasets displayed original metadata keys for each channel of AFI datasets added support for multiple Z sections DICOM improved file grouping and file-to-series mapping for multi-file datasets Fake added support for multi-resolution test images now populating WellSample positions when present using Plane data Gatan Digital Micrograph adjusted endianness and record byte count for long values allowed ROIs to be stored in DocumentObjectList groups no longer creating an empty ROI when an unsupported shape type is encountered Image Pro added support for Image Pro Plus .ips set GE InCell added support for parsing minimum and maximum pixel values Lambert Instruments FLIM fixed an integer overflow error with large files (thanks to Rolf Harkes) Leica LIF unified metadata parsing to use DataTools.parseDouble Leica SCN improved support for Versa datasets Micro-Manager improved handling of very large metadata.txt files prevented NumberFormatException for invalid double values add support for parsing ChannelColor from metadata.txt files Metamorph added support for multi-dimensional .scan dataset created from Scan Slide (thanks to Jeremy Muhlich) MRC (Medical Research Council) fixed endian detection for old-style headers Nikon ND2 prevented integer overflow when reading chunkmaps from files larger than 2GB fixed handling of duplicate and incomplete exposure time lists fixed chunk map handling when CustomData blocks are between ImageDataSeqs OME-TIFF added support for reading OME-TIFF with pyramidal resolutions stored as SubIFDs added support for writing OME-TIFF with pyramidal resolutions added support for companion OME-TIFF filesets where TIFF does not link back to the metadata file improved handling of missing planes in TiffData PerkinElmer Operetta improved support to handle datasets generated by the Harmony software TIFF split IFDs into separate series if the dimensions or pixel type mismatch restricted use case for legacy TIFF JAI reader fixed a bug with FillOrder which resulted in 0 pixel values Zeiss CZI reduced duplicate original metadata when reading a pyramid file Zeiss TIFF added support for AVI files acquired with Keyence software Zeiss ZVI reuse stream for sequential calls to openBytes on the same plane updated all pyramidal format readers to consume SubResolutionReader updated all readers to consume MetadataTools getter to retrieve enumerations reviewed all readers and plugins to close open instances of RandomAccessInputStream fixed some deprecation warnings in a number of readers for RGB images using ChannelSeparator all channel metadata is now copied instead of just names ImageJ plugin improvements: updated the updater message in the Fiji plugin (thanks to Jan Eglinger) disabled LUT writing for any plane that has a default grayscale lookup table added macro option to always skip LUT writing MATLAB toolbox improvements: improved performance of bfGetPlane by removing an unnecessary data copy (thanks to Cris Luengo) Command-line tools improvements: bfconvert utility added -no-flat option to the command-line tools to convert files with sub-resolutions added -pyramid-scale and -pyramid-resolutions options to generate sub-resolutions during conversion removed Plate elements when -series is passed as an option extended usage to describe available formats, extensions and compressions xmlvalid utility added new validate methods to loci.formats.tools.XMLValidate returning the validation status added a return code to xmlvalid Component changes: ome-common was upgraded to 6.0.0 ome-codecs was upgraded to 0.2.3 ome-model was upgraded to 6.0.0 Automated test changes: added testng.allow-missing property allowing to skip unconfigured filesets added testUnflattenedSaneOMEXML to compare series count to OME-XML images count when resolution flattening is disabled added test-equivalent target to compare pixel data between two files added support for storing resolution index and resolution count in the configuration files used for automated testing tests now fail when a configured file throws UnknownFormatException Documentation improvements: fixed the xmlvalid documentation page (thanks to Kouichi C. Nakamura) improved the memory section of the MATLAB documentation page (thanks to Kouichi C. Nakamura) extended IFormatReader Javadocs to reflect the reader guide added reference to current Adobe TIFF specification switched to image.sc as the reference location for public feedback Full details can be found at: https://docs.openmicroscopy.org/bio-formats/6.0.0/about/whats-new.html 2 The software is available at: https://www.openmicroscopy.org/bio-formats/downloads/ 
Type Of Technology Software 
Year Produced 2019 
Open Source License? Yes  
Impact This version of Bio-Formats contains three major types of improvements: 1. Full reader/writer/sepcfication support for an updated OME-TIFF that supports multi-dimensional, multi-resolution tiled ("pyramidal") image files as sused in imaging of large blocks of tissue in reasearch and in clinical applications. This is the first open source, fully open, full implemented file format for whole slide imaging and other tisse imaging applications. 2. Support for the Keller Lab Block (KLB) image file format, a format used by several labs performaing light sheet microscopy or of large biological specimens. 3. Not released, but soon to be is support for the BigDataViewer format, another commonly used light sheet microscopy format. 
URL https://docs.openmicroscopy.org/bio-formats/6.0.0/about/whats-new.html