BioStudies and the Image Data Resource: Expanding Imaging Datasets, Linkage, Metadata, and Value

Lead Research Organisation: University of Dundee

Department Name: School of Life Sciences

Abstract

Access to primary research data is vital for the advancement of the scientific enterprise. It facilitates the validation of existing observations and provides the raw materials to build on those observations. In the life sciences, research communities have repeatedly collaborated to build resources that allow public submission and access to particular types of datasets. These include gene sequences, protein structural data, and gene and protein expression profiles. In these cases the community united to standardize the structure of the data and its associated metadata, and to create centralized repositories to facilitate deposition, promote discoverability, and ensure the longevity of the data.

Much of the published research in the life sciences carries with it detailed image data. These images are routinely used for quantitative measures of biological processes and structures that form the foundation of many of the results published in peer-reviewed life sciences journals. In almost all cases, however, images are presented in published articles in processed, compressed formats that do not accurately convey the quality and complexity of the original image data. Those original data are housed in thousands of individual labs in hundreds of different file formats. The data are difficult for researchers to share and in practice impossible to publish.

The sheer size and complexity of image data sets and even of individual multi-dimensional images makes data submission, handling and publication extremely complex. An image-based genome-wide "high content" screen (HCS) may have over a million images, new "virtual slide" and "light sheet" tissue imaging technologies generate individual images that contain gigapixels of data showing tissues or whole organisms at subcellular resolutions. Many of these datasets-acquired on the latest generation imaging systems-- are valuable resources that contain so much data that their full value can only be achieved if a large community is given the opportunity to view, analyse and re-analyse the data, sometimes in combination with other datasets.

Just as genomic and structural biology have already done, the imaging community must address the challenges posed by multidimensional image datasets so scientists, educators, students and the wider public can find, share, and validate the data that underlie published scientific results.

This proposal connects a growing community resource called BioStudies with a previous BBSRC-funded project that developed a public Image Data Repository to deliver the next step in public resources for scientific images.

Technical Summary

Much of the published research in the life sciences includes multidimensional, quantitative image data. These images are routinely used for quantitative measures of biological processes and structures that form the foundation of many of the results published in peer-reviewed life sciences journals. In almost all cases, however, images are presented in published articles in processed, compressed formats that do not accurately convey the quality and complexity of the original image data. The sheer size and heterogeneity of image data sets- multi-dimensional image stacks combined with experimental metadata and analytic results-- makes image data handling and publication extremely complex, and in practice, rarely achieved.

In this project we aim to build the submission pipeline for deposition of reference imaging data in BioStudies and then into IDR. This will grow the datasets that are publicly available in both BioStudies and IDR. We will do this by building a submission pipeline and updating the data submission templates and building metadata validators for use by submitters. This will ensure correct metadata submission and reduce the time spent curating submitted studies by IDR staff. We will also extend the value of data stored by adding links to several valuable resources and extending the metadata the IDR holds.

Planned Impact

There are several forms of impact from this project. The first will derive from the imaging datasets we make available in BioStudies and IDR. These datasets can be accessed through the interactive interfaces presented by the two resources, and thus meet two recent requirements for scientific data, that the datasets will be findable and accessible. Those reference datasets that are included in IDR will further be integrated with other datasets through curation and normalisation, thus starting to make them interoperable, and available via the IDR Jupiter resource and also downloadable by Aspera, so they are reusable.

One of the aims of BioStudies is to catalyze the development of data standards in life sciences - data can be initially described using the lightweight structures offered by BioStudies, and then tighter requirements can be defined in an incremental fashion. The proposed project will serve as a proof of concept of this process.

In addition, this project will help support the movement that is emerging to make the publication of imaging data routine, and possibly in the future, mandatory for scientific publications. Currently, journals, funders and community scientists are debating this issue- we hope to energise this debate and provide both technical solutions and scientific examples and rationales for publishing imaging data routinely. This potentialimpact is demonstrated by the LoS's from several leading journals.

Finally, the datasets are all available for download from BioStudies or IDR, providing resources for the development of new tools of image processing and analysis. Moreover, from the IDR, the application stacks and the metadata databases are all available, which allows others to download and re-use IDR data and systems, and integrate their datasets and analytics.

Funded Value:

£583,944

Funded Period:

Jun 18 - Jun 21

Funder:

BBSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

BB/R015384/1

Principal Investigator:

Jason Swedlow

Research Subject:

Tools, technologies & methods (96%)

Research Topic:

Bioinformatics (24%)

Tools for the biosciences (48%)

eScience (24%)

Organisations

People	ORCID iD
Jason Swedlow (Principal Investigator)	http://orcid.org/0000-0002-2198-1958

Publications

Author Name

Title Publication Date Published

|< < 1 2 > >|

10 25 50

Besson S (2019) Bringing Open Data to Whole Slide Imaging. in Digital Pathology : 15th European congress, ECDP 2019, Warwick, UK, April 10-13, 2019 : proceedings. European Congress on Digital Pathology (15th : 2019 : Warwick, England)

Ellenberg J (2018) A call for public archives for biological image data. in Nature methods

Hartley M (2021) The BioImage Archive - building a home for life-sciences microscopy data

Hartley M (2023) Providing open imaging data at scale: An EMBL-EBI perspective. in Histochemistry and cell biology

Hartley M (2022) The BioImage Archive - Building a Home for Life-Sciences Microscopy Data. in Journal of molecular biology

Iudin A (2023) Volume Electron Microscopy

Kunis S (2021) MDEmic: a metadata annotation tool to facilitate management of FAIR image data in the bioimaging community. in Nature methods

Moore J (2023) OME-Zarr: a cloud-optimized bioimaging file format with international community support. in Histochemistry and cell biology

Moore J (2023) OME-Zarr: a cloud-optimized bioimaging file format with international community support. in bioRxiv : the preprint server for biology

Moore J (2021) OME-NGFF: a next-generation file format for expanding bioimaging data-access strategies. in Nature methods

Key Findings
Impact Summary
Further Funding
Research Databases and Models
Collaboration
Software and Technical Products


Description	We have grown and solidified the Image Data Resource one of the world's largest bioimage data resources, which now holds >300 TByte of data from >100 independent studies.
Exploitation Route	The public datasets are proving to be a valuable foundation for several research projects, teaching and training and development of new imaging data analysis tools
Sectors	Digital/Communication/Information Technologies (including Software) Pharmaceuticals and Medical Biotechnology
URL	https://idr.openmicroscopy.org


Description	IDR is now a flagship resource and is spurring the development of several national scale resources. These are emerging in several EU countries, the USA, Japan and Australia. Morover, IDR stands as a public demonstrator of th possibiliy and value of pubic bioimaging data. Data from IDR has been downloaded by scientitsts in the USA, Latin America, Autralia and others as foundations for downstream computational and AI research.
First Year Of Impact	2020
Sector	Digital/Communication/Information Technologies (including Software)
Impact Types	Policy & public services


Description	Connecting and Expanding RIKEN's Systems Science of Biological Dynamics and OME's Image Data Resource Databases
Amount	£2,980 (GBP)
Funding ID	BB/S013032/1
Organisation	Biotechnology and Biological Sciences Research Council (BBSRC)
Sector	Public
Country	United Kingdom
Start	09/2018
End	11/2018


Description	Next Generation Data Formats For 21st Century Biology
Amount	£3,265,180 (GBP)
Funding ID	313803/Z/24/Z
Organisation	Wellcome Trust
Sector	Charity/Non Profit
Country	United Kingdom
Start	11/2024
End	04/2028


Title	Image Data Repository (IDR)
Description	A collection of image data and metadata, including all experimental, acquisition, and analytic metadata.
Type Of Material	Database/Collection of data
Year Produced	2015
Provided To Others?	Yes
Impact	The scripts used for importing datasets into the IDR form the basis of proposed standards for experimental and analytic metadata image-based phenotypic studies. A proposal to fund the full development of these standards has been submitted.
URL	http://idr-demo.openmicroscopy.org


Title	McDole et al Dataset in IDR
Description	Addition of the KLB reader to Bio-Formats made it possible to publsih the definitive fate map of the mouse embryo (Publication: https://doi.org/10.1016/j.cell.2018.09.031)
Type Of Material	Database/Collection of data
Year Produced	2018
Provided To Others?	Yes
Impact	These are the original data that underly the publciation by McDole et al and demsontrate the definitive fate map of the mouse embryo.
URL	http://idr.openmicroscopy.org/webclient/?show=project-502


Description	Euro-BioImaging
Organisation	Euro-BioImaging
Country	European Union (EU)
Sector	Public
PI Contribution	BioImagingUK connects with Euro-BioImaging to provide feedback and updates on then status and priorities of the UK imaging community.
Collaborator Contribution	Euro-BioImaging
Impact	Ongoing work during Euro-BioImaging Interim Phase
Start Year	2009


Description	IDR
Organisation	EMBL European Bioinformatics Institute (EMBL - EBI)
Country	United Kingdom
Sector	Academic/University
PI Contribution	We have built the OMERO and Bio-Formats technology that forms the basis of the IDR.
Collaborator Contribution	Alvis Brazma is a collaborator on our BBSRC IDR award (BB/M018423/1).
Impact	The IDR is the major current output. Publications are now in prep.
Start Year	2015


Title	Bio-Formats 6.0
Description	Bio-Formats 6.0.0 is a major update that includes support for the updated OME-TIFF file format, which now supports multi-resolution tiled images (so-called pyramidal file format). For more info, see http://blog.openmicroscopy.org/file-formats/community/2018/11/29/ometiffpyramid/. This new version of Bio-Formats also includes support for the KLB format for light sheet microscopy. Bio-Formats API changes: Java 8 is now the minimum supported version Sub-resolution reading: added MetadataList and CoreMetadataList classes added a new SubResolutionFormatReader abstract class for handling pyramidal format readers updated all pyramid format readers to use SubResolutionFormatReader deprecated getCoreMetadataList, seriesToCoreIndex, coreIndexToSeries, getCoreIndex and setCoreIndex in IFormatWriter added a new IPyramidHandler interface with the resolution getter methods Sub-resolution writing changes: IFormatWriter now extends IPyramidHandler (breaking) added setResolutions and getResolutions methods to IFormatWriter (breaking) added examples of using the sub-resolution writing API Tiled writing API changes: updated IFormatWriter to use setTileSizeX(0) and setTileSizeY(0) as a way to disable tiling (breaking) updated FormatWriter to set 0 as the default values of getTileSizeX() and getTileSizeY (breaking) IFormatWriter.getCompressionTypes now returns the types for the selected writer only Metadata handling: added getter methods to MetadataTools for retrieving OME enumerations by value deprecated OME enumeration getter methods in FormatReader Refactor FilePatternReader logic in a new WrappedReader abstract class New file formats: KLB added a new reader for Keller Lab Block (KLB) files CV7000 added a new reader for Yokogawa CV7000 datasets GE MicroCT added a new reader for GE MicroCT datasets File format fixes and improvements: Aperio SVS/AFI removed pyramidal resolutions of mismatching pixel types fixed exposure times, improved image naming of AFI datasets displayed original metadata keys for each channel of AFI datasets added support for multiple Z sections DICOM improved file grouping and file-to-series mapping for multi-file datasets Fake added support for multi-resolution test images now populating WellSample positions when present using Plane data Gatan Digital Micrograph adjusted endianness and record byte count for long values allowed ROIs to be stored in DocumentObjectList groups no longer creating an empty ROI when an unsupported shape type is encountered Image Pro added support for Image Pro Plus .ips set GE InCell added support for parsing minimum and maximum pixel values Lambert Instruments FLIM fixed an integer overflow error with large files (thanks to Rolf Harkes) Leica LIF unified metadata parsing to use DataTools.parseDouble Leica SCN improved support for Versa datasets Micro-Manager improved handling of very large metadata.txt files prevented NumberFormatException for invalid double values add support for parsing ChannelColor from metadata.txt files Metamorph added support for multi-dimensional .scan dataset created from Scan Slide (thanks to Jeremy Muhlich) MRC (Medical Research Council) fixed endian detection for old-style headers Nikon ND2 prevented integer overflow when reading chunkmaps from files larger than 2GB fixed handling of duplicate and incomplete exposure time lists fixed chunk map handling when CustomData blocks are between ImageDataSeqs OME-TIFF added support for reading OME-TIFF with pyramidal resolutions stored as SubIFDs added support for writing OME-TIFF with pyramidal resolutions added support for companion OME-TIFF filesets where TIFF does not link back to the metadata file improved handling of missing planes in TiffData PerkinElmer Operetta improved support to handle datasets generated by the Harmony software TIFF split IFDs into separate series if the dimensions or pixel type mismatch restricted use case for legacy TIFF JAI reader fixed a bug with FillOrder which resulted in 0 pixel values Zeiss CZI reduced duplicate original metadata when reading a pyramid file Zeiss TIFF added support for AVI files acquired with Keyence software Zeiss ZVI reuse stream for sequential calls to openBytes on the same plane updated all pyramidal format readers to consume SubResolutionReader updated all readers to consume MetadataTools getter to retrieve enumerations reviewed all readers and plugins to close open instances of RandomAccessInputStream fixed some deprecation warnings in a number of readers for RGB images using ChannelSeparator all channel metadata is now copied instead of just names ImageJ plugin improvements: updated the updater message in the Fiji plugin (thanks to Jan Eglinger) disabled LUT writing for any plane that has a default grayscale lookup table added macro option to always skip LUT writing MATLAB toolbox improvements: improved performance of bfGetPlane by removing an unnecessary data copy (thanks to Cris Luengo) Command-line tools improvements: bfconvert utility added -no-flat option to the command-line tools to convert files with sub-resolutions added -pyramid-scale and -pyramid-resolutions options to generate sub-resolutions during conversion removed Plate elements when -series is passed as an option extended usage to describe available formats, extensions and compressions xmlvalid utility added new validate methods to loci.formats.tools.XMLValidate returning the validation status added a return code to xmlvalid Component changes: ome-common was upgraded to 6.0.0 ome-codecs was upgraded to 0.2.3 ome-model was upgraded to 6.0.0 Automated test changes: added testng.allow-missing property allowing to skip unconfigured filesets added testUnflattenedSaneOMEXML to compare series count to OME-XML images count when resolution flattening is disabled added test-equivalent target to compare pixel data between two files added support for storing resolution index and resolution count in the configuration files used for automated testing tests now fail when a configured file throws UnknownFormatException Documentation improvements: fixed the xmlvalid documentation page (thanks to Kouichi C. Nakamura) improved the memory section of the MATLAB documentation page (thanks to Kouichi C. Nakamura) extended IFormatReader Javadocs to reflect the reader guide added reference to current Adobe TIFF specification switched to image.sc as the reference location for public feedback Full details can be found at: https://docs.openmicroscopy.org/bio-formats/6.0.0/about/whats-new.html 2 The software is available at: https://www.openmicroscopy.org/bio-formats/downloads/
Type Of Technology	Software
Year Produced	2019
Open Source License?	Yes
Impact	This version of Bio-Formats contains three major types of improvements: 1. Full reader/writer/sepcfication support for an updated OME-TIFF that supports multi-dimensional, multi-resolution tiled ("pyramidal") image files as sused in imaging of large blocks of tissue in reasearch and in clinical applications. This is the first open source, fully open, full implemented file format for whole slide imaging and other tisse imaging applications. 2. Support for the Keller Lab Block (KLB) image file format, a format used by several labs performaing light sheet microscopy or of large biological specimens. 3. Not released, but soon to be is support for the BigDataViewer format, another commonly used light sheet microscopy format.
URL	https://docs.openmicroscopy.org/bio-formats/6.0.0/about/whats-new.html