'Omics Data Sharing: the Investigation / Study / Assay (ISA) Infrastructure

Lead Research Organisation: NERC CEH (Up to 30.11.2019)
Department Name: Hails

Abstract

There is a pressing and recognized need in the biological domain for improved data sharing and unified access to data from a wide range of sources. The use of 'omics technologies (such as genomics, metagenomics, transcriptomics, proteomics and metabolomics) is now wide-spread and the rate at which these technologies generate data is revolutionizing the scientific landscape. This massive influx of data brings both unprecedented scientific opportunities and a range of challenges that must be met if these data, and the public investment in science that they represent, are to be fully exploited. While there are many obstacles to overcome if we are to realize large-scale multi-omic data sharing at the community level, solutions are now possible due to the activities of a range of grass-roots standardisation projects including the 'Minimum Information for Biological and Biomedical Investigations' (MIBBI) project (http://mibbi.org/) and the Open Biological Ontologies (OBO) Foundry (http://obofoundry.org/). We propose to make more widely available our 'omics data sharing software based on the 'Investigation / Study / Assay' (ISA) concept (http://isatab.sf.net). The ISA concept allows the description of any 'Investigation' comprising one or more 'Studies' in which biological samples have been studied using one or more 'Assays' (technologies). The ISA concept is supported by the MIBBI community and has been used to structure a universal file format, ISA-Tab. The ISA-Tab file format leverages biologists' familiarity with, and trust of spreadsheet-based input and manipulation of information. Descriptive experimental information (metadata) captured in ISA-Tab format is made compliant with MIBBI-registered standards (for transcriptomics, MIAME; for proteomics, MIAPE; and for genomics, MIGS/MIMS) using pre-defined extensions. ISA-Tab can be configured to hold additional fields allowing users to comply with emerging standards as well. The availability of this universal file format has enabled the creation of a set of tools and a database to hold data sets captured in it. The current pilot-stage ISA Infrastructure provides a complete solution for managing multi-omic metadata at the community level. A core aspect of the design of the ISA Infrastructure is its integral use of OBO Foundry ontologies to describe investigations, rendering data descriptions unambiguous and computationally accessible. In the course of this proposed project, we will extend the current ISA Infrastructure implementation and work with identified research communities and their bioinformatic service providers to set up 'ISA Networks' in the UK and around the globe, covering a wide range of data types. These portals will serve as 'one-stop shops' for the aggregation and display of relevant datasets at the community level. The metadata captured will support searching and data discovery across organisms, technologies and data types. The shared use of minimum information standards, ontologies and a single file format will support exchange of data between communities and the transfer of data to and from public repositories. At the international level, we will work closely with the MIBBI and OBO Foundry communities to further unify MIBBI checklists and OBO Foundry ontologies to support descriptions of multi-omic investigations. The development of the ISA Infrastructure must be consensus-driven and is therefore best developed under the auspices of an international working group. We will therefore formalise the collaboration between ISA Networks and work within the data standardisation community to increase linkages between currently separated groups by launching the BioSharing Consortium (http://biosharing.org).

Technical Summary

Despite the many obvious benefits of data sharing, unification of our global, invaluable, and now vast, biological data stores has proven elusive. Associated concerns over the inaccessibility of data, leading both to lost opportunities for discovery and unnecessary duplication of effort, is driving a focus on 'omics data sharing. In 2009, major international groups of researchers held workshops to promote improved data sharing of pre- and post-publication resources. Funders are also concerned as evidenced by the publication of data policies aimed at improving stewardship of billions of pounds of hard won research data, especially in the field of 'omics research. Obstacles include the long-standing issue of a lack of software solutions for supporting data sharing that suits the needs of data submitters and users alike. To overcome these challenges we have designed and developed the ISA infrastructure, the first pilot-stage freely available software suite for curating, aggregating and sharing multi-omics investigations. In this project we will complete the software suite and help our wide range of collaborators to deploy several ISA Networks environments to: (i) assist in the reporting and local management of experimental metadata, (ii) empower their user communities to uptake community-defined MIBBI-registered checklists, OBO ontologies and the ISA-Tab format and (iii) facilitate submission of metadata to international public repositories. We will also continue our consensus-building standards activities, mapping/matching concepts in MIBBI to those in OBO Foundry ontologies and make sure all are can be captured in ISA-Tab format and manipulated/displayed in the ISA Infrastructure. Lastly, under the large BioSharing Consortium umbrella, we will formalize linkages between wide range of communities; including MIBBI, OBO Foundry and the ISA Networks, as well as journals, funders, industry, databases, biocurators, and next-generation technology providers.

Planned Impact

Public bioscience: Funders and journals increasingly require that researchers make more of their data public,e.g. by submitting it to public repositories, and that they seek to comply with community-defined standards. However non-compliance may be difficult to overcome because the use of 'raw' standards, and checking compliance, can be challenging. The only feasible solution is better annotation at source using a tool that simplifies standards use and that provides some automated content validation to reduce the burden on reviewers and database curators. More generally, the availability of richly- and uniformly-annotated bio-investigations will increase efficiency and accelerate scientific advance. By reducing the chance that data will be misinterpreted, confidence in, and therefore reuse of public data will increase, reducing the waste of resources and time inherent in redundant repetition. Industrial bioscience: This sector increasingly regards public domain omics-based bio-investigations as an important input to their internal research and discovery activities. However, to maximise their value, bio-investigations must be richly and uniformly annotated so that the resulting data can be correctly interpreted through the context provided by their metadata. Thus, this proposal is timely. We can leverage and marshal the ongoing MIBBI, OBO Foundry and ISA-Tab efforts, and provide the community with a set of tools (the ISA Infrastructure) to make the job of capturing, richly annotating, integrating and sharing experimental data much simpler. Using the ISA infrastructure will therefore increase the prima facie value of data contributed to the public domain to various industries, and by extension, increase the return on the investment of (public and private) funds that supported their generation. We have gone to great lengths to engage with companies such as AstraZeneca, Pfizer and Syngenta, operating in pharma- and agri-business in the UK and internationally. The ISA infrastructure offers a potential solution for their internal omic data management and services if used 'in house'. We are also looking into possibility of starting an ISA Knowledge Transfer programme (here, 'ISA' denotes the reporting guidelines, ontologies and tools we cover in this proposal). We will also continue our outreach to industry, to knowledge service providers and to journals via our engagement with the Pistoia Alliance. The standards community: Expertise and knowledge gained through the project will be disseminated through a variety of channels. The BioSharing consortium will continue to promote integration generally, and common community standards in particular, to an ever broader audience. Resource providers: There are many benefits accruing to the development, acceptance and implementation of reporting standards, but the real value and challenge is overcoming their fragmentation and moving towards their integration, as outlined in our proposal. By limiting the range and variability of standards, the development and maintenance costs for commercial (and academic) software developers of standards-compliant products drops and thus software and instrument vendors (and by extension, their users) benefit. Public policymakers: The production of more-richly annotated bioinvestigations will improve the evidence base for policy makers by providing greater interpretability of experimental context, simplifying the job of data integration and study comparison. More detail for those forming policy on biological and biomedical issues should produce better decisions. The general public: The lay person is becoming increasingly aware of the impact of post-genomic technologies. Our work in support of the sharing of well-characterised data will not only increase the pace of advance in post-genomic science, ultimately benefitting the public, but also provides an excellent example of how science can make best use of the taxation that funds it.

Publications

10 25 50
publication icon
Sansone SA (2012) Toward interoperable bioscience data. in Nature genetics

publication icon
Maguire E (2012) Taxonomy-Based Glyph Design—with a Case Study on Visualizing Workflows of Biological Experiments. in IEEE transactions on visualization and computer graphics

publication icon
Maguire E (2013) OntoMaton: a bioportal powered ontology widget for Google Spreadsheets. in Bioinformatics (Oxford, England)

publication icon
Maguire E (2013) Visual Compression of Workflow Visualizations with Automated Detection of Macro Motifs in IEEE Transactions on Visualization and Computer Graphics