'Omics Data Sharing: the Investigation / Study / Assay (ISA) Infrastructure

Lead Research Organisation: European Bioinformatics Institute
Department Name: Proteomics Services Team

Abstract

There is a pressing and recognized need in the biological domain for improved data sharing and unified access to data from a wide range of sources. The use of 'omics technologies (such as genomics, metagenomics, transcriptomics, proteomics and metabolomics) is now wide-spread and the rate at which these technologies generate data is revolutionizing the scientific landscape. This massive influx of data brings both unprecedented scientific opportunities and a range of challenges that must be met if these data, and the public investment in science that they represent, are to be fully exploited. While there are many obstacles to overcome if we are to realize large-scale multi-omic data sharing at the community level, solutions are now possible due to the activities of a range of grass-roots standardisation projects including the 'Minimum Information for Biological and Biomedical Investigations' (MIBBI) project (http://mibbi.org/) and the Open Biological Ontologies (OBO) Foundry (http://obofoundry.org/). We propose to make more widely available our 'omics data sharing software based on the 'Investigation / Study / Assay' (ISA) concept (http://isatab.sf.net). The ISA concept allows the description of any 'Investigation' comprising one or more 'Studies' in which biological samples have been studied using one or more 'Assays' (technologies). The ISA concept is supported by the MIBBI community and has been used to structure a universal file format, ISA-Tab. The ISA-Tab file format leverages biologists' familiarity with, and trust of spreadsheet-based input and manipulation of information. Descriptive experimental information (metadata) captured in ISA-Tab format is made compliant with MIBBI-registered standards (for transcriptomics, MIAME; for proteomics, MIAPE; and for genomics, MIGS/MIMS) using pre-defined extensions. ISA-Tab can be configured to hold additional fields allowing users to comply with emerging standards as well. The availability of this universal file format has enabled the creation of a set of tools and a database to hold data sets captured in it. The current pilot-stage ISA Infrastructure provides a complete solution for managing multi-omic metadata at the community level. A core aspect of the design of the ISA Infrastructure is its integral use of OBO Foundry ontologies to describe investigations, rendering data descriptions unambiguous and computationally accessible. In the course of this proposed project, we will extend the current ISA Infrastructure implementation and work with identified research communities and their bioinformatic service providers to set up 'ISA Networks' in the UK and around the globe, covering a wide range of data types. These portals will serve as 'one-stop shops' for the aggregation and display of relevant datasets at the community level. The metadata captured will support searching and data discovery across organisms, technologies and data types. The shared use of minimum information standards, ontologies and a single file format will support exchange of data between communities and the transfer of data to and from public repositories. At the international level, we will work closely with the MIBBI and OBO Foundry communities to further unify MIBBI checklists and OBO Foundry ontologies to support descriptions of multi-omic investigations. The development of the ISA Infrastructure must be consensus-driven and is therefore best developed under the auspices of an international working group. We will therefore formalise the collaboration between ISA Networks and work within the data standardisation community to increase linkages between currently separated groups by launching the BioSharing Consortium (http://biosharing.org).

Technical Summary

Despite the many obvious benefits of data sharing, unification of our global, invaluable, and now vast, biological data stores has proven elusive. Associated concerns over the inaccessibility of data, leading both to lost opportunities for discovery and unnecessary duplication of effort, is driving a focus on 'omics data sharing. In 2009, major international groups of researchers held workshops to promote improved data sharing of pre- and post-publication resources. Funders are also concerned as evidenced by the publication of data policies aimed at improving stewardship of billions of pounds of hard won research data, especially in the field of 'omics research. Obstacles include the long-standing issue of a lack of software solutions for supporting data sharing that suits the needs of data submitters and users alike. To overcome these challenges we have designed and developed the ISA infrastructure, the first pilot-stage freely available software suite for curating, aggregating and sharing multi-omics investigations. In this project we will complete the software suite and help our wide range of collaborators to deploy several ISA Networks environments to: (i) assist in the reporting and local management of experimental metadata, (ii) empower their user communities to uptake community-defined MIBBI-registered checklists, OBO ontologies and the ISA-Tab format and (iii) facilitate submission of metadata to international public repositories. We will also continue our consensus-building standards activities, mapping/matching concepts in MIBBI to those in OBO Foundry ontologies and make sure all are can be captured in ISA-Tab format and manipulated/displayed in the ISA Infrastructure. Lastly, under the large BioSharing Consortium umbrella, we will formalize linkages between wide range of communities; including MIBBI, OBO Foundry and the ISA Networks, as well as journals, funders, industry, databases, biocurators, and next-generation technology providers.

Planned Impact

See lead organisation form.

Publications

10 25 50
 
Description The ISA 'Omics Data Sharing: the Investigation / Study / Assay (ISA) is an Infrastructure project (http://isa-tools.org). The ISA metadata tracking tools help to manage an increasingly diverse set of life science, environmental and biomedical experiments that employing one or a combination of technologies.

Built around the 'Investigation' (the project context), 'Study' (a unit of research) and 'Assay' (analytical measurement) general-purpose Tabular format, the ISA tools helps you to provide rich description of the experimental metadata (i.e. sample characteristics, technology and measurement types, sample-to-data relationships) so that the resulting data and discoveries are reproducible and reusable.

The project
- completed the core development of the ISA infrastructure:
Several new components of the modular ISA infrastructure have been delivered, others have been extended and new, enhanced versions have been released by the ISA Operational Team in Oxford.

The software components, meet the requirements of the Software Sustainability Institute. Several components have also been through and passed the User Acceptance Testing carried out by industrial users and collaborators at Janssen Research; Development in Belgium, The Novartis Institutes for BioMedical Research in USA, but also at the FDA's Center for Bioinformatics at the National Center for Toxicological Research in USA In addition, the ISA infrastructure as also been adopted by Eagle Genomics Ltd, a bioinformatics consulting company in Cambridge, UK, and they have signed a Memorandum of Understanding with the ISA Operational Team and the University of Oxford.

- set up a constellation of ISA networks - local environments for curating, aggregating and sharing multi-omics investigations. Another website (http://www.isacommons.org) has been set up to provide at a glance view of the community we have been working with and for, named the ISA Commons. These groups are the ISA user base. The majority of these groups are service providers running ISA-powered systems that are (i) local, institute-based, (ii) project, consortium-based, or (iii) global, international repositories.

- supported the 'integration' of reporting standards through continued consensus-building activities.
The ISA Operational Team has also developed the BioSharing prototype catalogue (http://biosharing.org), extending the work of the MIBBI portal (a predecessor of BioSharing), to map the landscape of community standards (minimum reporting requirements, terminologies and exchange formats), databases and policies. The initial work to collect the descriptions of key databases has been done in collaboration with Oxford University Press' NAR Database and DATABASE journals, and with the support of the International Society of Biocuration.
Launched in 2011, the BioSharing prototype catalogue has registered a total of approx 541 standards (66 minimum reporting requirements, 329 terminologies, 145 exchange formats), 18 data sharing policies, and 618 databases. The BioSharing website has 30,000 visitors, its twitter account approx 500 followers, and a growing community of 40 communities and prospective users groups (publishers, standard groups, service providers, data scientists associations and research consortia), including Nature Publishing Group, BioMedCentral, Genomics Standards Consortium, International Society of Biocuration, Proteomics Standards Initiative, Science Commons, Digital Curation Centre, DataCite, SEEK and the industry-driven Pistoia Alliance. Each of these groups serves hundreds to thousands of researchers in diverse life science domains and attests to the widespread interest.
BioSharing is part of several NIH BD2K Centres and is nominated as a service of the EU RI ELIXIR.
Exploitation Route Launched in 2011, the BioSharing prototype catalogue has registered a total of approx 541 standards (66 minimum reporting requirements, 329 terminologies, 145 exchange formats), 18 data sharing policies, and 618 databases. The BioSharing website has 30,000 visitors, its twitter account approx 500 followers, and a growing community of 40 communities and prospective users groups (publishers, standard groups, service providers, data scientists associations and research consortia), including Nature Publishing Group, BioMedCentral, Genomics Standards Consortium, International Society of Biocuration, Proteomics Standards Initiative, Science Commons, Digital Curation Centre, DataCite, SEEK and the industry-driven Pistoia Alliance. Each of these groups serves hundreds to thousands of researchers in diverse life science domains and attests to the widespread interest.

We know that the user bases of the groups within the ISA Commons range from hundreds to thousands of researchers in an increasingly diverse set of life, natural and biomedical sciences. However, providing an exact number of users for each ISA-powered instance is impossible. For example, the ISA-Tab compliant SysMO-SEEK serves the 15 European consortia and over 300 researchers; but recently the SEEK platform, principles and methods have also been adopted by other multi-site Systems Biology projects. Similarly, the newly established public MetaboLights at EBI (fully powered by the ISA tools) is set to become the central repositories for all metabolomics-based datasets, also in the context of ELIXIR. Furthermore, emerging data publication platforms using ISA-Tab, such as GigaScience (by BioMedCentral and BGI) and Scientific Data (by Nature Publishing Group), are also set to contribute to the wider uptake of the format and of several of the software components. A first example is the agreement announced between Scientific Data and the Global Biodiversity Information Facility (GBIF) to help users of GBIF Integrated Publishing Toolkit submit data in ISA-Tab to the Nature Publishing Group platform. At this stage, we can safely estimate that collectively there are over 10,000 users, associated to the different ISA-powered instances; the number is expected to grow.
Sectors Agriculture, Food and Drink,Healthcare,Pharmaceuticals and Medical Biotechnology

URL http://www.isa-tools.org/
 
Description We know that the user bases of the groups within the ISA Commons range from hundreds to thousands of researchers in an increasingly diverse set of life, natural and biomedical sciences. However, providing an exact number of users for each ISA-powered instance is impossible. For example, the ISA-Tab compliant SysMO-SEEK serves the 15 European consortia and over 300 researchers; but recently the SEEK platform, principles and methods have also been adopted by other multi-site Systems Biology projects. Similarly, the newly established public MetaboLights at EBI (fully powered by the ISA tools) is set to become the central repositories for all metabolomics-based datasets, also in the context of ELIXIR. Furthermore, emerging data publication platforms using ISA-Tab, such as GigaScience (by BioMedCentral and BGI) and Scientific Data (by Nature Publishing Group), are also set to contribute to the wider uptake of the format and of several of the software components. A first example is the agreement announced between Scientific Data and the Global Biodiversity Information Facility (GBIF) to help users of GBIF Integrated Publishing Toolkit submit data in ISA-Tab to the Nature Publishing Group platform. At this stage, we can safely estimate that collectively there are over 10,000 users, associated to the different ISA-powered instances; the number is expected to grow.
First Year Of Impact 2015
Sector Agriculture, Food and Drink,Pharmaceuticals and Medical Biotechnology
 
Title ISA Tools Infrastructure 
Description The ISA tool infrastructure has been successfully created. Built around the 'Investigation' (the project context), 'Study' (a unit of research) and 'Assay' (analytical measurement) general-purpose Tabular format, the ISA tools help to provide rich description of the experimental metadata (i.e. sample characteristics, technology and measurement types, sample-to-data relationships) so that the resulting data and discoveries are reproducible and reusable. http://isa-tools.org/ 
Type Of Technology Webtool/Application 
Year Produced 2011 
Impact ISA tools are accepted and used by the community, among others by the BBSRC funded Metabolights resource, and data-oriented journals GigaScience and Scientific Data (Nature Group). 
URL http://isacommons.org/
 
Description Career Q&A 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Schools
Results and Impact This career Q&A with year 10 students was carried out virtually for the local collage and it is hoped that it would encourage more student to think about entering not only science but all the field of bioinformatics.
Year(s) Of Engagement Activity 2020