Making data FAIR using InterMine

Lead Research Organisation: University of Cambridge
Department Name: Genetics

Abstract

Against a backdrop of ever-increasing data generation by the world's bioscience community there is growing recognition that it is essential to describe data more precisely and so make it easier to find and reuse, including integrating it with other datasets. In order to achieve these goals, the community needs to describe the data better (i.e. create better "metadata") and also to store and transport the data according to rigorously defined standards. This is hard in the biosciences because of the huge diversity in experimental techniques and the extremely rapid pace of technological change.

Accordingly, the recently published FAIR principles are defining a consensus that data should be "Findable, Accessible, Interoperable and Reusable". These principles are providing an opportunity for a concerted effort towards their increased adoption by the bioscience research community together with science funders and publishers, who are starting to ask for data to be managed according to the FAIR principles.

InterMine is a data integration framework that has been developed for over a decade and is used for large-scale data integration projects around the world, including by many of the main plant and animal model organism databases (MODs). These MODs are integrated repositories representing the output of much of the world's basic research, and have correspondingly large user communities.

Since starting in 2002, the InterMine project has been under continuous development to reflect changes in best practice and to exploit the best available technologies. Consistent with this approach, in this proposal, we aim to make extensive and coherent changes to InterMine, to enable InterMine database operators to create FAIR resources for their user communities. Thus this project should positively and directly impact current and future providers of integrated data resources based on InterMine, indirectly impact their collectively large user communities by providing better-described data, and facilitate one of the key aims of the FAIR principles, which is large-scale data re-use.

Technical Summary

We propose to extend InterMine to make it a better provider of FAIR data.

We will provide unique and stable URIs for InterMine data objects, so that they can be accessed RESTfully and safely used in external datasets and systems. We will register these identifiers with third-parties and embed search-engine friendly metadata in web pages, to make InterMine-served data more findable.

To enhance interoperability, we will generalize the facility for attaching the Sequence Ontology to the data model into one that can attach any arbitrary ontology. This will allow us to fully annotate the core model, and allow InterMine operators to annotate their own extensions to it. InterMine's existing JSON and XML query output formats will be extended with this new metadata, and these formats will be available for individual objects as well.

RDF will become a new InterMine output format allowing users who want to integrate all InterMine data to bulk download RDF triples. For piecemeal integration, there will be RDF representations of InterMine objects, lists and query results. We will also provide a SPARQL endpoint Docker image for experimentation rather than production, as SPARQL is a powerful technology but still subject to performance and uptime issues.

Data interoperability means adding links between datasets and data objects. We will create as many links as possible. In some cases these will target primary data providers, in other cases intermediate FAIR link registries.

Increasing data integration also makes recording and providing data license information more important. We will improve InterMine's capabilities in this area and advocate its importance.

As a long-lived professionally developed open source project, we will do all this following best practices. We will also make it usable by writing documentation, put it in front of people by presenting papers and running workshops, and adapt our plans in response to community feedback and contributions.

Planned Impact

(Please see the Academic Beneficiaries section to see how the academic community will benefit from this proposal. Below we outline potential Economic and Societal benefits)

Agritech, biotech and pharmaceutical companies, that depend extensively on the academic community for tools and access to data, will benefit by being able to find and access data more easily as well as subsequently being able to integrate and reuse it, so increasing their efficiency with consequent economic benefits.

Google, Yahoo, Yandex, Microsoft and other companies involved in schema.org will benefit from our involvement in bioschemas.org: this proposal will yield a compelling and high value use case for the Bioschemas community through thousands of end-users accessing the terabytes of data provided through InterMine databases.

Funders, such as the Research Councils, will benefit from the increased impact of the projects they support. Data will be made FAIR more easily, and so data generated from grants will be re-used more and thus will provide better value for taxpayers' money. Similarly by increasing the availability of FAIR data, the overall research effort will be made more efficient.

Local schools and school-trips visiting Cambridge for possibly entry, together with the general public through outreach activities, will benefit from greater awareness of the importance of big data to modern bioscience, and how this can benefit the economy and society.

Publications

10 25 50
 
Description 1. We have adapted InterMine so that all data objects are now identified by unique and stable URIs based on the InterMine class names combined with local IDs provided by the data resource providers. This is important to allow data to be found consistently between database releases.

2. We have applied ontologies to the InterMine data model including EDAM Ontology, Semanticscience Integrated Ontology, National Cancer Institute Thesaurus, Dublin Core. These changes mean that data are labelled and organised more consistently with established standards and this improves searching of the data.

3. Bioschemas.org is working to improve the descriptive data (metadata) provided by web pages in order to improve internet search. We have added JSON-LD structured data to make data more findable using DataSet, BioChemEntity, Gene and Protein defined by Bioschemas.org.

4. We have improved the way that external data sources are referenced by making use of Identifiers.org, improving the ability of users to understand the origins of the data they are using.

5. It is becoming increasingly important to only use data in a way that fits the licence under which the data were released. We updated the InterMine core model and data parsers to record data licences and to display them alongside the dataset in order to facilitate this.

6. We have registered InterMine databases in community bioinformatics systems such as bio.tools and FAIRsharing.org, improving the ease with which they can be found.

7. Results tables can be now downloaded in RDF or NTriple formats in additions to the various existing formats using the Export functionality.

8. The R2RML mapping language has been applied so that it is now possible to dynamically generate a mapping from the data in any InterMine database to data in RDF format. As a result of this InterMine administrators can now provide SPARQL endpoints for their InterMine databases using the Ontop (third party) software. We have set up a FlyMine SPARQL endpoint to demonstrate this functionality.

9. We have generated a simple intermine ontology which describes classes and their attributes defined in the InterMine core model. This is used where there are no appropriate terms from existing popular ontologies (e.g. Sequence Ontology, Semantic Science, EDAM, MeSH, Dublin Core) and provides actionable and unique URIs.
Exploitation Route The outcomes of this funding will benefit the growing number of users of the InterMine platform worldwide, both in academia and industry. The changes made through this funding mean that the platform now conforms well to the FAIR data principles, meaning that data presented through the InterMine platform can be Found, Accessed, Interoperated with and Re-used. This will benefit the many thousands of InterMine database end users around the world.

Through presentation of our work we have promoted the benefits of complying with the FAIR principles and set an example of the benefits that this brings.
Sectors Agriculture, Food and Drink,Digital/Communication/Information Technologies (including Software),Manufacturing, including Industrial Biotechology,Pharmaceuticals and Medical Biotechnology,Other

URL http://intermine.org/
 
Description This grant funded adaptations to the InterMine platform to improve the degree that it complies with the FAIR data principles. The InterMine platform is in use by the biotechnology company STORM Therapeutics. As a result, by providing consistent URIs to identify database objects, STORM Therapeutics Ltd has been able to share experimental data between the different interested departments and run comparisons over it. It also has allowed the use of concepts from widely-known ontologies such as EDAM to perform more fine-grained queries with a vocabulary that the STORM InterMine users were familiar with.
First Year Of Impact 2020
Sector Pharmaceuticals and Medical Biotechnology
Impact Types Economic

 
Description Accidently duplicated record 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Accidently duplicated record - can't find any way to delete this
Year(s) Of Engagement Activity 2020
 
Description BOSC training workshop 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact InterMine workshop and Birds of Feather meeting at BOSC conference
Year(s) Of Engagement Activity 2018
 
Description Big Biology Day 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Schools
Results and Impact Big Biology Day (BBD) is an annual one-day science festival at Hills Road Sixth Form College, Cambridge, and part of National Biology Week. BBD celebrates the life sciences and engages the public with an array of biological topics with lots of hands-on activity and a career fair.
Year(s) Of Engagement Activity 2019
 
Description Biohackathon 2020 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact In the biohackathon Ontop and R2RML were used to provide the ability to use SPARQL to query InterMine databases.
Year(s) Of Engagement Activity 2020
 
Description Bioinformatics Community Conference 2020 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Training workshop on how to handle integrated biological data using Python, Jupyter, and InterMine
Year(s) Of Engagement Activity 2020
 
Description Cambridge Science Festival 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Public/other audiences
Results and Impact Puzzle based game to engage general public with the issues/ challenges of data.
Year(s) Of Engagement Activity 2018
 
Description ECCB Conference webinar 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact An introductory workshop on analysing biological data using the InterMine platform
Year(s) Of Engagement Activity 2020
 
Description ELIXIR Open Day 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact The ELIXIR Open Day was to promote the achievements of the pan-European data organisation ELIXIR. InterMine is an ELIXIR Recommended Interoperability Resource.
Year(s) Of Engagement Activity 2020
 
Description ELIXIR SME and Innovation Event 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Industry/Business
Results and Impact To raise awareness of our activities within ELIXIR-UK to SME
Year(s) Of Engagement Activity 2018
 
Description F.A.I.R. blog post 
Form Of Engagement Activity Engagement focused website, blog or social media channel
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Blog post to increase awareness of the importance of F.A.I.R. data practices
Year(s) Of Engagement Activity 2018
 
Description InterMine training workshop 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact An introductory workshop on analysing biological data using the InterMine platform
Year(s) Of Engagement Activity 2020
 
Description Science festival, department of Genetics 
Form Of Engagement Activity Participation in an open day or visit at my research institution
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Public/other audiences
Results and Impact The Genetics Department held outreach events as part of the Cambridge Science Festival
Year(s) Of Engagement Activity 2019