Making data FAIR using InterMine

Lead Research Organisation: University of Cambridge

Department Name: Genetics

Abstract

Against a backdrop of ever-increasing data generation by the world's bioscience community there is growing recognition that it is essential to describe data more precisely and so make it easier to find and reuse, including integrating it with other datasets. In order to achieve these goals, the community needs to describe the data better (i.e. create better "metadata") and also to store and transport the data according to rigorously defined standards. This is hard in the biosciences because of the huge diversity in experimental techniques and the extremely rapid pace of technological change.

Accordingly, the recently published FAIR principles are defining a consensus that data should be "Findable, Accessible, Interoperable and Reusable". These principles are providing an opportunity for a concerted effort towards their increased adoption by the bioscience research community together with science funders and publishers, who are starting to ask for data to be managed according to the FAIR principles.

InterMine is a data integration framework that has been developed for over a decade and is used for large-scale data integration projects around the world, including by many of the main plant and animal model organism databases (MODs). These MODs are integrated repositories representing the output of much of the world's basic research, and have correspondingly large user communities.

Since starting in 2002, the InterMine project has been under continuous development to reflect changes in best practice and to exploit the best available technologies. Consistent with this approach, in this proposal, we aim to make extensive and coherent changes to InterMine, to enable InterMine database operators to create FAIR resources for their user communities. Thus this project should positively and directly impact current and future providers of integrated data resources based on InterMine, indirectly impact their collectively large user communities by providing better-described data, and facilitate one of the key aims of the FAIR principles, which is large-scale data re-use.

Technical Summary

We propose to extend InterMine to make it a better provider of FAIR data.

We will provide unique and stable URIs for InterMine data objects, so that they can be accessed RESTfully and safely used in external datasets and systems. We will register these identifiers with third-parties and embed search-engine friendly metadata in web pages, to make InterMine-served data more findable.

To enhance interoperability, we will generalize the facility for attaching the Sequence Ontology to the data model into one that can attach any arbitrary ontology. This will allow us to fully annotate the core model, and allow InterMine operators to annotate their own extensions to it. InterMine's existing JSON and XML query output formats will be extended with this new metadata, and these formats will be available for individual objects as well.

RDF will become a new InterMine output format allowing users who want to integrate all InterMine data to bulk download RDF triples. For piecemeal integration, there will be RDF representations of InterMine objects, lists and query results. We will also provide a SPARQL endpoint Docker image for experimentation rather than production, as SPARQL is a powerful technology but still subject to performance and uptime issues.

Data interoperability means adding links between datasets and data objects. We will create as many links as possible. In some cases these will target primary data providers, in other cases intermediate FAIR link registries.

Increasing data integration also makes recording and providing data license information more important. We will improve InterMine's capabilities in this area and advocate its importance.

As a long-lived professionally developed open source project, we will do all this following best practices. We will also make it usable by writing documentation, put it in front of people by presenting papers and running workshops, and adapt our plans in response to community feedback and contributions.

Planned Impact

(Please see the Academic Beneficiaries section to see how the academic community will benefit from this proposal. Below we outline potential Economic and Societal benefits)

Agritech, biotech and pharmaceutical companies, that depend extensively on the academic community for tools and access to data, will benefit by being able to find and access data more easily as well as subsequently being able to integrate and reuse it, so increasing their efficiency with consequent economic benefits.

Google, Yahoo, Yandex, Microsoft and other companies involved in schema.org will benefit from our involvement in bioschemas.org: this proposal will yield a compelling and high value use case for the Bioschemas community through thousands of end-users accessing the terabytes of data provided through InterMine databases.

Funders, such as the Research Councils, will benefit from the increased impact of the projects they support. Data will be made FAIR more easily, and so data generated from grants will be re-used more and thus will provide better value for taxpayers' money. Similarly by increasing the availability of FAIR data, the overall research effort will be made more efficient.

Local schools and school-trips visiting Cambridge for possibly entry, together with the general public through outreach activities, will benefit from greater awareness of the importance of big data to modern bioscience, and how this can benefit the economy and society.

Funded Value:

£597,726

Funded Period:

Jul 17 - Mar 21

Funder:

BBSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

BB/P024335/1

Principal Investigator:

Gos Micklem

Research Subject:

Tools, technologies & methods (96%)

Research Topic:

Bioinformatics (32%)

Tools for the biosciences (64%)

Organisations

University of Cambridge (Lead Research Organisation)

People	ORCID iD
Gos Micklem (Principal Investigator)

Publications

Author Name

Title Publication Date Published

10 25 50

Key Findings
Impact Summary
Engagement Activities


Description	1. We have adapted InterMine so that all data objects are now identified by unique and stable URIs based on the InterMine class names combined with local IDs provided by the data resource providers. This is important to allow data to be found consistently between database releases. 2. We have applied ontologies to the InterMine data model including EDAM Ontology, Semanticscience Integrated Ontology, National Cancer Institute Thesaurus, Dublin Core. These changes mean that data are labelled and organised more consistently with established standards and this improves searching of the data. 3. Bioschemas.org is working to improve the descriptive data (metadata) provided by web pages in order to improve internet search. We have added JSON-LD structured data to make data more findable using DataSet, BioChemEntity, Gene and Protein defined by Bioschemas.org. 4. We have improved the way that external data sources are referenced by making use of Identifiers.org, improving the ability of users to understand the origins of the data they are using. 5. It is becoming increasingly important to only use data in a way that fits the licence under which the data were released. We updated the InterMine core model and data parsers to record data licences and to display them alongside the dataset in order to facilitate this. 6. We have registered InterMine databases in community bioinformatics systems such as bio.tools and FAIRsharing.org, improving the ease with which they can be found. 7. Results tables can be now downloaded in RDF or NTriple formats in additions to the various existing formats using the Export functionality. 8. The R2RML mapping language has been applied so that it is now possible to dynamically generate a mapping from the data in any InterMine database to data in RDF format. As a result of this InterMine administrators can now provide SPARQL endpoints for their InterMine databases using the Ontop (third party) software. We have set up a FlyMine SPARQL endpoint to demonstrate this functionality. 9. We have generated a simple intermine ontology which describes classes and their attributes defined in the InterMine core model. This is used where there are no appropriate terms from existing popular ontologies (e.g. Sequence Ontology, Semantic Science, EDAM, MeSH, Dublin Core) and provides actionable and unique URIs.
Exploitation Route	The outcomes of this funding will benefit the growing number of users of the InterMine platform worldwide, both in academia and industry. The changes made through this funding mean that the platform now conforms well to the FAIR data principles, meaning that data presented through the InterMine platform can be Found, Accessed, Interoperated with and Re-used. This will benefit the many thousands of InterMine database end users around the world. Through presentation of our work we have promoted the benefits of complying with the FAIR principles and set an example of the benefits that this brings.
Sectors	Agriculture, Food and Drink,Digital/Communication/Information Technologies (including Software),Manufacturing, including Industrial Biotechology,Pharmaceuticals and Medical Biotechnology,Other
URL	http://intermine.org/


Description	This grant funded adaptations to the InterMine platform to improve the degree that it complies with the FAIR data principles. The InterMine platform is in use by the biotechnology company STORM Therapeutics. As a result, by providing consistent URIs to identify database objects, STORM Therapeutics Ltd has been able to share experimental data between the different interested departments and run comparisons over it. It also has allowed the use of concepts from widely-known ontologies such as EDAM to perform more fine-grained queries with a vocabulary that the STORM InterMine users were familiar with.
First Year Of Impact	2020
Sector	Pharmaceuticals and Medical Biotechnology
Impact Types	Economic


Description	Accidently duplicated record
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	Accidently duplicated record - can't find any way to delete this
Year(s) Of Engagement Activity	2020


Description	BOSC training workshop
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	InterMine workshop and Birds of Feather meeting at BOSC conference
Year(s) Of Engagement Activity	2018


Description	Big Biology Day
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	Local
Primary Audience	Schools
Results and Impact	Big Biology Day (BBD) is an annual one-day science festival at Hills Road Sixth Form College, Cambridge, and part of National Biology Week. BBD celebrates the life sciences and engages the public with an array of biological topics with lots of hands-on activity and a career fair.
Year(s) Of Engagement Activity	2019


Description	Biohackathon 2020
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	In the biohackathon Ontop and R2RML were used to provide the ability to use SPARQL to query InterMine databases.
Year(s) Of Engagement Activity	2020


Description	Bioinformatics Community Conference 2020
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	Training workshop on how to handle integrated biological data using Python, Jupyter, and InterMine
Year(s) Of Engagement Activity	2020


Description	Cambridge Science Festival
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	Regional
Primary Audience	Public/other audiences
Results and Impact	Puzzle based game to engage general public with the issues/ challenges of data.
Year(s) Of Engagement Activity	2018


Description	ECCB Conference webinar
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	An introductory workshop on analysing biological data using the InterMine platform
Year(s) Of Engagement Activity	2020


Description	ELIXIR Open Day
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	The ELIXIR Open Day was to promote the achievements of the pan-European data organisation ELIXIR. InterMine is an ELIXIR Recommended Interoperability Resource.
Year(s) Of Engagement Activity	2020


Description	ELIXIR SME and Innovation Event
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	National
Primary Audience	Industry/Business
Results and Impact	To raise awareness of our activities within ELIXIR-UK to SME
Year(s) Of Engagement Activity	2018


Description	F.A.I.R. blog post
Form Of Engagement Activity	Engagement focused website, blog or social media channel
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	Blog post to increase awareness of the importance of F.A.I.R. data practices
Year(s) Of Engagement Activity	2018


Description	InterMine training workshop
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	An introductory workshop on analysing biological data using the InterMine platform
Year(s) Of Engagement Activity	2020


Description	Science festival, department of Genetics
Form Of Engagement Activity	Participation in an open day or visit at my research institution
Part Of Official Scheme?	No
Geographic Reach	Regional
Primary Audience	Public/other audiences
Results and Impact	The Genetics Department held outreach events as part of the Cambridge Science Festival
Year(s) Of Engagement Activity	2019