Linking data with Identifiers.org

Lead Research Organisation: European Bioinformatics Institute
Department Name: Computational Neurobiology Group

Abstract

Annotating data, life science datasets with cross-references to other sources of knowledge has always be very important. These metadata are often what separate valuable information from heaps of unusable data. With the advent of systems biology, the size and complexity of datasets shifted the balance from direct human interaction to automated computer processing. Such operations are greatly facilitated if the metadata is encoded following standard procedures and using controlled vocabularies. If those procedures and vocabularies are shared between different types of data, it becomes possible to align, compare and integrate different datasets. A key part of any cross-reference is the identifier of the resource it points to. This identifier must be unique, perennial, resolvable and free. Most data providers create identifiers for their own records; for example '9606' identifies 'Homo sapiens' in the Taxonomy, and '22140103' identifies the latest publication about Identifiers.org in PubMed. However, those identifiers are only unique within a given dataset so their usefulness is limited when considering records in a wider context.

Identifiers.org provides such global identifiers, and resolves them to the relevant dataset. In order to achieve this purpose, it uses the information recorded in the MIRIAM Registry (http://www.ebi.ac.uk/miriam/). Therefore both projects provide a distinct part of the final technical solution. Identifiers generated with the Registry make use of the accession numbers supplied by data providers, but also contain information about the collection they come from. All identifiers are unique, resolvable and robust. They allow persons or software tools to directly access the identified pieces of data on the web, via alternative providers. Although a prototype, Identifiers.org has been adopted by a number of communities and projects, as it fulfils their need for perennial cross-references and removes their previous need for maintaining and keeping up to date long lists of ever changing web links (or URLs).

As more and more communities realise the benefits of using Identifiers.org URIs, new needs and use cases have appeared. This proposal seeks to strengthen and extend the services provided by the resource in order to respond to those new user requests. We will make the resource easier to use in automated procedures, specially for semantic web applications. This involves providing the content of the Registry in Resource Description Framework (RDF) format and supply a SPARQL endpoint for query purposes. Users will be able to fine-tuned the way identifiers are resolved, via the creation of 'profiles', that will record their preferences. The resource will allow the communities (more specially the data providers themselves) to get involved in the maintenance of the Registry. This will take place via a system of "ownership" by data providers of their record in the Registry. Although we currently have automatic systems in place to detect obsolete information, having the actual data providers contributing to the maintenance would ensure better quality of the recorded information, meaning a better quality of the services provided. Finally we will improve and extend the underlying computing infrastructure. By deploying it in more more data centres, we will provide more reliable services to an ever growing number of users.

The resulting resource will provide a way to seamlessly link all data annotated with the same URI to represent the same concept, a key step towards data integration. By providing a semantic glue between those datasets, Identifiers.org will facilitate data retrieval, comparison, integration, locally or through the semantic web. It will also facilitate the reasoning on the integrated datasets and lead to new, possibly automated discovery in the biomedical domain.

Technical Summary

Providing annotated life science data (which include cross-references to other sources of knowledge) has always be very important. For both human and tool consumers, the key requirement for such metadata is globally unique, perennial, resolvable and free identifiers. MIRIAM Registry and Identifiers.org are tools providing such identifiers in the form of HTTP URIs. The aim of this proposal is to address our most important users' requests, in order to transform a prototype system into a full featured and reliable service to the community.

We will fully support semantic web applications by upgrading the services to follow the standards of this area, namely RDF and SPARQL. By making the data of the Registry available in RDF and supplying a SPARQL endpoint, users will be able to efficiently integrate data coming from multiple sources, even if they do not use the same URIs for identifying the same concepts.

Moreover, the services currently resolving Identifiers.org URLs only leads to HTML pages. This is very useful for users but need to be extended for tools. It will require listing for each physical location (or resource) the different formats they can provide. By doing so, tools only able to handle a given format could still traverse the various linked datasets. The system will also provide customised services, by allowing users to create 'profiles'. Those will record, for example, their resolving and format preferences. Web services and data export will make use of this information to provide targeted services.

Also, to tackle the future maintenance in a sustainable way of a Registry growing at an increasing rate, we will involve the community, and more specially the data providers, in contributing to the update of their own records.

Finally, to provide more reliable services, we will migrate the infrastructure to new, fully redundant, data centres, with the aim to ultimately provide a framework deployable on more mirrors and possibly cloud infrastructures.

Planned Impact

The new developments of Identifiers.org infrastructure presented in this project will have several major impacts.

Firstly, this will extend the services provided as well as improve their usability. It will thus become easier and more convenient for anybody to use the system. Any data provider having to record cross-references (that is most of them) will benefit much from it. Having such a central resource prevents the duplication of effort when maintaining up to date lists of URLs. Moreover, it reduces the amount of code necessary for handling identifiers: the same identifier can be stored in the database and directly used on the user interface. This removes the need to convert internal identifiers into resolvable URLs. From the data providers' point of view, the whole cross-reference handling processes will become much easier, more efficient and more reliable.

Moreover, the proposed developments on the front of semantic web technologies will directly benefit providers of open and linked data. These can be the primary data providers (such as UniProt or Ensembl) or secondary providers (such as Bio2RDF or Pathway Commons), some of them already using MIRIAM URIs. Those efforts will gain much from using the same standardised URIs. This will make processes such as validation or integration easier, as entities in different datasets can be linked together; Identifiers.org URIs providing a key element for linking disparate pieces of information. The discovery and reasoning capabilities added by the planned developments related to RDF and SPARQL will also improve those aspects.

Users' specific needs will also be catered for by the provision of customised services. For instance given projects and consortia will be be able to utilise only the subset of the Registry they need, and to make use of their own preferences.

Finally, as the system better fulfil users' needs (specially as those new features directly come from user requests), it should make the system more useful and attractive, knowing that specific use cases are catered for. It is therefore expected that the community of users of the Identifiers.org URIs will expand. This will have a feedback loop effect, as more data providers use those unique URIs, the provided benefits of the system will increase and be strengthened, making more and more datasets interoperable.

All the planned developments will be available freely for both academic and commercial users. This is specially important as numerous companies nowadays provide services based on the distribution, analysis or integration of publicly available datasets.

Overall, this project should strengthen the economic competitiveness of the United Kingdom, helping it to be at the forefront of linked data technologies and provision, catering for the next challenge in this domain: the semantic web.

Publications

10 25 50
publication icon
Chelliah V (2015) BioModels: ten-year anniversary. in Nucleic acids research

publication icon
Juty N (2015) BioModels: Content, Features, Functionality, and Use. in CPT: pharmacometrics & systems pharmacology

publication icon
Wimalaratne SM (2014) BioModels linked dataset. in BMC systems biology

publication icon
Wimalaratne SM (2015) SPARQL-enabled identifier conversion with Identifiers.org. in Bioinformatics (Oxford, England)

 
Description This grant provided the means to further develop Identifiers.org's infrastructure and services, in order to provide the features that users have been requested, as well as making the resource more reliable.

The existing services have been consolidated by migrating them to the EBI London data centre, which provide a redundant hosting infrastructure.

A SPARQL endpoint was created to allow users to easily integrate heterogeneous datasets from multiple sources, using Semantic web technologies. This service relies on the Registry being able to record alternative URI schemes and performs URI scheme conversion so that users do not need to worry about the type of URIs which are being used in each dataset.

Previously, Identifiers.org URIs only resolved to HTML pages. While this is sufficient in most cases, it does not take into consideration that a lot of resources do provide their information in a wide range of formats. Therefore, support for additional formats has been added. This allows users to get direct access to the information they need in their preferred format (availability depends on data provider's offering). Automated services can also be built on top of that new feature, such as semantic web applications relying solely on RDF encoded data.

To build on a long standing strength of the system (the resolving to potentially multiple locations on the web) but which caused annoyance to some users (who wanted to use a specific resource for a given type of data), customised services were developed. This takes the form of "profiles", and allows anyone to customise the resolving of Identifiers.org URIs. This can be useful for example to databases which want to display their cross-references and have preferences for which resources or format to point to.

Finally, to ease the curational effort on our team and make the resource more sustainable in the long term, maintenance of the Registry has been opened to the community. People are now able to login the Registry (using their existing account with for example Google or Yahoo) and update records for which they have been granted access to. This facility is aimed at data providers, but can be used by keen users too.
Exploitation Route Identifiers.org is a tool which can be useful to anyone needing to store or handle cross-references, as it simplifies the tasks previously necessary. Identifiers.org URIs are so versatile that they can be used unmodified in backend databases, in import/export formats, in web-based user interfaces and as a way to access the identified data.

We expect that more data providers will adopt the system and retire their ad hoc mechanism as the advantages are numerous, such as decreasing the need for each resource to maintain this necessary, but not core, aspect of their data provision.

Additionally, with the new features added specially targeting the semantic web community (such as the SPARQL endpoint for URI schemes conversion), this opens up the possibilities for data integration of heterogeneous datasets.

Finally, although the focus has been so far on the life sciences, the system is perfectly suited to any type of data accessible on the web.

Update 02/2020: Identifiers.org is now a cloud-deployed Elixir Recommended Interoperability Resource. In 2019, an average of 10,000 unique hosts per month generated more than 1 million requests per month.
Sectors Agriculture, Food and Drink,Digital/Communication/Information Technologies (including Software),Energy,Environment,Healthcare,Pharmaceuticals and Medical Biotechnology

URL http://identifiers.org
 
Description Identifiers.org is a tool for data providers with cross-referencing needs and researchers wanting to explore large datasets to extract new findings. Therefore the societal impact is mainly indirect. Similarly for the economic impact: this should be mostly visible to funding agencies (like the BBSRC) if other resources rely on Identifiers.org and do not require support to develop similar system (as a lot of resources do such needs). Update 02/2020: Identifiers.org is now a cloud-deployed Elixir Recommended Interoperability Resource. In 2019, an average of 10,000 unique hosts per month generated more than 1 million requests per month.
First Year Of Impact 2016
Sector Digital/Communication/Information Technologies (including Software),Pharmaceuticals and Medical Biotechnology
Impact Types Economic

 
Description EU FREYA
Amount £4,998,650 (GBP)
Funding ID 777523 
Organisation European Commission H2020 
Sector Public
Country Belgium
Start 12/2017 
End 11/2020
 
Title Identifiers.org 
Description Identifiers.org is a system providing resolvable persistent URIs used to identify data for the scientific community, with a current focus on the Life Sciences domain. 
Type Of Material Database/Collection of data 
Year Produced 2011 
Provided To Others? Yes  
Impact Database operators, in particular in the domain of systems biology, are using identifiers.org for stable, reliable references to external databases, to overcome the challenge of URLs for specific resources often being unstable and changing. 
URL http://identifiers.org
 
Title Identifiers.org SPARQL endpoint 
Description Identifiers.org SPARQL endpoint allows the conversion of URI schemes. 
Type Of Technology Webtool/Application 
Year Produced 2014 
Impact This is particularly useful when querying (using Semantic Web technologies) multiple heterogeneous datasets making use of different type of URIs. 
URL http://identifiers.org/services/sparql
 
Description BioMedBridges Workshop 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other academic audiences (collaborators, peers etc.)
Results and Impact The aims of the workshop were to explore several aspects of standardisation, primarily: defining entity identifiers and identifiers best practice, as well as develop a standards registry which will document the standards in use within the biomedical sciences research infrastructures, facilitate data integration by matching data elements and integrate other resources within BioMedBridges and externally.

Presentations emphasized the importance of standards for biological data sharing and outlined the current standards landscape. Gap analyses were carried out in group sessions to identify common challenges faced in each field and how working solutions could be found through collaboration. A best practice document for identifiers is intended to be published later in the year. Based on the discussions, the standards registry developed within BioMedBridges will be integrated with existing resources such as BioSharing.org, identifiers.org, the EDAM ontology and the BioMedBridges Service Registry. Information on key standards used in the biomedical sciences research infrastructure domains will be included as a priority and regular community input will be sought to keep the information up to date.
Year(s) Of Engagement Activity 2014
URL http://www.biomedbridges.eu/news/standards-and-data-harmonization-prerequisites-data-integration
 
Description Biohackathon 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other academic audiences (collaborators, peers etc.)
Results and Impact Improvement to Identifiers.org services, discussion with users of their needs.

Several users of Identifiers.org were present and provided valuable feedbacks regarding the services, which were then fed in our development roadmap.
Year(s) Of Engagement Activity 2013,2014
URL http://biohackathon.org/
 
Description Career Q&A 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Schools
Results and Impact This career Q&A with year 10 students was carried out virtually for the local collage and it is hoped that it would encourage more student to think about entering not only science but all the field of bioinformatics.
Year(s) Of Engagement Activity 2020
 
Description ELIXIR Workshop: A common vocabulary to classify resources in the life science domain 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other academic audiences (collaborators, peers etc.)
Results and Impact Kick-start agreement on a common topics vocabulary to classify resources in the life sciences domain, including databases, tools, courses, training materials, meetings, jobs and publications.

Discussed use cases and how a common vocabulary will make an impact in the community.
Year(s) Of Engagement Activity 2014
URL http://www.elixir-europe.org/events/workshop-common-vocabulary-classify-resources-life-science-domai...
 
Description International Symposium on Integrative Bioinformatics 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Presentation of the BioModels and Reactome linked datasets and their usage of Identifiers.org URIs.

Raised the awareness of the audience.
Year(s) Of Engagement Activity 2014
URL http://www.imbio.de/ib2014/
 
Description RDF summit: Identifiers.org: new developments 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other academic audiences (collaborators, peers etc.)
Results and Impact Discussed and collected feedbacks and user requirements for Identifiers.org new developments.

Several new missing but useful features were identified.
Year(s) Of Engagement Activity 2014