From data to knowledge / the ONDEX System for integrating Life Sciences data sources

Lead Research Organisation: Rothamsted Research
Department Name: Computational & Systems Biology

Abstract

The biological sciences generate many different types of data from different specialist disciplines (e.g. genetics, biochemistry, molecular biology). Bringing data together coherently is a major undertaking in any systems biology project. While new databases of biological thesauri and classification systems (ontologies) for the component parts of biology make it easier to link specialist databases, this only solves part of the problem of data integration for systems biologists who need a much richer body of information. For example, there are many different ways that biological components can be related (e.g. by function, location, size) which needs to be captured and information about the provenance (history or source) of data can be important when it is interpreted. New types of information are also important in systems biology, including descriptions of the biological processes and pathways for metabolism and information flow. Many of these have been created by extracting information from the scientific literature to form the basis for the predictive dynamic models and simulations of system function. Because systems biology has a need for complex data integration and scientific text mining that is not met by readily available bioinformatics software in the biological research community, a prototype system (ONDEX) has been developed by Rothamsted Research. This project will combine ONDEX with leading technologies in workflow, graph analysis and text mining, to develop a powerful and professional tool that will underpin systems biology research. Three systems biology research projects, run by our BBSRC-funded systems biology centre partners, will drive the development of ONDEX and will validate new features on real scientific problems. Biological areas addressed cover: bioenergy crops; yeast metabolome models; and Telomere Function in ageing. The research partners bring important technical expertise that will enhance ONDEX with new capabilities known to be required by systems biologists at their centres. These include: * Extensions to methods that map data into ONDEX to broaden the range of data that can be integrated and capture more of the information about it (the metadata). * State of the art text mining capabilities, for extracting biological concepts and relationships from online text to enable new data buried in the scientific literature to be extracted and structured into models and databases. * Extensions to handle the statistical uncertainty inherent in many biological relationships, to enable new relationships to be identified in the integrated datasets using modern statistical inference techniques. * Enhanced graphical visualisations of the complex network of relationships to accommodate new information and scale to huge data networks, to enable a better understanding of new interactions, and better ways of interrogating the data in a richly integrated dataset * Exploitation of the latest in distributed computing techniques and scientific workflows to simplify, automate and scale the complex task of integration. * Extended range of data interfaces relevant to both programmers and users to enable shared access over the Internet of the integrated datasets, which are important information resources in their own right. A number of actions and engineering developments will make ONDEX easier to use by biologists and support uptake in new areas of systems biology. These include new training resources, workshops for users and developers and providing direct help for new applications through an outreach programme. At the end of the project ONDEX will be delivered in a well-engineered and robust form to existing and new users that will be more readily used by a greatly expanded user and developer community that should make it sustainable in the long term as an open software project.

Technical Summary

The current ONDEX system enables data from diverse biological data set to be linked, integrated and visualised through graph analysis techniques. It uses a semantically rich Core data structure based on graphs, has explicit support for workflow and has the ability to bring together information from structured databases and unstructured sources such as sequence data and free text. Extensions for Systems Biology include: Enhancing the ONDEX Core: - Methods to map data into the core data structures to exploit synteny and sequence similarity for applications needing comparative analysis of genetic and genomic organisation of multiple organisms. -Techniques for probabilistic interpretation of relations allowing uncertainty in the integrated data and in biological relationships to be modelled, combining relations using probabilistic models such as naive Bayesian and Bayesian graphical Gaussian approaches. Exploiting the ONDEX data graph: A graph structure analysis toolkit using, standard and advanced graph analysis algorithms, that traverses the data graph and modules representing common structural and functional components to be identified. Populating the ONDEX model: - Orchestrating data integration and analysis steps in ONDEX applications, using Taverna workflows and services (myGrid), including the running of workflows. Using Taverna will allow ONDEX to retain data on workflow provenance, which can be used to track, verify and validate data. - Enhanced text mining methods to extract and map terms from text in databases and online literature sources to detect synonymy and ambiguity and the identification and extraction biologically relevant relations. Exposing ONDEX to tools: New data access interfaces to allow ONDEX data to be used by third party tools, e.g. within workflows, and data export tools to provide easy access to ONDEX data for users of Cytoscape and for export in standard systems biology model exchange formats (e.g. SBML, BioPAX etc).

Publications

10 25 50
publication icon
Hassani-Pak K (2016) Developing integrated crop knowledge networks to advance candidate gene discovery. in Applied & translational genomics

publication icon
Weile J (2012) Bayesian integration of networks without gold standards. in Bioinformatics (Oxford, England)

publication icon
Mironov V (2012) Gauging triple stores with actual biological data. in BMC bioinformatics

publication icon
Nawaz R (2013) Negated bio-events: analysis and identification. in BMC bioinformatics

publication icon
Zappa A (2012) Towards linked open gene mutations data. in BMC bioinformatics

publication icon
James K (2012) Is newer better?--evaluating the effects of data curation on integrated analyses in Saccharomyces cerevisiae. in Integrative biology : quantitative biosciences from nano to macro

publication icon
Balaur I (2017) EpiGeNet: A Graph Database of Interdependencies Between Genetic and Epigenetic Events in Colorectal Cancer. in Journal of computational biology : a journal of computational molecular cell biology

publication icon
Pesch R (2008) Graph-based sequence annotation using a data integration approach. in Journal of integrative bioinformatics

publication icon
Lesk V (2011) WIBL: Workbench for Integrative Biological Learning. in Journal of integrative bioinformatics

 
Title Lightdrawings of networks by Hugo Dalton 
Description When science meets art, or new meets old. KnetMiner has inspired the artist Hugo Dalton. He has created lighdrawings of our networks and has projected them on old sculptures to bring them back to life. 
Type Of Art Artwork 
Year Produced 2018 
Impact Our work has been displayed for 3 months to visitors of the Fitzwilliam Musuem. 
URL https://www.instagram.com/p/BflFDaHlwdv/
 
Title Video Introducing KnetMiner 
Description A short 90-sec clip introducing KnetMiner to the general public. What is it? Who developed it? Who uses it? 
Type Of Art Film/Video/Animation 
Year Produced 2017 
Impact Video inspires general public, students and scientists visiting the KnetMiner website. 
URL https://www.youtube.com/watch?v=4aOv5QXqvLI
 
Description The development of a full-feature general data integration software platform (Ondex) for the life sciences with associated websites, documentation and training materials. A range of demonstrator projects and publications which illustrate the benefits of data integration.

The development of an integrated knowledgebase of data relating to the genetics and genomics of the Poplar tree which can be accessed using a geneticist-friendly web-based user interface to support research and breeding that will improve the sustainability of willow trees as a second generation bioenergy crop.

The development of new and general methods of data integration and visualisation which have been incorporated into the Ondex system that support the selection and evaluation of functional candidate genes from combined genetic and genomic studies of complex traits.
Exploitation Route The Ondex software is entirely general in its application domains and could potentially be relevant to any problem needed data integration technology. The most obvious commercial sectors are the lifescience communities undertaking research using multi-omics and systems biology approaches.
Sectors Agriculture, Food and Drink,Education,Environment,Pharmaceuticals and Medical Biotechnology

URL http://www.ondex.org
 
Description The Ondex software has been downloaded from the project website by 900 users since the start of this project. It is unclear how many have used it and for what purposes. The major users have been the named collaborators and follow on projects with Syngenta and at Newcastle Univeristy.
First Year Of Impact 2011
Sector Agriculture, Food and Drink
Impact Types Economic

 
Description Bioinformatics to advance wheat breeding
Geographic Reach Europe 
Policy Influence Type Influenced training of practitioners or researchers
Impact Trained researchers to use evidence-based practices in biological decision making.
URL http://www.wheatinitiative.org/events/durum-ewg-workshop-bioinformatics-advance-wheat-breeding
 
Description A FAIR community resource for pathogens, hosts and their interactions to enhance global food security and human health
Amount £557,820 (GBP)
Funding ID BB/S020020/1 
Organisation Biotechnology and Biological Sciences Research Council (BBSRC) 
Sector Public
Country United Kingdom
Start 06/2019 
End 06/2022
 
Description Accelerating Discovery by Mining and Visualising Integrated Chemogenomics Data
Amount £199,359 (GBP)
Funding ID 100938 
Organisation Innovate UK 
Sector Public
Country United Kingdom
Start 11/2011 
End 03/2013
 
Description Finding Value in Complex Biological Data - Integrated 'omics FS
Amount £22,973 (GBP)
Organisation Innovate UK 
Sector Public
Country United Kingdom
Start 04/2016 
End 10/2017
 
Description QTLNetMiner Project Funding
Amount £125,356 (GBP)
Funding ID BB/I023860/1 
Organisation Biotechnology and Biological Sciences Research Council (BBSRC) 
Sector Public
Country United Kingdom
Start 12/2011 
End 02/2013
 
Description What determines protein abundance in plants?
Amount £3,354,456 (GBP)
Funding ID BB/T002182/1 
Organisation Biotechnology and Biological Sciences Research Council (BBSRC) 
Sector Public
Country United Kingdom
Start 11/2019 
End 10/2024
 
Title AraKNET Release 42 - Feb 2019 
Description Integrated database of Arabidopsis genome, genotype, phenotype, omics and homology information. Available through www.knetminer.org or as RDF and Neo4j graph databases. 
Type Of Material Database/Collection of data 
Year Produced 2019 
Provided To Others? Yes  
Impact It takes away the pain of connecting data from multiple sources and finding useful clues in data. A task that can take biologists weeks or months, can be done in a few minutes using the resource. 
URL http://knetminer.rothamsted.ac.uk/Arabidopsis_thaliana/
 
Title AraKNET Release 45 - Oct 2019 
Description Release 45 of the Arabidopsis Knowledge Graph on OXL, RDF and Neo4j format 
Type Of Material Database/Collection of data 
Year Produced 2019 
Provided To Others? Yes  
Impact FAIR data. 
 
Title Ondex Integrated Plant Genomics Databases 
Description The Ondex project produced a portfolio of pre-integrated datasets for use in plant genomics research 
Type Of Material Database/Collection of data 
Year Produced 2009 
Provided To Others? Yes  
Impact None 
URL http://www.ondex.org/doc.shtml
 
Title RiceKNET Release 42 - Feb 2019 
Description Integrated database of rice genome, genotype, phenotype, omics and homology information. Available through www.knetminer.org or as RDF and Neo4j graph databases. 
Type Of Material Database/Collection of data 
Year Produced 2019 
Provided To Others? Yes  
Impact It takes away the pain of connecting data from multiple sources and finding useful clues in data. A task that can take biologists weeks or months, can be done in a few minutes using the resource. 
URL http://knetminer.rothamsted.ac.uk/Oryza_sativa/
 
Title Wheat Knowledge Network - Release Nov 2017 
Description Integrated database of wheat genome, genotype, phenotype and homology information (Hassani-Pak et al, 2016) 
Type Of Material Database/Collection of data 
Year Produced 2017 
Provided To Others? Yes  
Impact This database powers the KnetMiner application 
URL http://knetminer.rothamsted.ac.uk/
 
Title Wheat pathogens knowledge network - Release Nov 2017 
Description Knowledge networks of wheat pathogens Fusarium and Zymospeptoria 
Type Of Material Database/Collection of data 
Year Produced 2017 
Provided To Others? Yes  
Impact Will help to understand wheat diseases 
URL http://knetminer.rothamsted.ac.uk/
 
Title WheatKNET Release 42 - Feb 2019 
Description Integrated database of wheat genome, genotype, phenotype, omics and homology information. Available through knetminer.org or as RDF and Neo4j graph databases. 
Type Of Material Database/Collection of data 
Year Produced 2019 
Provided To Others? Yes  
Impact It takes away the pain of connecting data from multiple sources and finding useful clues in data. A task that can take biologists weeks or months, can be done in a few minutes using the resource. 
URL http://knetminer.rothamsted.ac.uk/Triticum_aestivum/
 
Title WheatKNET Release 45 - Oct 2019 
Description Release 45 of the wheat knowledge graph in OXL, RDF, Neo4j format 
Type Of Material Database/Collection of data 
Year Produced 2019 
Provided To Others? Yes  
Impact Used by over 1000 users through the KnetMiner UI and API 
URL http://knetminer-data.cyverseuk.org/lodestar/
 
Description GeneStack 
Organisation Genestack
Country United Kingdom 
Sector Private 
PI Contribution We are collaborating in an Innovate UK funded Feasibility Study to translate our bioinformatics software and methods to a commercial cloud based software platform and investigating future commercial licensing.
Collaborator Contribution They are providing the expertise to integrate our software into their platform
Impact Too soon to report outputs
Start Year 2016
 
Description GeneStack 
Organisation Genestack
Country United Kingdom 
Sector Private 
PI Contribution We are collaborating in an Innovate UK funded Feasibility Study to translate our bioinformatics software and methods to a commercial cloud based software platform and investigating future commercial licensing.
Collaborator Contribution They are providing the expertise to integrate our software into their platform
Impact Too soon to report outputs
Start Year 2016
 
Description Participation in Syngenta UIC at Imperial College 
Organisation Imperial College London
Department Department of Computing
Country United Kingdom 
Sector Academic/University 
PI Contribution The Rothamsted Ondex team have participated in the University Innovation Centre funded by Syngenta at Imperial College, London. The participation was largely technical providing support in the use of the Ondex platform for supporting the research there in machine learning applications.
Start Year 2009
 
Title KnetMiner v1.0 
Description Helps users to analyse biological experiments and put finding into the context of published knowledge. Follow the link to see the release notes. 
Type Of Technology Webtool/Application 
Year Produced 2018 
Open Source License? Yes  
Impact Accelerates gene discovery and plant breeding. 
URL http://knetminer.rothamsted.ac.uk/
 
Title KnetMiner v3.1 
Description Minor improvements and bug fixes 
Type Of Technology Webtool/Application 
Year Produced 2019 
Open Source License? Yes  
Impact Benefits KnetMiner end users and developers 
URL https://github.com/Rothamsted/knetminer/releases/tag/v3.1
 
Title KnetMiner v3.2 
Description Minor improvements and bug fixes 
Type Of Technology Webtool/Application 
Year Produced 2019 
Open Source License? Yes  
Impact Benefits to KnetMiner end users and developers 
URL https://github.com/Rothamsted/knetminer/releases/tag/v3.2
 
Title Ondex Suite 
Description Data integration using semantic integration methods for lifesciences research and systems biology 
Type Of Technology Software 
Year Produced 2010 
Open Source License? Yes  
Impact It led to further research projects and enabled a collaboration with Syngenta 
URL http://www.ondex.org
 
Title Ondex Web 
Description A web based biological network visualisation tool. 
Type Of Technology Webtool/Application 
Year Produced 2012 
Impact It was used in a collaboration with Syngenta 
URL http://www.ondex.org/projects.shtml#ondexweb
 
Title Ondex to RDF Exporter 
Description Ondex components and applications that are necessary for building genome-scale knowledge networks used in projects like KnetMiner. It includes the Ondex base, CLI, workflow engine and a set of plugins (parsers, mappers, transformers, filters and exporters) that are relevant for building genome-scale knowledge networks 
Type Of Technology Software 
Year Produced 2018 
Open Source License? Yes  
Impact It is part of a suite of tools that help knetminer networks, including those developed for wheat, be shared through linked open data methods. 
URL https://github.com/Rothamsted/ondex-knet-builder/tree/master/modules/rdf-export-2
 
Title Ondex-Knet-Builder v2.1 
Description Command line based workflow engine for building knowledge graphs in OXL and RDF format. 
Type Of Technology Software 
Year Produced 2018 
Open Source License? Yes  
Impact Improvements to several data loaders. 
URL https://github.com/Rothamsted/ondex-knet-builder/releases/tag/v2.1
 
Title QTLNetMiner 
Description QTLNetMiner is a user-friendly web application that can interrogate plant and animal knowledge networks and be used to show candidate genes and QTL associated with given input terms (e.g. early flowering, disease resistance). The relevance of a gene to particular query terms is weighted using information retrieval and network inference methods. The supporting evidence networks for selected candidate genes are visualized in the Ondex Web Java-applet. QTLNetMiner is designed in a generic way and can be created for any organism with an integrated Ondex knowledge network 
Type Of Technology Webtool/Application 
Year Produced 2013 
Impact None to date 
URL https://ondex.rothamsted.ac.uk/QTLNetMiner/
 
Title RDF to Neo4j converter 
Description RDF-Neo4 Converter and config to load KnetMiner data 
Type Of Technology Software 
Year Produced 2018 
Open Source License? Yes  
Impact Exports our biological knowledge networks into formats that can be more easily re-used. 
 
Title Wheat KnetMiner - Release Nov 2017 
Description Wheat KnetMiner - Release Nov 2017. Added disease related RNA-seq studies and wheat GWAS data. 
Type Of Technology Webtool/Application 
Year Produced 2017 
Open Source License? Yes  
Impact Helps wheat researchers in gene discovery and knowledge visualization. 
URL http://knetminer.rothamsted.ac.uk/Triticum_aestivum/
 
Description Durum EWG Workshop: Bioinformatics to advance wheat breeding 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact A 2-day workshop organised by Roberto Tuberosa and Luigi Cattivelli and attended by 100 wheat breeders, geneticists and researchers to learn about cutting-edge bioinformatics tools and resources available for wheat. The KnetMiner training led to a collaboration with Roberto Tuberosa's lab to identify potential candidate genes in hundreds of wheat QTL using KnetMiner networks and APIs.
Year(s) Of Engagement Activity 2017
URL http://www.wheatinitiative.org/events/durum-ewg-workshop-bioinformatics-advance-wheat-breeding
 
Description EBI Training Workshop - Integrative 'OMICS 2015, 2016 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact We taught data integration and visualisation using Ondex and QTLNetMiner at the European Bioinformatics Institute
Year(s) Of Engagement Activity 2015,2016
URL http://www.ebi.ac.uk/training/events/2015/introduction-integrative-omics
 
Description Introduction to Integrative 'omics 2016 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact Training workshop in integrative omics software methods. Features both Ondex and QTLNetMiner
Year(s) Of Engagement Activity 2016
URL http://www.ebi.ac.uk/training/events/2016/introduction-integrative-omics-0
 
Description Ondex Training workshops 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Other academic audiences (collaborators, peers etc.)
Results and Impact The Ondex SABR project ran a series of outreach and training events for research scientists in academia and industry.

.
Year(s) Of Engagement Activity 2008,2009,2010
 
Description Press release on KnetMiner software and collaboration with Genestack 
Form Of Engagement Activity A press release, press conference or response to a media enquiry/interview
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Media (as a channel to the public)
Results and Impact Press release introducing the KnetMiner software developed in the Hassani-Pak lab at Rothamsted and a recent collaboration to make it available as an App in the Genestack bioinformatics platform. News covered by Rothamsted, Genestack, BBSRC, Aafarmer, Farmbusiness and other websites.
Year(s) Of Engagement Activity 2017
URL https://www.rothamsted.ac.uk/news/visualising-data-connections-promises-faster-discoveries
 
Description Revival Exhibition by Hugo Dalton at Fitzwilliam Museum, Cambridge 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Public/other audiences
Results and Impact Hugo Dalton's lightdrawings were inspired by two major innovations, i.e. Omega3 plants and KnetMiner software, from Rothamsted Research. His light projections were on display at the Fitzwilliam Museum in Cambridge from Nov 2017 - Feb 2018.
Year(s) Of Engagement Activity 2017,2018
URL http://www.fitzmuseum.cam.ac.uk/calendar/whatson/hugo-dalton-revival-lightdrawings