From data to knowledge / the ONDEX System for integrating Life Sciences data sources
Lead Research Organisation:
Rothamsted Research
Department Name: Computational & Systems Biology
Abstract
The biological sciences generate many different types of data from different specialist disciplines (e.g. genetics, biochemistry, molecular biology). Bringing data together coherently is a major undertaking in any systems biology project. While new databases of biological thesauri and classification systems (ontologies) for the component parts of biology make it easier to link specialist databases, this only solves part of the problem of data integration for systems biologists who need a much richer body of information. For example, there are many different ways that biological components can be related (e.g. by function, location, size) which needs to be captured and information about the provenance (history or source) of data can be important when it is interpreted. New types of information are also important in systems biology, including descriptions of the biological processes and pathways for metabolism and information flow. Many of these have been created by extracting information from the scientific literature to form the basis for the predictive dynamic models and simulations of system function. Because systems biology has a need for complex data integration and scientific text mining that is not met by readily available bioinformatics software in the biological research community, a prototype system (ONDEX) has been developed by Rothamsted Research. This project will combine ONDEX with leading technologies in workflow, graph analysis and text mining, to develop a powerful and professional tool that will underpin systems biology research. Three systems biology research projects, run by our BBSRC-funded systems biology centre partners, will drive the development of ONDEX and will validate new features on real scientific problems. Biological areas addressed cover: bioenergy crops; yeast metabolome models; and Telomere Function in ageing. The research partners bring important technical expertise that will enhance ONDEX with new capabilities known to be required by systems biologists at their centres. These include: * Extensions to methods that map data into ONDEX to broaden the range of data that can be integrated and capture more of the information about it (the metadata). * State of the art text mining capabilities, for extracting biological concepts and relationships from online text to enable new data buried in the scientific literature to be extracted and structured into models and databases. * Extensions to handle the statistical uncertainty inherent in many biological relationships, to enable new relationships to be identified in the integrated datasets using modern statistical inference techniques. * Enhanced graphical visualisations of the complex network of relationships to accommodate new information and scale to huge data networks, to enable a better understanding of new interactions, and better ways of interrogating the data in a richly integrated dataset * Exploitation of the latest in distributed computing techniques and scientific workflows to simplify, automate and scale the complex task of integration. * Extended range of data interfaces relevant to both programmers and users to enable shared access over the Internet of the integrated datasets, which are important information resources in their own right. A number of actions and engineering developments will make ONDEX easier to use by biologists and support uptake in new areas of systems biology. These include new training resources, workshops for users and developers and providing direct help for new applications through an outreach programme. At the end of the project ONDEX will be delivered in a well-engineered and robust form to existing and new users that will be more readily used by a greatly expanded user and developer community that should make it sustainable in the long term as an open software project.
Technical Summary
The current ONDEX system enables data from diverse biological data set to be linked, integrated and visualised through graph analysis techniques. It uses a semantically rich Core data structure based on graphs, has explicit support for workflow and has the ability to bring together information from structured databases and unstructured sources such as sequence data and free text. Extensions for Systems Biology include: Enhancing the ONDEX Core: - Methods to map data into the core data structures to exploit synteny and sequence similarity for applications needing comparative analysis of genetic and genomic organisation of multiple organisms. -Techniques for probabilistic interpretation of relations allowing uncertainty in the integrated data and in biological relationships to be modelled, combining relations using probabilistic models such as naive Bayesian and Bayesian graphical Gaussian approaches. Exploiting the ONDEX data graph: A graph structure analysis toolkit using, standard and advanced graph analysis algorithms, that traverses the data graph and modules representing common structural and functional components to be identified. Populating the ONDEX model: - Orchestrating data integration and analysis steps in ONDEX applications, using Taverna workflows and services (myGrid), including the running of workflows. Using Taverna will allow ONDEX to retain data on workflow provenance, which can be used to track, verify and validate data. - Enhanced text mining methods to extract and map terms from text in databases and online literature sources to detect synonymy and ambiguity and the identification and extraction biologically relevant relations. Exposing ONDEX to tools: New data access interfaces to allow ONDEX data to be used by third party tools, e.g. within workflows, and data export tools to provide easy access to ONDEX data for users of Cytoscape and for export in standard systems biology model exchange formats (e.g. SBML, BioPAX etc).
Publications
Addinall SG
(2011)
Quantitative fitness analysis shows that NMD proteins and many other protein complexes suppress or enhance distinct telomere cap defects.
in PLoS genetics
Alcaraz N
(2011)
KeyPathwayMiner: Detecting Case-Specific Biological Pathways Using Expression Data
in Internet Mathematics
Ananiadou S
(2011)
Named entity recognition for bacterial Type IV secretion systems.
in PloS one
Balaur I
(2017)
EpiGeNet: A Graph Database of Interdependencies Between Genetic and Epigenetic Events in Colorectal Cancer.
in Journal of computational biology : a journal of computational molecular cell biology
Balaur I
(2017)
Recon2Neo4j: applying graph database technologies for managing comprehensive genome-scale networks.
in Bioinformatics (Oxford, England)
Brandizi M
(2018)
Towards FAIRer Biological Knowledge Networks Using a Hybrid Linked Data and Graph Database Approach.
in Journal of integrative bioinformatics
Burger A
(2012)
Semantic Web applications and tools for the life sciences: SWAT4LS 2010.
in BMC bioinformatics
Canevet C
(2010)
Analysis and visualisation of RDF resources in Ondex
in Nature Precedings
Canevet C
(2010)
Analysis and visualisation of RDF resources in Ondex
in Nature Precedings
Title | Lightdrawings of networks by Hugo Dalton |
Description | When science meets art, or new meets old. KnetMiner has inspired the artist Hugo Dalton. He has created lighdrawings of our networks and has projected them on old sculptures to bring them back to life. |
Type Of Art | Artwork |
Year Produced | 2018 |
Impact | Our work has been displayed for 3 months to visitors of the Fitzwilliam Musuem. |
URL | https://www.instagram.com/p/BflFDaHlwdv/ |
Title | Video Introducing KnetMiner |
Description | A short 90-sec clip introducing KnetMiner to the general public. What is it? Who developed it? Who uses it? |
Type Of Art | Film/Video/Animation |
Year Produced | 2017 |
Impact | Video inspires general public, students and scientists visiting the KnetMiner website. |
URL | https://www.youtube.com/watch?v=4aOv5QXqvLI |
Description | The development of a full-feature general data integration software platform (Ondex) for the life sciences with associated websites, documentation and training materials. A range of demonstrator projects and publications which illustrate the benefits of data integration. The development of an integrated knowledgebase of data relating to the genetics and genomics of the Poplar tree which can be accessed using a geneticist-friendly web-based user interface to support research and breeding that will improve the sustainability of willow trees as a second generation bioenergy crop. The development of new and general methods of data integration and visualisation which have been incorporated into the Ondex system that support the selection and evaluation of functional candidate genes from combined genetic and genomic studies of complex traits. |
Exploitation Route | The Ondex software is entirely general in its application domains and could potentially be relevant to any problem needed data integration technology. The most obvious commercial sectors are the lifescience communities undertaking research using multi-omics and systems biology approaches. |
Sectors | Agriculture Food and Drink Education Environment Pharmaceuticals and Medical Biotechnology |
URL | http://www.ondex.org |
Description | The Ondex software has been downloaded from the project website by 900 users since the start of this project. It is unclear how many have used it and for what purposes. The major users have been the named collaborators and follow on projects with Syngenta and at Newcastle Univeristy. |
First Year Of Impact | 2011 |
Sector | Agriculture, Food and Drink |
Impact Types | Economic |
Description | Bioinformatics to advance wheat breeding |
Geographic Reach | Europe |
Policy Influence Type | Influenced training of practitioners or researchers |
Impact | Trained researchers to use evidence-based practices in biological decision making. |
URL | http://www.wheatinitiative.org/events/durum-ewg-workshop-bioinformatics-advance-wheat-breeding |
Description | A FAIR community resource for pathogens, hosts and their interactions to enhance global food security and human health |
Amount | £557,820 (GBP) |
Funding ID | BB/S020020/1 |
Organisation | Biotechnology and Biological Sciences Research Council (BBSRC) |
Sector | Public |
Country | United Kingdom |
Start | 06/2019 |
End | 06/2023 |
Description | Accelerating Discovery by Mining and Visualising Integrated Chemogenomics Data |
Amount | £199,359 (GBP) |
Funding ID | 100938 |
Organisation | Innovate UK |
Sector | Public |
Country | United Kingdom |
Start | 11/2011 |
End | 03/2013 |
Description | Finding Value in Complex Biological Data - Integrated 'omics FS |
Amount | £22,973 (GBP) |
Organisation | Innovate UK |
Sector | Public |
Country | United Kingdom |
Start | 03/2016 |
End | 10/2017 |
Description | QTLNetMiner Project Funding |
Amount | £125,356 (GBP) |
Funding ID | BB/I023860/1 |
Organisation | Biotechnology and Biological Sciences Research Council (BBSRC) |
Sector | Public |
Country | United Kingdom |
Start | 12/2011 |
End | 02/2013 |
Description | What determines protein abundance in plants? |
Amount | £3,354,456 (GBP) |
Funding ID | BB/T002182/1 |
Organisation | Biotechnology and Biological Sciences Research Council (BBSRC) |
Sector | Public |
Country | United Kingdom |
Start | 09/2020 |
End | 10/2025 |
Title | AraKNET Release 42 - Feb 2019 |
Description | Integrated database of Arabidopsis genome, genotype, phenotype, omics and homology information. Available through www.knetminer.org or as RDF and Neo4j graph databases. |
Type Of Material | Database/Collection of data |
Year Produced | 2019 |
Provided To Others? | Yes |
Impact | It takes away the pain of connecting data from multiple sources and finding useful clues in data. A task that can take biologists weeks or months, can be done in a few minutes using the resource. |
URL | http://knetminer.rothamsted.ac.uk/Arabidopsis_thaliana/ |
Title | AraKNET Release 45 - Oct 2019 |
Description | Release 45 of the Arabidopsis Knowledge Graph on OXL, RDF and Neo4j format |
Type Of Material | Database/Collection of data |
Year Produced | 2019 |
Provided To Others? | Yes |
Impact | FAIR data. |
Title | Ondex Integrated Plant Genomics Databases |
Description | The Ondex project produced a portfolio of pre-integrated datasets for use in plant genomics research |
Type Of Material | Database/Collection of data |
Year Produced | 2009 |
Provided To Others? | Yes |
Impact | None |
URL | http://www.ondex.org/doc.shtml |
Title | RiceKNET Release 42 - Feb 2019 |
Description | Integrated database of rice genome, genotype, phenotype, omics and homology information. Available through www.knetminer.org or as RDF and Neo4j graph databases. |
Type Of Material | Database/Collection of data |
Year Produced | 2019 |
Provided To Others? | Yes |
Impact | It takes away the pain of connecting data from multiple sources and finding useful clues in data. A task that can take biologists weeks or months, can be done in a few minutes using the resource. |
URL | http://knetminer.rothamsted.ac.uk/Oryza_sativa/ |
Title | Wheat Knowledge Network - Release Nov 2017 |
Description | Integrated database of wheat genome, genotype, phenotype and homology information (Hassani-Pak et al, 2016) |
Type Of Material | Database/Collection of data |
Year Produced | 2017 |
Provided To Others? | Yes |
Impact | This database powers the KnetMiner application |
URL | http://knetminer.rothamsted.ac.uk/ |
Title | Wheat pathogens knowledge network - Release Nov 2017 |
Description | Knowledge networks of wheat pathogens Fusarium and Zymospeptoria |
Type Of Material | Database/Collection of data |
Year Produced | 2017 |
Provided To Others? | Yes |
Impact | Will help to understand wheat diseases |
URL | http://knetminer.rothamsted.ac.uk/ |
Title | WheatKNET Release 42 - Feb 2019 |
Description | Integrated database of wheat genome, genotype, phenotype, omics and homology information. Available through knetminer.org or as RDF and Neo4j graph databases. |
Type Of Material | Database/Collection of data |
Year Produced | 2019 |
Provided To Others? | Yes |
Impact | It takes away the pain of connecting data from multiple sources and finding useful clues in data. A task that can take biologists weeks or months, can be done in a few minutes using the resource. |
URL | http://knetminer.rothamsted.ac.uk/Triticum_aestivum/ |
Title | WheatKNET Release 45 - Oct 2019 |
Description | Release 45 of the wheat knowledge graph in OXL, RDF, Neo4j format |
Type Of Material | Database/Collection of data |
Year Produced | 2019 |
Provided To Others? | Yes |
Impact | Used by over 1000 users through the KnetMiner UI and API |
URL | http://knetminer-data.cyverseuk.org/lodestar/ |
Description | GeneStack |
Organisation | Genestack |
Country | United Kingdom |
Sector | Private |
PI Contribution | We are collaborating in an Innovate UK funded Feasibility Study to translate our bioinformatics software and methods to a commercial cloud based software platform and investigating future commercial licensing. |
Collaborator Contribution | They are providing the expertise to integrate our software into their platform |
Impact | Too soon to report outputs |
Start Year | 2016 |
Description | GeneStack |
Organisation | Genestack |
Country | United Kingdom |
Sector | Private |
PI Contribution | We are collaborating in an Innovate UK funded Feasibility Study to translate our bioinformatics software and methods to a commercial cloud based software platform and investigating future commercial licensing. |
Collaborator Contribution | They are providing the expertise to integrate our software into their platform |
Impact | Too soon to report outputs |
Start Year | 2016 |
Description | Participation in Syngenta UIC at Imperial College |
Organisation | Imperial College London |
Department | Department of Computing |
Country | United Kingdom |
Sector | Academic/University |
PI Contribution | The Rothamsted Ondex team have participated in the University Innovation Centre funded by Syngenta at Imperial College, London. The participation was largely technical providing support in the use of the Ondex platform for supporting the research there in machine learning applications. |
Start Year | 2009 |
Title | KnetMiner v1.0 |
Description | Helps users to analyse biological experiments and put finding into the context of published knowledge. Follow the link to see the release notes. |
Type Of Technology | Webtool/Application |
Year Produced | 2018 |
Open Source License? | Yes |
Impact | Accelerates gene discovery and plant breeding. |
URL | http://knetminer.rothamsted.ac.uk/ |
Title | KnetMiner v3.1 |
Description | Minor improvements and bug fixes |
Type Of Technology | Webtool/Application |
Year Produced | 2019 |
Open Source License? | Yes |
Impact | Benefits KnetMiner end users and developers |
URL | https://github.com/Rothamsted/knetminer/releases/tag/v3.1 |
Title | KnetMiner v3.2 |
Description | Minor improvements and bug fixes |
Type Of Technology | Webtool/Application |
Year Produced | 2019 |
Open Source License? | Yes |
Impact | Benefits to KnetMiner end users and developers |
URL | https://github.com/Rothamsted/knetminer/releases/tag/v3.2 |
Title | Ondex Suite |
Description | Data integration using semantic integration methods for lifesciences research and systems biology |
Type Of Technology | Software |
Year Produced | 2010 |
Open Source License? | Yes |
Impact | It led to further research projects and enabled a collaboration with Syngenta |
URL | http://www.ondex.org |
Title | Ondex Web |
Description | A web based biological network visualisation tool. |
Type Of Technology | Webtool/Application |
Year Produced | 2012 |
Impact | It was used in a collaboration with Syngenta |
URL | http://www.ondex.org/projects.shtml#ondexweb |
Title | Ondex to RDF Exporter |
Description | Ondex components and applications that are necessary for building genome-scale knowledge networks used in projects like KnetMiner. It includes the Ondex base, CLI, workflow engine and a set of plugins (parsers, mappers, transformers, filters and exporters) that are relevant for building genome-scale knowledge networks |
Type Of Technology | Software |
Year Produced | 2018 |
Open Source License? | Yes |
Impact | It is part of a suite of tools that help knetminer networks, including those developed for wheat, be shared through linked open data methods. |
URL | https://github.com/Rothamsted/ondex-knet-builder/tree/master/modules/rdf-export-2 |
Title | Ondex-Knet-Builder v2.1 |
Description | Command line based workflow engine for building knowledge graphs in OXL and RDF format. |
Type Of Technology | Software |
Year Produced | 2018 |
Open Source License? | Yes |
Impact | Improvements to several data loaders. |
URL | https://github.com/Rothamsted/ondex-knet-builder/releases/tag/v2.1 |
Title | QTLNetMiner |
Description | QTLNetMiner is a user-friendly web application that can interrogate plant and animal knowledge networks and be used to show candidate genes and QTL associated with given input terms (e.g. early flowering, disease resistance). The relevance of a gene to particular query terms is weighted using information retrieval and network inference methods. The supporting evidence networks for selected candidate genes are visualized in the Ondex Web Java-applet. QTLNetMiner is designed in a generic way and can be created for any organism with an integrated Ondex knowledge network |
Type Of Technology | Webtool/Application |
Year Produced | 2013 |
Impact | None to date |
URL | https://ondex.rothamsted.ac.uk/QTLNetMiner/ |
Title | RDF to Neo4j converter |
Description | RDF-Neo4 Converter and config to load KnetMiner data |
Type Of Technology | Software |
Year Produced | 2018 |
Open Source License? | Yes |
Impact | Exports our biological knowledge networks into formats that can be more easily re-used. |
Title | Wheat KnetMiner - Release Nov 2017 |
Description | Wheat KnetMiner - Release Nov 2017. Added disease related RNA-seq studies and wheat GWAS data. |
Type Of Technology | Webtool/Application |
Year Produced | 2017 |
Open Source License? | Yes |
Impact | Helps wheat researchers in gene discovery and knowledge visualization. |
URL | http://knetminer.rothamsted.ac.uk/Triticum_aestivum/ |
Description | Durum EWG Workshop: Bioinformatics to advance wheat breeding |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Postgraduate students |
Results and Impact | A 2-day workshop organised by Roberto Tuberosa and Luigi Cattivelli and attended by 100 wheat breeders, geneticists and researchers to learn about cutting-edge bioinformatics tools and resources available for wheat. The KnetMiner training led to a collaboration with Roberto Tuberosa's lab to identify potential candidate genes in hundreds of wheat QTL using KnetMiner networks and APIs. |
Year(s) Of Engagement Activity | 2017 |
URL | http://www.wheatinitiative.org/events/durum-ewg-workshop-bioinformatics-advance-wheat-breeding |
Description | EBI Training Workshop - Integrative 'OMICS 2015, 2016 |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | We taught data integration and visualisation using Ondex and QTLNetMiner at the European Bioinformatics Institute |
Year(s) Of Engagement Activity | 2015,2016 |
URL | http://www.ebi.ac.uk/training/events/2015/introduction-integrative-omics |
Description | Introduction to Integrative 'omics 2016 |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Postgraduate students |
Results and Impact | Training workshop in integrative omics software methods. Features both Ondex and QTLNetMiner |
Year(s) Of Engagement Activity | 2016 |
URL | http://www.ebi.ac.uk/training/events/2016/introduction-integrative-omics-0 |
Description | Ondex Training workshops |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | Local |
Primary Audience | Other academic audiences (collaborators, peers etc.) |
Results and Impact | The Ondex SABR project ran a series of outreach and training events for research scientists in academia and industry. . |
Year(s) Of Engagement Activity | 2008,2009,2010 |
Description | Press release on KnetMiner software and collaboration with Genestack |
Form Of Engagement Activity | A press release, press conference or response to a media enquiry/interview |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Media (as a channel to the public) |
Results and Impact | Press release introducing the KnetMiner software developed in the Hassani-Pak lab at Rothamsted and a recent collaboration to make it available as an App in the Genestack bioinformatics platform. News covered by Rothamsted, Genestack, BBSRC, Aafarmer, Farmbusiness and other websites. |
Year(s) Of Engagement Activity | 2017 |
URL | https://www.rothamsted.ac.uk/news/visualising-data-connections-promises-faster-discoveries |
Description | Revival Exhibition by Hugo Dalton at Fitzwilliam Museum, Cambridge |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | Regional |
Primary Audience | Public/other audiences |
Results and Impact | Hugo Dalton's lightdrawings were inspired by two major innovations, i.e. Omega3 plants and KnetMiner software, from Rothamsted Research. His light projections were on display at the Fitzwilliam Museum in Cambridge from Nov 2017 - Feb 2018. |
Year(s) Of Engagement Activity | 2017,2018 |
URL | http://www.fitzmuseum.cam.ac.uk/calendar/whatson/hugo-dalton-revival-lightdrawings |