From data to knowledge / the ONDEX System for integrating Life Sciences data sources

Lead Research Organisation: Newcastle University
Department Name: Computing Sciences

Abstract

The biological sciences generate many different types of data from different specialist disciplines (e.g. genetics, biochemistry, molecular biology). Bringing data together coherently is a major undertaking in any systems biology project. While new databases of biological thesauri and classification systems (ontologies) for the component parts of biology make it easier to link specialist databases, this only solves part of the problem of data integration for systems biologists who need a much richer body of information. For example, there are many different ways that biological components can be related (e.g. by function, location, size) which needs to be captured and information about the provenance (history or source) of data can be important when it is interpreted. New types of information are also important in systems biology, including descriptions of the biological processes and pathways for metabolism and information flow. Many of these have been created by extracting information from the scientific literature to form the basis for the predictive dynamic models and simulations of system function. Because systems biology has a need for complex data integration and scientific text mining that is not met by readily available bioinformatics software in the biological research community, a prototype system (ONDEX) has been developed by Rothamsted Research. This project will combine ONDEX with leading technologies in workflow, graph analysis and text mining, to develop a powerful and professional tool that will underpin systems biology research. Three systems biology research projects, run by our BBSRC-funded systems biology centre partners, will drive the development of ONDEX and will validate new features on real scientific problems. Biological areas addressed cover: bioenergy crops; yeast metabolome models; and Telomere Function in ageing. The research partners bring important technical expertise that will enhance ONDEX with new capabilities known to be required by systems biologists at their centres. These include: * Extensions to methods that map data into ONDEX to broaden the range of data that can be integrated and capture more of the information about it (the metadata). * State of the art text mining capabilities, for extracting biological concepts and relationships from online text to enable new data buried in the scientific literature to be extracted and structured into models and databases. * Extensions to handle the statistical uncertainty inherent in many biological relationships, to enable new relationships to be identified in the integrated datasets using modern statistical inference techniques. * Enhanced graphical visualisations of the complex network of relationships to accommodate new information and scale to huge data networks, to enable a better understanding of new interactions, and better ways of interrogating the data in a richly integrated dataset * Exploitation of the latest in distributed computing techniques and scientific workflows to simplify, automate and scale the complex task of integration. * Extended range of data interfaces relevant to both programmers and users to enable shared access over the Internet of the integrated datasets, which are important information resources in their own right. A number of actions and engineering developments will make ONDEX easier to use by biologists and support uptake in new areas of systems biology. These include new training resources, workshops for users and developers and providing direct help for new applications through an outreach programme. At the end of the project ONDEX will be delivered in a well-engineered and robust form to existing and new users that will be more readily used by a greatly expanded user and developer community that should make it sustainable in the long term as an open software project.

Technical Summary

The current ONDEX system enables data from diverse biological data set to be linked, integrated and visualised through graph analysis techniques. It uses a semantically rich Core data structure based on graphs, has explicit support for workflow and has the ability to bring together information from structured databases and unstructured sources such as sequence data and free text. Extensions for Systems Biology include: Enhancing the ONDEX Core: - Methods to map data into the core data structures to exploit synteny and sequence similarity for applications needing comparative analysis of genetic and genomic organisation of multiple organisms. -Techniques for probabilistic interpretation of relations allowing uncertainty in the integrated data and in biological relationships to be modelled, combining relations using probabilistic models such as naive Bayesian and Bayesian graphical Gaussian approaches. Exploiting the ONDEX data graph: A graph structure analysis toolkit using, standard and advanced graph analysis algorithms, that traverses the data graph and modules representing common structural and functional components to be identified. Populating the ONDEX model: - Orchestrating data integration and analysis steps in ONDEX applications, using Taverna workflows and services (myGrid), including the running of workflows. Using Taverna will allow ONDEX to retain data on workflow provenance, which can be used to track, verify and validate data. - Enhanced text mining methods to extract and map terms from text in databases and online literature sources to detect synonymy and ambiguity and the identification and extraction biologically relevant relations. Exposing ONDEX to tools: New data access interfaces to allow ONDEX data to be used by third party tools, e.g. within workflows, and data export tools to provide easy access to ONDEX data for users of Cytoscape and for export in standard systems biology model exchange formats (e.g. SBML, BioPAX etc).

Publications

10 25 50

publication icon
Hallinan JS (2011) Network approaches to the functional analysis of microbial proteins. in Advances in microbial physiology

publication icon
James K (2012) Is newer better?--evaluating the effects of data curation on integrated analyses in Saccharomyces cerevisiae. in Integrative biology : quantitative biosciences from nano to macro

publication icon
Lister AL (2010) Annotation of SBML models through rule-based semantic integration. in Journal of biomedical semantics

publication icon
Lister AL (2009) Saint: a lightweight integration environment for model annotation. in Bioinformatics (Oxford, England)

publication icon
Misirli G (2013) BacillOndex: an integrated data resource for systems and synthetic biology. in Journal of integrative bioinformatics

publication icon
Weile J (2012) Bayesian integration of networks without gold standards. in Bioinformatics (Oxford, England)

publication icon
Weile J (2011) Customizable views on semantically integrated networks for systems biology. in Bioinformatics (Oxford, England)

 
Description The main features of the project were the development of data integration technologies and demonstrating their potential using a range of systems biology application cases. The aim was to develop a software product that would have as broad application potential as possible. It is unlikely that without the SABR funding that it would have been possible to do this using a single standard response mode project because of the range of biological problems that we were able to tackle. These covered topics from yeast metabolic model development through analysis of bacterial genomics to the mining of crop genetic and genomic datasets.
Exploitation Route The Ondex project resulted in a suite of tools for data integration in systems and synthetic biology.

The training strategy for the project staff and students was based on a regular series of internal workshops where both progress relating to application cases and feature requests to be implemented in the software were debated and prioritised. Any new features that had been implemented in the previous period were presented in training tutorials organised by the staff involved in outreach.
Regular hackathons (approximately 2 per year) were organised involving most of the programmers on the project where refactoring or other major software engineering efforts were delivered. These nearly always involved some element of training in terms of software architecture or development of interfaces or protocols.
The most important aspect of training in the project, which was also a key added-value from the project having the resources made possible by the SABR initiative were the outreach and training workshops that we presented to the systems biology research community. Over the duration of the project we gave a total of 17 tutorials and training workshops on Ondex to approximately 250 participants who were mainly post-docs and students from outside the project.
The unintended, but welcome, consequence of this was that we started a range of collaborations with other groups who wanted help to get started with Ondex. Some of these turned in to long term collaborations and led to publications and or grant applications of which several were funded and helped extend the project.
Training in Ondex was also delivered (and is currently still delivered) to students on the MSc Systems Biology, MSc Synthetic Biology, MSc Neuroinformatics and MSc Bioinformatics at Newcastle University.
Sectors Digital/Communication/Information Technologies (including Software),Pharmaceuticals and Medical Biotechnology

URL http://www.ondex.org/index.shtml
 
Description The Ondex system has been used to aid in the discovery of new drugs by facilitating data mining for drug repurposing opportunities. The additional value of this SABR grant came from the ability to recruit a team of software developers and bioinformaticians from different backgrounds, working together with lifescientists carrying out wet-laboratory based science. Each team brought to the project different technological expertise and this enabled the successful interoperability that was achieved between the major software components of the system i.e. Taverna workflows, Text mining, Graph based integration and visualisation. The diversity of the application cases explored at Rothamsted, Manchester and Newcastle enabled different aspects of the software to be evaluated on different scientific problems and this enriched the product and ensured that it remained relevant to a wide range of biological research topics. The scale of the project enabled both bioinformaticians and professional computer scientists (in Manchester and Newcastle) to work together and this enabled major refactoring of the software so that it was easier to maintain and more straightforward to develop. This refactoring improved the sustainability of the Ondex software and has facilitated many of the follow-on projects. The parallel operation of the doctoral training grant (unique to the SABR programme) provided an important and complementary research approach in the recruitment of PhD students. The students were free to carry out more exploratory research that underpinned important concepts, ultimately implemented in the Ondex platform. Work by Prof. Wipat using the Ondex platform in collaboration with e-therapeutics Ltd. resulted in an impact case that was submitted to the ref by the School of Computing Science at Newcastle. Newcastle was ranked first for impact in Computing Science the UK.
First Year Of Impact 2009
Sector Digital/Communication/Information Technologies (including Software),Healthcare,Pharmaceuticals and Medical Biotechnology
Impact Types Economic