DISCOVERING GENETIC FUNCTIONS USING LINKED OPEN DATASETS ON THE WEB
Lead Research Organisation:
University of Oxford
Department Name: Zoology
Abstract
Although the fast advances of bioinformatics research and technologies have led to the production of a vast amount of data, translating the genetic understandings obtained from studies of model organisms into human clinical efficacy remains of limited success. Until the analysis of the data deluge at large scale is computerised and automated, the productivity of genetic research and its application to drug development will lag behind the fast growth of data. The World Wide Web offers a scalable platform for publishing the fast growing research datasets. Its maturing standards and protocols ensure its linking capability and the interoperability between data shared on the Web. The Linking Open Data project (LOD, http://linkeddata.org/) is an emerging community that promotes the publication of data on the Web using an interoperable format. This provides a unique opportunity to investigate and prove the feasibility of supporting genetic research at a Web-scale using the interoperable, webs of linked research data.I propose a new practice of interacting with the segmented data on the Web that will improve the productivity of genetic research as applied to medicine, drug development, and epidemiology. The data will be automatically connected to form webs of linked datasets, building upon an interoperable representation format and machine-processable metadata. Analysis of the datasets will be automated, by a set of computer tools that I will create in response to concrete users' need. This will form a 'genetic platform' on top of the Web that will assist with the task of gathering and analysing research datasets at a Web-scale. Through ongoing collaborative interaction with biomedical researchers, the true barriers that presently hinder productive genetic research will be identified and the impediments will be removed in a user-led, incremental approach. By experimenting with current technologies, I will uncover their strength and shortcomings, and identify the missing technical, biological and social components in supporting data-intensive genetic studies.I will use two test beds, one for genomics studies and the other for proteomics studies. I will work together with Drosophila researchers from Oxford and Cardiff, and bioinformaticians from Amsterdam, and computer scientists from the LOD and the Semantic Web Health Care and Life Sciences (HCLS) (http://www.w3.org/2001/sw/hcls/) communities. Biological use cases will drive the research agenda, and new computational practices for interacting with large scale biological research data will be evaluated by the extent it removes the identified impediments of Drosophila genetic research productivity or improves the productivity compared to the conventional research practice. The efficiency of my new practice is built upon the linked research datasets, enabled by a new practice of data publication. A demonstration of its efficiency will drive adoption of this new data publication practice within the biomedical research community, assisted by the maturing tools and standards emerging from the web community. This will lead to a cultural shift! I will devote continuous efforts to transferring this cultural evolution to the wider community, in order to promote step changes to the convergence of biomedical research and its clinical application, such as the personalised medicine . The emergence of a new bioinformatics science is foreseeable, to pass the knowledge required for adopting this new practice of data publication and analysis to the young generations of biological and medical students, who will become the new scientists working with research data on a large scale and lead new innovations in biomedical research and its applications.
Organisations
People |
ORCID iD |
Jun Zhao (Principal Investigator) |
Publications
Anja Jentzsch (Author)
(2009)
Linking Open Drug Data
Cheung KH
(2009)
A journey to Semantic Web query federation in the life sciences.
in BMC bioinformatics
Deus HF
(2012)
Translating standards into practice - one Semantic Web API for Gene Expression.
in Journal of biomedical informatics
Jun Zhao (Author)
(2010)
Provenance Requirements for the Next Version of RDF
Jun Zhao (Co-Author)
(2009)
Using web data provenance for quality assessment
Jun Zhao (Co-Author)
(2010)
Provenance of microarray experiments for a better understanding of experiment results.
Luciano JS
(2011)
The Translational Medicine Ontology and Knowledge Base: driving personalized medicine by bridging the gap between bench and bedside.
in Journal of biomedical semantics
MARCO BORGHESI (Recipient)
(2011)
Describing Linked Datasets with the VoID Vocabulary
Description | In this LSI fellowship project I set out to identify barriers that cause the inefficiency in genetic research and provide solutions to fix some of these barrier via technical and social solutions. The mechanism by which research datasets are made available is a first-step impediment to the productivity of research as well as application buildings. Coping with the heterogeneity of scientific data is a perennial challenge. By representing them using the latest standard-compliant format and exposing them through standard access protocols provides a higher starting point for data integration. However, to take full advantage of the latest semantic web tools and technologies for accessing and reasoning this Web of Data as one connected data Web, we must take one step further by establishing linkage between research datasets at both the individual instance level and the higher conceptual level, including their temporal, spatial and thematic aspects. In this fellowship project vocabularies or ontologies that are key for facilitating the discovery and integration of the distributed datasets on the Linked Data Web as well as establishing trustworthiness upon them have been developed and widely adopted by the community as well as high impact projects, like the UK data.gov.uk project. Tools implementing these vocabularies have been co-developed or adapted in this project to testify these technologies in real case studies. Platforms to enable integrated access to datasets have been developed to support research in specific subject areas and been successfully adapted and applied to new sub-domains, including the identification of potential chemical compounds underpinning the effect of traditional Chinese medicines, and the exploration of biological networks related to a specific disease. The former had led to an award winning application. Apart from providing efficient, integrative access to disturbed research datasets, an even stronger and more fundamental impediments to the productivity of genetic research as well as many other scientific research activities, is the lack of access to raw/processed research datasets and support for transparent and reproducible research practices. The fellowship project set out to focus on primary data resources published in existing public databases. However, the value of raw research datasets and those related to research conclusions or claims published in publications is growingly highlighted through interaction with different scientists and collaborators. This lack of data sharing and publication is largely caused by the social reluctance, due to fears of losing credit for their data or being revealed of errors. This is also caused by the lack of technical supports: for effectively publishing and sharing these data in a persistent, citable manner, and for evaluating and rewarding scientists' research outcomes based on their data sharing and data citation/reuse index. Reuse of research data must be automatically tracked, to establish data credit for data publishers and to detect any misconducts with original datasets. Vocabularies to define the data reuse relationship between experiments and studies have been developed to facilitate the exploration of establishment of novel data citation evaluation metric based upon these data reuse network. Lastly this research has also identified one key limitation of the current technologies for supporting the target applications proposed in the original research, i.e. the quality and trustworthiness of this distributed approach of data publication and access. Building upon my existing expertise in provenance research, I made pioneering contributions to this key missing piece. Again, by collaborating with other international academic and industrial partners, I produced several data model and vocabularies to represent key provenance information, which tracks how data was generated, replicated, published, accessed, etc, to enable quality and trust assessment. I also produced methodologies and technologies to use provenance information to enable quality-aware data access and integration. This pioneering work has led to several of my key academic publications and ongoing standard output from the World Wide Web Consortium (W3C). |
Exploitation Route | The findings from this work has contributed to the development of several community standards, such as: 1. W3C Health Care and Life Science (HCLS) Linked Data Guide: https://www.w3.org/TR/hcls-dataset/ 2. W3C Dataset Descriptions: HCLS Community Profile https://www.w3.org/TR/hcls-dataset/ The research work by the PI has also contributed to several standardisation reports for the W3C Provenance Incubator Group (https://www.w3.org/2005/Incubator/prov/wiki/Main_Page), and the final standardisation of the W3C PROV model (https://www.w3.org/TR/prov-o/). |
Sectors | Digital/Communication/Information Technologies (including Software) Education Healthcare |
Description | - Participant in the movement of driving a fast growing number of life science datasets published in linked data format - Establishment as community leader and expert in provenance research especially for the Web of Data domain - Establishment as community leader and expert in applying lightweight semantic web technologies to the building of life science data integration applications. |
First Year Of Impact | 2010 |
Sector | Digital/Communication/Information Technologies (including Software),Healthcare |
Impact Types | Cultural |
Description | Workflow4Ever:Advanced Workflow Preservation Technologies for Enhanced Science |
Amount | £2,673,000 (GBP) |
Funding ID | FP7 270192 |
Organisation | European Commission |
Sector | Public |
Country | European Union (EU) |
Start | 12/2010 |
End | 11/2013 |
Title | FlyKit |
Description | A JavaScript framework that has been used to improve the efficiency of data integration to support gene expression studies. |
Type Of Material | Improvements to research infrastructure |
Year Produced | 2010 |
Provided To Others? | Yes |
Impact | The tool has also been successfully adapted to support queries across Chinese and western medicine data, see - Integrating findings of traditional medicine with modern pharmaceutical research: the potential role of linked open data. Samwald et al. Chinese medicine 5 (1), 43. 2010 |
URL | https://code.google.com/p/open-biomed/ |
Title | open-boomed-data |
Description | This is a collection of datasets in RDF format, including those from the following databases: - FlyTED - FlyAtlas - Flybase - BDGP - EBI GeneAtlas/ArrayExpress - TCM. Having this set of very heterogeneous dataset in RDF format greatly leverages the data heterogeneity issues that we have to face. This not only makes our sense making tasks much easier but also produces datasets that can be reused for future data integration tasks. |
Type Of Material | Database/Collection of data |
Year Produced | 2009 |
Provided To Others? | Yes |
Impact | People have been very keen to learn from our experience of publishing these biological datasets in the novel RDF format. We have been involved with other groups to help them with or feedback on their similar tasks. |
URL | https://code.google.com/p/open-biomed/w/list |
Description | Cracking the quality puzzle with provenance pieces |
Form Of Engagement Activity | Scientific meeting (conference/symposium etc.) |
Part Of Official Scheme? | No |
Type Of Presentation | paper presentation |
Geographic Reach | International |
Primary Audience | Other academic audiences (collaborators, peers etc.) |
Results and Impact | Invited talk on provenance in Principles of Provenance Seminar in Dagstuhl, Germany. N/A |
Year(s) Of Engagement Activity | 2012 |
Description | Linked Data for Biomedical Science: A Tale of Two Success Stories |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | Regional |
Primary Audience | Professional Practitioners |
Results and Impact | An invited talk on Linked Data for Health Care Life Science applications in Talis open day. follow-up collaboration and network |
Year(s) Of Engagement Activity | 2010 |
Description | Linked Data for Health Care and Life Science Research |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | Invited talk on Linking Open Drug Data at Life Science SIG (Special Interest Group). Washington D.C., USA. None |
Year(s) Of Engagement Activity | 2009 |
Description | Member of the W3C Provenance Incubator Group |
Form Of Engagement Activity | A formal working group, expert panel or dialogue |
Part Of Official Scheme? | Yes |
Geographic Reach | International |
Primary Audience | Public/other audiences |
Results and Impact | Member of the W3C Provenance Incubator Group . Awarding Body - World Wide Web Consortium, Name of Scheme - Provenance Incubator Group A following-up working group was set up as a result of outcomes from this incubation group. |
Year(s) Of Engagement Activity | 2009,2010 |
Description | Member of the W3C Provenance Working Group |
Form Of Engagement Activity | A formal working group, expert panel or dialogue |
Part Of Official Scheme? | Yes |
Geographic Reach | International |
Primary Audience | Public/other audiences |
Results and Impact | A series of recommendations, see http://www.w3.org/2011/prov/ We have organised several very successful outreach activities following the publication of our recommendation series, see http://www.w3.org/2001/sw/wiki/OutreachInformation. The PROV recommendations have been widely adopted by both academic and commercial organisations, see http://www.w3.org/2001/sw/wiki/PROV#Implementations. |
Year(s) Of Engagement Activity | 2011,2013 |
Description | Member of the World Wide Web Consortium Health Care Life Science Interest Group |
Form Of Engagement Activity | A formal working group, expert panel or dialogue |
Part Of Official Scheme? | Yes |
Geographic Reach | International |
Primary Audience | Public/other audiences |
Results and Impact | We have produced several influential journal papers and technical notes. A selection of them can be found below: - A journey to Semantic Web query federation in the life sciences. Cheung K.H. el a. BMC bioinformatics 10 (Suppl 10), S10. 2009 - Publishing Chinese medicine knowledge as Linked Data on the Web. J Zhao. Chinese medicine 5 (1), 1-12. 2010 - Integrating findings of traditional medicine with modern pharmaceutical research: the potential role of linked open data. Samwald et al. Chinese medicine 5 (1), 43. 2010 - The Translational Medicine Ontology and Knowledge Base: driving personalized medicine by bridging the gap between bench and bedside Luciano, J.S. et al. Journal of biomedical semantics 2 (Suppl 2), S1. 2011 - Emerging practices for mapping and linking life sciences data using RDF-A case series. Marshall M.S. et al. Web Semantics: Science, Services and Agents on the World Wide Web Vol 14, 2-13. 2012 A lot of these activities are continued in the HCLS interest group, by using and improving the resources and knowledge produced by these activities. |
Year(s) Of Engagement Activity | 2009,2013 |
Description | Open Genomic Data Web |
Form Of Engagement Activity | Scientific meeting (conference/symposium etc.) |
Part Of Official Scheme? | No |
Type Of Presentation | keynote/invited speaker |
Geographic Reach | International |
Primary Audience | Other academic audiences (collaborators, peers etc.) |
Results and Impact | Invited talk on open-biomed at 2009 GMOD Meeting Europe. Oxford. A follow-up visit was initialized from the GMOD project team to assist their data management design but it was declined due to clash with schedule. |
Year(s) Of Engagement Activity | 2009 |
Description | OpenFlyData, a Data Web for Drosophila |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | National |
Primary Audience | Other academic audiences (collaborators, peers etc.) |
Results and Impact | An invited talk on openflydata at the Linked Data and Practical Semantic Web Workshop. Oxford. We attracted a summer student as a result of the talk in our lab. |
Year(s) Of Engagement Activity | 2009,2010 |
Description | Provenance in the Dynamic, Collaborative New Science |
Form Of Engagement Activity | Scientific meeting (conference/symposium etc.) |
Part Of Official Scheme? | No |
Type Of Presentation | keynote/invited speaker |
Geographic Reach | National |
Primary Audience | Other academic audiences (collaborators, peers etc.) |
Results and Impact | Invited talk on provenance in Provenance Workshop in Edinburgh e-Science Center. A following-up workshop was planned |
Year(s) Of Engagement Activity | 2011 |
Description | The Web as the Platform for Sharing and Consuming Biomedical Research Data |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | Local |
Primary Audience | Other academic audiences (collaborators, peers etc.) |
Results and Impact | Invited talk on open-biomed at Information System Group Seminar, Computing Lab, Oxford University. Oxford. Follow up inter-group seminars. |
Year(s) Of Engagement Activity | 2010 |
Description | Using the Web as the Platform for Sharing and Consuming Biomedical Data |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Other academic audiences (collaborators, peers etc.) |
Results and Impact | Invited talk on open-biomed at the Workshop on the Influence and Impact of Web 2.0 on Various Applications. Edinburgh. A book was planned around the workshop topic. |
Year(s) Of Engagement Activity | 2010 |