DISCOVERING GENETIC FUNCTIONS USING LINKED OPEN DATASETS ON THE WEB

Lead Research Organisation: University of Oxford
Department Name: Zoology

Abstract

Although the fast advances of bioinformatics research and technologies have led to the production of a vast amount of data, translating the genetic understandings obtained from studies of model organisms into human clinical efficacy remains of limited success. Until the analysis of the data deluge at large scale is computerised and automated, the productivity of genetic research and its application to drug development will lag behind the fast growth of data. The World Wide Web offers a scalable platform for publishing the fast growing research datasets. Its maturing standards and protocols ensure its linking capability and the interoperability between data shared on the Web. The Linking Open Data project (LOD, http://linkeddata.org/) is an emerging community that promotes the publication of data on the Web using an interoperable format. This provides a unique opportunity to investigate and prove the feasibility of supporting genetic research at a Web-scale using the interoperable, webs of linked research data.I propose a new practice of interacting with the segmented data on the Web that will improve the productivity of genetic research as applied to medicine, drug development, and epidemiology. The data will be automatically connected to form webs of linked datasets, building upon an interoperable representation format and machine-processable metadata. Analysis of the datasets will be automated, by a set of computer tools that I will create in response to concrete users' need. This will form a 'genetic platform' on top of the Web that will assist with the task of gathering and analysing research datasets at a Web-scale. Through ongoing collaborative interaction with biomedical researchers, the true barriers that presently hinder productive genetic research will be identified and the impediments will be removed in a user-led, incremental approach. By experimenting with current technologies, I will uncover their strength and shortcomings, and identify the missing technical, biological and social components in supporting data-intensive genetic studies.I will use two test beds, one for genomics studies and the other for proteomics studies. I will work together with Drosophila researchers from Oxford and Cardiff, and bioinformaticians from Amsterdam, and computer scientists from the LOD and the Semantic Web Health Care and Life Sciences (HCLS) (http://www.w3.org/2001/sw/hcls/) communities. Biological use cases will drive the research agenda, and new computational practices for interacting with large scale biological research data will be evaluated by the extent it removes the identified impediments of Drosophila genetic research productivity or improves the productivity compared to the conventional research practice. The efficiency of my new practice is built upon the linked research datasets, enabled by a new practice of data publication. A demonstration of its efficiency will drive adoption of this new data publication practice within the biomedical research community, assisted by the maturing tools and standards emerging from the web community. This will lead to a cultural shift! I will devote continuous efforts to transferring this cultural evolution to the wider community, in order to promote step changes to the convergence of biomedical research and its clinical application, such as the personalised medicine . The emergence of a new bioinformatics science is foreseeable, to pass the knowledge required for adopting this new practice of data publication and analysis to the young generations of biological and medical students, who will become the new scientists working with research data on a large scale and lead new innovations in biomedical research and its applications.
 
Description In this LSI fellowship project I set out to identify barriers that cause the inefficiency in genetic research and provide solutions to fix some of these barrier via technical and social solutions.



The mechanism by which research datasets are made available is a first-step impediment to the productivity of research as well as application buildings. Coping with the heterogeneity of scientific data is a perennial challenge. By representing them using the latest standard-compliant format and exposing them through standard access protocols provides a higher starting point for data integration. However, to take full advantage of the latest semantic web tools and technologies for accessing and reasoning this Web of Data as one connected data Web, we must take one step further by establishing linkage between research datasets at both the individual instance level and the higher conceptual level, including their temporal, spatial and thematic aspects. In this fellowship project vocabularies or ontologies that are key for facilitating the discovery and integration of the distributed datasets on the Linked Data Web as well as establishing trustworthiness upon them have been developed and widely adopted by the community as well as high impact projects, like the UK data.gov.uk project. Tools implementing these vocabularies have been co-developed or adapted in this project to testify these technologies in real case studies. Platforms to enable integrated access to datasets have been developed to support research in specific subject areas and been successfully adapted and applied to new sub-domains, including the identification of potential chemical compounds underpinning the effect of traditional Chinese medicines, and the exploration of biological networks related to a specific disease. The former had led to an award winning application.





Apart from providing efficient, integrative access to disturbed research datasets, an even stronger and more fundamental impediments to the productivity of genetic research as well as many other scientific research activities, is the lack of access to raw/processed research datasets and support for transparent and reproducible research practices. The fellowship project set out to focus on primary data resources published in existing public databases. However, the value of raw research datasets and those related to research conclusions or claims published in publications is growingly highlighted through interaction with different scientists and collaborators. This lack of data sharing and publication is largely caused by the social reluctance, due to fears of losing credit for their data or being revealed of errors. This is also caused by the lack of technical supports: for effectively publishing and sharing these data in a persistent, citable manner, and for evaluating and rewarding scientists' research outcomes based on their data sharing and data citation/reuse index. Reuse of research data must be automatically tracked, to establish data credit for data publishers and to detect any misconducts with original datasets. Vocabularies to define the data reuse relationship between experiments and studies have been developed to facilitate the exploration of establishment of novel data citation evaluation metric based upon these data reuse network.



Lastly this research has also identified one key limitation of the current technologies for supporting the target applications proposed in the original research, i.e. the quality and trustworthiness of this distributed approach of data publication and access. Building upon my existing expertise in provenance research, I made pioneering contributions to this key missing piece. Again, by collaborating with other international academic and industrial partners, I produced several data model and vocabularies to represent key provenance information, which tracks how data was generated, replicated, published, accessed, etc, to enable quality and trust assessment. I also produced methodologies and technologies to use provenance information to enable quality-aware data access and integration. This pioneering work has led to several of my key academic publications and ongoing standard output from the World Wide Web Consortium (W3C).
Exploitation Route The findings from this work has contributed to the development of several community standards, such as:
1. W3C Health Care and Life Science (HCLS) Linked Data Guide: https://www.w3.org/TR/hcls-dataset/
2. W3C Dataset Descriptions: HCLS Community Profile https://www.w3.org/TR/hcls-dataset/

The research work by the PI has also contributed to several standardisation reports for the W3C Provenance Incubator Group (https://www.w3.org/2005/Incubator/prov/wiki/Main_Page), and the final standardisation of the W3C PROV model (https://www.w3.org/TR/prov-o/).
Sectors Digital/Communication/Information Technologies (including Software),Education,Healthcare

 
Description - Participant in the movement of driving a fast growing number of life science datasets published in linked data format - Establishment as community leader and expert in provenance research especially for the Web of Data domain - Establishment as community leader and expert in applying lightweight semantic web technologies to the building of life science data integration applications.
First Year Of Impact 2010
Sector Digital/Communication/Information Technologies (including Software),Healthcare
Impact Types Cultural

 
Description Workflow4Ever:Advanced Workflow Preservation Technologies for Enhanced Science
Amount £2,673,000 (GBP)
Funding ID FP7 270192 
Organisation European Commission 
Sector Public
Country European Union (EU)
Start 12/2010 
End 11/2013
 
Title FlyKit 
Description A JavaScript framework that has been used to improve the efficiency of data integration to support gene expression studies. 
Type Of Material Improvements to research infrastructure 
Year Produced 2010 
Provided To Others? Yes  
Impact The tool has also been successfully adapted to support queries across Chinese and western medicine data, see - Integrating findings of traditional medicine with modern pharmaceutical research: the potential role of linked open data. Samwald et al. Chinese medicine 5 (1), 43. 2010 
URL https://code.google.com/p/open-biomed/
 
Title open-boomed-data 
Description This is a collection of datasets in RDF format, including those from the following databases: - FlyTED - FlyAtlas - Flybase - BDGP - EBI GeneAtlas/ArrayExpress - TCM. Having this set of very heterogeneous dataset in RDF format greatly leverages the data heterogeneity issues that we have to face. This not only makes our sense making tasks much easier but also produces datasets that can be reused for future data integration tasks. 
Type Of Material Database/Collection of data 
Year Produced 2009 
Provided To Others? Yes  
Impact People have been very keen to learn from our experience of publishing these biological datasets in the novel RDF format. We have been involved with other groups to help them with or feedback on their similar tasks. 
URL https://code.google.com/p/open-biomed/w/list
 
Description Cracking the quality puzzle with provenance pieces 
Form Of Engagement Activity Scientific meeting (conference/symposium etc.)
Part Of Official Scheme? No
Type Of Presentation paper presentation
Geographic Reach International
Primary Audience Other academic audiences (collaborators, peers etc.)
Results and Impact Invited talk on provenance in Principles of Provenance Seminar in Dagstuhl, Germany.

N/A
Year(s) Of Engagement Activity 2012
 
Description Linked Data for Biomedical Science: A Tale of Two Success Stories 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Professional Practitioners
Results and Impact An invited talk on Linked Data for Health Care Life Science applications in Talis open day.

follow-up collaboration and network
Year(s) Of Engagement Activity 2010
 
Description Linked Data for Health Care and Life Science Research 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Invited talk on Linking Open Drug Data at Life Science SIG (Special Interest Group). Washington D.C., USA.

None
Year(s) Of Engagement Activity 2009
 
Description Member of the W3C Provenance Incubator Group 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? Yes
Geographic Reach International
Primary Audience Public/other audiences
Results and Impact Member of the W3C Provenance Incubator Group . Awarding Body - World Wide Web Consortium, Name of Scheme - Provenance Incubator Group

A following-up working group was set up as a result of outcomes from this incubation group.
Year(s) Of Engagement Activity 2009,2010
 
Description Member of the W3C Provenance Working Group 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? Yes
Geographic Reach International
Primary Audience Public/other audiences
Results and Impact A series of recommendations, see http://www.w3.org/2011/prov/

We have organised several very successful outreach activities following the publication of our recommendation series, see http://www.w3.org/2001/sw/wiki/OutreachInformation.

The PROV recommendations have been widely adopted by both academic and commercial organisations, see http://www.w3.org/2001/sw/wiki/PROV#Implementations.
Year(s) Of Engagement Activity 2011,2013
 
Description Member of the World Wide Web Consortium Health Care Life Science Interest Group 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? Yes
Geographic Reach International
Primary Audience Public/other audiences
Results and Impact We have produced several influential journal papers and technical notes. A selection of them can be found below:

- A journey to Semantic Web query federation in the life sciences. Cheung K.H. el a. BMC bioinformatics 10 (Suppl 10), S10. 2009

- Publishing Chinese medicine knowledge as Linked Data on the Web. J Zhao. Chinese medicine 5 (1), 1-12. 2010

- Integrating findings of traditional medicine with modern pharmaceutical research: the potential role of linked open data. Samwald et al.
Chinese medicine 5 (1), 43. 2010

- The Translational Medicine Ontology and Knowledge Base: driving personalized medicine by bridging the gap between bench and bedside
Luciano, J.S. et al. Journal of biomedical semantics 2 (Suppl 2), S1. 2011

- Emerging practices for mapping and linking life sciences data using RDF-A case series. Marshall M.S. et al. Web Semantics: Science, Services and Agents on the World Wide Web Vol 14, 2-13. 2012

A lot of these activities are continued in the HCLS interest group, by using and improving the resources and knowledge produced by these activities.
Year(s) Of Engagement Activity 2009,2013
 
Description Open Genomic Data Web 
Form Of Engagement Activity Scientific meeting (conference/symposium etc.)
Part Of Official Scheme? No
Type Of Presentation keynote/invited speaker
Geographic Reach International
Primary Audience Other academic audiences (collaborators, peers etc.)
Results and Impact Invited talk on open-biomed at 2009 GMOD Meeting Europe. Oxford.

A follow-up visit was initialized from the GMOD project team to assist their data management design but it was declined due to clash with schedule.
Year(s) Of Engagement Activity 2009
 
Description OpenFlyData, a Data Web for Drosophila 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Other academic audiences (collaborators, peers etc.)
Results and Impact An invited talk on openflydata at the Linked Data and Practical Semantic Web Workshop. Oxford.

We attracted a summer student as a result of the talk in our lab.
Year(s) Of Engagement Activity 2009,2010
 
Description Provenance in the Dynamic, Collaborative New Science 
Form Of Engagement Activity Scientific meeting (conference/symposium etc.)
Part Of Official Scheme? No
Type Of Presentation keynote/invited speaker
Geographic Reach National
Primary Audience Other academic audiences (collaborators, peers etc.)
Results and Impact Invited talk on provenance in Provenance Workshop in Edinburgh e-Science Center.

A following-up workshop was planned
Year(s) Of Engagement Activity 2011
 
Description The Web as the Platform for Sharing and Consuming Biomedical Research Data 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Other academic audiences (collaborators, peers etc.)
Results and Impact Invited talk on open-biomed at Information System Group Seminar, Computing Lab, Oxford University. Oxford.

Follow up inter-group seminars.
Year(s) Of Engagement Activity 2010
 
Description Using the Web as the Platform for Sharing and Consuming Biomedical Data 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other academic audiences (collaborators, peers etc.)
Results and Impact Invited talk on open-biomed at the Workshop on the Influence and Impact of Web 2.0 on Various Applications. Edinburgh.

A book was planned around the workshop topic.
Year(s) Of Engagement Activity 2010