COpenPlantOmics (COPO): a Collaborative Bioinformatics Plant Science Platform

Lead Research Organisation: EMBL - European Bioinformatics Institute
Department Name: Ensembl Genomes

Abstract

We live in a digital age where we increasingly rely on interconnected resources in our daily lives. Biological science, due to the very nature of the complexity of worldwide research avenues, is typically fragmented. Even though scientific information is published in peer-reviewed articles, it is often badly described and, until very recently, often unavailable to the general public because of journal licensing issues and expensive subscription costs.

The field of bioinformatics (the analysis and management of biological data using computational methods) produces many freely available tools for data analysis and exposure that are incredibly useful to researchers. However, these tools often do not interoperate well, meaning that great effort is spent attempting to convert or tweak datasets to fit with other tools that further bioinformatics processes, hindering timely accurate reusable research. Couple this with the lack of descriptive information noted earlier, and knowledge that can be vital to one researcher, team or community can become at least unreproducible (thus letting others confirm findings) at worst unusable.

Life scientists are people focused on investigating biological processes. This requires a lot of time, effort and fastidiousness in experimental observation, data collection and analysis. Typically for life scientists, more time is spent on the former: defining and publishing experimental methods and results. The latter, i.e. the data behind these results, is usually badly defined and largely unpublished. For computer scientists, the story is reversed - the focus is on getting to the data. This platform will bridge the gap between these two groups by providing tools and training to both life and computer scientists in the plant bioscience field, in order to help them get their data into the right formats and described uniformly for open research.

To do this, the management, interoperability and curation of scientific datasets is key. Researchers need clear guidance and help to:

- Manage their data in a concise relevant way that allows immediate reuse by others: Generating data is only one part of the picture. To back up scientific findings, data needs to be made available to others to allow the same degree of rigour and peer review that is enforced for published material. This is not an easy task because the tools and resources required to describe data well and to make data available are typically designed for the computer scientist.
- Let them analyse their data easily: Large software development projects like Galaxy provide access to complex analytical tools - we are not aiming to reinvent the wheel in this regard. We aim to engage and collaborate with these existing providers to develop and exploit interfaces to these specialised software projects, so to let descriptive tools and analytical tools communicate efficiently.

This project will address these issues directly, providing tools for storing, annotating and sharing valuable information as well as promoting clear guidance, training. Overall this promises to be a major boost to UK plant sciences research.

This project aims to promote and build links between scientific knowledge and the tools used to generate that knowledge, addressing the lack of descriptive information about underlying data. By doing so, we will provide a platform comprising both existing tools and novel interoperability processes, allowing researchers easy access to methods of describing their work, feeding directly into analytical software, thus promoting clear and robust best practices in science.

Open science is vital to the future generation of researcher, especially to realise the goals of transparent knowledge sharing. This project will remove the barriers that restrict researchers in making their findings freely available to everyone in a consolidated seamless easy-to-use fashion.

Technical Summary

Accessibility to biological data has been hindered by lack of standards, lack of awareness of the benefits and pathways to releasing data that is described by those standards, and lack of services whereby data can be analysed, published and retrieved easily. Recently, there has been a large commitment by the BBSRC to push for open access data and publishing to further bioscience research in the UK. However, barriers still exist that prevent scientists from openly depositing their data and metadata, which comprise a lack of interoperability between metadata annotation services, data repositories, data analysis platforms and data publishing platforms. As such, plant scientists might not: be aware that the services exist; have the expertise to use them; see the value in properly describing their data.
This project aims to build COPO, the software infrastructure required to reach the level of interoperability that plant researchers need to describe their data using community-recognised ontologies, seamless bi-directional data flow to relevant repositories, and then publish these data for open access. COPO will manage the hardware infrastructure at TGAC to deliver a consistent robust staging area and database that will support unique accessioned artefacts representing the corpus of data and metadata a user wants to expose. The resulting marked-up datasets processed and published using COPO will allow greater potential integrative analysis using existing tools such as iPlant and Galaxy.
New Application Programming Interfaces (APIs) will interconnect existing tools and services, and by developing new RESTful user interfaces that wrap up these APIs, COPO will be a single point-of-entry for plant researchers to disseminate their data all the way from generation to publication. By federating the TGAC iRODS data grid system with others, e.g. Texas Advanced Computing Center's iPlant installation, access to worldwide analytical infrastructure and data will be facilitated.

Planned Impact

Impact Summary
Academic, Economic and Commercial Impacts
With the renewed interest and push from all areas of bioscience to promote publicly available research, the COPO project will be a pioneering national and international effort to facilitate sharing of all aspects of plant research to the public. COPO aims to be the vehicle to bring together the tools required to harmonise open plant omics research. This sector has obvious ties with industry. Public domain omics-based bioscience is relevant and important input into industry internal research and discovery activities. To make such bioscience data truly reusable and ensure scientific robustness, it must be uniformly annotated, allowing not only integration through equivalence of terminology but also by increasing efficiency in data production and re-use, and allowing correct interpretation by means of the context provided by their metadata. A collaborative platform for frictionless bioinformatics built with and for the academic and industrial community is long overdue. Alongside data processing, industry also works on finding solutions for integration and management of large 'omics data sets, e.g. efforts like the Pistoia Alliance. Together with COPO industry partners (Eagle Genomics) we will develop use-cases for the platform in industry, propose acceptance criteria required for commercial use, supply technical advice/support on meeting acceptance criteria, evaluate the platform on 3rd party infrastructure, and maximise knowledge exchange and commercialisation.

COPO and the standards community
Expertise and knowledge gained throughout the lifetime of the project and beyond will be disseminated through a variety of channels. The presence of a direct link with the plant science community (through GARNet, UK Plant Sciences Federation (UKPSF)) is key to the success and adoption of the platform and associated standards. The project will have a continuous dialogue, through face-to-face events as well as online tools and social media, between those working on the platform and the plant bioscience community. The several letters of support show a clear interest in working together, using and adopting a platform that implicitly confers standards compliance. COPO will provide a solution to overcome the challenges in standards fragmentation by (i) fostering development, acceptance and implementation of reporting standards that are immediately suitable for plant research, and (ii) limiting the range and variability of standards. This will have a direct impact on the development and maintenance costs for commercial and academic software developers of standards-compliant products.

Societal impacts
Historically there has been reluctancy to adopt some of the standards and open-data principles in the plant bioscience community, especially in the field of food sustainability and security, so openness and transparency in these areas are vital to continue improving the public perception. The presentation of the research data will play a key role in opening the dialogue with the general public and will contribute to the development of stronger links with sectors in society (such as school teachers) that are less familiar with the scientific activities in plant research and the beneficial impact this has in their lives. It is widely recognised that the shortage of expertise and skill in biomathematics and informatics across the world is a major risks for a future development of key areas in life sciences. The objectives of this proposal will help to attract talented staff to work with the COPO partners, and offer alternative career paths.

Publications

10 25 50
 
Description We have gathered substantial information from the community about the relevant metadata related to their experiments; about data standards in use for diverse experimental data being generated by various research communities; and about how these concepts map onto the metadata collected by the major database archives. We have used this information to find generic common factors and to develop data submission and validation tools to ease the capture and archiving of plant omics data. These tools have been pre-released at https://copo-project.org ; the code and documentation can be found at https://github.com/collaborative-open-plant-omics. At the end of 2019 COPO was feeding archives such as ENA, figshare, DSPACE, ckan or Dataverse, with nearly 50 institutional users, with a total volume of 10TB brokered data. We have also worked with the community to formalise the standards for plant-related metadata (MIAPPE) and crop ontologies which are being integrated into the submission tool.
We have completed an indexing project to automatically search for EBI plant samples, find their associated data files (across archives such as ENA, EVA, Array Express) and output them in a JSON format at ftp://ftp.ensemblgenomes.org/pub/misc_data/plant_index . The code has been recently updated to cope with changes at the ENA API and can be found at https://github.com/EnsemblGenomes/ebi_plant_index . The indexed data is now regularly imported into INRAE's Genetic and Genomic Information System at https://urgi.versailles.inra.fr/faidare , which allows users to search for germplasm and plant phenotype experiments across several plant breeding institutes.
Exploitation Route The generic tools in development as COPO can be configured to meet the needs of other research communities, allowing a single technological solution to be deployed in any domain customised complex experiments, generating multiple data types with different persistent archives and subject to different formalised standards. The MIAPPE standards implemented in COPO have potential application by any other software/database handling the same data types.
Sectors Agriculture, Food and Drink,Digital/Communication/Information Technologies (including Software)

 
Description ENA Facilities Day 2019 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other audiences
Results and Impact Reviewed current data exchanges between Ensembl Plants, the European Nucleotide Archive (ENA) and Array Express and discussed problems plant community members face when submitting new genomic data to archives such as the ENA. The primary audience was teams involved in submissions of biological sequences. The most important impact was to raise awareness of the challenges of large plant genomes such as wheat and barley, which require different cut-offs.
Year(s) Of Engagement Activity 2019
URL https://www.ebi.ac.uk/ena/support/facilities-day
 
Description EU-China expert seminar on identifying potential joint priorities for research and innovation in food, agriculture and biotechnology 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Policymakers/politicians
Results and Impact I participated in an EU-China expert seminar on identifying potential joint priorities for research and innovation in food, agriculture and biotechnology, designed to identify future priorities for joint funding schemes based on the direction of current research.
Year(s) Of Engagement Activity 2016
 
Description Elixir meeting attedance 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Undergraduate students
Results and Impact Introduced EBI plant sample indexing proposal to other Elixir plant nodes (break out meeting with slides). First connection made with the Italian node and their variation study on common apple cultivars.
Year(s) Of Engagement Activity 2018
URL https://www.elixir-europe.org/events/elixir-all-hands-2018
 
Description Marc Rossello attended "PhenoHarmonis Pheontyping Workshop" 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Meetings with Elixir plant nodes and CGIAR community about MIAPPE usage and scope
Year(s) Of Engagement Activity 2018
URL https://bit.ly/2TnKXnL
 
Description Participation in meeting on Plant genetic resources and SDGs: needs rights and opportunities 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Policymakers/politicians
Results and Impact The sharing of biological data related to plant genetic resources, and ensuring that the benefits from this sharing are equitably distributed throughout the world, are a matter of important societal concern. A meeting of interested parties was convened to advise the DivSeek organisation, which had been asked to prepare a position paper for the secretariat of the International Treaty on Plant Genetic Resources on behalf of a number of organisations involved in the generation, management and usage of such data. Publications aimed at other audiences are also expected to result from this meeting.
Year(s) Of Engagement Activity 2016
URL http://www.divseek.org/news/
 
Description Presentation at the Conference "The Future of Science: The Digital Revolution: What is changing for humankind" 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Undergraduate students
Results and Impact A presentation at a conference attended mostly by undergraduate and high-school students, focused on far-reaching changes in scientific practice.
Year(s) Of Engagement Activity 2016
URL http://www.futureofscience.org/press/first-world-conference-on-the-future-of-science-science-and-soc...