Bioinformatics tools for plant genetic resources

Lead Research Organisation: University of East Anglia
Department Name: Computing Sciences

Abstract

Modern agriculture needs crop varieties with improved performance for the consumer (e.g. flavour, shape, texture etc) and the producer (e.g. high yield, resistance to pests), and reduced environmental impact (e.g. lower fertiliser or pesticide input). These developments are all possible using conventional breeding backed by modern biotechnology, without the need for genetically modified (GM) plants. These improved properties are found in 'genebanks', which are collections of thousands of plant samples taken from the wild or old crop varieties, together with the many cultivars resulting from decades of selective breeding around the World. The problem with harnessing this potentially useful biodiversity in future breeding programmes is working out which samples to use. The solution is to 'genetically fingerprint' every sample and take accurate measurements of all the useful properties mentioned above. These experiments can tell us in principle which plants are likely to carry potentially useful genes. However, this huge quantity of potentially useful information remains difficult to use (hundreds of measurements in thousands of samples means millions of data points), because improvements in computer databases to store, analyse and display the results have lagged behind our ability to do the lab experiments. This project proposes to bridge that gap by developing a powerful, versatile and accessible computer database and associated computational tools, which can be applied to data collected from crop plant genebanks, to identify promising plant samples for further experimental analysis. All of these computational resources will be freely available to the World's genetic resources community.

Technical Summary

Efficient utilisation of plant genetic resources (gene banks) requires versatile, powerful databases for storing, accessing and combining the wide variety of data that are becoming available in rapidly increasing amounts. We have developed a functioning database, GERMINATE, which can accommodate a wide variety of data types, from descriptive (morphology, geography) to molecular (DNA sequence, marker scores, map position etc.). We now seek funds to complete its development into an integrated data and analytical resource for the World's plant genetic resource community. Currently, the GERMINATE database stores passport and multi-crop descriptor data for every popular molecular marker type except SNP. We propose to extend this capability to include SNP data, in a format that is acceptable to the World's plant genetic resources and genomics communities. We will also deploy an ontology module, which will provide standard nomenclatures for phenotypes, developmental stages and mutant or disease ontologies for crop plants, allowing rational searching for these previously inaccessible characters. Additionally, we will increase the functionality of GERMINATE by greatly expanding the number of linked, web-accessible bioinformatic tools, including the existing suites STRUCTURE (for deducing and visualising the population structure of germplasm), TASSEL (for tree drawing and linkage disequilibrium estimation) and DIVA-GIS (for visualising geographical data associated with accessions). Also, a new set of tools will be designed and developed, including GERMANE (managing workflows of multiple, chained analytical routines), CORE (for management of genetic resources, including identification of core collections in response to user requirements), and NETWORK (analysing non-treelike evolution via introgression, using a marker model-based approach). Lastly, the GERMINATE web interfaces will be improved to allow easier and more powerful uploading, retrieving and analysis of the data.

Publications

10 25 50
 
Description Modern agriculture needs crop varieties with improved performance for the consumer (e.g. flavour, shape, texture etc) and the producer (e.g. high yield, resistance to pests), and reduced environmental impact (e.g. lower fertiliser or pesticide input). These developments are all possible using conventional breeding backed by modern biotechnology, without the need for genetically modified (GM) plants. These improved properties are found in 'genebanks', which are collections of thousands of plant samples taken from the wild or old crop varieties, together with the many cultivars resulting from decades of selective breeding around the World. The problem with harnessing this potentially useful biodiversity in future breeding programmes is working out which samples to use. The solution is to 'genetically fingerprint' every sample and take accurate measurements of all the useful properties mentioned above. These experiments can tell us in principle which plants are likely to carry potentially useful genes. However, this huge quantity of potentially useful information remains difficult to use (hundreds of measurements in thousands of samples means millions of data points), because improvements in computer databases to store, analyse and display the results have lagged behind our ability to do the lab experiments. This project helped to bridge that gap by developing a computational tool which can be applied to data collected from crop plant genebanks, to identify promising plant samples for further experimental analysis.

In particular, we developed the software tool called Core Collection Detector, CCD, a tool developed in the Java programming language at UEA in collaboration with JIC. CCD facilitates core subset selection analysis of plant genetic resources. It incorporates most (if not all) of the popular core subset selection strategies (e.g. the C,P,L and M strategies) identified from the public literature and interfaces to more computationally intensive stand-alone tools (e.g. PowerCore and CoreHunter). In addition, it incorporates four new core subset selection strategies developed within this project (the PD_T, PD_N, GREEDY_H and GREEDY_SH strategies). The PD_T and PD_N strategies are based on algorithms for computing phylogenetic diversity relative to phylogenetic trees and networks, the former using a greedy algorithm to choose a core subset from a germplasm collection, taking into account factors such as geographic locations of collected accessions. In addition, simple evaluation methodologies have been implemented within CCD which allow the evaluation and comparison of different strategies. The CCD tool is freely available for download.
Exploitation Route The CCD tool can be used by researchers to help them select core collections of diverse taxa (e.g. from germplasm collections) that are intended to capture the genetic diversity of the input dataset.
Sectors Agriculture, Food and Drink

URL https://www.uea.ac.uk/computing/ccd-core-collection-of-diverse-taxa-
 
Description It has been used to understand the genetic structure of germplasm collections.
First Year Of Impact 2008
Sector Agriculture, Food and Drink
Impact Types Economic

 
Title CCD (Core collection of diverse taxa) 
Description CCD is a Java software package for the selection of core collections of diverse taxa (e.g. from germplasm collections) that are intended to capture the genetic diversity of the input dataset. 
Type Of Technology Software 
Year Produced 2009 
Impact No actual Impacts realised to date 
URL http://www.mybiosoftware.com/phylogenetic-analysis/10672