Big Data Infrastructure for Crop Genomics

Lead Research Organisation: Earlham Institute
Department Name: Research Faculty

Abstract

Recent advances in sequencing technologies and computational tools have made it possible to sequence the genomes of some of the world's most important crop species, such as rice, barley, rapeseed, maize, soya and wheat. These crops constitute a substantial part of the daily food intake for most of the population of the world and any improvements in the breeding for more efficient and nutritious varieties will have a direct impact on ensuring global food security.

Whilst obtaining the genome sequences for these crops provides a hugely useful resource for giving insights into the differences between species, it is through sequencing different individuals from the same or closely-related species which allows us to identify useful genetic variants which can be selected for during plant breeding. These approaches require a combination of sequence and phenotypic data, plus analysis tools. We propose to develop a crop bioinformatics platform which enables users to access this genetic and phenotypic variation and perform analyses to explore gene expression and associations between genetic variation and traits.

The platform will be developed using open source principles and publicly available data. Population-wide genetic variants will be represented on a genomic data structure; an archiving system for storing plant phenotype data will be developed; tools to allow the querying of these datasets and analyses to link genotype to phenotype will be implemented; and the platform will be accessible via TGAC and EBI servers but also packaged into a virtual machine for easy installation on users' local hardware. This novel platform for crop bioinformatics will promote opportunities for collaborative work with R&D groups in industry, research and academia. The availability of data generated by publicly funded resources, and the concomitant development of new, production-quality tools will lower the barriers to information-enabled crop science, stimulating new opportunities for research and application. The platform will also open up new opportunities for the UK bioinformatics community, traditionally focused on biomedical applications, by developing alternative career paths around biotechnology and agri-food.

Technical Summary

We propose the strategic development and deployment of a bioinformatics platform to enable genomics research in crop science. The platform will directly address the needs of the scientific community by integrating and facilitating the use of available genomics and bioinformatics resources, and will be developed in collaboration with a broad user base including plant biologists, geneticists and crop breeders.

The main output of this initiative will be an infrastructure to accommodate data from the large-scale genomic resequencing projects that are already underway within the plant research community, for the model species Arabidopsis thaliana and for crop species such as rice (whole genome), brassica (transcriptome), wheat and barley (exomes and genotyping-by-sequencing). This is an area of active research in crop genomics as a direct consequence of the availability of novel inexpensive sequence-based genotyping technologies. We propose to develop a suite of tools (extending from existing software where possible) and an application programming interface (API) to interact with genomic representations of population-derived sequences. Tools will range from simple querying mechanisms to the implementation of more advanced expression and association analyses. We will also develop infrastructure to enable the archival and querying of plant phenotypic data, using existing ontological terms and building on the software developed by the International Mouse Phenotyping Consortium.

The platform will be accessible via servers located at TGAC and EBI, and will also be available as a virtual machine for local installation.

Planned Impact

The recent advances in data-generating technologies have opened a gap between the ability to generate data and the capacity to effectively store and analyse them. The objectives set for the infrastructure we propose to develop will directly target this issue by contributing solutions in areas of research relevant to the BBSRC in food security, bioenergy and biology underpinning health.

Academic, Economic and Commercial Impacts
The development of the platform will generate new opportunities for collaborative work with R&D groups in industry working in crop breeding and academic institutions. TGAC and EBI are members of large international consortia such as the wheat (IWGS) and barley (IBSC) genome sequencing project. The transformative effect of the availability of large diversity datasets is one of the main drivers supporting next generation crop breeding programmes. One example is the effect that genomics assisted methods will have on breeding for disease resistance traits. The availability of data generated by the public sector and the translation of the research tools into production pipelines will have a direct impact on the generation of new service-based business.

The most important traits, such as yield and drought tolerance, involve multiple genes in general identified through Quantitative Trait Loci (QTLs), and complex interactions with the environment. High-density molecular markers are one of the most important tools for informing the characterisation of complex agricultural traits and the design of sophisticated breeding strategies (e.g. genomic selection). This initiative is focused on the development of a data infrastructure to support these kind of datasets.

Societal impacts
The development and availability of the infrastructure for crop bioinformatics will directly impact the local community with the generation of new jobs and funding opportunities. TGAC's presence in the Norwich Research Park and EBI in the Cambridge area have strengthened the position of the region as a technology hub hosting specific expertise in informatics applied to life sciences and biotechnology. This will create new opportunities around the development of services in genomics and bioinformatics, which will translate into job opportunities. We also expect this development will bring a renewed interest in the application of genomics and bioinformatics to areas of agriculture and biotechnology research.

Policy: BBSRC, research councils and UK
A direct consequence of the implementation of this initiative will be to position TGAC and EBI as international leaders in informatics for crops research. Around this, we expect the emergence of a high-class scientific base in computational research placing the UK in a unique position in a future where technology, data and multidisciplinary work will be the common denominators. This is aligned with the general principles set by the UK Agri-Tech Strategy which emphasises the importance of using scientific knowledge to drive agricultural innovation.

Publications

10 25 50
publication icon
Aken BL (2017) Ensembl 2017. in Nucleic acids research

publication icon
Aken BL (2016) The Ensembl gene annotation system. in Database : the journal of biological databases and curation

publication icon
Bevan MW (2017) Genomic innovation for crop improvement. in Nature

publication icon
Cunningham F (2015) Improving the Sequence Ontology terminology for genomic variant annotation. in Journal of biomedical semantics

publication icon
Hoopen PT (2016) Plant specimen contextual data consensus. in GigaScience

publication icon
Kersey PJ (2016) Ensembl Genomes 2016: more genomes, more complexity. in Nucleic acids research

publication icon
Yates A (2016) Ensembl 2016. in Nucleic acids research

 
Description Software was developed for the submission, identification, genomic alignment, organisation and dissemination of RNA-seq data. Using this software, large quantities of RNA-seq data, which previously were relatively inaccessible in the sequence archives, have been identified, annotated, aligned to genomes, and been made available (in "Track hub" format) through Ensembl Plants; the pipeline is in continuous operation and allows us to automatically detect newly submitted experiments and process these accordingly. Ensembl Plants currently exposes 1,258 track hubs (one for each study submitted to the European Nucleotide Archive) comprising a total of 21,452 tracks (one for each 'Run' in ENA) for 42 of the 44 plant species currently represented in the resource.

Two wheat genomes have also been sequenced (Robigus and Claire) enabling pan-genome comparisons for wheat:
https://wheatis.tgac.ac.uk/grassroots-portal/blast
In addition a tool for performing genome-wide association to link genetic variants to traits of interest, has been implemented on Cyverse, an accessible high-performance computing platform:
http://cyverseuk.org/applications/gwasser/
Exploitation Route Large quantities of RNA-seq data which were previously inaccessible are now available through Ensembl Plants, providing easy access to expression data for 42 plant species. The additional wheat genomes are available to breeders and researchers for exploration of variety-specific regions and identification of conservation. The GWAS tool is available on cyverse and can be used to perform simple association studies, the approach is adapted from a technique used at NIAB for testing the ability to perform genome wide association study of the awning phenotype in wheat (Triticum aestivum) on lines of their MAGIC population.
Sectors Agriculture, Food and Drink

 
Description We have used the infrastructure developed to produce almost 40,000 RNA alignment tracks from over 1,600 experiments for over 40 species, and these data are available in Ensembl Plants. Ensembl Plants had over 150,000 unique visitors (by IP address) in 2017 and users are known to include plant breeders and herbicide/fungicide developers. In conjunction with BBSRC LoLa funding, this project also contributed to the sequencing of eight wheat varieties, two of which, Robigus and Claire, are fully assembled and available to the community under the terms of the Toronto agreement.
First Year Of Impact 2017
Sector Agriculture, Food and Drink
Impact Types Economic

 
Description Wheat pan-genomics
Geographic Reach Multiple continents/international 
Policy Influence Type Membership of a guideline committee
Impact We described the EI technological advances and our plans to use these for improved wheat breeding with interested parties from developed and developing countries to improve their wheat breeding programs by sequencing key accessions from their countries. This has lead to the establishment of a large multi-national collaboration with substantial cross links and obvious synergies. The worry prior to this meeting was that there would be disparate efforts using different technological approaches which would prevent comparison, meta-analysis and lead to a split in wheat genetics/genomics - this has been avoided and we expect great progress to be made in short time frames.
 
Description BBSRC responsive mode
Amount £1,504,558 (GBP)
Funding ID BB/P010768/1 
Organisation Biotechnology and Biological Sciences Research Council (BBSRC) 
Sector Public
Country United Kingdom
Start 04/2017 
End 03/2020
 
Title Barley genome database 
Description The barley genome and associated data (including polymorphism data, of potential interest to breeders) has been incorporated within the Ensembl plants database, which offers access to genomic data from a large number of plant species 
Type Of Material Database/Collection of data 
Year Produced 2012 
Provided To Others? Yes  
Impact Ensembl Plants (as a whole) had a total of 156,037 unique visitors in 2017. 
URL http://plants.ensembl.org/Hordeum_vulgare/Info/Index
 
Title Sequencing and assembly of Claire, Paragon, Robigus, Cadenza and Weebil hexaploid wheat lines 
Description Sequencing and assembly of 4 UK elites Claire, Paragon, Robigus, Cadenza and 1 Mexican (CIMMYT) Weebil hexaploid wheat cultivars 
Type Of Material Database/Collection of data 
Year Produced 2019 
Provided To Others? Yes  
Impact large dataset covering >50% UK genetic diversity, 1st Mexican (heat and drought tolerant line) made publicly available 
URL https://www.ebi.ac.uk/ena/browser/view/PRJEB35709
 
Title Wheat genome database 
Description Data from the bread wheat Tritium aestivum and related species have been included in Ensembl Plants, an integrative resource offering access to genome scale data from a variety of plant species. Successive improvements to the genome sequence have been accommodated including polymorphism data important for breeding (increasingly, in collaboration with CerealsDB), and the incorporation of the new TGAC 1.0 genome assembly. The database was developed originally with funding from BB/J00328X/1 and development has continued with funding from BBS/E/J/000PR9782. 
Type Of Material Database/Collection of data 
Year Produced 2012 
Provided To Others? Yes  
Impact The data structure, analysis pipelines and visualisation interfaces have been developed to accommodate polyploid crops in general of which bread wheat is our first example. Ensembl Plants (as a whole) has had 156,037 unique visitors in 2017. 
URL http://plants.ensembl.org/Triticum_aestivum/Info/Index
 
Title GWASer - a iPlant/Cyverse Genome Wide Association Study tool for MAGIC populations 
Description Standard tools for mapping genetic traits assume a bi-parental cross, or a (largely) unstructured population. Multi-parental advanced genetic intercrosses (MAGIC) populations are neither of these and need a specific tool. The NIAB 8 parental UK recommended list wheat MAGIC has 8 parentals and nearly 1000 offspring, and is a powerful tool to understand wheat genetics. We have use the Cyverse cloud infrastructure to enable wheat geneticists to use access high performance computing and existing genotype data and by using their own phenotype data and our tool find loci associated with their traits. 
Type Of Technology Webtool/Application 
Year Produced 2016 
Impact This has lowered the bar necessary for breeders to use the MAGIC populations without powerful compute cluster of their own, or an expert quantitative geneticist. Other communities e.g. brassica are also looking at this tool for their uses. 
URL https://de.cyverse.org/de/
 
Title RNA sequencing processing pipeline 
Description Software was developed for the submission, identification, genomic alignment, organisation and dissemination of RNA-seq data. 
Type Of Technology Software 
Year Produced 2016 
Impact Large quantities of RNA-seq data, which previously were relatively inaccessible in the sequence archives, have been identified, aligned to genomes, and been made available (in "Track hub" format) through Ensembl Plants; the pipeline is in continuous operation and allows us to automatically detect newly submitted experiments and process these accordingly. Ensembl Plants currently exposes 1,258 track hubs (one for each study submitted to the European Nucleotide Archive) comprising a total of 21,452 tracks (one for each 'Run' in ENA) for 42 of the 44 plant species currently represented in the resource. 
URL http://plants.ensembl.org
 
Description EU-China expert seminar on identifying potential joint priorities for research and innovation in food, agriculture and biotechnology 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Policymakers/politicians
Results and Impact I participated in an EU-China expert seminar on identifying potential joint priorities for research and innovation in food, agriculture and biotechnology, designed to identify future priorities for joint funding schemes based on the direction of current research.
Year(s) Of Engagement Activity 2016
 
Description Illumina users group meeting 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact Presentation: "Plant genomics - assembling genomies and understanding haplotypes"
Year(s) Of Engagement Activity 2018
 
Description Invited presentation in the the C3BI seminar series held at the Institut Pasteur, Paris. 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact An invited presentation was given in the the C3BI seminar series held at the Institut Pasteur, Paris.
Year(s) Of Engagement Activity 2017
URL https://c3bi.pasteur.fr/seminars-tba-non-vertebrate-genomics/
 
Description Our Broken Planet - NHM exhibition 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Public/other audiences
Results and Impact I worked with the Natural History Museum team to contribute towards the exhibition "Our broken planet: How we got here and ways to fix it" this was open from May to August 2022, but was planned months before this. Specifically I contributed towards the design of a cabinet explaining how modern agriculture is carried out, and how research is working to maximise yields, minimise environmental impacts, specifically around the use of genetics. The cabinet contained different types of wheat, and in a recording I explained how they differ morphologically and genetically, and how we are seeking to breed new varieties of wheat that use less water, pesticides and fungicides by developing better wheat genetics.
Year(s) Of Engagement Activity 2022
URL https://www.nhm.ac.uk/visit/our-broken-planet.html
 
Description Participation in meeting on Plant genetic resources and SDGs: needs rights and opportunities 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Policymakers/politicians
Results and Impact The sharing of biological data related to plant genetic resources, and ensuring that the benefits from this sharing are equitably distributed throughout the world, are a matter of important societal concern. A meeting of interested parties was convened to advise the DivSeek organisation, which had been asked to prepare a position paper for the secretariat of the International Treaty on Plant Genetic Resources on behalf of a number of organisations involved in the generation, management and usage of such data. Publications aimed at other audiences are also expected to result from this meeting.
Year(s) Of Engagement Activity 2016
URL http://www.divseek.org/news/
 
Description Participation in meeting on Wheat Genomic Resources in the Post-Reference Genome Era 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Presentation to a meeting convened by the wheat initiative to identify future priorities for international wheat research.
Year(s) Of Engagement Activity 2016
URL http://www.wheatinitiative.org/events/wheat-genomic-resources-post-reference-sequence-era
 
Description Poster at PAG 2016 Conference - A novel strategy to assemble the hexaploid wheat genome: beyond Chinese Spring and towards a pan-genome 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Poster at the Plant and Animal Genomes Conference - 11 January 2016 - San Diego, CA, USA

To discuss with peers across the field
Year(s) Of Engagement Activity 2016
URL https://pag.confex.com/pag/xxiv/webprogram/Paper21578.html
 
Description Poster at PAG 2016 Conference - Wheat Research at TGAC 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Poster at Plant and Animal Genome Conference - January 2016 - San Diego, CA, USA

At The Genome Analysis Centre (TGAC), we are focused on decoding wheat genome and understanding how variation within and across genomes affects the plant and its interaction with the environment
Year(s) Of Engagement Activity 2016
URL https://pag.confex.com/pag/xxiv/webprogram/Paper21311.html
 
Description Presentation at Rothamsted Research Bioinformatics Forum 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Professional Practitioners
Results and Impact A presentation to the Rothamsted Research Bioinformatics Forum on "Creating a pipeline that generates track hubs from plant RNA-Seq alignments and registers them in the Track Hub Registry".
Year(s) Of Engagement Activity 2016
 
Description Presentation at the Conference "The Future of Science: The Digital Revolution: What is changing for humankind" 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Undergraduate students
Results and Impact A presentation at a conference attended mostly by undergraduate and high-school students, focused on far-reaching changes in scientific practice.
Year(s) Of Engagement Activity 2016
URL http://www.futureofscience.org/press/first-world-conference-on-the-future-of-science-science-and-soc...
 
Description Presentation on Database Interoperability at Plant and Animal Genomes meeting, San Diego, 2016 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact A presentation was made to a workshop on Interoperability and Federation Across Bioinformatic Platforms and Resources at the Plant and Animal Genomes meeting, San Diego, 2016, covering the infrastructure for interoperability developed under the Big Data Infrastructure for Crop Genomics award. A mixed audience of professionals from a academia and industry and post-graduate students was addressed.
Year(s) Of Engagement Activity 2017
URL http://app.core-apps.com/pag-2017/abstract/1a72fbe5da697beb0993e254b2da1f2d
 
Description Release of 6 wheat genome assemblies 
Form Of Engagement Activity A press release, press conference or response to a media enquiry/interview
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Industry/Business
Results and Impact Coupled to the EI talk at Plant and Animal Genomes in USA "From Zero to Many: Assembling Wheat Genomes with w2rap" we released 6 wheat genome assemblies
Year(s) Of Engagement Activity 2017
URL http://www.earlham.ac.uk/can-we-produce-better-wheat-crop-feed-world-single-multiple-wheat-genomics
 
Description Seminar at the Institute of Plant Genetics, Polish Academy of Science, Poznan. 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact Invited to present on various subjects, including the outcomes of the Big Data Infrastructure for Crop Genomics project, at the Institute of Plant Genetics, part of the Polish Academy of Sciences in Poznan, Poland, as part of the EU-funded Bio-Talent project.
Year(s) Of Engagement Activity 2017
URL http://www.biotalent.eu
 
Description Talk at conference in MPI Golm Germany 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact Attended the "plants and people" conference at MPI Golm, Germany. This conference is organised by the PhD students at the institute who are also presenting. The meeting attracted speakers from across the world and also from across the field of plant science from research, industry, EU regulations, and public engagements. It was very interesting to be able to meet with so many professionals across the field, and with young scientists doing their PhDs.
Year(s) Of Engagement Activity 2019
URL https://plants-and-people.mpg.de/node/7
 
Description Wheat breeders' workshop 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Industry/Business
Results and Impact The Genome Analysis Centre (TGAC) hosted the workshop "Towards a sustainable future for wheat genomics" that introduces the latest, freely available wheat genomic resources and the new opportunities these bring for sophisticated analysis
Year(s) Of Engagement Activity 2014
URL http://www.tgac.ac.uk/news/103/15/TGAC-takes-the-lead-towards-a-sustainable-future-for-wheat-genomic...
 
Description Wheat breeders' workshop 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Industry/Business
Results and Impact The second Wheat Breeders' Workshop entitled, "Towards A Sustainable Future for Wheat Genomics" was held at The Genome Analysis Centre (TGAC) on 20-21 October 2015 to show new and upcoming resources to wheat breeders and involve them in future development of the tools they stand to benefit from
Year(s) Of Engagement Activity 2015
URL http://www.tgac.ac.uk/news/240/15/Wheat-breeders-techno-remix/
 
Description Workshop talk 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Presented to "workshop on multi-parental populations", this was held in Cambridge UK, but was attended by international researchers from Europe, plus scientists from ICRISAT and IRRI ( CGIAR breeding institutes). My presentation introduced our work on Wheat and MAGIC populations in my talk "Combining genome technologies with genetics to analyse MAGIC populations". There was a lot of interest in how genomics can be used in breeding, as showed in our work, and the workshop attendees have written a review on "multi-parental populations" for the Hereditary journal.
Year(s) Of Engagement Activity 2019
URL http://mtweb.cs.ucl.ac.uk/mus/www/MAGICdiverse/MAGIC_workshop.htm