Triticeae Genomics for Sustainable Agriculture

Lead Research Organisation: Earlham Institute
Department Name: Directorate Office


Securing food supply on a global scale requires solutions to a complex set of unprecedented problems, including rising demand due to major population increases and social mobility, global climate change, rising energy costs and land, water and nutrient limitations. Finding and implementing these solutions is a top priority for governments and scientists worldwide, and has been articulated as a key BBSRC strategic objective. Opportunities for plant science to contribute to global food security include increasing the yield and quality of crops, combatting diseases, enabling maximal crop productivity in sub-optimal growth conditions, and increasing maximal yield potential. Utilising non- food components of food crops, such as cell wall material and waste products of food production to produce energy and industrial feedstocks, has a major role in reaching sustainability and maximising overall yield of renewable resources from limited land and soils. Grass crops are essential for human existence by directly or indirectly serving as the primary source of human nutrition. Wheat, rice and coarse grains such as maize are the most important crops for human food production, therefore increasing grain production sustainably is a critically important strategic and scientific objective. Wheat is the main arable crop in the UK, planted on 60% of arable land, with an annual farm gate value of ~£2.5b and a processed product value of approximately £150bn. Yield increases in wheat are slowing compared to past gains achieved primarily through improved agronomy and also in relation to other grain crops, notably maize. Genetic and transgenic improvement of wheat is therefore a very high priority in the UK and world- wide, and large international programmes for wheat genetic improvement are underway. A high quality genomics sequence provides a complete, accurate and durable record of genes, predicted proteins and other genomic elements that today are a fundamental foundation for nearly all areas of biological research. This proposal describes a UK component of an international coordinated wheat genome sequencing project that will make decisive and innovative contributions to sequencing the wheat genome and supporting crop improvement through genomics.

Technical Summary

Bread wheat has an exceptionally complex genome comprised of three independently- maintained genomes, each of which is approximately 6 Gb- more that the entire human genome. Wheat genes are found predominantly as small (1-4) clusters, with an average density of between 1 gene/86kb in proximal regions and 1gene/180 kb in distal regions of the chromosome. Genes and gene islands are separated by extensive tracts of nested retrotransposon repeats comprising approximately 85% of the genome. The gene content of diploid grasses is approximately 30-35,000 suggesting bread wheat has approximately three times this number of genes. The scale and complexity of this genome requires a large coordinated effort and the development and application of new technologies. Work in the International Wheat Genome Sequencing Consortium aims to generate accurate sequences of nearly all genes, annotate these and place them in a syntenic framework. Four chromosomes will be sequenced to high quality reference standards using a combination of established methods and novel sequencing technologies. Re-sequencing methods will be developed to access sequence variation in the Triticeae in concert with the pre-breeding programme. Finally, bioinformatics resources for the long- term maintenance and analysis of the sequence will be established.

Planned Impact

The transformative effect of access to high quality genome sequence that is carefully analyzed, and directly and freely available to all users, is well known. Wheat is one of the three major crop plants of global importance, and the predicted impact of a high quality wheat genome resource on crop improvement will be profound, as genomics provides a framework for new breeding methods that are substantially faster and more effective. The wheat genome project will have two immediate impacts on a wide range of new research in wheat by researchers world-wide, and on the application of genomics to breeding and crop improvement by the breeding and agricultural biotechnology industries. Thus plant and crop scientists working in academia and industry are direct beneficiaries of the outcomes of the project. The impact of a genome sequence to these researchers will be profound. Access to and systematic study of all proteins sequence variation in the Triticeae, global gene expression, and the systems-level analysis of biological functions will transform research in crop improvement. Because many agronomic traits in wheat, such as yield and abiotic stress responses, are due to the effects of many genes, such traits will now be accessible to the full range of experimentation possible in modern biology. Consequently progress towards increasing yield stability and sustainable production will be substantially accelerated. The agricultural biotechnology and crop breeding industries, and bioinformatics and computer scientists working on genome assembly and analysis, will benefit from a similar revolutionary effect of genomics seen in rice and maize breeding. A key impact will be the direct and permanent improvements in the rate and scope of wheat breeding, leading to the production of new wheat varieties that can maintain high levels of productivity with reduced inputs. Research funding organizations are also direct beneficiaries of this project by enabling transformative research in wheat improvement, particularly through international collaborations. The impact is a major tangible contribution to meeting important societal goals in food security and sustainable production world-wide. Many indirect beneficiaries of the research can be predicted. Wheat growers will benefit from new varieties that will be more productive and with new end-uses, leading to more stable incomes and diversified production. By addressing the environmental sustainability of crop production through new genomics- lead research in nutrient- and water- use efficiency, the major environmental footprint of wheat production could be reduced, having a beneficial impact on the ecology and sustainability of the agricultural landscape. Other indirect beneficiaries are food processors, who will have access to affordable and a more secure supply of a global staple product. In turn consumers will benefit from more stable prices and access to a staple food.
Description We have made considerable progress in achieving objective 1. In November 2015 we released a new draft genome sequence of the bread wheat reference cultivar: Chinese Spring 42 . This assembly is ~40% more complete, and contains 40 times longer sequences than previous assemblies. This data is publicly available on the EI and ENSEMBL websites, for BLAST searching, browsing gene models and to download. To build this assembly we wrote a new scaffolder to integrating additional data types e.g. longer range mate pairs (~20kb) and fosmid libraries in vectors that are compatible with Illumina sequencing of fosmid ends (FosIlls) for scaffolding in the 40kb range. A manuscript has been submitted to BioRxiv ( describing this work, and subsequently published in Genome Research journal in May 2017 (Clavijo, Venturini et al. 2017). To this Chinese Spring reference assembly we have now added another 4 hexaploid and one tetraploid assemblies which are publicly available

For objective 2 we have annotated genes within the genome using new data generated as part of this project which includes full length transcripts sequences using PacBio long read technology and strand specific Illumina-compatible RNA-seq libraries with long (250bp on Illumina) reads. See Clavijo, Venturini et al. 2017.

For objective 3 we released our genome assembly assigning scaffolds to chromosome arms using flow sorted data that we published in 2014. We have now integrated PopSeq genetic markers, which allows us to identify known and novel translocations: see Clavijo, Venturini et al. 2017. We made a high-coverage 40kb "jumping" library of fosill clones for assessing the long range integrity of different wheat genome assemblies. This showed that the TGAC V1 assembly is as accurate as the subsequent NR Gene (IWGSC RefSeq v1) wheat assembly, although more fragmentary. Fosill paired-end reads were used to scaffold the TGAC V1, leading to three-fold increase in contiguity (BioRxiv bioRxiv 219352; doi: ).

For objective 4 as discussed in the mid term review, we have developed a high throughput, low cost BAC sequencing pipeline to prepare and sequence the samples in a rapid, cost-effective manner. We have parallelised processes to sequence at first 384 samples (BACs) at once then increased this 6-fold to processing 2304 samples at once, and most recently 9,216 samples. We have also decreased the cost and increased the throughput to the extent that it is possible to sequence the whole bread wheat genome using randomly generated BACs at low cost. This would achieve further cost savings as creating a Minimum Tiling Path (MTP) is time consuming and costly. We have integrated these random BACs with those of the already sequenced 3DL MTP - a useful test case for the whole genome. BACs from chromosome 3DL have been integrated with PacBio long reads from the Triticum V3 wheat assembly to generate large scaffolds of 2.8 Mb N50. These are being used for detailed comparison with chromosome 3L from Aegilops tauschii. We have also used these BACs for quality control of our TGACv1 assembly (Clavijo, Venturini et al. Genome Research 2017) and supplied them to the IWGSC.

For objective 5 we have tested Moleculo synthetic long range data, 10x Genomics, plus in vitro (Dovetail) and in vivo Hi-C data for longer range scaffolding especially in areas of poor recombination.

For objective 6, the re-sequencing of genes of mutagenised populations and genes of diverse Triticeae genomes relevant to wheat pre-breeding research, we have developed an automated exome resequencing pipeline starting with leaf tissue and progressing to identify EMS induced SNPs in the coding regions of the majority of wheat genes. This pipeline is useful for wheat and numerous other crops and was used to identify over six million EMS-induced mutations in 1200 lines of cv. Cadenza. The data is available for public searches on the website, where users can also request seed for lines of interest. This work is now published in Krasileva et al. PNAS 2017. We have recently sought to complement the TILLING resources using an orthogonal mutagenesis technique (X-ray induced deletions) with different biases, and so capture remaining gene function. Here we combine the ultra-cheap library approach (developed here for sequencing BACs) to shallow sequence ~1000 Paragon deletion lines, and use the optimised alignment pipeline (developed here for scaffolding contigs) to map reads with very high discrimination and call induced deletions.

For objective 7 to maximize the impact of the research through training and outreach we provided annual training workshop in using the wheat genomic resources for breeders, and were excited to hear about the breeders work as well as their positive feedback on our progress and to hear what data or tools they would like us to prioritise to serve the community. These meetings were also used to establish a priority list of UK wheat genomes for sequencing.
Exploitation Route Developing new genotyping platforms, identify tightly linked markers and clone genes underlying key traits. Sequenced EMS lines of cv. Cadenza are being used for wheat functional genomics, hypothesis testing and crop improvement, by both academic labs and commercial wheat breeders.
Sectors Agriculture

Food and Drink

Description In three of the 5 years of the project we held a Wheat breeders workshop to update them on the latest tools and data available. Shortly after this we released a new greatly improved genome assembly for the reference Wheat cultivar which is available from and now via as a pre-site from ENSEMBL with the genes and markers transferred from the earlier assembly Already 56 publications cite our wheat genome paper (Clavijo et al 2017), and we are aware of a new exome capture reagent set that has been designed based on our work. This assembly is now available for use by UK and worldwide wheat breeders including the ability to download the assembly for their use, and helping them breed better wheat cultivars. The exome sequencing of EMS-mutagenised lines of cv. Cadenza has provided a valuable resource for wheat breeders. Many commercial and public sector breeders have identified mutations of interest and requested seed, including one major seed company that obtained seed for all 1200 lines. The mutations have been integrated into Plant Ensembl and are also available at project dedicated website. Already 43 publications cite our exome sequencing (Krasileva et al 2017) paper.
First Year Of Impact 2015
Sector Agriculture, Food and Drink
Impact Types Economic

Title Tools to anchor wheat genomic assembly on genetic map 
Description Bioinformatics tools to anchor wheat genomic assemblies 
Type Of Material Technology assay or reagent 
Year Produced 2016 
Provided To Others? Yes  
Impact N/A 
Title tandem 
Description Software to find head-to-head gene pairs in a genome 
Type Of Material Technology assay or reagent 
Year Produced 2017 
Provided To Others? Yes  
Impact This tool allowed to elucidate the role of head-to-head gene pairs in the formation of integrated domain fusions (cf. Bailey et al, 2018 Genome Biology) 
Title Crop Haplotypes 
Type Of Material Data handling & control 
Year Produced 2020 
Provided To Others? Yes  
Impact Identify haplotypes from 10 pangenome project 
Title PolyMarker 
Description Update on Primer design for polyploid species 
Type Of Material Computer model/algorithm 
Year Produced 2019 
Provided To Others? Yes  
Impact New biology 
Title Recombination landscape of hexaploid bread wheat 
Description Sequence exchange between homologous chromosomes through crossing over and gene conversion is highly conserved among eukaryotes, contributing to genome stability and genetic diversity. Lack of recombination limits breeding efforts in crops, therefore increasing recombination rates can reduce linkage-drag and generate new genetic combinations. We use computational analysis of open access data from 13 recombinant inbred mapping populations to assess crossover and gene conversion frequency in the hexaploid genome of wheat (Triticum aestivum). We find that high frequency crossover sites are shared between populations and that closely related parental founders lead to populations with more similar crossover patterns. We have identified QTL for altered gene conversion and crossover frequency and confirm functionality for a novel candidate RecQ helicase gene that belongs to an ancient clade that is missing in some plant lineages. Harnessing the RecQ helicase has the potential to break linkage-drag utilizing widespread gene conversions conserved across recombination sparse centromeric regions. 
Type Of Material Database/Collection of data 
Year Produced 2019 
Provided To Others? Yes  
Impact Demonstrates high rates of recombination in wheat previously not seen 
Title Sequencing and assembly of Claire, Paragon, Robigus, Cadenza and Weebil hexaploid wheat lines 
Description Sequencing and assembly of 4 UK elites Claire, Paragon, Robigus, Cadenza and 1 Mexican (CIMMYT) Weebil hexaploid wheat cultivars 
Type Of Material Database/Collection of data 
Year Produced 2019 
Provided To Others? Yes  
Impact large dataset covering >50% UK genetic diversity, 1st Mexican (heat and drought tolerant line) made publicly available 
Title Supporting data for "Efficient and accurate detection of splice junctions from RNA-Seq with Portcullis" 
Description Next generation sequencing (NGS) technologies enable rapid and cheap genome-wide transcriptome analysis, providing vital information about gene structure, transcript expression and alternative splicing. Key to this is the the accurate identification of exon-exon junctions from RNA sequenced (RNA-Seq) reads. A number of RNA-Seq aligners capable of splitting reads across these splice junctions (SJs) have been developed, however, it has been shown that while they correctly identify most genuine SJs available in a given sample, they also often produce large numbers of incorrect SJs.
Herein we describe the extent of this problem using popular RNA-Seq mapping tools, and present a new method, called Portcullis, to rapidly filter false SJs junctions derived from spliced alignments. We show that Portcullis distinguishes between genuine and false positive junctions to a high-degree of accuracy across different species, samples, expression levels, error profiles and read lengths. Portcullis is portable, efficient and to our knowledge is currently the only SJ prediction tool that reliably scales for use with large RNA-Seq datasets and large, highly-fragmented genomes, whilst delivering accurate SJs 
Type Of Material Database/Collection of data 
Year Produced 2018 
Provided To Others? Yes  
Title Supporting data for "Leveraging multiple transcriptome assembly methods for improved gene structure annotation" 
Description The performance of RNA-Seq aligners and assemblers varies greatly across different organisms and experiments, and often the optimal approach is not known beforehand. Here we show that the accuracy of transcript reconstruction can be boosted by combining multiple methods, and we present a novel algorithm to integrate multiple RNA-Seq assemblies into a coherent transcript annotation. Our algorithm can remove redundancies and select the best transcript models according to user-specified metrics, while solving common artefacts such as erroneous transcript chimerisms. We have implemented this method in an open-source Python3 and Cython program, Mikado, available at 
Type Of Material Database/Collection of data 
Year Produced 2018 
Provided To Others? Yes  
Title The Grassroots DFW Data Portal 
Description Continually updated large datasaet repository for the DFW project. Houses a variety of key wheat and associated datasets that are either under the Toronto licence or others as apprpriate for the level of open access. 
Type Of Material Database/Collection of data 
Year Produced 2018 
Provided To Others? Yes  
Impact To date, we house 24TB of wheat datasets that have been accessed by over 4000 researchers from 64 countries. 
Title Triticum aestivum cultivars in ensembl 
Description Five wheat lines chosen for their importance in breeding and research in the United Kingdom have been sequenced and displayed in Ensembl as part of our contribution to the Designing Future Wheat project. This includes Claire, Cadenza, Paragon, Robigus and Weebill. These scaffold-level assemblies were sequenced at the Earlham Institute as part of the wheat pan genome. Sequencing was performed on an Illumina HiSeq 2500 instrument with a 2x250 bp read metric targeting 45x raw coverage of the amplification-free library and 25x coverage of a combination of mate-pair libraries with inserts sizes >7 Kbp. Between 44 and 51x paired-end genome coverage was generated per line. Contigging was performed using the w2rap-contigger using k=200. Two mate-pair libraries were produced for each line except Weebill, where five libraries were used. Mate-pairs were processed, filtered and used to scaffold contigs as described in the w2rap pipeline. Scaffolds less than 500bp were removed from the final assemblies. The K-mer Analysis Toolkit was used to validate scaffolds by generating a kmer histogram from the matrix of kmers shared between the paired-end reads and the scaffolds. 
Type Of Material Database/Collection of data 
Year Produced 2020 
Provided To Others? Yes  
Impact Provides other wheat researchers easy access to the new assemblies and gene models. 
Title Wheat Expression 
Description The majority of RNA-seq expression studies in plants remain underutilised and inaccessible due to the use of disparate transcriptome references and the lack of skills and resources to analyse and visualise this data. We have developed expVIP, an expression Visualisation and Integration Platform, which allows easy analysis of RNA-seq data combined with an intuitive and interactive interface. Users can analyse public and user-specified datasets with minimal bioinformatics knowledge using the expVIP virtual machine. This generates a custom web browser to visualise, sort and filter the RNA-seq data and provides outputs for differential gene expression analysis. We demonstrate expVIP's suitability for polyploid crops and evaluate its performance across a range of biologically-relevant scenarios. To exemplify its use in crop research we developed a flexible wheat expression browser ( which can be expanded with user-generated data in a local virtual machine environment. The open-access expVIP platform will facilitate the analysis of gene expression data from a wide variety of species by enabling the easy integration, visualisation and comparison of RNA-seq data across experiments. 
Type Of Material Computer model/algorithm 
Year Produced 2016 
Provided To Others? Yes  
Impact This database provides open access to over 400 RNAseq studies in wheat and the underlying algorithms are all available to adapt to any species with a reference genome. Over 11,000 sessions from >5,000 and >50,000 pageviews 
Title Wheat TILLING 
Description Database with data from exome captured wheat mutants 
Type Of Material Database/Collection of data 
Year Produced 2017 
Provided To Others? Yes  
Impact Over 2,000 unique users across >5,000 sessions. 
Title Wheat TILLING 
Description This resource consists of TILLING populations developed in tetraploid durum wheat cv 'Kronos' and hexaploid bread wheat cv 'Cadenza' as part of a joint project between the University of California Davis, Rothamsted Research, The Earlham Institute, and John Innes Centre. We have re-sequenced the exome of 1,535 Kronos and 1,200 Cadenza mutants using Illumina next-generation sequencing, aligned this raw data to the IWGSC Chinese Spring chromosome arm survey sequence, identified mutations, and predicted their effects based on the protein annotation available at Ensembl Plants. 
Type Of Material Database/Collection of data 
Year Produced 2016 
Provided To Others? Yes  
Impact Over 5000 access 
Title Wheat TILLING in EnsemblPlants 
Description Update of sequenced mutants and coordination with ensemblPlants 
Type Of Material Database/Collection of data 
Year Produced 2018 
Provided To Others? Yes  
Impact over 7000 mutants distributed from over 300 orders 
Title Wheat Training 
Description This website provides background information and practical resources to help both budding wheat scientists as well as researchers looking to expand their work into wheat. There is a need to improve crops to feed the world's growing population with the backdrop of climate change. Translation of fundamental plant biology research (e.g. from Arabidopsis thaliana) into crops such as wheat provides a potential route to deal with this challenge. However learning even simple tasks such as growing and crossing wheat plants requires time and effort, while material and methods sections in published articles are often short and cannot substitute teaching aids. This is also true for more complex topics such as the genomics aspect of wheat. 
Type Of Material Database/Collection of data 
Year Produced 2016 
Provided To Others? Yes  
Impact >4,500 sessions from >2,700 users 
Title Wheat Training 
Description Wheat Training website to help new researchers engage with wheat 
Type Of Material Database/Collection of data 
Year Produced 2018 
Provided To Others? Yes  
Title Wheat reTILLING 
Description Rebuild of the Wheat TILLING resource using novel technologies (DRAGEN BioIT Processor + custom software) based on the new wheat reference genome (RefSeq 1.0) and TGAC/Earlham Institute cultivar-specific genome references. This is a collaboration between Earlham Institute, Rothamsted Research, John Innes Centre and UC Davis. 
Type Of Material Database/Collection of data 
Year Produced 2018 
Provided To Others? No  
Impact When the output will be available (within 2018), then it will replace the original Wheat TILLING data and continue to be of use for researchers and breeders. Variant annotation will be hosted by EnsEMBL Plants. 
Title eFP expression browser 
Description Expression browser for wheat data 
Type Of Material Database/Collection of data 
Year Produced 2018 
Provided To Others? Yes  
Impact New biology 
Title Plant prolamins and glutenins 
Description A method for detecting and sequencing genes encoding plant prolamins and glutenins (such as gluten, hordein, secalin, avenin, zein, gliadin, farinin) in a sample of a plant (such as barley, wheat, rye, oats, or maize) or a foodstuff, comprising targeted enrichment of gluten encoding nucleic acids using baits or probes. The sequencing may be a real-time sequencing method such as SMRT sequencing. 
IP Reference GB2559540 
Protection Patent granted
Year Protection Granted 2018
Licensed No
Impact The inventors have moved on from TGAC/EI and there is limited potential for exploitation and impact realisation.
Title EMBER - GlutenSeq Pipeline 
Description Pipeline for processing of GlutenSeq data. 
Type Of Technology Software 
Year Produced 2017 
Open Source License? Yes  
Impact This pipeline allows the analysis and downstream processing of sequencing data obtained via targeted Gluten gene captures. 
Title Expression Browser in wheat 
Description Expression browser for wheat gene expression data 
Type Of Technology Webtool/Application 
Year Produced 2018 
Open Source License? Yes  
Impact New biology 
Title Galaxy id_fusion pipeline 
Description Galaxy implementation of integrated domain fusion detection pipeline (cf. Sarris et al, 2016 BMC Biology) 
Type Of Technology Software 
Year Produced 2016 
Open Source License? Yes  
Impact Tool available for use on Galaxy platforms. 
Title Royal Dragen 
Description Royal dragen is a software developed by Dr. Rob King (Rothamsted Research) and Dr. Christian Schudoma (Earlham Institute). The software allows to perform variant filtering on Dragen-generated variant calls. The software was designed to mimic the behaviour of MAPS in wheat tilling experiments (cf. Krasileva et al., PNAS 2017). 
Type Of Technology Software 
Year Produced 2017 
Open Source License? Yes  
Impact The software, together with the Dragen BioIT processor allows to significantly speed up tilling analyses on wheat exome capture data. 
Title scvep - super cereal variant effect predictor 
Description This software allows effect prediction of single nucleotide polymorphisms without a reference annotation. 
Type Of Technology Software 
Year Produced 2017 
Open Source License? Yes  
Impact This software was used to filter wheat candidate lines for (virus?) resistance in a collaboration with IPK Gatersleben, Germany. 
Title tandem 
Description Software to detect (head-to-head) gene pairs (tandems) in a genome. 
Type Of Technology Software 
Year Produced 2017 
Impact This software allowed to analyse whether gene tandems might play a role in the formation of integrated domain fusions (cf. Bailey et al, 2018, Genome Biology) 
