A bioinformatics tool for the accelerated diagnosis of multiple viral infections in crops using next generation sequencing

Lead Research Organisation: University of St Andrews
Department Name: Biology

Abstract

Viruses that infect crop plants in the UK causes significant losses in terms of yield and quality. In the UK, the production value of the potato harvest is £684 million per year, but the losses due to some viruses are estimated to be £30 million. Hence, the need to quickly and accurately identify virus infected crops is of importance both economically, and to ensure the continued supply of food adequate for a growing population.

Current techniques for virus identification only allow the detection of single or at best a very small numbers of related viruses. This makes disease diagnosis slow and expensive. Plant viruses can be identified from their genetic material, which is commonly RNA. It is possible to sequence the genetic material contained within samples extracted from an infected plant. This genetic material comprises a mixed collection of the host plant's RNA and the RNA of multiple viruses (and other organisms) that infect the plant. Technology can be used to sequence this mixed genetic material, which gives a very large data set of millions of short reads of RNA. A major difficulty is the identification of virus sequences within the mixed data set, and the ability to do this in a short enough time period to allow for successful disease diagnosis. Ideally we require a software tool that can take RNA samples and produce a list of viruses present with a matter of days rather than weeks. To date there is no software that can make successful plant virus diagnosis in a sufficiently short timeframe. The aim of the project is to develop software that will take the millions of short reads of RNA from a mixed sample and produce a list of viruses present for accurate diagnosis and so that effective disease treatments can be deployed.

The software comprises two elements (a) identification and removal of plant host RNA reads and (b) identification of known and potential new viruses. The identification of RNA viruses in mixed RNA samples is difficult, due to their high sequence variability meaning that even if a related sequence is present in a reference database the differences may be too great to detect the similarity by alignment. In addition alignment methods, in which short RNA reads are aligned against a reference genome and assembled, are too slow for diagnostic purposes. In this project we will develop a bioinformatics tool that will overcome both of these problems.

We will use a method known as k-mer counting to identify the viruses present. RNA sequences can be treated as character strings and divided into multiple substrings of length K. In this way a sequence can be represented by k-mer profiles, and these profiles can be compared to identify which species are present in a mixed sample. In addition we will test the use of a speedy aligner that will enable us to identify the host RNA more quickly if a reference genome is available. We will integrate these methods to create a pipeline. The tool will be delivered through Galaxy, an open platform for intensive data analysis, making it widely available to researchers. It will be designed to be used by the non-expert user. The tool will be tested on RNA sequence data from infected raspberry plants and from potato plant material. The tool will have direct applications in plant health, quarantine and certification procedures, used to stop the spread of crop diseases.

Technical Summary

Viruses cause significant yield and quality losses in a wide variety of agricultural and horticultural crops, and have an important negative economic impact. Hence, plant virus diagnosis is a field of great significance in terms of the UK's food security and the agricultural economy. Next generation sequencing (NGS) of infected plant material is now a principal focus for viral diagnostics, but it requires fast and robust bioinformatics tools for host sequence and virus identification. Both of these aspects are missing in current software, which in general has been developed for clinical diagnosis. There are no tools for crops that go beyond sequence homology for virus identification that can be used by the diagnostician and give results in a rapid time frame.

The aim of this project is to develop a bioinformatics tool that uses mixed RNA sequence reads from infected plant material to produce a viral index in an accelerated timeframe to support disease diagnosis. Such a bioinformatics tool would have direct applications in plant health, quarantine and certification procedures. The tool will include a method for k-mer profiling, which goes beyond sequence homology, for the detection and identification of known and new viruses. The project will also explore the use of a speedy alignment method for plant host extraction, when a reference genome is available. The tool will be developed within Galaxy as a workflow, making it widely available to the non-expert. By using an open workflow platform, the tool will have the potential to be used on a cloud based Galaxy server, which makes it available to researchers without significant computing infrastructure, such as those in developing countries. We will develop and test the bioinformatics pipeline on already collected RNA-seq data from virus infected raspberry plants and potato plant material, but the tool will be applicable to a wide variety of crop plants.

Planned Impact

The Purpose of this Project is to develop a bioinformatics tool that uses RNA sequence reads from infected plant material to rapidly produce a viral index for disease diagnosis; with applications in plant health, quarantine and certification procedures. Rapid and accurate virus detection is an essential part of efficient crop management, offering protection against economic losses due to low yield and poor quality. Viral diagnostics is a key component of ensuring security and sustainability of food production both in the UK and in developing economies with low-input farming systems.

The key output from the project is the bioinformatics "pipeline". The primary impact of the project will be realised when the pipeline is used for virus identification in new and existing next generation sequence datasets by the beneficiaries worldwide.

There are a large number of academic beneficiaries of this project, including researchers involved in disease diagnosis for plant quarantine and certification purposes, crop breeding and those establishing diagnostic tools for cops in developing countries (see previous section). In addition this project will have direct impact on farmers, horticulturists and other food producers who need to monitor the ongoing health of their crops and investigate new instances of disease as they arise. The speed of use of the bioinformatics tool will enable them to deploy appropriate crop management practices in time to maximise productivity and profit. James Hutton Limited, a commercial subsidiary of the institute has established the JHL Molecular Diagnostics Unit, which uses diagnostic tests for plant health assessment and crop genotyping. JHL has an existing customer base with excellent links to growers and agronomists. Therefore, the software pipeline developed by this project will be a major boost to the capabilities of this Unit.

Impact of the bioinformatics pipeline will be measured by recording the number of installs made from the Galaxy Tool Shed, by recording number of requests for collaboration using the pipeline, by recording the number of references to the pipeline in scientific and other publications, by recording approaches for use of or further development of the pipeline by other research institutions and commercial companies.
 
Description Progress
Work has been conducted in two aspects of this project.
• A Galaxy workflow has been created using standard NGS analysis tools (including host removal and virus identification by mapping) to take raw RNA-seq data and produce an initial list of viruses present. This workflow matches reads to the RefSeq dataset of viruses (which comprises > 7000 genomes). This workflow has been tested on African groundnut RNA-sequence data, which has been sequenced as part of a Royal Society network grant. We are currently analysing initial results and working on a protocol for further analysis of the unmapped reads to recover additional viruses.
• We have created a synthetic data set of RNA-sequence reads which comprises a plant host and a number of known viruses. This dataset is being used to assess a tool that can match reads to sequence databases using k-mers. We are currently testing and optimizing parameters for virus recovery in the synthetic dataset, and also to create our own k-mer database of plant viruses.
Exploitation Route outcomes will be relevant to a number of groups in due course - the description of the new bioinformatics pipeline has been published
Sectors Agriculture, Food and Drink,Digital/Communication/Information Technologies (including Software),Education

 
Description we have been succesful in applying our findings to an ongoing project on identfying groundnut virus diseases in Kenya and training Kenyan scientists funded by the CONNECTED network
First Year Of Impact 2018
Sector Agriculture, Food and Drink,Education
Impact Types Economic

 
Title Kodoja: A workflow for virus detection in plants using k-mer analysis of RNA-sequencing data 
Description Background: RNA-sequencing of plant material allows for hypothesis-free detection of multiple viruses simultaneously. This methodology relies on bioinformatics workflows for virus identification. Most workflows are designed for human clinical data, and few go beyond sequence mapping for virus identification. Methods: We present a new workflow (Kodoja) for the detection of plant virus sequences in RNA-sequence data. Kodoja uses k-mer profiling at the nucleotide level and sequence mapping at the protein level by integrating two existing tools Kraken and Kaiju. Results and Discussion: Kodoja was tested on 3 existing RNA-seq datasets from grapevine, and 2 new RNA-seq datasets from raspberry. For grapevine, Kodoja was shown to be more sensitive than a method based on contig building and Blast alignments (27 viruses detected compared to 19). The application of Kodoja to raspberry, showed that field-grown raspberries were infected by multiple viruses, and that RNA-seq can identify lower amounts of virus material than RT-PCR. This work enabled the design of new PCR-primers for detection of Raspberry yellow net virus and Beet ringspot virus. Kodoja is a sensitive method for plant virus discovery in field samples and enables the design of more accurate primers for detection. Kodoja is available to install through Bioconda and as a tool within Galaxy. Some modules were updated Aug 2019 to account for error checking with software downloads. See https://github.com/abaizan/kodoja_galaxy for details of latest commits to github. 
Type Of Material Model of mechanisms or symptoms - non-mammalian in vivo 
Year Produced 2018 
Provided To Others? Yes  
Impact This workflow is freely available to reserachers to use and we have used it to develop new PCR primers for the detction of virsues in red raspberry plants. 
URL https://github.com/abaizan/kodoja_galaxy
 
Title K-Mer Databases Of Plant Virus Sequences For Use With The Kodoja Workflow 
Description Details This is a gzipped tar file that includes the plant virus database files required to run the Kodoja workflow (https://github.com/abaizan/kodoja)[1]. Kodoja is a workflow for the detection of plant virus sequences in RNA-seq data files that uses two previoulsy published tools Kraken[2] and Kaiju[3]. This file contains databases for Kraken [2] and Kaiju [3]. The file includes the kraken database files: database.idx, database.kdb, nodes.dmp, names.dmp and the kaiju database file kaij_library.fmi. These k-mer databases are based on virus sequences in RefSeq [4] (ttps://www.ncbi.nlm.nih.gov/refseq/) with plant hosts as defined in the Virus-Host Database [5] (https://www.genome.jp/virushostdb/). Version 1.0 kodojaDB_v1.0 is based on RefSeq v89 and the Virus-Host Database (accessed 03/09/2018 which is based on RefSeq 89 and Genbank 226.0). The viral partition of RefSeq v89 genome comprises 7946 viruses (ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/viral/assembly_summary.txt). kodojaDB_v1.0 was created using kodoja_retrieve.py which is part of the kodoja workflow (v0.05) (https://github.com/abaizan/kodoja). References [1] Baizan-Edge, A, Cock, P, MacFarlane, S, McGavin, W, Torrance, T, Jones, S. Kodoja: A workflow for virus detection in plants using k-mer analysis of RNA-sequencing data (under review Nucleic Acids Research). [2] Wood,D.E. and Salzberg,S.L. (2014) Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol., 15, R46 [3] Menzel,P., Ng,K.L. and Krogh,A. (2016) Fast and sensitive taxonomic classification for metagenomics with Kaiju. Nat. Commun., 7, 1-9. [4] O'Leary,N.A., Wright,M.W., Brister,J.R., Ciufo,S., Haddad,D., McVeigh,R., Rajput,B., Robbertse,B., Smith-White,B., Ako-Adjei,D., et al. (2016) Reference sequence (RefSeq) database at NCBI: Current status, taxonomic expansion, and functional annotation. Nucleic Acids Res., 44, D733-D745. [5] Mihara,T., Nishimura,Y., Shimizu,Y., Nishiyama,H., Yoshikawa,G., Uehara,H., Hingamp,P., Goto,S. and Ogata,H. (2016) Linking virus genomes with host taxonomy. Viruses, 8, 10-15 
Type Of Material Database/Collection of data 
Year Produced 2018 
Provided To Others? Yes  
 
Description collaboration with Prof H Were, MMUST, Kakamega, Kemya 
Organisation Masinde Muliro University of Science and Technology
Country Kenya 
Sector Academic/University 
PI Contribution A Galaxy workflow has been created using standard NGS analysis tools (including host removal and virus identification by mapping) to take raw RNA-seq data and produce an initial list of viruses present. This workflow matches reads to the RefSeq dataset of viruses (which comprises > 7000 genomes). This workflow has been tested on African groundnut RNA-sequence data, which has been sequenced as part of a Royal Society network grant. We are currently analysing initial results and working on a protocol for further analysis of the unmapped reads to recover additional viruses. Two Kenyan students came to Scotland for training in bioinformatics techniques.
Collaborator Contribution Partner collected the groundnut samples
Impact Sequence information from groundnut samples
Start Year 2016
 
Description 1. Oct 2016: Sue Jones: Talk at University of St Andrews bioinformatics forum: "Plant virus diagnostics using NGS" 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Professional Practitioners
Results and Impact Talk at Bioinfomatics Forum
Year(s) Of Engagement Activity 2016
 
Description 2. Nov 2016: Sue Jones: Talk at RNA-sequence data workshop at the James Hutton Institute: "Searching for African groundnut viruses using NGS" 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Professional Practitioners
Results and Impact talk at workshop
Year(s) Of Engagement Activity 2016
 
Description 3. Feb 2017: Amanda Edge: informal discussions about NGS analysis methods at the NextGenBUG meeting at the Centre for Virus Research at the University of Glasgow 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Professional Practitioners
Results and Impact talk to conference
Year(s) Of Engagement Activity 2016
 
Description Conference presentation by Amanda Baizan-Edge at the Science Protecting Plant Health Conference, Brisbane, Australia: Title of talk: A new bioinformatics pipeline for the rapid detection of plant viruses using next generation sequencing data. 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Oral presentation by Amanda Baizan-Edge at the Science Protecting Plant Health Conference, Brisbane, Australia. Title of talk: A new bioinformatics pipeline for the rapid detection of plant viruses using next generation sequencing data.
Year(s) Of Engagement Activity 2017
URL http://sciplant2017.com.au/
 
Description Talk and Poster at COST ACTION meeting 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Talk and poster presentation at Cost Action meeting: Deep investigation on viral associated sequences at the University of Liege, Belgium
Year(s) Of Engagement Activity 2018
URL http://www.cost-divas.eu/final-meeting/meeting-venue/