Development of computational strategies for identification and characterisation of viruses in metagenomic samples

Lead Research Organisation: Earlham Institute
Department Name: Research Faculty

Abstract

Metagenomics is the study of the DNA of mixed environmental samples that include the genomes of many different organisms. We can sequence metagenomic samples using the same next generation sequencing technology that we use to sequence the genome of a single organism, but analysing the data is much more complicated because it is difficult to know in advance which organisms are present in a sample and therefore difficult to know which organism a particular fragment of DNA (a 'read') has come from.

Assembly is the process of putting together short reads into contigs that represent a much longer fragment of DNA, enabling more useful analysis. Assembly is a difficult but relatively mature field when it involves DNA from a single organism. However, many of the simplifying assumptions made by assembly tools are invalid when dealing with metagenomic data, making the process of metagenomic assembly much harder and the field much less mature.

The aim of this project is to develop computational algorithms for metagenomic assembly and to produce a tool that is sensitive and able to accurately differentiate between very similar species. We have targeted a particular type of metagenomic data involving viral detection because this is an important area and one that is particularly under-addressed with the small number of metagenomic assembly tools that already exist. Using such a tool enables scientists to gain vital information from metagenomic samples, including understanding the mechanisms of disease in animals and humans, detecting novel viruses and monitoring the spread of viruses in order to prevent and contain outbreaks.

Technical Summary

The analysis of data from next generation sequencing of metagenomic samples has emerged as an important tool in recent years. In the past, much of this analysis has involved targeted 16S ribosomal sequencing followed by taxonomic classification. However, the increase in throughput and reduction in cost of NGS, combined with the lack of resolution provided by 16S approaches, has encouraged the adoption of whole genome shotgun approaches. While read mapping is still a useful tool for analysing this data, greater insights are possible from assembly of reads. However, metagenomic assembly is a very immature field with only a handful of assemblers having emerged. One of these is our own MetaCortex, a proof-of-concept assembly tool that has shown promising results when applied to the analysis of the virome of a species of bats from West Africa. The purpose of this project is to develop the algorithms necessary to turn the proof-of-concept into an efficient and sensitive assembly tool that will benefit the metagenomics community. Though we feel the tool should have applicability to a wide range of metagenomic datasets, we are targeting the particular problem of viral detection, as this is an important and under-explored area of metagenomic analysis that has important implications for animal and human health.

In order to validate the effectiveness of the assembly algorithms, we plan to test on simulated datasets and, crucially, on new metagenomic sequence data generated for this project. This will include samples from humans, cows and insects that carry known viruses, or have been artificially infected with known viruses. Additionally, we have access to a set of rodent samples collected in Africa that are expected to contain many zoonotic viruses. These will be used as a case study to demonstrate the effectiveness of the tool in real world experiments.

Planned Impact

Academic impact

Techniques for assembly of metagenomic sequence data are in their infancy. As presented in the BBSRC's Review of Next Generation Sequencing, provision of assembly software for metagenomics is "highly deficient" (Conclusion 10). An important academic impact of this work will be to drive forward methods for metagenomic assembly by increasing understanding of the problems, by developing new algorithmic approaches and by encouraging best practice techniques for analysis. The BBSRC's expert working group on metagenomics identified that the UK had failed to take full advantage of metagenomic techniques, something that is reflected in the current research highlight. This project will contribute to addressing this shortfall by helping to support the establishment of a research group focused on metagenomic tools and by increasing the knowledge and expertise of UK researchers, both through training of the post doctoral researcher and research assistant, and through wider training in the use of the tool that will be developed.

The specific focus of the project on viral detection applications will have impact on those working in diagnostics and surveillance, providing them with tools, techniques and knowledge to enable them to more efficiently and effectively carry out their work. This might include epidemiologists tracking viral borne disease in the UK and overseas, as well as those seeking to understand the often complex interlinked nature of animal and human disease mechanisms.

The development of the tools will generate new opportunities for collaborative work with R&D groups in industry and with academic institutions, particularly those also funded by BBSRC. TGAC already collaborates very closely on virus work with the Pirbright Institute, the University of Cambridge Veterinary School and with the Centre for Viral Research, Glasgow University.

The two staff employed for the project will gain important knowledge of bioinformatics, metagenomics and virology. They will develop extremely valuable skills in the use of high performance computing environments and will gain further opportunities to develop their written and verbal communication skills.

Economic and societal impacts

Metagenomics is a powerful tool for the study of health and disease in animals of agricultural importance and in carriers of zoonotic infections, enabling us to understand the role viruses play in disease outbreaks and enabling interventions to be applied before outbreaks occur. An indirect impact of the project will be to inform policy makers about circulation of pathogens and to enable them to better plan for outbreaks, both in the UK and abroad. Within a human clinical setting, metagenomics also has the potential to be a powerful diagnostic and monitoring tool.

The knowledge that will come from metagenomic analyses of viral datasets could lead to economic benefits, as there is a need for cheap diagnostic tests to be developed for animals (e.g. livestock) and humans, presenting opportunities for current and start-up biotech companies.

Developing metagenomics and bioinformatics skills in the UK is vitally important and this project will contribute towards that. We believe it will attract talented people and encourage them to consider a career in UK genomics. This project will also contribute towards the UK and BBSRC being recognised as leaders in metagenomics, bioinformatics and viral genomics.

Publications

10 25 50
 
Description 1) We conducted a review of current metagenomic assembly tools and have published a paper in Briefings in Bioinformatics. 2) We have developed a new tool, MetaCortex, for assembly and analysis of metagenomic sequence data. The tool is general purpose, but particularly targeted at viral datasets. One of the key innovations is the ability to output assemblies in one of two emerging graph formats instead of just simply outputting assembled contigs. This allows differentiation of, for example, viral haplotypes. The tool is available to download from GitHub. 3) We sequenced and analysed a set of rodent samples from rural and urban locations in Africa with a view to better understanding zoonotic transmission. A paper is in preparation. 4) We developed a read preparation tool called NextClip, which was published in Bioinformatics. 5) We developed a taxonomy processing tool called acc2tax which is available on GitHub. 6) We trained 3 postdoctoral researchers. 7) Knowledge gained through this project has resulted in other successful grant applications.
Exploitation Route We believe MetaCortex will be of great interest to those struggling to assemble next generation sequencing data from metagenomic samples. We expect to see it applied to a wide range of datasets in the next few years. The graph output, offering the potential for more detailed analysis of viral haplotypes, should be particularly attractive. The review paper will help to inform inexperienced researchers on the best approaches to metagenomic assembly. The publishing of the African rodent analysis should be of great interest to researchers in disease transmission and will hopefully inform future studies.
Sectors Agriculture, Food and Drink,Energy,Environment,Healthcare,Pharmaceuticals and Medical Biotechnology

 
Description DNA sequencing for biological threat monitoring
Amount $5,270,000 (USD)
Funding ID HR001119C0031 
Organisation Defense Advanced Research Projects Agency (DARPA) 
Sector Public
Country United States
Start 12/2018 
End 12/2023
 
Description US/UK Microbiome Workshop
Amount £1,800 (GBP)
Organisation Foreign Commonwealth and Development Office (FCDO) 
Sector Public
Country United Kingdom
Start 03/2016 
End 03/2016
 
Description Beth Okamura (Natural History Museum) collaboration 
Organisation Natural History Museum
Country United Kingdom 
Sector Public 
PI Contribution Richard Leggett is co-I on Beth Okamura's Leverhulme Trust funded project aiming to characterise parasites in frog populations using metagenomic techniques.
Collaborator Contribution Obtained grant funding from Leverhulme Trust
Impact Successful application to Leverhulme Trust.
Start Year 2020
 
Description US/UK Microbiome Workshop 
Organisation Foreign Commonwealth and Development Office (FCDO)
Country United Kingdom 
Sector Public 
PI Contribution Invited to attend a US/UK Microbiome workshop in San Diego, March 2016. This workshop was aimed at fostering collaboration between US and UK scientists working in metagenomics. The invitation came as a result of this grant award.
Collaborator Contribution The workshop was organised and facilitated by the FCO Science & Innovation network, who also covered my travel and accommodation expenses (listed separately under Further Funding).
Impact A workshop report. Regular email communication highlighting US/UK collaborative opportunities.
Start Year 2016
 
Title MetaCortex 
Description MetaCortex is an assembly tool for metagenomic sequence data. We are preparing a paper for publication, but the software is available already. 
Type Of Technology Software 
Year Produced 2017 
Open Source License? Yes  
Impact A number of research groups are already using the software. 
 
Title acc2tax 
Description Tool for batch processing of taxonomy - e.g. to convert multiple accessions to taxonomy strings. 
Type Of Technology Software 
Year Produced 2016 
Open Source License? Yes  
Impact Used internally at EI and by researchers around the world. 
 
Description A talk or presentation - Conference talk: MetaCortex: Assembling variation in metagenomics at ISMB/ECCB 2017 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Martin Ayling presented his work on MetaCortex at the HitSeq 2017 meeting at ISMB/ECCB 2017.
Year(s) Of Engagement Activity 2017
 
Description Conference talk: MetaCortex: Assembling variation in metagenomics at ALCoB 2017 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Martin Ayling gave a presentation at the international conference ALCoB 2017 (4th International Conference on Algorithms for Computational Biology).
Year(s) Of Engagement Activity 2017
 
Description Metagenomic viral assembly: the current crop 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Other audiences
Results and Impact The post-doc employed on this project, Martin Ayling, delivered a talk at the UK Genome Science meeting, Birmingham, 7-9 September. The audience was composed of genomics researchers from around the UK.

The talk prompted discussion with a number of potential collaborators afterwards. These discussions about future projects are ongoing.

Martin's slides were made available on F1000Research - see link.
Year(s) Of Engagement Activity 2015
URL http://f1000research.com/slides/4-769
 
Description New approaches for metagenome assembly with short reads 
Form Of Engagement Activity Engagement focused website, blog or social media channel
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Public/other audiences
Results and Impact EI website news article about our publication on approaches for metagenomic assembly with short reads.
Year(s) Of Engagement Activity 2019
URL http://www.earlham.ac.uk/newsroom/new-approaches-metagenome-assembly-short-reads
 
Description Next generation sequencing data analysis for metagenomics 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact The PI (Richard Leggett) and the postdoc (Martin Ayling) taught metagenomics data analysis and assembly sessions on the week long TGAC course "Metagenomics: From bench to data analysis", attended by 20-30 people.

Lots of discussion and questions were prompted by it.
Year(s) Of Engagement Activity 2015
URL http://www.tgac.ac.uk/361_Division/training-programme/courses-workshops/tgac-events/metagenomics-fro...
 
Description Project launch press release 
Form Of Engagement Activity A press release, press conference or response to a media enquiry/interview
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Media (as a channel to the public)
Results and Impact A press release at the start of the project, available in the news section of the institute's website. Resulted in some enquiries about the work.
Year(s) Of Engagement Activity 2014
URL http://www.earlham.ac.uk/ei-leads-research-help-identify-animal-human-transmitted-diseases
 
Description Scientists on the Loose 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Public/other audiences
Results and Impact "Scientists on the Loose" is a series of public engagement activities run by TGAC. Martin Ayling (postdoc on this project) gave a talk on metagenomic assembly to a group from The Norfolk Humanists, at The Coachmakers pub in Norwich.
Year(s) Of Engagement Activity 2015
URL http://www.tgac.ac.uk/news/245/68/Scientists-on-the-Loose-when-genetics-met-humanism/
 
Description TGAC Open Day 
Form Of Engagement Activity Participation in an open day or visit at my research institution
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Public/other audiences
Results and Impact The PI (Richard Leggett) and post-doc (Martin Ayling) manned a stand at the TGAC Open Day describing the project's work on metagenomic assembly algorithms. Lots of people, from school children to the retired, passed through and we had many interesting conversations.
Year(s) Of Engagement Activity 2015
URL http://www.tgac.ac.uk/news/242/68/Discovering-the-decoding-of-living-systems/
 
Description Triple Science Network Science Fair 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Schools
Results and Impact 180 year 9 pupils attended Norwich Research Park for a science fair. The PDR on this project attended and was part of a Scientist Question Time, answering questions about his work and life as a scientist, seeking to inspire the next generation.
Year(s) Of Engagement Activity 2017
 
Description Year 10 work experience placement 
Form Of Engagement Activity Participation in an open day or visit at my research institution
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Schools
Results and Impact A one week work experience placement as part of EI's work experience pilot. A successful event for a very gifted year 10 student. The student visited a number of scientists, including spending a day with the PDR on this project understanding his work and learning basic bioinformatics skills (e.g. cluster access, BLAST, online resources).
Year(s) Of Engagement Activity 2016