Development of computational strategies for identification and characterisation of viruses in metagenomic samples
Lead Research Organisation:
Earlham Institute
Department Name: Research Faculty
Abstract
Metagenomics is the study of the DNA of mixed environmental samples that include the genomes of many different organisms. We can sequence metagenomic samples using the same next generation sequencing technology that we use to sequence the genome of a single organism, but analysing the data is much more complicated because it is difficult to know in advance which organisms are present in a sample and therefore difficult to know which organism a particular fragment of DNA (a 'read') has come from.
Assembly is the process of putting together short reads into contigs that represent a much longer fragment of DNA, enabling more useful analysis. Assembly is a difficult but relatively mature field when it involves DNA from a single organism. However, many of the simplifying assumptions made by assembly tools are invalid when dealing with metagenomic data, making the process of metagenomic assembly much harder and the field much less mature.
The aim of this project is to develop computational algorithms for metagenomic assembly and to produce a tool that is sensitive and able to accurately differentiate between very similar species. We have targeted a particular type of metagenomic data involving viral detection because this is an important area and one that is particularly under-addressed with the small number of metagenomic assembly tools that already exist. Using such a tool enables scientists to gain vital information from metagenomic samples, including understanding the mechanisms of disease in animals and humans, detecting novel viruses and monitoring the spread of viruses in order to prevent and contain outbreaks.
Assembly is the process of putting together short reads into contigs that represent a much longer fragment of DNA, enabling more useful analysis. Assembly is a difficult but relatively mature field when it involves DNA from a single organism. However, many of the simplifying assumptions made by assembly tools are invalid when dealing with metagenomic data, making the process of metagenomic assembly much harder and the field much less mature.
The aim of this project is to develop computational algorithms for metagenomic assembly and to produce a tool that is sensitive and able to accurately differentiate between very similar species. We have targeted a particular type of metagenomic data involving viral detection because this is an important area and one that is particularly under-addressed with the small number of metagenomic assembly tools that already exist. Using such a tool enables scientists to gain vital information from metagenomic samples, including understanding the mechanisms of disease in animals and humans, detecting novel viruses and monitoring the spread of viruses in order to prevent and contain outbreaks.
Technical Summary
The analysis of data from next generation sequencing of metagenomic samples has emerged as an important tool in recent years. In the past, much of this analysis has involved targeted 16S ribosomal sequencing followed by taxonomic classification. However, the increase in throughput and reduction in cost of NGS, combined with the lack of resolution provided by 16S approaches, has encouraged the adoption of whole genome shotgun approaches. While read mapping is still a useful tool for analysing this data, greater insights are possible from assembly of reads. However, metagenomic assembly is a very immature field with only a handful of assemblers having emerged. One of these is our own MetaCortex, a proof-of-concept assembly tool that has shown promising results when applied to the analysis of the virome of a species of bats from West Africa. The purpose of this project is to develop the algorithms necessary to turn the proof-of-concept into an efficient and sensitive assembly tool that will benefit the metagenomics community. Though we feel the tool should have applicability to a wide range of metagenomic datasets, we are targeting the particular problem of viral detection, as this is an important and under-explored area of metagenomic analysis that has important implications for animal and human health.
In order to validate the effectiveness of the assembly algorithms, we plan to test on simulated datasets and, crucially, on new metagenomic sequence data generated for this project. This will include samples from humans, cows and insects that carry known viruses, or have been artificially infected with known viruses. Additionally, we have access to a set of rodent samples collected in Africa that are expected to contain many zoonotic viruses. These will be used as a case study to demonstrate the effectiveness of the tool in real world experiments.
In order to validate the effectiveness of the assembly algorithms, we plan to test on simulated datasets and, crucially, on new metagenomic sequence data generated for this project. This will include samples from humans, cows and insects that carry known viruses, or have been artificially infected with known viruses. Additionally, we have access to a set of rodent samples collected in Africa that are expected to contain many zoonotic viruses. These will be used as a case study to demonstrate the effectiveness of the tool in real world experiments.
Planned Impact
Academic impact
Techniques for assembly of metagenomic sequence data are in their infancy. As presented in the BBSRC's Review of Next Generation Sequencing, provision of assembly software for metagenomics is "highly deficient" (Conclusion 10). An important academic impact of this work will be to drive forward methods for metagenomic assembly by increasing understanding of the problems, by developing new algorithmic approaches and by encouraging best practice techniques for analysis. The BBSRC's expert working group on metagenomics identified that the UK had failed to take full advantage of metagenomic techniques, something that is reflected in the current research highlight. This project will contribute to addressing this shortfall by helping to support the establishment of a research group focused on metagenomic tools and by increasing the knowledge and expertise of UK researchers, both through training of the post doctoral researcher and research assistant, and through wider training in the use of the tool that will be developed.
The specific focus of the project on viral detection applications will have impact on those working in diagnostics and surveillance, providing them with tools, techniques and knowledge to enable them to more efficiently and effectively carry out their work. This might include epidemiologists tracking viral borne disease in the UK and overseas, as well as those seeking to understand the often complex interlinked nature of animal and human disease mechanisms.
The development of the tools will generate new opportunities for collaborative work with R&D groups in industry and with academic institutions, particularly those also funded by BBSRC. TGAC already collaborates very closely on virus work with the Pirbright Institute, the University of Cambridge Veterinary School and with the Centre for Viral Research, Glasgow University.
The two staff employed for the project will gain important knowledge of bioinformatics, metagenomics and virology. They will develop extremely valuable skills in the use of high performance computing environments and will gain further opportunities to develop their written and verbal communication skills.
Economic and societal impacts
Metagenomics is a powerful tool for the study of health and disease in animals of agricultural importance and in carriers of zoonotic infections, enabling us to understand the role viruses play in disease outbreaks and enabling interventions to be applied before outbreaks occur. An indirect impact of the project will be to inform policy makers about circulation of pathogens and to enable them to better plan for outbreaks, both in the UK and abroad. Within a human clinical setting, metagenomics also has the potential to be a powerful diagnostic and monitoring tool.
The knowledge that will come from metagenomic analyses of viral datasets could lead to economic benefits, as there is a need for cheap diagnostic tests to be developed for animals (e.g. livestock) and humans, presenting opportunities for current and start-up biotech companies.
Developing metagenomics and bioinformatics skills in the UK is vitally important and this project will contribute towards that. We believe it will attract talented people and encourage them to consider a career in UK genomics. This project will also contribute towards the UK and BBSRC being recognised as leaders in metagenomics, bioinformatics and viral genomics.
Techniques for assembly of metagenomic sequence data are in their infancy. As presented in the BBSRC's Review of Next Generation Sequencing, provision of assembly software for metagenomics is "highly deficient" (Conclusion 10). An important academic impact of this work will be to drive forward methods for metagenomic assembly by increasing understanding of the problems, by developing new algorithmic approaches and by encouraging best practice techniques for analysis. The BBSRC's expert working group on metagenomics identified that the UK had failed to take full advantage of metagenomic techniques, something that is reflected in the current research highlight. This project will contribute to addressing this shortfall by helping to support the establishment of a research group focused on metagenomic tools and by increasing the knowledge and expertise of UK researchers, both through training of the post doctoral researcher and research assistant, and through wider training in the use of the tool that will be developed.
The specific focus of the project on viral detection applications will have impact on those working in diagnostics and surveillance, providing them with tools, techniques and knowledge to enable them to more efficiently and effectively carry out their work. This might include epidemiologists tracking viral borne disease in the UK and overseas, as well as those seeking to understand the often complex interlinked nature of animal and human disease mechanisms.
The development of the tools will generate new opportunities for collaborative work with R&D groups in industry and with academic institutions, particularly those also funded by BBSRC. TGAC already collaborates very closely on virus work with the Pirbright Institute, the University of Cambridge Veterinary School and with the Centre for Viral Research, Glasgow University.
The two staff employed for the project will gain important knowledge of bioinformatics, metagenomics and virology. They will develop extremely valuable skills in the use of high performance computing environments and will gain further opportunities to develop their written and verbal communication skills.
Economic and societal impacts
Metagenomics is a powerful tool for the study of health and disease in animals of agricultural importance and in carriers of zoonotic infections, enabling us to understand the role viruses play in disease outbreaks and enabling interventions to be applied before outbreaks occur. An indirect impact of the project will be to inform policy makers about circulation of pathogens and to enable them to better plan for outbreaks, both in the UK and abroad. Within a human clinical setting, metagenomics also has the potential to be a powerful diagnostic and monitoring tool.
The knowledge that will come from metagenomic analyses of viral datasets could lead to economic benefits, as there is a need for cheap diagnostic tests to be developed for animals (e.g. livestock) and humans, presenting opportunities for current and start-up biotech companies.
Developing metagenomics and bioinformatics skills in the UK is vitally important and this project will contribute towards that. We believe it will attract talented people and encourage them to consider a career in UK genomics. This project will also contribute towards the UK and BBSRC being recognised as leaders in metagenomics, bioinformatics and viral genomics.
People |
ORCID iD |
Richard Leggett (Principal Investigator) |
Publications
Ayling M
(2020)
New approaches for metagenome assembly with short reads
in Briefings in Bioinformatics
Leggett RM
(2014)
NextClip: an analysis and read preparation tool for Nextera Long Mate Pair libraries.
in Bioinformatics (Oxford, England)
Martin S
(2023)
Capturing variation in metagenomic assembly graphs with MetaCortex.
in Bioinformatics (Oxford, England)
Description | 1) We conducted a review of current metagenomic assembly tools and have published a paper in Briefings in Bioinformatics. 2) We have developed a new tool, MetaCortex, for assembly and analysis of metagenomic sequence data. The tool is general purpose, but particularly targeted at viral datasets. One of the key innovations is the ability to output assemblies in one of two emerging graph formats instead of just simply outputting assembled contigs. This allows differentiation of, for example, viral haplotypes. The tool is available to download from GitHub. 3) We sequenced and analysed a set of rodent samples from rural and urban locations in Africa with a view to better understanding zoonotic transmission. A paper is in preparation. 4) We developed a read preparation tool called NextClip, which was published in Bioinformatics. 5) We developed a taxonomy processing tool called acc2tax which is available on GitHub. 6) We trained 3 postdoctoral researchers. 7) Knowledge gained through this project has resulted in other successful grant applications. |
Exploitation Route | We believe MetaCortex will be of great interest to those struggling to assemble next generation sequencing data from metagenomic samples. We expect to see it applied to a wide range of datasets in the next few years. The graph output, offering the potential for more detailed analysis of viral haplotypes, should be particularly attractive. The review paper will help to inform inexperienced researchers on the best approaches to metagenomic assembly. The publishing of the African rodent analysis should be of great interest to researchers in disease transmission and will hopefully inform future studies. |
Sectors | Agriculture, Food and Drink,Energy,Environment,Healthcare,Pharmaceuticals and Medical Biotechnology |
Description | DNA sequencing for biological threat monitoring |
Amount | $5,270,000 (USD) |
Funding ID | HR001119C0031 |
Organisation | Defense Advanced Research Projects Agency (DARPA) |
Sector | Public |
Country | United States |
Start | 12/2018 |
End | 12/2023 |
Description | US/UK Microbiome Workshop |
Amount | £1,800 (GBP) |
Organisation | Foreign Commonwealth and Development Office (FCDO) |
Sector | Public |
Country | United Kingdom |
Start | 03/2016 |
End | 03/2016 |
Description | Beth Okamura (Natural History Museum) collaboration |
Organisation | Natural History Museum |
Country | United Kingdom |
Sector | Public |
PI Contribution | Richard Leggett is co-I on Beth Okamura's Leverhulme Trust funded project aiming to characterise parasites in frog populations using metagenomic techniques. |
Collaborator Contribution | Obtained grant funding from Leverhulme Trust |
Impact | Successful application to Leverhulme Trust. |
Start Year | 2020 |
Description | US/UK Microbiome Workshop |
Organisation | Foreign Commonwealth and Development Office (FCDO) |
Country | United Kingdom |
Sector | Public |
PI Contribution | Invited to attend a US/UK Microbiome workshop in San Diego, March 2016. This workshop was aimed at fostering collaboration between US and UK scientists working in metagenomics. The invitation came as a result of this grant award. |
Collaborator Contribution | The workshop was organised and facilitated by the FCO Science & Innovation network, who also covered my travel and accommodation expenses (listed separately under Further Funding). |
Impact | A workshop report. Regular email communication highlighting US/UK collaborative opportunities. |
Start Year | 2016 |
Title | MetaCortex |
Description | MetaCortex is an assembly tool for metagenomic sequence data. We are preparing a paper for publication, but the software is available already. |
Type Of Technology | Software |
Year Produced | 2017 |
Open Source License? | Yes |
Impact | A number of research groups are already using the software. |
Title | acc2tax |
Description | Tool for batch processing of taxonomy - e.g. to convert multiple accessions to taxonomy strings. |
Type Of Technology | Software |
Year Produced | 2016 |
Open Source License? | Yes |
Impact | Used internally at EI and by researchers around the world. |
Description | A talk or presentation - Conference talk: MetaCortex: Assembling variation in metagenomics at ISMB/ECCB 2017 |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | Martin Ayling presented his work on MetaCortex at the HitSeq 2017 meeting at ISMB/ECCB 2017. |
Year(s) Of Engagement Activity | 2017 |
Description | Conference talk: MetaCortex: Assembling variation in metagenomics at ALCoB 2017 |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | Martin Ayling gave a presentation at the international conference ALCoB 2017 (4th International Conference on Algorithms for Computational Biology). |
Year(s) Of Engagement Activity | 2017 |
Description | Metagenomic viral assembly: the current crop |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | National |
Primary Audience | Other audiences |
Results and Impact | The post-doc employed on this project, Martin Ayling, delivered a talk at the UK Genome Science meeting, Birmingham, 7-9 September. The audience was composed of genomics researchers from around the UK. The talk prompted discussion with a number of potential collaborators afterwards. These discussions about future projects are ongoing. Martin's slides were made available on F1000Research - see link. |
Year(s) Of Engagement Activity | 2015 |
URL | http://f1000research.com/slides/4-769 |
Description | New approaches for metagenome assembly with short reads |
Form Of Engagement Activity | Engagement focused website, blog or social media channel |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Public/other audiences |
Results and Impact | EI website news article about our publication on approaches for metagenomic assembly with short reads. |
Year(s) Of Engagement Activity | 2019 |
URL | http://www.earlham.ac.uk/newsroom/new-approaches-metagenome-assembly-short-reads |
Description | Next generation sequencing data analysis for metagenomics |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Postgraduate students |
Results and Impact | The PI (Richard Leggett) and the postdoc (Martin Ayling) taught metagenomics data analysis and assembly sessions on the week long TGAC course "Metagenomics: From bench to data analysis", attended by 20-30 people. Lots of discussion and questions were prompted by it. |
Year(s) Of Engagement Activity | 2015 |
URL | http://www.tgac.ac.uk/361_Division/training-programme/courses-workshops/tgac-events/metagenomics-fro... |
Description | Project launch press release |
Form Of Engagement Activity | A press release, press conference or response to a media enquiry/interview |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Media (as a channel to the public) |
Results and Impact | A press release at the start of the project, available in the news section of the institute's website. Resulted in some enquiries about the work. |
Year(s) Of Engagement Activity | 2014 |
URL | http://www.earlham.ac.uk/ei-leads-research-help-identify-animal-human-transmitted-diseases |
Description | Scientists on the Loose |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | Regional |
Primary Audience | Public/other audiences |
Results and Impact | "Scientists on the Loose" is a series of public engagement activities run by TGAC. Martin Ayling (postdoc on this project) gave a talk on metagenomic assembly to a group from The Norfolk Humanists, at The Coachmakers pub in Norwich. |
Year(s) Of Engagement Activity | 2015 |
URL | http://www.tgac.ac.uk/news/245/68/Scientists-on-the-Loose-when-genetics-met-humanism/ |
Description | TGAC Open Day |
Form Of Engagement Activity | Participation in an open day or visit at my research institution |
Part Of Official Scheme? | No |
Geographic Reach | National |
Primary Audience | Public/other audiences |
Results and Impact | The PI (Richard Leggett) and post-doc (Martin Ayling) manned a stand at the TGAC Open Day describing the project's work on metagenomic assembly algorithms. Lots of people, from school children to the retired, passed through and we had many interesting conversations. |
Year(s) Of Engagement Activity | 2015 |
URL | http://www.tgac.ac.uk/news/242/68/Discovering-the-decoding-of-living-systems/ |
Description | Triple Science Network Science Fair |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | Local |
Primary Audience | Schools |
Results and Impact | 180 year 9 pupils attended Norwich Research Park for a science fair. The PDR on this project attended and was part of a Scientist Question Time, answering questions about his work and life as a scientist, seeking to inspire the next generation. |
Year(s) Of Engagement Activity | 2017 |
Description | Year 10 work experience placement |
Form Of Engagement Activity | Participation in an open day or visit at my research institution |
Part Of Official Scheme? | No |
Geographic Reach | Local |
Primary Audience | Schools |
Results and Impact | A one week work experience placement as part of EI's work experience pilot. A successful event for a very gifted year 10 student. The student visited a number of scientists, including spending a day with the PDR on this project understanding his work and learning basic bioinformatics skills (e.g. cluster access, BLAST, online resources). |
Year(s) Of Engagement Activity | 2016 |