Development of robust analytical pipelines for the analysis of microbial community data from clinical samples

Lead Research Organisation: University of Birmingham
Department Name: Sch of Biosciences

Abstract

Cystic fibrosis is an inherited disease which affects around 9,000 people in the UK. It is a recessive disorder, meaning that both parents have to carry a faulty copy of a gene for a child to be affected. Approximately 1 in 25 people carry this copy. Cystic fibrosis dramatically shortens the life of those affected and almost half do not live beyond their 30s. Cystic fibrosis has a negative effect on many parts of the body, but it particularly affects the lungs. The result of the defective gene is that thick secretions cannot be effectively cleared from the lungs, resulting in the airways becoming congested and damaged.

This congestion results in frequent and severe exacerbations, sometimes requiring admission to hospital and treatment. We think exacerbations are often a result of infections, which may be caused by viruses and bacteria. One of the most common bacteria found is Pseudomonas aeruginosa. Microbiologists diagnose Pseudomonas by putting samples of sample onto plates and identifying the bacteria that grow in visible colonies. Often Pseudomonas is treated by antibiotics. Other bacteria may be found including Streptococcus and Staphylococcus.

However when looking for causes of exacerbations we are limited to finding pathogens that we know about, and which grow on the types of culture plates we use. It may be that other bacteria are present that we can't see because they do not grow easily. Sometimes those bacteria may be a cause. However, in a similar way to the human gut it may be that there are "good" and "bad" bacteria. It is known that certain types of bacteria can prevent other types infecting and therefore they may help protect against exacerbations.

A new technology utilises the idea of molecular barcodes which identify bacterial species from fragments of DNA in their cell. New instruments termed high-throughput sequencers permit these barcodes to be read from many samples easily and cheaply. This technology gives us a "parts list" of the bacterial species in a particular sample, and a rough idea of how frequently they occur. By reading this parts lists from patients with cystic fibrosis - when they are well, when they are very sick and when they are recovering, we may be able to tell the relative contribution of these unseen bacteria to the condition. For example a particular species increasing or reducing in abundance may be associated with recovery, giving us a potential therapeutic target.

We are also interesting in seeing how patients end up being colonised with particular, commonly seen bacteria. For example in our local patients, about 30% have a particular type of Pseudomonas infection called Midlands 1. However, very little is known about how it is that so many patients end up being infected by the same strain. We are now able to sequence all the DNA in a bacterial cell (the genome) which gives very high resolution view of how it has evolved. By comparing genomes of strains from different patients, we can help determine whether patients are infecting each other with the same strain, or whether the strains are quite different and come from many different sources. We can also see how the Pseudomonas evolves whilst it is in a patients lungs. Previous studies have shown that the Pseudomonas adapts to the specific environment, which may give us clues as to why people with the same cystic fibrosis mutation have different courses and end up with more hospital admissions or exacerbations than others.

The technology we are using is very new, and there are a number of difficulties with it before it can be a routine clinical test. One problem is that the machines generate plenty of sequencing "noise" which may look like species are present that aren't. I want to develop bioinformatics methods that try and increase the reliability of these techniques and generate information that would be useful for clinicians. We expect these techniques to enter the clinic within the next five years.

Technical Summary

Cystic fibrosis (CF) affects around 9,000 people in the UK and is one of the most common inherited causes of premature death in adults. CF is inherited recessively from mutations found in the gene encoding for the CF transmembrane conductance regulator (CFTR), an epithelial chloride ion channel. In the respiratory tract, defective chloride ion transport leads to dehydration of the mucosal surface and consequent failure of the mucociliary escalator and plugging of the airways by hyperviscous mucus. Although, without treatment, premature death is inevitable, in developed countries management of the condition has resulted in a dramatically increased life expectancy with most patients now surviving into adult life. However, still only half of CF patients survive into their 40s.

During this fellowship, I aim to answer several key questions relating to chronic lung infection in cystic fibrosis, particularly:
- Is there a typical CF microbiota, and what is it like?
- How does the CF microbiota vary according to disease severity?
- Why has the Pseudomonas aeruginosa Midlands-1 clone established itself in our local cohort, and what are the routes of transmission?
- How does Pseudomonas and other species evolve to adapt to the host during colonisation?

Answering these questions requires the concurrent addressing of important technical questions, specifically:
- Which benchtop instrument (Ion Torrent and Illumina MiSeq) performs best for 16S phylogenetic surveys?
- How well does whole-genome shotgun metagenomics perform compared to 16S analysis for phylogenetic surveys?
- What other information does whole-genome shotgun metagenomics provide compared to 16S?
- Are these methods statistically robust enough to enter clinical practice?
- Can 16S and metagenomics data be presented in a way useful for clinicians?

The methods employed will include use of 16S amplicon sequencing, whole-genome shotgun sequencing and whole-genome metagenomics (see case for support).

Planned Impact

This work will have a direct impact to patients, and in particular cystic fibrosis sufferers and their carers who will benefit from an enhanced understanding of the polymicrobial nature of CF lung infection and bronchiectasis. These findings will be useful for those looking at ways of developing novel therapeutics targeting the polymicrobial communities, for example probiotics and bacteriophage therapy. The community will be engaged through the CF patient representative, the CF patient group and the Heartlands CF newsletter and kept abreast of the progress of the study, as well as on the Heartlands CF website (www.heartlandscf.org). An executive summary plus the anonymised results of the final report will be sent to GPs of the subjects involved in the study and published on the CF website and publicised via the CF newsletter. The final report will also be offered for circulation to relevant organizations interested in cystic fibrosis, for example the Cystic Fibrosis Trust and Cystic Fibrosis Medicine.

The economic and social impact of this research will stem from the ability to make more timely and accurate diagnosis. A better understanding of the effects of antimicrobial treatments may lead to the development of more personalised treatments, and potentially a change in the way we treat cystic fibrosis exacerbations. The result of improved cystic fibrosis care with the aim of reducing exacerbations will improve the quality of life of both patients and their carers.

The work on metagenomic data analysis will have an impact on those studying microbiomics of disease, and more broadly the whole field of microbial ecology, due to the novel methods developed during this fellowship. The trials of metagenomics using benchtop sequencers will bring the knowledge required to start using these cutting-edge techniques in the clinical arena, complementing existing conventional microbiological diagnostics and providing a more rich picture of the microbial ecology of the respiratory tract. These techniques can be applied to the field as a whole.
 
Description Asked to contribution expertise to drafting of POSTnote on surveillance of infectious disease
Geographic Reach National 
Policy Influence Type Implementation circular/rapid advice/letter to e.g. Ministry of Health
 
Description Contribution to PHG Foundation Pathogen Genomics into Practice report
Geographic Reach National 
Policy Influence Type Participation in a guidance/advisory committee
URL http://www.phgfoundation.org/reports/16857/
 
Title Beta testing and implementation of nanopore sequencing 
Description I was selected to be one of the first recipients of the MinION(tm) sequencing instrument from Oxford Nanopore Technologies, a British-owned biotechnology company. This technology is able to produce whole-genome sequencers on a "USB stick" sequencer no larger than a typical smartphone. 
Type Of Material Improvements to research infrastructure 
Year Produced 2014 
Provided To Others? Yes  
Impact Our laboratory were responsible for early beta-testing of the instrument, providing critical feedback on the instrument and laboratory and bioinformatics methods development which we shared back with the community. We were the first to generate data from this instrument (figshare.com/articles/A_P_aeruginosa_serotype_defining_single_read_from_our_first_Oxford_Nanopore_run/1052996) and then subsequently the first to publish a bacterial genome sequenced with it (www.gigasciencejournal.com/content/3/1/22). Work is ongoing in implementing real-time genome sequencing for the clinic with this technology, and for outbreaks e.g. Ebola. 
 
Title CLIMB: Cloud Infrastructure for Microbial Bioinformatics 
Description The Cloud Infrastructure for Microbial Bioinformatics (CLIMB) was funded by the MRC in 2014. I am one of three research fellows on this project. I have been responsible, with Tom Connor (Cardiff), for the design and technical deployment of the >£4m investment in computer hardware and storage for this project. We have successfully deployed a cloud infrastructure using the OpenStack software across four sites. The service is currently is beta-testing phase, and has over 50 research groups using it. 
Type Of Material Improvements to research infrastructure 
Year Produced 2015 
Provided To Others? Yes  
Impact Provided computational infrastructure report for the recent Ebola real-time genome sequencing project. 
URL http://www.climb.ac.uk
 
Title Implementation of whole-genome sequencing service at University of Birmingham 
Description Development of whole-genome sequencing service at the University of Birmingham and associated software and laboratory infrastructure to support genome sequencing for researchers at a fixed price of £50/sample. 
Type Of Material Improvements to research infrastructure 
Year Produced 2013 
Provided To Others? Yes  
Impact New collaborations with the University of East Anglia, University of Liverpool, University of Nottingham and internally with Institute of Microbiology and Infection researchers 
 
Title Pseudomonas aeruginosa whole-genome sequence database 
Description A database of whole-genome sequences from Pseudomonas aeruginosa, collected from cystic fibrosis patients, burns patients, the hospital water and the environment. 
Type Of Material Database/Collection of Data/Biological Samples 
Year Produced 2013 
Provided To Others? Yes  
Impact Ongoing discussions with the Department of Health over the use of whole-genome sequencing to help determine the role of hospital water in nosocomial infections, and development of a scheme of prospective pilot sequencing in three hospital centres. 
URL http://pathogenomics.bham.ac.uk/clinicogenomics
 
Title Established real-time Ebola genome surveillance in Guinea for World Health Organisation 
Description The Ebola virus epidemic of 2014-2016 was the largest and most lethal reported, responsible for >13,000 deaths. We established a portable genome surveillance laboratory in Guinea working with the European Mobile Laboratories, Public Health England and the World Health Organisation to provide real-time genomic surveillance of the outbreak from April 2015. The laboratory was based around the new Oxford Nanopore MinION portable genome sequencer. 
Type Diagnostic Tool - Non-Imaging
Current Stage Of Development Early clinical assessment
Year Development Stage Completed 2015
Development Status Under active development/distribution
Impact After establishment of the laboratory we were able to obtain genome sequences from patient isolates within 24 hours of the patient sample being received, unprecedented anywhere in the world, but even more notable for being done in a extremely resource-limited setting. We were able to sequence >50% of all the Ebola cases in Guinea from April 2014 onwards. The information we generated was shared in real-time. Results were communicated to epidemiologists involved in the epidemic response via the WHO, and openly, through websites and online repositories including the ebola.Nextstrain.org website. The work was featured widely in the press, including an interview with Nick Loman on the BBC World Service and even on Bill Gates' Twitter feed. 
URL http://lab.loman.net/2016/02/03/behind-the-paper-real-time-portable-sequencing-for-ebola-surveillanc...
 
Title CONCOCT 
Description CONCOCT is software for reconstructing genomes from metagenomics. 
Type Of Technology Software 
Year Produced 2014 
Open Source License? Yes  
Impact Publication in Nature Methods http://www.nature.com/nmeth/journal/vaop/ncurrent/full/nmeth.3103.html and featured in their research highlights http://blogs.nature.com/methagora/2014/09/microbial-sequencing-at-nature-methods.html 
URL https://github.com/BinPro/CONCOCT
 
Title Nanopore error correction software:ananocorrect & nanopolish 
Description We developed two new open source software packages for analysis of Oxford Nanopore MinION sequencing data. The tools- nanocorrect & nanopolish- can be used to assemble genomes without reliance on sequencing data from other instruments was necessary to use Oxford Nanopore data. The software can also be used to detect single nucleotide variants for example in outbreak investigation. 
Type Of Technology Software 
Year Produced 2015 
Impact The software was initially used to generate the first bacterial genome assembly, of E. coli K-12, using just nanopore data and was published in Nature Methods. Later on the software was adapted to provide highly accurate variant calls in support of the real-time Ebola genomic surveillance project (Quick, Loman, et al Nature 2016). 
URL http://github.com/jts/nanopolish
 
Title poretools 
Description This is the first published piece of software for analysing data from the Oxford Nanopore sequencing platform. 
Type Of Technology Software 
Year Produced 2014 
Open Source License? Yes  
Impact n/a 
URL http://bioinformatics.oxfordjournals.org/content/early/2014/08/19/bioinformatics.btu555.abstract
 
Description WebValley 2014 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Schools
Results and Impact International students between 17 and 18 years old learnt how to operate the Oxford Nanopore portable genome sequencer, and the basics of metagenomics analysis including assembly.

Interest in studying bioinformatics and genomics at an undergraduate level.
Year(s) Of Engagement Activity 2014