Machine-learning to predict and understand the zoonotic threat of E. coli O157 isolates

Lead Research Organisation: University of Edinburgh
Department Name: The Roslin Institute


Enterohemorrhagic Escherichia coli (EHEC) O157 are bacteria that have their main reservoir in food production animals, predominately cattle, and can be responsible for serious and life-threatening infections in humans. There are specific factors that define EHEC O157, including a micro-injection (type 3 secretion) system and production of specific Shiga toxins. However, we have known for nearly twenty years that not all subtypes represent the same threat to human health and significant effort has gone into understanding why this is the case. On key reason is that there are different Shiga toxin types, some potentially more toxic than others, and their production levels differ between isolates. This variability comes from the fact that Shiga toxins are introduced into the bacteria by infection with bacterial viruses, known as bacteriophages. These integrate their DNA into the bacterial genome in a 'prophage' state. When the bacterial cell is threatened this can activate the prophage to produce copies of itself and new bacteriophages. From whole genome sequencing of E. coli we are now aware that multiple prophages are present in E. coli genomes, some in different states of decay, but they can impact on each other and recombine to produce new variants. Much of the differences between E. coli O157 isolates are down to their prophage content yet sequence identification methods generally use only 'core' genes for epidemiological studies.
We have recently applied machine-learning approaches to examine whole genome sequences of E. coli O157 from cattle and humans. We use these as training sets and then ask it to predict which group other E. coli O157 isolates should be assigned to. Surprisingly it only assigns a small proportion (<10%) of isolates from cattle to the human grouping, indicating that only this small subset may be more of a threat to human health. This grant is to investigate the biological basis of this selection process. We know that the machine-learning assignment is based on discriminatory protein variants predicted to be expressed from mainly prophage genes, so this fits with our understanding of the variation present in these isolates. The proposed work will be a combination of bioinformatics research and 'wet' infection biology research. For the bioinformatics we can use subjective and objective approaches to swap gene variants, including whole prophage, between isolate sequences and re-calculate their host prediction scores. This will allow us to define the most important combinations of genes being used for the prediction of zoonotic potential. It may also highlight specific genes to simplify the identification process. In the laboratory we will initially compare isolates that are very similar at the core genome level but differ markedly in their prediction scores. We will examine their gene expression profiles, metabolic profiles and key phenotypes such as Shiga toxin production, cellular interactions and pathology in a mouse model. Then we will swap or mutate genes identified by the bioinformatics and test these strain variants in the same laboratory assays.
The research should help validate this exciting new approach to understanding bacterial virulence and identify genes involved in the zoonotic threat of this dangerous pathogen. We should then be able to develop simpler approaches to identifying these specific variants on farms and intervene with, for example a vaccine, to reduce the threat to human health. The approach may also work to predict differences in virulence between human isolates and this could have repercussions for how specific outbreaks are managed. This research is timely as it builds on our recent and unique application of machine learning to predict zoonotic potential and access to fully annotated PacBio sequences of UK cattle and human E. coli O157 isolates generated in partnership with Dr James Bono (USDA, Nebraska).

Technical Summary

Enterohaemorrhagic E. coli (EHEC) O157 lysogenized with Shiga toxin 2a (Stx2a)-encoding bacteriophages have become prevalent in cattle in the UK in the last 30 years and this timeframe matches the emergence of serious EHEC-associated human disease. Cattle are an asymptomatic primary reservoir for this zoonosis which can cause bloody diarrhoea and kidney/brain damage in humans. Whole genome sequencing has demonstrated the mosaic nature of the E. coli O157 genome and multiple prophages contribute to diversity of this serotype. Based on whole genome sequences, we have recently used support vector machine, a machine-learning algorithm, to predict the zoonotic potential of cattle isolates. The main conclusion was that only a small fraction of the bovine isolates (<10%) may be a threat to human health, even within previously defined pathogenic lineages. The prediction probabilities are based primarily on prophage-associated differential protein variants (PVs) extracted from sequence assemblies. The proposed study will combine bioinformatics and laboratory research to define key prophage regions important for prediction and investigate how they impact on pathogen biology. The computer-based studies will focus on in silico recombination, decision trees/random forests and genetic algorithms to define critical combinations of PVs. The laboratory work will initially study paired isolates with similar core genomes but markedly different prediction scores. Transcriptomic analysis, metabolic profiling, Shiga toxin production, cellular interactions and toxin pathology in a mouse model will be studied. Isogenic mutants of prophage regions identified by the bioinformatics analyses will then be characterised in the same laboratory assays. The research aims to identify differential genes responsible for zoonotic potential and use this information to simplify assessment of farm isolates to allow targeted interventions.

Planned Impact

The Edinburgh EHEC grouping has been growing since 1999 and now has links to an extensive network of scientists nationally and internationally covering epidemiology, molecular biology, health impact and possible interventions; this includes collaborations with basic research groups, animal scientists, diagnostic and public health laboratories (PHE, SERl, HPS). A good relationship with FSS/FSA further links us through to food producers, politicians and the general public. Knowledge exchange will be maintained with these groups by twice-yearly meetings which currently occur under our FSS programme (end 08/17) but this momentum will be maintained under this grant. We will also host a specific symposium at the Roslin Institute in 2019 to discuss the application of machine learning to interrogate both zoonotic potential and host source. We are currently helping SERL with the installation of a bioinformatics pipeline based on core SNP differences (developed at PHE) and we aim for this to expanded to include prophage profiles and host prediction scores. A longer-term objective is to work with Health Protection Scotland (HPS) to understand how the SVM prediction scores may relate to patient pathology, with a direct impact on outbreak management. Our group works hard to ensure we obtain published outputs from our research and we have a good track record in this and delivery of seminars across the country and abroad. We currently are part of two international partnership awards, one with researchers in Argentina, which has some of the highest rates of EHEC disease in the world. This application builds on this award by using their in vivo infection model and further researcher exchange. Another partnership award is focused on vaccine development against bacterial zoonoses originating from livestock with groups in the USA and our partnership with Jim Bono (USDA) will further develop this network. Another important impact is through the training of veterinary undergraduates at the University of Edinburgh through lectures and tutorials that benefit from the advances made in this research. This feeds through to the important role that veterinary clinicians have in working with commercial producers and the public to raise awareness of such infections.
We envisage the possibility of herd testing and application of a vaccine or alternative intervention based on identification of strains with high zoonotic potential using the SVM method. Development of a multiplex PCR or alternative test could be applied to screen herds to identify those that should be targeted. Impact for this is through a commercial partner potentially allied to ongoing research on E. coli O157 vaccines. We are currently in negotiation over the licencing of our vaccine patents and with an intention to test the vaccine in a feedlot trial in collaboration with the USDA (Dr Jim Bono, Nebraska). The machine-learning (SVM) method is also proving accurate for predicting the isolation host of E. coli in general (not just EHEC) and this could have important repercussions for food, health and environmental sampling. We will work to achieve this with our dedicated business development operatives from Edinburgh Research and Innovation (ERI), a non-profit subsidiary company of the University of Edinburgh, who are based at the Roslin Institute.
EHEC O157:H7 and other Stx-associated infections generate considerable public interest and we are committed to disseminating the as widely as possible. The Roslin Institute provides information about our research through our web site (, talks and discussion groups and direct interaction with the media. Each investigator & PDRA on the grant will be expected to spend ~2 days/yr in direct engagement with the public & schools including participation in our yearly 'open doors' events. Direct impact is also achieved through training of these staff in diverse skills including in bioinformatics and molecular biology.


10 25 50
Description We are studying exactly how such duplications and inversions alter the predictive scores and how they transition through to changes in phenotype. We propose that such re-arrangements provide fundamental plasticity to the organism to switch phenotype in different niches. We are investigating exactly which phenotypes relating to virulence may be affected, including Shiga toxin production and type 3 secretion. Our work is also examining why a specific subset of UK E. coli O157 strains are more of a threat to human health than others based on extensive sequencing of human and cattle isolates. We are currently trying to obtain more metadata in relation to human outbreaks against which genomic features can be correlated.
Exploitation Route We continue to collaborate with the Scottish E. coli reference laboratory and Public Health England to extrapolate our findings to help investigate human outbreaks, in particular to understand the likely source and geography of the infection and the threat to human health of isolate concerned.
Sectors Agriculture, Food and Drink,Healthcare

Description Infections with bacteria encoding Shiga toxins can be lethal and are also associated with long term morbidity often as a result of kidney damage requiring repeated dialysis treatments and an eventual transplant. The aim of this work is to define the genetic regions of isolates predicted to have higher zoonotic potential. Routine whole genome sequencing of human EHEC isolates is underway at Public Health England (PHE) and planned at the Scottish E. coli Reference Laboratory (SERL). We already collaborate with both agencies and progress on our project will allow improved predictive capacity of the threat posed by specific isolates based on their sequences. We are currently helping SERL with the installation of a bioinformatics pipeline for providing EHEC outbreak isolates with unique identifiers based on core SNP differences (developed by Tim Dallman at PHE) and this will be expanded to consider their prophage profiles and host prediction scores. A longer- term objective is to work with Health Protection Scotland (HPS) to understand how our scores may relate to levels of pathology in patients, with the hypothesis that more bovine -associated scores may be less virulent. If this is the case then it could alter how specific infections are handled depending on the perceived threat of the isolate. To achieve this we will arrange twice yearly meetings with PHE/SERL and HPS to update them on our progress and discuss pipeline changes that could be instigated. Our recent EHEC O157 research programme was funded by the Food Standards Agency and Food Standards Scotland. There is increasing concern over the contamination of fresh produce with Shiga toxigenic bacteria and we will continue our close working relationship, especially with FSS, through twice yearly reports and yearly meetings on our progress. Tracking the source of infections is often critical to outbreak investigations, such as is a current concern over the death of a 3 year old in Scotland allegedly through consumption of unpasteurized blue cheese. We consider that the machine learning approaches will have value in epidemiology investigations and will be of interest to FSA/FSS and their stakeholders, from farmer, food producers, packagers, consumers and politicians. Understanding the genetic basis to our zoonotic prediction scores should provide confidence in the approach at a commercial level. At this point we would envisage the possibility of herd testing and application of a vaccine or alternative intervention based on identification of strains with high zoonotic potential. Currently, a whole genome sequencing approach is probably too expensive, although the costs are generally always falling and so an important aim of our research is to define a limited number of protein variants (PVs) that work well as a proxy for the SVM scoring. At this point a multiplex PCR or alternative test could be applied to screen herds to identify those that should be targeted. The pathway for this is allied to our ongoing research on E. coli O157 vaccines and this continues to have commercial interest. We are currently in negotiation with Pacific Gene Tech (PGT) over the licencing of our vaccine patents and they aim to test the vaccine in a feedlot trial in collaboration with USDA (Dr Jim Bono, Nebraska) next summer (2017). If this is successful, then a pathway to development of the diagnostic would could be through working with PGT or an alternative commercial partner. To achieve this we will hold meetings with them to discuss both the machine-learning approach and development of a more focused PCR-diagnostic. The machine-learning (SVM) method is also proving accurate for predicting the isolation host of E. coli in general (not just EHEC) and this could have important repercussions for food, health and environmental sampling.
First Year Of Impact 2017
Sector Agriculture, Food and Drink,Healthcare
Impact Types Policy & public services

Description Collaboration with Public Health England 
Organisation Public Health England
Country United Kingdom 
Sector Public 
PI Contribution Provision of animal and human STEC for sequencing, working with PHE to analyse strain phylogeny and epidemiology. We have contributed through further analysis of long read strain sequences to understand changes in strains that occur during outbreaks. We have co-upervised 2 PhD students on STEC bioinformatics projects.
Collaborator Contribution Reduced rate sequencing of STEC, analysis of data, provision of metadata. Co-publication
Impact Publications as in main list
Start Year 2013
Description Artificial Intelligence (AI) workshop at Earlham 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact Inter-institute workshop that discussed main applications of AI in their fields and potential for further research collaboration
Year(s) Of Engagement Activity 2018
Description International workshop on Shiga toxin-producing Escherichia coli at The Roslin Institute 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other audiences
Results and Impact A two-day international workshop was held at The Roslin Institute on Shiga toxin-producing Escherichia coli (STEC), funded partly by this award (for travel of US collaborators) and partly by the Food Standards Agency of Scotland via a £2m award for collaborative research by a consortium led by Professor Gally. The workshop attracted leading academics working on E. coli O157 and other STEC from the US (Jim Bono, Guy Loneragan, Tom Edrington), Canada (Tim McAllister, Kim Stanford), Germany (Christian Menge), Belgium (Eric Cox), Sweden (Erik Eriksson, Lena-Mari Tamminen, Robert Soderlund) and the United Kingdom (Claire Jenkins, Tim Dallman, Dominic Mellor, Norval Strachan [Chief Scientific Advisor for FSA Scotland]). The workshop shared the latest advances in understanding of the biology of E. coli O157 and other STEC, including epidemiology, genomics, virulence, super-shedding and control strategies.
Year(s) Of Engagement Activity 2017