Investigating how non-homologous recombination structures genes, proteins, operons, clusters, genomes and ecosystems

Lead Research Organisation: University of Nottingham
Department Name: School of Life Sciences

Abstract

Since Darwin's time we have thought of the evolution of life on the planet in terms of a great unifying tree of life. Darwin, writing to Thomas Henry Huxley said "The time will come (though I will not live to see it), when we shall have fairly true genealogies of each great kingdom of life". For much of the intervening years, the focus has been on trying to construct this great tree of life. However, while much of life is tree-like and diversifying, much of life is also involved in the process of merging. From simple symbioses, which might or might not become permanent (e.g. the chloroplast that we see powering plant life on the planet is the descendent of a once free-living bacterium), to the hybridization of plants or animals, to the fusion of genes, we see many, many instances of mergings. Unfortunately, our knowledge of these mergings lags well behind our knowledge of diversifying evolution. In this proposal, we will broaden our understanding of mergings and make such analyses easier and more comprehensive.

Our first objective is to develop software to help analyse genetic data. In this effort we have been helped enormously - and perhaps quite surprisingly - by entities such as Google, Facebook and Twitter. These companies have based their technology around the kinds of graphs we are using for molecular sequences. When a person joins Facebook, they are represented by the software as a "node" on a graph. When they "friend" somebody, then an "edge" is drawn between these two nodes. When they "like" a post or a page, a different kind of edge is drawn between the person node and the page node. People and pages form a bipartite graph. Pages are characterised, say as being political pages, or pages with an interest in sport or furniture, etc. Therefore, there is another level for pages. Overall, between people, pages, groups, interests, etc. Facebook represents their entire business as a multi-level graph. We are now doing the same kind of thing for evolving entities.

In the case of multilevel analysis of evolving objects, we can represent the smallest of evolving objects (say, a protein domain) as a node. If two domains are homologous (they share a common ancestor and are related), then we can draw an edge between them. If the appear on the same protein/gene, then we can draw edges between the domains and that gene (like as if two people have the same interest in fishing, on facebook). We can then characterise the gene as being of a particular "kind", say metabolic, or membrane-embedded. We can also indicate genes on our network, that are sitting on the same chromosome (analogous to saying they have the same "interest"). We can also have a network level where we indicate whether the organism is free-living, pathogenic, anaerobic, involved in a metabolic consortium, etc.

In the same way that we see on social networks that communities form, we see on sequence networks that communities form. There are many parallels and we can gain significant insights into how evolution is really structuring life on the planet. For instance, preliminary studies have shown that some sequences are promiscuous and some are not. Certain domains are widespread in genes, while some are only found in one kind of sequence and no other. We see plasmids, such as those found in the Lyme-disease-causing bacterium Borrelia that have unique kinds of genes, but these genes are found across the diversity of Borrelia plasmids. In other words, the genes are species-restricted, but not plasmid restricted.

The outcome of this programme will be to have flexible software and several new insights into how evolution has structures genes and genomes.

Technical Summary

The merging of evolving entities, known as introgression, is the focus of this proposal. The challenge is to move from the current situation where analyses of genetic mergers are relatively ad hoc, to a situation where introgression is as widely understood as phylogenetics, where analysis tools are as widely available, user-friendly and flexible and where evolutionary biologists investigate their data as easily for introgressive processes as they currently investigate the data for treelike processes. This transformation requires careful analysis of concepts, the development of exceptional software and the analysis of data from the diversity of evolving entities.

We will develop a flexible, robust network analysis program that will be a major new addition to the toolset for evolutionary biology. We will develop N-Rooted Fusion Graphs that will facilitate entirely new insights into what happens to sequences post-fusion. We will develop approaches based on graph theory in order to explain how nature structures evolving objects (domains, genes, operons, clusters, genomes, consortia and ecosystems). We will identify communities in bipartite/k-partite graphs with a view to understanding sequence promiscuity, co-occurrence of sequences, major gene flows from one lineage to another, the level that is most important for understanding an ecosystem (the gene level, or the species level, or the protein domain level, etc).

Planned Impact

The impact of this research will be felt in a variety of different areas, for instance:

Epidemiology:
In the field of infectious disease epidemiology, it is important to have tools to understand associations. Occasionally these associations will indicate causal relationships. In this work, we will provide tools and methods of analysis that can easily be used in applied infectious disease epidemiology. Sequence sharing networks display the flows of genetic information from one genome to another and can be used to highlight recombination events and horizontal gene transfers, as well as track viral and bacterial transmission histories.

Synthetic Biology:
In the field of synthetic biology, the merging together of evolving entities is used to generate DNA sequences with interesting functions. We will develop tools that allow the large-scale analysis of mergings of evolving entities. This is the best way we can devise of finding out the rules in nature for merging DNA sequences together. This work will have the impact that we can begin to understand how Nature selects successful sequences and this can be applied to designing new sequences.

Environmental understandings
The k-partite graph analyses outlined in this proposal are centered on the environment or the ecosystem or the niche. In other words, I view niches as being multilevel and protein domains have niches, as do genes, clusters and genomes. The research in this proposal will delve in to this issue using homology information, genetic co-occurrence information and contextual information about ecological niches. Therefore, this work will have an impact on how we view niches at the genic level and indeed at different genic levels.


New forms of energy
In the microbial world, we see that the phenotypes of microorganisms rarely follow a treelike pattern. Photosynthesis, for instance, appears in several organisms that don't seem otherwise to be closely related. This is how energy and metabolism works in the microbial world and this project will help us understand the emergence of energy mechanisms. The horizontal transfer of almost 1,000 eubacterial genes at the origins of the haloarchaea resulted in new forms of energy production. This project will develop ways of investigating these kinds of events.


Wider societal impact:
I have appeared extensively in the media - print, TV, radio and online - talking about scientific research. This entire area is one that I have found fascinates the public. That DNA strands merge with other DNA strands in a form of "natural DNA manipulation" is something that interests a great many people. I have made videos explaining my research e.g.: https://www.youtube.com/watch?v=owcRGzIwcAg and I plan on continuing this public engagement.
 
Description When we analyse the genome sequences of Bacteria and Archaea we see that there is an enormous amount of variation in which genes are present or absent. Before our work, these patterns of presence and absence were not really well understood. We approached this problem using the perspective that we could represent and analyse the data if we constructed a network to identify connections between genes and genomes. These networks are not too dissimilar to the networks that are used by FaceBook, Twitter and Google in order to understand the connections between people, football teams, musicians, etc. What we found was that, in many cases, if a genome contained a particular kind of gene, then this influenced by quite a bit, what other genes that might be in the genome. This means that we can start looking at the attractions and repulsions that exist within genomes. Future work might be able to use this information in order to construct a "suggestion engine" for sets of genes that would work well together.
Exploitation Route This finding is key to understanding pangenomes and indeed genome content. We hope to take this forward in order to construct new kinds of genomes and new kinds of functioning multi-gene units.
Sectors Agriculture, Food and Drink,Energy,Environment,Healthcare

 
Title CoinFinder 
Description https://github.com/fwhelan/coinfinder A tool for the identification of coincident (associating and dissociating) genes in pangenomes. Written in collaboration with Martin Rusilowicz & Fiona Whelan. What is it? Coinfinder is an algorithm and software tool that detects genes which associate and dissociate with other genes more often than expected by chance in pangenomes. Coinfinder is written primarily in C++ and is a command line tool which generates text, gexf, and pdf outputs for the user. Coinfinder uses a Bonferroni-corrected Binomial exact test statistic of the expected and observed rates of gene-gene association to evaluate whether a given gene pair is coincident. When and why should I use it? Coinfinder is designed to take as input a dataset of pangenomes and their genes. Ideally, genes will clustered into homologous gene clusters using a pangenomic tool such as Roary, PIRATE, or Pandora. Coinfinder should be used to identify coincident gene sets within a given pangenomic dataset. Coinfinder was written to identify coincident genes among strains of prokaryote species (i.e. a species pangenome) but can be extended to other pangenomic datasets. Manuscript published in Microbial Genomics Fiona J. Whelan, Martin Rusilowicz, & James O. McInerney. "Coinfinder: detecting significant associations and dissociations in pangenomes." doi: https://doi.org/10.1099/mgen.0.000338 
Type Of Material Improvements to research infrastructure 
Year Produced 2020 
Provided To Others? Yes  
Impact We now have a tool that enables the large-scale analysis of microbial pangenomes. 
URL http://github.com/fwhelan/coinfinder