# Algebraic Invariants for Phylogenetic Network Inference

Lead Research Organisation:
Earlham Institute

Department Name: Research Faculty

### Abstract

The key goal in phylogenetics is to be able to infer the evolutionary histories of species from DNA sequence data of their living relatives. This has applications in many fields, such as tracing the mutations of viral outbreaks, understanding speciation events to aid conservation, and even tracing the histories of ancient manuscripts that were copied by hand through generations.

Most evolutionary histories can be described with a phylogenetic tree, where the "leaves" of the tree represent species that are alive today, and the vertices higher up the tree represent common ancestor species. However, for many biological problems, a tree cannot properly represent the evolutionary history of the species involved. Such problems are said to have seen "horizontal evolution". One example occurs in microbiomes, where different microbial species are able to share portions of their DNA in a process called horizontal gene transfer. This is one mechanism by which antibiotic resistance can spread between bacteria, and so being able to describe when such events have occurred has important implications for human health. To describe horizontal evolution, biologists use what's called a phylogenetic network. Here, one can use a tree structure as a backbone, onto which further edges are drawn to represent horizontal evolution events.

The problem of inferring the evolutionary histories of species where horizontal evolution has occurred is particularly challenging, and is the focus of much of the research in phylogenetics today. One method of phylogenetic inference is to use algebraic invariants. These have seen significant development for inferring evolution along a tree, and in some cases have been shown to outperform other methods. For phylogenetic networks however, very little research on algebraic invariants has been done. This project will develop and test the method of using algebraic invariants for phylogenetic network inference.

For a particular phylogenetic network, the process of evolution along it can be modelled using a type of probabilistic model called a Markov model. Under this model, one can calculate the probability of observing particular patterns of DNA at the leaves of the network, and these probabilities can be expressed as polynomials in the numerical parameters of the model. By allowing the numerical parameters to vary freely (i.e. treating them as variables) we can represent the network as the set of solutions to the equations describing the probabilities. Such a set of solutions forms an object that algebraists call an algebraic variety. Using this model gives us the advantage of being able to use the powerful machinery of algebraic geometry in determining whether observed DNA sequence data is a good fit for the network. In particular, we can describe the variety corresponding to a network by using expressions called algebraic invariants. To determine whether a particular network is a good fit for observed DNA sequence data, the idea is to calculate the frequencies of patterns in the data, and then apply the network's algebraic invariants to these frequencies. The resulting quantities will determine how closely the data matches the network.

This project will examine how effective this method is to infer phylogenetic networks from DNA sequence data. To do this, we will utilize the most recent developments in the field to calculate the invariants for a small class of phylogenetic networks. Next, we will develop a computational tool that will infer the network that best describes the evolutionary history coming from a set of DNA sequence data, by using the invariants we have calculated. We will then test our tool on both simulated DNA sequence data and real DNA sequence data, and compare the results to state of the art methods.

Most evolutionary histories can be described with a phylogenetic tree, where the "leaves" of the tree represent species that are alive today, and the vertices higher up the tree represent common ancestor species. However, for many biological problems, a tree cannot properly represent the evolutionary history of the species involved. Such problems are said to have seen "horizontal evolution". One example occurs in microbiomes, where different microbial species are able to share portions of their DNA in a process called horizontal gene transfer. This is one mechanism by which antibiotic resistance can spread between bacteria, and so being able to describe when such events have occurred has important implications for human health. To describe horizontal evolution, biologists use what's called a phylogenetic network. Here, one can use a tree structure as a backbone, onto which further edges are drawn to represent horizontal evolution events.

The problem of inferring the evolutionary histories of species where horizontal evolution has occurred is particularly challenging, and is the focus of much of the research in phylogenetics today. One method of phylogenetic inference is to use algebraic invariants. These have seen significant development for inferring evolution along a tree, and in some cases have been shown to outperform other methods. For phylogenetic networks however, very little research on algebraic invariants has been done. This project will develop and test the method of using algebraic invariants for phylogenetic network inference.

For a particular phylogenetic network, the process of evolution along it can be modelled using a type of probabilistic model called a Markov model. Under this model, one can calculate the probability of observing particular patterns of DNA at the leaves of the network, and these probabilities can be expressed as polynomials in the numerical parameters of the model. By allowing the numerical parameters to vary freely (i.e. treating them as variables) we can represent the network as the set of solutions to the equations describing the probabilities. Such a set of solutions forms an object that algebraists call an algebraic variety. Using this model gives us the advantage of being able to use the powerful machinery of algebraic geometry in determining whether observed DNA sequence data is a good fit for the network. In particular, we can describe the variety corresponding to a network by using expressions called algebraic invariants. To determine whether a particular network is a good fit for observed DNA sequence data, the idea is to calculate the frequencies of patterns in the data, and then apply the network's algebraic invariants to these frequencies. The resulting quantities will determine how closely the data matches the network.

This project will examine how effective this method is to infer phylogenetic networks from DNA sequence data. To do this, we will utilize the most recent developments in the field to calculate the invariants for a small class of phylogenetic networks. Next, we will develop a computational tool that will infer the network that best describes the evolutionary history coming from a set of DNA sequence data, by using the invariants we have calculated. We will then test our tool on both simulated DNA sequence data and real DNA sequence data, and compare the results to state of the art methods.

### Publications

Description | We evaluated the performance of a novel algebraic method of inferring evolutionary relationships between taxa from their DNA sequence data. We found that the method performed well, but is currently limited in scope by a lack of mathematical understanding. |

Exploitation Route | The outcomes of this award will provide the basis for further study. The development of new phylogenetic inference methods will be used in evolutionary biology. |

Sectors | Other |

Description | Algorithms for Phylogenetic Network Inference from DNA Sequence Data |

Amount | £3,000 (GBP) |

Funding ID | BB/X005186/1 |

Organisation | Biotechnology and Biological Sciences Research Council (BBSRC) |

Sector | Public |

Country | United Kingdom |

Start | 04/2022 |

End | 06/2022 |

Description | Collaboration with Elizabeth Gross, University of Hawaii |

Organisation | University of Hawaii |

Country | United States |

Sector | Academic/University |

PI Contribution | Research collaboration around phylogenetic network models. |

Collaborator Contribution | Research collaboration around phylogenetic network models. |

Impact | Sam Martin visited Elizabeth Gross in Hawaii and we are now working on two papers for publication. |

Start Year | 2022 |

Description | FTMA: Collaboration with Benjamin Hollering, Max Planck Institute for Mathematics in the Sciences |

Organisation | Max Planck Institute for Mathematics in the Sciences |

Country | Germany |

Sector | Public |

PI Contribution | Collaboration to determine phylogenetic invariants for complex group-based models of evolution on certain phylogenetic networks. Contributed mathematical and computational expertise. |

Collaborator Contribution | Collaboration to determine phylogenetic invariants for complex group-based models of evolution on certain phylogenetic networks. Contributed mathematical and computational expertise. |

Impact | Work is ongoing. |

Start Year | 2022 |

Description | Talk at Algebraic Statistics 2022 |

Form Of Engagement Activity | A talk or presentation |

Part Of Official Scheme? | No |

Geographic Reach | International |

Primary Audience | Other audiences |

Results and Impact | Talk at conference Algebraic Statistics 2022, held at the University of Hawai'i at Manoa, May 2022. |

Year(s) Of Engagement Activity | 2022 |

URL | https://sites.google.com/iit.edu/as2022 |

Description | Talk at BAMC 2022 |

Form Of Engagement Activity | A talk or presentation |

Part Of Official Scheme? | No |

Geographic Reach | National |

Primary Audience | Other audiences |

Results and Impact | Contributed talk at conference "British Applied Mathematics Colloquium 2022". Audience consisted of academics working in various areas of applied mathematics. |

Year(s) Of Engagement Activity | 2022 |

URL | https://bamc2022.lboro.ac.uk/ |

Description | Talk at Emerging Mathematical Frontiers in Molecular Evolution |

Form Of Engagement Activity | A talk or presentation |

Part Of Official Scheme? | No |

Geographic Reach | International |

Primary Audience | Other audiences |

Results and Impact | Contributed talk (online) at the conference Emerging Mathematical Frontiers in Molecular Evolution, held at the Institut Mittag-Leffler in Sweden, August 2022. |

Year(s) Of Engagement Activity | 2022 |

URL | http://www.mittag-leffler.se/konferens/emerging-mathematical-frontiers-molecular-evolution |