Mechanistic models of protein sequence evolution

Lead Research Organisation: University College London
Department Name: Infection

Abstract

All life on earth is the result of molecular evolution, so it is not surprising that evolutionary viewpoints have yielded so much understanding of living systems. By modelling how proteins have evolved and co-evolved, by re-creating ancestral forms in the laboratory, by analysing patterns of conservation and change in individual proteins, by identifying related proteins about which more is known, phylogenetic analysis has become a powerful tool throughout the life sciences.

Analyses of protein evolution rely on appropriate models of sequence change. Currently, such analyses are dominated by empirical models created by comparing closely related proteins. In order to make model creation tractable and their use computationally feasible, a number of seemingly unjustifiable assumptions are made, such that all sites in the protein evolve independently in a similar manner. This simplicity has been the source of much of their power, success, and widespread use. Unfortunately, these simplifications have three major negative consequences. Firstly, the parameters in these models do not necessarily correspond to biologically relevant quantities, making interpretations of the results difficult. Secondly, some of the analyses are very sensitive to the assumptions made, with unrealistic assumptions yielding erroneous or misleading results. Thirdly, we are often interested in situations where the assumptions break down, providing important information about the proteins' properties and roles and how these change. These limitations have led to an interest in mechanistic models of the substitution process based on the underlying protein biophysics, molecular biology, and evolutionary dynamics. Progress in this area has been limited, however, due to the complexity and our lack of understanding of protein evolution, especially the extent, significance, and impact of epistatic interactions where the substitution patterns at one site depend on the amino acids found in other sites.

We propose a systematic integrated approach to the problem. This will involve:

- Performing computational simulations of protein evolution, where proteins evolve to maintain thermodynamic stability or ability to bind a constant or changing ligand. The purpose of these simulations is to allow us to investigate various phenomena that occur when selection acts on properties of the entire protein that cannot be reduced to the sum of contributions of single sites.

- Evaluating, validating, and characterising the hypotheses generated by the simulations through phylogenetic analyses of well-sampled proteins, including mitochondrial proteins.

- Connecting the results of these simulations with well-developed approaches in the physical sciences, such as the theory of chemical reaction rates.

- Using the insights drawn from the computational simulations and phylogenetic analysis of real proteins to construct more powerful and accurate models of the substitution process that are firmly rooted in the underlying biology.

- Using these substitution models to develop important applications, including the analysis of the degree, nature, and time dependence of the selection acting on proteins, and the identification of deleterious mutations.

- Applying these tools to study selection on G-protein coupled receptors and the detection of disease-related nsSNPs.

- Incorporating the new models and applications into the powerful publically available phylogenetics analysis program PLEX.

By increasing the accuracy, interpretability, and scope of phylogenetic analysis, we will impact a wide range of investigations throughout the life sciences, including how proteins and other biomolecules function and interact, how organisms adapt to new environments, how pathogens change in order to defeat host immune systems and infect new hosts, how humans and other organisms adjust to new and changing pathogens, and how we can modify proteins to our own specifications.

Technical Summary

Modelling protein evolution has become an important tool throughout the life sciences. Such analyses require appropriate models of sequence change, and are limited by their inaccuracies and simplifying assumptions. In particular, standard empirical models ignore epistasis, the interactions between substitutions at different sites, despite its importance and although including epistasis in the substitution model can provide important information about protein structure and function while its neglect can result in erroneous conclusions.

It is difficult to make more accurate substitution models without unacceptable increases in the number of adjustable parameters. By creating mechanistic models that represent the underlying protein biophysics, molecular biology, and evolutionary biology, it is possible to embed our understanding of these areas into the structure of the model, reducing the number of adjustable parameters while preserving fidelity to the evolutionary process. Progress in this approach has been limited, however, due to the complexity and our lack of understanding of protein evolution, especially the extent, significance, and impact of these epistatic interactions.

We propose a systematic integrated approach to the problem, involving:

- Studying the process of protein evolution through computational simulations, phylogenetic analyses, and theoretical formulation based on the theory of chemical reaction rates.

- Using the new insights and understandings to construct more powerful and accurate mechanistic models of the substitution process, as well as developing important applications.

- Incorporating the new models and applications into the powerful publically available phylogenetics program PLEX.

Increasing the accuracy, interpretability, and scope of phylogenetic analysis will have an impact across the wide range of disciplines throughout the biological and medical sciences where this type of analysis is used.

Planned Impact

The proposed research will create a deeper understanding of the process of molecular evolution, more accurate models of amino acid substitutions that better represent the constraints acting on protein sequences, and computational tools targeting specific important biological questions. This work will have a significant impact on phylogenetics and evolutionary biology as well as other areas throughout the biological and medical sciences, epidemiology, pharmacology, and biotechnology, resulting in contributions to health and the economy.

- Phylogenetics and evolutionary biology:
The resulting substitution models will not only be more accurate and more powerful for phylogenetic analysis, but will represent a different type of model, implementing more sophisticated representations of the evolutionary process.

Many of the hypotheses of evolutionary biology, even those that do not involve molecular phenomena, are proven or disproven by phylogenetic analysis done at the molecular level. Increased accuracy of such analyses will have an impact on these studies.

Finally, many of the conceptual aspects of evolutionary biology such as arms races, group selection, policing and altruism, etc., have equivalents in molecular evolution. Studies of these phenomena is often most tractable at the molecular level.

- Biological sciences:
With the growing number of sequences from different individuals and organisms come increased possibilities for comparing genes and gene products. Patterns of conservation and variation can yield important information about protein structure and function. Identifying and characterising various forms of purifying, neutral, and adaptive selection can provide insight into how organisms adapt to their environment and changes in their environment, and how the current observed species, organisms, and genes arose. The identification of evolutionary relationships between proteins can provide evidence for the role of horizontal transfer and gene or genome duplication. Insights about one protein can come from the identification of related proteins about which more is known.

- Medical sciences:
Pathogens and hosts are often involved in an 'arms race' as both evolve to counter changes in the other. Identifying such patterns is important for characterising host-pathogen interactions. Tumour cells can be sequenced and their origins determined through phylogenetic analysis, helping us understand the process of tumourigenesis.

- Epidemilogy:
We can determine the relationship of fast evolving viruses from different individuals, allowing us to model transmission pathways. Identification of transmission networks, natural reservoirs, and zoonotic disease origins can assist in the monitoring, prevention, and control of future pandemics.

- Biotechnology and pharmacology:
Better understanding of the types of selection acting on proteins can help us identify deleterious mutations, providing opportunities for targeted monitoring and prophylactic measures, as well as hypotheses about disease mechanisms leading to the identification of potential drug targets. Evolutionary analyses, such as those that characterise adaptive events, deleterious mutations, and pathogen-host interactions can also help in the identification of drug targets. Understanding the impact of mutations would assist in modifying proteins to our specifications.

Publications

10 25 50
 
Description We have generated increased understanding of the process of speciation, as well as understanding how protein properties are reflected in their evolutionary process.

We have developed new models for analysing longitudinal deep sequencing data that allows increased understanding of the intrahost population genetics and evolutionary dynamics of pathogens. These new models have been applied to a number of different pathogens including norovirus, cytomegalovirus, and SARS CoV-2. We have developed and applied new techniques to the study of SARS CoV-2 evolution.

We have investigated the role of selection for cooperative folding of proteins on epistasis, and find that epistasis tends to be stronger in proteins under such circumstances.
Exploitation Route We are still working adapting our models to generate better tools for sequence analysis.

Our haplotype reconstruction software is of broad applicability.
Sectors Healthcare,Pharmaceuticals and Medical Biotechnology,Other

 
Description Our methods for analysing intrahost evolutionary dynamics through haplotype reconstruction contributed to an important analysis (Kemp et al. 2021 Nature) which has informed the debate regarding Covid public health policy.
First Year Of Impact 2021
Sector Healthcare
Impact Types Policy & public services

 
Title Development of novel method for reconstructing haplotypes from longitudinal deep sequencing samples. 
Description Me have developed a new method to use sequential deep sequencing samples to decompose population into haplotypes. This has provided much deeper insight into the dynamics of intrahost pathogen population genetics and evolution. The approach was developed to be applicable to evolving entities where the mutation rate is sufficiently low that the distance between variant sites is longer than the length of the reads, requiring a different source of information for resolving these haplotypes. This is 
Type Of Material Improvements to research infrastructure 
Year Produced 2018 
Provided To Others? Yes  
Impact As described in collaborations, this approach has contributed to a number of high-impact publications: Goldstein R, Tamuri A, Roy S, Breuer J, (2018). Haplotype assignment of longitudinal viral deep-sequencing data using co-variation of variant frequencies. Pang J, Slyker JA, Roy S, Bryant J, Atkinson C, Cudini J..... Breuer J, (2020). Mixed cytomegalovirus genotypes in HIV-positive mothers show compartmentalization and distinct patterns of transmission to infants.. eLife, Kemp S, , C 
URL https://github.com/RichardAGoldstein/HaROLD
 
Title Reconstructing substitutions on phylogenetic trees 
Description This is a software tool that determines where substitutions have occurred in a phylogenetic tree. 
Type Of Material Improvements to research infrastructure 
Year Produced 2018 
Provided To Others? Yes  
Impact We have used this tool extensively to map substitutions onto phylogenetic trees. As this tool is new, we hope it will be used by others in the future. 
URL https://github.com/chrismonit/SubRecon
 
Title Development of novel method for reconstructing haplotypes from longitudinal deep sequencing samples. 
Description We have developed a new method to use sequential deep sequencing samples to decompose population into haplotypes. This has provided much deeper insight into the dynamics of intrahost pathogen population genetics and evolution. The approach was developed to be applicable to evolving entities where the mutation rate is sufficiently low that the distance between variant sites is longer than the length of the reads, requiring a different source of information for resolving these haplotypes. This is done by considering co-evolution of variant frequencies. 
Type Of Material Data analysis technique 
Year Produced 2018 
Provided To Others? Yes  
Impact As described in collaborations, this approach has contributed to a number of high-impact publications: Goldstein R, Tamuri A, Roy S, Breuer J, (2018). Haplotype assignment of longitudinal viral deep-sequencing data using co-variation of variant frequencies. Pang J, Slyker JA, Roy S, Bryant J, Atkinson C, Cudini J..... Breuer J, (2020). Mixed cytomegalovirus genotypes in HIV-positive mothers show compartmentalization and distinct patterns of transmission to infants.. eLife, Kemp S, , Collier D, Datir R, Ferreira I, Gayed S..... Gupta R, (2021). SARS-CoV-2 evolution during treatment of chronic infection. Nature, 
URL https://github.com/RichardAGoldstein/HaROLD
 
Description Haplotype reconstruction applied to intrahost pathogen evolution 
Organisation University College London
Department Institute of Child Health
Country United Kingdom 
Sector Academic/University 
PI Contribution We have developed novel methods for reconstructing haplotypes in intra-host pathogen evolution based on longitudinal deep-sequencing data. These methods make it possible to characterise the process of intra-host pathogen population dynamics and evolution. This method has been described in Goldstein R, Tamuri A, Roy S, Breuer J, (2018). Haplotype assignment of longitudinal viral deep-sequencing data using co-variation of variant frequencies.
Collaborator Contribution We have worked together to better understand intra-host evolution of SARS CoV-2, cytomegalavirus, and norovirus.
Impact This has contributed to a number of publications, including: Pang J, Slyker JA, Roy S, Bryant J, Atkinson C, Cudini J..... Breuer J, (2020). Mixed cytomegalovirus genotypes in HIV-positive mothers show compartmentalization and distinct patterns of transmission to infants.. eLife, Kemp S, , Collier D, Datir R, Ferreira I, Gayed S..... Gupta R, (2021). SARS-CoV-2 evolution during treatment of chronic infection. Nature, Cudini J, Roy S, Houldcroft CJ, Bryant JM, Depledge DP, Tutill H..... Breuer J, (2019). Human cytomegalovirus haplotype reconstruction reveals high diversity due to superinfection and evidence of within-host recombination.. Proceedings of the National Academy of Sciences of the United States of America, 116 (12), pp. 5693-5698 The work in collaboration with the Oxford team has resulted in a publication (Kemp et al. 2021, Nature) that has significantly contributed to the current discussions about the rise of new SARS CoV-2 variants.
Start Year 2018
 
Description Haplotype reconstruction applied to intrahost pathogen evolution 
Organisation University of Cambridge
Department Department of Medicine
Country United Kingdom 
Sector Academic/University 
PI Contribution We have developed novel methods for reconstructing haplotypes in intra-host pathogen evolution based on longitudinal deep-sequencing data. These methods make it possible to characterise the process of intra-host pathogen population dynamics and evolution. This method has been described in Goldstein R, Tamuri A, Roy S, Breuer J, (2018). Haplotype assignment of longitudinal viral deep-sequencing data using co-variation of variant frequencies.
Collaborator Contribution We have worked together to better understand intra-host evolution of SARS CoV-2, cytomegalavirus, and norovirus.
Impact This has contributed to a number of publications, including: Pang J, Slyker JA, Roy S, Bryant J, Atkinson C, Cudini J..... Breuer J, (2020). Mixed cytomegalovirus genotypes in HIV-positive mothers show compartmentalization and distinct patterns of transmission to infants.. eLife, Kemp S, , Collier D, Datir R, Ferreira I, Gayed S..... Gupta R, (2021). SARS-CoV-2 evolution during treatment of chronic infection. Nature, Cudini J, Roy S, Houldcroft CJ, Bryant JM, Depledge DP, Tutill H..... Breuer J, (2019). Human cytomegalovirus haplotype reconstruction reveals high diversity due to superinfection and evidence of within-host recombination.. Proceedings of the National Academy of Sciences of the United States of America, 116 (12), pp. 5693-5698 The work in collaboration with the Oxford team has resulted in a publication (Kemp et al. 2021, Nature) that has significantly contributed to the current discussions about the rise of new SARS CoV-2 variants.
Start Year 2018
 
Description Protein evolution 
Organisation University of Colorado
Country United States 
Sector Academic/University 
PI Contribution Modellilng of evolution of a simple model of proteins
Collaborator Contribution Active joint collaboration
Impact Yanlong O. Xu, Randall W. Hall, Richard A. Goldstein, and David D. Pollock (2005), Divergence, recombination, and retention of functionality during protein evolution, Human Genomics, 2:158-167 Paul D. Williams, David D. Pollock and Richard A. Goldstein (2006), Functionality and the evolution of marginal stability in proteins: Inferences from lattice simulations, Evol. Bioinform. Online, 2:1-11. Paul D. Williams, David D. Pollock and Richard A. Goldstein (2006), Selective advantage of recombination in evolving protein populations: A lattice model study, Int. J. Mod. Phys. C, 17:75-90. Paul D. Williams, David D. Pollock, Benjamin P. Blackburne, and Richard A. Goldstein (2006), Accessing the accuracy of ancestral protein reconstruction methods, PLoS Computational Biology, 2:e69, PMID: 16789817. Richard A. Goldstein and David D. Pollock (2006), Observations of amino acid gain and loss during protein evolution are explained by statistical bias, Mol. Biol. Evol., 23: 1444, PMID: 16698770. Richard A. Goldstein (2007), Amino-acid interactions in psychrophiles, mesophiles, thermophiles, and hyperthermophiles: Insights from the quasi-chemical approximation. Protein Sci. 16, 1887-1895, PMID: 17766385. Richard A. Goldstein (2008), The structure of protein evolution and the evolution of protein structure, Curr. Opinion Struct. Biol., 18, 170-177. Richard A. Goldstein (2011), The evolution and evolutionary consequences of marginal thermostability in proteins, Proteins, 79:1396-1407. Richard A. Goldstein and David D. Pollock (2012), Modeling protein evolution, in Computational Modeling of Biological Systems (Nikolay Dokholyan, ed.), Springer, pps. 426-431. Ivan Coluzza, James T. MacDonald, Michael I. Sadowski, William R. Taylor, and Richard A Goldstein (2012), Analytic Markovian rates for generalized protein structure evolution, PLoS One, 7:e34228. David A. Liberles et al. (2012), The Interface of Protein Structure, Protein Biophysics, and Molecular Evolution, Protein Science, 21:769-785. David D. Pollock, Grant Thiltgen, and Richard A. Goldstein (2012), Relaxation of amino acid propensities: An evolutionary Stokes shift, Proceedings of the National Academy of Sciences U.S.A., 109:E1352-1359, PMID: 22547823. Grant Thiltgen and Richard A. Goldstein (2012), Assessing predictors of changes in protein stability upon mutation without using experimental data, PLoS One, 7:e46084. Richard A. Goldstein (2013), Population size dependence of fitness effect distribution and substitution rate probed by biophysical model of protein thermostability. Genome Biol Evol., 5:1584-1593, PMID: 23884461. David D. Pollock and Richard A. Goldstein (2014), Strong evidence for protein epistasis, weak evidence against it. Proceedings of the National Academy of Sciences U.S.A., 111:E1450. Richard A. Goldstein, Stephen T. Pollard, Seena D. Shah, David D. Pollock (2015), Non-adaptive amino acid convergence rates decrease over time. Mol Biol Evol, 32:1373-81. Bhavin S. Khatri and Richard A. Goldstein (2015), A coarse-grained biophysical model of sequence evolution and the population size dependence of the speciation rate, J Theor Biol, 378:56-64. Bhavin S. Khatri and Richard A. Goldstein (2015), Simple Biophysical Model Predicts Faster Accumulation of Hybrid Incompatibilities in Small Populations Under Stabilizing Selection. Genetics. 201:1525-1537. Richard A. Goldstein, David D. Pollock (2016) The tangled bank of amino acids. Protein Science 25:1354-1362. Grant Thiltgen, Mario dos Reis, Richard A. Goldstein (2017) Finding Direction in the Search for Selection. Journal of Molecular Evolution, doi:10.1007/s00239-016-9765-5. Richard A. Goldstein and David D. Pollock (2017), Sequence entropy of folding and the absolute rate of amino acid substitutions, Nature Ecology & Evolution 1:1923-1930. David D. Pollock, Stephen T. Pollard, Jonathan A. Shortt, Richard A. Goldstein (2017) Mechanistic Models of Protein Evolution in Evolutionary Biology: Self/Nonself Evolution, Species and Complex Traits Evolution, Methods and Concepts, P. Pontarotti (ed.), Springer, Cham, Switzerland, pages 277-296.