Mechanistic models of protein sequence evolution

Lead Research Organisation: University College London
Department Name: Infection

Abstract

All life on earth is the result of molecular evolution, so it is not surprising that evolutionary viewpoints have yielded so much understanding of living systems. By modelling how proteins have evolved and co-evolved, by re-creating ancestral forms in the laboratory, by analysing patterns of conservation and change in individual proteins, by identifying related proteins about which more is known, phylogenetic analysis has become a powerful tool throughout the life sciences.

Analyses of protein evolution rely on appropriate models of sequence change. Currently, such analyses are dominated by empirical models created by comparing closely related proteins. In order to make model creation tractable and their use computationally feasible, a number of seemingly unjustifiable assumptions are made, such that all sites in the protein evolve independently in a similar manner. This simplicity has been the source of much of their power, success, and widespread use. Unfortunately, these simplifications have three major negative consequences. Firstly, the parameters in these models do not necessarily correspond to biologically relevant quantities, making interpretations of the results difficult. Secondly, some of the analyses are very sensitive to the assumptions made, with unrealistic assumptions yielding erroneous or misleading results. Thirdly, we are often interested in situations where the assumptions break down, providing important information about the proteins' properties and roles and how these change. These limitations have led to an interest in mechanistic models of the substitution process based on the underlying protein biophysics, molecular biology, and evolutionary dynamics. Progress in this area has been limited, however, due to the complexity and our lack of understanding of protein evolution, especially the extent, significance, and impact of epistatic interactions where the substitution patterns at one site depend on the amino acids found in other sites.

We propose a systematic integrated approach to the problem. This will involve:

- Performing computational simulations of protein evolution, where proteins evolve to maintain thermodynamic stability or ability to bind a constant or changing ligand. The purpose of these simulations is to allow us to investigate various phenomena that occur when selection acts on properties of the entire protein that cannot be reduced to the sum of contributions of single sites.

- Evaluating, validating, and characterising the hypotheses generated by the simulations through phylogenetic analyses of well-sampled proteins, including mitochondrial proteins.

- Connecting the results of these simulations with well-developed approaches in the physical sciences, such as the theory of chemical reaction rates.

- Using the insights drawn from the computational simulations and phylogenetic analysis of real proteins to construct more powerful and accurate models of the substitution process that are firmly rooted in the underlying biology.

- Using these substitution models to develop important applications, including the analysis of the degree, nature, and time dependence of the selection acting on proteins, and the identification of deleterious mutations.

- Applying these tools to study selection on G-protein coupled receptors and the detection of disease-related nsSNPs.

- Incorporating the new models and applications into the powerful publically available phylogenetics analysis program PLEX.

By increasing the accuracy, interpretability, and scope of phylogenetic analysis, we will impact a wide range of investigations throughout the life sciences, including how proteins and other biomolecules function and interact, how organisms adapt to new environments, how pathogens change in order to defeat host immune systems and infect new hosts, how humans and other organisms adjust to new and changing pathogens, and how we can modify proteins to our own specifications.

Technical Summary

Modelling protein evolution has become an important tool throughout the life sciences. Such analyses require appropriate models of sequence change, and are limited by their inaccuracies and simplifying assumptions. In particular, standard empirical models ignore epistasis, the interactions between substitutions at different sites, despite its importance and although including epistasis in the substitution model can provide important information about protein structure and function while its neglect can result in erroneous conclusions.

It is difficult to make more accurate substitution models without unacceptable increases in the number of adjustable parameters. By creating mechanistic models that represent the underlying protein biophysics, molecular biology, and evolutionary biology, it is possible to embed our understanding of these areas into the structure of the model, reducing the number of adjustable parameters while preserving fidelity to the evolutionary process. Progress in this approach has been limited, however, due to the complexity and our lack of understanding of protein evolution, especially the extent, significance, and impact of these epistatic interactions.

We propose a systematic integrated approach to the problem, involving:

- Studying the process of protein evolution through computational simulations, phylogenetic analyses, and theoretical formulation based on the theory of chemical reaction rates.

- Using the new insights and understandings to construct more powerful and accurate mechanistic models of the substitution process, as well as developing important applications.

- Incorporating the new models and applications into the powerful publically available phylogenetics program PLEX.

Increasing the accuracy, interpretability, and scope of phylogenetic analysis will have an impact across the wide range of disciplines throughout the biological and medical sciences where this type of analysis is used.

Planned Impact

The proposed research will create a deeper understanding of the process of molecular evolution, more accurate models of amino acid substitutions that better represent the constraints acting on protein sequences, and computational tools targeting specific important biological questions. This work will have a significant impact on phylogenetics and evolutionary biology as well as other areas throughout the biological and medical sciences, epidemiology, pharmacology, and biotechnology, resulting in contributions to health and the economy.

- Phylogenetics and evolutionary biology:
The resulting substitution models will not only be more accurate and more powerful for phylogenetic analysis, but will represent a different type of model, implementing more sophisticated representations of the evolutionary process.

Many of the hypotheses of evolutionary biology, even those that do not involve molecular phenomena, are proven or disproven by phylogenetic analysis done at the molecular level. Increased accuracy of such analyses will have an impact on these studies.

Finally, many of the conceptual aspects of evolutionary biology such as arms races, group selection, policing and altruism, etc., have equivalents in molecular evolution. Studies of these phenomena is often most tractable at the molecular level.

- Biological sciences:
With the growing number of sequences from different individuals and organisms come increased possibilities for comparing genes and gene products. Patterns of conservation and variation can yield important information about protein structure and function. Identifying and characterising various forms of purifying, neutral, and adaptive selection can provide insight into how organisms adapt to their environment and changes in their environment, and how the current observed species, organisms, and genes arose. The identification of evolutionary relationships between proteins can provide evidence for the role of horizontal transfer and gene or genome duplication. Insights about one protein can come from the identification of related proteins about which more is known.

- Medical sciences:
Pathogens and hosts are often involved in an 'arms race' as both evolve to counter changes in the other. Identifying such patterns is important for characterising host-pathogen interactions. Tumour cells can be sequenced and their origins determined through phylogenetic analysis, helping us understand the process of tumourigenesis.

- Epidemilogy:
We can determine the relationship of fast evolving viruses from different individuals, allowing us to model transmission pathways. Identification of transmission networks, natural reservoirs, and zoonotic disease origins can assist in the monitoring, prevention, and control of future pandemics.

- Biotechnology and pharmacology:
Better understanding of the types of selection acting on proteins can help us identify deleterious mutations, providing opportunities for targeted monitoring and prophylactic measures, as well as hypotheses about disease mechanisms leading to the identification of potential drug targets. Evolutionary analyses, such as those that characterise adaptive events, deleterious mutations, and pathogen-host interactions can also help in the identification of drug targets. Understanding the impact of mutations would assist in modifying proteins to our specifications.

Publications

10 25 50

publication icon
Monit C (2019) Positive selection in dNTPase SAMHD1 throughout mammalian evolution in Proceedings of the National Academy of Sciences

publication icon
Thiltgen G (2017) Finding Direction in the Search for Selection. in Journal of molecular evolution

 
Description Protein evolution 
Organisation University of Colorado
Country United States 
Sector Academic/University 
PI Contribution Modellilng of evolution of a simple model of proteins
Collaborator Contribution Active joint collaboration
Impact Yanlong O. Xu, Randall W. Hall, Richard A. Goldstein, and David D. Pollock (2005), Divergence, recombination, and retention of functionality during protein evolution, Human Genomics, 2:158-167 Paul D. Williams, David D. Pollock and Richard A. Goldstein (2006), Functionality and the evolution of marginal stability in proteins: Inferences from lattice simulations, Evol. Bioinform. Online, 2:1-11. Paul D. Williams, David D. Pollock and Richard A. Goldstein (2006), Selective advantage of recombination in evolving protein populations: A lattice model study, Int. J. Mod. Phys. C, 17:75-90. Paul D. Williams, David D. Pollock, Benjamin P. Blackburne, and Richard A. Goldstein (2006), Accessing the accuracy of ancestral protein reconstruction methods, PLoS Computational Biology, 2:e69, PMID: 16789817. Richard A. Goldstein and David D. Pollock (2006), Observations of amino acid gain and loss during protein evolution are explained by statistical bias, Mol. Biol. Evol., 23: 1444, PMID: 16698770. Richard A. Goldstein (2007), Amino-acid interactions in psychrophiles, mesophiles, thermophiles, and hyperthermophiles: Insights from the quasi-chemical approximation. Protein Sci. 16, 1887-1895, PMID: 17766385. Richard A. Goldstein (2008), The structure of protein evolution and the evolution of protein structure, Curr. Opinion Struct. Biol., 18, 170-177. Richard A. Goldstein (2011), The evolution and evolutionary consequences of marginal thermostability in proteins, Proteins, 79:1396-1407. Richard A. Goldstein and David D. Pollock (2012), Modeling protein evolution, in Computational Modeling of Biological Systems (Nikolay Dokholyan, ed.), Springer, pps. 426-431. Ivan Coluzza, James T. MacDonald, Michael I. Sadowski, William R. Taylor, and Richard A Goldstein (2012), Analytic Markovian rates for generalized protein structure evolution, PLoS One, 7:e34228. David A. Liberles et al. (2012), The Interface of Protein Structure, Protein Biophysics, and Molecular Evolution, Protein Science, 21:769-785. David D. Pollock, Grant Thiltgen, and Richard A. Goldstein (2012), Relaxation of amino acid propensities: An evolutionary Stokes shift, Proceedings of the National Academy of Sciences U.S.A., 109:E1352-1359, PMID: 22547823. Grant Thiltgen and Richard A. Goldstein (2012), Assessing predictors of changes in protein stability upon mutation without using experimental data, PLoS One, 7:e46084. Richard A. Goldstein (2013), Population size dependence of fitness effect distribution and substitution rate probed by biophysical model of protein thermostability. Genome Biol Evol., 5:1584-1593, PMID: 23884461. David D. Pollock and Richard A. Goldstein (2014), Strong evidence for protein epistasis, weak evidence against it. Proceedings of the National Academy of Sciences U.S.A., 111:E1450. Richard A. Goldstein, Stephen T. Pollard, Seena D. Shah, David D. Pollock (2015), Non-adaptive amino acid convergence rates decrease over time. Mol Biol Evol, 32:1373-81. Bhavin S. Khatri and Richard A. Goldstein (2015), A coarse-grained biophysical model of sequence evolution and the population size dependence of the speciation rate, J Theor Biol, 378:56-64. Bhavin S. Khatri and Richard A. Goldstein (2015), Simple Biophysical Model Predicts Faster Accumulation of Hybrid Incompatibilities in Small Populations Under Stabilizing Selection. Genetics. 201:1525-1537. Richard A. Goldstein, David D. Pollock (2016) The tangled bank of amino acids. Protein Science 25:1354-1362. Grant Thiltgen, Mario dos Reis, Richard A. Goldstein (2017) Finding Direction in the Search for Selection. Journal of Molecular Evolution, doi:10.1007/s00239-016-9765-5. Richard A. Goldstein and David D. Pollock (2017), Sequence entropy of folding and the absolute rate of amino acid substitutions, Nature Ecology & Evolution 1:1923-1930. David D. Pollock, Stephen T. Pollard, Jonathan A. Shortt, Richard A. Goldstein (2017) Mechanistic Models of Protein Evolution in Evolutionary Biology: Self/Nonself Evolution, Species and Complex Traits Evolution, Methods and Concepts, P. Pontarotti (ed.), Springer, Cham, Switzerland, pages 277-296.