The effects of natural selection on genome-wide patterns of genetic variation

Lead Research Organisation: University of Sheffield
Department Name: Animal and Plant Sciences

Abstract

It is well known that mutations create differences between individuals, and therefore provide the raw material for natural selection and evolution. Depending on their effects on the host's well-being, mutations can be divided into three categories: (1) deleterious mutations, which are harmful to the fitness of their host; (2) advantageous mutations, which increase survival or fertility; (3) neutral mutations, which have little or no effect. A question that has been central to evolutionary genetics is the role of natural selection on these three types of mutations in shaping patterns of genetic variation within populations. In fact, this is one of the questions that have motivated major ongoing DNA sequencing efforts in humans (e.g., the 1000 Genomes Project) and a number of other organisms, such as the fruit fly Drosophila (e.g., the Drosophila Population Genomics Project) and the weedy plant Arabidopsis (e.g., the 1001 Genomes Project).

The answer to the above question is fundamentally important for biologists who intend to use these large-scale datasets to decipher the genetic basis of phenotypic variation (e.g., disease susceptibility), to infer evolutionary history, and to identify mutations underlying key functional innovations that have helped the organism better adapt to the environment, because it underlies our understanding of the nature of genetic variation, which is critical for developing reliable methods to gather accurate results from the data. Unfortunately, despite being of both practical and theoretical significance, we still know rather little about the roles that negative selection against deleterious mutations, referred to as background selection or BGS, and positive selection on advantageous mutations, referred to as selective sweeps or SSW, play in controlling the genetic make-up of a population.

A major stumbling block is the lack of suitable theoretical tools for predicting the effects of BGS on sequence variability. This has hampered the progress towards a better understanding of the nature of genetic variation, because multiple lines of evidence have suggested that most mutations (especially those in functional parts of the genome) are deleterious, but the consequences of such mutations for genetic variability in nearby genomic regions are not well understood due to the lack of theoretical tools.

The first objective of this project is, therefore, to construct a set of BGS models that are not only biologically realistic and computationally efficient, but are also suitable for analysing large-scale datasets. This is made feasible by a BGS model I have recently published. I will improve on this by developing a set of extended models that incorporate several essential biological features that are missing in the original model. I will also develop theoretical tools for analysing sequence variability that incorporate both of the BGS and SSW processes. I will apply these new models to whole-genome sequence datasets, such as that for the house mouse Mus musculus castaneus, which is being generated by my collaborators at the University of Edinburgh. The goal is to understand the relative importance of BGS and SSW in controlling patterns of variation within populations, and to enhance the new methods in such a way that they can become a useful set of tools for analysing the new large-scale sequencing datasets that are currently being generated by researchers on a variety of different organisms.

Technical Summary

Understanding the role that natural selection on mutations that have different effects on fitness plays in shaping patterns of genetic variation within populations is central to evolutionary genetics. However, although multiple lines of evidence have suggested that most mutations (especially those in functional parts of the genome) are deleterious, the study of the effects of the continual removal of deleterious mutations by selection on variability at linked sites, known as background selection (BGS), has been hampered by the difficulty in modelling selection and genetic recombination simultaneously. Here I propose to construct a set of BGS models that take into account key biological processes such as changes in population size, the probability distribution of fitness effects of new mutations, and recombination hotspots. These models will be constructed based on a structured coalescent framework developed recently by myself, which can accurately and efficiently capture the effects of BGS and recombination on local gene genealogies. I will use these new models to design methods for jointly estimating the distribution of fitness effects and changes in population size. I also intend to study the joint effects of BGS and selective sweeps on patterns of variation, and to design better methods for distinguishing between them. Finally, I will use these new methods to examine the relative contributions of BGS and selective sweeps to patterns of variation observed in at least one whole-genome polymorphism dataset, such as that for the house mouse Mus musculus castaneus, which is being generated by my collaborators at the University of Edinburgh. Overall, these models will enhance our understanding of the nature of genetic variation, and will be important for analysing the data currently being collected by next-generation sequencing technologies in many different species.

Planned Impact

The proposed research intends to address how fundamental evolutionary processes such as natural selection, mutation, genetic recombination, and changes in population size interact with each other and shape patterns of genetic variation; it also attempts to construct methods for estimating key parameters from genome-level datasets. This kind of research has become increasingly important for modern biological research, as it provides the theoretical underpinning for making sense of the massive amount of DNA sequence data that are now being rapidly generated in species ranging from humans, many domesticated species, to model organisms such as Drosophila and Arabidopsis. Thus, the proposed project will have significant impact in the following areas:

1) Theoretical underpinnings: it will provide valuable theoretical tools for industrial and academic researchers who are working on more applied subjects of medical and economic importance, such as association mapping of disease-causing mutations or detection of "domestication genes" in domesticated species that underlie important traits (e.g., seed or fruit size), which have been selected for by breeders.

2) Human capital: the proposed project takes a multidisciplinary approach, and will train workers in genomics, statistics, mathematics, and computer science. This kind of expertise is currently in short supply. Thus, such training will make contributions of importance for both the academic and commercial sectors.

3) Education: the project will promote the general public's awareness of the significance of evolutionary research in policy-making and well-being, and will inform youngsters (e.g., school children and undergraduate students) who are still on their way to building up their skill-sets about the huge advantage of possessing multidisciplinary and quantitative abilities for the success of their careers.

To engage with these potential beneficiaries, I will undertake the following activities:

1) Setting up a website about population genetics to spark interest and to facilitate applications. This website will have two sections. The first section is for the general public. It will use accessible language and interactive features to introduce key concepts in population genetics, and explain their relevance to biological/medical research, policy-making and well-being using concrete, inspiring examples from the literature. The second section is for non-population geneticists who want to use the theory to analyse DNA sequence data. It will contain external links to useful computer programs, tutorials showing how to calculate useful statistics from data, and guidelines of how to design experiments and how to interpret results. The goal is to disseminate knowledge of methods that are essential tools for analysing high-throughput sequencing data to a wider community.

2) Engaging with the general public. The website will be publicised to the general public via on-line advertisement (e.g., New Scientist). I will also get involved in outreach activities organised by the University of Sheffield for Sheffield schools and prospective undergraduate students (University Open Days) by, e.g., putting up posters with materials selected from the first section of the website. In addition, I intend to give introductory lectures to undergraduate students studying biology or non-biology degrees (e.g., statistics, computer science), in order to increase their awareness of the importance of multidisciplinary and quantitative abilities for the success of their careers.
 
Description This project focuses on understanding how natural selection shapes patterns of genetic variation across the genome, with a special emphasis on purifying selection against deleterious mutations because this type of mutation is known to be prevalent and ubiquitous in all organisms. We have so far produced five manuscripts. These are discussed in detail below.

We have succeeded in constructing more realistic models of background selection (BGS). First, in Zeng (2013), we developed a structured coalescent model of BGS that accommodates biologically-important features including changes in population size, variation in selection coefficients against deleterious mutations across sites, and recombination. In addition, the model has the ability to generate sequence variability at both selected and neutral sites. More recently, we have further extended the framework, so that it can also predict the joint effects of BGS and population structure on patterns of polymorphism within and between subpopulations (Zeng and Corcoran 2015). In addition, we showed that, in the interference regime of many tightly linked selected sites subject to recurrent deleterious mutations, neutral diversity patterns obtained from a subdivided population may be virtually indistinguishable for models that have identical variance in fitness, but are nonetheless different with respect to the number of selected sites and the strength of purifying selection. This equivalence in neutral diversity patterns suggests that data collected from subdivided populations may have limited power for differentiating among the selective pressures to which closely linked selected sites are subject.

The models of BGS developed above are the most realistic constructed to date. This type of model is essential for making sense of polymorphism data collected from natural populations. We have implemented these models in a user-friendly packages, named msbgs and Forwards, and have made them publicly available on the PI's website (http://zeng-lab.group.shef.ac.uk).

In addition to model developments, we have also made significant progress in the data analysis part of the project. Taking advantage of the data available locally from the PI's collaborator Prof Jon Slate in Sheffield, in Gossmann et al. (2014), we examined how recombination and natural selection interact to shape patterns of divergence between several avian species (great tit, zebra finch, and chicken). This is important because previous empirical examinations in other organisms have produced inconsistent results, with evidence of recombination enhancing the efficacy of selection found in some species, but not in others. We approached the matter by exploiting the fact that the recombination rate is much more variable in avian genomes than in other species, including humans and Drosophila. Our key finding is that natural selection (both purifying and positive) is indeed more effective in high-recombination regions compared to regions with much reduced recombination. To further our study of the great tit, in collaboration with Prof Jon Slate and a Dutch team led by Prof Marcel Visser and Prof Martien Groenen, we took part in the Great Tit Genome Project. Our role was to analyse data from 29 whole-genome resequenced great tit individuals. Several important findings were made: (1) The great tit population across Europe is largely panmictic, and has experienced recent population size expansion; (2) The great tit genome has been subject to recent episodes of selective sweeps, especially genes related to neuronal functions, learning and cognition, consistent with the fact that great tits can learn socially and solve complex learning tasks. This manuscript, with the PDRA Toni Gossmann as a co-first author, was published Nature Communications (Laine et al. 2016).

In addition to great tits, we have also published a genome-wide analysis of polymorphism patterns in Drosophila melanogaster using the data provided by the Drosophila Population Genomics Project (Jackson et al. 2015). In particular, we obtained convincing evidence that purifying selection acting on functionally-important sites is a key factor shaping diversity patterns both within and between D. melanogaster populations. These results are of great value to researchers working on Drosophila population genetics, as they reveal the importance of taking into account the effects of purifying selection in attempts to, e.g., detect genes under positive selection. Additionally, the results make a strong case for the modelling efforts described in the proposal, and suggest that the new methods that are being developed in our group can be applied to this data.
Exploitation Route The project represents basic research. So the most direct beneficiaries will be academics working on population/evolutionary genetics. However, since a large part of the project is about developing generic models that are applicable to many different organisms, it is entirely possible that the research will generate long-lasting impact in the field. Additionally, understanding how the interaction between selection, demography and recombination affects genome evolution is essential for medical genetics, animal/plant breeding, and conservation genetics. Therefore, our work may also facilitate progress in these more applied areas of research. Finally, the high-quality genomic data obtained from the Great Tit Genome Project will play an instrumental role in furthering the integration of ecological, evolutionary, behavioural and genomic approaches in this model species.
Sectors Agriculture, Food and Drink,Education,Environment

 
Title Forwards: a program for generating samples under selection and demographic models using forwards-in-time simulations 
Description Forwards is a user-friendly computer package written in Java. Its main purpose is to help researchers generate simulated samples under complex demographic and selection models using forwards-in-time simulations. This kind of simulation algorithms are very useful when it comes to modelling, as can be seen from the PI's papers published under the support of this grant. In fact, Forwards originated from the modelling work the PI carried out in the duration of the grant. 
Type Of Material Computer model/algorithm 
Year Produced 2015 
Provided To Others? Yes  
Impact Modelling the joint effects of demography and selection on the process of evolution is very difficult, and forwards-in-time simulation algorithms play a critical role in this area of research. Although there are many other algorithms available, Forwards is unique in that it allows the user to keep records of the entire local genealogy in user-determined genomic positions, which provides a useful way to obtain detailed information about how selection and demography change the shape of the genealogy. This is the reason why several researchers (e.g., the group led by Prof Brian Charlesworth FRS) have contacted me and asked for the program. 
URL http://zeng-lab.group.shef.ac.uk/wordpress/?page_id=28
 
Title msbgs: a coalescent simulator for generating samples under background selection and demographic changes 
Description msbgs is a simulation program for generating variability under background selection models using the coalescent framework first described in Zeng and Charlesworth (2011). It can accommodate biologically important factors including recombination (crossover), variation in selection coefficient against deleterious mutations across sites, and changes in population size, population structure and migration (Zeng, 2013; Zeng and Corcoran, 2015). 
Type Of Material Computer model/algorithm 
Year Produced 2015 
Provided To Others? Yes  
Impact Modelling the effects of background selection on patterns of genetic diversity is an essential task. Previously, there was no available software that could simultaneously generate random samples in the presence of background selection and many biologically important factors, such as changes in population size and population structure. By providing such a program, msbgs is likely to be of interest to many researchers working DNA sequence polymorphism data. 
URL http://zeng-lab.group.shef.ac.uk/wordpress/?page_id=28
 
Description Lecturer of Biostatistics for the International biology Olympiad, February 2014 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Schools
Results and Impact Toni Gossmann was involved in the organization of the 3rd round of federal selection of the German National International Biology Olympiad (IBO, http://www.ibo-info.org/) Team, which took place in February 2014 at the University of Kiel and at the Eppendorf Center in Hamburg. Participants were 45 high-school students aged 16-19 from whole of Germany who had been selected from more than 1500 students. During the event, the students received training to work in a laboratory environment and also attended a variety of lectures covering different aspects of biology. In particular Toni Gossmann was responsible for the biostatistics/bioinformatics seminars which aimed to introduce basic concepts of statistics and biomathematics, such as segregation of Mendelian traits, Hardy-Weinberg equilibrium, statistical testing and regression analysis.
Year(s) Of Engagement Activity 2014
 
Description Lecturer of Biostatistics for the International biology Olympiad, February 2015 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Schools
Results and Impact Toni Gossmann was involved in the organization of the 3rd round of federal selection of the German National International Biology Olympiad (IBO, http://www.ibo-info.org/), which took place in February 2015 at the University of Kiel and at the Eppendorf Center in Hamburg. Participants were 45 high-school students aged 16-19 from whole of Germany who had been selected from more than 1500 students. During the event, the students received training to work in a laboratory environment and also attended a variety of lectures covering different aspects of biology. In particular Toni Gossmann was responsible for the biostatistics/bioinformatics seminars which aimed to introduce basic concepts of statistics and biomathematics, such as segregation of Mendelian traits, Hardy-Weinberg equilibrium, statistical testing and regression analysis.
Year(s) Of Engagement Activity 2015