Detecting signatures of natural selection in the human genome with geographically explicit models

Lead Research Organisation: European Bioinformatics Institute

Department Name: Vertebrate Genomics

Abstract

Modern sequencing techniques have provided us with very large genetic datasets, on a scale that was hard to imagine only a couple of years ago. As these datasets comprise human populations from the entire globe, it is tempting to look at the geographic distribution of genetic variants and try to find explanations for why some variants are more common in some places rather than others. After all, we have known for a long time that sickle cell anaemia is found in regions where malaria was prevalent, as it can confer resistance to the deadly disease. So, could we find other important genetic variants that have been affected by natural selection by examining their geographic distribution? While this approach sounds promising, it raises the issue of being able to distinguish between those patterns that truly reflect past and present selection, and patterns that might have simply arisen by chance. In this project, we propose to develop a population genetics framework that will allow us to reconstruct the spread of anatomically modern humans around the globe, taking into account past changes in climate and the shape of continents. By knowing how and when people got to different parts of the world, we will then be able to distinguish which genetic variants have geographic distributions too extreme to be the result of mere chance, and thus have been the target of natural selection. Besides looking for regions under selection in the nuclear genome, we will also consider the small amount of genetic material contained in the mitochondria, small organelles that act as the biochemical powerhouses in our cells. Mitochondrial DNA is arguably the most widely used source of information for reconstructing human past history, but such reconstructions rely on the assumption that mitochondrial DNA has not been affected by natural selection. Our new framework, together with a better geographic coverage of mitochondrial genetic variability that will be achieved in this project, will allow us to test the assumption of neutrality and to find any deviation that should be taken into account in future work on human settlement history.

Technical Summary

We propose to exploit the recently available datasets on worldwide human genomic diversity to test for possible targets of natural selection in the genome. We will first develop a demographic, geographically explicit inference framework for the analysis of genetic data. Using this tool, we will reconstruct the expansion out of Africa by anatomically modern humans, taking into account climatic changes over the last 100k years. We will then run stochastic simulations within this well parameterised demography to characterise genomic regions likely to have been affected by natural selection. The analyses will be run on the 650k SNPs already typed for the HGDP-CEPH panel (~1,000 individuals from 51 populations) and subsequently on larger datasets, which will be sourced from ongoing dense re-sequencing projects. To get further insights into the underlying selective forces, plausible targets of natural selection will be tested for their spatial association with environmental variables such as climate and diseases. We will also expand our approach to investigate natural selection on human mitochondrial DNA (mtDNA). Our group has recently uncovered new strong evidence that worldwide mtDNA diversity has been partly shaped by climate. We will sequence complete mtDNA genomes for 1,400 individuals belonging to 76 populations (the HGDP-CEPH panel and 25 Amerindian and Siberian populations previously genotyped at a large number of neutral autosomal loci). We will then investigate whether the current geographic distribution of mitochondrial haplotypes is compatible with our understanding of past human migrations as inferred from nuclear markers. Our demographic, spatially explicit model will provide a formal framework to test whether the association between some haplotypes and temperature that we detected in our previous work can be explained by stochastic events, or whether selection has to be invoked.

Planned Impact

The research herein proposed comprises four different objectives, which are likely to appeal to different parts of the scientific community and the wider society. We intend to fill a major gap in the toolbox of population biologists with an eco-geographic inference framework. This should be of interest to human population biologists. However, so far the framework has encountered most enthusiasm from population biologists outside the human genetics community. Despite very limited publicity so far, we have been approached by numerous groups working on organisms as diverse as plant pathogens or marine mammals. We wish to encourage the use of the framework by making it freely available and producing extensive and user-friendly documentation. We also hope that the approach will be adopted by epidemiologists in the longer term. Our reconstruction of human settlement history should provide a richer more detailed picture of human evolution over the last 100,000 years. We expect the results to be of interest to our colleagues in human genetics as well as to anthropologists and archaeologists. This is also a topic of interest to the general public. In addition to peer reviewed publications destined to the academic community, we wish to engage with a wider audience. To this effect, we are planning to produce a series of interactive flash applets capturing the main results. These will be made available through our websites but will also be used in talks and exhibitions. The new analyzes on selection in the human genome should again appeal to scientists and non-scientists alike. This part of the project is really a leap into the unknown and it is thus difficult to make plans on how to publicize the results. Our methodology combined with the extraordinary increase in human genomic data should provide us with unprecedented power, making it likely that we will identify previously unsuspected genes of interest. The appeal of the results, in particular to the general public, will largely depend on the new genes we will identify. Irrespective of the results, we expect that the wider community of geneticists will be interested due to the novelty of the approach and the high statistical power of the analysis. Selection in the mitochondrial genome is a completely different situation from the genome-wide data mining as we will test a very specific hypothesis. We have previously shown that mitochondrial diversity correlates with minimum temperature and have identified two plausible SNPs that make sense from a functional perspective. The manuscript was reviewed by Nature, Science and PLoS; the reviewers rejected it eventually on all three instances mainly because they felt that the results had such far reaching consequences that not the slightest doubt could be allowed to exist. Indeed, probably over 80% of the literature based on human settlement history relies inferences from mtDNA and a correlation with climate would require revisiting it entirely. While we ran considerable controls, we were unable to perform the final control analysis as this requires matched samples for mtDNA and neutral genomic markers we did not have. Our proposed research will remedy this problem and clarify whether the previous results stemmed from a sampling artifact, an unknown complex demographic mechanism or will confirm our original results. In the latter case, this would arguably constitute one of the most important results in human population genetics and would lead to several paradigm shifts, such as reconsidering the pervasive notion of an 'out of Africa bottleneck'. We have no doubt that such a result would significantly impact large parts of the scientific community and generate considerable media attention.

Funded Value:

£146,934

Funded Period:

Sep 10 - Aug 13

Funder:

BBSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

BB/H008691/1

Principal Investigator:

Paul Flicek

Research Subject:

Agri-environmental science (26%)

Genetics & development (26%)

Tools, technologies & methods (24%)

Research Topic:

Bioinformatics (12%)

Earth & environmental (26%)

Evolution & populations (26%)

Theoretical biology (12%)

Organisations

People	ORCID iD
Paul Flicek (Principal Investigator)

Publications

Author Name

Title Publication Date Published

10 25 50

Eriksson A (2012) Effect of ancient population structure on the degree of polymorphism shared between modern human populations and ancient hominins. in Proceedings of the National Academy of Sciences of the United States of America

Eriksson A (2012) Late Pleistocene climate change and the global expansion of anatomically modern humans. in Proceedings of the National Academy of Sciences of the United States of America

Key Findings
Impact Summary
Research Tools and Methods


Description	The most significant achievement from our portion of the grant is a robust, scalable and flexible software infrastructure for the management and analysis of human variation data. We have tested and used this software in several very demanding situations and will continue to do so in the context of the research questions in this grant and as appropriate in other situations.
Exploitation Route	We created computer software to enable the effective management of data arising from model biology research. Specifically, methods for determining how people's genomes differ from each other generate very large amounts of data called sequenced. Genome sequences contain the information from our parents that not only make us look like our parents, but give us some of the same disease risks. Genome sequences also contribute to what makes us different from each other in various ways including height, hair colour and other details. These differences are known to be associated with where people come from including both obvious characteristics like red hair in northern Europeans and less obvious characteristics, such as sensitivity to temperature, which are the focus of this proposal. The software is therefore applicable to other datasets of this type; all of our software is open source and freely available for those who wish to use it. We originally used our software to support the data coordination and management for the world-wide 1000 Genomes Project and is it is especially suited for cases when there are genome sequences or other genetic information from a large number of individuals. In this project we updated and extended the capabilities of our software in several key ways. For example, we added new features to support other types of data analysis including methods for estimating the full genome sequence from tests (like those currently provided by 23andme) that actually measure a small fraction of genome. We have also engineered our software to be much more efficient. Additional software development was done to enable it to be run in a fully secure environment which is required by some research studies to protect the privacy of the research participants genome data.
Sectors	Environment,Healthcare,Pharmaceuticals and Medical Biotechnology


Description	The primary software resource is the ReseqTrack infrastructure, which was originally developed to support data coordination and management in the 1000 Genomes Project. Results from common human genotyping arrays were tested with the imputation pipeline and compared to reference implementation and data sets to ensure accuracy before release. we have extensively validated our variant calling infrastructure using the above software for both whole genome sequencing and exome sequencing use cases using the independently generated 1000 Genomes and UK10K project data sets for validation. We have also developed modules that enable the rapid identification of functional variation based on Ensembl annotations and using the Ensembl Variant Effector Predictor (VEP) software.
First Year Of Impact	2011
Sector	Healthcare,Pharmaceuticals and Medical Biotechnology
Impact Types	Policy & public services


Title	Improved ReseqTrack infrastructure
Description	The primary software resource is the ReseqTrack infrastructure, which was originally developed to support data coordination and management in the 1000 Genomes Project. ReseqTrack includes a database for file and metadata management essential for the processing of data from a large number of individuals and analysis pipeline software.
Type Of Material	Improvements to research infrastructure
Provided To Others?	No
Impact	New capabilities include modules for genotyping, phasing, alignment, variant calling and imputation algorithms to process sequence-based variation data. The software was used support the data coordination and management for the 1000 Genomes Project.

Abstract

Technical Summary

Planned Impact

Organisations

People

ORCID iD

Publications