Computational genetic methods to identify human structural variation using short-read data

Lead Research Organisation: The Wellcome Trust Sanger Institute
Department Name: Wellcome Trust Genome Campus

Abstract

In genetic terms, there is relatively little variation between humans. The differences between any two individuals, regardless of how diverse their origins, count for only a small fraction of the total genetic content stored in the DNA of either one. But the variation that does exist is important: some of it affects not only physical characteristics like size or shape, but also differences in disease susceptibility or the risk of genetic disorders.

Until recently, it has only been feasible to study genetic variation at very short length scales within the DNA sequence. Now a new technology is becoming available for reading DNA much more quickly and at much less cost than before. This has made it possible to begin a comprehensive study of all types of variation amongst hundreds of individuals from all around the world.

Analysing the data produced in this study will require powerful and sophisticated computing techniques. Our research will develop and refine these techniques, and will use them to make discoveries about some of the factors that have shaped human evolution. We will investigate how the human DNA sequence has been affected in different ways by the environments we inhabit, the threats we face, and the ancestors we share.

Technical Summary

The Wellcome Trust Sanger Institute is participating in an international project to obtain high-throughput short-read sequence data for hundreds of individuals. We will develop and implement probabilistic computational methods for inferring sequence-level structural variation in this data.

Structural variation in the human genome comprises a substantial proportion of the differences between human individuals, and has important phenotypic effects, including susceptibility to diseases such as HIV, Parkinson‘s disease and Alzheimer‘s disease. Using short-read sequencing technology, it will be possible to study structural variation over a greater range of scales and at higher sequence resolution than is possible with current microarray methods. However, short-read data requires the development of mathematically sophisticated and computationally intensive analysis techniques. Moreover, structural variation involves issues which are fundamental to the idea of a comparison between genome sequences. Our work will make use of mathematical approaches such as the de Bruijn graph, a compact and accurate representation of sequence structure which has previously been introduced in de novo sequence assembly. We will account for divergence between the reference and sample sequences and errors in the sequencing process, features which pose particular challenges for the task of short-read mapping and assembly. We will also develop methods for detecting and characterising structural variation in low-coverage sequence data from many individuals, which is important in the context of a variation study.

Using these techniques we will quantify the extent and nature of structural variation in human populations as observed in the variation study, and explore its implications for population history and selection effects in human genetic evolution. This will involve established population genetics methods, modified as necessary for application to structural variation, as well as novethods such as looking for evidence of selection in Ancestral Recombination Graphs.

Publications

10 25 50
 
Description Isaac Newton Trust / Wellcome Trust ISSF / University of Cambridge Joint Research Grant
Amount £70,000 (GBP)
Organisation Wellcome Trust 
Sector Charity/Non Profit
Country United Kingdom
Start 02/2016 
End 03/2017
 
Description Royal Society Research Grant
Amount £14,500 (GBP)
Organisation The Royal Society 
Sector Academic/University
Country United Kingdom
Start 12/2013 
End 12/2014
 
Title Gorilla genome assembly 
Description Assembly and annotation of a whole genome for gorilla 
Type Of Material Biological samples 
Year Produced 2009 
Provided To Others? Yes  
Impact Enables the comparison of all extant great apes across the whole genome. 
 
Title HMMCNV 
Description A computer software package for identifying copy number variation in an individual based on next generation genome sequence data. 
Type Of Material Physiological assessment or outcome measure 
Year Produced 2008 
Provided To Others? Yes  
Impact Contribution to the set of structural variants identified in the first individuals to fully sequenced using next generation technology. Also part of the tools used in the 1000 genomes project. 
 
Description Coalescent population modelling 
Organisation Aarhus University
Department Bioinformatics Research Centre (BiRC)
Country Denmark 
Sector Academic/University 
PI Contribution sequence data, alignment and variant calls; model development and analysis
Collaborator Contribution Development and implementation of methods for coalescent inference using sequence data for individuals from different populations.Aligment of orangutan sequence data
Impact Multi-disciplinary: computer science, mathematics, genomics
Start Year 2010
 
Description Coalescent population modelling 
Organisation Medical Research Council (MRC)
Department MRC Functional Genomics Unit
Country United Kingdom 
Sector Public 
PI Contribution sequence data, alignment and variant calls; model development and analysis
Collaborator Contribution Development and implementation of methods for coalescent inference using sequence data for individuals from different populations.Aligment of orangutan sequence data
Impact Multi-disciplinary: computer science, mathematics, genomics
Start Year 2010
 
Description Gorilla Genome Consortium 
Organisation Aarhus University
Department Bioinformatics Research Centre (BiRC)
Country Denmark 
Sector Academic/University 
PI Contribution Lead consortium; produced assembly; analysis of sequence loss and gain; analysis of male/female mutation rate bias
Collaborator Contribution Modelling and inference of great ape phylogenomicsStructural variation analysis of gorilla genome and gorillas. Additional sequence data.Indel analysis of gorilla genomeStructuar variation comparison of humans and gorillas.Provision of transcriptome sequence analysis, comparison , between humans and other African apes.Protein evolutionary analysis.
Impact Release of annotated gorilla genome assembly in ENSEMBL; release of great ape primary DNA sequence data, RNA-seq data and ChIP-seq data in Genbank; submission of manuscript to Nature. Multi-disciplinary: Genomics, Genetics, Computer science, Primatology, Paleoanthropology
Start Year 2009
 
Description Gorilla Genome Consortium 
Organisation Pompeu Fabra University
Department Institute of Evolutionary Biology
Country Spain 
Sector Academic/University 
PI Contribution Lead consortium; produced assembly; analysis of sequence loss and gain; analysis of male/female mutation rate bias
Collaborator Contribution Modelling and inference of great ape phylogenomicsStructural variation analysis of gorilla genome and gorillas. Additional sequence data.Indel analysis of gorilla genomeStructuar variation comparison of humans and gorillas.Provision of transcriptome sequence analysis, comparison , between humans and other African apes.Protein evolutionary analysis.
Impact Release of annotated gorilla genome assembly in ENSEMBL; release of great ape primary DNA sequence data, RNA-seq data and ChIP-seq data in Genbank; submission of manuscript to Nature. Multi-disciplinary: Genomics, Genetics, Computer science, Primatology, Paleoanthropology
Start Year 2009
 
Description Gorilla Genome Consortium 
Organisation University of Cambridge
Department Department of Zoology
Country United Kingdom 
Sector Academic/University 
PI Contribution Lead consortium; produced assembly; analysis of sequence loss and gain; analysis of male/female mutation rate bias
Collaborator Contribution Modelling and inference of great ape phylogenomicsStructural variation analysis of gorilla genome and gorillas. Additional sequence data.Indel analysis of gorilla genomeStructuar variation comparison of humans and gorillas.Provision of transcriptome sequence analysis, comparison , between humans and other African apes.Protein evolutionary analysis.
Impact Release of annotated gorilla genome assembly in ENSEMBL; release of great ape primary DNA sequence data, RNA-seq data and ChIP-seq data in Genbank; submission of manuscript to Nature. Multi-disciplinary: Genomics, Genetics, Computer science, Primatology, Paleoanthropology
Start Year 2009
 
Description Gorilla Genome Consortium 
Organisation University of Geneva
Department Faculty of Medicine
Country Switzerland 
Sector Academic/University 
PI Contribution Lead consortium; produced assembly; analysis of sequence loss and gain; analysis of male/female mutation rate bias
Collaborator Contribution Modelling and inference of great ape phylogenomicsStructural variation analysis of gorilla genome and gorillas. Additional sequence data.Indel analysis of gorilla genomeStructuar variation comparison of humans and gorillas.Provision of transcriptome sequence analysis, comparison , between humans and other African apes.Protein evolutionary analysis.
Impact Release of annotated gorilla genome assembly in ENSEMBL; release of great ape primary DNA sequence data, RNA-seq data and ChIP-seq data in Genbank; submission of manuscript to Nature. Multi-disciplinary: Genomics, Genetics, Computer science, Primatology, Paleoanthropology
Start Year 2009
 
Description Gorilla Genome Consortium 
Organisation University of Oxford
Department Department of Physiology, Anatomy and Genetics
Country United Kingdom 
Sector Academic/University 
PI Contribution Lead consortium; produced assembly; analysis of sequence loss and gain; analysis of male/female mutation rate bias
Collaborator Contribution Modelling and inference of great ape phylogenomicsStructural variation analysis of gorilla genome and gorillas. Additional sequence data.Indel analysis of gorilla genomeStructuar variation comparison of humans and gorillas.Provision of transcriptome sequence analysis, comparison , between humans and other African apes.Protein evolutionary analysis.
Impact Release of annotated gorilla genome assembly in ENSEMBL; release of great ape primary DNA sequence data, RNA-seq data and ChIP-seq data in Genbank; submission of manuscript to Nature. Multi-disciplinary: Genomics, Genetics, Computer science, Primatology, Paleoanthropology
Start Year 2009
 
Description Gorilla Genome Consortium 
Organisation Washington University in St Louis
Department Genome Center
Country United States 
Sector Academic/University 
PI Contribution Lead consortium; produced assembly; analysis of sequence loss and gain; analysis of male/female mutation rate bias
Collaborator Contribution Modelling and inference of great ape phylogenomicsStructural variation analysis of gorilla genome and gorillas. Additional sequence data.Indel analysis of gorilla genomeStructuar variation comparison of humans and gorillas.Provision of transcriptome sequence analysis, comparison , between humans and other African apes.Protein evolutionary analysis.
Impact Release of annotated gorilla genome assembly in ENSEMBL; release of great ape primary DNA sequence data, RNA-seq data and ChIP-seq data in Genbank; submission of manuscript to Nature. Multi-disciplinary: Genomics, Genetics, Computer science, Primatology, Paleoanthropology
Start Year 2009
 
Description Press coverage of gorilla genome project 
Form Of Engagement Activity A press release, press conference or response to a media enquiry/interview
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Public/other audiences
Results and Impact Press conference; numerous interviews for press & broadcast media, March 2012

Widespread secondary coverage and discussion on the web and other public forums for discussion of science.
Year(s) Of Engagement Activity 2012
 
Description WTSI website 
Form Of Engagement Activity A magazine, newsletter or online publication
Part Of Official Scheme? Yes
Geographic Reach International
Primary Audience Public/other audiences
Results and Impact Contributed material for publication on a public website.

Press inquiries about the project
Year(s) Of Engagement Activity 2008