NIRG: FARSPhase: a Flexible, widely Applicable, Robust, and Scalable phasing algorithm for human genetics

Lead Research Organisation: University of Edinburgh
Department Name: The Roslin Institute

Abstract

In computational genetics, phasing is the modelling of the underlying haploid structure of diploid genotypes. It is important for many genetic studies because inheritance actually takes place at the haploid level, even though we can only directly observe diploid genotypes with current mainstream technologies. In many applications haplotypes provide richer and more useful information than genotypes alone. Applications of haplotype phase include understanding the interplay of genetic variation and disease, enabling identity-by-descent models for use in heritability analysis, gene association studies and genomic prediction, imputation of un-typed genetic variation, prioritizing individuals for sequencing, calling genotypes, detecting genotype error, inferring human demographic history, inferring points of recombination, detecting recurrent mutation and signatures of selection, and modelling cis-regulation of gene expression.
Human genetics data sets that will likely be phased in the future can be categorised into: (i) huge populations of nominally unrelated individuals (e.g. 500,000 individuals, UK Biobank); (ii) smaller subsets of such populations (e.g. data collected in individual studies); (iii) large (e.g. 50,000 individuals) or small (e.g. 1,000 individuals) data sets collected from isolated populations with high degrees of relatedness within them (e.g. Orcades - Orkney, deCODE - Iceland, VIKING - Sweden); (iv) data sets with and without pedigree information; (v) data sets that combine several of these features (e.g. Generation Scotland); and (vi) data sets with different types of genomic information (e.g. single nucleotide polymorphisms, low- or high-coverage sequence, short or longer sequence reads, etc.).
There are many phasing methods for human genetics data and these can be broadly classified into two groups: (i) heuristic methods (e.g. Long-Range Phasing (LRP)); and (ii) probabilistic methods (e.g. Hidden Markov Models (HMM)). Phasing is computationally intensive and the size and features of different data sets make them more or less suited to particular methods. LRP is computationally fast in comparison to HMM, but is only applicable to situations where individuals share relatively recent ancestry (e.g. within 10 generations) and thus share relatively long haplotypes (e.g. 5 to 10 cM length). Isolated populations (e.g. as in Orcades, Orkney) are ideally suited to LRP but huge populations with hundreds of thousands of nominally unrelated individuals may also be suitable (e.g. UK Biobank). Application of current HMM to such huge populations is computationally intractable. However, HMM are more suited to subsets of such populations than LRP because HMM only require that individuals share short haplotypes (e.g. <1 cM) due to sharing very distant relatives (e.g. 50 to 100 generations ago).
LRP and HMM methods are complementary in many ways. One models long haplotypes, the other short haplotypes. HMM methods are more flexible and can better model uncertainty in the data. LRP methods are computationally much more efficient and are also more accurate in scenarios to which they are suited. LRP methods are also more amenable to incorporation of pedigree information. A combined algorithm could exploit this complementarity.
The objective of this proposal is to develop FARSPhase: a Flexible, widely Applicable, Robust, and Scalable, phasing algorithm for human genetics that combines the best features of LRP, other heuristics, and HMM methods into a single framework. As well as meeting the phasing needs for small data sets, if successful, this research will enable huge data sets be phased and thereby opening the possibility of more powerful analysis. The developed algorithm will be combined into a user friendly software package built using best practices in software engineering and its performance will be tested in a wide range of simulated and real data sets that reflect the likely future phasing scenarios for human genetics.

Technical Summary

Phasing is the modelling of the underlying haploid structure of diploid genotypes. It is important because inheritance actually takes place at the haploid level, even though only diploid genotypes are observed. The many phasing methods for human data can be broadly classified into: (i) heuristic methods (e.g. Long-range phasing (LRP)); and (ii) probabilistic methods (e.g. Hidden Markov Models (HMM)). Phasing is computationally intensive and the size and features of data sets make them more or less suited to particular methods.
Future data sets could be: (i) huge populations of nominally unrelated individuals (e.g. 500,000 individuals, UK Biobank); (ii) smaller subsets of such populations; (iii) Isolated populations with high degrees of relatedness within them (e.g. Orkney, Iceland); (iv) Data sets with/without pedigree information; (v) data sets with several of these features (e.g. Generation Scotland); and (vi) Data sets with different types of genomic information (e.g. single nucleotide polymorphisms, low/high-coverage sequence, short/longer sequence reads).
Heuristic methods, such as LRP, are suited to isolated populations with/without pedigree information or, due to their computational efficiency, to large populations of nominally unrelated individuals. Small nominally unrelated populations may not comprise enough individuals LRP to work, but they are suited to HMM because they are small enough to be computationally tractable and individuals within them share short haplotypes that HMM can model. Some data sets may be best addressed with a combination of algorithms. A combined algorithm could exploit the complementarity that exists between heuristic and probabilistic algorithms and thus be more powerful than the component algorithms.
The objective of this proposal is to develop and test a Flexible, widely Applicable, Robust, and Scalable phasing algorithm that combines the best features of LRP, other heuristics methods, and HMM methods into a single framework.

Planned Impact

This project will develop a practical tool enabling genotype phasing in a wide variety of human genetics scenarios opening up the potential for generating huge volumes of rich genomic information at low cost. It will develop fundamental scientific knowledge primarily in bioinformatics applied to genomics. The outcomes will be beneficial for:
(i) The academic community. Scientifically, the project constitutes a novel approach for combining heuristic and probabilistic phasing methods into a single scalable, flexible, and accurate algorithm that will be suited to a wide variety of scenarios in human genetics. In human genetics research applications of haplotype phase include understanding the interplay of genetic variation and disease, enabling identity-by-descent models for use in heritability analysis, gene association studies and genomic prediction, imputation of untyped genetic variation, prioritizing individuals for sequencing, calling genotypes in microarray and sequence data, detecting genotype error, inferring human demographic history, inferring points of recombination, detecting recurrent mutation and signatures of selection, and modelling cis-regulation of gene expression. In summary the method will enable richer analysis and creation of larger data sets.
(iii) Commercial sequence and genotype providers. Companies providing genotype data will be able to add value to the data that they generate.
(iv) Society. All members of society who depend on scientific research to unravel the biology of humans: healthy people, sick people, at risk people.
(v) UK science base. The proposed algorithm will provide a platform for increased R&D capabilities in the UK, maintaining its scientific reputation and associated institutions, with increased capability for a more healthy population.
(vi) Training. The proposed research will be embedded within training courses that the PI is regularly invited to give, and the post-doc working on the project will have the opportunity to be trained at a world-class institute in a cutting edge area of research.
(vii) Policy. Genomic data is expensive, but the research and practical benefits are potentially large. Therefore much investment will be made in genomic data by human geneticists, charities, and the government in the coming years. To maximise efficiency of investment a co-ordinated national and perhaps international effort may be needed. The method to be developed in this proposal could enhance and underpin such an effort.

Publications

10 25 50
 
Description Newton Fund Workshop Brazil
Amount £52,000 (GBP)
Funding ID 228949780 
Organisation British Council 
Sector Charity/Non Profit
Country United Kingdom
Start 04/2016 
End 09/2016
 
Description Newton Fund Workshop Mexico
Amount £37,550 (GBP)
Funding ID 2016-RLWK7-10399 
Organisation British Council 
Sector Charity/Non Profit
Country United Kingdom
Start 04/2017 
End 03/2018
 
Title AlphaPeel, a probabilistic method that integrates whole genome sequence of any coverage, genotype and pedigree data 
Description We developed AlphaPeel, a probabilistic method that integrates whole genome sequence of any coverage, genotype and pedigree data into a single data source and simultaneously uses all of this data to call genotypes for sequence data of any coverage, phase such data, and impute such data to any individual in the pedigree. 
Type Of Material Improvements to research infrastructure 
Year Produced 2018 
Provided To Others? Yes  
Impact In human studies towards genetic factors in diseases, panel sizes (i.e. number of individuals in the study) may be required that can reaches scales of hundreds of thousands or more. Dense genotyping or fully sequencing all individuals on high sequencing depth would be very expensive. Phasing and genotype imputation of incomplete data sets is an important strategy to limit the wet-lab costs. The algorithm developed in this project, in concert with other algorithms developed by our group significantly advances the theory and practice of sequencing and imputation in human genetics, livestock genetics and crop genetics. Cost-effective genotyping will benefit medical research, and is of great economic value for breeding companies. 
URL https://alphagenes.roslin.ed.ac.uk/wp/software/alphapeel/
 
Description Sequencing of beef cattle in Ireland 
Organisation Illumina Inc.
Department Illumina
Country United Kingdom 
Sector Private 
PI Contribution The objectives of this project is to generate large data set for the Irish beef and cattle market, analyse it and obtain insights into the mechanics of the resulting predictions underlying the biology of the beef and dairy population. The AlphaSuite is a collection of software that we have developed to perform many of the common tasks in animal breeding, plant breeding, and human genetics including genomic prediction, breeding value estimation, variance component estimation, GWAS, imputation, phasing, optimal contributions, simulation, field trial designs, and various data recoding and handling tools.
Collaborator Contribution Illumina is providing the DNA sequencing data on more than 1000 cattle.
Impact At this stage of the collaboration the outputs have not been generated.
Start Year 2016
 
Title AlphaPhase 
Description The use of phased sequencing data has been shown to significantly increase the accuracy of imputation. AlphaPhase has been used as part of an imputation pipeline. Existing programs for phasing, have generally scaled poorly to large datasets with long and expensive burden in the computational resources available. Additionally, the increasing production of large sequencing data bundles and its heterogeneity complicate the phasing process. The current version of AlhaPhase implements methods to determine phase using an extended Long Range Phasing and Haplotype Library Imputation. 
Type Of Technology Software 
Year Produced 2016 
Impact The AlphaPhase package is freely available in AlphSuite and includes supporting manual, and access to technical support with the aim of benefiting the academic research community in animal breeding. Since its recent publication in the AlphaSuite, AlphaPhase have been downloaded 5 times. The AlphaPhase program is closely related to AlphaImpute, and is playing a key role in the Innovate UK funded project in collaboration with PIC, Innovate UK, Aviangen Innovate UK and ICBF. 
URL http://www.alphagenes.roslin.ed.ac.uk/alphasuite-softwares/
 
Title AlphaPlantImpute 
Description AlphaPlantImpute is a software package designed for phasing and imputing genotype data in plant breeding populations. AlphaPlantImpute can be implemented within and across bi-parental populations to phase and impute focal individuals genotyped at low-density to high-density. 
Type Of Technology Software 
Year Produced 2018 
Impact This package was found to be extremely useful by our project partner global breeder KWS Saat SE. 
URL https://alphagenes.roslin.ed.ac.uk/wp/software/alphaplantimpute/
 
Description AlphaGenes Twitter channel 
Form Of Engagement Activity Engagement focused website, blog or social media channel
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact The AlphaGenes updates the scientific community and a broader audience about news around our research group, scientific output and engagement activities
Year(s) Of Engagement Activity 2012,2013,2014,2015,2016,2017,2018,2019,2020
URL https://twitter.com/Alpha_Genes
 
Description AlphaGenes website 
Form Of Engagement Activity Engagement focused website, blog or social media channel
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact The AlphaGenes website informs the scientific community about the groups research activities, outputs, courses and available software tools.
Year(s) Of Engagement Activity 2017,2018,2019,2020
URL https://alphagenes.roslin.ed.ac.uk
 
Description Contribution to the New York Time article: Open Season Is Seen in Gene Editing of Animals 
Form Of Engagement Activity A magazine, newsletter or online publication
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Public/other audiences
Results and Impact Open Season Is Seen in Gene Editing of Animals was a feature article on gene Editing by Amy Harmon. Professor John Hickey was interviewed as specialist in the Quantitative Genetic field.
Year(s) Of Engagement Activity 2016
URL https://www.nytimes.com/2015/11/27/us/2015-11-27-us-animal-gene-editing.html?_r=0
 
Description John Hickey Guest in Farming Today (BBC Radio 4) 
Form Of Engagement Activity A broadcast e.g. TV/radio/film/podcast (other than news/press)
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Public/other audiences
Results and Impact On Monday 26th September, The BBC Radio 4 Farming Today had Professor John Hickey as specialist scientist on the subject of breeding programs and scientific impact.
Year(s) Of Engagement Activity 2016
URL http://www.bbc.co.uk/programmes/b07w5xxq
 
Description Public engagement at the Royal Highland Show 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Public/other audiences
Results and Impact All members of the research group engaged the visitors of the RHS, to show the importance of their research towards the enhancement of the agricultural sector in direct or indirect ways.
Year(s) Of Engagement Activity 2019
URL https://www.royalhighlandshow.org
 
Description Short course in Evolutionary Quantitative Genetics 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Study participants or study members
Results and Impact Evolutionary Quantitative Genetics course was a comprehensive review of modern concepts in Evolutionary Quantitative Genetics. The contents of the course are basic statistics, population genetics, quantitative genetics, evolutionary response in quantitative traits, estimating the fitness of traits and mixed models and their extensions. the instructor was Dr Bruce Walsh, Department of Ecology Evolutionary Biology, University of Arizona, and co-author of Genetics and Analysis of Quantitative Traits. The Course was hosted by Professor John Hickey at the Roslin Institute.
Year(s) Of Engagement Activity 2016
URL http://www.alphagenes.roslin.ed.ac.uk/bruce-walsh-visit/
 
Description Teaching course: Next Generation Plant and Animal Breeding Programs, Animal Science Department, University of Nebraska, Lincoln. 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact Series of the lectures and workshops on Plant and Animal Breeding Programs exploring current practices and future areas
of research. The course was designed and imparted by John Hickey and key members of his team.
Year(s) Of Engagement Activity 2016
URL http://animalscience.unl.edu/next-generation-plant-and-animal-breeding-programs
 
Description The Expert Working Group on Wheat Breeding Methods and Strategies 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Expert Working Group on Wheat Breeding Methods and Strategies seeks to exchange breeding methods research information and germ plasm to expert build capacity and support in wheat breeding programs, with more efficient breeding methods consistent with the latest scientific advances. The EWG is working on activities such us workshops, training courses, communications, and sharing of germplasm and information to reach larger pool of wheat breeders and trained in state-­of-­the-­art breeding methods.
Year(s) Of Engagement Activity 2015,2016,2017
URL http://www.wheatinitiative.org/activities/expert-working-groups/wheat-breeding-methods-and-strategie...