Next generation imputation for huge data sets

Lead Research Organisation: University of Edinburgh
Department Name: The Roslin Institute

Abstract

Knowledge gained from genome sequencing has great potential for increasing the direction and rate of genetic change in livestock breeding, and biological discovery in animal science. However huge numbers of individuals will need to be sequenced to unlock this potential, and the current cost of sequencing for livestock is several hundreds or thousands of pounds per individual. This will remain a barrier for using this data routinely until the unit cost is of the order of tens of pounds. One promising approach to reducing costs whilst maintaining the quality of the resulting data is to use technology called next-generation sequencing with low coverage (lcNGS). With lcNGS, large numbers of individuals can have their sequences sampled at low cost per individual, but each individual sequence will have substantial missing information. Accuracy is restored by inferring missing data using a process known as imputation. In livestock this process is made more efficient by pedigree structures in livestock populations.
Imputation using single nucleotide polymorphism (SNP) data from chips has been successfully applied in livestock. However, these methods are not optimal for the imputation from lcNGS data for several reasons. (i) SNP-chip genotypes are highly accurate and data points are missing only occasionally due to technical issues. In contrast, lcNGS data has much less certainty over the true genotype at a particular locus, and the missing data is randomly spread over the whole genome. (ii) SNP-chip genotypes cover only a small fraction of the genetic variation present in the genome in comparison to sequence data, so the computational techniques for imputing sequence data need to be much more efficient for practical use. (iii) The range of the data produced by lcNGS is rapidly evolving, requiring next-generation imputation algorithms to be very flexible.
The imputation algorithm proposed will address these issues from a novel direction by combining two approaches: heuristic and probabilistic. Heuristic algorithms use basic principles of inheritance and so are fast, and accurate. They are well-suited to animal breeding since they use pedigree to make inferences from the abundance of closely-related individuals from large families, with large portions of the genome shared between pairs of individuals. However, heuristic methods can fail if such data is lacking or is unreliable across all or parts of the genome. Probabilistic algorithms primarily use Hidden Markov Models to mimic inheritance statistically and are computationally more demanding, slower, and inherently less accurate than heuristic algorithms. They have been developed primarily for application to human populations in which the pedigree structures, for example small sibships, are not well-suited to exploiting the power of heuristic algorithms. The proposed algorithm will obtain synergy from combining the two approaches as they have complementary strengths in the recovery of information and computational efficiency.
The overall objective is therefore to develop a generic imputation system that is capable of imputing in data sets of the order of millions of animals, can cope with the wide variety of data types that may appear from lcNGS. New heuristic approaches will be adopted to develop data that can be integrated with probabilistic approaches and combined into a novel hybrid algorithm. Efficient data handling and storage frameworks, and a user interface will be developed to ensure the algorithm is computationally efficient, easy-to-use, and readily available to users. The algorithm will be benchmarked using a range of real and simulated data sets and historical, real SNP-chip data to ensure it remains backwards compatible to current or previous technology. The availability of the algorithm will enable breeders to accumulate sequence data on millions of animals at low unit cost, and in turn prompt greater accuracy of selection and innovation in breeding goals.

Technical Summary

Realising the potential of sequencing livestock genomes will require sequence for huge numbers of animals. This can only be achieved when the cost of acquiring sequence is much lower than at present. One approach to reducing cost is to use low-coverage sequencing and infer missing data with the process of imputation. Existing imputation algorithms for livestock are unable to use such probabilistic data as they are designed for imputing data from genotype data, which are known with near certainty as generated from SNP-chips. These and other probabilistic approaches using Hidden Markov Models (HMM) will also be unable to cope with the computational demands of the millions of animals that will be sequenced. This proposal will develop a generic imputation algorithm that is (1) flexible in utilising multiple types of genomic and ancillary information (e.g. pedigree), (2) scalable to datasets with millions of animals, and (3) accurate in livestock settings. The algorithm will start by developing new heuristic approaches to encompass probabilistic data obtained from low-coverage sequence data and, after applying heuristic principles, will produce data that is suitable for the application of HMM, so producing a novel hybrid algorithm. The heuristic component will target large haplotypes shared by many individuals in livestock populations by capitalising on pedigree, and abundant, large families. The probabilistic component will target genomic regions where haplotypes are too short for the heuristic component to work effectively, or where information (e.g. pedigree) is unreliable. This will create synergy between the scalability and computational efficiency of heuristic algorithms and the robustness of the HMM. The hybrid algorithm will be benchmarked by comparing performance with existing algorithms on datasets from large, industry populations, huge simulated populations, and small prototype data sets. Software for the algorithm will be provided to allow ease of use.

Planned Impact

This project will develop a practical tool enabling sequence to be imputed from a wide variety of sources, opening up the potential for generating huge volumes of sequence information at low cost. It will develop fundamental scientific knowledge primarily in bioinformatics applied to genomics. The outcomes will be beneficial for:
(i) The academic community. Scientifically, the project constitutes a novel approach for combining heuristic and probabilistic imputation methods into a single scalable, flexible, and accurate imputation algorithm. This algorithm will enable the generation of large volumes of sequence information at low cost and will have the flexibility to handle new types of genomic information as they emerge. This will enable larger and hence more powerful experiments than currently feasible, and greater ability to combine data obtained with old technology with those with new technologies. The direct application of the method will benefit researchers in animal genetics (both natural and commercial populations) and those who study isolated human populations. Methodological developments will benefit plant and human geneticists concerned with outbred populations. The prototype data generated in this project will be a unique resource for livestock researchers and evolutionary biologists.
(ii) Breeding companies, breed societies, and levy boards. As indicated by the attached letters of support from four representatives of the livestock production industry (covering the three economically most important livestock species in the UK), successful outcome of the project is expected to be open new possibilities that will be highly beneficial to breeding companies and organisations that carry out genetic evaluations of domestic livestock. Such organisations will be provided with the tool so it can be embedded within their research, development and operational pipelines. This will increase the efficiency and sustainability of genetic improvement in the long-term. We also anticipate similar application in pedigreed companion animal populations in the future.
(iii) Commercial sequence and genotype providers. Companies providing SNP or sequence data will be able to use imputation to add value to the data that they generate.
(iv) Society. All members of society who work to improve or depend upon the competitiveness and sustainability of agriculture will benefit from the downstream practical applications outlined above. The application of the algorithm by breeding organisations will lead to faster and more sustainable genetic progress, leading to healthier food, and food production that is more resource efficient and affordable. Increased efficiencies in agriculture has direct societal benefits in greater food security with less environmental impact.
(v) UK science base. The proposed algorithm will provide a platform for increased R&D capabilities in the UK, maintaining its scientific reputation and associated institutions, with increased capability for sustainable agricultural production.
(vi) Training. The proposed research will be embedded within training courses that the PI is regularly invited to give, and the post-doc working on the project will have the opportunity to be trained at a world-class institute in a cutting edge area of research.
(vii) Policy. Sequence data is expensive, but the research and practical benefits are potentially large. Therefore much investment will be made in sequence data in the livestock sector in the coming years. To maximise efficiency of investment a co-ordinated national and perhaps international effort may be needed. The method to be developed in this proposal could enhance and underpin such an effort.

Publications

10 25 50
 
Description Some breeding programs require high-density genotype information of all the individuals in the population. To acquire all this information is expensive and algorithms able to infer it are crucial. Algorithms based on the pedigree information are accurate and fast, but fail when the genetic information is too sparse or the pedigree information is inconsistent. Algorithms based on information from the whole population can infer all the missing genetic information but are slower and not as accurate. We implemented a fast and accurate algorithm that takes advantages of information from the pedigree and the population to infer all the genetic information for all the individuals.
Exploitation Route Alpha Impute and AlphaPhase software are freely available for researchers to download on the AlphaGenes webpage.
Sectors Agriculture, Food and Drink,Digital/Communication/Information Technologies (including Software),Environment

URL http://www.alphagenes.roslin.ed.ac.uk/alphasuite-softwares/alphaimpute/
 
Description We have developed novel phasing and imputation methods. These involve heuristic methods that scale to huge data sets (e.g., the UK Biobank) and which can work with data sets that have been genotyped with highly heterogeneous genotyping platforms. We have developed a multi-locus iterative software (AlphaPeel) that can utilise pedigree information to infer phase and impute genotypes for large pedigrees (e.g., >100,000 individuals). This method can exploit genotype and sequence data of any density or coverage. This software enables cost-effective production of genoype datasets without the need to fully cover all individuals in a study, thus saving costs in research and breeding programs. The results of this project triggered (and is still triggering) the interest of breeding companies (KWS, BASF, GENUS plc) to conduct follow-up projects on the development of novel imputation methods, each for their specific needs.
First Year Of Impact 2016
Sector Agriculture, Food and Drink,Education
Impact Types Societal,Economic

 
Description Newton Fund Workshop Brazil
Amount £52,000 (GBP)
Funding ID 228949780 
Organisation British Council 
Sector Charity/Non Profit
Country United Kingdom
Start 04/2016 
End 09/2016
 
Description Newton Fund Workshop Mexico
Amount £37,550 (GBP)
Funding ID 2016-RLWK7-10399 
Organisation British Council 
Sector Charity/Non Profit
Country United Kingdom
Start 04/2017 
End 03/2018
 
Title Next generation imputation for huge livestock data sets 
Description AlphaImpute is a software package for imputing and phasing genotype data in populations with pedigree information available. The program uses segregation analysis and haplotype library imputation to impute alleles and genotypes. A complete description of the methods is given in Hickey et al. (2012). AlphaImpute consists of a single program however it calls both AlphaPhase1.1 (Hickey et al., 2011) and GeneProbForAlphaImpute (Kerr and Kinghorn, 1996). All information on the model of analysis, input files and their layout, is specified in a single parameter file. 
Type Of Material Computer model/algorithm 
Year Produced 2015 
Provided To Others? Yes  
Impact Previously we have developed AlphaImpute, which is software for imputing classical marker genotype data in livestock breeding programs. AlphaImpute is unique in that it remains highly accurate even with marker densities as low as 384 SNP. For this reason AlphaImpute has been widely taken up by commercial pig and poultry breeding programs. In the current project, which is funded for 3 years by the BBSRC, we are developing the next version of AlphaImpute with scalability to data sets of millions of individuals and whole genome sequence information. 
URL http://www.alphagenes.roslin.ed.ac.uk/alphasuite/alphaimpute/
 
Description Aviagen support and upkeep AlphaImpute 
Organisation Aviagen Group
Country United States 
Sector Private 
PI Contribution AlphaImpute is a software package for imputing and phasing genotype data in populations with pedigree information available. The program uses segregation analysis and haplotype library imputation to impute alleles and genotypes.
Collaborator Contribution Aviangen investment supports the maintenance and improvement of AlphaImpute by contributing to the salary of a post-doctoral researcher.
Impact This collaboration has not produced any outputs to date.
Start Year 2015
 
Description Continued development of AlphaImpute 
Organisation Geno Global Ltd.
Country Norway 
Sector Private 
PI Contribution AlphaImpute is a software package for imputing and phasing genotype data in populations with pedigree information available. The program uses segregation analysis and haplotype library imputation to impute alleles and genotypes.
Collaborator Contribution Trygve Solberg Geno investment supports the maintenance and improvement of AlphaImpute by contributing to the salary of a post-doctoral researcher.
Impact This collaboration has not produced any outputs to date.
Start Year 2016
 
Description Partnership with Geno Global 
Organisation Geno Global Ltd.
Country Norway 
Sector Private 
PI Contribution In this partnership we will work together to develop and implement genotyping, sequencing and imputation strategies and tools in the Norwegian Red population central to Geno Global. The work seeks to integrate AlphaImpute, our imputation software, into the routine breeding value estimation pipeline at Geno. We will also perform analysis of the resulting data to aid the use of genomic prediction methods in the Geno breeding program and to help the discovery of causal variants that segregate in the Norwegian Red population.
Collaborator Contribution Geno has provided historical data form their herd and made it available to develop and train new imputation strategies.
Impact Developed an imputation strategy with Geno.
Start Year 2016
 
Description Sequencing of beef cattle in Ireland 
Organisation Illumina Inc.
Department Illumina
Country United Kingdom 
Sector Private 
PI Contribution The objectives of this project is to generate large data set for the Irish beef and cattle market, analyse it and obtain insights into the mechanics of the resulting predictions underlying the biology of the beef and dairy population. The AlphaSuite is a collection of software that we have developed to perform many of the common tasks in animal breeding, plant breeding, and human genetics including genomic prediction, breeding value estimation, variance component estimation, GWAS, imputation, phasing, optimal contributions, simulation, field trial designs, and various data recoding and handling tools.
Collaborator Contribution Illumina is providing the DNA sequencing data on more than 1000 cattle.
Impact At this stage of the collaboration the outputs have not been generated.
Start Year 2016
 
Title AlphaAssign 
Description AlphaAssign is a parentage assignment program in genotype data, using a likelihood based model to determine the sire of an individual based on a list of potential sires. Application is in large scale genotyping/resequencing projects in livestock breeding. 
Type Of Technology Software 
Year Produced 2019 
Impact AlphaAssign has seen adaptation in both the academic and industry communities. It has been used to perform parentage assignment for ecological populations, such as squirrels and vervet monkeys. In the industry community it has been used to perform parentage assignment as part of a pig-breeding program, and was used as the basis of a maternal grand-parent assignment algorithm, AlphaMGS assign 
URL https://alphagenes.roslin.ed.ac.uk/wp/software-2/alphaassign/
 
Title AlphaFamImpute 
Description AlphaFamImpute is a genotype calling, phasing, and imputation software package for large full-sib families in diploid plants and animals which supports individuals genotyped with SNP array or GBS data. 
Type Of Technology Software 
Year Produced 2019 
Impact The software package is currently used by our industrial partners in crop breeding 
URL https://alphagenes.roslin.ed.ac.uk/wp/software-2/alphafamimpute/
 
Title AlphaImpute 
Description Imputation can cost-effectively generate high-density genotypes of many individuals. Typical genotyping strategies involve genotyping a small number of individuals with expensive high-density marker panels, and a large number of individuals with cheaper low-density panels. Imputation is the used to infer the un-typed high-density markers in the individuals genotyped at low-density. AlphaImpute is a flexible tool that imputes genotypes and alleles accurately and quickly for datasets with large pedigrees and large numbers of genotyped markers. It combines basic rules of Mendelian inheritance, probabilistic inferences of genotypes, phasing of long stretches of haplotypes, and imputation of genotypes from a haplotype library. AlphaImpute consists of a single program however it calls both AlphaPhase1.1 and GeneProbForAlphaImpute. All information on the model of analysis, input files and their layout, is specified in a single parameter file. 
Type Of Technology Software 
Year Produced 2016 
Impact The AlphaImpute package is freely available in AlphSuite and includes supporting manual, and access to technical support with the aim of benefiting the academic research community in animal breeding. The program has been downloaded over 200 times in recent years, attracting users from a number of different academic institutions internationally. AlphaImpute has supported collaboration with a number of industrial partner. One such example is the Innovate UK funded project in collaboration with PIC. This project has accelerated the rate of genetic gain by 35% in pigs, enabled by AlphaImpute. Major emphasis has been put on making AlphaImpute more computationally effective and accessible to small animal breeding operation and/or academic institutions, we have succeeded in improved the computing time by 75%. 
URL http://www.alphagenes.roslin.ed.ac.uk/alphasuite-softwares/
 
Title AlphaPhase 
Description The use of phased sequencing data has been shown to significantly increase the accuracy of imputation. AlphaPhase has been used as part of an imputation pipeline. Existing programs for phasing, have generally scaled poorly to large datasets with long and expensive burden in the computational resources available. Additionally, the increasing production of large sequencing data bundles and its heterogeneity complicate the phasing process. The current version of AlhaPhase implements methods to determine phase using an extended Long Range Phasing and Haplotype Library Imputation. 
Type Of Technology Software 
Year Produced 2016 
Impact The AlphaPhase package is freely available in AlphSuite and includes supporting manual, and access to technical support with the aim of benefiting the academic research community in animal breeding. Since its recent publication in the AlphaSuite, AlphaPhase have been downloaded 5 times. The AlphaPhase program is closely related to AlphaImpute, and is playing a key role in the Innovate UK funded project in collaboration with PIC, Innovate UK, Aviangen Innovate UK and ICBF. 
URL http://www.alphagenes.roslin.ed.ac.uk/alphasuite-softwares/
 
Description AlphaGenes Twitter channel 
Form Of Engagement Activity Engagement focused website, blog or social media channel
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact The AlphaGenes updates the scientific community and a broader audience about news around our research group, scientific output and engagement activities
Year(s) Of Engagement Activity 2012,2013,2014,2015,2016,2017,2018,2019,2020
URL https://twitter.com/Alpha_Genes
 
Description AlphaGenes website 
Form Of Engagement Activity Engagement focused website, blog or social media channel
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact The AlphaGenes website informs the scientific community about the groups research activities, outputs, courses and available software tools.
Year(s) Of Engagement Activity 2017,2018,2019,2020
URL https://alphagenes.roslin.ed.ac.uk
 
Description Contribution to the New York Time article: Open Season Is Seen in Gene Editing of Animals 
Form Of Engagement Activity A magazine, newsletter or online publication
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Public/other audiences
Results and Impact Open Season Is Seen in Gene Editing of Animals was a feature article on gene Editing by Amy Harmon. Professor John Hickey was interviewed as specialist in the Quantitative Genetic field.
Year(s) Of Engagement Activity 2016
URL https://www.nytimes.com/2015/11/27/us/2015-11-27-us-animal-gene-editing.html?_r=0
 
Description John Hickey Guest in Farming Today (BBC Radio 4) 
Form Of Engagement Activity A broadcast e.g. TV/radio/film/podcast (other than news/press)
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Public/other audiences
Results and Impact On Monday 26th September, The BBC Radio 4 Farming Today had Professor John Hickey as specialist scientist on the subject of breeding programs and scientific impact.
Year(s) Of Engagement Activity 2016
URL http://www.bbc.co.uk/programmes/b07w5xxq
 
Description Modern plant and animal applied genomics driven by genotype and sequence data, University of Zagreb, Croatia, 17-19 July 2018 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact Workshop organised and given by me and two other members of my group.
Year(s) Of Engagement Activity 2018
 
Description Public engagement at the Royal Highland Show 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Public/other audiences
Results and Impact All members of the research group engaged the visitors of the RHS, to show the importance of their research towards the enhancement of the agricultural sector in direct or indirect ways.
Year(s) Of Engagement Activity 2019
URL https://www.royalhighlandshow.org
 
Description Short course in Evolutionary Quantitative Genetics 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Study participants or study members
Results and Impact Evolutionary Quantitative Genetics course was a comprehensive review of modern concepts in Evolutionary Quantitative Genetics. The contents of the course are basic statistics, population genetics, quantitative genetics, evolutionary response in quantitative traits, estimating the fitness of traits and mixed models and their extensions. the instructor was Dr Bruce Walsh, Department of Ecology Evolutionary Biology, University of Arizona, and co-author of Genetics and Analysis of Quantitative Traits. The Course was hosted by Professor John Hickey at the Roslin Institute.
Year(s) Of Engagement Activity 2016
URL http://www.alphagenes.roslin.ed.ac.uk/bruce-walsh-visit/
 
Description Teaching course: Next Generation Plant and Animal Breeding Programs, Animal Science Department, University of Nebraska, Lincoln. 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact Series of the lectures and workshops on Plant and Animal Breeding Programs exploring current practices and future areas
of research. The course was designed and imparted by John Hickey and key members of his team.
Year(s) Of Engagement Activity 2016
URL http://animalscience.unl.edu/next-generation-plant-and-animal-breeding-programs
 
Description The Expert Working Group on Wheat Breeding Methods and Strategies 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Expert Working Group on Wheat Breeding Methods and Strategies seeks to exchange breeding methods research information and germ plasm to expert build capacity and support in wheat breeding programs, with more efficient breeding methods consistent with the latest scientific advances. The EWG is working on activities such us workshops, training courses, communications, and sharing of germplasm and information to reach larger pool of wheat breeders and trained in state-­of-­the-­art breeding methods.
Year(s) Of Engagement Activity 2015,2016,2017
URL http://www.wheatinitiative.org/activities/expert-working-groups/wheat-breeding-methods-and-strategie...