An integrated variant calling pipeline for third-generation sequencing technologies

Lead Research Organisation: University of Oxford

Department Name: Wellcome Trust Centre for Human Genetics

Abstract

Rationale The majority of high-throughput sequencing (HTS) applications require two initial processing steps: read mapping, and variant calling. The fast-paced nature of HTS research has led to the development of numerous tools of varying and often ill-understood quality, with usually a narrow application range in terms of sequencing technology and experimental design. This situation will likely not improve with the emergence of 3rd generation platforms (including, for instance, Pacific Biosciences, IonTorrent, Helicos, Oxford Nanopore, Avantome, and Halcyon Molecular). A particular technical challenge for many third-generation HTS platforms is their relatively high indel error rate (inserted or missed bases), compared to 2nd generation (particularly Illumina and SOLiD) technologies, requiring the development of new tools. We have recently developed a read mapper ('Stampy') and integrated SNP and indel caller ('Platypus'), both of which were designed to cope with indel errors and mutations. Recently published and unpublished data shows that our tools are state-of-art for Illumina data. Aims This proposal aims to improve and streamline the initial analysis of HTS data for both existing and future HTS platforms, and to provide relevant tools to the academic community. Building upon our current tool set, it proposes to do this by developing an integrated tool chain, built on solid statistical principles, and applicable to a range of experimental designs. To make the tool chain technology-agnostic, we will develop a generic representation of uncertainty in sequencing reads, implemented as a generic file format for HT sequence reads. Briefly, this representation encodes a sequencer's output as a graphical network (specifically, a transducer), allowing the encoding of the local probability of base mismatches as well as insertions and deletions. This scheme is efficient, and sufficiently rich to represent the error profile of the Illumina, SOLiD and 454 platforms, as well as currently known 3rd generation platforms. Current technologies do not provide rich error models; in particular no current technology annotates reads with per-base indel error rates. To compute these from existing data, as well as to tune factory-provided error models to the particular conditions of a library, lane or flow cell, a recalibration tool will also be developed. Parameterizing the tool chain with the resulting statistical error model allows it to transparently cope with technology improvements, as well as with new technologies provided their error profiles fit within the generic statistical framework. In addition, we will continue to develop the current tool set to cope with a larger range of variants, and widen the spectrum of experimental designs to which it is applicable. Work plan overview We will first show feasibility by aiming at three currently widely used platforms: Illumina and 454 (both available in-house at the WTCHG), and SOLiD (for which we have access to data through the 1000 Genomes project to which both applicants contribute). Following successful development of the tool chain for these platforms, and having established a standard for representing uncertainty in sequence reads, we will adapt these tools for 3rd generation platform. Since the ability to successfully deal with indel errors will be crucial here, we will be helped by our previous experience in developing the read mapper 'Stampy', which shows particularly good sensitivity and specificity for indels.

Technical Summary

Background The majority of high-throughput sequencing (HTS) applications require two initial processing steps: read mapping, and variant calling. Current tools will have to be adapted to deal with emerging 3rd-generation platforms, such as Pacific Biosciences, IonTorrent, Helicos, Oxford Nanopore, Avantome, and Halcyon Molecular. A particular technical challenge for many third-generation platforms is their relatively high indel error rate (inserted or missed bases), compared to e.g. Illumina and SOLiD technologies. Aims This proposal aims to improve and streamline the initial analysis of HTS data for both existing and future HTS platforms, and to provide relevant tools to the academic community. Building upon our current tool set (including the read mapper Stampy), it proposes to do this by developing a generic and integrated tool chain. To achieve technology-agnosticity, we will develop a generic representation of uncertainty in sequencing reads, implemented as a generic file format for HT sequence reads. Briefly, this representation encodes a sequencer's output as a graphical network (specifically, a transducer), allowing the encoding of the local probability of base mismatches as well as insertions and deletions. This scheme is efficient, and sufficiently rich to represent the error profile of existing platforms, as well as future 3rd generation platforms. Parameterizing the tool chain with the resulting statistical error model allows it to transparently cope with technology improvements, as well as with new technologies provided their error profiles fit within the generic statistical framework.

Planned Impact

Beneficiaries The proposed tool chain is expected to be beneficial in cases where high sensitivity and specificity for identifying polymorphisms in genomes from high-throughput sequence data is required. This is the case in many clinical settings, as well as in many research settings, a large fraction of which will fall within the remit of the BBSRC. In addition, when the technology is pushed to its limits in large projects, for instance in large sequencing-based GWAS studies, and for reasons of economy or statistical power it is intended to extract maximal amounts of information from relatively low-coverage sequencing per individual, accurate variant calls or genotype likelihoods as provided by the proposed tool chain are essential. It is expected that providers of sequencing technology will also benefit from the proposed standardization of uncertainty in sequencing reads, since this will enhance inter-operability, and ease transition to new generations of sequencing platforms. Similarly, technology providers are expected to benefit from the generic tool chain by allowing providers to focus on the sequencing platform and less on downstream software development. The project outcomes are intended and expected to be beneficial for providers of analysis services of high-throughput sequencing data. Examples are the PI's group at the Wellcome Trust Centre for Human Genetics, and the group of Dr. Mario Caccamo at TGAC in Norwich. Groups in this position have a need for trusted, comprehensive and broadly applicable analysis tools in order to be able to effectively help their clients. Impact By enabling users to analyze heterogeneous data across multiple sequencing platforms in a uniform manner, the project outcomes will reduce the time between sequencing and analysis, interpretation and results. By enabling researchers to switch to other technologies, cost savings or power increase may be achieved. The higher sensitivity and lower false-positive rates that are suggested by our initial results, and that are the intended outcomes of the planned tool chain, will in a clinical context lead to fewer missed genetic variants that may be causative for the phenotype under study, and higher rates of correct diagnoses. Access to a trusted and sensitive generic analysis pipeline will enable bioinformaticians to provide end-users with the required analysis results more quickly and more confidently, saving costs and reducing turnaround times. The proposed technology is primarily *enabling*. A substantial fraction of the expected impact can be summarized as increased efficiency. This includes expected cases where take-up of a new technology will move forwards because of the availability of analysis tools, and the effect of this could be significant. To a lesser expected extent, the proposed project may lead to research that otherwise would not have been contemplated, because of the access to efficient sequencing technology that is appropriate for the research question. People The two PDRAs that will be trained in this project will acquire sought-after skills in statistical modelling, software development, and bioinformatics research and analysis. They should be well placed for a further career in research, software engineering or data analysis. The WTCHG is a world-class research centre and second-largest UK sequencing centre, employing around 500 scientists and staff, a large fraction of which in bioinformatics, and has an excellent track record in springboarding young researchers into future careers.

Funded Value:

£626,047

Funded Period:

Apr 12 - Apr 16

Funder:

BBSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

BB/I02593X/1

Principal Investigator:

Gerard Lunter

Research Subject:

Omic sciences & technologies (26%)

Tools, technologies & methods (26%)

Research Topic:

Bioinformatics (26%)

Genomics (13%)

Transcriptomics (13%)

Organisations

People	ORCID iD
Gerard Lunter (Principal Investigator)
Gil McVean (Co-Investigator)

Publications

Author Name

Title Publication Date Published

|< < 1 2 > >|

10 25 50

1000 Genomes Project Consortium (2015) A global reference for human genetic variation. in Nature

1000 Genomes Project Consortium (2012) An integrated map of genetic variation from 1,092 human genomes. in Nature

Bull K (2013) Unlocking the Bottleneck in Forward Genetics Using Whole-Genome Sequencing and Identity by Descent to Isolate Causative Mutations in PLoS Genetics

Cazier JB (2014) Whole-genome sequencing of bladder cancers reveals somatic CDKN1A mutations and clinicopathological associations with mutation burden. in Nature communications

Cooke D (2018) A unified haplotype-based method for accurate and comprehensive variant calling

Delaneau O (2014) Integrating sequence and array data to create an improved 1000 Genomes Project haplotype reference panel in Nature Communications

Eizirik DL (2012) The human pancreatic islet transcriptome: expression of candidate genes for type 1 diabetes and the impact of pro-inflammatory cytokines. in PLoS genetics

Lamble S (2013) Improved workflows for high throughput library preparation using the transposome-based Nextera system. in BMC biotechnology

Lise S (2012) Recessive mutations in SPTBN2 implicate ß-III spectrin in both cognitive and motor development. in PLoS genetics

MacArthur DG (2012) A systematic survey of loss-of-function variants in human protein-coding genes. in Science (New York, N.Y.)

Key Findings
Impact Summary
Policy Influence
Further Funding
Collaboration
Software and Technical Products
Spin Outs


Description	We have gained an in-depth understanding of the technical and biological complexities of DNA sequencing data produced by current state-of-art sequencing machines (Illumina HiSeq and MiSeq). Building on this, we have developed analytical software to help analyze this data and allow clinical conclusions to be drawn from it.
Exploitation Route	Our findings will help design future algorithms, and our software will be of direct use to many users, including in industry, of DNA sequencing technology.
Sectors	Agriculture Food and Drink Digital/Communication/Information Technologies (including Software) Environment Manufacturing including Industrial Biotechology Pharmaceuticals and Medical Biotechnology
URL	http://www.well.ox.ac.uk/platypus,https://github.com/luntergroup/octopus


Description	We have developed an in-depth understanding of the error profile of Illumina short read data, and building on that we have developed a variant caller for this data. The algorithm has been used widely within the Wellcome Trust Centre for Human Genetics (as standard part of the WTCHG processing pipeline), in in flagship projects such as WGS500 (the success of which led directly to the establishment of Genomics England by the UK government), in consortia (e.g. 1000 Genomes), has contributed to setting up a new company (Genomics plc), and has generated collaborations (e.g. with Nazneen Rahman, ICR, Sutton). The last collaboration has resulted in a broader, cheaper and faster cancer predisposition test that is now being applied in several NHS centres. The company Genomics plc was initially aiming to provide analytical support for the Genomics England 100k Genomes project. Its focus has since shifted towards drug discovery.
First Year Of Impact	2014
Sector	Digital/Communication/Information Technologies (including Software),Healthcare,Government, Democracy and Justice,Pharmaceuticals and Medical Biotechnology
Impact Types	Societal Economic Policy & public services


Description	Developed analytical pipeline as part of WT-funded strategic award to improve genetic testing within NHS
Geographic Reach	National
Policy Influence Type	Influenced training of practitioners or researchers
Impact	The new screening methodology has allowed medical practitioners to identify rare mutations in cancer predisposition genes, which has on occasion led to improved prognosis, improved management, and genetic counseling leading of family members. The new screening methodology is cheaper and faster and has thereby made an immediate economic impact.
URL	http://mcgprogramme.com/


Description	BBSRC Impact Accelleration Award
Amount	£10,759 (GBP)
Funding ID	KCD00490 Task H501.01
Organisation	University of Oxford
Sector	Academic/University
Country	United Kingdom
Start	02/2016
End	05/2016


Description	Improved HLA genotyping for NHS BTU
Organisation	NHS Blood Transfusion Unit
Country	United Kingdom
Sector	Public
PI Contribution	We are developing an improved algorithm to determine the HLA type of a sample, to help identify tissues suitable for transplantation.
Collaborator Contribution	Our collaborators have contributed partially validated data to help us assess our methods and have indicated requirements and existing issues that they would like an improved algorithm to address.
Impact	1. We have developed an improved algorithm, which has been validated against the data provided by the NHSBTU, showing that it outperforms the currently available commercial algorithms that were selected by the NHSBTU. 2. We are currently working together to design a front-end that will allow clinicians to easily use the algorithm. A manuscript describing the algorithm is currently being written.
Start Year	2013


Description	Mainstreaming Cancer Genetics
Organisation	Institute of Cancer Research UK
Country	United Kingdom
Sector	Academic/University
PI Contribution	An informal collaboration involving the Centre director and others at Institute of Cancer Research around indel calling, has evolved into a formal collaboration in the form of a Wellcome Trust Strategic Award. Our role in this collaboration is to provide the software for small variant calling from Illumina targeted capture data.
Collaborator Contribution	Our partners have contributed targeted sequencing data across a large number of patients, and have generated validation data for a subset of variant calls that were generated by our pipeline.
Impact	Publication: 10.1186/s13073-015-0195-6 (2015) CSN and CAVA: variant annotation tools for rapid, robust next-generation sequencing analysis in the clinical setting Software: CAVA, http://www.well.ox.ac.uk/cava
Start Year	2012


Title	LGENU - inferring haplotypes in ENU-mutagenized mice
Description	LGENU is a program designed to infer ancestral haplotypes in ENU-mutated mice, from next-generation SNP calls, using the Lander-Green algorithm. The program currently works properly only for a fixed pedigree, and needs SNP data from 3 sibling mice. It takes a VCF file of SNP calls, a map of recombination rates across the mouse genome, a window size, and a value for the ENU mutation rate. The output is a text file consisting of inferred ancestral haplotypes across the genome for each moue. A test VCF file is provided with the program.
Type Of Technology	Software
Year Produced	2013
Open Source License?	Yes
Impact	- Publication: Unlocking the Bottleneck in Forward Genetics Using Whole-Genome Sequencing and Identity by Descent to Isolate Causative Mutations. Bull, K.R., Rimmer, A.J., Siggs, O.M., Miosge, L.A., Roots, C.M., Enders, A., Bertram, E.M., Crockford, T.L., Whittle, B., Potter, P.K., Simon, M.M., Mallon, A.-M., Brown, S.D.M., Beutler, B., Goodnow, C.C., Lunter, G. , Cornall, R.J. PLoS Genetics 9(1) 2013. 15 citations.
URL	http://www.well.ox.ac.uk/lgenu


Title	Octopus - general and tumour/normal variant calling
Description	Octopus is a modular haplotype-based variant caller, which includes bespoke models for several common experimental designs including tumour/normal, trio, population and single-genome variant calling, and which outperforms state-of-art in these designs.
Type Of Technology	Software
Year Produced	2018
Open Source License?	Yes
Impact	The software is used by Invitae as part of their core cariant calling pipeline.
URL	https://www.biorxiv.org/content/10.1101/456103v1


Title	Platypus
Description	a DNA variant caller for Illumina sequencing data
Type Of Technology	Software
Year Produced	2014
Open Source License?	Yes
Impact	Led to: - 1888 downloads (Mar 2016); 40 citations of Platypus paper - key involvement in 1000 Genomes project - involvement in the WGS500 project, which formed the blueprint of the UK government's 100,000 Genomes Project - a spin out company, Genomics PLC; the first employee was Andy Rimmer, main developer of Platypus - several other collaborations and publications
URL	http://www.well.ox.ac.uk/platypus


Company Name	Genomics
Description	Genomics develops databases and algorithms providing genome sequence data analysis for use in preventative medicine and drug discovery.
Year Established	2014
Impact	Secured US$ 15M in funding. Signed collaborations with Vertex Pharmaceuticals; Oxford University and Oxford University Hospitals NHS Trust; Biogen; Eisai Appointed Sekar Kathiresan as Chair of SAB
Website	http://www.genomicsplc.com