An integrated variant calling pipeline for third-generation sequencing technologies

Lead Research Organisation: University of Oxford
Department Name: Wellcome Trust Centre for Human Genetics

Abstract

Rationale The majority of high-throughput sequencing (HTS) applications require two initial processing steps: read mapping, and variant calling. The fast-paced nature of HTS research has led to the development of numerous tools of varying and often ill-understood quality, with usually a narrow application range in terms of sequencing technology and experimental design. This situation will likely not improve with the emergence of 3rd generation platforms (including, for instance, Pacific Biosciences, IonTorrent, Helicos, Oxford Nanopore, Avantome, and Halcyon Molecular). A particular technical challenge for many third-generation HTS platforms is their relatively high indel error rate (inserted or missed bases), compared to 2nd generation (particularly Illumina and SOLiD) technologies, requiring the development of new tools. We have recently developed a read mapper ('Stampy') and integrated SNP and indel caller ('Platypus'), both of which were designed to cope with indel errors and mutations. Recently published and unpublished data shows that our tools are state-of-art for Illumina data. Aims This proposal aims to improve and streamline the initial analysis of HTS data for both existing and future HTS platforms, and to provide relevant tools to the academic community. Building upon our current tool set, it proposes to do this by developing an integrated tool chain, built on solid statistical principles, and applicable to a range of experimental designs. To make the tool chain technology-agnostic, we will develop a generic representation of uncertainty in sequencing reads, implemented as a generic file format for HT sequence reads. Briefly, this representation encodes a sequencer's output as a graphical network (specifically, a transducer), allowing the encoding of the local probability of base mismatches as well as insertions and deletions. This scheme is efficient, and sufficiently rich to represent the error profile of the Illumina, SOLiD and 454 platforms, as well as currently known 3rd generation platforms. Current technologies do not provide rich error models; in particular no current technology annotates reads with per-base indel error rates. To compute these from existing data, as well as to tune factory-provided error models to the particular conditions of a library, lane or flow cell, a recalibration tool will also be developed. Parameterizing the tool chain with the resulting statistical error model allows it to transparently cope with technology improvements, as well as with new technologies provided their error profiles fit within the generic statistical framework. In addition, we will continue to develop the current tool set to cope with a larger range of variants, and widen the spectrum of experimental designs to which it is applicable. Work plan overview We will first show feasibility by aiming at three currently widely used platforms: Illumina and 454 (both available in-house at the WTCHG), and SOLiD (for which we have access to data through the 1000 Genomes project to which both applicants contribute). Following successful development of the tool chain for these platforms, and having established a standard for representing uncertainty in sequence reads, we will adapt these tools for 3rd generation platform. Since the ability to successfully deal with indel errors will be crucial here, we will be helped by our previous experience in developing the read mapper 'Stampy', which shows particularly good sensitivity and specificity for indels.

Technical Summary

Background The majority of high-throughput sequencing (HTS) applications require two initial processing steps: read mapping, and variant calling. Current tools will have to be adapted to deal with emerging 3rd-generation platforms, such as Pacific Biosciences, IonTorrent, Helicos, Oxford Nanopore, Avantome, and Halcyon Molecular. A particular technical challenge for many third-generation platforms is their relatively high indel error rate (inserted or missed bases), compared to e.g. Illumina and SOLiD technologies. Aims This proposal aims to improve and streamline the initial analysis of HTS data for both existing and future HTS platforms, and to provide relevant tools to the academic community. Building upon our current tool set (including the read mapper Stampy), it proposes to do this by developing a generic and integrated tool chain. To achieve technology-agnosticity, we will develop a generic representation of uncertainty in sequencing reads, implemented as a generic file format for HT sequence reads. Briefly, this representation encodes a sequencer's output as a graphical network (specifically, a transducer), allowing the encoding of the local probability of base mismatches as well as insertions and deletions. This scheme is efficient, and sufficiently rich to represent the error profile of existing platforms, as well as future 3rd generation platforms. Parameterizing the tool chain with the resulting statistical error model allows it to transparently cope with technology improvements, as well as with new technologies provided their error profiles fit within the generic statistical framework.

Planned Impact

Beneficiaries The proposed tool chain is expected to be beneficial in cases where high sensitivity and specificity for identifying polymorphisms in genomes from high-throughput sequence data is required. This is the case in many clinical settings, as well as in many research settings, a large fraction of which will fall within the remit of the BBSRC. In addition, when the technology is pushed to its limits in large projects, for instance in large sequencing-based GWAS studies, and for reasons of economy or statistical power it is intended to extract maximal amounts of information from relatively low-coverage sequencing per individual, accurate variant calls or genotype likelihoods as provided by the proposed tool chain are essential. It is expected that providers of sequencing technology will also benefit from the proposed standardization of uncertainty in sequencing reads, since this will enhance inter-operability, and ease transition to new generations of sequencing platforms. Similarly, technology providers are expected to benefit from the generic tool chain by allowing providers to focus on the sequencing platform and less on downstream software development. The project outcomes are intended and expected to be beneficial for providers of analysis services of high-throughput sequencing data. Examples are the PI's group at the Wellcome Trust Centre for Human Genetics, and the group of Dr. Mario Caccamo at TGAC in Norwich. Groups in this position have a need for trusted, comprehensive and broadly applicable analysis tools in order to be able to effectively help their clients. Impact By enabling users to analyze heterogeneous data across multiple sequencing platforms in a uniform manner, the project outcomes will reduce the time between sequencing and analysis, interpretation and results. By enabling researchers to switch to other technologies, cost savings or power increase may be achieved. The higher sensitivity and lower false-positive rates that are suggested by our initial results, and that are the intended outcomes of the planned tool chain, will in a clinical context lead to fewer missed genetic variants that may be causative for the phenotype under study, and higher rates of correct diagnoses. Access to a trusted and sensitive generic analysis pipeline will enable bioinformaticians to provide end-users with the required analysis results more quickly and more confidently, saving costs and reducing turnaround times. The proposed technology is primarily *enabling*. A substantial fraction of the expected impact can be summarized as increased efficiency. This includes expected cases where take-up of a new technology will move forwards because of the availability of analysis tools, and the effect of this could be significant. To a lesser expected extent, the proposed project may lead to research that otherwise would not have been contemplated, because of the access to efficient sequencing technology that is appropriate for the research question. People The two PDRAs that will be trained in this project will acquire sought-after skills in statistical modelling, software development, and bioinformatics research and analysis. They should be well placed for a further career in research, software engineering or data analysis. The WTCHG is a world-class research centre and second-largest UK sequencing centre, employing around 500 scientists and staff, a large fraction of which in bioinformatics, and has an excellent track record in springboarding young researchers into future careers.

Publications

10 25 50
 
Description We have gained an in-depth understanding of the technical and biological complexities of DNA sequencing data produced by current state-of-art sequencing machines (Illumina HiSeq and MiSeq). Building on this, we have developed analytical software to help analyze this data and allow clinical conclusions to be drawn from it.
Exploitation Route Our findings will help design future algorithms, and our software will be of direct use to many users, including in industry, of DNA sequencing technology.
Sectors Agriculture, Food and Drink,Digital/Communication/Information Technologies (including Software),Environment,Manufacturing, including Industrial Biotechology,Pharmaceuticals and Medical Biotechnology

URL http://www.well.ox.ac.uk/platypus,https://github.com/luntergroup/octopus
 
Description We have developed an in-depth understanding of the error profile of Illumina short read data, and building on that we have developed a variant caller for this data. The algorithm has been used widely within the Wellcome Trust Centre for Human Genetics (as standard part of the WTCHG processing pipeline), in in flagship projects such as WGS500 (the success of which led directly to the establishment of Genomics England by the UK government), in consortia (e.g. 1000 Genomes), has contributed to setting up a new company (Genomics plc), and has generated collaborations (e.g. with Nazneen Rahman, ICR, Sutton). The last collaboration has resulted in a broader, cheaper and faster cancer predisposition test that is now being applied in several NHS centres. The company Genomics plc was initially aiming to provide analytical support for the Genomics England 100k Genomes project. Its focus has since shifted towards drug discovery.
First Year Of Impact 2014
Sector Digital/Communication/Information Technologies (including Software),Healthcare,Government, Democracy and Justice,Pharmaceuticals and Medical Biotechnology
Impact Types Societal,Economic,Policy & public services

 
Description Developed analytical pipeline as part of WT-funded strategic award to improve genetic testing within NHS
Geographic Reach National 
Policy Influence Type Influenced training of practitioners or researchers
Impact The new screening methodology has allowed medical practitioners to identify rare mutations in cancer predisposition genes, which has on occasion led to improved prognosis, improved management, and genetic counseling leading of family members. The new screening methodology is cheaper and faster and has thereby made an immediate economic impact.
URL http://mcgprogramme.com/
 
Description BBSRC Impact Accelleration Award
Amount £10,759 (GBP)
Funding ID KCD00490 Task H501.01 
Organisation University of Oxford 
Sector Academic/University
Country United Kingdom
Start 02/2016 
End 05/2016
 
Description Improved HLA genotyping for NHS BTU 
Organisation NHS Blood Transfusion Unit
Country United Kingdom 
Sector Public 
PI Contribution We are developing an improved algorithm to determine the HLA type of a sample, to help identify tissues suitable for transplantation.
Collaborator Contribution Our collaborators have contributed partially validated data to help us assess our methods and have indicated requirements and existing issues that they would like an improved algorithm to address.
Impact 1. We have developed an improved algorithm, which has been validated against the data provided by the NHSBTU, showing that it outperforms the currently available commercial algorithms that were selected by the NHSBTU. 2. We are currently working together to design a front-end that will allow clinicians to easily use the algorithm. A manuscript describing the algorithm is currently being written.
Start Year 2013
 
Description Mainstreaming Cancer Genetics 
Organisation Institute of Cancer Research UK
Country United Kingdom 
Sector Academic/University 
PI Contribution An informal collaboration involving the Centre director and others at Institute of Cancer Research around indel calling, has evolved into a formal collaboration in the form of a Wellcome Trust Strategic Award. Our role in this collaboration is to provide the software for small variant calling from Illumina targeted capture data.
Collaborator Contribution Our partners have contributed targeted sequencing data across a large number of patients, and have generated validation data for a subset of variant calls that were generated by our pipeline.
Impact Publication: 10.1186/s13073-015-0195-6 (2015) CSN and CAVA: variant annotation tools for rapid, robust next-generation sequencing analysis in the clinical setting Software: CAVA, http://www.well.ox.ac.uk/cava
Start Year 2012
 
Title LGENU - inferring haplotypes in ENU-mutagenized mice 
Description LGENU is a program designed to infer ancestral haplotypes in ENU-mutated mice, from next-generation SNP calls, using the Lander-Green algorithm. The program currently works properly only for a fixed pedigree, and needs SNP data from 3 sibling mice. It takes a VCF file of SNP calls, a map of recombination rates across the mouse genome, a window size, and a value for the ENU mutation rate. The output is a text file consisting of inferred ancestral haplotypes across the genome for each moue. A test VCF file is provided with the program. 
Type Of Technology Software 
Year Produced 2013 
Open Source License? Yes  
Impact - Publication: Unlocking the Bottleneck in Forward Genetics Using Whole-Genome Sequencing and Identity by Descent to Isolate Causative Mutations. Bull, K.R., Rimmer, A.J., Siggs, O.M., Miosge, L.A., Roots, C.M., Enders, A., Bertram, E.M., Crockford, T.L., Whittle, B., Potter, P.K., Simon, M.M., Mallon, A.-M., Brown, S.D.M., Beutler, B., Goodnow, C.C., Lunter, G. , Cornall, R.J. PLoS Genetics 9(1) 2013. 15 citations. 
URL http://www.well.ox.ac.uk/lgenu
 
Title Octopus - general and tumour/normal variant calling 
Description Octopus is a modular haplotype-based variant caller, which includes bespoke models for several common experimental designs including tumour/normal, trio, population and single-genome variant calling, and which outperforms state-of-art in these designs. 
Type Of Technology Software 
Year Produced 2018 
Open Source License? Yes  
Impact The software is used by Invitae as part of their core cariant calling pipeline. 
URL https://www.biorxiv.org/content/10.1101/456103v1
 
Title Platypus 
Description a DNA variant caller for Illumina sequencing data 
Type Of Technology Software 
Year Produced 2014 
Open Source License? Yes  
Impact Led to: - 1888 downloads (Mar 2016); 40 citations of Platypus paper - key involvement in 1000 Genomes project - involvement in the WGS500 project, which formed the blueprint of the UK government's 100,000 Genomes Project - a spin out company, Genomics PLC; the first employee was Andy Rimmer, main developer of Platypus - several other collaborations and publications 
URL http://www.well.ox.ac.uk/platypus
 
Company Name Genomics plc 
Description From the website: We are aiming to lead the genomic transformation of healthcare Our vision is simple: to fulfil the potential of genomics to change the world We will lead the way through this challenge, using our expertise and experience to unleash the potential of genomics and set the standards by which healthcare professionals can benefit from genomic data. 
Year Established 2014 
Impact Secured US$ 15M in funding. Signed collaborations with Vertex Pharmaceuticals; Oxford University and Oxford University Hospitals NHS Trust; Biogen; Eisai Appointed Sekar Kathiresan as Chair of SAB
Website http://www.genomicsplc.com/