An integrated variant calling pipeline for third-generation sequencing technologies
Lead Research Organisation:
University of Oxford
Department Name: Wellcome Trust Centre for Human Genetics
Abstract
Rationale The majority of high-throughput sequencing (HTS) applications require two initial processing steps: read mapping, and variant calling. The fast-paced nature of HTS research has led to the development of numerous tools of varying and often ill-understood quality, with usually a narrow application range in terms of sequencing technology and experimental design. This situation will likely not improve with the emergence of 3rd generation platforms (including, for instance, Pacific Biosciences, IonTorrent, Helicos, Oxford Nanopore, Avantome, and Halcyon Molecular). A particular technical challenge for many third-generation HTS platforms is their relatively high indel error rate (inserted or missed bases), compared to 2nd generation (particularly Illumina and SOLiD) technologies, requiring the development of new tools. We have recently developed a read mapper ('Stampy') and integrated SNP and indel caller ('Platypus'), both of which were designed to cope with indel errors and mutations. Recently published and unpublished data shows that our tools are state-of-art for Illumina data. Aims This proposal aims to improve and streamline the initial analysis of HTS data for both existing and future HTS platforms, and to provide relevant tools to the academic community. Building upon our current tool set, it proposes to do this by developing an integrated tool chain, built on solid statistical principles, and applicable to a range of experimental designs. To make the tool chain technology-agnostic, we will develop a generic representation of uncertainty in sequencing reads, implemented as a generic file format for HT sequence reads. Briefly, this representation encodes a sequencer's output as a graphical network (specifically, a transducer), allowing the encoding of the local probability of base mismatches as well as insertions and deletions. This scheme is efficient, and sufficiently rich to represent the error profile of the Illumina, SOLiD and 454 platforms, as well as currently known 3rd generation platforms. Current technologies do not provide rich error models; in particular no current technology annotates reads with per-base indel error rates. To compute these from existing data, as well as to tune factory-provided error models to the particular conditions of a library, lane or flow cell, a recalibration tool will also be developed. Parameterizing the tool chain with the resulting statistical error model allows it to transparently cope with technology improvements, as well as with new technologies provided their error profiles fit within the generic statistical framework. In addition, we will continue to develop the current tool set to cope with a larger range of variants, and widen the spectrum of experimental designs to which it is applicable. Work plan overview We will first show feasibility by aiming at three currently widely used platforms: Illumina and 454 (both available in-house at the WTCHG), and SOLiD (for which we have access to data through the 1000 Genomes project to which both applicants contribute). Following successful development of the tool chain for these platforms, and having established a standard for representing uncertainty in sequence reads, we will adapt these tools for 3rd generation platform. Since the ability to successfully deal with indel errors will be crucial here, we will be helped by our previous experience in developing the read mapper 'Stampy', which shows particularly good sensitivity and specificity for indels.
Technical Summary
Background The majority of high-throughput sequencing (HTS) applications require two initial processing steps: read mapping, and variant calling. Current tools will have to be adapted to deal with emerging 3rd-generation platforms, such as Pacific Biosciences, IonTorrent, Helicos, Oxford Nanopore, Avantome, and Halcyon Molecular. A particular technical challenge for many third-generation platforms is their relatively high indel error rate (inserted or missed bases), compared to e.g. Illumina and SOLiD technologies. Aims This proposal aims to improve and streamline the initial analysis of HTS data for both existing and future HTS platforms, and to provide relevant tools to the academic community. Building upon our current tool set (including the read mapper Stampy), it proposes to do this by developing a generic and integrated tool chain. To achieve technology-agnosticity, we will develop a generic representation of uncertainty in sequencing reads, implemented as a generic file format for HT sequence reads. Briefly, this representation encodes a sequencer's output as a graphical network (specifically, a transducer), allowing the encoding of the local probability of base mismatches as well as insertions and deletions. This scheme is efficient, and sufficiently rich to represent the error profile of existing platforms, as well as future 3rd generation platforms. Parameterizing the tool chain with the resulting statistical error model allows it to transparently cope with technology improvements, as well as with new technologies provided their error profiles fit within the generic statistical framework.
Planned Impact
Beneficiaries The proposed tool chain is expected to be beneficial in cases where high sensitivity and specificity for identifying polymorphisms in genomes from high-throughput sequence data is required. This is the case in many clinical settings, as well as in many research settings, a large fraction of which will fall within the remit of the BBSRC. In addition, when the technology is pushed to its limits in large projects, for instance in large sequencing-based GWAS studies, and for reasons of economy or statistical power it is intended to extract maximal amounts of information from relatively low-coverage sequencing per individual, accurate variant calls or genotype likelihoods as provided by the proposed tool chain are essential. It is expected that providers of sequencing technology will also benefit from the proposed standardization of uncertainty in sequencing reads, since this will enhance inter-operability, and ease transition to new generations of sequencing platforms. Similarly, technology providers are expected to benefit from the generic tool chain by allowing providers to focus on the sequencing platform and less on downstream software development. The project outcomes are intended and expected to be beneficial for providers of analysis services of high-throughput sequencing data. Examples are the PI's group at the Wellcome Trust Centre for Human Genetics, and the group of Dr. Mario Caccamo at TGAC in Norwich. Groups in this position have a need for trusted, comprehensive and broadly applicable analysis tools in order to be able to effectively help their clients. Impact By enabling users to analyze heterogeneous data across multiple sequencing platforms in a uniform manner, the project outcomes will reduce the time between sequencing and analysis, interpretation and results. By enabling researchers to switch to other technologies, cost savings or power increase may be achieved. The higher sensitivity and lower false-positive rates that are suggested by our initial results, and that are the intended outcomes of the planned tool chain, will in a clinical context lead to fewer missed genetic variants that may be causative for the phenotype under study, and higher rates of correct diagnoses. Access to a trusted and sensitive generic analysis pipeline will enable bioinformaticians to provide end-users with the required analysis results more quickly and more confidently, saving costs and reducing turnaround times. The proposed technology is primarily *enabling*. A substantial fraction of the expected impact can be summarized as increased efficiency. This includes expected cases where take-up of a new technology will move forwards because of the availability of analysis tools, and the effect of this could be significant. To a lesser expected extent, the proposed project may lead to research that otherwise would not have been contemplated, because of the access to efficient sequencing technology that is appropriate for the research question. People The two PDRAs that will be trained in this project will acquire sought-after skills in statistical modelling, software development, and bioinformatics research and analysis. They should be well placed for a further career in research, software engineering or data analysis. The WTCHG is a world-class research centre and second-largest UK sequencing centre, employing around 500 scientists and staff, a large fraction of which in bioinformatics, and has an excellent track record in springboarding young researchers into future careers.
People |
ORCID iD |
Gerard Lunter (Principal Investigator) | |
Gil McVean (Co-Investigator) |
Publications
MacArthur DG
(2012)
A systematic survey of loss-of-function variants in human protein-coding genes.
in Science (New York, N.Y.)
1000 Genomes Project Consortium
(2012)
An integrated map of genetic variation from 1,092 human genomes.
in Nature
Eizirik DL
(2012)
The human pancreatic islet transcriptome: expression of candidate genes for type 1 diabetes and the impact of pro-inflammatory cytokines.
in PLoS genetics
Lise S
(2012)
Recessive mutations in SPTBN2 implicate ß-III spectrin in both cognitive and motor development.
in PLoS genetics
Palles C
(2013)
Germline mutations affecting the proofreading domains of POLE and POLD1 predispose to colorectal adenomas and carcinomas.
in Nature genetics
Montgomery SB
(2013)
The origin, evolution, and functional impact of short insertion-deletion variants identified in 179 human genomes.
in Genome research
Lamble S
(2013)
Improved workflows for high throughput library preparation using the transposome-based Nextera system.
in BMC biotechnology
Rimmer A
(2014)
Integrating mapping-, assembly- and haplotype-based approaches for calling variants in clinical sequencing applications.
in Nature genetics
Delaneau O
(2014)
Integrating sequence and array data to create an improved 1000 Genomes Project haplotype reference panel.
in Nature communications
Description | We have gained an in-depth understanding of the technical and biological complexities of DNA sequencing data produced by current state-of-art sequencing machines (Illumina HiSeq and MiSeq). Building on this, we have developed analytical software to help analyze this data and allow clinical conclusions to be drawn from it. |
Exploitation Route | Our findings will help design future algorithms, and our software will be of direct use to many users, including in industry, of DNA sequencing technology. |
Sectors | Agriculture, Food and Drink,Digital/Communication/Information Technologies (including Software),Environment,Manufacturing, including Industrial Biotechology,Pharmaceuticals and Medical Biotechnology |
URL | http://www.well.ox.ac.uk/platypus,https://github.com/luntergroup/octopus |
Description | We have developed an in-depth understanding of the error profile of Illumina short read data, and building on that we have developed a variant caller for this data. The algorithm has been used widely within the Wellcome Trust Centre for Human Genetics (as standard part of the WTCHG processing pipeline), in in flagship projects such as WGS500 (the success of which led directly to the establishment of Genomics England by the UK government), in consortia (e.g. 1000 Genomes), has contributed to setting up a new company (Genomics plc), and has generated collaborations (e.g. with Nazneen Rahman, ICR, Sutton). The last collaboration has resulted in a broader, cheaper and faster cancer predisposition test that is now being applied in several NHS centres. The company Genomics plc was initially aiming to provide analytical support for the Genomics England 100k Genomes project. Its focus has since shifted towards drug discovery. |
First Year Of Impact | 2014 |
Sector | Digital/Communication/Information Technologies (including Software),Healthcare,Government, Democracy and Justice,Pharmaceuticals and Medical Biotechnology |
Impact Types | Societal,Economic,Policy & public services |
Description | Developed analytical pipeline as part of WT-funded strategic award to improve genetic testing within NHS |
Geographic Reach | National |
Policy Influence Type | Influenced training of practitioners or researchers |
Impact | The new screening methodology has allowed medical practitioners to identify rare mutations in cancer predisposition genes, which has on occasion led to improved prognosis, improved management, and genetic counseling leading of family members. The new screening methodology is cheaper and faster and has thereby made an immediate economic impact. |
URL | http://mcgprogramme.com/ |
Description | BBSRC Impact Accelleration Award |
Amount | £10,759 (GBP) |
Funding ID | KCD00490 Task H501.01 |
Organisation | University of Oxford |
Sector | Academic/University |
Country | United Kingdom |
Start | 02/2016 |
End | 05/2016 |
Description | Improved HLA genotyping for NHS BTU |
Organisation | NHS Blood Transfusion Unit |
Country | United Kingdom |
Sector | Public |
PI Contribution | We are developing an improved algorithm to determine the HLA type of a sample, to help identify tissues suitable for transplantation. |
Collaborator Contribution | Our collaborators have contributed partially validated data to help us assess our methods and have indicated requirements and existing issues that they would like an improved algorithm to address. |
Impact | 1. We have developed an improved algorithm, which has been validated against the data provided by the NHSBTU, showing that it outperforms the currently available commercial algorithms that were selected by the NHSBTU. 2. We are currently working together to design a front-end that will allow clinicians to easily use the algorithm. A manuscript describing the algorithm is currently being written. |
Start Year | 2013 |
Description | Mainstreaming Cancer Genetics |
Organisation | Institute of Cancer Research UK |
Country | United Kingdom |
Sector | Academic/University |
PI Contribution | An informal collaboration involving the Centre director and others at Institute of Cancer Research around indel calling, has evolved into a formal collaboration in the form of a Wellcome Trust Strategic Award. Our role in this collaboration is to provide the software for small variant calling from Illumina targeted capture data. |
Collaborator Contribution | Our partners have contributed targeted sequencing data across a large number of patients, and have generated validation data for a subset of variant calls that were generated by our pipeline. |
Impact | Publication: 10.1186/s13073-015-0195-6 (2015) CSN and CAVA: variant annotation tools for rapid, robust next-generation sequencing analysis in the clinical setting Software: CAVA, http://www.well.ox.ac.uk/cava |
Start Year | 2012 |
Title | LGENU - inferring haplotypes in ENU-mutagenized mice |
Description | LGENU is a program designed to infer ancestral haplotypes in ENU-mutated mice, from next-generation SNP calls, using the Lander-Green algorithm. The program currently works properly only for a fixed pedigree, and needs SNP data from 3 sibling mice. It takes a VCF file of SNP calls, a map of recombination rates across the mouse genome, a window size, and a value for the ENU mutation rate. The output is a text file consisting of inferred ancestral haplotypes across the genome for each moue. A test VCF file is provided with the program. |
Type Of Technology | Software |
Year Produced | 2013 |
Open Source License? | Yes |
Impact | - Publication: Unlocking the Bottleneck in Forward Genetics Using Whole-Genome Sequencing and Identity by Descent to Isolate Causative Mutations. Bull, K.R., Rimmer, A.J., Siggs, O.M., Miosge, L.A., Roots, C.M., Enders, A., Bertram, E.M., Crockford, T.L., Whittle, B., Potter, P.K., Simon, M.M., Mallon, A.-M., Brown, S.D.M., Beutler, B., Goodnow, C.C., Lunter, G. , Cornall, R.J. PLoS Genetics 9(1) 2013. 15 citations. |
URL | http://www.well.ox.ac.uk/lgenu |
Title | Octopus - general and tumour/normal variant calling |
Description | Octopus is a modular haplotype-based variant caller, which includes bespoke models for several common experimental designs including tumour/normal, trio, population and single-genome variant calling, and which outperforms state-of-art in these designs. |
Type Of Technology | Software |
Year Produced | 2018 |
Open Source License? | Yes |
Impact | The software is used by Invitae as part of their core cariant calling pipeline. |
URL | https://www.biorxiv.org/content/10.1101/456103v1 |
Title | Platypus |
Description | a DNA variant caller for Illumina sequencing data |
Type Of Technology | Software |
Year Produced | 2014 |
Open Source License? | Yes |
Impact | Led to: - 1888 downloads (Mar 2016); 40 citations of Platypus paper - key involvement in 1000 Genomes project - involvement in the WGS500 project, which formed the blueprint of the UK government's 100,000 Genomes Project - a spin out company, Genomics PLC; the first employee was Andy Rimmer, main developer of Platypus - several other collaborations and publications |
URL | http://www.well.ox.ac.uk/platypus |
Company Name | Genomics plc |
Description | From the website: We are aiming to lead the genomic transformation of healthcare Our vision is simple: to fulfil the potential of genomics to change the world We will lead the way through this challenge, using our expertise and experience to unleash the potential of genomics and set the standards by which healthcare professionals can benefit from genomic data. |
Year Established | 2014 |
Impact | Secured US$ 15M in funding. Signed collaborations with Vertex Pharmaceuticals; Oxford University and Oxford University Hospitals NHS Trust; Biogen; Eisai Appointed Sekar Kathiresan as Chair of SAB |
Website | http://www.genomicsplc.com/ |